<table align="left">
<tr>

<th, style="background-color:white">
<img src="https://github.com/mlgill/ODSC_East_2017_PythonNLP/blob/master/assets/logo.png?raw=true", width=140, height=100>
</th>

<th, style="background-color:white">
<div align="left">
<h1>Learning from Text: <br> Introduction to Natural Language Processing with Python</h1>  
<h2>Michelle L. Gill, Ph.D.</h2>     
Senior Data Scientist, Metis  
ODSC East  
May 3, 2017 
</div>
</th>

</tr>
</table>  

## LDA Walkthrough and Exercises

## The Data

We will be using a portion data set containing approximately 20,000 posts partitioned evenly across 20 different newsgroups. This data set is quite famous. We will be using a sample of this data set, containing 5 topics and about 3,000 posts.

We will begin by loading the data.

In [1]:
import nltk
from accessory_functions import nltk_path

# Setup nltk corpora path
nltk.data.path.insert(0, nltk_path)

In [None]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

topic_list = ['sci.space', 'comp.sys.mac.hardware', 'rec.autos',
              'rec.sport.baseball', 'sci.med']

dataset = fetch_20newsgroups(shuffle=True, random_state=1, data_home='../data',
                             categories=topic_list,
                             remove=('headers', 'footers', 'quotes'))

data = pd.DataFrame(dataset['data'], columns=['text'])
print(len(data))

2956


## Preprocess the Data

Next we will preprocess the data using the convenience method from `accessory_functions`.

In [None]:
from accessory_functions import preprocess_series_text

data['text'] = preprocess_series_text(data.text, 
                                      nltk_path=nltk_path)

In [None]:
data.head()

## Create Numerical Features

Use Count Vectorizer to create a document-term matrix.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

n_features = 1000
cv = CountVectorizer(max_df=0.95, min_df=2, 
                     max_features=n_features)
X = cv.fit_transform(data.text)

print(X.shape)

In [None]:
pd.DataFrame(X.toarray(), columns=cv.get_feature_names()).head()

## Create an LDA Model

Use Scikit-learn's [`LatentDirichletAllocation`](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) to fit an LDA model.

In [None]:
LatentDirichletAllocation?

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

n_topics = len(topic_list)
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

lda.fit(X)

## Print Top Words
Print the top words associated with each topic.

In [None]:
def print_top_words(model, feature_names, n_top_words):
    
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % (topic_idx+1))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

In [None]:
n_top_words = 20

print("Topics in LDA model:")
cv_feature_names = cv.get_feature_names()
print_top_words(lda, cv_feature_names, n_top_words)

## Visualize the LDA Model

A visualization of the topic model can be easily created with `pyLDAvis`. 

In [None]:
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
pyLDAvis.sklearn.prepare(lda, X, cv)

## Question

* Fit an LDA model with a different number of topics and compare the top 20 words to those from the model above.
* Create a different document-term matrix by changing input parameters (max_features, etc.) or by switching to `TfidfVectorizer` and use this to fit another LDA model