# Using LDA to Understand Documents

## Introduction

We'll be using some of the fetch20 dataset built into SkLearn to use LDA for topic modeling. Let's start by loading 3 categories from the dataset. I recommend choosing 3 fairly disparate categories to begin with. You can see the full list of available categories here: http://scikit-learn.org/stable/datasets/twenty_newsgroups.html. 

## Question 1

* Load ONLY the training set from each of these categories 
* Remove the headers, footers, and quotes from each member of the set
* Explore the dataset and verify that you have inputs from each category (are the datapoints randomized?)

In [5]:
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)
import numpy as np
import nltk
import os
from sklearn import datasets

categories = ['alt.atheism', 'comp.graphics', 'rec.sport.baseball']
ng_train = datasets.fetch_20newsgroups(subset='train', 
                                       categories=categories, 
                                       remove=('headers', 
                                               'footers', 'quotes'))

In [6]:
print(ng_train.data[2])
print("++\n", ng_train.data[1504])
print("++\n", ng_train.data[1000])



	Sorry, I was, but I somehow have misplaced my diskette from the last 
couple of months or so. However, thanks to the efforts of Bobby, it is being 
replenished rather quickly!  

	Here is a recent favorite:

	--


       "Satan and the Angels do not have freewill.  
        They do what god tells them to do. "

        S.N. Mozumder (snm6394@ultb.isc.rit.edu) 


--


       "Satan and the Angels do not have freewill.  
        They do what god tells them to do. "
++
 

Why not use the PD C library for reading/writing TIFF files? It took me a
good 20 minutes to start using them in your own app.

Martin

--
---------------------------------------------------------------------------
++
 
Indeed, if the color teal on a team's uniforms is any indication of the
future, the Marlins are in dire trouble! Refer to the San Jose Sharks for
proof... But I have hope for the Marlins. I was a sometime member of the
Rene Lachemann fan club at the Oakland Coliseum, and have a deep respect
for the guy

## Question 2

* Pre-process all words in your document, including removing stop words.
* Remove words that show up in more than 60% of the documents/
* Vectorize your documents using NGrams


In [7]:
from sklearn.feature_extraction.text import CountVectorizer
# student section here


## Question 3

* Create an LDA model with 3 topics. You can do this with GenSim or SkLearn.
* Print out the topics and the 20 words most associated with that topic. 
* Try using more or less topics, is there a sweet spot that allows us to separate out the three input classes?
* Find a document that is clearly about baseball, does the model choose it as dominantly the topic?
* Use pyLDAvis (pip install pyldavis) to create an interactive visualization of the topics

In [8]:
from sklearn.decomposition import LatentDirichletAllocation
n_topics = 4
n_iter = 10
# student section here


# student section ends here
data[0] 

NameError: name 'data' is not defined

In [None]:
print(ng_train.data[0]) # 99% composed of topic 3!

In [None]:
def display_topics(model, feature_names, no_top_words):
    for ix, topic in enumerate(model.components_):
        print("Topic ", ix)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
        
display_topics(lda,count_vectorizer.get_feature_names(),20)

**Topic 3 is baseball stuff and our post on baseball is predicted to be 99% topic 3. Nice!**

## Question 4

* Open a new dataset from `data/ap/ap_split.txt` (Source: http://www.cs.columbia.edu/~blei/lda-c/). This is a dataset of articles from the associated press with no pre-determined scheme of topics. 
* Split this raw file into a set of documents. There is a clear marker between each article.
* Clean the text data and prepare for modeling (note that each document has some \<XYZ\> tags as well as extra spaces)

In [None]:
with open('../data/ap_split.txt','r') as f:
    raw_text = f.read()
docs = raw_text.split('---')
docs[1]

In [None]:
import re
# student section here




# student section ends here
docs[1]

In [None]:
print(len(docs))

## Question 5

* Do LDA modeling to find topics in this chain of articles. Try many different numbers of topics and processing techniques. You can use GenSim or Sklearn.
* Note: In this case, there isn't a "right" answer, but some answers are better than others. Try to find an answer where you're getting clear topics that make sense and seem consistent. Assign labels to each topic, after investigating it by eye, if you can.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# student section here




In [None]:
from sklearn.decomposition import LatentDirichletAllocation

# student section here




In [None]:
def display_topics(model, feature_names, no_top_words, topic_names=None):
    for ix, topic in enumerate(model.components_):
        if not topic_names or not topic_names[ix]:
            print("\nTopic ", ix)
        else:
            print("\nTopic: '",topic_names[ix],"'")
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
display_topics(lda,count_vectorizer.get_feature_names(),20) # We have to look at the topics before hand and then add the labels afterwards

In [None]:
tn = ["Political Media",None,"Financials",None,"Nordstrom Scandal","Oil","Hurricanes","North Korea","NASA","US Politics","TV Networks","Forest Fires",
      None,"Agriculture/Drought","Middle East","US Political Campaigns","Pollution","Carribean","Health/Medical","Theatre/Arts","Global Warming",
      "Advertisements","Southern US Weather","South America",None]
display_topics(lda,count_vectorizer.get_feature_names(),20,topic_names=tn)