# Crunchbase Assignment
## By Mehul Shah

**Task**: Perform topic modelling on a corpus of 10,000 documents.  
**Data**: Text file containing corpus of 10,000 documents in JSON format. 

I will explore the data before performing two types of topic modelling: 

1. [Nonnegative Matrix Factorization (NMF)](https://en.wikipedia.org/wiki/Non-negative_matrix_factorization)
2. [Latent Dirichlet Allocation (LDA)](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

Then I'll visualize the data using the helpful package [pyLDAvis](http://pyldavis.readthedocs.io/en/latest/).

In [1]:
#to store/explore the data
import json
import pandas as pd
import numpy as np
#to perform feature extraction & the two types of topic modelling algorithms
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
#to visualize the data
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

# Loading data and initial peek

In [2]:
data = []
with open('corpus.txt','r') as f:
    for line in f:
        data.append(json.loads(line))
print("There are " + str(len(data)) + " documents")

There are 10000 documents


Now that we have the data loaded into Python, let's make use of Pandas DataFrame object to peek into the data & its format:

In [3]:
df = pd.DataFrame(data)
df.head(5)

Unnamed: 0,author,crawlName,date,html,humanLanguage,pageUrl,siteName,tags,text,thumbnailUrl,title
0,{'string': 'Eleanor Cummins'},{'string': 'slate_tech'},1507313040000,,{'string': 'en'},http://www.slate.com/articles/technology/techn...,{'string': 'Slate Magazine'},"[{'score': 0.75, 'count': 10, 'label': 'Puerto...","Joe Raedle/Getty Images\nLast week, I wrote a ...",{'string': 'http://www.slate.com/content/dam/s...,There’s Reason to Be Skeptical of a Tesla-Powe...
1,{'string': 'Kelly McGowan'},{'string': 'demoines_register_business'},1507248000000,,{'string': 'en'},http://www.desmoinesregister.com/story/news/cr...,{'string': 'Des Moines Register'},"[{'score': 0.65, 'count': 3, 'label': 'West De...",Zach Boyden-Holmes/The Register\nPolice have i...,{'string': 'https://www.gannett-cdn.com/-mm-/9...,Police: Threats to West Des Moines students we...
2,{'string': 'Chicago Tribune'},{'string': 'chicago_tribue_business'},1507161600000,,{'string': 'en'},http://www.chicagotribune.com/news/local/break...,{'string': 'chicagotribune.com'},"[{'score': 0.38, 'count': 1, 'label': 'The Bla...",Chicago police are investigating whether Steph...,{'string': 'http://www.trbimg.com/img-59d66185...,Gunman reserved two rooms at Blackstone
3,,{'string': 'upi_business'},1507055340000,,{'string': 'en'},https://www.upi.com/Sports_News/College-Footba...,{'string': 'UPI'},[],"NASHVILLE, Tenn. -- No. 5 Georgia, coming off ...",{'string': 'https://cdnph.upi.com/svc/sv/upi/2...,"No. 5 Georgia Bulldogs, Vanderbilt Commodores ..."
4,,{'string': 'cb3_chicago_tribue_business'},1506776400000,,{'string': 'en'},http://www.chicagotribune.com/entertainment/tv...,{'string': 'chicagotribune.com'},[],,{'string': 'http://www.trbimg.com/img-59cea4c3...,Craig Robinson and Adam Scott buddy up in Fox'...


In [4]:
df['text'][0]

'Joe Raedle/Getty Images\nLast week, I wrote a story for Slate about Hurricane Maria’s sole silver lining: In the wake of unprecedented destruction, Puerto Rico actually has a unique opportunity to build a better, greener grid. On Thursday, Elon “Batteries Suck” Musk stepped in to suggest, over Twitter, that he (and his companies Tesla and SolarCity) could be the one to do the heavy lifting for the ravaged island territory. The governor of Puerto Rico, Ricardo Rosselló, soon tweeted back to say, “Let’s talk.”\nThis is not what I had in mind. Musk’s day-old plan for Puerto Rico is underdeveloped, to say the least, but the billionaire’s past projects in energy infrastructure could point the way. Back in 2016, Musk installed a microgrid on the island of Ta’u in American Samoa. Up until November of that year, the island had run primarily on diesel shipped in from the mainland. So Musk stepped in to install more than 5,000 solar panels and 60 Tesla Powerpacks for storage, as the Verge repor

We can see a number of data fields (columns) present in the data. The most useful seems to be the *text* field, which we will primarily use as our feature vector. However, further experimentation may make use of fields like *tags*, *crawlName*, or even the *title* to increase performance through either speed or accuracy. 

# Init variables and preprocess data

In [5]:
# in order: number of data points, number of features per data point, number of topics to classify, number of words present in each topic
n_samples = 10000
n_features = 1000
n_components = 10
n_top_words = 20

In [6]:
# pre-process data by ensuring all texts consist of just strings (not dict {string: 'text'})
for i in range(len(df['text'])):
    if type(df['text'][i]) != str:
        df['text'][i] = df['text'][i]['string']

# traindata is a list (vector) consisting of Strings, where each index contains the entire article text from a document
traindata = list(df['text'])

# Feature extraction

In the topic modeling area of NLP, there are two usual suspects in terms of feature extraction: 

1. [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
2. [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)
    
Note: TFidfVectorization is equivalent to CountVectorization followed by [TfidfTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html)

We need to tokenize the data in order to feed it into a model for fitting. This will create sparse matrices that contain the word frequency throughout the corpus. There is plenty of room for experimentation, w/ n-grams, regex to remove certain characters, and frequency cutoffs. But I will keep it simple for the sake of the assignment, and leave that for future work.

A brief explanation of the parameters for the two types of vectorization:

* max_df=0.95 - any word that has a frequency higher than 0.95 will not be included in the vocabulary
* min_df=2 - any word that has an absolute frequency in the corpus of less than 2 will not be included in the vocabulary
* max_features=n_features - build a vocabulary of only `n_features` words ordered by frequency

In [7]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.90, min_df=5,
                                   max_features=n_features,
                                   stop_words='english')

tfidf = tfidf_vectorizer.fit_transform(traindata)

In [8]:
tf_vectorizer = CountVectorizer(max_df=0.90, min_df=5,
                                   max_features=n_features,
                                   stop_words='english')

tf = tf_vectorizer.fit_transform(traindata)

In [9]:
"TF_IDF & TF feature sets have shapes of {}, {} respectively".format(tfidf.shape, tf.shape)

'TF_IDF & TF feature sets have shapes of (10000, 1000), (10000, 1000) respectively'

# Defining model parameters and fitting data

In [10]:
# pd.DataFrame(tf.toarray(), columns=tf_vectorizer.get_feature_names())
# We can use this to visualize our tokenization of the data, to see the frequency of words in our vocabulary

In [11]:
nmf_tfidf = NMF(n_components=n_components, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf)

nmf_tf = NMF(n_components=n_components, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tf)

In [12]:
lda_tfidf = LatentDirichletAllocation(n_components=n_components, random_state=0).fit(tfidf)

lda_tf = LatentDirichletAllocation(n_components=n_components, random_state=0).fit(tf)



# Visualizing Results

In [13]:
def display_topics_and_words(model, feature_name, n_words):
    for index, topic in enumerate(model.components_):
        message = "Topic #%d: " % (index+1)
        message += " ".join([feature_name[i] for i in topic.argsort()[:-n_words - 1:-1]])
        print(message)
        print()

In [14]:
print("\nTopics in NMF model w/ features (generalized Kullback-Leibler divergence):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
display_topics_and_words(nmf_tfidf, tfidf_feature_names, n_top_words)


Topics in NMF model w/ features (generalized Kullback-Leibler divergence):
Topic #1: like just people time says don know new women think life ve love really way going good want year years

Topic #2: game season games points team play coach win league second players scored yards said goal ball played teams 10 lead

Topic #3: said state mr government court minister people school told public party officials statement law health country city case department national

Topic #4: percent million year market billion tax cent growth quarter bank 2017 sales shares rate price investors prices oil stock company

Topic #5: trump president house donald white campaign senate tax republican administration republicans washington mr russia russian election congress american news states

Topic #6: company data business com services technology information market new products solutions customers industry global service www companies management based platform

Topic #7: police man officers old car said cou

In [15]:
print("\nTopics in LDA TF model:")
tf_feature_names = tf_vectorizer.get_feature_names()
display_topics_and_words(lda_tf, tf_feature_names, n_top_words)


Topics in LDA TF model:
Topic #1: game said team season games year points play win just time second players league coach going good got won week

Topic #2: said police year people old family years told man home time life just did says children day car death say

Topic #3: new apple amazon google app store iphone device video phone best devices available users use black features like mobile space

Topic #4: said trump president government state law house minister court country party mr states national political federal new north korea public

Topic #5: women said mr place new film sexual york news men ms times star weinstein told director hotel city sex allegations

Topic #6: like just people time don think make want way know good new ve going need really work things says right

Topic #7: year million percent tax said 2017 billion market bank cent quarter growth company financial oil income net sales 2016 share

Topic #8: company market business data com services technology information

In [16]:
print("\nTopics in LDA TF_IDF model:")
display_topics_and_words(lda_tfidf, tfidf_feature_names, n_top_words)


Topics in LDA TF_IDF model:
Topic #1: said people like just time new year family says years film love know women life don black day children music

Topic #2: police said court man old county shooting officers car gun year authorities death killed investigation charges vehicle according hospital case

Topic #3: company new market business said data technology services com information apple customers industry products companies based service global use 2018

Topic #4: free 30 city subscribe county st local access sign park unlimited trial story miss digital 10 day water state said

Topic #5: star tech happy devices competition sex model employees maybe value felt 14 sign news hour original record design insurance feature

Topic #6: said trump government president state minister mr party house china people country election political court national tax law korea new

Topic #7: game season games team points league coach win said play players second scored just year time player ball going y

In [17]:
pyLDAvis.sklearn.prepare(lda_tf,tf,tf_vectorizer)

In [18]:
pyLDAvis.sklearn.prepare(lda_tfidf,tfidf,tfidf_vectorizer)

Cool! Looks like we were able to get some decent separation between topics using the CountVectorizer tokenized feature set. It's clear we have topics such as:

* Finance
* Technology
* Politics
* Education
* Sports
* Entertainment
* Lifestyle/Gossip
* Stocks
* News about police/crime

Now, let's take a look at a couple of documents from a couple of topics, to see if our topic modeling had any real success! I'll be using the NMF model for this section, as the topics seem to have better categories, w/ less miscellanious terms.

In [19]:
nmf_doc_distrib = nmf_tfidf.transform(tfidf)
nmf_doc_distrib.shape

(10000, 10)

In [20]:
doc_topics = [[] for x in range(10)]
topics = ['Lifestyle','Sports', 'Law', 'Stocks/Finance', 'US Politics (Trump)', 'Business', 'Crime', 'Advertisements', 'Tech' ,'Intl Politics']
for i in range(len(nmf_doc_distrib)):
    doc_topics[np.argmax(nmf_doc_distrib[i])].append(i)

for i in range(len(doc_topics)):
    print(topics[i], end=": ")
    print(doc_topics[i][:5])
    print()

Lifestyle: [0, 4, 5, 6, 7]

Sports: [3, 22, 23, 32, 34]

Law: [8, 16, 19, 20, 21]

Stocks/Finance: [9, 48, 49, 54, 56]

US Politics (Trump): [28, 58, 65, 75, 91]

Business: [11, 12, 13, 14, 26]

Crime: [1, 2, 24, 35, 36]

Advertisements: [17, 25, 89, 119, 158]

Tech: [38, 62, 64, 79, 95]

Intl Politics: [137, 185, 206, 236, 303]



Looking at any of these individual documents will allow us to see how accurate our model was. For the sake of time, I won't go through every single topic, but that should be done in future work.

# Conclusions & Next Steps

Using the extremely useful package [sklearn](http://scikit-learn.org/), we were able to fit our document data into two different types of models, Nonnegative matrix factorization and Latent dirichlet allocation. I loaded the data, tokenized it so that it can be fed into a model, and then fit the data into 10 topics. I chose to keep it fairly simple as this was a small assignment. However, there is a lot of room for future exploration! Here are some ideas I had regarding future possibilities:

* Hyperparameter optimization using grid-search (i.e. what are good values of decay, learning rate, regularization, etc.)
* Experiment w/ the number of topics to prevent overlap/bad classification
* Look into probability distribution of each document across topics, and be able to see how confidently each document is classified
* For documents that aren't confidently in one topic, manually look into them to see what the content is, and use that knowledge to further improve the model
* Preprocess the text using n-grams, and look into regex to remove characters/numerics that may worsen the performance of the model
* Obtain some ground-truth data to come up w/ priors for topics and use cross-validation to determine actual accuracy of the model
* Read into literature on topic modeling to experiment w/ the state of the art.
* Look into the pros/cons of models (i.e. Kullback-Liebler vs least squares (frobenius), regularization, topic top words, etc.)
* Look into other packages (Gensim, Factorie, etc.) especially if dealing w/ more documents and performance is an issue (i.e. in production)

I had a lot of fun playing around with this data, and wanted to thank you for the opportunity! 