# Challenge: Topic extraction on new data

Take the well-known [20 newsgroups](http://qwone.com/~jason/20Newsgroups/) dataset and use each of the methods on it.  Your goal is to determine which method, if any, best reproduces the topics represented by the newsgroups.  Write up a report where you evaluate each method in light of the 'ground truth'- the known source of each newsgroup post.  Which works best, and why do you think this is the case?

In [1]:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
import timeit
%matplotlib inline

from sklearn.datasets import fetch_20newsgroups
twenty_newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))

## Data Preview

In [2]:
# View 20 newsgroups 
twenty_newsgroups.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [3]:
# Print first entry in dataset
print("\n".join(twenty_newsgroups.data[0].split("\n")))

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.


## Generating the tf-idf matrix
(term frequency - inverse document frequency)

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Creating tf-idf matrix
vectorizer = TfidfVectorizer(stop_words='english')
twenty_newsgroups_tfidf = vectorizer.fit_transform(twenty_newsgroups.data)

# Getting the word list.
terms = vectorizer.get_feature_names()

# Number of topics.
ntopics=20

# Linking words to topics
def word_topic(tfidf,solution, wordlist):
    
    # Loading scores for each word on each topic/component.
    words_by_topic=tfidf.T * solution

    # Linking the loadings to the words in an easy-to-read way.
    components=pd.DataFrame(words_by_topic,index=wordlist)
    
    return components

# Extracts the top N words and their loadings for each topic.
def top_words(components, n_top_words):
    n_topics = range(components.shape[1])
    index= np.repeat(n_topics, n_top_words, axis=0)
    topwords=pd.Series(index=index)
    for column in range(components.shape[1]):
        # Sort the column so that highest loadings are at the top.
        sortedwords=components.iloc[:,column].sort_values(ascending=False)
        # Choose the N highest loadings.
        chosen=sortedwords[:n_top_words]
        # Combine loading and index into a string.
        chosenlist=chosen.index +" "+round(chosen,2).map(str) 
        topwords.loc[column]=chosenlist
    return(topwords)

# Number of words to look at for each topic.
n_top_words = 10

# Fitting the 3 topic extraction models

### LSA (Latent Semantic Analysis)

Returns clusters of terms that reflect a topic.

In [5]:
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

# Parameters for LSA
svd= TruncatedSVD(ntopics)
lsa = make_pipeline(svd, Normalizer(copy=False))

# Time and run LSA model
start_time = timeit.default_timer()
twenty_newsgroups_lsa = lsa.fit_transform(twenty_newsgroups_tfidf)
elapsed_lsa = timeit.default_timer() - start_time

# Extract most common words for LSA
components_lsa = word_topic(twenty_newsgroups_tfidf, twenty_newsgroups_lsa, terms)
topwords=pd.DataFrame()
topwords['LSA']=top_words(components_lsa, n_top_words)                

### LDA (Latent Dirichlet Allocation)

Estimates the probability that a topic will be in a document, and the probability that a word will be in a topic.

In [6]:
from sklearn.decomposition import LatentDirichletAllocation as LDA

# Parameters for LDA
lda = LDA(n_topics=ntopics, 
          doc_topic_prior=None, # Prior = 1/n_documents
          topic_word_prior=1/ntopics,
          learning_decay=0.7, # Convergence rate.
          learning_offset=10.0, # Causes earlier iterations to have less influence on the learning
          max_iter=10, # when to stop even if the model is not converging (to prevent running forever)
          evaluate_every=-1, # Do not evaluate perplexity, as it slows training time.
          mean_change_tol=0.001, # Stop updating the document topic distribution in the E-step when mean change is < tol
          max_doc_update_iter=100, # When to stop updating the document topic distribution in the E-step even if tol is not reached
          n_jobs=-1, # Use all available CPUs to speed up processing time.
          verbose=0, # amount of output to give while iterating
          random_state=0
         )

# Time and run LDA model
start_time = timeit.default_timer()
twenty_newsgroups_lda = lda.fit_transform(twenty_newsgroups_tfidf)
elapsed_lda = timeit.default_timer() - start_time

# Extract most common words for LDA
components_lda = word_topic(twenty_newsgroups_tfidf, twenty_newsgroups_lda, terms)
topwords['LDA']=top_words(components_lda, n_top_words)



### NNMF (Non-negative Matrix Factorization)



In [7]:
from sklearn.decomposition import NMF

# Parameters for NNMF
nmf = NMF(alpha=0.0, 
          init='nndsvdar', # how starting value are calculated
          l1_ratio=0.0, # Sets whether regularization is L2 (0), L1 (1), or a combination (values between 0 and 1)
          max_iter=200, # when to stop even if the model is not converging (to prevent running forever)
          n_components=ntopics, 
          random_state=0, 
          solver='cd', # Use Coordinate Descent to solve
          tol=0.0001, # model will stop if tfidf-WH <= tol
          verbose=0 # amount of output to give while iterating
         )

# Time and run NNMF model
start_time = timeit.default_timer()
twenty_newsgroups_nmf = nmf.fit_transform(twenty_newsgroups_tfidf)
elapsed_nnmf = timeit.default_timer() - start_time

# Extract most common words for NNMF
components_nmf = word_topic(twenty_newsgroups_tfidf, twenty_newsgroups_nmf, terms)
topwords['NNMF']=top_words(components_nmf, n_top_words)

# Analysis

In [8]:
for topic in range(ntopics):
    print('Topic {}:'.format(topic))
    print(topwords.loc[topic])

Topic 0:
            LSA             LDA         NNMF
0    like 87.76   armenian 2.92    just 3.23
0     don 86.51  armenians 2.22     don 3.22
0    just 85.16    turkish 1.92   think 2.61
0    know 80.12     people 1.75    like 2.49
0  people 79.19       like 1.54  people 2.08
0   think 74.78       know 1.53    know 1.99
0    does 62.14       just 1.48    good 1.63
0    good 62.07        don 1.43      ve 1.54
0    time 61.16       does 1.37    time 1.51
0     use 59.87     thanks 1.29     say 1.43
Topic 1:
              LSA          LDA           NNMF
1    windows 32.2      ax 5.25        use 2.2
1    thanks 30.84    know 1.24       mac 1.42
1      card 20.73    just 1.15  software 1.26
1     drive 19.48    like 1.12     apple 0.97
1       dos 17.02     don 1.08      like 0.97
1       use 16.74    does 1.06      need 0.88
1      file 16.07  thanks 1.05   problem 0.87
1  software 16.06   think 0.98     modem 0.86
1      mail 15.78     use 0.91      used 0.84
1        pc 15.53  people 0

In [9]:
print('LSA (Latent Semantic Analysis) runtime: {} \n'.format(elapsed_lsa))
print('LDA (Latent Dirichlet Allocation) runtime: {} \n'.format(elapsed_lda))
print('NNMF (Non-Negative Matrix Factorization) runtime: {} '.format(elapsed_nnmf))

LSA (Latent Semantic Analysis) runtime: 5.2903826422031175 

LDA (Latent Dirichlet Allocation) runtime: 1516.9622960999652 

NNMF (Non-Negative Matrix Factorization) runtime: 38.7235158709434 
