# Session 7 - Topic extraction

#### For this notebook you will need to be running the most recent version of scikit-learn.  
Make sure you have at least version 17.0

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import euclidean_distances, cosine_similarity
import warnings
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 
# Suppress warnings from pandas library

import nltk
pd.set_option('display.max_colwidth', 150000) #important for getting all the text



### Topic extraction and application 


#### Let's talk about Non-negative Matrix Factorization.

Our vector spaces are also matrices and we can split them (factor them) into two components in which all the entries are positive or zero (non-negative). Why?

What if the thing we are trying to get at isn't actually represented by the numbers in our matrix?  What if they represent some unobserved ("latent") characteristic that we want to approximate?
Example:  Movie reviews - the number you give to a particular movie may also reflect your feelings about the actors, directon, or genre as much as the individual movie. 
For this analysis: the topic you are looking to extract and apply to a particular document may not be represented by word that actually appears in that document.  (Remember the "base", "hit", "homer" example).

By factoring the matrix, we can fill in the matrix so it isn't so sparse and we can better apply the labels that capture some of the latent features for a document.

Math: http://www.slideshare.net/BenjaminBengfort/non-negative-matrix-factorization

Another example: https://de.dariah.eu/tatom/topic_model_python.html

https://robinsones.github.io/Topic-Modeling-the-New-York-Times-and-Trump/

In [2]:
#using the example math from above to show how the factorization works
import numpy

def matrix_factorization(R, P, Q, K, steps=5000, alpha=0.0002, beta=0.02):
    Q = Q.T
    for step in range(steps):
        for i in range(len(R)):
            for j in range(len(R[i])):
                if R[i][j] > 0:
                    eij = R[i][j] - numpy.dot(P[i,:],Q[:,j])
                    for k in range(K):
                        P[i][k] = P[i][k] + alpha * (2 * eij * Q[k][j] - beta * P[i][k])
                        Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k] - beta * Q[k][j])
        eR = numpy.dot(P,Q)
        e = 0
        for i in range(len(R)):
            for j in range(len(R[i])):
                if R[i][j] > 0:
                    e = e + pow(R[i][j] - numpy.dot(P[i,:],Q[:,j]), 2)
                    for k in range(K):
                        e = e + (beta/2) * (pow(P[i][k],2) + pow(Q[k][j],2))
        if e < 0.001:
            break
    return P, Q.T


In [3]:
R = [
     [5,3,0,1],
     [4,0,0,1],
     [1,1,0,5],
     [1,0,0,4],
     [0,1,5,4],
    ]
 
R = np.array(R)

N = len(R)
M = len(R[0])
K = 2
 
P = np.random.rand(N,K)
Q = np.random.rand(M,K)
 
nP, nQ = matrix_factorization(R, P, Q, K)
nR = np.dot(nP, nQ.T)


In [4]:
print(R)
print("~~~~~~~~~~~~~~~~")
print(P)
print("~~~~~~~~~~~~~~~~")
print(Q)
print("~~~~~~~~~~~~~~~~")
print(nR)

[[5 3 0 1]
 [4 0 0 1]
 [1 1 0 5]
 [1 0 0 4]
 [0 1 5 4]]
~~~~~~~~~~~~~~~~
[[ 2.34270556 -0.19845106]
 [ 1.866818   -0.06908812]
 [ 0.8207655   1.96608698]
 [ 0.70065436  1.56158232]
 [ 1.15613634  1.46313113]]
~~~~~~~~~~~~~~~~
[[ 2.10763731 -0.33086903]
 [ 1.23488645 -0.11099837]
 [ 2.23750482  1.58118649]
 [ 0.61819314  2.26703334]]
~~~~~~~~~~~~~~~~
[[5.00323494 2.91500309 4.92802685 0.99834933]
 [3.95743438 2.31297692 4.06777308 0.99742902]
 [1.0793587  0.79531974 4.94521691 4.96457632]
 [0.96004605 0.69189549 4.03687038 3.9732989 ]
 [1.95261132 1.26529193 4.90034381 4.03168261]]


In [5]:
#what about our group text example?


friend1 = "Machine learning is super fun"
friend2 = "Python is super, super cool"
friend3 = "Statistics is cool, too"
friend4 = "Fun? Data science is more than fun"
friend5 = "Python is great for machine learning"
friend6 = "I like football"
friend7 = "Football is great to watch"
textStr = [friend1, friend2, friend3, friend4, friend5, friend6, friend7]
print(textStr)




cvnorm = TfidfVectorizer(binary=False, stop_words='english', use_idf = False) #default is to normalize
cvnorm_dm = cvnorm.fit_transform(textStr)
print(cvnorm_dm.toarray())



['Machine learning is super fun', 'Python is super, super cool', 'Statistics is cool, too', 'Fun? Data science is more than fun', 'Python is great for machine learning', 'I like football', 'Football is great to watch']
[[0.         0.         0.         0.5        0.         0.5
  0.         0.5        0.         0.         0.         0.5
  0.        ]
 [0.40824829 0.         0.         0.         0.         0.
  0.         0.         0.40824829 0.         0.         0.81649658
  0.        ]
 [0.70710678 0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.70710678 0.
  0.        ]
 [0.         0.40824829 0.         0.81649658 0.         0.
  0.         0.         0.         0.40824829 0.         0.
  0.        ]
 [0.         0.         0.         0.         0.5        0.5
  0.         0.5        0.5        0.         0.         0.
  0.        ]
 [0.         0.         0.70710678 0.         0.         0.
  0.70710678 0.         0.         0.   

In [6]:
R = numpy.array(cvnorm_dm.toarray())

N = len(R)
M = len(R[0])
K = 2
 
P = numpy.random.rand(N,K)
Q = numpy.random.rand(M,K)
 
nP, nQ = matrix_factorization(R, P, Q, K)
nR = numpy.dot(nP, nQ.T)
print(nR)

[[0.30308779 0.25060509 0.67170275 0.47417101 0.38702892 0.54861057
  0.35659203 0.50594075 0.16080538 0.44223947 0.2960322  0.44683018
  0.5697743 ]
 [0.53422802 0.57300349 0.887964   0.8535573  0.76010818 0.90628848
  0.75535629 0.7459321  0.44368831 0.7104549  0.64273132 0.7864986
  0.86362123]
 [0.55153099 0.63650945 0.81538559 0.88728853 0.81140593 0.91485936
  0.82324088 0.72015791 0.51292348 0.70982695 0.70495462 0.8115983
  0.8434306 ]
 [0.38636829 0.49558468 0.45918612 0.62830632 0.59791199 0.61791982
  0.62470851 0.44929942 0.41997118 0.47112974 0.53961872 0.56814185
  0.53761515]
 [0.34859759 0.45731127 0.39135816 0.56826168 0.54550008 0.55280873
  0.57346641 0.39407375 0.39133454 0.41972211 0.49623907 0.51251662
  0.47415693]
 [0.44571866 0.50116502 0.68877812 0.71526934 0.64788398 0.7454585
  0.65252124 0.59669072 0.39837034 0.58060256 0.55752105 0.65600161
  0.69579184]
 [0.36868983 0.40717011 0.58639171 0.59065719 0.53153429 0.62004298
  0.53261999 0.50177411 0.32051109 

In [7]:
#transforms the matrix so that it can extract feature names that may apply to the topic in a document
# even if that feature name doesn't appear in that document.

#But you don't need to do the math directly
from sklearn.decomposition import NMF
n_topics = 3
n_top_words = 2

# Fit the NMF model
nmf = NMF(n_components=n_topics, random_state=1).fit(cvnorm_dm)


names_texts = cvnorm.get_feature_names()

print(names_texts)


['cool', 'data', 'football', 'fun', 'great', 'learning', 'like', 'machine', 'python', 'science', 'statistics', 'super', 'watch']


In [8]:
print(type(nmf.components_))
print(len(nmf.components_))
print(nmf.components_)
text_components = nmf.components_

<class 'numpy.ndarray'>
3
[[0.         0.16521867 0.         0.66965112 0.21701335 0.58223715
  0.         0.58223715 0.30339442 0.16521867 0.         0.45984276
  0.        ]
 [0.         0.         0.78442041 0.         0.40568026 0.01598515
  0.42295974 0.01598515 0.03754171 0.         0.         0.
  0.36146067]
 [0.6775754  0.         0.         0.         0.         0.
  0.         0.         0.15622711 0.         0.48415424 0.33782021
  0.        ]]


In [9]:
# nice function for printing topic information - 
# https://stackoverflow.com/questions/34429635/topic-modelling-assign-a-document-with-top-2-topics-as-category-label-sklear

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

    

In [10]:
print("\nTopics for texts in NMF model:")
print_top_words(nmf, names_texts, n_top_words)


Topics for texts in NMF model:
Topic #0:
fun machine
Topic #1:
football like
Topic #2:
cool statistics



In [11]:
# a "real" example - let's torture the news articles again

newsdf = pd.read_csv("nytimes2013.csv", index_col = 0) 
print(newsdf.shape)
print(list(newsdf))

(3848, 5)
['date', 'description', 'headline', 'url', 'text']


In [12]:
# a little preprocessing
import re
news_dict = {'united states':'usa','states':'state', 'years':'year',
             'new york': 'ny', 'republicans':'republican', 'schools':'school',
            'companies':'company', 'de blasio':'nymayor'}



def multiple_replace(dict, text): 

  """ Replace in 'text' all occurences of any key in the given
  dictionary by its corresponding value.  Returns the new tring.""" 
  text = str(text).lower()

  # Create a regular expression  from the dictionary keys
  regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))

  # For each match, look-up corresponding value in dictionary
  return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)

newsdf['cleantext'] = newsdf.text.apply(lambda x: multiple_replace(news_dict, x))


In [13]:
from sklearn.feature_extraction import text 
from nltk.corpus import stopwords

skl_stopwords = list(text.ENGLISH_STOP_WORDS)

#stop words didn't get saved, need to recreate that list
my_stopwords = skl_stopwords + ["mr", "ms", "say","said", "0", '000','10', '100', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '200', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '21', '22', '23', '24', '25', '27', '30', '300', '35', '40', '45', '50', '500', '60', '70', '80']



tfidf = TfidfVectorizer(lowercase=True, 
                        stop_words= my_stopwords, 
                        max_df=0.95, 
                        min_df=0.05) 

# fit and transform text

tfidf_dm = tfidf.fit_transform(newsdf['cleantext'])

#print matrix shape(s)
print(tfidf_dm.shape)
names_news = tfidf.get_feature_names()
print(type(names_news), len(names_news))


(3848, 1110)
<class 'list'> 1110


In [14]:
n_topics = 10
n_top_words = 5

# Fit the NMF model
nmf = NMF(n_components=n_topics, random_state=1).fit(tfidf_dm)
news_components_10_5 = nmf.components_

names_news = tfidf.get_feature_names()
print(len(names_news))

print("\nTopics for news in NMF model:")
print_top_words(nmf, names_news, n_top_words)

1110

Topics for news in NMF model:
Topic #0:
like just year time don
Topic #1:
military obama american officials government
Topic #2:
game team season players games
Topic #3:
republican senate house democrats senator
Topic #4:
company percent industry market year
Topic #5:
city mayor ny police brooklyn
Topic #6:
court justice case state law
Topic #7:
school students education children college
Topic #8:
health insurance care state law
Topic #9:
dr women study medical research



### This is where you decipher these collections of words to decide whether they make sense.  Maybe 10 topics is too many - are any of these "overlapping"?  Is 5 words the right number to capture distinctions?

Let's try to put labels on the different collections:
* Topic 0: 
* Topic 1: 
* Topic 2: 
* Topic 3: 
* Topic 4: 
* Topic 5: 
* Topic 6: 
* Topic 7:
* Topic 8: 
* Topic 9: 


In [15]:
# try different configurations
# fewer topics, lots of words
n_topics = 5
n_top_words = 20

# Fit the NMF model
nmf = NMF(n_components=n_topics, random_state=1).fit(tfidf_dm)
news_components_5_5 = nmf.components_

names_news = tfidf.get_feature_names()
print(len(names_news))

print("\nTopics for news in NMF model:")
print_top_words(nmf, names_news, n_top_words)

1110

Topics for news in NMF model:
Topic #0:
like year just time old family people don mother life didn know way home day good ny ve told make
Topic #1:
military obama american officials government usa president weapons security administration intelligence official war foreign al forces nations china minister attack
Topic #2:
game team season players games league play coach teams player fans field ball year win world second played playing goal
Topic #3:
republican senate house democrats senator obama vote party legislation president state law congress conservative democrat democratic budget representative health spending
Topic #4:
company percent city state health school year federal people students million new dr law care insurance workers ny industry public



### You can see that sports, politics, government, and our wild card category are pretty much the same but now business, local, and health are lumped together. 

In [16]:
#more topics, fewer words
n_topics = 20
n_top_words = 5

# Fit the NMF model
nmf = NMF(n_components=n_topics, random_state=1).fit(tfidf_dm)
news_components_20_5 = nmf.components_

names_news = tfidf.get_feature_names()
print(len(names_news))

print("\nTopics for news in NMF model:")
print_top_words(nmf, names_news, n_top_words)

1110

Topics for news in NMF model:
Topic #0:
like don just people think
Topic #1:
military government american weapons officials
Topic #2:
game team games season play
Topic #3:
republican senate democrats senator house
Topic #4:
company data industry technology information
Topic #5:
city mayor ny campaign brooklyn
Topic #6:
court justice judge case law
Topic #7:
school students education college student
Topic #8:
health insurance care law plans
Topic #9:
dr study medical research university
Topic #10:
police officers department crime enforcement
Topic #11:
food oil water add restaurant
Topic #12:
building street apartment park buildings
Topic #13:
obama president administration white house
Topic #14:
percent market tax year economy
Topic #15:
women men woman sex book
Topic #16:
mother family father children parents
Topic #17:
china usa party north countries
Topic #18:
state new jersey laws federal
Topic #19:
players league season player team



### Once you are happy with the topic assignment, it's time to label each document with the appropriate comment.

#### What does "appropriate" mean?  It's the label CLOSEST to the document content.  

Yup, this is where all that distance and similarity stuff comes in.  We'll use those calculations to decided which topic label we want to apply to each document.

(Why are we doing this?  Remember, classification is one of the reasons we do data mining.  This is a form of classification.  We can also use it for prediction:  build predictive models to assign topics based on the model we can build from this training dataset. )

In [17]:
import math

#define a function for cosine similarity - the latest version in sklearn doesn't take vectors
def cosine_similarity(a, b):
    return sum([i*j for i,j in zip(a, b)])/(math.sqrt(sum([i*i for i in a]))* math.sqrt(sum([i*i for i in b])))




In [18]:
#let's make a  dataframe
textdf = pd.DataFrame(textStr)
textdf.columns = ['text']
print(textdf)

                                   text
0         Machine learning is super fun
1           Python is super, super cool
2               Statistics is cool, too
3    Fun? Data science is more than fun
4  Python is great for machine learning
5                       I like football
6            Football is great to watch


In [19]:
# define a function for determining the most similar topic
# credit Leo Ji

def topic_sim(arr, feature_names, n_top_words, topics):
    """
    @type  arr: array of number
    @param arr: vectorizer number in an array.
    @type  feature_names: array of string
    @param feature_names: The array of feature names.
    @type  n_top_words: number
    @param n_top_words: The number of topics to return.
    @type  topics: array of string
    @param topics: Complete list of topics from topic extraction.
    
    @rtype:   top topics
    @return:  top topics in string separated by space.
    """
    top_sim = 0
    top_topic = np.array([])
    # iterate over topics
    for idx, topic in enumerate(topics):
        # calculate cosine similarity - substitute euclidean distance if that is your preferred metric
        # could switch to euclidean_distances
        sim = cosine_similarity(arr, topic)
        if sim > top_sim:
            top_sim = sim
            top_topic = topic
    
    # argsort sort is in ascending order, so pick last n_top_words from it
    selected_topic_index = top_topic.argsort()[:-n_top_words-1:-1]
    # return the text feature names by indeing back into feature_names (assigned earlier)   
    return " ".join([feature_names[i] for i in selected_topic_index])




In [20]:
# create a vector of topic labels that can be appended to the original dataframe
textdf['nmf_topics'] = np.ma.apply_along_axis(topic_sim, axis=1, 
        arr=cvnorm_dm.toarray(), feature_names=cvnorm.get_feature_names(), n_top_words=2, topics=text_components)
textdf

Unnamed: 0,text,nmf_topics
0,Machine learning is super fun,fun machine
1,"Python is super, super cool",cool statistics
2,"Statistics is cool, too",cool statistics
3,Fun? Data science is more than fun,fun machine
4,Python is great for machine learning,fun machine
5,I like football,football like
6,Football is great to watch,football like


In [21]:
# let's apply this to the news example - this takes some time
import time

t0 = time.time()



#apply most similar topic to each document
newsdf['topics_20_5'] = np.ma.apply_along_axis(topic_sim, axis=1, 
        arr=tfidf_dm.toarray(), feature_names=names_news, n_top_words=5, topics=news_components_20_5)
t1 = time.time()

t1-t0

553.9061460494995

In [22]:
# count topics and view dataframe

newsdf['topics_20_5'].value_counts()



like don just people think                        751
military government american weapons officials    317
game team games season play                       271
republican senate democrats senator house         213
company data industry technology information      209
percent market tax year economy                   204
building street apartment park buildings          198
mother family father children parents             179
food oil water add restaurant                     178
city mayor ny campaign brooklyn                   175
players league season player team                 150
court justice judge case law                      141
obama president administration white house        140
dr study medical research university              139
school students education college student         125
health insurance care law plans                   116
police officers department crime enforcement      111
state new jersey laws federal                      93
women men woman sex book    

In [23]:
newsdf[['description','topics_20_5']].head(10)

Unnamed: 0,description,topics_20_5
0,"Ending a climactic showdown in the final hours of the 112th Congress, the House sent to President Obama legislation to avert big income tax increases on most Americans.",republican senate democrats senator house
1,A report on nearly three million people found that those whose body mass index ranked them as overweight had less risk of dying than people of normal weight.,dr study medical research university
2,"As the United States prepares to withdraw from an unpopular war in Afghanistan, it faces challenges similar to what the country’s last occupier, the Soviet Union, had experienced.",military government american weapons officials
3,"The popularity of the drinks reflects success in convincing consumers that they provide an edge, but most of their ingredients have no or little benefit, research shows.",dr study medical research university
4,"New Hampshire, which again chose a woman to be governor, will also become the first state in history to have an all-female delegation in Washington.",women men woman sex book
5,"For a 65th birthday, a first burial pitch.",like don just people think
6,"The United Nations’ human rights chief, Navi Pillay, on Wednesday voiced dismay over an analysis that far exceeds earlier estimates of the toll in 22-month-old war.",military government american weapons officials
7,The depth of the anger that followed the House’s refusal to take up a package of assistance for Hurricane Sandy victims was extraordinary and exceedingly personal.,republican senate democrats senator house
8,How the breakout star (and daughter of David) turned “Girls” from a show about a trio to one about a quartet.,like don just people think
9,Three memorable television characters who dressed for success (and sometimes to excess).,women men woman sex book


### There are other methods for doing topic extraction

#### Latent Dirichlet Allocation
* Introduction: http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/

LDA assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution, like the ones in our walkthrough model. In other words, LDA assumes a document is made from the following steps:

* Determine the number of words in a document. Let’s say our document has 6 words.
* Determine the mixture of topics in that document. For example, the document might contain 1/2 the topic “health” and 1/2 the topic “vegetables.”
* Using each topic’s multinomial distribution, output words to fill the document’s word slots. In our example, the “health” topic is 1/2 our document, or 3 words. The “health” topic might have the word “diet” at 20% probability or “exercise” at 15%, so it will fill the document word slots based on those probabilities.
* Given this assumption of how documents are created, LDA backtracks and tries to figure out what topics would create those documents in the first place.

(from: https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html

Another version of the same example: https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/

These use the gensim package rather than sklearn.)

Remember this example?  http://brandonrose.org/clustering

In [24]:
from sklearn.decomposition import LatentDirichletAllocation
n_topics = 3
n_top_words = 2

# Fit the LDA model
lda = LatentDirichletAllocation(n_components=n_topics, max_iter=5,
                                learning_method='online', learning_offset=50.,
                                random_state=0)

lda.fit(cvnorm_dm)

feature_names = cvnorm.get_feature_names()

#what are the topics for this corpus?
print("\nTopics for texts in LDA model:")
print_top_words(lda, feature_names, n_top_words)


Topics for texts in LDA model:
Topic #0:
cool football
Topic #1:
fun data
Topic #2:
watch super



In [25]:
%%time
#apply most similar topic to each document
textdf['lda_topics'] = np.ma.apply_along_axis(topic_sim, axis=1, 
        arr=cvnorm_dm.toarray(), feature_names=cvnorm.get_feature_names(), n_top_words=2, topics=lda.components_)

CPU times: user 3.37 ms, sys: 562 µs, total: 3.93 ms
Wall time: 4.2 ms


In [26]:
textdf

Unnamed: 0,text,nmf_topics,lda_topics
0,Machine learning is super fun,fun machine,watch super
1,"Python is super, super cool",cool statistics,cool football
2,"Statistics is cool, too",cool statistics,cool football
3,Fun? Data science is more than fun,fun machine,fun data
4,Python is great for machine learning,fun machine,fun data
5,I like football,football like,cool football
6,Football is great to watch,football like,watch super


In [27]:
#and the news stories?
from sklearn.decomposition import LatentDirichletAllocation
n_topics = 10
n_top_words = 5

# Fit the NMF model
lda = LatentDirichletAllocation(n_components=n_topics, max_iter=5,
                                learning_method='online', learning_offset=50.,
                                random_state=0)

lda.fit(tfidf_dm)

feature_names = tfidf.get_feature_names()

print("\nTopics for news in LDA model:")
print_top_words(lda, feature_names, n_top_words)


Topics for news in LDA model:
Topic #0:
women china children company college
Topic #1:
officers police says government department
Topic #2:
cases year chicago american office
Topic #3:
republican senate state health obama
Topic #4:
company market year like people
Topic #5:
military government police obama officials
Topic #6:
music energy year dr child
Topic #7:
year like team people just
Topic #8:
city year company women school
Topic #9:
republican vote director senate american



In [28]:
%%time
n_topics = 20
n_top_words = 5

# Fit the model
lda = LatentDirichletAllocation(n_components=n_topics, max_iter=50,
                                learning_method='online', learning_offset=50.,
                                random_state=0)

lda.fit(tfidf_dm)

feature_names = tfidf.get_feature_names()

print("\nTopics for news in LDA model:")
print_top_words(lda, feature_names, n_top_words)


Topics for news in LDA model:
Topic #0:
children dr child school city
Topic #1:
mayor year games children health
Topic #2:
year like team game just
Topic #3:
love state daughter building million
Topic #4:
teams research report defense percent
Topic #5:
military officials women government rules
Topic #6:
music base space year place
Topic #7:
company story city death car
Topic #8:
building buildings street apartment west
Topic #9:
republican senate vote house book
Topic #10:
city mayor people says bar
Topic #11:
american police intelligence security attack
Topic #12:
father table class china mother
Topic #13:
city air rights people age
Topic #14:
building house president obama white
Topic #15:
republican state government obama year
Topic #16:
read ny job neighborhood tell
Topic #17:
minister prime private president leader
Topic #18:
school like year bar students
Topic #19:
company north french women students

CPU times: user 1min 29s, sys: 1.91 s, total: 1min 30s
Wall time: 53.7 s


In [29]:
%%time
#apply most similar topic to each document
t0 = time.time()
newsdf['lda_topics'] = np.ma.apply_along_axis(topic_sim, axis=1, 
        arr=tfidf_dm.toarray(), feature_names=tfidf.get_feature_names(), n_top_words=4, topics=lda.components_)

t1 = time.time()

t1-t0

CPU times: user 7min 20s, sys: 11.1 s, total: 7min 31s
Wall time: 9min 37s


577.3352110385895

In [30]:
newsdf['lda_topics'].value_counts()


year like team game                      1978
republican state government obama        1745
school like year bar                       43
building buildings street apartment        42
children dr child school                   12
city mayor people says                      8
city air rights people                      6
american police intelligence security       5
company north french women                  3
mayor year games children                   2
teams research report defense               2
minister prime private president            1
republican senate vote house                1
Name: lda_topics, dtype: int64

In [31]:
newsdf[['description','topics_20_5','lda_topics']].head(10)

Unnamed: 0,description,topics_20_5,lda_topics
0,"Ending a climactic showdown in the final hours of the 112th Congress, the House sent to President Obama legislation to avert big income tax increases on most Americans.",republican senate democrats senator house,republican state government obama
1,A report on nearly three million people found that those whose body mass index ranked them as overweight had less risk of dying than people of normal weight.,dr study medical research university,year like team game
2,"As the United States prepares to withdraw from an unpopular war in Afghanistan, it faces challenges similar to what the country’s last occupier, the Soviet Union, had experienced.",military government american weapons officials,republican state government obama
3,"The popularity of the drinks reflects success in convincing consumers that they provide an edge, but most of their ingredients have no or little benefit, research shows.",dr study medical research university,republican state government obama
4,"New Hampshire, which again chose a woman to be governor, will also become the first state in history to have an all-female delegation in Washington.",women men woman sex book,republican state government obama
5,"For a 65th birthday, a first burial pitch.",like don just people think,year like team game
6,"The United Nations’ human rights chief, Navi Pillay, on Wednesday voiced dismay over an analysis that far exceeds earlier estimates of the toll in 22-month-old war.",military government american weapons officials,republican state government obama
7,The depth of the anger that followed the House’s refusal to take up a package of assistance for Hurricane Sandy victims was extraordinary and exceedingly personal.,republican senate democrats senator house,republican state government obama
8,How the breakout star (and daughter of David) turned “Girls” from a show about a trio to one about a quartet.,like don just people think,year like team game
9,Three memorable television characters who dressed for success (and sometimes to excess).,women men woman sex book,year like team game


## Singular value decomposition and LSA

####  LSA code and diagrams borrowed from Will Stanton (with many thanks): http://www.williamgstanton.com/ - unfortunately, the slides have disappeared from his website.

Other suggested reading:
* http://lsa.colorado.edu/papers/dp1.LSAintro.pdf
* http://mccormickml.com/2016/03/25/lsa-for-text-classification-tutorial/


Remember, SVD is a dimensionality reduction technique similar to PCA.  Here we use it to reduce the feature space (get rid of zeros similar to the NMF) and then extract "important" pieces.

In [32]:
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer

In [33]:
# Fit LSA. Use algorithm = “randomized” for large datasets
# apply to our text examples again

lsa = TruncatedSVD(3, algorithm = 'arpack')
cvnorm_lsa = lsa.fit_transform(cvnorm_dm)
cvnorm_lsa = Normalizer(copy=False).fit_transform(cvnorm_lsa)

* Each LSA component is a linear combination of words 

In [34]:
pd.DataFrame(lsa.components_,index = ["component_1","component_2","component_3"],columns = cvnorm.get_feature_names())

Unnamed: 0,cool,data,football,fun,great,learning,like,machine,python,science,statistics,super,watch
component_1,0.205953,0.083666,0.136799,0.392147,0.274684,0.411709,0.049008,0.411709,0.319064,0.083666,0.073782,0.489158,0.087791
component_2,-0.143532,-0.059996,0.698419,-0.198971,0.384218,-0.021209,0.371972,-0.021209,-0.009318,-0.059996,-0.076444,-0.213156,0.326448
component_3,0.621099,-0.183674,0.056415,-0.474265,-0.015689,-0.138379,0.040643,-0.138379,0.142182,-0.183674,0.447455,0.24037,0.015772


* Each document is a linear combination of the LSA components

In [35]:
pd.DataFrame(cvnorm_lsa, index = textStr, columns = ["component_1","component_2","component_3"])

Unnamed: 0,component_1,component_2,component_3
Machine learning is super fun,0.928146,-0.24748,-0.278028
"Python is super, super cool",0.738578,-0.284539,0.61118
"Statistics is cool, too",0.248376,-0.195317,0.948768
Fun? Data science is more than fun,0.558296,-0.303859,-0.771994
Python is great for machine learning,0.968419,0.227201,-0.102684
I like football,0.170351,0.981357,0.088985
Football is great to watch,0.333741,0.941908,0.037766


In [36]:
xs = [w[0] for w in cvnorm_lsa]
ys = [w[1] for w in cvnorm_lsa]
xs, ys

([0.9281456333640472,
  0.7385777114863382,
  0.24837620403530086,
  0.5582961773012041,
  0.9684191697560601,
  0.17035099239586765,
  0.3337414717255542],
 [-0.24747973299406437,
  -0.28453856431053604,
  -0.1953166236526416,
  -0.3038592430937842,
  0.2272010915726264,
  0.9813573449925506,
  0.9419078180327609])

## Document similarity using LSA

https://technowiki.wordpress.com/2011/08/27/latent-semantic-analysis-lsa-tutorial/

In [38]:
# Compute document similarity using LSA components
similarity = np.asarray(numpy.asmatrix(cvnorm_lsa) * numpy.asmatrix(cvnorm_lsa).T)
pd.DataFrame(similarity,index=textStr, columns=textStr).head(10)

Unnamed: 0,Machine learning is super fun,"Python is super, super cool","Statistics is cool, too",Fun? Data science is more than fun,Python is great for machine learning,I like football,Football is great to watch
Machine learning is super fun,1.0,0.586,0.015082,0.808015,0.871155,-0.109496,0.066158
"Python is super, super cool",0.586,1.0,0.818888,0.026978,0.587847,-0.099031,0.001567
"Statistics is cool, too",0.015082,0.818888,1.0,-0.534427,0.098733,-0.064938,-0.065245
Fun? Data science is more than fun,0.808015,0.026978,-0.534427,1.0,0.550899,-0.271784,-0.129036
Python is great for machine learning,0.871155,0.587847,0.098733,0.550899,1.0,0.378799,0.533326
I like football,-0.109496,-0.099031,-0.064938,-0.271784,0.378799,1.0,0.984562
Football is great to watch,0.066158,0.001567,-0.065245,-0.129036,0.533326,0.984562,1.0


In [39]:
# back to our regularly scheduled feature - use LSA for topic extraction

feature_names = names_texts
n_top_words = 2

print("\nTopics for texts in LSA model:")
print_top_words(lsa, feature_names, n_top_words)


Topics for texts in LSA model:
Topic #0:
super machine
Topic #1:
football great
Topic #2:
cool statistics



In [40]:
# let's add these topic labels to our dataframe

textdf['lsa_topics'] = np.ma.apply_along_axis(topic_sim, axis=1, 
        arr=cvnorm_dm.toarray(), feature_names=cvnorm.get_feature_names(), n_top_words=2, topics=lsa.components_)
textdf

Unnamed: 0,text,nmf_topics,lda_topics,lsa_topics
0,Machine learning is super fun,fun machine,watch super,super machine
1,"Python is super, super cool",cool statistics,cool football,super machine
2,"Statistics is cool, too",cool statistics,cool football,cool statistics
3,Fun? Data science is more than fun,fun machine,fun data,super machine
4,Python is great for machine learning,fun machine,fun data,super machine
5,I like football,football like,cool football,football great
6,Football is great to watch,football like,watch super,football great


## Let's try it on our news articles

In [41]:
n_topics = 20
n_top_words = 5

# Fit LSA. Use algorithm = “randomized” for large datasets
lsa = TruncatedSVD(n_topics, algorithm = 'randomized')
tfidf_lsa = lsa.fit_transform(tfidf_dm)
tfidf_lsa = Normalizer(copy=False).fit_transform(tfidf_lsa)

In [42]:

feature_names = tfidf.get_feature_names()

print("\nTopics for news in LSA model:")
print_top_words(lsa, feature_names, n_top_words)


Topics for news in LSA model:
Topic #0:
year like people city new
Topic #1:
republican obama senate government president
Topic #2:
game team season players games
Topic #3:
republican senate democrats house senator
Topic #4:
health company percent state insurance
Topic #5:
city mayor police ny school
Topic #6:
school students court state children
Topic #7:
school students education percent college
Topic #8:
health insurance care city obama
Topic #9:
dr women study research medical
Topic #10:
state china country military percent
Topic #11:
food oil state court obama
Topic #12:
building house street dr apartment
Topic #13:
police food officers government oil
Topic #14:
percent police food tax economy
Topic #15:
women police men students military
Topic #16:
women food company usa china
Topic #17:
china obama police president white
Topic #18:
state intelligence china american usa
Topic #19:
state china players obama police



In [43]:
#apply most similar topic to each document
t0 = time.time()
newsdf['lsa_topics'] = np.ma.apply_along_axis(topic_sim, axis=1, 
        arr=tfidf_dm.toarray(), feature_names=names_news, n_top_words=5, topics=lsa.components_)
t1 = time.time()

t1-t0

537.5284647941589

In [44]:
newsdf['lsa_topics'].value_counts()

year like people city new                       3071
game team season players games                   189
republican obama senate government president     122
city mayor police ny school                       71
republican senate democrats house senator         62
dr women study research medical                   56
school students education percent college         48
health insurance care city obama                  48
food oil state court obama                        48
school students court state children              36
women police men students military                23
building house street dr apartment                21
china obama police president white                11
health company percent state insurance            10
police food officers government oil                8
state intelligence china american usa              7
percent police food tax economy                    6
state china country military percent               5
women food company usa china                  

In [45]:
newsdf[['description','topics_20_5','lda_topics','lsa_topics']].head(10)

Unnamed: 0,description,topics_20_5,lda_topics,lsa_topics
0,"Ending a climactic showdown in the final hours of the 112th Congress, the House sent to President Obama legislation to avert big income tax increases on most Americans.",republican senate democrats senator house,republican state government obama,republican senate democrats house senator
1,A report on nearly three million people found that those whose body mass index ranked them as overweight had less risk of dying than people of normal weight.,dr study medical research university,year like team game,dr women study research medical
2,"As the United States prepares to withdraw from an unpopular war in Afghanistan, it faces challenges similar to what the country’s last occupier, the Soviet Union, had experienced.",military government american weapons officials,republican state government obama,year like people city new
3,"The popularity of the drinks reflects success in convincing consumers that they provide an edge, but most of their ingredients have no or little benefit, research shows.",dr study medical research university,republican state government obama,year like people city new
4,"New Hampshire, which again chose a woman to be governor, will also become the first state in history to have an all-female delegation in Washington.",women men woman sex book,republican state government obama,women police men students military
5,"For a 65th birthday, a first burial pitch.",like don just people think,year like team game,year like people city new
6,"The United Nations’ human rights chief, Navi Pillay, on Wednesday voiced dismay over an analysis that far exceeds earlier estimates of the toll in 22-month-old war.",military government american weapons officials,republican state government obama,year like people city new
7,The depth of the anger that followed the House’s refusal to take up a package of assistance for Hurricane Sandy victims was extraordinary and exceedingly personal.,republican senate democrats senator house,republican state government obama,republican senate democrats house senator
8,How the breakout star (and daughter of David) turned “Girls” from a show about a trio to one about a quartet.,like don just people think,year like team game,year like people city new
9,Three memorable television characters who dressed for success (and sometimes to excess).,women men woman sex book,year like team game,women police men students military


### Some comparisons:
* http://scikit-learn.org/0.17/auto_examples/applications/topics_extraction_with_nmf_lda.html
* https://medium.com/@aneesha/topic-modeling-with-scikit-learn-e80d33668730
* http://aclweb.org/anthology/D/D12/D12-1087.pdf

In [48]:
# can we predict the topic labels?
from sklearn.model_selection import train_test_split

X = tfidf_dm.toarray()  #remember this is the output from the vectorizer and we are turning it into an array
y = newsdf['topics_20_5'].values #this is an array of labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) #random_state is set seed

# Decision Tree Classifier
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier

# fit a CART model to the data
model = DecisionTreeClassifier(random_state = 42)
model.fit(X_train, y_train)

# make predictions
clf1_expected = y_test
clf1_predicted = model.predict(X_test)


print(model.score(X_test,y_test))

# summarize the fit of the model
print("accuracy: " + str(metrics.accuracy_score(clf1_expected, clf1_predicted)))
print(metrics.classification_report(clf1_expected, clf1_predicted))

0.6103896103896104
accuracy: 0.6103896103896104
                                                precision    recall  f1-score   support

      building street apartment park buildings       0.50      0.53      0.51        59
               china usa party north countries       0.75      0.75      0.75        24
               city mayor ny campaign brooklyn       0.56      0.54      0.55        46
  company data industry technology information       0.61      0.61      0.61        74
                  court justice judge case law       0.53      0.65      0.59        26
          dr study medical research university       0.55      0.53      0.54        34
                 food oil water add restaurant       0.64      0.62      0.63        52
                   game team games season play       0.75      0.68      0.71        87
               health insurance care law plans       0.71      0.69      0.70        32
                    like don just people think       0.64      0.62    

In [49]:
# can we predict the topic labels?

X = tfidf_dm.toarray()  #remember this is the output from the vectorizer and we are turning it into an array
y = newsdf['lda_topics'].values #this is an array of labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) #random_state is set seed

# Decision Tree Classifier
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier

# fit a CART model to the data
model = DecisionTreeClassifier(random_state = 42)
model.fit(X_train, y_train)

# make predictions
clf1_expected = y_test
clf1_predicted = model.predict(X_test)


print(model.score(X_test,y_test))

# summarize the fit of the model
print("accuracy: " + str(metrics.accuracy_score(clf1_expected, clf1_predicted)))
print(metrics.classification_report(clf1_expected, clf1_predicted))

0.8199134199134199
accuracy: 0.8199134199134199
                                       precision    recall  f1-score   support

american police intelligence security       0.00      0.00      0.00         0
  building buildings street apartment       0.00      0.00      0.00        14
             children dr child school       0.00      0.00      0.00         4
               city air rights people       0.00      0.00      0.00         1
               city mayor people says       0.00      0.00      0.00         3
           company north french women       0.00      0.00      0.00         1
    republican state government obama       0.83      0.84      0.84       527
                 school like year bar       0.09      0.08      0.08        13
        teams research report defense       0.00      0.00      0.00         1
                  year like team game       0.84      0.85      0.85       591

                             accuracy                           0.82      1155
  

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Some comparisons:
* http://scikit-learn.org/0.17/auto_examples/applications/topics_extraction_with_nmf_lda.html
* https://medium.com/@aneesha/topic-modeling-with-scikit-learn-e80d33668730
* http://aclweb.org/anthology/D/D12/D12-1087.pdf


## Just for fun: how closely to topics and cluster assignment line up?

In [51]:
from sklearn.cluster import KMeans
import time
t0= time.time()
news_dm = tfidf_dm.toarray()
My_k = 20
km = KMeans(n_clusters=My_k, init='k-means++', max_iter=100, random_state = 42)
news_k = km.fit(news_dm)
clusters = km.labels_.tolist()
newsdf['clusters'] = clusters
t1 = time.time()

t1-t0

4.561550855636597

In [52]:
print("Clusters:")
print(newsdf['clusters'].value_counts())
print("\nNMF:")
print(newsdf['topics_20_5'].value_counts())
print("\nLDA:")
print(newsdf['lda_topics'].value_counts())
print("\nLSA")
print(newsdf['lsa_topics'].value_counts())

Clusters:
14    750
10    402
0     267
16    259
15    244
8     205
13    194
11    167
2     160
1     155
6     149
5     145
19    142
4     124
18    114
7     106
9     105
12     56
17     56
3      48
Name: clusters, dtype: int64

NMF:
like don just people think                        751
military government american weapons officials    317
game team games season play                       271
republican senate democrats senator house         213
company data industry technology information      209
percent market tax year economy                   204
building street apartment park buildings          198
mother family father children parents             179
food oil water add restaurant                     178
city mayor ny campaign brooklyn                   175
players league season player team                 150
court justice judge case law                      141
obama president administration white house        140
dr study medical research university              139

In [53]:
newsdf[['headline','topics_20_5','lda_topics','lsa_topics', 'clusters']].head(10)

Unnamed: 0,headline,topics_20_5,lda_topics,lsa_topics,clusters
0,Divided House Passes Tax Deal in End to Latest Fiscal Standoff,republican senate democrats senator house,republican state government obama,republican senate democrats house senator,13
1,Study Suggests Lower Mortality Risk for People Deemed to Be Overweight,dr study medical research university,year like team game,dr women study research medical,1
2,"With U.S. Set to Leave Afghanistan, Echoes of 1989",military government american weapons officials,republican state government obama,year like people city new,0
3,"Energy Drinks Promise Edge, but Experts Say Proof Is Scant",dr study medical research university,republican state government obama,year like people city new,1
4,"From Congress to Halls of State, in New Hampshire, Women Rule",women men woman sex book,republican state government obama,women police men students military,9
5,"Wanted: Mausoleum w/Wi-Fi, Cable, River Vu",like don just people think,year like team game,year like people city new,6
6,"More Than 60,000 Have Died in Syrian Conflict, U.N. Says",military government american weapons officials,republican state government obama,year like people city new,8
7,Stalling of Storm Aid Makes Northeast Republicans Furious,republican senate democrats senator house,republican state government obama,republican senate democrats house senator,13
8,Zosia Mamet Is Still Getting Used to Being Your New Best Friend,like don just people think,year like team game,year like people city new,14
9,To Thine Own Character,women men woman sex book,year like team game,women police men students military,14


In [54]:
#what is the distribution of topics per cluster
thing = newsdf.loc[newsdf['clusters'] == 8]
print("\nNMF:")
print(thing['topics_20_5'].value_counts())
print("\nLDA:")
print(thing['lda_topics'].value_counts())
print("\nLSA")
print(thing['lsa_topics'].value_counts())


NMF:
military government american weapons officials    138
china usa party north countries                     9
like don just people think                          8
republican senate democrats senator house           7
court justice judge case law                        7
police officers department crime enforcement        7
city mayor ny campaign brooklyn                     4
percent market tax year economy                     4
company data industry technology information        4
building street apartment park buildings            3
women men woman sex book                            3
obama president administration white house          3
state new jersey laws federal                       3
mother family father children parents               3
health insurance care law plans                     1
food oil water add restaurant                       1
Name: topics_20_5, dtype: int64

LDA:
republican state government obama        189
year like team game                        5
bu

### Main take aways:

* More examples: https://medium.com/@aneesha/topic-modeling-with-scikit-learn-e80d33668730
* Documents can be classified by topic using words that don't appear in that document (content vs expression!) 
* We learned 3 different methods for extracting topics 
* Number of topics and number of words used to describe a topic are decisions that data scientists make (along with classifier and settings choices, what external information to include (and how), label creation options (dictionary/index), preprocesing options (stop word lists, stemming/lemmatization), vectorizer options, parameter settings, etc.)

