# Topic Modelling

- understand topic modelling
- learn latent dirichlet allocation
- implement LDA
- Understand Non-negative Matrix factorization
- implement NMF
- apply LDA and NMF with a project

In [68]:
# overview

- effeciently analyze large volumes of text by clustring document into topics
- unlabelled data, we won't be able to apply our previous supervised learning
- we have unlabelled data, then we can attempt to 'discover' labels
- this means to discover clusters of documents, grouped together by topic
- its difficult to evavulate
- LDT allocation 

LDA is based off a prob. distribution
- assumptions
 - doucments with similar topic use similar group of words
 - latent topics can be found by searching for groups of words that freq. occour together in docs
 
 
- assumptions of LDA for topic modelling
 - documents are probablity distributions over latent topics
 - topics themselves are probablity distribution over words

In [69]:
# what happens here is :
## we decide how many topics we need
## we give random topic to each words, then find out the occourances and adjust it to a common type
## by doing so it will give a representation of the document to a topic


- decide the number of words N the doucment will have
- choose a topic mixture for the document (according to a Dirichlet distrib. over a fixed k topics)
- eg : 60% business, 20% politics, 10% food
- generate each word in the document by-
 - first picking a topic according to multinomial distribution that you sampled previously (60% business.....)
 - using the topic to generate the word itself(according to multinomial distribution)
 - eg: if we selected food, we would select the word apple with higher porb. than something like home


- imagine we have set of docs
- we choosen some random fixed number k to discover.
- go through documents and randomly assign each word in the document to one of the k topics
- this random assignment already gives you both topic representations of all the documents and word distributions of all the topics(note, these initial random topics won't make sense)
- we are going to iterate over everyword in every document to improve these topics
- for every word in every document and for each topic t we calculate:
 - p(topic t | document d) = the proportion of words in document d that are currently assigned to topic t
 - p(word w | topic t) = proportion of assignments to topic t over all doc that come from this word w
- reassign w a new topic , where we choose topic t with probablity p(topic t | document d) * p(word w | topic t)
- this is essentially the probablity that topic t generated word w

- we end up in an output such as:
 - document assigned to topic #4
 - most common words(highest probablity) for topic #4
  - : [cat , vet, birds, dog, ..] 
 - it is up to the user to interpret these to topics

# Latent Dirichlet alloc with python

In [70]:
import pandas as pd


In [71]:
npr = pd.read_csv('resources/npr.csv')

In [72]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [73]:
len(npr)

11992

In [74]:
from sklearn.feature_extraction.text import CountVectorizer


In [75]:
# max_df removes the words that are in 90 (0.9 mentioned below) percent of the docs
# if integer then its like atleast these many documents it must show up
cv = CountVectorizer(max_df = 0.9, min_df = 2, stop_words='english')

In [76]:
dtm = cv.fit_transform(npr['Article'])

In [77]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [78]:
from sklearn.decomposition import LatentDirichletAllocation

In [79]:
LDA = LatentDirichletAllocation(n_components =7, random_state= 42)

In [80]:
LDA.fit(dtm)

LatentDirichletAllocation(n_components=7, random_state=42)

In [81]:
# grab the vocab of words
# grab the topics
# grab the highest probablity words per topic

## grab the vocabulary of words

In [82]:
cv.get_feature_names()[2000]

'africa'

In [83]:
type(cv.get_feature_names())

list

In [84]:
len(cv.get_feature_names())

54777

### 

In [85]:
import random 
random_word_id = random.randint(0, 54000)
cv.get_feature_names()[random_word_id]

'whisked'

In [86]:
len(LDA.components_)

7

In [87]:
type(LDA.components_)

numpy.ndarray

In [88]:
LDA.components_.shape

(7, 54777)

In [89]:
LDA.components_

array([[8.64332806e+00, 2.38014333e+03, 1.42900522e-01, ...,
        1.43006821e-01, 1.42902042e-01, 1.42861626e-01],
       [2.76191749e+01, 5.36394437e+02, 1.42857148e-01, ...,
        1.42861973e-01, 1.42857147e-01, 1.42906875e-01],
       [7.22783888e+00, 8.24033986e+02, 1.42857148e-01, ...,
        6.14236247e+00, 2.14061364e+00, 1.42923753e-01],
       ...,
       [3.11488651e+00, 3.50409655e+02, 1.42857147e-01, ...,
        1.42859912e-01, 1.42857146e-01, 1.42866614e-01],
       [4.61486388e+01, 5.14408600e+01, 3.14281373e+00, ...,
        1.43107628e-01, 1.43902481e-01, 2.14271779e+00],
       [4.93991422e-01, 4.18841042e+02, 1.42857151e-01, ...,
        1.42857146e-01, 1.43760101e-01, 1.42866201e-01]])

In [90]:
single_topic = LDA.components_[0]
second_topic = LDA.components_[1]

In [91]:
len(single_topic)

54777

In [92]:
single_topic.argsort()

array([ 2475, 18302, 35285, ..., 22673, 42561, 42993], dtype=int64)

In [93]:
# ARGSORT ---> INDEX POSITION SORTED FROM LEAST ---> GREATEST
# TOP 10 VALUES (10 GREATEST VALUES)
# LAST 10 VALUES OF ARGSORT()

single_topic.argsort()[-10:] # grab the last 10 values of argsort()

array([33390, 36310, 21228, 10425, 31464,  8149, 36283, 22673, 42561,
       42993], dtype=int64)

In [103]:
top_ten_words = single_topic.argsort()[-10:]

In [104]:
for index in top_ten_words:
    print(cv.get_feature_names()[index])

new
percent
government
company
million
care
people
health
said
says


In [96]:
# GRAB THE HIGHEST PROBABLITY WORDS PER TOPIC

In [97]:
for i, topic in enumerate(LDA.components_):
    print(f"THE TOP 15 WORDS FOR TOPIC #{i}")
    print([cv.get_feature_names()[index] for index in topic.argsort()[-15:]])
    print('\n')
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']




THE TOP 15 WORDS FOR TOPIC #1
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']




THE TOP 15 WORDS FOR TOPIC #2
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']




THE TOP 15 WORDS FOR TOPIC #3
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']




THE TOP 15 WORDS FOR TOPIC #4
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']




THE TOP 15 WORDS FOR TOPIC #5
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know'

In [98]:
topic_result = LDA.transform(dtm)

In [99]:
topic_result[0].round(2)

array([0.02, 0.68, 0.  , 0.  , 0.3 , 0.  , 0.  ])

In [100]:
topic_result[0].argmax()

1

In [101]:
npr['Topic'] = topic_result.argmax(axis = 1)

In [102]:
npr

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2
...,...,...
11987,The number of law enforcement officers shot an...,1
11988,"Trump is busy these days with victory tours,...",4
11989,It’s always interesting for the Goats and Soda...,3
11990,The election of Donald Trump was a surprise to...,4


## Non negative matrix Factorization

- non supervised
- simentaneously performs dimensionality reduction and clustring (giving a topic i guess)
- we can use it in conjuction with TF-IDF to model topics across documents

- input --> non-negative data matrix A , here TFIDF
- number of basis vectors K (refers to how many topics we want)
- k dimensional factorization interms of W and H
- initial values for factors W and H (eg. random matrices)



- maybe ... doubt
 - n * m ---> A (data matrix --> rows=features, cols=Objects)
 - n * k ---> W (Basis Vectors --> Rows = features)
 - k * m ---> H (Coeffecient Matrix --> Cols=Objects)

- get some measure of error btw A and approximation WH
 - 1/2 * (A - WH)^2
- optimize, expectation maximation
- updating H, updating W until converge

### How its gonna work for US

- construct vector space model for document(after stopword filtering)
- result is a term documnent matrix A
- apply TFIDF term weight normalization to A
- Normalize TF-IDF vectors to unit length
- initialise factors using NNDSVD on A
- Apply projected Gradient NMF to A
- basis vectors: the topics(clusters) in the data
- Coefficient matrix: the membership weights for documents relative to each topic(cluster)

## Non neg- matrix factorization 2

In [105]:
import pandas as pd

In [107]:
npr = pd.read_csv('resources/npr.csv')

In [108]:
# preprocessing feature extraction changes - 
## we could only use countvector for lda, but here we could preprocess with vector factorization..
## ^the above statement is because maybe in LDA we needed to deal with probablity so someting like 
## but here we are optimizing the coeffecient, the coeffecient that decides to which it belongs, so 
## it may only require some sort of weight that's associated with it



In [109]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [112]:
tfidf = TfidfVectorizer(max_df = 0.95, min_df = 2, stop_words ='english' )

In [113]:
dtm = tfidf.fit_transform(npr['Article'])

In [114]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [116]:
from sklearn.decomposition import NMF

In [117]:
nmf_model = NMF(n_components = 7, random_state = 42)

In [118]:
nmf_model.fit(dtm)



NMF(n_components=7, random_state=42)

In [120]:
tfidf.get_feature_names()[5000]

'bask'

In [124]:
for index, topic in enumerate(nmf_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC - #{index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n\n')

THE TOP 15 WORDS FOR TOPIC - #0
['new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']



THE TOP 15 WORDS FOR TOPIC - #1
['gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']



THE TOP 15 WORDS FOR TOPIC - #2
['senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']



THE TOP 15 WORDS FOR TOPIC - #3
['officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']



THE TOP 15 WORDS FOR TOPIC - #4
['primary', 'cruz', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']



THE TOP 15 WORDS FOR TOPIC - #5
['love',

In [125]:
topic_results = nmf_model.transform(dtm)

In [126]:
topic_results.argmax(axis = 1)

array([1, 1, 1, ..., 0, 4, 3], dtype=int64)

In [128]:
npr['Topic']= topic_results.argmax(axis =1)

In [129]:
npr.head()

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3
4,"From photography, illustration and video, to d...",6


In [130]:
topic_dic = {6:'edu', 5:'music', 4:'election', 3:'poli', 2:'legis', 2:'ele', 1:'health'} 

In [131]:
npr['Topic label'] = npr['Topic'].map(topic_dic)

In [132]:
npr.head()

Unnamed: 0,Article,Topic,Topic label
0,"In the Washington of 2016, even when the polic...",1,health
1,Donald Trump has used Twitter — his prefe...,1,health
2,Donald Trump is unabashedly praising Russian...,1,health
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3,poli
4,"From photography, illustration and video, to d...",6,edu


# Assesment

In [9]:
#import pandas and open

In [10]:
import pandas as pd
import numpy as np

In [11]:
df = pd.read_csv('resources/quora_questions.csv')

In [12]:
df.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


In [13]:
# preprocessing , tfidf vectorization to create a vectorized document term matrix
# use same parameters as previous lecture

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [15]:
tfidf = TfidfVectorizer(max_df = 0.95, min_df = 2, stop_words = 'english')

In [16]:
dtm = tfidf.fit_transform(df['Question'])

In [17]:
dtm

<404289x38669 sparse matrix of type '<class 'numpy.float64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

## Non negative matrix factorization

- using NMF , 20 topics, random_state = 42

In [18]:
from sklearn.decomposition import NMF

In [19]:
model = NMF()

In [26]:
model = NMF(n_components = 20, random_state = 42)

In [27]:
model.fit(dtm)



NMF(n_components=20, random_state=42)

In [28]:
model.components_

array([[0.00000000e+00, 5.63036920e-02, 5.40156715e-05, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.23892290e-03, 0.00000000e+00, 3.45251649e-05, ...,
        0.00000000e+00, 3.65013292e-03, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       ...,
       [4.07891142e-04, 4.92671304e-03, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [7.84654148e-05, 4.54449173e-04, 6.05797981e-05, ...,
        1.70479939e-03, 0.00000000e+00, 1.70479939e-03],
       [3.45021413e-04, 0.00000000e+00, 4.81932600e-06, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])

In [29]:
for i, topic in enumerate(model.components_):
    print(f'words belong to topic {i}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-10:] ])
    print('\n\n')

words belong to topic 0
['phone', 'buy', 'laptop', 'movie', 'ways', '2016', 'books', 'book', 'movies', 'best']



words belong to topic 1
['use', 'exist', 'really', 'compare', 'cost', 'long', 'feel', 'work', 'mean', 'does']



words belong to topic 2
['improvement', 'delete', 'asked', 'google', 'answers', 'answer', 'ask', 'question', 'questions', 'quora']



words belong to topic 3
['internet', 'free', 'home', 'easy', 'youtube', 'ways', 'earn', 'online', 'make', 'money']



words belong to topic 4
['live', 'want', 'change', 'moment', 'real', 'important', 'thing', 'meaning', 'purpose', 'life']



words belong to topic 5
['china', 'business', 'country', 'olympics', 'available', 'job', 'spotify', 'war', 'pakistan', 'india']



words belong to topic 6
['hacking', 'want', 'python', 'languages', 'java', 'learning', 'start', 'language', 'programming', 'learn']



words belong to topic 7
['vote', 'better', 'election', 'did', 'win', 'hillary', 'president', 'clinton', 'donald', 'trump']



words