# Topic Modelling

![nlp](https://wrm5sysfkg-flywheel.netdna-ssl.com/wp-content/uploads/2019/01/NLP-Technology-in-Healthcare.jpg)

Section goals are: 
- Understand topic modeling
- Learn Latent Dirichlet Allocation
- implement LDA
- Understand non-negative matrix factorization NMF

# Introduction

Before focusing on methods such as LDA & NMF we need to understand topic modeling and what it is important. It allows us to efficiently analyze large volumes of documents by clustering them into topics. A large amount of data is unlabeled meaning we can't apply supervised learning methods in order to build ML models for this type of data. 

Where we have unlabeled data we can attempt to discover labels. When considering the case of text data this means attempting to discover clusters of clusters of documents which are grouped by topic. It is important to be mindful here that we don;t know the correct topic, or the 'right answer'. All we can know is that clustered documents share similar topic ideas. It becomes an onus on the user to identify what these topics then represent. 

# 7.1.0 - LDA - Latent Dirichlet Allocation

Johnann Dirichlet was a German mathematician in the 1800s who contributed much to the field of modern mathematics. There is a probability distribution named after him, the "Dirichlet Distribution"
- LDA is based off of this probability distribution
- LDA was published in 2003 as a graphical model for topic discovery.

To apply LDA for topic modelling relies on two critical assumptions:
1. Documents with similar topics use similar groups of words
2. Latent topics can then be found by searching for groups of words that frequently occur together in documents across the corpus.

## Theory Description
**Documents are probability distributions over latent topics** - Over a given amount of latent topics we can see that documents will have a probability distribution. Assuming we declare there are 5 latent topics across a collection of documents then any document in the collection has a probability of belonging to each topic.

**Topics themselves are probability distributions over words** - essentially a probability calculation of
 words belonging to a specific topic. 

LDA represents documents as mixtures of topics that spit out words with certain probabilities. it assumes the documents are produced in the following fashion:
- Decide on the number of words `N` the document will have.
- Choose a topic mixture for the document (according to a Dirichlet distribution over a fixed set of `K` topics) e.g. 60% business, 20% politics, 10% arts, 10% culinary
- We then generate each word in the document by: 
    - first picking a topic according to the multinomial distribution sampled previously. 
    - If we selected a food topic we might generate the word "apple" with a 60% probability.
    - assuming the model for a collection of documents, LDA tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.

## Practical Description 
Where we have a fixed set of documents and we have selected a fixed number of `k` topics to discover. We want to use LDA to learn the topic representation of each document and the words associated to each topic.

We then go through each document and randomly assign each word in the doc to one of the `k` topics. This random assignment already gives you both topic representations of all the documents and word distributions of all the topics (be mindful, initial random assignment won't make much sense)

Then, we iterate over each to improve these topics, for every word in every document we calculate:
- `p`(topic `t` | document `d`) = proportion of words in the doc currently assigned to topic `t`
- `p`(word `w` | topic `t`) = proportion of assignments to topic `t` over all documents that come from word `w`.

We reassign `w` a new topic where the topic is chosen as: 
- `p`(topic `t` | document `d`) * `p`(word `w` | topic `t`)

After repeating the refinement step a large number of times we eventually reach a somewhat steady state where the assignments are acceptable. This facilitates each word being assigned to a topic, allowing us to search for the words with the highest probability of being assigned to a topic.

Two important notes:
1. The user must decide on the number of topics present in the document.
2. The user must interpret what the topics are.

# 7.2.0 - LDA with Python

In [1]:
import pandas as pd

In [2]:
npr = pd.read_csv('./resources/npr.csv')
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [3]:
from sklearn.feature_extraction.text import CountVectorizer

In [5]:
# max_df - this setting is for discarding words that appear frequently
# min_df - minimum doc freq
cv = CountVectorizer(max_df=0.9, min_df=2, stop_words='english')

The next step is to fit_transform. Note we can't `train_test_split` because there is nothing to train against, we're in unsupervised territory. 

In [7]:
dtm = cv.fit_transform(npr['Article'])

In [8]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [9]:
from sklearn.decomposition import LatentDirichletAllocation

In [10]:
LDA = LatentDirichletAllocation(n_components=7, random_state=42)

In [11]:
LDA.fit(dtm)

LatentDirichletAllocation(n_components=7, random_state=42)

In [15]:
# grab the vocablary of words 
print(f"type: {type(cv.get_feature_names())}  length:{len(cv.get_feature_names())}")

type: <class 'list'>  length:54777


In [27]:
# random word selection from the set

import random 
word = random.randint(0,len(cv.get_feature_names()))

cv.get_feature_names()[word]

'interjects'

In [29]:
# grab the topics
len(LDA.components_)

7

In [31]:
single_topic = LDA.components_[0]
single_topic.argsort()

array([ 2475, 18302, 35285, ..., 22673, 42561, 42993])

In [None]:
# argsort returns index positions sorted from least to greatest
# top 10 values, or the last 10 values of argsort.
single_topic.argsort()[-10:]

In [34]:
top_picks = single_topic.argsort()[-20:]

In [35]:
for index in top_picks:
    print(cv.get_feature_names()[index])

president
state
tax
insurance
trump
companies
money
year
federal
000
new
percent
government
company
million
care
people
health
said
says


In [37]:
# grab the highest probability words per topic

for i, topic in enumerate(LDA.components_):
    print(f"Top picks for topic {i}")
    print([cv.get_feature_names()[index] for index in topic.argsort()[-20:]])
    print('\n')

Top picks for topic 0
['president', 'state', 'tax', 'insurance', 'trump', 'companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']


Top picks for topic 1
['white', 'according', 'attack', 'reported', 'war', 'military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']


Top picks for topic 2
['little', 'know', 'don', 'year', 'make', 'way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']


Top picks for topic 3
['world', 'research', 'university', 'percent', 'care', 'time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']


Top picks for topic 4
['donald', 'political', 'states', 'law', 'just', 'voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republica

In [38]:
res = LDA.transform(dtm)

In [40]:
npr['topic'] = res.argmax(axis=1)

In [41]:
npr

Unnamed: 0,Article,topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2
...,...,...
11987,The number of law enforcement officers shot an...,1
11988,"Trump is busy these days with victory tours,...",4
11989,It’s always interesting for the Goats and Soda...,3
11990,The election of Donald Trump was a surprise to...,4


# 7.3.0 - Non-negative Matrix Factorization

Another unsupervised algorithm, it performs dimensionality reduction and clustering simultaneously. It can be used in conjunction with TF-IDF to model topics across documents

#### Exploring the math behind NMF

Given a non-negative matrix, A. find `k-dimension` approximation in terms of non-negative factors `W` & `H`. 

- $A = n \cdot m$ Data matrix, rows = features, cols = objects

- $W = n \cdot k$ Basis vectors, rows= features

- $H = k \cdot m$ Coefficient matrix, cols = objects

#### Process

1. Construct a vector space model for documents (after stop-word filtering), to produce matrix A.
2. Apply TF-IDFterm weight normalization to matrix A.
3. Normalize TF-IDF vectors to unit length
4. Initialise factors using NNDSVD on matrix A
5. Apply projected Gradient NMF to A.

**Note:** steps 1-3 can be obtained by the `sklearn TfIdfVectorizer`

# 7.4.0. NMF with Python

In [42]:
npr = pd.read_csv('./resources/npr.csv')

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [44]:
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [45]:
dtm = tfidf.fit_transform(npr['Article'])

In [46]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [47]:
from sklearn.decomposition import NMF

In [48]:
nmf = NMF(n_components=7, random_state=42)

In [49]:
nmf.fit(dtm)

NMF(n_components=7, random_state=42)

In [50]:
for index, topic in enumerate(nmf.components_):
    print(f"Top picks for topic {index}")
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-20:]])
    print('\n')

Top picks for topic 0
['years', 'brain', 'university', 'researchers', 'scientists', 'new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']


Top picks for topic 1
['intelligence', 'office', 'nominee', 'republicans', 'comey', 'gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


Top picks for topic 2
['insurers', 'federal', 'said', 'aca', 'repeal', 'senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']


Top picks for topic 3
['killed', 'reported', 'military', 'justice', 'city', 'officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']


Top picks for topic 4
['candidate', 'said', 'win', 'candidat

In [51]:
res = nmf.transform(dtm)

In [52]:
npr['topic'] = res.argmax(axis=1)

In [53]:
npr

Unnamed: 0,Article,topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3
4,"From photography, illustration and video, to d...",6
...,...,...
11987,The number of law enforcement officers shot an...,3
11988,"Trump is busy these days with victory tours,...",1
11989,It’s always interesting for the Goats and Soda...,0
11990,The election of Donald Trump was a surprise to...,4
