<img src="./imgs/NLP_topics.png" />

# Objectives
At the end of this notebook the students should:   

* Develop a basic understanding of how to get started with text data
* Perform basic preprocessing & vectorization of text data 
* Build and interpret an NMF topic model 

Data:  
We'll take a look at:   [one million ABC News headlines](https://www.kaggle.com/code/thebrownviking20/k-means-clustering-of-1-million-headlines/data)

# Building an NLP Pipeline

For the pair problem today, we'll build a pipeline which manages the *basic* requirements for an NLP project. The goal is to build a toolbox for converting one or more strings of text into a matrix (retaining textual information along the way).

## Step 1: Read in Data

In [10]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('~/Downloads/abcnews-date-text.csv')
df.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


## Step 2: Vectorize (part 1)

Using one of the below vectorizers provided by Sci-Kit Learn, **convert the `reviews` pandas Series to a matrix**, where each row represents a document, and each column represents a term (or, a word in a document). The number of rows should match the number of rows in `df` — this is called the "corpus". And, the number of columns should be the total number of *distinct* terms (i.e., words) in the corpus — this is called the "vocabulary".

**Build the matrix such that the value at `(i,j)` is the *Count* of term (column) `j` in document (row) `i`.**

**What are the terms in this corpus?** *Hint: When using one of these vectorizers, what is the difference between `.vocabulary_` and `.get_feature_names()`?*

*Note: The default behaviour for vectorizers is to output a Sparse matrix.*

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

In [4]:
docs = df.headline_text

In [5]:
vec = CountVectorizer()

In [6]:
doc_term = vec.fit_transform(docs)

In [11]:
vec.vocabulary_

{'aba': 4665,
 'decides': 25198,
 'against': 5913,
 'community': 21254,
 'broadcasting': 15400,
 'licence': 51363,
 'act': 5288,
 'fire': 33684,
 'witnesses': 94686,
 'must': 59152,
 'be': 11516,
 'aware': 9888,
 'of': 62095,
 'defamation': 25374,
 'calls': 16958,
 'for': 34625,
 'infrastructure': 44473,
 'protection': 68858,
 'summit': 83464,
 'air': 6232,
 'nz': 61725,
 'staff': 81689,
 'in': 43928,
 'aust': 9583,
 'strike': 82846,
 'pay': 64904,
 'rise': 73710,
 'to': 86856,
 'affect': 5775,
 'australian': 9638,
 'travellers': 87872,
 'ambitious': 7142,
 'olsson': 62368,
 'wins': 94525,
 'triple': 88206,
 'jump': 46985,
 'antic': 7854,
 'delighted': 25653,
 'with': 94643,
 'record': 71323,
 'breaking': 15014,
 'barca': 10886,
 'aussie': 9577,
 'qualifier': 69690,
 'stosur': 82616,
 'wastes': 92973,
 'four': 34953,
 'memphis': 55926,
 'match': 54705,
 'addresses': 5422,
 'un': 89301,
 'security': 77124,
 'council': 22877,
 'over': 63341,
 'iraq': 45396,
 'australia': 9635,
 'is': 455

In [12]:
doc_term.shape

(1103665, 96687)

In [13]:
df.shape

(1103665, 2)

## Vectorize (part 2)

**Build the matrix such that the value at `(i,j)` represents a sort of *normalized frequency*,** which takes into account (a) the density of term `j` in document `i`, as well as (b) the number of documents in which that term occurs.

*Hint: Try `TfidfVectorizer`. What is this?*

In [14]:
vec = TfidfVectorizer()

In [15]:
doc_term = vec.fit_transform(docs.values)

In [16]:
# we can look at the 1st row to see what is happening, 
tfidf_df=pd.DataFrame(doc_term[0].toarray(),columns=vec.get_feature_names())
tfidf_df.head()



Unnamed: 0,000,002,005,006,007,01,0101,010115,010213,010215,...,zydelig,zygar,zygiefs,zygier,zyl,zylvester,zynga,zyngier,zz,zzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
# Let's do a bit better .. 
vec=TfidfVectorizer(stop_words='english',max_df=.75,min_df=6,token_pattern=r'(?u)\b[A-Za-z]+\b')
doc_term=vec.fit_transform(docs.values)

tfidf_df=pd.DataFrame(doc_term[0].toarray(),columns=vec.get_feature_names())
tfidf_df.head()



Unnamed: 0,aa,aaa,aaco,aacta,aamer,aami,aant,aapt,aaron,ab,...,zsa,zuckerberg,zuckerbergs,zullo,zuma,zurich,zusak,zverev,zvonareva,zygier
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## What can you do with this?

Try a few different operations, and try to **interpret their meaning/usecase**:

* Calculate the correlation between documents, or between terms
* Consider bigrams or n-grams in your vectorizer
* Determine if there is multicollinearity between documents, or between terms
* Try to incorporate the `user_id` into your analysis
* Build a Python Class to make your work repeatable


###  NMF 

( document matrix x topic matrix)

<img src= 'https://media.geeksforgeeks.org/wp-content/uploads/20210429213042/Intuition1-660x298.png'>

In [18]:
doc_term.shape

(1103665, 32362)

In [8]:
from sklearn.decomposition import NMF
nmf = NMF(n_components=10, init='random')
W = nmf.fit_transform(doc_term)
H = nmf.components_

### OUTPUT THE MODEL
nmf

In [11]:
vocab = vec.get_feature_names()
id_topic = nmf.fit_transform(doc_term)
n_top_words=10
topic_words = {}

for topic, comp in enumerate(nmf.components_):
    word_idx = np.argsort(comp)[::-1][:n_top_words]
    # store the words most relevant to the topic
    topic_words[topic] = [vocab[i] for i in word_idx]



In [12]:
for k,v in topic_words.items():
    print(k,v)
    print('\n')

0 ['new', 'zealand', 'cases', 'laws', 'year', 'coronavirus', 'york', 'records', 'covid', 'home']


1 ['govt', 'council', 'says', 'plan', 'water', 'health', 'urged', 'qld', 'funding', 'government']


2 ['interview', 'extended', 'michael', 'david', 'john', 'nrl', 'smith', 'james', 'ben', 'scott']


3 ['news', 'abc', 'rural', 'national', 'business', 'weather', 'market', 'sport', 'analysis', 'entertainment']


4 ['australia', 'day', 'south', 'world', 'cup', 'test', 'coronavirus', 'live', 'vs', 'china']


5 ['country', 'hour', 'nsw', 'tas', 'wa', 'vic', 'august', 'drum', 'october', 'sa']


6 ['crash', 'car', 'killed', 'dies', 'fatal', 'woman', 'road', 'driver', 'plane', 'dead']


7 ['man', 'charged', 'murder', 'missing', 'jailed', 'stabbing', 'arrested', 'guilty', 'death', 'sydney']


8 ['court', 'accused', 'face', 'murder', 'charges', 'faces', 'told', 'case', 'high', 'sex']


9 ['police', 'investigate', 'probe', 'missing', 'search', 'death', 'hunt', 'officer', 'shooting', 'seek']




###  Using Glove Embeddings to plot Emotions 

<img src="imgs/emotions.png" />