**Import data**

In [1]:
import pandas as pd
import os

with open('data/news/news.txt') as newsfile:
    newsdata = newsfile.read()
    newsdata = newsdata.replace('\n', '').replace('(CNN) - ', '\n')

with open('temp.txt', 'a+') as f:
    for line in newsdata:
        f.write(line)
        
news = pd.read_csv('temp.txt', delimiter="\t", header=None, names=['Text'])
os.remove("temp.txt")

print('Number of news: ', news.shape[0])
news.head()

Number of news:  60


Unnamed: 0,Text
0,Lauren London broke her silence Tuesday and pa...
1,"Eric Holder, the man police think fatally shot..."
2,Hours after Nipsey Hussle was gunned down in t...
3,Music brought Nipsey Hussle together with his ...
4,While celebrating her victory in becoming the ...


**Create vocabulary and Vectorize document**
* Use *TfidfVectorizer*
* Include those words that appear in less than 80% of the document. (max_df)
* Include those words that appear in atleast 2 documents. (min_df)
* Remove English stopwords

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, min_df=2, stop_words='english')
doc_term_matrix = tfidf_vectorizer.fit_transform(news['Text'].values.astype('U'))
doc_term_matrix

<60x2627 sparse matrix of type '<class 'numpy.float64'>'
	with 12001 stored elements in Compressed Sparse Row format>

**Create topic model**
* Use NMF on the vectorized documents
* Divide into 5 topics. (n_components)
* Create probability matrix that contains probabilities of all the words in the vocabulary for all the topics.

In [3]:
from sklearn.decomposition import NMF

nmf = NMF(n_components=5, random_state=42)
nmf.fit(doc_term_matrix)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=5, random_state=42, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

**Get each topic with its top 10 highest probability words**
* Use *components_* to fetch a topic.
* Use *argsort()* to sort the words based on probability values and fetch their indices.
* Use *get_feature_names()* to retrieve the words from vectorizer (vocabulary) using the indices

In [4]:
for i, topic in enumerate(nmf.components_):
    print(f'Top 10 words for Topic {i+1}:')
    for j in (topic.argsort()[-10:]):
        print(tfidf_vectorizer.get_feature_names()[j])
    print('\n')

Top 10 words for Topic 1:
immigration
mexico
republican
republicans
care
health
president
obamacare
border
trump


Top 10 words for Topic 2:
driver
carolina
holbrook
vehicle
rowland
ride
police
columbia
uber
josephson


Top 10 words for Topic 3:
kushner
subpoena
newbold
clearance
committee
kline
cummings
white
security
house


Top 10 words for Topic 4:
angeles
los
london
chicago
mayor
hussle
city
lightfoot
nipsey
police


Top 10 words for Topic 5:
attorney
trump
poll
obstruction
report
investigation
special
counsel
barr
mueller




**Predict Topic for news text**
* Use *argmax(axis=1)* to get the topic with max probability

In [5]:
news_topics = nmf.transform(doc_term_matrix)
news_topics.shape
news['Topic'] = news_topics.argmax(axis=1)
news.head()

Unnamed: 0,Text,Topic
0,Lauren London broke her silence Tuesday and pa...,3
1,"Eric Holder, the man police think fatally shot...",3
2,Hours after Nipsey Hussle was gunned down in t...,3
3,Music brought Nipsey Hussle together with his ...,3
4,While celebrating her victory in becoming the ...,3
