**Import data**

In [1]:
import pandas as pd
import os

with open('data/news/news.txt') as newsfile:
    newsdata = newsfile.read()
    newsdata = newsdata.replace('\n', '').replace('(CNN) - ', '\n')

with open('temp.txt', 'a+') as f:
    for line in newsdata:
        f.write(line)
        
news = pd.read_csv('temp.txt', delimiter="\t", header=None, names=['Text'])
os.remove("temp.txt")

print('Number of news: ', news.shape[0])
news.head()

Number of news:  60


Unnamed: 0,Text
0,Lauren London broke her silence Tuesday and pa...
1,"Eric Holder, the man police think fatally shot..."
2,Hours after Nipsey Hussle was gunned down in t...
3,Music brought Nipsey Hussle together with his ...
4,While celebrating her victory in becoming the ...


**Vectorize documents with the vocabulary matrix of all words**
* Use *CountVectorizer*
* Include those words that appear in less than 80% of the document. (max_df)
* Include those words that appear in atleast 2 documents. (min_df)
* Remove English stopwords

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer(max_df=0.8, min_df=2, stop_words='english')
doc_term_matrix = count_vectorizer.fit_transform(news['Text'].values.astype('U'))
doc_term_matrix

<60x2627 sparse matrix of type '<class 'numpy.int64'>'
	with 12001 stored elements in Compressed Sparse Row format>

**Create topic model**
* Use LDA on the vectorized documents
* Divide into 5 topics. (n_components)
* Calculate probability distribution of each word in vocabulary

In [3]:
from sklearn.decomposition import LatentDirichletAllocation

LDA = LatentDirichletAllocation(n_components=5, random_state=42)
LDA.fit(doc_term_matrix)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=5, n_jobs=None,
                          perp_tol=0.1, random_state=42, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

**Get each topic with its top 10 highest probability words**
* Use *components_* to fetch a topic.
* Use *argsort()* to sort the words based on probability values and fetch their indices.
* Use *get_feature_names()* to retrieve the words from vectorizer (vocabulary) using the indices

In [4]:
for i, topic in enumerate(LDA.components_):
    print(f'Top 10 words for Topic {i+1}:')
    for j in (topic.argsort()[-10:]):
        print(count_vectorizer.get_feature_names()[j])
    print('\n')

Top 10 words for Topic 1:
carolina
rowland
vehicle
time
cnn
ride
columbia
police
uber
josephson


Top 10 words for Topic 2:
counsel
justice
border
president
poll
report
investigation
barr
mueller
trump


Top 10 words for Topic 3:
think
step
prison
act
lightfoot
city
president
mayor
buttigieg
people


Top 10 words for Topic 4:
kline
newbold
told
committee
nipsey
police
cummings
security
white
house


Top 10 words for Topic 5:
white
republican
women
border
care
cnn
health
house
president
trump




**Predict Topic for news text**
* Use *argmax(axis=1)* to get the topic with max probability

In [5]:
news_topics = LDA.transform(doc_term_matrix)
news_topics.shape
news['Topic'] = news_topics.argmax(axis=1)
news.head()

Unnamed: 0,Text,Topic
0,Lauren London broke her silence Tuesday and pa...,3
1,"Eric Holder, the man police think fatally shot...",3
2,Hours after Nipsey Hussle was gunned down in t...,3
3,Music brought Nipsey Hussle together with his ...,3
4,While celebrating her victory in becoming the ...,2
