# Topic Modelling using Non Negative Matrix Factorization

## Data reading and inspection

Let's load the required libraries.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option("display.max_colwidth", 200)

We will use 20 Newsgroup dataset from sklearn.

In [2]:
from sklearn.datasets import fetch_20newsgroups

dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data
len(documents)

11314

In [4]:
dataset.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

The dataset has 11,314 text documents distributed across 20 different newsgroups.

<br>

## Data Preprocessing

to start with, we will try to clean our text data as much as possible. We will remove the punctuations, numbers, special characters all in one step using the regex replace("[^a-zA-Z#]", " ") which will replace everything, except alphabets, with a space. Then we will remove shorter words because usually they don't contain much of the information. Finally we will make all the text lowercase to nullify case sensitivity.

In [5]:
news_df = pd.DataFrame({'document':documents})

# removing everything except alphabets`
news_df['clean_doc'] = news_df['document'].str.replace("[^a-zA-Z#]", " ")

# removing short words
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))

# make all text lowercase
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: x.lower())

In [6]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

# tokenization
tokenized_doc = news_df['clean_doc'].apply(lambda x: x.split()) 

# remove stop-words
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])

# de-tokenization
detokenized_doc = []
for i in range(len(news_df)):
    t = ' '.join(tokenized_doc[i])
    detokenized_doc.append(t)
    
news_df['clean_doc'] = detokenized_doc

## Document-Term Matrix

This is the first step towards topic modeling. We will use sklearn's TfidfVectorizer to create a document-term matrix with 1000 terms.

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', 
                             max_features= 1000, # keep top 1000 terms 
                             max_df = 0.5, 
                             smooth_idf=True)

X = vectorizer.fit_transform(news_df['clean_doc'])
X_feature_names = vectorizer.get_feature_names()

X.shape # check shape of the document-term matrix

(11314, 1000)

In [9]:
from sklearn.decomposition import NMF

# number of topics
num_topics = 20

nmf = NMF(n_components=num_topics, random_state=1, 
          alpha=.1, l1_ratio=.5, init='nndsvd').fit(X)

## Display and Evaluate Topics

In [14]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic "+str(topic_idx)+": ")
        print (" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [15]:
top_n_words = 10
display_topics(nmf, X_feature_names, top_n_words)

Topic 0: 
right believe make really said point want jesus things question
Topic 1: 
thanks mail advance looking info help information address email appreciated
Topic 2: 
game team year games season players play hockey league teams
Topic 3: 
drive scsi hard drives disk floppy controller internal tape cable
Topic 4: 
windows file files program version directory using running software graphics
Topic 5: 
chip clipper encryption government keys phone data escrow algorithm chips
Topic 6: 
like sounds looks look sound things thing make sure really
Topic 7: 
card video monitor cards drivers driver color memory board mode
Topic 8: 
know anybody want need program appreciated sure maybe help getting
Topic 9: 
people government rights person world country force guns society life
Topic 10: 
think wrong really pretty steve remember makes wait agree original
Topic 11: 
problem problems using apple screen fine error work worked solution
Topic 12: 
good thing pretty better year looking quality world id