<p style="font-family:Roboto; font-size: 28px; color: magenta"> Python for NLP: Topic Modeling</p>

In [11]:
'''
  Topic modeling is one of the most sought after research areas in NLP. 
  It is used to group large volumes of unlabeled text data.
'''

'\n  Topic modeling is one of the most sought after research areas in NLP. \n  It is used to group large volumes of unlabeled text data.\n'

In [12]:
'''
Topic modeling is an unsupervised technique that intends to analyze large volumes of text data 
by clustering the documents into groups.
In the case of topic modeling, the text data do not have any labels attached to it. 
Rather, topic modeling tries to group the documents into clusters based on similar characteristics.
'''

'\nTopic modeling is an unsupervised technique that intends to analyze large volumes of text data \nby clustering the documents into groups.\nIn the case of topic modeling, the text data do not have any labels attached to it. \nRather, topic modeling tries to group the documents into clusters based on similar characteristics.\n'

<p style="font-family:consolas; font-size: 26px; color: orange; text-decoration-line: overline; "> 1: _Latent Dirichlet Allocation (LDA)</p>

In [13]:
'''
The LDA is based upon two general assumptions:

Documents that have similar words usually have the same topic
    Documents that have groups of words frequently occurring together usually have the same topic.
'''

'\nThe LDA is based upon two general assumptions:\n\nDocuments that have similar words usually have the same topic\n    Documents that have groups of words frequently occurring together usually have the same topic.\n'

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _LDA for Topic Modeling in Python</p>

In [14]:
'''The data set contains user reviews for different products in the food category. 
We will use LDA to group the user reviews into 5 categories.'''
import pandas as pd
import numpy as np

reviews_datasets = pd.read_csv(r'data/Reviews.csv')
reviews_datasets = reviews_datasets.head(20000)
reviews_datasets.dropna()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...
...,...,...,...,...,...,...,...,...,...,...
19995,19996,B002C50X1M,A1XRXZI5KOMVDD,"KAF1958 ""amandaf0626""",0,0,4,1307664000,Crispy and tart,Deep River Salt & Vinegar chips are thick and ...
19996,19997,B002C50X1M,A7G9M0IE7LABX,Kevin,0,0,5,1307059200,Exceeded my expectations. One of the best chip...,I was very skeptical about buying a brand of c...
19997,19998,B002C50X1M,A38J5PRUDESMZF,ray,0,0,5,1305763200,"Awesome Goodness! (deep river kettle chips, sw...",Before you turn to other name brands out there...
19998,19999,B002C50X1M,A17TPOSAG43GSM,Herrick,0,0,3,1303171200,"Pretty good, but prefer other jalapeno chips","I was expecting some ""serious flavor"" as it wa..."


In [15]:
'''We will be applying LDA on the "Text" column since it contains the reviews, the rest of the columns will be ignored.'''
reviews_datasets['Text'][350]

'These chocolate covered espresso beans are wonderful!  The chocolate is very dark and rich and the "bean" inside is a very delightful blend of flavors with just enough caffine to really give it a zing.'

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _create vocabulary of all the words</p>

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
'''We specify to only include those words that appear in less than 80% of the document and appear in at least 2 documents'''

count_vect = CountVectorizer(max_df=0.8, min_df=2, stop_words='english')
doc_term_matrix = count_vect.fit_transform(reviews_datasets['Text'].values.astype('U'))

In [17]:
doc_term_matrix

<20000x14546 sparse matrix of type '<class 'numpy.int64'>'
	with 594703 stored elements in Compressed Sparse Row format>

In [18]:
'''Each of 20k documents is represented as 14546 dimensional vector, which means that our vocabulary has 14546 words.'''

'Each of 20k documents is represented as 14546 dimensional vector, which means that our vocabulary has 14546 words.'

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _use LDA to create topics</p>

In [19]:
from sklearn.decomposition import LatentDirichletAllocation
'''The parameter n_components specifies the number of categories, or topics, that we want our text to be divided into'''
LDA = LatentDirichletAllocation(n_components=5, random_state=42)
LDA.fit(doc_term_matrix)

In [27]:
import random

for i in range(10):
    random_id = random.randint(0,len(count_vect.get_feature_names_out()))
    print(count_vect.get_feature_names_out()[random_id])

bonus
severely
frog
cutesy
received
overstuffed
pedestrian
spritzers
reports
pursue


In [22]:
'''Let's find 10 words with the highest probability for the first topic.
To get the first topic, you can use the components_ attribute and pass a 0 index as the value'''
first_topic = LDA.components_[0]

In [23]:
'''To sort the indexes according to probability values, we can use the argsort() function
Once sorted, the 10 words with the highest probabilities will now belong to the last 10 indexes of the array. 
The following script returns the indexes of the 10 words with the highest probabilities:'''
top_topic_words = first_topic.argsort()[-10:]

In [25]:
'''These indexes can then be used to retrieve the value of the words from the count_vect object'''
for i in top_topic_words:
    print(count_vect.get_feature_names_out()[i])

water
great
just
drink
sugar
good
flavor
taste
like
tea


In [None]:
'''The words show that the first topic might be about tea.'''
for i,topic in enumerate(LDA.components_):
    print(f'Top 10 words for topic #{i}:')
    print([count_vect.get_feature_names()[i] for i in topic.argsort()[-10:]])
    print('\n')

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _add a column to the original data frame that will store the topic for the text.</p> 

In [None]:
''' we can use LDA.transform() method and pass it our document-term matrix. 
This method will assign the probability of all the topics to each document'''
topic_values = LDA.transform(doc_term_matrix)
topic_values.shape

In [None]:
'''In the output, you will see (20000, 5) which means that each of the document has 5 columns 
where each column corresponds to the probability value of a particular topic'''
'''
To find the topic index with maximum value, we can call the argmax() method and pass 1 as the value for the axis parameter.
'''
reviews_datasets['Topic'] = topic_values.argmax(axis=1)

In [None]:
reviews_datasets.head()

<p style="font-family:consolas; font-size: 26px; color: orange; text-decoration-line: overline; "> 2: _Non-Negative Matrix Factorization (NMF)</p>

In [None]:
'''
Non-negative matrix factorization is also a supervised learning technique which performs clustering 
as well as dimensionality reduction. 
It can be used in combination with TF-IDF scheme to perform topic modeling. 
'''

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _NMF for Topic Modeling in Python</p> 

In [21]:
'''In the previous section we used thee count vectorizer, 
but in this section we will use TFIDF vectorizer since NMF works with TFIDF'''
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(max_df=0.8, min_df=2, stop_words='english')
doc_term_matrix = tfidf_vect.fit_transform(reviews_datasets['Text'].values.astype('U'))

In [28]:
from sklearn.decomposition import NMF

nmf = NMF(n_components=5, random_state=42)
nmf.fit(doc_term_matrix )

In [29]:
for i in range(10):
    random_id = random.randint(0,len(tfidf_vect.get_feature_names_out()))
    print(tfidf_vect.get_feature_names_out()[random_id])

crunchie
weighing
edmund
pepsin
compliments
apt
000
alergic
unremarkable
5x


In [30]:
'''Next, we will retrieve the probability vector of words for the first topic and 
will retrieve the indexes of the ten words with the highest probabilities:'''
first_topic = nmf.components_[0]
top_topic_words = first_topic.argsort()[-10:]

In [32]:
for i in top_topic_words:
    print(tfidf_vect.get_feature_names_out()[i])

really
chocolate
love
flavor
just
product
taste
great
good
like


In [33]:
'''
The words for the topic 1 shows that topic 1 might contain reviews for chocolates. 
Lets's now print the ten words with highest probabilities for each of the topics:
'''

"\nThe words for the topic 1 shows that topic 1 might contain reviews for chocolates. \nLets's now print the ten words with highest probabilities for each of the topics:\n"

In [36]:
for i,topic in enumerate(nmf.components_):
    print(f'Top 10 words for topic #{i}:')
    print([tfidf_vect.get_feature_names_out()[i] for i in topic.argsort()[-10:]])
    print('\n')

Top 10 words for topic #0:
['really', 'chocolate', 'love', 'flavor', 'just', 'product', 'taste', 'great', 'good', 'like']


Top 10 words for topic #1:
['like', 'keurig', 'roast', 'flavor', 'blend', 'bold', 'strong', 'cups', 'cup', 'coffee']


Top 10 words for topic #2:
['com', 'amazon', 'orange', 'switch', 'water', 'drink', 'soda', 'sugar', 'juice', 'br']


Top 10 words for topic #3:
['bags', 'flavor', 'drink', 'iced', 'earl', 'loose', 'grey', 'teas', 'green', 'tea']


Top 10 words for topic #4:
['old', 'love', 'cat', 'eat', 'treat', 'loves', 'dogs', 'food', 'treats', 'dog']




In [38]:
'''
The words for topic 1 shows that this topic contains reviews about coffee. 
    Similarly, the words for topic 2 depicts that it contains reviews about sodas and juices. 
        Topic 3 again contains reviews about drinks. 
            Finally, topic 4 may contain reviews about animal food since it contains words such as "cat", "dog", "treat", etc.
'''
topic_values = nmf.transform(doc_term_matrix)
reviews_datasets['Topic'] = topic_values.argmax(axis=1)
reviews_datasets.head(3)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,Topic
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,4
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,0
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...,4
