## 4 -- KEYWORD EXTRACTION

Keyword extraction is a methodology to automatically detect important words that can be used to represent the text and can be used for topic modeling. This is a very efficient way to get insights from a huge amount of unstructured text data.

### 4.1 -- Installing and Importing Dependencies

Scikit-learn is a free machine learning library for Python. It features various algorithms like support vector machine, random forests, and k-neighbours, and it also supports Python numerical and scientific libraries like NumPy and SciPy.

In [1]:
!pip install scikit-learn

Defaulting to user installation because normal site-packages is not writeable


The Natural Language Toolkit (NLTK) is a platform used for building Python programs that work with human language data for applying in statistical natural language processing (NLP). It contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning.

In [2]:
!pip install nltk

Defaulting to user installation because normal site-packages is not writeable


Importing warnings and sklearn..

In [3]:
import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

Importing some basic libraries which we are going to need for further process..

In [4]:
import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.corpus import wordnet

### 4.2 -- Reading csv file to perform Keyword Extraction

In [5]:
# Using pandas .read method to read our csv file..
df_5 = pd.read_csv('03_Sentiment_Analysis.csv')

In [6]:
df_5.head()

Unnamed: 0,SKU,PRODUCT_NAME,PRICE,PRODUCT_CATEGORY,PACK_SIZE,REVIEW_COUNT,REVIEW_DATE,REVIEW_TIME,PRICE_RATING,QUALITY_RATING,VALUE_RATING,REVIEW_CONTENT,URL,DATE_OF_CREATION,LAST_UPDATED_DATE,STATES,SENTIMENT_SCORE,SENTIMENT,REVIEW_PREPROCESSED_TEXT
0,8904417301762,Vitamin C Daily Glow Face Cream With Vitamin C...,249.0,skin,80g,1.0,2022-08-29,16:38:37,5.0,0.0,0.0,Mamaearth always wins my heart with new surpri...,https://mamaearth.in/product/vitamin-c-daily-g...,2022-08-17,2022-09-04,Rajasthan,5,Positive,mamaearth always win heart new surprise called...
1,8904417300338,Green Tea Face Wash With Green Tea & Collagen ...,399.0,skin,100ml,47.0,2022-09-02,16:34:36,5.0,0.0,0.0,"I've had acne my entire life, and this appears...",https://mamaearth.in/product/green-tea-face-wa...,2022-08-17,2022-09-01,Madhya Pradesh,5,Positive,ive acne entire life appears face wash doesnt ...
2,8904417300338,Green Tea Face Wash With Green Tea & Collagen ...,399.0,skin,100ml,47.0,2022-09-02,16:34:26,5.0,0.0,0.0,"Great cleanser, gentle and makes my face fresh...",https://mamaearth.in/product/green-tea-face-wa...,2022-08-17,2022-09-01,Karnataka,5,Positive,great cleanser gentle make face fresh clean us...
3,8904417300338,Green Tea Face Wash With Green Tea & Collagen ...,399.0,skin,100ml,47.0,2022-09-02,16:34:15,5.0,0.0,0.0,I use Mamaearth green tea range and the result...,https://mamaearth.in/product/green-tea-face-wa...,2022-08-17,2022-09-01,Goa,5,Positive,use mamaearth green tea range result shocked p...
4,8904417300338,Green Tea Face Wash With Green Tea & Collagen ...,399.0,skin,100ml,47.0,2022-09-02,16:34:01,4.0,0.0,0.0,"I have sensitive skin, and I did not experienc...",https://mamaearth.in/product/green-tea-face-wa...,2022-08-17,2022-09-01,Haryana,4,Positive,sensitive skin experience breakout using produ...


In [7]:
df_5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28228 entries, 0 to 28227
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   SKU                       28228 non-null  int64  
 1   PRODUCT_NAME              28228 non-null  object 
 2   PRICE                     28228 non-null  object 
 3   PRODUCT_CATEGORY          28228 non-null  object 
 4   PACK_SIZE                 28228 non-null  object 
 5   REVIEW_COUNT              28228 non-null  float64
 6   REVIEW_DATE               28228 non-null  object 
 7   REVIEW_TIME               28228 non-null  object 
 8   PRICE_RATING              28228 non-null  float64
 9   QUALITY_RATING            28228 non-null  float64
 10  VALUE_RATING              28228 non-null  float64
 11  REVIEW_CONTENT            28228 non-null  object 
 12  URL                       28228 non-null  object 
 13  DATE_OF_CREATION          28228 non-null  object 
 14  LAST_U

In [8]:
df_5['REVIEW_PREPROCESSED_TEXT'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 28228 entries, 0 to 28227
Series name: REVIEW_PREPROCESSED_TEXT
Non-Null Count  Dtype 
--------------  ----- 
28192 non-null  object
dtypes: object(1)
memory usage: 220.7+ KB


In [9]:
# # Data is already pre_processed that's why I've commented this out
# # Lower casing and removing punctuations
# df_5['REVIEW_PREPROCESSED_TEXT'] = df_5['REVIEW_PREPROCESSED_TEXT'].apply(lambda x: " ".join(x.lower() for x in x.split()))
# df_5.REVIEW_PREPROCESSED_TEXT.head(5)

In [10]:
# # Data is already pre_processed that's why I've commented this out
# # Removing all the unwanted special characters and numbers by using regex function..
# df_5['REVIEW_PREPROCESSED_TEXT'] = df_5['REVIEW_PREPROCESSED_TEXT'].str.replace('[^\w\s]', " ")
# df_5.REVIEW_PREPROCESSED_TEXT.head(5)

### 4.3 -- Removing Stop Words

In [11]:
# # Data is already pre_processed that's why I've commented this out
# # Lambda function for removing stopwords
# stop_words = stopwords.words('english')
# # stop_words.extend(['karen','pata','usmein','jain','samajh'])
# df_5['REVIEW_PREPROCESSED_TEXT'] = df_5['REVIEW_PREPROCESSED_TEXT'].apply(lambda x: " ".join(x for x in x.split() if x not in stop_words))
# df_5.REVIEW_PREPROCESSED_TEXT.head()

### 4.4 -- Spelling Correction

In [12]:
# Spelling correction consumes large amount of time, that's why I've commented this out

# df_5['REVIEW_PREPROCESSED_TEXT'] = df_5['REVIEW_PREPROCESSED_TEXT'].apply(lambda x: str(TextBlob(x).correct()))
# df_5.REVIEW_PREPROCESSED_TEXT.head()

### 4.5 -- Lemmatization

Lemmatization is one of the most common text pre-processing techniques used in NLP and machine learning in general. In lemmatization, we try to reduce a given word to its root word. The root word is called a lemma in the lemmatization process.

In [13]:
# # Data is already pre_processed that's why I've commented this out
# from nltk.stem import WordNetLemmatizer

# from textblob import TextBlob
# from textblob import Word

In [14]:
# # Data is already pre_processed that's why I've commented this out
# # Lambda function for lemmatizing each review
# df_5['REVIEW_PREPROCESSED_TEXT'] = df_5['REVIEW_PREPROCESSED_TEXT'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
# df_5.REVIEW_PREPROCESSED_TEXT.head()

### 4.6 -- SKLEARN feature_extraction module

The sklearn feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.

Importing TfidfVectorizer and CountVectorizer..

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer

### 4.7 -- TfidfVectorizer for unigram range

In [16]:
cv_unigram = TfidfVectorizer(ngram_range=(1,1))
Data_unigram = cv_unigram.fit_transform(df_5.REVIEW_PREPROCESSED_TEXT.values.astype('str'))
avg_unigram = Data_unigram.mean(axis=0)
avg_unigram = pd.DataFrame(avg_unigram, columns=cv_unigram.get_feature_names())
avg_unigram = avg_unigram.T
avg_unigram = avg_unigram.rename(columns={0:'SCORE'}) 
avg_unigram['WORD'] = avg_unigram.index
avg_unigram = avg_unigram.sort_values('SCORE', ascending=False)



In [17]:
avg_unigram

Unnamed: 0,SCORE,WORD
product,0.123058,product
good,0.111096,good
nice,0.072872,nice
like,0.055286,like
love,0.038185,love
...,...,...
usmein,0.000003,usmein
karen,0.000003,karen
pata,0.000003,pata
kah,0.000003,kah


### 4.8 -- TfidfVectorizer for trigram range

In [18]:
cv_trigram = TfidfVectorizer(ngram_range=(3,3))
Data_trigram = cv_trigram.fit_transform(df_5.REVIEW_PREPROCESSED_TEXT.values.astype('str'))
avg_trigram = Data_trigram.mean(axis=0)
avg_trigram = pd.DataFrame(avg_trigram, columns=cv_trigram.get_feature_names())
avg_trigram = avg_trigram.T
avg_trigram = avg_trigram.rename(columns={0:'SCORE'}) 
avg_trigram['WORD'] = avg_trigram.index
avg_trigram = avg_trigram.sort_values('SCORE', ascending=False)



In [19]:
avg_trigram

Unnamed: 0,SCORE,WORD
really good product,0.002986,really good product
like product much,0.001465,like product much
love mamaearth product,0.001443,love mamaearth product
best face wash,0.001438,best face wash
mama earth product,0.001434,mama earth product
...,...,...
se bahut jain,0.000003,se bahut jain
product ke main,0.000003,product ke main
hi achcha program,0.000003,hi achcha program
se kah rahe,0.000003,se kah rahe


In [20]:
# Creating a list of avg_unigram and avg_trigram
unigram_list = avg_unigram['WORD'].tolist()
trigram_list = avg_trigram['WORD'].tolist()

In [21]:
def convert(listt):
    return([i.split() for i in listt])

trigram_split = convert(trigram_list)

In [22]:
test = pd.DataFrame(columns=['TOPIC','SUB_TOPIC'])

In [23]:
for j in unigram_list:
    counter=0
    for k in trigram_split:
        if counter<5 and (j==k[0] or j==k[1] or j==k[2]):
            trigram_words = ' '.join(k)
            test = pd.concat([test,pd.concat([pd.Series(j,name='TOPIC'),pd.Series(trigram_words,name='SUB_TOPIC')],axis=1)],axis=0)
            counter=counter+1


In [24]:
test_new = test.groupby(['TOPIC'], as_index=False, sort=False).agg({'SUB_TOPIC':', '.join})
test_new

Unnamed: 0,TOPIC,SUB_TOPIC
0,product,"really good product, like product much, love m..."
1,good,"really good product, product really good, good..."
2,nice,"really nice product, nice product love, nice f..."
3,like,"like product much, really like product, like m..."
4,love,"love mamaearth product, good product love, lov..."
...,...,...
12038,usmein,"usmein happy ki, kah rahe usmein, rahe usmein ..."
12039,karen,"jain use karen, karen iske yah, use karen iske"
12040,pata,"kiya mujhe pata, pata chala thank, mujhe pata ..."
12041,kah,"acche se kah, kah rahe usmein, se kah rahe"


In [25]:
# In case you want separate data file
test_new.to_csv('04_Keyword_Extraction_using_TF-IDF.csv',index=False)

So, these are the codes for Keyword Extraction which we have performed in the previous data. We now have the keywords of the reviews.