In this final session we will cover the following: 

1) Classification model on sentiment analysis - For this we will use our VADER sentiment (which we have done as part of Marketing Analytics)

2) Part of Speech tagging and than based on POS tagging we will re-create our topic modelling.

Lets start with the VADER Sentiment and classification model

In [2]:
!pip install vaderSentiment --upgrade

Collecting vaderSentiment
  Using cached vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [3]:
# Import the library


from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer


# Import the data: 

import pandas as pd

df = pd.read_excel("Amazon.xlsx")

df.head()


Unnamed: 0,id,asins,brand,categories,colors,dateAdded,dateUpdated,dimension,ean,keys,...,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.userCity,reviews.userProvince,reviews.username,sizes,upc,weight
0,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I initially had trouble deciding between the p...,"Paperwhite voyage, no regrets!",,,Cristina M,,,205 grams
1,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,Allow me to preface this with a little history...,One Simply Could Not Ask For More,,,Ricky,,,205 grams
2,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,4.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I am enjoying it so far. Great for reading. Ha...,Great for those that just want an e-reader,,,Tedd Gardiner,,,205 grams
3,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I bought one of the first Paperwhites and have...,Love / Hate relationship,,,Dougal,,,205 grams
4,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I have to say upfront - I don't like coroporat...,I LOVE IT,,,Miljan David Tanic,,,205 grams


In [4]:
# Lets create a sentiment intensity analyzer object

senti = SentimentIntensityAnalyzer()

In [5]:
# We will work only on the review text column

df_text = df[['reviews.text']]

df_text.head()

Unnamed: 0,reviews.text
0,I initially had trouble deciding between the p...
1,Allow me to preface this with a little history...
2,I am enjoying it so far. Great for reading. Ha...
3,I bought one of the first Paperwhites and have...
4,I have to say upfront - I don't like coroporat...


In [6]:
# Lets create a user define fuction by using lambda.

function = lambda x : senti.polarity_scores(x)['compound']

In [7]:
# Lets apply the function on our data frame

df_text['Polarity'] = df_text['reviews.text'].apply(function)

df_text.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_text['Polarity'] = df_text['reviews.text'].apply(function)


Unnamed: 0,reviews.text,Polarity
0,I initially had trouble deciding between the p...,0.9882
1,Allow me to preface this with a little history...,0.9886
2,I am enjoying it so far. Great for reading. Ha...,0.4364
3,I bought one of the first Paperwhites and have...,0.9755
4,I have to say upfront - I don't like coroporat...,0.998


Now we can define the buckets as follow:

-1 to -0.5 --------> Very Neg

-0.5 to -0.2 ------> Neg

-0.2 to +0.2 ------> Netural

+0.2 to 0.5 -------> Pos

0.5 to 1 ----------> Very Pos

In [8]:
# Creating a new column of Bucket based on above bucket values

import numpy as np

df_text['Bucket'] = np.where(df_text['Polarity']> 0.5 , 'Very Pos', 
                            np.where((df_text['Polarity'] <= 0.5) & (df_text['Polarity'] > 0.2), 'Pos',
                                    np.where((df_text['Polarity']<=0.2) & (df_text['Polarity'] > -0.2), 'Net',
                                            np.where((df_text['Polarity']<=-0.2) & (df_text['Polarity']> -0.5), 'Neg',
                                                    np.where(df_text['Polarity']<= -0.5, 'Very Neg', 'NA')))))


df_text.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_text['Bucket'] = np.where(df_text['Polarity']> 0.5 , 'Very Pos',


Unnamed: 0,reviews.text,Polarity,Bucket
0,I initially had trouble deciding between the p...,0.9882,Very Pos
1,Allow me to preface this with a little history...,0.9886,Very Pos
2,I am enjoying it so far. Great for reading. Ha...,0.4364,Pos
3,I bought one of the first Paperwhites and have...,0.9755,Very Pos
4,I have to say upfront - I don't like coroporat...,0.998,Very Pos


In [9]:
# Lets find how many reviews in each bucket

df_text.groupby('Bucket').size().reset_index()

Unnamed: 0,Bucket,0
0,Neg,35
1,Net,96
2,Pos,95
3,Very Neg,63
4,Very Pos,1308


Lets build a classification model on these reviews sentiments where bucket is the target variable and reviews.text is the input variable.

In [10]:
# Step 1: Find you X and Y

X = df_text['reviews.text']
Y = df_text['Bucket']

# Step 2: Split the data into training and test

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.2,
                                                   random_state = 1234)

len(X_train), len(X_test), len(Y_train), len(Y_test)

(1277, 320, 1277, 320)

As my X_train ( which is my input variable) is a text data, we must first convert the same to a DTM format. 

In [11]:
# Step 3: Lets create the TF-IDF vec DTM

from sklearn.feature_extraction.text import TfidfVectorizer

# Step 3.1: Create vector object

tfidf_vec = TfidfVectorizer()

In [12]:
# Step 3.2: I will fit the vector object on my X_train to create the DTM

X_train_dtm = tfidf_vec.fit_transform(X_train)

X_train_dtm

<1277x5854 sparse matrix of type '<class 'numpy.float64'>'
	with 114807 stored elements in Compressed Sparse Row format>

You can use any classification method like SVM, Multinomial Naive Byes, RF, DT, or any Deep learning back prop. method.

For demo I will use a simple Decission tree

In [13]:
# Step 4: Creating the model object

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier()

# Step 5: Fit the model on X_train_dtm, Y_train

model = tree.fit(X_train_dtm, Y_train)

model

DecisionTreeClassifier()

In [14]:
# Step 6: Using your X_test data to find the model accuracy. 

# Step 6.1: You must convert your X_test also into a DTM. 

# NOTE: Please use the same vector object which you have used on X_train
# NOTE: You must use transform to create your test DTM so that the number
# of variable in your test DTM is same as training DTM

X_test_dtm = tfidf_vec.transform(X_test)

X_test_dtm

<320x5854 sparse matrix of type '<class 'numpy.float64'>'
	with 28742 stored elements in Compressed Sparse Row format>

In [15]:
# Step 7: Lets predict the bucket using my X_test_dtm

Y_pred = model.predict(X_test_dtm)

Y_pred

array(['Very Pos', 'Neg', 'Very Pos', 'Pos', 'Very Pos', 'Very Pos',
       'Very Pos', 'Very Pos', 'Very Pos', 'Very Pos', 'Very Pos',
       'Very Pos', 'Very Pos', 'Very Pos', 'Very Pos', 'Very Pos',
       'Very Pos', 'Very Pos', 'Very Pos', 'Very Pos', 'Very Pos', 'Net',
       'Very Pos', 'Very Pos', 'Very Pos', 'Very Neg', 'Very Pos',
       'Very Pos', 'Neg', 'Very Pos', 'Very Pos', 'Very Pos', 'Very Pos',
       'Neg', 'Very Pos', 'Very Pos', 'Very Pos', 'Very Pos', 'Very Pos',
       'Very Pos', 'Very Pos', 'Very Pos', 'Very Pos', 'Very Pos',
       'Very Pos', 'Pos', 'Very Pos', 'Very Pos', 'Very Pos', 'Very Pos',
       'Very Pos', 'Net', 'Very Pos', 'Very Pos', 'Very Pos', 'Very Pos',
       'Very Pos', 'Very Pos', 'Very Pos', 'Very Pos', 'Pos', 'Very Pos',
       'Net', 'Net', 'Very Pos', 'Very Pos', 'Very Pos', 'Very Pos',
       'Very Pos', 'Neg', 'Net', 'Very Pos', 'Very Pos', 'Very Pos',
       'Very Neg', 'Pos', 'Very Pos', 'Very Pos', 'Very Pos', 'Very Pos',
       

In [16]:
# Step 8: Create the confusion matrix between Y_test and Y_pred

pd.crosstab(Y_test, Y_pred)

col_0,Neg,Net,Pos,Very Neg,Very Pos
Bucket,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Neg,1,1,1,0,6
Net,2,8,1,0,5
Pos,0,5,3,0,7
Very Neg,2,2,2,5,4
Very Pos,3,10,9,6,237


In [17]:
acc = (1+7+3+5+240)/320

acc

0.8

"What a horrible product. I was ashamed usign this product. Go to hell"

In [25]:
# Lets try to classifiy the above review using the model

val_data = pd.DataFrame({'reviews.text': ["Extremely Superb product !!!!"]})

val_data

Unnamed: 0,reviews.text
0,Extremely Superb product !!!!


In [26]:
# Lets convert the val_data into DTM

val_data_dtm = tfidf_vec.transform(val_data)

# Lets use the model to predict the review

val_pred = model.predict(val_data_dtm)

val_pred

array(['Net'], dtype=object)

We often see these problem of wrong classification due to the imbalance in the data which can be resolved by using techniques like SMOTE. 

On text analytics prediction models, we prefer to use Deep learning algo which give us a good amount of control parameters to balance the data (to a good extend).

POS - Part of Speech tagging - processes a sequence of words, and attach a part of speech tag to each words.

The list of POS tags can be taken from the below link

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [32]:
# You will need the following downloads

import nltk

nltk.download('punkt')

nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/amitchoudhary/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/amitchoudhary/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [38]:
# Lets import the libraries

import nltk

from nltk import pos_tag, word_tokenize

# Step 1: To break your sentence into words using word_tokenize

text = word_tokenize("Hello welcome to the world of learning categorizing and POS tagging with NLTK and Python")

text

['Hello',
 'welcome',
 'to',
 'the',
 'world',
 'of',
 'learning',
 'categorizing',
 'and',
 'POS',
 'tagging',
 'with',
 'NLTK',
 'and',
 'Python']

In [52]:
# Step 2: We can use pos_tag to do the tagging of POS to each words

nltk.pos_tag(text)

[('Hello', 'NNP'),
 ('welcome', 'NN'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('world', 'NN'),
 ('of', 'IN'),
 ('learning', 'VBG'),
 ('categorizing', 'VBG'),
 ('and', 'CC'),
 ('POS', 'NNP'),
 ('tagging', 'VBG'),
 ('with', 'IN'),
 ('NLTK', 'NNP'),
 ('and', 'CC'),
 ('Python', 'NNP')]

In [64]:
# Lets see how I can filter all the proper noun from the above list of words

is_noun = lambda x : x =="NNP"

nouns = [y for (y,x) in pos_tag(text) if is_noun(x)]

print(nouns)

['Hello', 'POS', 'NLTK', 'Python']


We will use the same news artical data set to create a topic modelling using POS tagging

In [65]:
# Lets take this on a news paper heading data set

from sklearn.datasets import fetch_20newsgroups

# Lets pick few categories from the data

cat = ['comp.sys.mac.hardware', 'rec.autos', 'sci.space',
      'talk.politics.guns']

newsgroup = fetch_20newsgroups(categories=cat,
                              remove=('headers','footers', 'quotes'))

raw_data = newsgroup.data

raw_data

['\n\n       Money orders operate pretty much like checks, with both parties being\nsupposed to sign them.  I assume you\'d have to show the buy-back people\nan ID, and you\'d then have a money order made out to that ID.  \n\n       As far as traceable as a practical matter, I don\'t know, it would\ndepend on whether they bother to computerize who the recipient\'s name is\non the money order and bother keying that sort of thing in.  I\'d say\ncertainly the police and the buyback people would keep a record of who\nthey gave money orders out to.\n\n\n       There might be some questions asked, I suppose, if somebody \nbrought in a number of weapons each time over a series of "buy back"\nprograms.\n\n        ',
 '\nFlame on!!\n\nIs this guy serious????\n\nIf he would ever really pay attention to the news (oops I forgot that the media\n   for the most part loves to jump right on top of a story before all the facts \n   are known, as well as to manipulate what we see and thus what we believ

In [66]:
# Lets convert the raw text into a data frame

import pandas as pd

df = pd.DataFrame(raw_data)

df = df.rename(columns = {0 : "Text"})

df.head()

Unnamed: 0,Text
0,\n\n Money orders operate pretty much li...
1,\nFlame on!!\n\nIs this guy serious????\n\nIf ...
2,\nThe Partition button in Apple's HD Setup let...
3,Here are some recent observations taken by the...
4,You will need Driver ver 3.5.2 to work with Qu...


In [68]:

# Clearn the text

doc = df['Text'].str.lower().str.replace("[^a-z' ]", '')

from nltk.corpus import stopwords

stop = stopwords.words('english')

# Create a UDF to remove the stop words

def sw(x):
    x = [word for word in x.split() if word not in stop]
    return " ".join(x)

doc_clean = doc.apply(sw)

doc_clean.head()

0    money orders operate pretty much like checks p...
1    flame onis guy seriousif would ever really pay...
2    partition button apple's hd setup lets set aux...
3    recent observations taken hubble space telesco...
4    need driver ver work quadracentris downloadit ...
Name: Text, dtype: object

In [69]:
# Lets create a user define function to apply the POS tags on each word
# and filter all the nouns

def nouns(x):
    
    # Filter condition
    
    is_noun = lambda x : x =="NN"
    
    # Word tokenizer
    
    token = word_tokenize(x)
    
    # applying the POS tags and filtering the nouns
    
    all_nouns = [y for (y,x) in pos_tag(token) if is_noun(x)]
    
    # Before return the words it should again join them to return a 
    # sentence of only NOUN words
    
    return ' '.join(all_nouns)

In [70]:
# Lets apply the UDF on doc_clean

doc_final = doc_clean.apply(nouns)

doc_final.head()

0    money sign show id money order id matter bothe...
1    flame onis guy seriousif attention news part j...
2    partition button apple setup something willhow...
3    space telescope faint spectrograph fos pluto m...
4    need driver ver work quadracentris downloadit ...
Name: Text, dtype: object

In [79]:
# Create a TF-IDF vectorizer

from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(max_df = .3, min_df = 10)

# DTM all nouns

dtm_noun = vec.fit_transform(doc_final)

dtm_noun

<2311x930 sparse matrix of type '<class 'numpy.int64'>'
	with 26595 stored elements in Compressed Sparse Row format>

In [80]:
from sklearn.decomposition import LatentDirichletAllocation

# Step 1: To build a LDA object

lda_model = LatentDirichletAllocation(n_components=4, random_state = 1234)

# Step 2: To fit this model object on our DTM

lda_output = lda_model.fit_transform(dtm_noun)

lda_output

array([[0.9587259 , 0.01357941, 0.01390174, 0.01379295],
       [0.80587142, 0.15170483, 0.02111073, 0.02131302],
       [0.03052917, 0.02789344, 0.71664234, 0.22493504],
       ...,
       [0.95139281, 0.01665146, 0.01570117, 0.01625456],
       [0.93041163, 0.02346373, 0.02304993, 0.0230747 ],
       [0.12898957, 0.61204148, 0.13143773, 0.12753122]])

In [81]:
# Lets create the user define function to find the top 20 words from each 
# topic

def show_words(vectorizer, model, n_words):
    # Step 1: Create the array of all the words in your DTM
    keywords = np.array(vectorizer.get_feature_names())
    
    # Step 2: To create and empty list
    topic_keywords = []
    
    # Step 3: From my LDA model we will use the topic term matrix to find 
    # the top words based on the likelihood for each topic
    
    for x in lda_model.components_:
        word_loc = (-x).argsort()[:n_words]
        
        # Lets append the empty list by picking the words from the 
        # keywords array
        
        topic_keywords.append(keywords.take(word_loc))
    return topic_keywords

In [82]:
show_words(vec, lda_model, 20)

[array(['car', 'time', 'government', 'way', 'someone', 'something',
        'anyone', 'thing', 'problem', 'day', 'point', 'engine', 'course',
        'state', 'get', 'gas', 'year', 'anything', 'fire', 'question'],
       dtype='<U14'),
 array(['gun', 'control', 'crime', 'law', 'rate', 'bill', 'firearm',
        'weapon', 'state', 'use', 'person', 'section', 'number', 'year',
        'file', 'police', 'committee', 'handgun', 'article', 'study'],
       dtype='<U14'),
 array(['apple', 'mac', 'system', 'problem', 'drive', 'anyone', 'software',
        'bit', 'memory', 'use', 'card', 'disk', 'monitor', 'work', 'scsi',
        'hardware', 'cable', 'power', 'color', 'video'], dtype='<U14'),
 array(['space', 'launch', 'program', 'orbit', 'earth', 'system',
        'mission', 'information', 'moon', 'time', 'science', 'technology',
        'flight', 'research', 'center', 'station', 'satellite', 'year',
        'rocket', 'cost'], dtype='<U14')]

For topic modelling pefer to use Countvectorizer DTM.