# Problem #2B : Sentiment Analysis
The goal is to identify the best number of clusters that responses to the following question organize into using NLP methods:

"What one action can faculty take to improve your educational experience at UW?"

No assumptions are made about how many clusters (groups) these responses will fall into. The goal of
this portion of the NLP project is to identify the optimal number of clusters to support future coding of
these responses. 

This will be accomplished by representing the students' responses into three categories::
part A: topic, part B: sentiment analysis, and part B: semantic similarity.


In [62]:
import numpy as np 
import pandas as pd 
import re
import nltk 
import matplotlib.pyplot as plt
%matplotlib inline

The main idea behind unsupervised learning is that you don’t give any previous assumptions and definitions to the model about the outcome of variables you feed into it — you simply insert the data (of course preprocessed before), and want the model to learn the structure of the data itself. It is extremely useful in cases when you don’t have labeled data, or you are not sure about the structure of the data, and you want to learn more about the nature of process you are analyzing, without making any previous assumptions about its outcome.

In [63]:
df = pd.read_csv('/Users/nehakardam/Documents/UWclasses /EE517 NLP/Project/FacultySupport_SA1_June2_2021.csv')

In [64]:
df

Unnamed: 0,RemoteTrad,Class,Quarter,Year,A1_Status,A2_Major,A4.2,A5.2,B1_Age,B2_Gender,B3.1_USStatus,B3.2_Country,B4.1_Race,B4.2,B5_Income,B6_GPA,B7_MotherEd,B8_FatherEd,B9_SocioClass,SA1
0,2,EE233_Spring2020,Spring,2020,2,1,2018,2022,20,2,1,0,6,,6,3.2,5,6,4,Restructure quizzes and stuff. In 235 we had a...
1,2,EE233_Spring2020,Spring,2020,4,1,2018,2022,19,1,1,0,1,,7,3.23,5,4,5,"Flexible late turn in policies, especially in ..."
2,2,EE235_Spring2020,Spring,2020,2,1,2018,2022,19,2,1,0,8,,8,3.4,3,5,5,Leniency on deadlines. It can be hard to stay...
3,2,EE331_Spring2020,Spring,2020,3,1,2018,2022,19,1,1,0,1,,,3.81,6,7,,be flexible to possible changes and take stude...
4,2,EE233_Spring2020,Spring,2020,2,1,2018,2022,19,2,3,1,1,,4,3.55,4,6,5,have some strict action to make sure every stu...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1167,1,EE271_Spring2018,Spring,2018,2,1,2016,2019,19,2,1,,1,Middle east,8,3.40,4,7,5,Be more clear on what they want when grading a...
1168,1,EE271_Spring2018,Spring,2018,2,1,2016,2020,19,2,1,,8,,5,3.71,6,7,4,"When giving assignments, clarification on exac..."
1169,1,EE271_Spring2018,Spring,2018,2,1,2016,,20,2,1,,1,,7,3.64,6,5,4,Give sample exams
1170,1,EE271_Spring2018,Spring,2018,3,1,2017,2020,18,1,1,,1,,5,3.77,3,2,3,"Being there for students outside of class, whe..."


In [65]:
df.SA1

0       Restructure quizzes and stuff. In 235 we had a...
1       Flexible late turn in policies, especially in ...
2        Leniency on deadlines. It can be hard to stay...
3       be flexible to possible changes and take stude...
4       have some strict action to make sure every stu...
                              ...                        
1167    Be more clear on what they want when grading a...
1168    When giving assignments, clarification on exac...
1169                                    Give sample exams
1170    Being there for students outside of class, whe...
1171    Make sure that the assignments and tests given...
Name: SA1, Length: 1172, dtype: object

# NLTK
NLTK’s Vader sentiment analysis tool uses a bag of words approach (a lookup table of positive and negative words) with some simple heuristics (e.g. increasing the intensity of the sentiment if some words like “really”, “so” or “a bit” are present).

In [66]:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
for i in range(df.shape[0]):
    s = str(df.SA1[i])
    if str(s):
        print(sid.polarity_scores(s))
    

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/nehakardam/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


{'neg': 0.0, 'neu': 0.899, 'pos': 0.101, 'compound': 0.4019}
{'neg': 0.0, 'neu': 0.844, 'pos': 0.156, 'compound': 0.6115}
{'neg': 0.15, 'neu': 0.734, 'pos': 0.116, 'compound': 0.031}
{'neg': 0.0, 'neu': 0.872, 'pos': 0.128, 'compound': 0.2263}
{'neg': 0.124, 'neu': 0.826, 'pos': 0.05, 'compound': -0.5267}
{'neg': 0.0, 'neu': 0.878, 'pos': 0.122, 'compound': 0.2023}
{'neg': 0.0, 'neu': 0.648, 'pos': 0.352, 'compound': 0.7717}
{'neg': 0.0, 'neu': 0.773, 'pos': 0.227, 'compound': 0.6486}
{'neg': 0.041, 'neu': 0.823, 'pos': 0.136, 'compound': 0.7083}
{'neg': 0.0, 'neu': 0.794, 'pos': 0.206, 'compound': 0.0772}
{'neg': 0.106, 'neu': 0.673, 'pos': 0.221, 'compound': 0.4432}
{'neg': 0.0, 'neu': 0.882, 'pos': 0.118, 'compound': 0.4215}
{'neg': 0.0, 'neu': 0.735, 'pos': 0.265, 'compound': 0.6597}
{'neg': 0.073, 'neu': 0.927, 'pos': 0.0, 'compound': -0.3318}
{'neg': 0.068, 'neu': 0.815, 'pos': 0.117, 'compound': 0.2023}
{'neg': 0.031, 'neu': 0.88, 'pos': 0.09, 'compound': 0.4767}
{'neg': 0.057, 

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.0, 'neu': 0.618, 'pos': 0.382, 'compound': 0.4754}
{'neg': 0.074, 'neu': 0.774, 'pos': 0.152, 'compound': 0.34}
{'neg': 0.0, 'neu': 0.818, 'pos': 0.182, 'compound': 0.4404}
{'neg': 0.0, 'neu': 0.633, 'pos': 0.367, 'compound': 0.4404}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.0, 'neu': 0.708, 'pos': 0.292, 'compound': 0.4391}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.048, 'neu': 0.952, 'pos': 0.0, 'compound': -0.3612}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.0, 'neu': 0.81, 'pos': 0.19, 'compound': 0.7425}
{'neg': 0.0, 'neu': 0.819, 'pos': 0.181, 'compound': 0.4754}
{'neg': 0.0, 'neu': 0.708, 'pos': 0.292, 'compound': 0.4391}
{'neg': 0.0, 'neu': 0.704, 'pos': 0.296, 'compound': 0.7425}
{'neg': 0.373

In [67]:
d = sid.polarity_scores(df.SA1[0])
d['neg']

0.0

In [83]:
new_table = []

for i in range(df.shape[0]):
    s = str(df.SA1[i])
    if str(s):
        d = sid.polarity_scores(s)
        new_row = [d['neg'],d['neu'],d['pos'],d['compound']]
        new_table.append(new_row)

new_df = pd.DataFrame(new_table, columns=['neg', 'neu', 'pos', 'compound'])

In [112]:
result = pd.concat([df, new_df], axis=1)
result

Unnamed: 0,RemoteTrad,Class,Quarter,Year,A1_Status,A2_Major,A4.2,A5.2,B1_Age,B2_Gender,...,B5_Income,B6_GPA,B7_MotherEd,B8_FatherEd,B9_SocioClass,SA1,neg,neu,pos,compound
0,2,EE233_Spring2020,Spring,2020,2,1,2018,2022,20,2,...,6,3.2,5,6,4,Restructure quizzes and stuff. In 235 we had a...,0.000,0.899,0.101,0.4019
1,2,EE233_Spring2020,Spring,2020,4,1,2018,2022,19,1,...,7,3.23,5,4,5,"Flexible late turn in policies, especially in ...",0.000,0.844,0.156,0.6115
2,2,EE235_Spring2020,Spring,2020,2,1,2018,2022,19,2,...,8,3.4,3,5,5,Leniency on deadlines. It can be hard to stay...,0.150,0.734,0.116,0.0310
3,2,EE331_Spring2020,Spring,2020,3,1,2018,2022,19,1,...,,3.81,6,7,,be flexible to possible changes and take stude...,0.000,0.872,0.128,0.2263
4,2,EE233_Spring2020,Spring,2020,2,1,2018,2022,19,2,...,4,3.55,4,6,5,have some strict action to make sure every stu...,0.124,0.826,0.050,-0.5267
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1167,1,EE271_Spring2018,Spring,2018,2,1,2016,2019,19,2,...,8,3.40,4,7,5,Be more clear on what they want when grading a...,0.000,0.656,0.344,0.4927
1168,1,EE271_Spring2018,Spring,2018,2,1,2016,2020,19,2,...,5,3.71,6,7,4,"When giving assignments, clarification on exac...",0.000,0.552,0.448,0.7579
1169,1,EE271_Spring2018,Spring,2018,2,1,2016,,20,2,...,7,3.64,6,5,4,Give sample exams,0.000,1.000,0.000,0.0000
1170,1,EE271_Spring2018,Spring,2018,3,1,2017,2020,18,1,...,5,3.77,3,2,3,"Being there for students outside of class, whe...",0.000,0.788,0.212,0.7783


In [85]:
result.to_csv(r'/Users/nehakardam/Documents/UWclasses /EE517 NLP/Project/FS_Sentiment_SA1_June2_2021.csv', index = False)

In [86]:
# #Number of Response (Total, Positive, Negative, Neutral)
# SA1_list = pd.DataFrame(SA1_list)
# neutral_list = pd.DataFrame(neutral_list)
# negative_list = pd.DataFrame(negative_list)
# positive_list = pd.DataFrame(positive_list)
# print(“total number: “,len(SA1_list))
# print(“positive number: “,len(positive_list))
# print(“negative number: “, len(negative_list))
# print(“neutral number: “,len(neutral_list))

In [87]:
# import pandas as pd
# import matplotlib.pyplot as plt
# import numpy as np
# from matplotlib import pyplot as plt
# # result = {'SA1','neg','neu','pos'}
  
# # df2 = pd.DataFrame(result,columns=['SA1','neg','neu','pos'])
# # df2.plot(x ='SA1', y='neg''neu''pos', kind = 'bar')
# # plt.show()

# # Creating dataset
# Response = ['SA1']

# data = [23, 17, 35, 29, 12, 41]

# # Creating plot
# fig = plt.figure(figsize =(10, 7))
# plt.pie(data, labels = cars)

# # show plot
# plt.show()


# Textblob
Textblob’s Sentiment Analysis works in a similar way to NLTK — using a bag of words classifier, but the advantage is that it includes Subjectivity Analysis too (how factual/opinionated a piece of text is)! 
However, it doesn’t contain the heuristics that NLTK has, and so it won’t intensify or negate a sentence’s sentiment.

In [88]:
# !pip install -U textblob
# !python -m textblob.download_corpora

In [90]:
from textblob import TextBlob

for i in range(df.shape[0]):
    s = str(df.SA1[i])
    if str(s):
        print(TextBlob(s).sentiment)

Sentiment(polarity=0.0, subjectivity=1.0)
Sentiment(polarity=0.186, subjectivity=0.6300000000000001)
Sentiment(polarity=-0.2916666666666667, subjectivity=0.5416666666666666)
Sentiment(polarity=0.0, subjectivity=1.0)
Sentiment(polarity=0.11666666666666665, subjectivity=0.662962962962963)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.5, subjectivity=0.5)
Sentiment(polarity=-0.13333333333333333, subjectivity=0.6)
Sentiment(polarity=0.1605218855218855, subjectivity=0.4162457912457913)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.5316666666666667, subjectivity=0.75)
Sentiment(polarity=0.5, subjectivity=0.5)
Sentiment(polarity=0.4666666666666667, subjectivity=0.6666666666666667)
Sentiment(polarity=-0.1375, subjectivity=0.575)
Sentiment(polarity=0.0, subjectivity=0.7000000000000001)
Sentiment(polarity=0.07857142857142858, subjectivity=0.46785714285714286)
Sentiment(polarity=0.33999999999999997, subjectivity=0.54)
Sentiment(polarity=0.5, subjectivity=0.5)
Sen

Sentiment(polarity=0.25, subjectivity=0.5)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.325, subjectivity=0.17500000000000002)
Sentiment(polarity=-0.027380952380952388, subjectivity=0.5916666666666666)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.27366666666666667, subjectivity=0.5096666666666667)
Sentiment(polarity=0.2, subjectivity=0.25)
Sentiment(polarity=0.16944444444444443, subjectivity=0.46388888888888885)
Sentiment(polarity=0.2, subjectivity=0.3)
Sentiment(polarity=-0.084375, subjectivity=0.475)
Sentiment(polarity=0.20166666666666666, subjectivity=0.4066666666666666)
Sentiment(polarity=0.0, subjectivity=1.0)
Sentiment(polarity=0.3653174603174603, subjectivity=0.5546957671957672)
Sentiment(polarity=0.18333333333333332, subjectivity=0.43333333333333335)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.25, subjectivity=0.25)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=-0.12,

In [109]:
new_table1 = []
for i in range(df.shape[0]):
    s1 = str(df.SA1[i])
    if str(s1):
        d1 = TextBlob(s1).sentiment
        new_row1 = [d1.polarity,d1.subjectivity]
        new_table1.append(new_row1)

new_df1 = pd.DataFrame(new_table1, columns=['polarity', 'subjectivity'])

In [114]:
result1 = pd.concat([result, new_df1], axis=1)
result1

Unnamed: 0,RemoteTrad,Class,Quarter,Year,A1_Status,A2_Major,A4.2,A5.2,B1_Age,B2_Gender,...,B7_MotherEd,B8_FatherEd,B9_SocioClass,SA1,neg,neu,pos,compound,polarity,subjectivity
0,2,EE233_Spring2020,Spring,2020,2,1,2018,2022,20,2,...,5,6,4,Restructure quizzes and stuff. In 235 we had a...,0.000,0.899,0.101,0.4019,0.000000,1.000000
1,2,EE233_Spring2020,Spring,2020,4,1,2018,2022,19,1,...,5,4,5,"Flexible late turn in policies, especially in ...",0.000,0.844,0.156,0.6115,0.186000,0.630000
2,2,EE235_Spring2020,Spring,2020,2,1,2018,2022,19,2,...,3,5,5,Leniency on deadlines. It can be hard to stay...,0.150,0.734,0.116,0.0310,-0.291667,0.541667
3,2,EE331_Spring2020,Spring,2020,3,1,2018,2022,19,1,...,6,7,,be flexible to possible changes and take stude...,0.000,0.872,0.128,0.2263,0.000000,1.000000
4,2,EE233_Spring2020,Spring,2020,2,1,2018,2022,19,2,...,4,6,5,have some strict action to make sure every stu...,0.124,0.826,0.050,-0.5267,0.116667,0.662963
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1167,1,EE271_Spring2018,Spring,2018,2,1,2016,2019,19,2,...,4,7,5,Be more clear on what they want when grading a...,0.000,0.656,0.344,0.4927,0.300000,0.441667
1168,1,EE271_Spring2018,Spring,2018,2,1,2016,2020,19,2,...,6,7,4,"When giving assignments, clarification on exac...",0.000,0.552,0.448,0.7579,0.525000,0.500000
1169,1,EE271_Spring2018,Spring,2018,2,1,2016,,20,2,...,6,5,4,Give sample exams,0.000,1.000,0.000,0.0000,0.000000,0.000000
1170,1,EE271_Spring2018,Spring,2018,3,1,2017,2020,18,1,...,3,2,3,"Being there for students outside of class, whe...",0.000,0.788,0.212,0.7783,0.125000,0.400000


# Flair
Flair’s sentiment classifier is based on a character-level LSTM neural network which takes sequences of letters and words into account when predicting

In [121]:
# !pip3 install flair
import flair
flair_sentiment = flair.models.TextClassifier.load('en-sentiment')
s1 = flair.data.Sentence(s)
flair_sentiment.predict(s1)
total_sentiment = s1.labels
total_sentiment

for i in range(df.shape[0]):
    s1 = str(df.SA1[i])
    if str(s1):
        print(total_sentiment)


2021-06-03 11:49:58,901 loading file /Users/nehakardam/.flair/models/sentiment-en-mix-distillbert_4.pt
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678)]
[POSITIVE (0.9678

In [None]:
new_table_1 = []
for i in range(df.shape[0]):
    s = str(df.SA1[i])
    if str(s):
        d = TextBlob(s).sentiment
        new_row = [d.polarity,d.subjectivity]
        new_table_1.append(new_row)

new_df1 = pd.DataFrame(new_table_1, columns=['polarity', 'subjectivity'])

# DeepMoji
This last one isn’t technically a sentiment analysis tool, because it predicts emojis for a sentence, however, I’ve included it here because this type of classifications demonstrates an awareness of sentiment (and even emotion) from the model.

In [None]:
# !git clone https://github.com/huggingface/torchMoji
# import os
# os.chdir('torchMoji')
# !pip3 install -e .
# !python3 scripts/download_weights.py
# !python3 examples/text_emojize.py --text f" {s} "

Reference:https://medium.com/@b.terryjack/nlp-pre-trained-sentiment-analysis-1eb52a9d742c

In [48]:
plot_size = plt.rcParams["figure.figsize"] 
print(plot_size[0]) 
print(plot_size[1])

plot_size[0] = 8
plot_size[1] = 6
plt.rcParams["figure.figsize"] = plot_size

6.0
4.0


In [49]:
FS_sentiment = result.groupby(['SA1', 'neg', 'neu', 'pos', 'compound']).airline_sentiment.count().unstack()
FS_sentiment.plot(kind='bar')

AttributeError: 'DataFrameGroupBy' object has no attribute 'airline_sentiment'

AttributeError: 'bool' object has no attribute 'all'

Reference: https://medium.com/@b.terryjack/nlp-pre-trained-sentiment-analysis-1eb52a9d742c