# Group 1 - Sentiment Analysis 
   
**⛳️ Goal**: Analyzing the sentiments of the speeches using an unsupervised machine learning technique.
<br>
<font color= 'red'> ***To-Do***
* Make a class that can loop the following functions for all the speech files.
* Check whether these are correctly done.
* Which method is best for us? (Need to learn more about the methods..)
</br> 

<br>

***Note***
* Based on the reference link, `Stanza` pipeline seems better than `TextBlob` for our case. Although TextBlob can portray some nuance of the text due to its scale [-1,1], for a longer text it is better to use `Stanza`.  (See: http://monkeythinkmonkeycode.com/nlp-in-python-a-quick-library-comparison-for-sentiment-analysis/)
</br>

<h2> Getting PDFs </h2>

* Used Jolien's codes
* Used only "Farewell_to_Staff_and_Supporters.pdf"

In [11]:
from pathlib import Path

import pdfplumber
import re

import sys
src_path = str(Path.cwd().parent / "pdfs")
sys.path.append(src_path)


src_path = str(Path.cwd().parent / "src")
sys.path.append(src_path)
from pdf_processing import *

In [4]:
import pandas as pd
import numpy as np 

In [12]:
pdf_dir = Path.cwd().parent / "pdfs"
pdfs = list(pdf_dir.glob('*.pdf'))  
print("current number of PDFs:", len(pdfs))

current number of PDFs: 4


In [13]:
filepath = pdfs[1]

In [14]:
pdf = PDFHandler(filepath)

In [15]:
start = r"(?:hheettoorriicc\.\.ccoomm)"
date = r"(.*[dD]elivered\s+(?P<day>[0-9]{1,2})\s+(?P<mon>[A-Z][a-z]+)\s+(?P<year>[0-9]{2,4})"
loc = r"(,\s+(?P<location_small>[A-Za-z0-9. ]+),\s+(?P<location_big>[A-Za-z0-9., ]+))?"
auth = r"(?:\s+AUTHENTICITY CERTIFIED: Text version below transcribed directly from audio))?"
content = r"\s+(?P<content>.*)\n+"
end = r"(?:(Transcription\s+by\s+.*)?(Property\s+of\s+)?AmericanRhetoric\.com)"

pat = re.compile(start + date + loc + auth + content + end, re.DOTALL)

speech = pdf.extract_speech(pat)
print(speech)

Michelle and I, we've really been milking this goodbye thing, so it behooves me to be very 
 
brief.
Audience Members: No, no!  
President Obama: Yes, yes.  
You know, I said before and I will say again, that when we started on this journey we did so 
with an abiding faith in the American people and their ability, out ability, to join together to 
change the country in ways that would make life better for our kids and our grandkids, that 
 
change didn’t happen from the top down, but it happened from the bottom up.
It was met sometimes with skepticism and doubt. Some folks didn’t think we could pull it off. 
There were those who felt that the institutions of power and privilege in this country were too 
 
deeply entrenched. And yet, all of you came together, in small towns and big cities, a whole bunch of you really 
young, and you decided to believe. And you knocked on doors and you made phone calls, and 
you talked to your parents who didn’t know how to pronounce Barack Obama. And yo

In [16]:
pdf.print_info()

Title: Farewell_to_Staff_and_Supporters
Number of pages: 3
Date: ['20', 'January', '2017']
Location: ['Prince George County', 'Maryland']


In [17]:
old = [r'-+', r'\.{2,}', r'[’‘]', r'"', r'’’', r'‘‘', r'“', r'”', r',', r'\[sic\]', r'\s+']
new = [r' ' , r' '     , r"'"   , r'' , r''  , r''  , r'' , r'' , r',', r' '      , r' '  ]

clean_speech = pdf.replace(speech, old, new)
print(clean_speech)

Michelle and I, we've really been milking this goodbye thing, so it behooves me to be very brief. Audience Members: No, no! President Obama: Yes, yes. You know, I said before and I will say again, that when we started on this journey we did so with an abiding faith in the American people and their ability, out ability, to join together to change the country in ways that would make life better for our kids and our grandkids, that change didn't happen from the top down, but it happened from the bottom up. It was met sometimes with skepticism and doubt. Some folks didn't think we could pull it off. There were those who felt that the institutions of power and privilege in this country were too deeply entrenched. And yet, all of you came together, in small towns and big cities, a whole bunch of you really young, and you decided to believe. And you knocked on doors and you made phone calls, and you talked to your parents who didn't know how to pronounce Barack Obama. And you got to know each

<h2> Preprocessing </h2>

Note that these are just individual codes that are possible for each purpose. These steps are not necessary for `Stanza` since it offers a pipeline in which you can do the processing and sentiment analysis all at the same time.

<h3> Tokenization </h3>

In [26]:
from nltk.tokenize import sent_tokenize , word_tokenize

In [42]:
sentence = nltk.sent_tokenize(clean_speech)
print(sentence[0:4])

["Michelle and I, we've really been milking this goodbye thing, so it behooves me to be very brief.", 'Audience Members: No, no!', 'President Obama: Yes, yes.', "You know, I said before and I will say again, that when we started on this journey we did so with an abiding faith in the American people and their ability, out ability, to join together to change the country in ways that would make life better for our kids and our grandkids, that change didn't happen from the top down, but it happened from the bottom up."]


In [43]:
words_1 = word_tokenize(sentence[0])
print(words_1)

['Michelle', 'and', 'I', ',', 'we', "'ve", 'really', 'been', 'milking', 'this', 'goodbye', 'thing', ',', 'so', 'it', 'behooves', 'me', 'to', 'be', 'very', 'brief', '.']


<h3> Lemmatization </h3>

In [38]:
from nltk.stem import WordNetLemmatizer 

In [41]:
wordnet_lemmatizer = WordNetLemmatizer() 

In [47]:
word_list = []
for w in words_1: #loop over the first sentence
    word_list.append((w,wordnet_lemmatizer.lemmatize(w))) 
    #take every word and lemmatize

In [48]:
word_list

[('Michelle', 'Michelle'),
 ('and', 'and'),
 ('I', 'I'),
 (',', ','),
 ('we', 'we'),
 ("'ve", "'ve"),
 ('really', 'really'),
 ('been', 'been'),
 ('milking', 'milking'),
 ('this', 'this'),
 ('goodbye', 'goodbye'),
 ('thing', 'thing'),
 (',', ','),
 ('so', 'so'),
 ('it', 'it'),
 ('behooves', 'behooves'),
 ('me', 'me'),
 ('to', 'to'),
 ('be', 'be'),
 ('very', 'very'),
 ('brief', 'brief'),
 ('.', '.')]

<h2> Stanza Pipeline </h2>

Mainly followed codes provided by Stanza documentation: <a> https://stanfordnlp.github.io/stanza/ </a>.
<br> This allows us to **tokenize & lemmatize** all at the same time.

In [51]:
import stanza
stanza.download('en',verbose=False)

In [113]:
nlp = stanza.Pipeline('en', processors='tokenize, mwt, pos, lemma, depparse,sentiment',
                      use_gpu=False, 
                      verbose=False, pos_batch_size=3000) 

Use Stanza Pipeline to do both tokenization & sentiment analysis for our first text file.
Sentiment Levels:
* Negative = 0 
* Neutral = 1
* Positive = 2
<br>
Note: "Sentiment is added to the stanza pipeline by using a CNN classifier". (Convolutional Neural Networks: <a>https://arxiv.org/abs/1408.5882 </a>) 


In [94]:
doc = nlp(clean_speech)

doc_sent = []
for i, sentence in enumerate(doc.sentences):
    print(i, sentence.sentiment)
    doc_sent.append(sentence.sentiment)

0 1
1 0
2 1
3 1
4 2
5 1
6 0
7 0
8 1
9 0
10 1
11 0
12 0
13 1
14 1
15 1
16 2
17 0
18 0
19 0
20 1
21 1
22 1
23 1
24 0
25 1
26 2
27 1
28 1
29 1
30 2
31 1
32 0
33 0
34 1
35 1
36 1
37 2
38 1
39 2
40 0
41 1
42 1
43 1
44 1
45 1
46 1
47 1
48 1
49 1


In [95]:
# Average of the sentiment values
sum(doc_sent)/len(doc_sent) #0.86 = Rather neutral?

0.86

<h2> TextBlob </h2>

In [143]:
tb_speech = TextBlob(clean_speech)
print(tb_speech.sentiment)
tb_speech.polarity

Sentiment(polarity=0.1824178501810081, subjectivity=0.48478139793929265)


0.1824178501810081

<h3> Stanza vs. TextBlob  </h3> 

In [144]:
nlp = stanza.Pipeline(lang='en', processors='tokenize,sentiment',verbose=False)

In [152]:
stop_words = set(stopwords.words('english'))  

In [153]:
sentence = nltk.sent_tokenize(clean_speech)

In [156]:
df_speech = []

for sent in sentence:
    word_tokens = word_tokenize(sent)  
    filtered_sentence = [w for w in word_tokens if not w in stop_words]  
    x = ''
    for word in filtered_sentence:
        x +=' '+word
    df_speech.append(x)
    
df_speech

[" Michelle I , 've really milking goodbye thing , behooves brief .",
 ' Audience Members : No , !',
 ' President Obama : Yes , yes .',
 " You know , I said I say , started journey abiding faith American people ability , ability , join together change country ways would make life better kids grandkids , change n't happen top , happened bottom .",
 ' It met sometimes skepticism doubt .',
 " Some folks n't think could pull .",
 ' There felt institutions power privilege country deeply entrenched .',
 ' And yet , came together , small towns big cities , whole bunch really young , decided believe .',
 " And knocked doors made phone calls , talked parents n't know pronounce Barack Obama .",
 ' And got know .',
 " And went communities maybe 'd never even thought visiting .",
 " And met people surface seemed completely different n't look like talk like watch TV programs .",
 ' And yet , started talking , turned something common .',
 ' And grew , built .',
 ' And people took notice .',
 ' And t

In [183]:
df = pd.DataFrame(df_speech)
df.columns = ['text']

df['sentiment_stanza']=''
df['sentiment_blob'] = ''


In [184]:
def blob_sentiment(txt):
    sent = TextBlob(txt).sentiment.polarity
    return sent

In [185]:
df['sentiment_blob'] = df['text'].apply(lambda x: blob_sentiment(x))

In [186]:
nlp = stanza.Pipeline(lang='en', processors='tokenize,sentiment')
for idx in df.index:
    doc = nlp(df.loc[idx,'text'])
    for i, sentence in enumerate(doc.sentences):
        df.loc[idx,'sentiment_stanza']=np.float_(sentence.sentiment-1)

In [187]:
df.head()

Unnamed: 0,text,sentiment_stanza,sentiment_blob
0,"Michelle I , 've really milking goodbye thing...",0.0,0.1
1,"Audience Members : No , !",-1.0,0.0
2,"President Obama : Yes , yes .",0.0,0.0
3,"You know , I said I say , started journey abi...",0.0,0.333333
4,It met sometimes skepticism doubt .,0.0,0.0


<h2> Vader on the entire speech </h2>

In [18]:
# install vader if not already available
# !pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 KB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [84]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()

In [85]:
def sentiment_analyzer_scores(sentence):
    score = analyser.polarity_scores(sentence)
    print("{:-<40} {}".format(sentence, str(score)))

In [22]:
sentiment_analyzer_scores(clean_speech)

Michelle and I, we've really been milking this goodbye thing, so it behooves me to be very brief. Audience Members: No, no! President Obama: Yes, yes. You know, I said before and I will say again, that when we started on this journey we did so with an abiding faith in the American people and their ability, out ability, to join together to change the country in ways that would make life better for our kids and our grandkids, that change didn't happen from the top down, but it happened from the bottom up. It was met sometimes with skepticism and doubt. Some folks didn't think we could pull it off. There were those who felt that the institutions of power and privilege in this country were too deeply entrenched. And yet, all of you came together, in small towns and big cities, a whole bunch of you really young, and you decided to believe. And you knocked on doors and you made phone calls, and you talked to your parents who didn't know how to pronounce Barack Obama. And you got to know each