In [None]:
#!pip install nltk textblob 

### Sentiment Analysis

This is a long tradition in machine learning and natural language processing (NPL). Trying to computationally evaluate the sentiment of text. This often is scored of positive, neutral, and negative, and those are the ones that we will play with this week. (There are more sophisticated models out there that evaluate for different emotions, etc. 

Note: as usual, a human reader will be much better at assessing sentiment of anything written in their language(s). This is most often used for very large corpus of text as a way of essentially creating an advanced search.

We will be playing with ML-based Natural Languages Processing libraries:

**NLTK**: Natural Language Tool Kit https://www.nltk.org/ 
This has various applications for tokenizing languages, including libraries that can tag individual words for what kind of type they (verbs, proper nouns, etc) along with sentiment analysis tools.

**TextBlob**: https://textblob.readthedocs.io/en/dev/
this, in fact, interacts with NLTK read as well as pattern (that's a bit less easy to use) and has its own sentiment analysis approaches that we can evaluate as well.

**Scapy**: https://spacy.io/
This is a newer natural language library that has some powerful applications. We are going to brush by it (no need to install!) Because it's sentiment analysis module, which is relatively new is not significantly different from NLTK.

Install the following:

`pip install nltk`

`pip install numpy` (dependency and useful!)


`pip install -U textblob`

`python -m textblob.download_corpora`

The download_corpora might not actually work, but it's harmless.

## Training
The thing to understand about these libraries is that they were based on various training methods. Meaning, a large data set was produced in which documents or words were tagged with various levels of sentiment. So, depending on what they are trained on, they will be more or less useful, depending on what we want to evaluate.

How they are trained:

NLTK uses the **Vader** (Valence Aware Dictionary and sEntiment Reasoner) lexicon
https://github.com/cjhutto/vaderSentiment

Essentially different words are tagged with different levels of intensity scale from â€“4 to +4. 

"okay" = 0.9

"good" = 1.9

"great" = 3.1

"horrible" = â€“2.5

":(" = â€“2.2

**Text Blob** uses a similar approach as Vader, but is based on **product reviews**

**TextBlob's Naive Bayer Analyser** is based on machine learning algorithm trained on movie reviews.

More on training later!


#### import NLTK and download all

I would prefer to limit this to just a few downloads, as download all is going to download quite a bit and take a little time. But there is just too large of a chance of getting errors (actually in TextBlob) without downloading everything.

In [2]:
import nltk
# nltk.download('vader_lexicon') #sentiment https://github.com/cjhutto/vaderSentiment
# nltk.download('punkt') #this finds sentences in a text, not easy!
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/jonthirkield/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/jonthirkield/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /Users/jonthirkield/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /Users/jonthirkield/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /Users/jonthirkield/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_

True

#### using Vader 
You get four scores: pos

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA

sia = SIA()
# sia.polarity_scores("Excited about the upcoming weekend getaway!")
sia.polarity_scores("Traffic was terrible this morning.")


compound score is the main one 

In [None]:
sent = "Savoring the flavors of a home-cooked meal. Simple joys are the heart of happiness."
sia.polarity_scores(sent)


In [None]:
sent = "Spent hours creating the perfect playlist for every mood. Music is my therapy."
sia.polarity_scores(sent)

In [None]:
sent = "Celebrating a milestone at work! ðŸŽ‰"
sia.polarity_scores(sent)


$ pip install -U textblob
$ python -m textblob.download_corpora
https://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html

In [None]:
from textblob import TextBlob
from textblob import Blobber
from textblob.sentiments import NaiveBayesAnalyzer
# https://github.com/clips/pattern

In [None]:
# Create an nlp object
output = TextBlob("Apple's name was inspired by Steve Jobs' visit to an apple farm while on a fruitarian diet.")

# Print out POS tagging
output.tags



In [None]:
blob = TextBlob("Excited about the upcoming weekend getaway!")
blob.sentiment

In [None]:
blobber = Blobber(analyzer=NaiveBayesAnalyzer())

blob = blobber("Excited about the upcoming weekend getaway!")
blob.sentiment

In [None]:
frankenPh= """These reflections have dispelled the agitation with which I began my
letter, and I feel my heart glow with an enthusiasm which elevates me to
heaven; for nothing contributes so much to tranquillize the mind as a
steady purpose,â€”a point on which the soul may fix its intellectual eye.
This expedition has been the favourite dream of my early years. I have
read with ardour the accounts of the various voyages which have been
made in the prospect of arriving at the North Pacific Ocean through the
seas which surround the pole. You may remember, that a history of all
the voyages made for purposes of discovery composed the whole of our
good uncle Thomasâ€™s library. My education was neglected, yet I was
passionately fond of reading. These volumes were my study day and night,
and my familiarity with them increased that regret which I had felt, as
a child, on learning that my fatherâ€™s dying injunction had forbidden my
uncle to allow me to embark in a sea-faring life."""

In [None]:
fn = TextBlob(frankenPh)
fn.sentences

In [None]:
import spacy
import asent

# load spacy pipeline
nlp = spacy.blank('en')
nlp.add_pipe('sentencizer')

# add the rule-based sentiment model
nlp.add_pipe('asent_en_v1')


In [None]:
# try an example
text = 'Why is everything so bad'
#doc = nlp(text)

# print polarity of document, scaled to be between -1, and 1
print(doc._.polarity)



In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import re


In [None]:
taggart_poem = """To breathe and stretch one's arms again
to breathe through the mouth to breathe to
breathe through the mouth to utter in
the most quiet way not to whisper not to whisper
to breathe through the mouth in the most quiet way to
breathe to sing to breathe to sing to breathe"""

In [None]:
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

print(porter_stemmer.stem('challenging'))
print(porter_stemmer.stem('vibes'))
print(porter_stemmer.stem('reigns'))
print(porter_stemmer.stem('tenderness'))
print(porter_stemmer.stem('Overflowing'))
print(porter_stemmer.stem('blessings'))
print(porter_stemmer.stem('adoration'))

In [None]:
import pandas as pd
pd.set_option("display.max_colwidth", 200)

df = pd.DataFrame({'content': [
    "Just finished a challenging workout routine.",
    "Political discussions heating up on the timeline.",
    "Traffic was terrible this morning.",
    "Enjoying a beautiful day at the park!",
    "The new movie release is a must-watch!",
    "Sending affectionate vibes to friends and family.",
    "Overflowing adoration for a cute rescue puppy!",
    "Confusion reigns as I try to make sense of recent events.",
    "A moment of tenderness, connecting with loved ones.",
    "Overflowing with gratitude for life's blessings",
]})
df

In [None]:
texts =   ["Just finished a challenging workout routine.",
    "Political discussions heating up on the timeline.",
    "Traffic was terrible this morning.",
    "Enjoying a beautiful day at the park!",
    "The new movie release is a must-watch!",
    "Sending affectionate vibes to friends and family.",
    "Overflowing adoration for a cute rescue puppy!",
    "Confusion reigns as I try to make sense of recent events.",
    "A moment of tenderness, connecting with loved ones.",
    "Overflowing with gratitude for life's blessings"]


In [None]:
porter_stemmer = PorterStemmer()

def make_stems(string_in):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", string_in).lower().split()
    words = [porter_stemmer.stem(word) for word in words]
    return words

count_vectorizer = CountVectorizer(stop_words='english', tokenizer=make_stems)
X = count_vectorizer.fit_transform(texts)
print(count_vectorizer.get_feature_names_out())

In [None]:
def get_scores(content):
    blob = TextBlob(content)
    nb_blob = blobber(content)
    sia_scores = sia.polarity_scores(content)
    spacy = nlp(content)
    return pd.Series({
        'content': content,
        'textblob': blob.sentiment.polarity,
        'textblob_bayes': nb_blob.sentiment.p_pos - nb_blob.sentiment.p_neg,
        'nltk': sia_scores['compound'],
        'spacy': spacy._.polarity.compound
    })

scores = df.content.apply(get_scores)
scores.style.background_gradient(cmap='PiYG', axis=None, low=0.3, high=0.3)

In [None]:
from nltk.corpus import stopwords

In [None]:
stopwords.words('english')[:20]

In [None]:
import spacy
import asent

# load spacy pipeline
nlp = spacy.blank('en')
nlp.add_pipe('sentencizer')

# add the rule-based sentiment model
nlp.add_pipe('asent_en_v1')




In [None]:
import pandas as pd

df = pd.read_csv("sentiment140-subset.csv", nrows=30000)
df.head()

In [None]:
df.shape

In [None]:
df.polarity.value_counts()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer


In [None]:
vectorizer = TfidfVectorizer(max_features=1000, tokenizer=make_stems)
vectors = vectorizer.fit_transform(df.text)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names_out())
words_df.head()

In [None]:
X = words_df
y = df.polarity

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

In [None]:
%%time
# Create and train a logistic regression
logreg = LogisticRegression(C=1e9, solver='lbfgs', max_iter=1000)
logreg.fit(X, y)

In [None]:
%%time
# Create and train a random forest classifier
forest = RandomForestClassifier(n_estimators=50)
forest.fit(X, y)

In [None]:
%%time
# Create and train a linear support vector classifier (LinearSVC)
svc = LinearSVC()
svc.fit(X, y)

In [None]:
%%time
# Create and train a multinomial naive bayes classifier (MultinomialNB)
bayes = MultinomialNB()
bayes.fit(X, y)

In [None]:
pd.set_option("display.max_colwidth", 200)

test_set = pd.DataFrame({'content': [
   "@kenbakernow is just about the coolest guy I know!!  Thanks kenny!",
    "yum yum swedish fishies  mmm lets chat.haha",
    "@Kbelize it should have also been stopped once legends such as janet, whitney &amp; prince jumped on bored. hurts my soul. ",
    "Today's the big day for the iPhone update in the UK.  Not ready yet though ",
    "&quot;Dear Sleep Diary, i'm sorry i've hurt your feelings by saying that you're imaginary. i'll make it up to you by buying you a new cover..&quot; ",
    "#Primeval won't return for a 4th season! DNW! ",
    "Going to bed. He's grounded for another week  this sucks i miss him really bad and now i cant even talk to him",
    "@JonathanRKnight OMJ I go do the dishes and this is what I come back to... LOL I want a baby too ",
    "Supernatural me faz falta, Carry on my wayward son, there'll be peace when you are done ",
    "yah, todays gunna suck, prob be really busy and not on the computer much  | workin on a site w/ tabbed navigation, so far i have PS open",
    "@TK2575 dude am i gonna miss your first recital ??????? ",
    "Bowl of Lucky Charms hit the spot."
]})
test_set

In [None]:
# Put it through the vectoriser

# transform, not fit_transform, because we already learned all our words
unknown_vectors = vectorizer.transform(test_set.content)
unknown_words_df = pd.DataFrame(unknown_vectors.toarray(), columns=vectorizer.get_feature_names_out())
unknown_words_df.head()

In [None]:
# Logistic Regression predictions + probabilities
test_set['pred_logreg'] = logreg.predict(unknown_words_df)
test_set['pred_logreg_proba'] = logreg.predict_proba(unknown_words_df)[:,1]

# Random forest predictions + probabilities
test_set['pred_forest'] = forest.predict(unknown_words_df)
test_set['pred_forest_proba'] = forest.predict_proba(unknown_words_df)[:,1]

# SVC predictions
test_set['pred_svc'] = svc.predict(unknown_words_df)

# Bayes predictions + probabilities
test_set['pred_bayes'] = bayes.predict(unknown_words_df)
test_set['pred_bayes_proba'] = bayes.predict_proba(unknown_words_df)[:,1]

In [None]:
test_set

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
%%time

print("Training logistic regression")
logreg.fit(X_train, y_train)

print("Training random forest")
forest.fit(X_train, y_train)

print("Training SVC")
svc.fit(X_train, y_train)

print("Training Naive Bayes")
bayes.fit(X_train, y_train)

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

In [None]:
y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

In [None]:
y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

In [None]:
y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

In [None]:
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

In [None]:
y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

In [None]:
y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

In [None]:
y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

In [None]:
import pandas as pd
pd.__version__

In [None]:
dfs = pd.read_csv("Shakespeare_data.csv")
dfs.head()

In [None]:
dfs = dfs[dfs['Player'].notna()]
dfs.head()

In [None]:
dfs = dfs[dfs['ActSceneLine'].notna()]
dfs.head()

In [None]:
df2 =dfs[dfs['Play'].isin(["Hamlet","Othello","King Lear"])]
dfs[dfs['Play'].isin(["Twelfth Night","Much Ado about nothing","As you like it"])]
# dfs[dfs['Play'].str.contains('As you')]

In [None]:
playlist = ["Hamlet","Othello","King Lear","Twelfth Night","Much Ado about nothing","As you like it"]
trag_list = ["Hamlet","Othello","King Lear"]

df2 =dfs[dfs['Play'].isin(playlist)].copy()


In [None]:
df2.shape

In [None]:
df2["score"] = 1


In [None]:
df2.head(10)

In [None]:
df2.loc[df2["Play"].isin(trag_list), 'score'] = 0

In [None]:

df2[df2['Play'].isin(trag_list)]

In [None]:
df_final = df2[['score', 'PlayerLine']].copy()
df_final.head()
df_final.shape

In [None]:
df_final = df_final.sample(frac=1).reset_index(drop=True)
df_final.shape

In [None]:
df_final.head()

In [None]:
df_final.to_csv('shakes_sentiment.csv',index=False)

In [None]:
import pandas as pd

df = pd.read_csv("shakes_sentiment.csv", nrows=15000)
df.head()

In [None]:
df.shape

In [None]:
df.score.value_counts()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer


In [None]:
vectorizer = TfidfVectorizer(max_features=1000, tokenizer=make_stems)
vectors = vectorizer.fit_transform(df.PlayerLine)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names_out())
words_df.head()

In [None]:
X = words_df
y = df.score

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

In [None]:
%%time
# Create and train a logistic regression
logreg = LogisticRegression(C=1e9, solver='lbfgs', max_iter=1000)
logreg.fit(X, y)

In [None]:
%%time
# Create and train a random forest classifier
forest = RandomForestClassifier(n_estimators=50)
forest.fit(X, y)

In [None]:
%%time
# Create and train a linear support vector classifier (LinearSVC)
svc = LinearSVC()
svc.fit(X, y)

In [None]:
%%time
# Create and train a multinomial naive bayes classifier (MultinomialNB)
bayes = MultinomialNB()
bayes.fit(X, y)

In [None]:
pd.set_option("display.max_colwidth", 200)

test_set = pd.DataFrame({'content': [
 "Cassio came hither: I shifted him away,",
"herself wittingly.",
"Your answer, sir, is enigmatical:",
"To spy into abuses, and oft my jealousy",
"That we present us to him.",
"Fit for the mountains and the barbarous caves,",
"Merry, amen. I will, sir, I will.",
"Starts up, and stands on end. O gentle son,",
"Thou must be patient, we came crying hither:",
"Can labour ought in sad invention,",
"The same, my lord, and your poor servant ever.",
"Nor scar that whiter skin of hers than snow,",
"say her mind freely, or the blank verse shall halt",
"If it be made of penetrable stuff,",
"Are you good men and true?",
"Benedick bear it, pluck off the bull's horns and set",
"would have it at the Lady Hero's chamber-window.",
"And every measure fail me.",
"signior, walk aside with me: I have studied eight",
"She shall be buried with her face upwards.",
"Is he not jealous?",
"Be not amazed, right noble is his blood.",
"As of a father: for let the world take note,",
"Traitors ensteep'd to clog the guiltless keel,--",
"And with what wing the staniel cheques at it!",
"care for her frowning, now thou art an O without a",
"You will never run mad, niece.",
"Hail to your grace!",
"From Goneril his mistress salutations,",
"What a piece of work is a man! how noble in reason!"
]})
test_set

In [None]:
# Put it through the vectoriser

# transform, not fit_transform, because we already learned all our words
unknown_vectors = vectorizer.transform(test_set.content)
unknown_words_df = pd.DataFrame(unknown_vectors.toarray(), columns=vectorizer.get_feature_names_out())
unknown_words_df.head()

In [None]:
# Logistic Regression predictions + probabilities
test_set['pred_logreg'] = logreg.predict(unknown_words_df)
test_set['pred_logreg_proba'] = logreg.predict_proba(unknown_words_df)[:,1]

# Random forest predictions + probabilities
test_set['pred_forest'] = forest.predict(unknown_words_df)
test_set['pred_forest_proba'] = forest.predict_proba(unknown_words_df)[:,1]

# SVC predictions
test_set['pred_svc'] = svc.predict(unknown_words_df)

# Bayes predictions + probabilities
test_set['pred_bayes'] = bayes.predict(unknown_words_df)
test_set['pred_bayes_proba'] = bayes.predict_proba(unknown_words_df)[:,1]

In [None]:
test_set

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
%%time

print("Training logistic regression")
logreg.fit(X_train, y_train)

print("Training random forest")
forest.fit(X_train, y_train)

print("Training SVC")
svc.fit(X_train, y_train)

print("Training Naive Bayes")
bayes.fit(X_train, y_train)

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

In [None]:
y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

In [None]:
y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

In [None]:
y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

In [None]:
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

In [None]:
y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

In [None]:
y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

In [None]:
y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)