# Capstone

Continuation from Scraping notebook

### Model Walkthrough continued

# 1 GLOBAL IMPORTS

In [1]:
import numpy as np
import pandas as pd
import math
from decimal import Decimal
import re
import string
import unicodedata as unicode
import nltk
import contractions
import inflect
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer

%config InlineBackend.figure_format = 'retina'


# 2 EDA

### What are your variables of interest?

For the capstone, my variables or features of interest are the words that represent risk factors which signify trigger words for early intervention in mental health specifically depression

### What outliers did you remove?

In the context of textual parsing, by limiting word frequency to more than 2 occurrences and using trigrams(three-word pairs), any word combinations that did not meet that criteria were outliers.

### What types of data imputation did you perform?

Extending the stopwords list after visual examination of the postings was a form of data imputation that was performed. Uppercase was converted to lowercase and accents in words were stripped. During scraping, html tags and scripting elements were stripped.

In [2]:
# General imports for pickling and evaluating runtime performance
import pickle
import time

# Sklearn imports both text preprocessing and sklearn's NLP models
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

#imports for LDA topic model visualisation
from __future__ import print_function
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()


## 2.1 Reading Data

In [3]:
# Read scraped data into pandas dataframe
postings = pd.read_excel("postings.xlsx")

In [4]:
# only interested in the content of the postings at the moment so in our first iteration, EDA will focus on the posting content only.
# Scraping was done more broadly as we could factor in likes, badge and use the dates for other modelling in later stages.
postings.head()

Unnamed: 0,date,author,badge,likes,content
1,['3 March 2018'],Doolhof,Community Champion,4.0,Right now I feel like I don't have the energy ...
2,['3 March 2018'],quirkywords,Community Champion,3.0,Mrs dool.\nI am sending you a big reassuring h...
3,['3 March 2018'],Summer Rose,Valued Contributor,2.0,Hi Doolhof\nI'm sorry that you feel so low. Y...
4,['4 March 2018'],Doolhof,Community Champion,2.0,"Hi Quirky,\nThanks for the virtual hug, I need..."
5,['4 March 2018'],Doolhof,Community Champion,3.0,"Hi Summer Rose,\nThanks for your encouragement..."


In [5]:
postings.shape

(150, 5)

There are 150 posts from one of the discussion threads in the beyondblue mental health forum's depression subforum. This is a preliminary dataset and we will be expanding our dataset in the next iteration.

## 2.2 Text Preprocessing and Normalisation

Looked at the posts from within Excel and decided that we will only preprocess the post content for now. The dates can be dealt with when we want to perform any time-series modelling later. We scraped it given our final goal is geared towards an application where time and order of conversational flow matter. The rest of the fields besides content and date are pretty clean due to the html tag stripping done as part of the Beautiful Soup scraping process. So the EDA now focuses on the content of the posting...

Initial exploration of the text data was done by visual examination of each individual posts in the excel spreadsheet and we found that there were words that could be excluded in addition to the standard stopwords list in NLTK library

In [6]:
# extending standard NLTK library's english stopwords to include other words like pronouns and meaningless words from visual examination in Excel
stopwords = nltk.corpus.stopwords.words('english')
newStopWords = ["just", "like", "pamela","karen","dools","hi","hello","chloe","pamelar","pammy","mrs", "mr", "quirky", "doolhof", \
                "summer", "rose","demonblaster","quirkywords","db","Shell","etc","D","Shelley", "anne","Quercus","Nat","Dr","PamelaR","Bev","Mary", \
               "Agapanthus","Claire", "Weekes", "Paul","blondguy","em","BB","grandy","TonyWK","Tony","Ggrand","white", "knight","Mam", \
                "Laters","birdy", "Hiya","Croix", "Croix's","White","Knight","SN","DB","Queensland","Doolsy","bye","Hi/bye","Deebi","Weetbix", \
                "Willow", "(WW)","Chloe", "Chlo","BTW"]
stopwords.extend(newStopWords)

In [7]:
def getcvt_freq_words(sparse_counts, columns):
   # X_all is a sparse matrix, so sum() returns a 'matrix' datatype ...
   #   which we then convert into a 1-D ndarray for sorting
   word_counts = np.asarray(X_cvt.sum(axis=0)).reshape(-1)

   # argsort() returns smallest first, so we reverse the result
   largest_count_indices = word_counts.argsort()[::-1]

   # pretty-print the results! Remember to always ask whether they make sense ...
   freq_words = pd.Series(word_counts[largest_count_indices],
                          index=columns[largest_count_indices])

   return freq_words

In [8]:
def gettfidf_freq_words(sparse_counts, columns):
   # X_all is a sparse matrix, so sum() returns a 'matrix' datatype ...
   #   which we then convert into a 1-D ndarray for sorting
   word_counts = np.asarray(X_tfidf.sum(axis=0)).reshape(-1)

   # argsort() returns smallest first, so we reverse the result
   largest_count_indices = word_counts.argsort()[::-1]

   # pretty-print the results! Remember to always ask whether they make sense ...
   freq_words = pd.Series(word_counts[largest_count_indices],
                          index=columns[largest_count_indices])

   return freq_words

## 2.3 ETL: Transforming Text into Word Vectors

In [9]:
# ETL algorithms to normalise text into word vectors - testing countvectorizer and tfidfvectorizer

In [19]:
#CountVectorizer
cvt_vectorizer = CountVectorizer(analyzer="word", stop_words= stopwords, ngram_range=(3,3),min_df = 2, max_df=0.5, max_features=10000, lowercase=True, \
                      strip_accents="unicode")
X_cvt =  cvt_vectorizer.fit_transform(postings.content)
columns = np.array(cvt_vectorizer.get_feature_names())          # ndarray (for indexing below)
print(X_cvt.shape)
print("Requires {} ints to do a .toarray()!".format(X_cvt.shape[0] * X_cvt.shape[1]))
freq_words_cvt = getcvt_freq_words(X_cvt, columns)
freq_words_cvt.to_csv("freq_words_cvt.csv")
freq_words_cvt = pd.read_csv("freq_words_cvt.csv", header=None, names=["word", "count"])

(150, 65)
Requires 9750 ints to do a .toarray()!


In [20]:
freq_words_cvt

Unnamed: 0,word,count
0,mental health issues,8
1,cuddling black dog,6
2,sending virtual hugs,5
3,darn black dog,4
4,going try make,4
5,sorry hear struggling,3
6,dance around house,3
7,positivity motivation humour,3
8,help feel better,2
9,hope okay today,2


In [18]:
# tfidf vectorizer
tfidf_vectorizer = TfidfVectorizer(analyzer="word", stop_words= stopwords, ngram_range=(3,3),max_df=0.5, max_features=10000,
                             min_df=2,
                             use_idf=True)
X_tfidf =  tfidf_vectorizer.fit_transform(postings.content)
columns = np.array(tfidf_vectorizer.get_feature_names())          # ndarray (for indexing below)
print(X_tfidf.shape)
print("Requires {} ints to do a .toarray()!".format(X_tfidf.shape[0] * X_tfidf.shape[1]))
freq_words_tfidf = gettfidf_freq_words(X_tfidf, columns)
freq_words_tfidf.to_csv("freq_words_tfidf.csv")
freq_words_tfidf = pd.read_csv("freq_words_tfidf.csv", header=None, names=["word", "count"])

(150, 65)
Requires 9750 ints to do a .toarray()!


In [21]:
freq_words_tfidf

Unnamed: 0,word,count
0,mental health issues,5.005902
1,sending virtual hugs,4.049465
2,cuddling black dog,3.846836
3,positivity motivation humour,2.685494
4,sorry hear struggling,2.266668
5,dance around house,2.192610
6,darn black dog,2.111416
7,got nan dog,2.000000
8,love go camping,2.000000
9,enjoy time family,2.000000


# 3 Modelling

## 3.1 Topic Modelling

### 3.1.1 LDA (Latent Dirichlet Allocation) topic modelling

In [22]:
#LDA topic model fit based on CountVectorizer 

lda_cvt = LatentDirichletAllocation(n_components=20, random_state=0, n_jobs=-1, learning_method="online")
lda_cvt.fit_transform(X_cvt)

array([[0.0125    , 0.0125    , 0.0125    , ..., 0.0125    , 0.0125    ,
        0.0125    ],
       [0.05      , 0.05      , 0.05      , ..., 0.05      , 0.05      ,
        0.05      ],
       [0.0125    , 0.0125    , 0.0125    , ..., 0.0125    , 0.0125    ,
        0.0125    ],
       ...,
       [0.01666667, 0.01666667, 0.68333333, ..., 0.01666667, 0.01666667,
        0.01666667],
       [0.025     , 0.025     , 0.025     , ..., 0.025     , 0.525     ,
        0.025     ],
       [0.05      , 0.05      , 0.05      , ..., 0.05      , 0.05      ,
        0.05      ]])

In [15]:
#import GridSearchCV
from sklearn.model_selection import GridSearchCV

In [16]:
# Define Search Param
search_params = {'n_components': [2,5,7,10,20,30], 'learning_decay': [.5, .7, .9]}

# Init the Model
lda = LatentDirichletAllocation()

# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params)

# Do the Grid Search for CVT
model.fit(X_cvt)





GridSearchCV(cv=None, error_score='raise',
       estimator=LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=10, n_jobs=1,
             n_topics=None, perp_tol=0.1, random_state=None,
             topic_word_prior=None, total_samples=1000000.0, verbose=0),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_components': [2, 5, 7, 10, 20, 30], 'learning_decay': [0.5, 0.7, 0.9]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [17]:
#Best parameters for cvt
model.best_params_

{'learning_decay': 0.9, 'n_components': 2}

In [23]:
#Optimised LDA topic model fit based on CountVectorizer 

lda_cvt = LatentDirichletAllocation(n_components=20, random_state=0, n_jobs=-1, learning_method="online", learning_decay=0.9)
lda_cvt.fit_transform(X_cvt)

array([[0.0125    , 0.0125    , 0.0125    , ..., 0.0125    , 0.0125    ,
        0.0125    ],
       [0.05      , 0.05      , 0.05      , ..., 0.05      , 0.05      ,
        0.05      ],
       [0.0125    , 0.0125    , 0.0125    , ..., 0.0125    , 0.0125    ,
        0.0125    ],
       ...,
       [0.01666667, 0.01666667, 0.01666667, ..., 0.01666667, 0.01666667,
        0.01666667],
       [0.025     , 0.025     , 0.025     , ..., 0.025     , 0.525     ,
        0.025     ],
       [0.05      , 0.05      , 0.05      , ..., 0.05      , 0.05      ,
        0.05      ]])

In [24]:
#LDA topic model fit based on tfidf_vectorizer 
lda_tfidf = LatentDirichletAllocation(n_components=20, random_state=0, n_jobs=-1, learning_method='online')
lda_tfidf.fit_transform(X_tfidf)

array([[0.01837593, 0.01837593, 0.01837593, ..., 0.01837593, 0.01837593,
        0.01837593],
       [0.05      , 0.05      , 0.05      , ..., 0.05      , 0.05      ,
        0.05      ],
       [0.01830127, 0.01830127, 0.01830127, ..., 0.01830127, 0.01830127,
        0.01830127],
       ...,
       [0.02071068, 0.02071068, 0.02071068, ..., 0.02071068, 0.02071068,
        0.02071068],
       [0.025     , 0.025     , 0.025     , ..., 0.025     , 0.525     ,
        0.025     ],
       [0.05      , 0.05      , 0.05      , ..., 0.05      , 0.05      ,
        0.05      ]])

In [20]:
# Define Search Param
search_params = {'n_components': [2,5,7,10,20,30], 'learning_decay': [.5, .7, .9]}

# Init the Model
lda = LatentDirichletAllocation()

# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params)

# Do the Grid Search for tfidf
model.fit(X_tfidf)





GridSearchCV(cv=None, error_score='raise',
       estimator=LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=10, n_jobs=1,
             n_topics=None, perp_tol=0.1, random_state=None,
             topic_word_prior=None, total_samples=1000000.0, verbose=0),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_components': [2, 5, 7, 10, 20, 30], 'learning_decay': [0.5, 0.7, 0.9]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [21]:
#Best parameters for tfidf
model.best_params_

{'learning_decay': 0.9, 'n_components': 2}

In [25]:
#Optimised LDA topic model fit based on tfidf_vectorizer 
lda_tfidf = LatentDirichletAllocation(n_components=20, random_state=0, n_jobs=-1, learning_method='online', learning_decay=0.9)
lda_tfidf.fit_transform(X_tfidf)

array([[0.01837593, 0.01837593, 0.01837593, ..., 0.01837593, 0.01837593,
        0.01837593],
       [0.05      , 0.05      , 0.05      , ..., 0.05      , 0.05      ,
        0.05      ],
       [0.01830127, 0.01830127, 0.01830127, ..., 0.01830127, 0.01830127,
        0.01830127],
       ...,
       [0.02071068, 0.02071068, 0.02071068, ..., 0.02071068, 0.02071068,
        0.02071068],
       [0.025     , 0.025     , 0.025     , ..., 0.025     , 0.525     ,
        0.025     ],
       [0.05      , 0.05      , 0.05      , ..., 0.05      , 0.05      ,
        0.05      ]])

### 3.1.2 Visualising LDA (Latent Dirichlet Allocation) topic models


In [26]:
# Visualising LDA topic model based on CountVectorizer
pyLDAvis.sklearn.prepare(lda_cvt, X_cvt, cvt_vectorizer, n_jobs=1, mds='tsne')


In [28]:
#Visualising LDA Topic model based on tfidf_vectorizer
pyLDAvis.sklearn.prepare(lda_tfidf, X_tfidf, tfidf_vectorizer, n_jobs=-1, mds='tsne')

## 3.2 Sentiment Analysis

In [29]:
# import vader sentiment analyser
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [30]:
#instantiation of sentiment analyser
sentiment_analyzer = SentimentIntensityAnalyzer()

In [31]:
#create empty dataframe to store sentiment scores for each post
polarity_scores = pd.DataFrame(columns=["post", "sentiment_score"])

In [32]:
# computing sentimment scores for each post and populating dataframe
posts = []
polarities = []
for post in postings.content:
    polarity = sentiment_analyzer.polarity_scores(post)
    posts.append(post)
    polarities.append(polarity)

polarity_scores["post"] = posts
polarity_scores["sentiment_score"] = polarities

In [33]:
polarity_scores

Unnamed: 0,post,sentiment_score
0,Right now I feel like I don't have the energy ...,"{'neg': 0.26, 'neu': 0.544, 'pos': 0.196, 'com..."
1,Mrs dool.\nI am sending you a big reassuring h...,"{'neg': 0.158, 'neu': 0.608, 'pos': 0.234, 'co..."
2,Hi Doolhof\nI'm sorry that you feel so low. Y...,"{'neg': 0.127, 'neu': 0.686, 'pos': 0.187, 'co..."
3,"Hi Quirky,\nThanks for the virtual hug, I need...","{'neg': 0.144, 'neu': 0.727, 'pos': 0.129, 'co..."
4,"Hi Summer Rose,\nThanks for your encouragement...","{'neg': 0.156, 'neu': 0.708, 'pos': 0.136, 'co..."
5,Dools so sorry to hear you're in deep darl.\nI...,"{'neg': 0.045, 'neu': 0.643, 'pos': 0.313, 'co..."
6,Dear Dools how are you feeling today darl 🤗\nI...,"{'neg': 0.086, 'neu': 0.633, 'pos': 0.282, 'co..."
7,"Dear DB,\nThank you so very much. I have been ...","{'neg': 0.086, 'neu': 0.738, 'pos': 0.176, 'co..."
8,Hi all \nYeah it does pull us under its so dam...,"{'neg': 0.079, 'neu': 0.569, 'pos': 0.351, 'co..."
9,"Hi DB,\nWoke up this morning wondering why I h...","{'neg': 0.199, 'neu': 0.731, 'pos': 0.07, 'com..."


In [34]:
polarity_scores.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 2 columns):
post               150 non-null object
sentiment_score    150 non-null object
dtypes: object(2)
memory usage: 2.4+ KB


<u><b>Manual evaluation of Vader sentiment analysis by random sampling of 2 positive and 2 neutral posts. negative accuracy confirmed using proof by contradiction</b></u>

2 positive posts

In [35]:
polarity_scores.post[8]

"Hi all \nYeah it does pull us under its so damned powerful. I did that this last down, was heavy but not like they mostly were. Went with it. Easy to slide without resistance though\n Yeah agree there's something in talking to it, liking your wording Dools.. talking to it, challenging thoughts I've been saying but yours is better doesnt sound like effort where challenging does. Will think of you when I word it the same ☺ thanks \nIm not saying acceptance doesnt work ..just I don't get it yet.. may not.. does sound easier. Maybe going with it is part of acceptance ..dunno \nI think it's a good release glad you're talking Dools\nHope you have a restful secure sleep  🤗..eveyone..\nnigh nite 😊"

In [36]:
polarity_scores.sentiment_score[8]

{'compound': 0.9935, 'neg': 0.079, 'neu': 0.569, 'pos': 0.351}

In [37]:
polarity_scores.post[23]

"Beautifully said Rose \nThe stars ... oh yeah ..\nWe're part of what we see\nThe beauty. Intrigue. Love em"

In [38]:
polarity_scores.sentiment_score[23]

{'compound': 0.9313, 'neg': 0.0, 'neu': 0.535, 'pos': 0.465}

2 neutral posts

In [39]:
polarity_scores.post[132]

"Hi Dools,\nThat lady at the Op Shop sounds lovely! Its great that you have people that you can openly talk to. It certainly helps.  \nI try to think of my depression as a feral cat and my anxiety as a vicious dog. LOL. \nHow's everyone feeling today? My cat is still sleeping but my dog is awake and I can feel it breathing down my neck :(\nChloe"

In [40]:
polarity_scores.sentiment_score[132]

{'compound': 0.2402, 'neg': 0.121, 'neu': 0.718, 'pos': 0.16}

In [41]:
polarity_scores.post[149]

"Hugs are special, especially tight ones that make you feel like the person hugging you never wants to let you go. I'm going to try and see if I can get him to meet me at the shops tomorrow I really need to talk to someone face to face that I know will listen and not judge (and give me hugs lol). \nDepression todsy is one of the worst days I've had... Feeling very sad and suicidal. Even though I had a good laugh this morning I am still very bad today... Dumb dog and cat getting the better of me again. I just want to go to sleep but I won't be able too .\nHope you all had lovely days, it might cheer me up to hear that you guys did \nchloe x"

In [42]:
polarity_scores.sentiment_score[149]

{'compound': 0.9342, 'neg': 0.096, 'neu': 0.695, 'pos': 0.209}

In [43]:
#Cumulative sentiment score for thread of 150 posts
negative = 0.0
neutral = 0.0
positive = 0.0
compound = 0.0
for index, scores in enumerate(polarity_scores.sentiment_score):
    negative += scores["neg"]
    neutral += scores["neu"]
    positive += scores["pos"]
    compound += scores["compound"]

print("Negative score: " + str(negative))
print("Neutral score: " + str(neutral))
print("Positive score: " + str(positive))
print("Compound score: " + str(compound))


Negative score: 9.863999999999995
Neutral score: 105.87299999999998
Positive score: 34.261
Compound score: 121.31860000000003


# 4 Summary of Analysis:
### Model Selection and Implementation

For the first phase, we selected a 2 vectorized models and optimised them using gridsearch. For topic modelling, LDA was used as a simple model due to its high interpretability to get an initial high-level understanding of the significant terms and themes in the thread. Based on the features of Vader sentiment analysis library in its ability to handle emoticons and textual data that has not undergone significant text normalisation (see feature listing https://github.com/cjhutto/vaderSentiment), we chose Vader for sentiment analysis.

### Implementation and Evaluation

Overall, in this capstone:

<b>Scraping</b>
* Scraped 1 thread in beyondblue depression forum using BeautifulSoup
* Whilst scraping, used regex to extract the elements without html tags to save time in text preprocessing. Status codes were printed to monitor status of scraping
* Scraped dataframe was converted to excel for persistency and use in EDA and modelling

<b>EDA and modelling</b>
* Visually examined the data and added stopwords to the standard nltk stopword list.
* used both countvectorizer and tfidf vectorizer to preprocess and transform the text into word vectors.
* Optimised the vectorizers using grid-search cross validation
* Fit LDA topic model to both vectorizers and compared them
* Separately performed sentiment analysis using Vader on the individual posts without text preprocessing
* In conducting sentiment analysis, we computed an aggregate score of the sentiment of each posts to see the overall sentiment of the forum thread.
* To check accuracy of sentiment analysis, a random sample of 2 neutral and 2 positive posts were examined to match human labelling vs algorithmic output


### Inference

From the topic models, we are able to make inferences about the commonly used phrases in the forum thread. Also, using sentiment analysis, we inferred that the overall sentiment of the forum thread was mainly neutral and positive. There was hardly any negativity in the forum thread which is expected in a forum of this kind. Positive reinforcement, active listening and constructive suggestions were present in this forum.

# 5 Next Steps

* Phase 2: Increase dataset size and rerun tasks in phase 1 with improvements such as pickling the scraped dataframe instead of excel to minimise data loss
* Phase 3: Introduce subject matter experts and continue to fine-tune models. Increase dataset to have enough data to create different recurrent neural networks such as LSTM and GRUs that handle vanishing/exploding gradient well.
* Phase 4: Refine model for deployment to incorporate it into the chatbot
    

# APPENDIX

## Scripts for future use

In [123]:
# Sklearn imports both text preprocessing and sklearn's NLP models
from sklearn.decomposition import TruncatedSVD # SVD singular value decomposition
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB # mulitinomial Naive Bayes

In [124]:
Krunal = LogisticRegression()

In [136]:
Krunal.__dir__()

['penalty',
 'dual',
 'tol',
 'C',
 'fit_intercept',
 'intercept_scaling',
 'class_weight',
 'random_state',
 'solver',
 'max_iter',
 'multi_class',
 'verbose',
 'warm_start',
 'n_jobs',
 '__module__',
 '__doc__',
 '__init__',
 'fit',
 'predict_proba',
 'predict_log_proba',
 '_get_param_names',
 'get_params',
 'set_params',
 '__repr__',
 '__getstate__',
 '__setstate__',
 '__dict__',
 '__weakref__',
 '__hash__',
 '__str__',
 '__getattribute__',
 '__setattr__',
 '__delattr__',
 '__lt__',
 '__le__',
 '__eq__',
 '__ne__',
 '__gt__',
 '__ge__',
 '__new__',
 '__reduce_ex__',
 '__reduce__',
 '__subclasshook__',
 '__init_subclass__',
 '__format__',
 '__sizeof__',
 '__dir__',
 '__class__',
 'decision_function',
 'predict',
 '_predict_proba_lr',
 '_estimator_type',
 'score',
 'densify',
 'sparsify']

In [69]:
#NLP normalisation functions - keeping these functions here in case I need it later

def replace_contractions(text):
    """Replace contractions in string of text"""
    return contractions.fix(text)

def lemmatize_verbs(words):
    """Lemmatize verbs in list of tokenized words"""
    lemmatizer = WordNetLemmatizer()
    lemmas = []
    for word in words:
        lemma = lemmatizer.lemmatize(word, pos='v')
        lemmas.append(lemma)
    return lemmas

def normalize(words):
    
    lemmatize_verbs(words)
    remove_stopwords(words)
    
    return words
    

In [147]:
#Textblob metric sentiment analysis evaluation
from textblob import TextBlob

neg_threshold = 0.514
pos_threshold = 0.01


pos_count = 0
pos_correct = 0

for line in postings.content:
    analysis = TextBlob(line)

    if analysis.sentiment.polarity >= pos_threshold:
        if analysis.sentiment.polarity > 0:
            pos_correct += 1
        pos_count +=1


neg_count = 0
neg_correct = 0

for line in postings.content:
    analysis = TextBlob(line)
    if analysis.sentiment.polarity < neg_threshold:
        if analysis.sentiment.polarity <= 0:
            neg_correct += 1
        neg_count +=1

print("Positive accuracy = {}% via {} samples".format(pos_correct/pos_count*100.0, pos_count))
print("Negative accuracy = {}% via {} samples".format(neg_correct/neg_count*100.0, neg_count))

Positive accuracy = 100.0% via 137 samples
Negative accuracy = 9.090909090909092% via 143 samples


In [58]:
#Vader metric evaluation
neg_threshold = 0.514
pos_threshold = 0.26
pos_count = 0
pos_correct = 0

for line in postings.content:
    polarity = sentiment_analyzer.polarity_scores(line)
    if not polarity['neg'] > pos_threshold:
        if polarity['pos']-polarity['neg'] > 0:
            pos_correct += 1
        pos_count +=1


neg_count = 0
neg_correct = 0

for line in postings.content:
    polarity = sentiment_analyzer.polarity_scores(line)
    if not polarity['pos'] > neg_threshold:
        if polarity['pos']-polarity['neg'] <= 0:
            neg_correct += 1
        neg_count +=1

print("Positive accuracy = {}% via {} samples".format(pos_correct/pos_count*100.0, pos_count))
print("Negative accuracy = {}% via {} samples".format(neg_correct/neg_count*100.0, neg_count))

Positive accuracy = 92.66666666666666% via 150 samples
Negative accuracy = 7.333333333333333% via 150 samples
