# Sentiment analysis

For the first task in Part 2, a custom model was developed to 
apply sentiment analysis to the tweets streamed in part 1.

The training data was the *Sentiment-140* dataset 
(Go, Bhayaniand Huang, 2009), acquired via
 [kaggle](https://www.kaggle.com/kazanova/sentiment140).

## Load and sample data


In [19]:
import pandas as pd
from brexittweets.custom_funcs.process_text_functions import preprocess_training_data
from brexittweets.config import training_data_path

# check which version of the training data is available:
training_data_path = training_data_path
names = ['polarity', 'id', 'date', 'flag', 'user', 'text']
classified_tweets = pd.read_csv(training_data_path, names=names, encoding='ISO-8859-1')

# apply the same preprocessing as was applied to streaming tweets
classified_tweets['preprocessed'] = classified_tweets['text'].map(preprocess_training_data)
classified_tweets['polarity'].unique()
    
    
# tweets ordered by sentiment, take sample from the middle
classified_tweets = classified_tweets.iloc[650000:950000,[0,-1]]

classified_tweets.head()

Unnamed: 0,polarity,preprocessed
0,0,man thats insane im waiting for my g jailbrea...
1,0,bright and early day last full day in key wes...
2,0,megafail on the chair front though brought my...
3,0,isnt feelin well
4,0,itsucks when you are the only one home and get...


Get the polarity of the Tweets from the classified dataset. Check that we have equal numbers of positive (4.0) and
 negative (0.0) Tweets.
 

In [22]:
Y = classified_tweets.pop('polarity')
print(f'Unique values: {Y.unique()}')
print(f'Number of negative Tweets: {len(Y==0)}')
print(f'Number of negative Tweets: {len(Y==4)}')

Unique values: [0 4]
Number of negative Tweets: 300000
Number of negative Tweets: 300000


## Apply text preprocessing

For each tweet, apply tokenization, stemming, and lemmatization 
so that all forms of each unique word are grouped together. 

Then count the occurrence of each word to create a dictionary of 
features for each tweet.

In [4]:
from brexittweets.custom_funcs.process_text_functions import tokenize_tag,stem_lemmatize

preprocessed_tokenized = []

for tweet in classified_tweets['preprocessed']:
	tokenized_tweet = tokenize_tag(tweet)
	preprocessed_tokenized.append(stem_lemmatize(tokenized_tweet))
    
#Check that tokenisation, stemming and lemmatization has worked.
preprocessed_tokenized[0]

['that', 'insan', 'wait', 'jailbreak', 'miss', 'flexibl']

In [5]:
from typing import Dict

def get_features(text: str) -> Dict:
    """
    Count each unique word in a tweet and return dictionary of words and their counts 
    """
    features = {}
    for word in text:
        # Create word key and initialise as 1
        if word not in features.keys():
            features[word] = 1
        else:
            # Increase count of word
            features[word] = features[word] + 1
    return features

# Create list in which to store word counts
all_features = []

# Get word count for every tweet
for feature, label in zip([get_features(tweet) for tweet in preprocessed_tokenized], Y):
    all_features.append((feature,label))
    

In [6]:
len(all_features)

300000

## Split data

The training data is ordered by sentiment, so a sample will be taken
from the middle to be the training data.

The training data is supplied to a NaiveBayesClassifer. 

In [7]:
from nltk.classify import NaiveBayesClassifier

# Split data into training and testing.

training_data = all_features[45000:255000]
testing_data = all_features[0:45000] + all_features[255000:]

nbclassifier = NaiveBayesClassifier.train(training_data)

## Examine the classifier

We can examine the classifier to find out which features are most influential, and 
to predict sentiment for the testing set and evaluate accuracy.

In [8]:
nbclassifier.show_most_informative_features(20)

Most Informative Features
                  farrah = 1                   0 : 4      =    321.0 : 1.0
                    iran = 1                   0 : 4      =    114.6 : 1.0
              squarespac = 1                   0 : 4      =     97.7 : 1.0
                  hayfev = 1                   0 : 4      =     73.0 : 1.0
            followfriday = 1                   4 : 0      =     69.4 : 1.0
                  divorc = 1                   0 : 4      =     52.6 : 1.0
                   cramp = 1                   0 : 4      =     50.3 : 1.0
                wolverin = 1                   4 : 0      =     44.6 : 1.0
                  glasto = 1                   0 : 4      =     41.7 : 1.0
                  father = 2                   0 : 4      =     40.6 : 1.0
                    ceci = 1                   0 : 4      =     33.7 : 1.0
                  father = 1                   0 : 4      =     30.0 : 1.0
                  booooo = 1                   0 : 4      =     27.0 : 1.0

In [9]:
# turn testing data into DataFrame
test_df = pd.DataFrame(data = {'features': [feature[0] for feature in testing_data],
                               'polarity': Y[0:45000].append(Y[255000:])})

# Add column of predicted values.
test_df['classify']=test_df['features'].apply(nbclassifier.classify)

test_df.head()

Unnamed: 0,features,polarity,classify
650000,"{'that': 1, 'insan': 1, 'wait': 1, 'jailbreak'...",0,0
650001,"{'bright': 1, 'earli': 1, 'west': 1, 'dont': 1...",0,0
650002,"{'megafail': 1, 'chair': 1, 'bring': 1, 'camp'...",0,0
650003,"{'isnt': 1, 'feelin': 1}",0,0
650004,"{'itsuck': 1, 'home': 1, 'stuck': 1, 'bathroom...",0,0


In [10]:
# turn testing data into DataFrame
test_df = pd.DataFrame(data = {'features': [feature[0] for feature in testing_data],
                               'polarity': Y[0:45000].append(Y[255000:])})

# Add column of predicted values.
test_df['classify']=test_df['features'].apply(nbclassifier.classify)
test_df.head()

Unnamed: 0,features,polarity,classify
650000,"{'that': 1, 'insan': 1, 'wait': 1, 'jailbreak'...",0,0
650001,"{'bright': 1, 'earli': 1, 'west': 1, 'dont': 1...",0,0
650002,"{'megafail': 1, 'chair': 1, 'bring': 1, 'camp'...",0,0
650003,"{'isnt': 1, 'feelin': 1}",0,0
650004,"{'itsuck': 1, 'home': 1, 'stuck': 1, 'bathroom...",0,0


In [11]:
test_df['sum'] = test_df['polarity'] + test_df['classify']
test_df['diff'] = test_df['polarity'] - test_df['classify']

TP = len(test_df[test_df['sum']==8]) # true positives
TN = len(test_df[test_df['sum']==0]) # true negatives
FP = len(test_df[test_df['diff']==-4]) # false positives
FN = len(test_df[test_df['diff']==4]) # false negatives
all = TP+TN+FP+FN

print(f'True positives: {TP}\nTrue negatives: {TN}\nFalse positives: {FP}\nFalse negatives: {FN}')

True positives: 32561
True negatives: 34240
False positives: 10760
False negatives: 12439


In [12]:
# Evaluate model metrics

accuracy = (TP+TN)/all
recall = TP/(TP+FN)
precision = TP/(TP+FP)
f1_score = 2 * ((precision * recall)/(precision + recall))

print(f'Classifer metrics:\n--------------------\nAccuracy: {accuracy}\nRecall: {recall}\nPrecision: {precision}\nf1 score: {f1_score}')


Classifer metrics:
--------------------
Accuracy: 0.7422333333333333
Recall: 0.7235777777777778
Precision: 0.7516216153828398
f1 score: 0.7373331370795169


## Apply the model to the new data

Use the trained classifier to predict the sentiment of the streamed tweets.

In [13]:
def classify_tweet(tweet_text: str) -> float:
    """
    Predict the sentiment of a tweet by applying test preprocessing
    and supplying the preprocessed tweet to the sentiment classifier. 
    """
    preprocessed_tokenized_tweet = stem_lemmatize(tokenize_tag(tweet_text))
    features = get_features(preprocessed_tokenized_tweet)
    return nbclassifier.classify(features)
    
# Test the classifier
test_text = ["This is a rubbish tweet!",
             "This is a great tweet!",
             "I don't like mushrooms, they taste horrible",
             "I love mushrooms, they are great"]

for sentence in test_text:
    print(sentence)
    print(classify_tweet(sentence))


This is a rubbish tweet!
0
This is a great tweet!
4
I don't like mushrooms, they taste horrible
0
I love mushrooms, they are great
4


In [1]:
#Apply to tweets in DataFrame.

from brexittweets.config import sqlite_db_path
from brexittweets.custom_funcs.sqlite_functions import size_of_database, get_data
import sqlite3

db = sqlite3.connect(sqlite_db_path)
cursor = db.cursor()

# Get length of the db
dbLength = size_of_database(cursor)

# Apply sentiment analysis to Tweets in database
for index in range(1,dbLength+1):
    # Get tweet from the db
    tweet = get_data('text', index, cursor)
    tweet_sentiment = classify_tweet(tweet)
    update_statement = f"UPDATE tweets SET sentiment={tweet_sentiment} WHERE id = {index};"
    cursor.execute(update_statement)
    db.commit()

Grab a few Tweets to see how they were classified.

In [2]:
from brexittweets.custom_funcs.sqlite_functions import check_sentiment_analysis

test_df = check_sentiment_analysis(cursor, 'custom')
test_df


Unnamed: 0,id,text,sentiment_TextBlob
0,1,Ben Mellor want the Tories out He rocks up su...,0.0
1,2,Richard Ayoade doesnt even sound like he belie...,0.0
2,3,If sht was chocolate no body would starve,0.0
3,4,Let me guess you also voted brexit ?,4.0
4,5,Join our webinar with speakers from Hogan Love...,0.0


In [3]:
# Close the database
db.close()

## References

Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment 
classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.