# Twitter Sentiment Analysis -- Classification and Live Plotting

![](images/live-twitter-sentiment.gif)

# Introduction and Problem Statement

Our project aims to perform high-level sentiment analysis on Twitter data. If successful, our work will allow us to gain insights in real-time about how Twitter users feel about any given topic. Tweets are an important and widely used form of communication, not only providing a source of entertainment but also useful news updates. We hope to gain an understanding of how users feel about a topic and how this changes over time. Tweets are an interesting source of data by nature, constrained by the 140 character restriction but still able to leverage the variability of human language. Traditional methods for sentiment analysis and topic modelling may not perform as well on Tweets because they do not capture as much information as a full-length document. This project is interesting as a Big Data project because of the large number of daily new tweets (and other types of social media posts, were this work extended). A model, once trained, could be used to classify live social media data and understand and measure trends as they occur. Anyone interested in understanding such trends would use these types of results.

# Data Description

We will develop our models using static data and after we establishing a working pipeline, we will integrate with tweepy to stream, classify, and visualize live Tweets.

Our data comes from the SemEval-2017 Task 4: Sentiment Analysis in Twitter. This task publishes data for training sentiment analysis models related to Twitter data and has a popular competetion with about 50 teams competing in 2017. This dataset is fairly general and varied in terms of topics and was created with the intent of furthering work in the area of sentiment analysis. The dataset contains 16,041 tweets.

# Cleaning and Embedding

We begin by making our sentiment analysis classification a binary task. That is, we classify tweets as either positive or negative. Therefore, we remove tweets from our filtered dataset that have the 'neutral' label. The size of the filtered dataset is 9201 tweets, and roughly 75% of the tweets are labelled positive while the remaining 25% are labelled negative.

In [None]:
import pickle
import pandas as pd
from sklearn.model_selection import train_test_split

path_to_data = 'result.txt'
all_tweets_df = pd.read_table(path_to_data, names = ['ID', 'sentiment', 'tweet'])

# remove neutral tweets (for now)
all_tweets_df = all_tweets_df[all_tweets_df['sentiment'] != 'neutral']

# convert sentiment labels to binary (1 = positive, 0 = negative)
all_tweets_df['sentiment'] = (all_tweets_df['sentiment'] == 'positive').astype(int)

# split features and labels
X = all_tweets_df['tweet']
y = all_tweets_df['sentiment']

# separate the training and test data
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    shuffle=True)

print('Training data dimensions n,d = {}'.format(X_train.shape))
print('Test data dimensions n,d = {}'.format(X_test.shape))

## Text Processing: Punctuation, Stopwords, Lemmatization, Bag of Words, TF-IDF

Our preliminary text processing for all tweets includes removing stopwords and accents and leveraging lemmatization.

We decide to maintain the punctuation present in the original tweets. After initial exploration of our tweets dataset, we notice several examples of punctuation conveying meaning and we felt this could impact the overall sentiment (eg. `'!'`, `'...'`, `':)'`, `':('`, `'>>'`). Therefore, punctuation is included when we process each tweet.

We then filter out common stopwords from our `CountVectorizer` vocabulary using `NLTK`'s built-in list of English stopwords. This is a rather simple way to remove the most commonly occuring stopwords. Additionally we remove accents from the text using `CountVectorizer`'s strip_accent parameter.

We also utilize `NLTK`'s WordNetLemmatizer to perform lemmatization for each tweet. The goal of lemmatization is to transform each terms back into their base form. For example, given the terms 'walking', 'walked', 'walker', the lemmatizer would convert all of them back to 'walk'. The idea behind this is to represent terms that have different inflectional endings but capture the same meaning as a single base term. The lemmatizer we have used takes in an optional part of speech parameter. However, since we have chosen not to provide part of speech, the default is 'noun', meaning the lemmatizer will attempt to convert each term to the closest base in noun form.

In terms of our document matrix, we first use the term frequencies provided by `CountVectorizer`. In this approach, also known as bag-of-words, each row corresponds to a document (tweet) and each column represents a unique term. Each element in the matrix corresponds to the count of the specified term for that particular document. While this bag-of-words model allows us to gain some insight regarding which terms are important in a given document, it does not have a means to look at term frequencies across different documents. For example a given term may occur frequently in a given document and therefore appear important, but then also appear frequently for several of the other documents as well.

To address this, we can apply the TF-IDF transformation to the bag-of-words model. The term frequency measures the local importance of the word (standard bag-of-words), and the inverse document frequency which is equal to the $log(\frac{a}{b})$ where $a$ = # of Documents and $b$ = # of Documents containing term $t$. Therefore, terms that occur highly within a given document but do not occur heavily in other documents have higher values. The product of the term frequency and inverse document frequency now make up the individual elements of the document matrix.

Further text processing and embedding approaches are discussed below in the hyperparameter tuning section.

In [0]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import TweetTokenizer

class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

# Training Classification Models

We could use several different types of models to classify our Twitter data as positive/negative (along with the different embedding strategies discussed above). Here we'll try the following classifiers: Multinomial Naive Bayes Classifiers, Logistic Regression Classifiers, Stochastic Gradient Descent Classifiers, and Support Vector Classifiers. We use `sklearn` to build several pipelines that allow us to chain different combinations of embedding strategies and classification models. After building them here, we'll use cross-validation to compare their performance and find the best combinations of embedding strategies and classification models.

In [0]:
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# embedding methods
embedding_tuple = ('vect', CountVectorizer(tokenizer=LemmaTokenizer(),
                             stop_words='english',
                             strip_accents='unicode'))
tfidf_tuple = ('tfidf', TfidfTransformer())

# classifiers
mnb_tuple = ('clf', MultinomialNB())
svm_tuple = ('clf', SVC())
logreg_tuple = ('clf', LogisticRegression())
sgd_tuple = ('clf', SGDClassifier())

### Classifier Pipelines
# Multinomial Naive Bayes
tfidf__mnb_classifier = Pipeline([embedding_tuple, tfidf_tuple, mnb_tuple])
bow__mnb_classifier = Pipeline([embedding_tuple, mnb_tuple])

# SVC (linear and rbf)
tfidf__svm_classifier = Pipeline([embedding_tuple, tfidf_tuple, svm_tuple])
bow__svm_classifier = Pipeline([embedding_tuple, svm_tuple])

# Logistic regression
tfidf__logreg_classifier = Pipeline([embedding_tuple, tfidf_tuple, logreg_tuple])
bow__logreg_classifier = Pipeline([embedding_tuple, logreg_tuple])

# Stochastic gradient descent
tfidf__sgd_classifier = Pipeline([embedding_tuple, tfidf_tuple, sgd_tuple])
bow__sgd_classifier = Pipeline([embedding_tuple, sgd_tuple])

## Parameter Tuning via Cross-Validation

With our model pipelines in place, we next use cross-validation to find the optimal parameters and embedding strategy for each. 

We leveraged `Sklearn`'s GridSearchCV to find the optimal paramters for each classifier given a dictionary of parameters and corresponding set of values using 3-fold cross validation. Each of the parameter tuning choices are discussed below:

Text Processing (for all classifiers)
+ Max-df: Removes terms that appear too frequently (eg. max_df = 0.5 means that we ignore terms that appear in more than 50% of the documents). We decided to tune with two values: 0.5 and 0.75.
+ Max_features: Builds the model's vocabulary based on the top max_features. We decided to tune with three values: 2000, 5000, and None (does not restrict the number of features to be included in vocabulary).
+ Ngram_range: Specifies n-gram combinations that will be included in the model's vocabulary. We decided to tune with three values: (1,1), (1,2), and (2,2). We felt that including longer n_grams such as trigrams would capture phrases unique to a specific tweet, but not generalizable to other tweets.

Multinomial Naive Bayes and Stochastic Gradient Descent:
+ No hyperparameter tuning for these models

Logistic Regression

+ C: Inversely proportional to regularization (smaller values indicate stronger regularization). Regularization essentially adds a penalty for overfitting to our training data. We decided to tune with the following values: 0.01, 0.1, 1, 10, 100.

SVMs
+ C: Similar to logistic regression, the C parameter indicates how much we will overfit to the training data. For large values of C, the optimization will choose a smaller-margin hyperplane (if that leads to better training classification performance). A very small value of C will cause the optimization to look for a larger-margin separating hyperplane, even if that leads to greater misclassification of training data. We decided to tune with the following values: 0.01, 0.1, 1, 10, 100.
+ Gamma: The gamma parameter describes the influence of a single training example in determining the decision boundary (low values indicate examples further away from the decision boundary carry higher weight and high values indicate that closer carry higher weight). We decided to tune with the following values: 0.01, 0.1, 1, 10, 100.
+ Kernel: We decided to tune with two kernels: linear (no kernel) and the rbf (radial basis function) kernel, which is the default option for `sklearn` SVM classifiers.


In [0]:
from sklearn.model_selection import GridSearchCV

# First specify parameter dictionaries for the various classifiers
parameters = {
    'vect__max_df': (0.5, 0.75),
    'vect__max_features': (None, 2000, 5000),
    'vect__ngram_range': ((1, 1), (1, 2), (2,2)),  # unigrams or bigrams
    'clf__C': (0.01, 0.1, 1, 10, 100),
    'clf__gamma': (0.01, 0.1, 1, 10, 100),
}

parameters_lr = {
    'vect__max_df': (0.5, 0.75),
    'vect__max_features': (None, 2000, 5000),
    'vect__ngram_range': ((1, 1), (1, 2), (2,2)),  # unigrams or bigrams
    'clf__kernel': ('linear', 'rbf'),
    'clf__C': (0.01, 0.1, 1, 10, 100),
}

# Use Grid Search CV to find optimal model parameters for each of the following classifiers:

# Multinomial Bayes with TF-IDF
print('Tuning Multinomial Bayes with TF-IDF')
grid_search = GridSearchCV(tfidf__mnb_classifier, parameters, n_jobs=-1, verbose=1)
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters_svm.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

# Multinomial Bayes with Bag of Words
print('Tuning Multinomial Bayes with Bag of Words')
grid_search = GridSearchCV(bow__mnb_classifier, parameters, n_jobs=-1, verbose=0)
grid_search.fit(X_train, y_train)
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

# Multinomial# SVM with TF_IDF
print('Tuning SVM with TF_IDF')
grid_search = GridSearchCV(tfidf__svm_classifier, parameters, n_jobs=-1, verbose=0)
grid_search.fit(X_train, y_train)
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

# SVM with Bag of Words
print('Tuning SVM with Bag of Words')
grid_search = GridSearchCV(bow__svm_classifier, parameters_svm, n_jobs=-1, verbose=0)
grid_search.fit(X_train, y_train)
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters_lr.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

# Logistic Regression with TF_IDF
print('Tuning Logistic Regression with TF_IDF')
grid_search = GridSearchCV(tfidf__logreg_classifier, parameters_lr, n_jobs=-1, verbose=0)
grid_search.fit(X_train, y_train)
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters_svm.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

# Logistic Regression with Bag of Words
print('Tuning Logistic Regression with Bag of Words')
grid_search = GridSearchCV(bow__logreg_classifier, parameters_lr, n_jobs=-1, verbose=0)
grid_search.fit(X_train, y_train)
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

# SGD with TF_IDF
print('Tuning SGD with TF_IDF')
grid_search = GridSearchCV(tfidf__sgd_classifier, parameters_lr, n_jobs=-1, verbose=0)
grid_search.fit(X_train, y_train)
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters_lr.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

# SGD with Bag of Words
print('Tuning SGD with Bag of Words')
grid_search = GridSearchCV(bow__sgd_classifier, parameters, n_jobs=-1, verbose=0)
grid_search.fit(X_train, y_train)
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))


## Fitting Models with Optimal Parameters
We will now use the parameters computed from above to train our classifiers. We will also predict on the test data to assess the performance of our classifiers.

All the classifier performances are roughly between 80-85% for accuracy/precision and 90-95% for recall. Our classification performance is fairly high and consistent for each classifier. Furthermore, even though we have an imbalanced class situation with approx. 75% positive tweets, our classifiers go beyond choosing the majority class.

In [0]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

tfidf_tuple = ('tfidf', TfidfTransformer())
mnb_tuple = ('clf', MultinomialNB())
sgd_tuple = ('clf', SGDClassifier())

### Classifier Pipelines
# Multinomial Naive Bayes
tfidf_mnb_classifier = Pipeline([('vect', CountVectorizer(tokenizer=LemmaTokenizer(), stop_words='english', strip_accents='unicode', max_df=0.5, max_features=2000, ngram_range=(1, 2))), tfidf_tuple, mnb_tuple])

bow_mnb_classifier = Pipeline([('vect', CountVectorizer(tokenizer=LemmaTokenizer(), stop_words='english', strip_accents='unicode', max_df=0.5, max_features=5000, ngram_range=(1, 2))), mnb_tuple])

# Logistic regression
tfidf_logreg_classifier = Pipeline([('vect', CountVectorizer(tokenizer=LemmaTokenizer(), stop_words='english', strip_accents='unicode', max_df=0.5, max_features=None, ngram_range=(1, 2))), tfidf_tuple, ('clf', LogisticRegression(C=100))])

bow_logreg_classifier = Pipeline([('vect', CountVectorizer(tokenizer=LemmaTokenizer(), stop_words='english', strip_accents='unicode', max_df=0.5, max_features=5000, ngram_range=(1, 2))),('clf', LogisticRegression(C=1))])

# Stochastic gradient descent
tfidf_sgd_classifier = Pipeline([('vect', CountVectorizer(tokenizer=LemmaTokenizer(), stop_words='english', strip_accents='unicode', max_df=0.75, max_features=None, ngram_range=(1, 2))), tfidf_tuple, sgd_tuple])
bow_sgd_classifier = Pipeline([('vect', CountVectorizer(tokenizer=LemmaTokenizer(), stop_words='english', strip_accents='unicode', max_df=0.5, max_features=None, ngram_range=(1, 2))), sgd_tuple])

# Support Vector Machines
tfidf_svm_classifier = Pipeline([('vect', CountVectorizer(tokenizer=LemmaTokenizer(), stop_words='english', strip_accents='unicode', max_df=0.5, max_features=None, ngram_range=(1, 2))), tfidf_tuple, ('clf', SVC(C=10, gamma=0.01, kernel='linear'))])
bow_svm_classifier = Pipeline([('vect', CountVectorizer(tokenizer=LemmaTokenizer(), stop_words='english', strip_accents='unicode', max_df=0.5, max_features=None, ngram_range=(1, 1))), ('clf', SVC(C=10, gamma=0.01, kernel='rbf'))])

# Fit each of the classifiers and pickle the fitted models
print('Fitting Multinomial Bayes with Tf-idf')
tfidf_mnb_classifier.fit(X_train, y_train)
pred = tfidf_mnb_classifier.predict(X_test)
print('here is the mnb tf-idf accuracy, precision, recall: ', [accuracy_score(y_test, pred), precision_score(y_test, pred), recall_score(y_test, pred)])

print('Fitting Multinomial Bayes with bow')
bow_mnb_classifier.fit(X_train, y_train)
pred = bow_mnb_classifier.predict(X_test)
print('here is the mnb bow accuracy, precision, recall: ', [accuracy_score(y_test, pred), precision_score(y_test, pred), recall_score(y_test, pred)])

print('Fitting Logistic Regression with tf-idf')
tfidf_logreg_classifier.fit(X_train, y_train)
pred = tfidf_logreg_classifier.predict(X_test)
print('here is the log reg tf-idf accuracy, precision, recall: ', [accuracy_score(y_test, pred), precision_score(y_test, pred), recall_score(y_test, pred)])

print('Fitting Logistic Regression with bow')
bow_logreg_classifier.fit(X_train, y_train)
pred = bow_logreg_classifier.predict(X_test)
print('here is the log reg bow accuracy, precision, recall: ', [accuracy_score(y_test, pred), precision_score(y_test, pred), recall_score(y_test, pred)])

print('Fitting SGD with tf-idf')
tfidf_sgd_classifier.fit(X_train, y_train)
pred = tfidf_sgd_classifier.predict(X_test)
print('here is the sgd tf-idf accuracy, precision, recall: ', [accuracy_score(y_test, pred), precision_score(y_test, pred), recall_score(y_test, pred)])

print('Fitting SGD with bow')
bow_sgd_classifier.fit(X_train, y_train)
pred = bow_sgd_classifier.predict(X_test)
print('here is the sgd bow accuracy, precision, recall: ', [accuracy_score(y_test, pred), precision_score(y_test, pred), recall_score(y_test, pred)])

print('Fitting SVM with tf-idf')
tfidf_svm_classifier.fit(X_train, y_train)
pred = tfidf_svm_classifier.predict(X_test)
print('here is the svm tf-idf accuracy, precision, recall: ', [accuracy_score(y_test, pred), precision_score(y_test, pred), recall_score(y_test, pred)])

print('Fitting SVM with bow')
bow_svm_classifier.fit(X_train, y_train)
pred = bow_svm_classifier.predict(X_test)
print('here is the svm bow accuracy, precision, recall: ', [accuracy_score(y_test, pred), precision_score(y_test, pred), recall_score(y_test, pred)])

# dictionary that contains model name as key and fitted model as value
models = {'tfidf__mnb_classifier': tfidf_mnb_classifier, 'bow__mnb_classifier': bow_mnb_classifier, 'tfidf__logreg_classifier': tfidf_logreg_classifier,'bow__logreg_classifier': bow_logreg_classifier, 'tfidf_sgd_classifier': tfidf_sgd_classifier, 'bow__sgd_classifier': bow_sgd_classifier,'tfidf__svm_classifier': tfidf_svm_classifier, 'bow__svm_classifier': bow_svm_classifier}

## Pickling the Best Models

Training our models takes a little while each time. We don't want to repeat this process every time we want to use them. Thus we'll pickle our models for use whenever we need them (such as in the next section).

In [0]:
for name, model in items(models):
    pickle.dump(model, open('pickled_classifiers/' + name + '.pickle', 'wb'))

## Using a VoteClassifier

With all of our models tuned appropriately via cross-validation, we can get a classification score along with some measure of confidence by aggregating the predictions of each model and doing a simple 'majority rule' vote. We first implement the `VoteClassifier` class to take a group of classification models and create a majority vote method (`classify()`) as well as a method to get a rough confidence for the given prediction -- simply the proportion of the group that agrees (`confidence()`).

In [0]:
from nltk.classify import ClassifierI

class VoteClassifier(ClassifierI):
    def __init__(self, classifiers):
        self._classifiers = classifiers
        
    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.predict(features)[0]
            votes.append(v)
        mode = max(set(votes), key=votes.count)
        return mode
    
    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.predict(features)[0]
            votes.append(v)
        mode = max(set(votes), key=votes.count)
        choice_votes = votes.count(mode)
        conf = choice_votes / len(votes)
        return conf

# Streaming Twitter Data

With a way to classify a tweet as positive/negative along with a rough confidence in our classification, we can apply the models we've trained and tuned to classifying data streaming live from Twitter. To achieve this, we use the `tweepy` library. Here we implement a `TwitterListener` class to stream and classify tweets, outputing the classification along with the confidence to a text file.

In [None]:
import json
import pickle

from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener

CONSUMER_KEY = 'tOpKHcNzNpu88PSZxAzCI87Ne'
CONSUMER_SECRET = 'EhSEN89oydJi058EkQP3iMjsVlYw6yLYZ2Uq2UAVSWS43wXju9'
ACCESS_TOKEN = '37501551-hUS1bgjvyBq9H1pplnXkQb1rBIqNfNzPsBHFtx8dw'
ACCESS_SECRET = '0MI19XnM6FXT8D8LWM70KbHYZDp5GefyZpYwD6hUUvtSD'

# load pickled models and instantiate VoteClassifier
models = []
model_names = ['tfidf__mnb_classifier', 'bow__mnb_classifier', 'tfidf__logreg_classifier','bow__logreg_classifier', 'tfidf_sgd_classifier', 'bow__sgd_classifier', 'tfidf__svm_classifier', 'bow__svm_classifier']

for name in model_names:
    model = pickle.load(open('pickled_classifiers/' + name + '.pickle', 'rb'))
    models.append(model)

VOTECLASSIFIER = VoteClassifier(models)

class TwitterListener(StreamListener):
    def on_data(self, data):
        try:
            json_data = json.loads(data)
            tweet = json_data['text']
            sentiment = 'positive' if VOTECLASSIFIER.classify([tweet]) else 'negative'
            confidence = VOTECLASSIFIER.confidence([tweet])
            print(sentiment, confidence)

            with open('twitter-out.txt', 'a') as output:
                output.write(str(sentiment) + ',' + str(confidence))
                output.write('\n')
        except Exception as exception:
            print(exception)
            pass

        return True

    def on_error(self, status):
        print(status)
        pass

# authenticate
auth = OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

# Live Classification

With our `TwitterListener` in place and authenticated to connect to Twitter, we're ready to predict whether each incoming tweet is positive or negative. The Twitter API allows us to filter tweets by language and topic. Here we'll consider only Tweets in English. The filtering by topic will allow us to help get a sense of how the public (or at least the subsection of the public with Twitter accounts) feels about a given topic. For this example, we'll see how the English-speaking Twitter population feels about Donald Trump.

In [0]:
# create streaming object
topics_to_track = ['donald trump']
twitterStream = Stream(auth, TwitterListener())
twitterStream.filter(languages=['en'], track=topics_to_track)

# Live Visualization

The last step in answering our question is to plot our live results. We use `matplotlib`'s `pyplot` and `animation` libraries to read in the contents of the textfile containing our live results and display a histogram for positive and negative tweets as well as the mean confidence associated with each classification.

In [0]:
import time

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from matplotlib import style

style.use("ggplot")

fig = plt.figure()
subplot_1 = fig.add_subplot(2,2,1)
subplot_2 = fig.add_subplot(2,2,2)
subplot_3 = fig.add_subplot(2,1,2)

def animate(i):
    pullData = open("twitter-out.txt","r").read()
    lines = pullData.split('\n')

    negative_count = 0
    positive_count = 0
    neg_confidence_list = []
    pos_confidence_list = []

    all_sentiments_list = []

    for line in lines:
        line = line.split(',')
        if len(line) > 1:

            if line[0] == 'positive':
                positive_count += 1
                pos_confidence_list.append(float(line[1]))
                all_sentiments_list.append(1)
            elif line[0] == 'negative':
                negative_count += 1
                neg_confidence_list.append(float(line[1]))
                all_sentiments_list.append(0)
            else:
                continue


    # pos/neg plot
    bar_heights = [negative_count, positive_count]
    subplot_1.clear()
    subplot_1.bar([1,2], bar_heights)
    subplot_1.set_xticks([1, 2])
    subplot_1.set_xticklabels(('NEG', 'POS'))
    subplot_1.set_title('Pos/Neg Counts')

    # confidence plot
    neg_mean_confidence = 0
    pos_mean_confidence = 0
    if len(neg_confidence_list) > 0:
        neg_mean_confidence = sum(neg_confidence_list) * 1.0 / len(neg_confidence_list)
    if len(pos_confidence_list) > 0:
        pos_mean_confidence = sum(pos_confidence_list) * 1.0 / len(pos_confidence_list)
    bar_heights = [neg_mean_confidence, pos_mean_confidence]
    subplot_2.clear()
    subplot_2.bar([1,2], bar_heights)
    subplot_2.set_xticks([1, 2])
    subplot_2.set_xticklabels(('NEG', 'POS'))
    subplot_2.set_title('Mean Confidence')

    # moving average plot for last 200 values
    lookback_number = min(len(all_sentiments_list), 200)
    average_array = np.ones((lookback_number,)) / lookback_number
    moving_averages = np.convolve(all_sentiments_list, average_array, mode='valid')
    # look at most recent 50000
    if len(moving_averages) > 5000:
        moving_averages = moving_averages[-5000:]
    subplot_3.clear()
    subplot_3.plot(moving_averages)
    subplot_3.hlines(0.5, xmin = 0, xmax = len(moving_averages))
    subplot_3.set_title('Rolling 200-Tweet Average Sentiment')
    subplot_3.set_ylabel('Positivity (0 - 1)')

ani = animation.FuncAnimation(fig, animate, interval=1000)
plt.show()

# Future Work

There are several ways to extend this work in the future. Some areas we would like to expand upon in the future are:

1. Doing predictions for more classes than positive/negative (binary classification)
2. Using incoming tweets to adapt the current model in addition to classifying them
3. Further exploring text processing with emojis and punctuation (eg. only include punctuation that potentially captures meaning, and filter out the remaining punctuation)

# References

The following research papers were helpful in understanding current work on text processing and sentiment classification techniques:
+ Apoor Agarwal -- Sentiment Analysis of Twitter Data.
+ Apoor Agarwal, Jasneet Singh Sabharwal -- End-to-End Sentiment Analysis of Twitter Data
+ Pang, Lee, and Vaithyanathan -- Thumbs up? Sentiment classification using machine learning techniques

The following resources/tutorials were leveraged in constructing our model pipelines and VoteClassifier:
+ http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html
+ https://pythonprogramming.net/combine-classifier-algorithms-nltk-tutorial/
