# Statistical Natural Language Processing course PROJECT
### Group S: Mateusz Gierlach

The main goal of a project is creating a system that allows for sentiment analysis of tweets representing political opinions or opinions about some recent political events. System will be able to classify the tweet as positive or negative.

Main problem I encountered was the lack of annotated (sentiment) database with political tweets. Therefore, I chose the approach of building and evaluating models on known datasets with the sentiment labelled. Then, I run the models on never-seen-before dataset of political tweets and conclude whether such system gives satisfactory results on data coming from different distributions.

There are 3 separate parts of the project. In part 1, I use general IMDB reviews data to build word2vec model and classification (Random Forest) model. In part 2, I use Sentiment140 tweets database to build word2vec model and classification (Random Forest) model. In part 3, I peerform an inference of models from previous parts to see how well their performance is on political tweets. Then, I conclude my experiments and give conclusions.

## Part 1: word2vec + RF classification on IMDB reviews data

First, let's build the general word2vec embeddings model on a widely popular IMDB reviews dataset.

Importing the libraries:

In [20]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup 
import re
from nltk.corpus import stopwords
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from gensim.models import word2vec
from sklearn.ensemble import RandomForestClassifier

Loading in the IMDB datasets for training and testing:

In [3]:
train = pd.read_csv("imdb/labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
test = pd.read_csv("imdb/testData.tsv",header=0, delimiter="\t", quoting=3)

Downloading and loading tokenizers from NLTK library:

In [4]:
import nltk.data
nltk.download('popular')
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /Users/mg/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     /Users/mg/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /Users/mg/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     /Users/mg/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     /Users/mg/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /Users/mg/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /Users/

Function for converting (removing tags, deleting other characters, transformation to lower-case, optionally removing stopwords) a review to a list of separate words:

In [6]:
def review_wordlist(review, remove_stopwords=False):
    review_text = BeautifulSoup(review).get_text()
    review_text = re.sub("[^a-zA-Z]"," ",review_text)
    words = review_text.lower().split()
    if remove_stopwords:
        stops = set(stopwords.words("english"))     
        words = [w for w in words if not w in stops]
    return(words)

Function for converting a review to sentences, using the downloaded tokenizer:

In [8]:
def review_sentences(review, tokenizer, remove_stopwords=False):
    raw_sentences = tokenizer.tokenize(review.strip())
    sentences = []
    for raw_sentence in raw_sentences:
        if len(raw_sentence)>0:
            sentences.append(review_wordlist(raw_sentence, remove_stopwords))
    return sentences

Using the above functions to parse all the reviews from training dataset:

In [9]:
sentences = []
for review in train["review"]:
    sentences += review_sentences(review, tokenizer)

  ' Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup


Initializing and building a word2vec model in gensim library. At the top, I specify parameters for the model. Then, I save the model for later use.

In [12]:
num_features = 300
min_word_count = 40
num_workers = 4
context = 10
downsampling = 1e-3

model1 = word2vec.Word2Vec(sentences, workers=num_workers, size=num_features, min_count=min_word_count, window=context,
                          sample=downsampling)

model1.init_sims(replace=True)
model1_name = "imdb_model"
model1.save(model1_name)

2019-04-28 10:43:02,858 : INFO : collecting all words and their counts
2019-04-28 10:43:02,860 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-04-28 10:43:02,915 : INFO : PROGRESS: at sentence #10000, processed 225803 words, keeping 17776 word types
2019-04-28 10:43:02,994 : INFO : PROGRESS: at sentence #20000, processed 451892 words, keeping 24948 word types
2019-04-28 10:43:03,066 : INFO : PROGRESS: at sentence #30000, processed 671315 words, keeping 30034 word types
2019-04-28 10:43:03,157 : INFO : PROGRESS: at sentence #40000, processed 897815 words, keeping 34348 word types
2019-04-28 10:43:03,220 : INFO : PROGRESS: at sentence #50000, processed 1116963 words, keeping 37761 word types
2019-04-28 10:43:03,287 : INFO : PROGRESS: at sentence #60000, processed 1338404 words, keeping 40723 word types
2019-04-28 10:43:03,344 : INFO : PROGRESS: at sentence #70000, processed 1561580 words, keeping 43333 word types
2019-04-28 10:43:03,409 : INFO : PROGRESS: 

2019-04-28 10:43:30,755 : INFO : EPOCH 4 - PROGRESS: at 49.74% examples, 666198 words/s, in_qsize 7, out_qsize 0
2019-04-28 10:43:31,764 : INFO : EPOCH 4 - PROGRESS: at 63.04% examples, 633014 words/s, in_qsize 7, out_qsize 0
2019-04-28 10:43:32,771 : INFO : EPOCH 4 - PROGRESS: at 79.47% examples, 639183 words/s, in_qsize 7, out_qsize 1
2019-04-28 10:43:33,780 : INFO : EPOCH 4 - PROGRESS: at 93.81% examples, 628288 words/s, in_qsize 7, out_qsize 0
2019-04-28 10:43:34,146 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-04-28 10:43:34,148 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-04-28 10:43:34,160 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-04-28 10:43:34,167 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-04-28 10:43:34,167 : INFO : EPOCH - 4 : training on 5920724 raw words (4045400 effective words) took 6.4s, 628747 effective words/s
2019-04-28 10:43:35,181 : INFO : EPOCH 5 - PROG

Now let's move to building a classification model for sentiment.

First, we need methods for getting an averaged feature vector for each review. Sentiment can be defined for the whole review/tweet/document, as an average of words from whole text.

Below, there are methods for averaging these reviews, based on the words used, represented through word2vec model. Stopwords usually should be removed only at this stage, as semantics (word2vec) can be learned with it. But we don't use them at the stage of averaging reviews. 

In [15]:
def featureVecMethod(words, model, num_features):
    featureVec = np.zeros(num_features,dtype="float32")
    nwords = 0
    index2word_set = set(model.wv.index2word)
    for word in  words:
        if word in index2word_set:
            nwords = nwords + 1
            featureVec = np.add(featureVec,model[word])
    featureVec = np.divide(featureVec, nwords)
    return featureVec

In [16]:
def getAvgFeatureVecs(reviews, model, num_features):
    counter = 0
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
    for review in reviews:
        reviewFeatureVecs[counter] = featureVecMethod(review, model, num_features)
        counter = counter+1   
    return reviewFeatureVecs

Now, let's run the above methods to get average features for my training dataset. This is the preparation step before fitting the classification model.

In [18]:
clean_train_reviews = []
for review in train['review']:
    clean_train_reviews.append(review_wordlist(review, remove_stopwords=True)) 
trainDataVecs = getAvgFeatureVecs(clean_train_reviews, model1, num_features)

  


Doing the same thing for test dataset:

In [19]:
clean_test_reviews = []
for review in test["review"]:
    clean_test_reviews.append(review_wordlist(review,remove_stopwords=True))
testDataVecs = getAvgFeatureVecs(clean_test_reviews, model1, num_features)

  


Let's train the Random Forest classification model for sentiment prediction.

In [21]:
forest1 = RandomForestClassifier(n_estimators = 100)  
forest1 = forest1.fit(trainDataVecs, train["sentiment"])

Let's use the built RF model to predict sentiment classes on a test set. Let's take a look how the first 15 test reviews have been classified.

In [22]:
result1 = forest1.predict(testDataVecs)
output1 = pd.DataFrame(data = {"review": test["review"], "sentiment": result1})
output1.head(15)

Unnamed: 0,review,sentiment
0,"""Naturally in a film who's main themes are of ...",1
1,"""This movie is a disaster within a disaster fi...",0
2,"""All in all, this is a movie for kids. We saw ...",1
3,"""Afraid of the Dark left me with the impressio...",0
4,"""A very accurate depiction of small time mob l...",1
5,"""...as valuable as King Tut's tomb! (OK, maybe...",1
6,"""This has to be one of the biggest misfires ev...",0
7,"""This is one of those movies I watched, and wo...",0
8,"""The worst movie i've seen in years (and i've ...",0
9,"""Five medical students (Kevin Bacon, David Lab...",1


Saving the predicted outputs to a CSV file:

In [24]:
output1.to_csv("output_imdb.csv", index=False)

I do not validate the model at this stage. I use all the annotated data for training. Validation is performed in part 3 for political tweets, as this is the main point of the analysis.

## Part 2: word2vec + RF classification on Sentiment140 Twitter data

In this part, let's build a word2vec model and classification model on a popular Sentiment140 sentiment-annotated dataset.

Importing needed libraries:

In [48]:
pd.options.mode.chained_assignment = None
from copy import deepcopy
from string import punctuation
from random import shuffle
import pickle
import h5py
import json
import matplotlib.pyplot as plt 
import gensim
from gensim.models.word2vec import Word2Vec
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")
from nltk.tokenize import TweetTokenizer
from nltk import word_tokenize
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier

Loading training dataset from Sentiment140:

In [32]:
def ingest():
    data = pd.read_csv('sentiment140/tweetstrain.csv', encoding='latin-1')
    data.columns=["Sentiment","ItemID","Date","Blank","SentimentSource","SentimentText"]
    data.drop(['ItemID', 'SentimentSource'], axis=1, inplace=True)
    data = data[data.Sentiment.isnull() == False]
    data['Sentiment'] = data['Sentiment'].map( {4:1, 0:0}) #Converting 4 to 1
    data = data[data['SentimentText'].isnull() == False]
    data.reset_index(inplace=True)
    data.drop('index', axis=1, inplace=True)
    return data

data = ingest()

Preprocessing the data before building the model. Below, functions for tokenization and filtering the tweets and for additional processing.

In [33]:
tokenizer = TweetTokenizer()
def tokenize(tweet):
    try:
        tweet = tweet.lower()
        tokens = tokenizer.tokenize(tweet)
        tokens = list(filter(lambda t: not t.startswith('@'), tokens))
        tokens = list(filter(lambda t: not t.startswith('#'), tokens))
        tokens = list(filter(lambda t: not t.startswith('http'), tokens))
        return tokens
    except:
        return 'NC'

In [34]:
def postprocess(data):
    data['tokens'] = data['SentimentText'].progress_map(tokenize)
    data = data[data.tokens != 'NC']
    data.reset_index(inplace=True)
    data.drop('index', inplace=True, axis=1)
    return data

In [35]:
data = postprocess(data)

progress-bar: 100%|██████████| 1599999/1599999 [01:39<00:00, 16098.14it/s]


Splitting the data into training and testing sets, before building a word2vec model. I use the first million examples.

In [36]:
x_train, x_test, y_train, y_test = train_test_split(np.array(data.head(1000000).tokens),
                                                np.array(data.head(1000000).Sentiment), test_size=0.2)

Function for labelling the data into training and testing sets:

In [37]:
LabeledSentence = gensim.models.doc2vec.LabeledSentence

In [38]:
def labelizeTweets(tweets, label_type):
    labelized = []
    for i,v in tqdm(enumerate(tweets)):
        label = '%s_%s'%(label_type,i)
        labelized.append(LabeledSentence(v, [label]))
    return labelized

Let's use the above methods to label the data.

In [39]:
x_train = labelizeTweets(x_train, 'TRAIN')
x_test = labelizeTweets(x_test, 'TEST') 

  """
800000it [00:07, 103476.50it/s]
200000it [00:00, 234105.68it/s]


In [42]:
data_labellised= labelizeTweets(np.array(data.tokens), 'data')

  """
1599999it [00:15, 104106.12it/s]


Now, let's build the word2vec model on labellized tweets. As previously, I use gensim package.

In [43]:
n=1000000
n_dim = 200
tweet_w2v = Word2Vec(size=n_dim, min_count=10)
tweet_w2v.build_vocab([x.words for x in tqdm(data_labellised)])
tweet_w2v.train([x.words for x in tqdm(data_labellised)], total_examples = tweet_w2v.corpus_count, 
                epochs = tweet_w2v.iter) 

100%|██████████| 1599999/1599999 [00:00<00:00, 2331007.57it/s]
2019-04-28 11:57:32,107 : INFO : collecting all words and their counts
2019-04-28 11:57:32,109 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-04-28 11:57:32,170 : INFO : PROGRESS: at sentence #10000, processed 151409 words, keeping 13541 word types
2019-04-28 11:57:32,236 : INFO : PROGRESS: at sentence #20000, processed 300464 words, keeping 20437 word types
2019-04-28 11:57:32,301 : INFO : PROGRESS: at sentence #30000, processed 448553 words, keeping 26065 word types
2019-04-28 11:57:32,363 : INFO : PROGRESS: at sentence #40000, processed 598296 words, keeping 30941 word types
2019-04-28 11:57:32,437 : INFO : PROGRESS: at sentence #50000, processed 749113 words, keeping 35737 word types
2019-04-28 11:57:32,508 : INFO : PROGRESS: at sentence #60000, processed 899949 words, keeping 39931 word types
2019-04-28 11:57:32,586 : INFO : PROGRESS: at sentence #70000, processed 1048619 words, keeping

2019-04-28 11:57:37,235 : INFO : PROGRESS: at sentence #710000, processed 10739146 words, keeping 182771 word types
2019-04-28 11:57:37,295 : INFO : PROGRESS: at sentence #720000, processed 10888974 words, keeping 184357 word types
2019-04-28 11:57:37,354 : INFO : PROGRESS: at sentence #730000, processed 11039031 words, keeping 185936 word types
2019-04-28 11:57:37,415 : INFO : PROGRESS: at sentence #740000, processed 11190160 words, keeping 187684 word types
2019-04-28 11:57:37,474 : INFO : PROGRESS: at sentence #750000, processed 11341059 words, keeping 189366 word types
2019-04-28 11:57:37,534 : INFO : PROGRESS: at sentence #760000, processed 11492647 words, keeping 190808 word types
2019-04-28 11:57:37,590 : INFO : PROGRESS: at sentence #770000, processed 11643905 words, keeping 192396 word types
2019-04-28 11:57:37,638 : INFO : PROGRESS: at sentence #780000, processed 11796213 words, keeping 194001 word types
2019-04-28 11:57:37,687 : INFO : PROGRESS: at sentence #790000, processe

2019-04-28 11:57:41,060 : INFO : PROGRESS: at sentence #1420000, processed 20964663 words, keeping 302907 word types
2019-04-28 11:57:41,110 : INFO : PROGRESS: at sentence #1430000, processed 21103510 words, keeping 304359 word types
2019-04-28 11:57:41,157 : INFO : PROGRESS: at sentence #1440000, processed 21244727 words, keeping 305666 word types
2019-04-28 11:57:41,197 : INFO : PROGRESS: at sentence #1450000, processed 21389765 words, keeping 307103 word types
2019-04-28 11:57:41,247 : INFO : PROGRESS: at sentence #1460000, processed 21533732 words, keeping 308767 word types
2019-04-28 11:57:41,300 : INFO : PROGRESS: at sentence #1470000, processed 21676377 words, keeping 310340 word types
2019-04-28 11:57:41,349 : INFO : PROGRESS: at sentence #1480000, processed 21819577 words, keeping 311736 word types
2019-04-28 11:57:41,394 : INFO : PROGRESS: at sentence #1490000, processed 21959556 words, keeping 313123 word types
2019-04-28 11:57:41,443 : INFO : PROGRESS: at sentence #1500000,

2019-04-28 11:58:25,317 : INFO : EPOCH 2 - PROGRESS: at 56.94% examples, 617298 words/s, in_qsize 5, out_qsize 0
2019-04-28 11:58:26,320 : INFO : EPOCH 2 - PROGRESS: at 61.19% examples, 622160 words/s, in_qsize 5, out_qsize 0
2019-04-28 11:58:27,323 : INFO : EPOCH 2 - PROGRESS: at 65.63% examples, 628361 words/s, in_qsize 6, out_qsize 1
2019-04-28 11:58:28,339 : INFO : EPOCH 2 - PROGRESS: at 69.87% examples, 631099 words/s, in_qsize 6, out_qsize 0
2019-04-28 11:58:29,340 : INFO : EPOCH 2 - PROGRESS: at 74.19% examples, 634936 words/s, in_qsize 5, out_qsize 0
2019-04-28 11:58:30,346 : INFO : EPOCH 2 - PROGRESS: at 78.61% examples, 639759 words/s, in_qsize 6, out_qsize 0
2019-04-28 11:58:31,348 : INFO : EPOCH 2 - PROGRESS: at 81.28% examples, 631185 words/s, in_qsize 5, out_qsize 0
2019-04-28 11:58:32,355 : INFO : EPOCH 2 - PROGRESS: at 86.03% examples, 637866 words/s, in_qsize 4, out_qsize 1
2019-04-28 11:58:33,371 : INFO : EPOCH 2 - PROGRESS: at 89.58% examples, 635415 words/s, in_qsiz

2019-04-28 11:59:29,745 : INFO : EPOCH 5 - PROGRESS: at 13.29% examples, 776547 words/s, in_qsize 5, out_qsize 0
2019-04-28 11:59:30,747 : INFO : EPOCH 5 - PROGRESS: at 16.88% examples, 739088 words/s, in_qsize 5, out_qsize 0
2019-04-28 11:59:31,748 : INFO : EPOCH 5 - PROGRESS: at 19.67% examples, 690672 words/s, in_qsize 5, out_qsize 1
2019-04-28 11:59:32,783 : INFO : EPOCH 5 - PROGRESS: at 23.29% examples, 678696 words/s, in_qsize 4, out_qsize 1
2019-04-28 11:59:33,825 : INFO : EPOCH 5 - PROGRESS: at 26.52% examples, 659011 words/s, in_qsize 5, out_qsize 0
2019-04-28 11:59:34,830 : INFO : EPOCH 5 - PROGRESS: at 30.32% examples, 659146 words/s, in_qsize 5, out_qsize 0
2019-04-28 11:59:35,845 : INFO : EPOCH 5 - PROGRESS: at 35.11% examples, 678368 words/s, in_qsize 5, out_qsize 0
2019-04-28 11:59:36,846 : INFO : EPOCH 5 - PROGRESS: at 39.07% examples, 681073 words/s, in_qsize 5, out_qsize 0
2019-04-28 11:59:37,860 : INFO : EPOCH 5 - PROGRESS: at 42.22% examples, 669313 words/s, in_qsiz

(85542300, 117647830)

Optionally, save and load the model:

In [44]:
tweet_w2v.save('w2vmodel')
#new_w2vmodel = gensim.models.Word2Vec.load('w2vmodel')

2019-04-28 11:59:51,784 : INFO : saving Word2Vec object under w2vmodel, separately None
2019-04-28 11:59:51,789 : INFO : not storing attribute vectors_norm
2019-04-28 11:59:51,790 : INFO : not storing attribute cum_table
2019-04-28 11:59:52,716 : INFO : saved w2vmodel


Additionally, I will weight the word vectors by their tf-idf values, so that I can also take into account the words that are specific to each document/tweet.

In [45]:
vectorizer = TfidfVectorizer(analyzer=lambda x: x, min_df=10)
matrix = vectorizer.fit_transform([x.words for x in data_labellised])
tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))

Function for building the feature vectors, on which we will build the classification model. I use both word2vec values and tf-idf values for determination of the features.

In [46]:
def buildWordVector(tokens, size):
    vec = np.zeros(size).reshape((1, size))
    count = 0.
    for word in tokens:
        try:
            vec += tweet_w2v[word].reshape((1, size)) * tfidf[word]
            count += 1.
        except KeyError:
            continue
    if count != 0:
        vec /= count
    return vec

Now, let's use the above method to build the feature vectors for all tweets in training and testing datasets.

In [47]:
train_vecs_w2v = np.concatenate([buildWordVector(z, n_dim) for z in tqdm(map(lambda x: x.words, x_train))])
train_vecs_w2v = scale(train_vecs_w2v)

test_vecs_w2v = np.concatenate([buildWordVector(z, n_dim) for z in tqdm(map(lambda x: x.words, x_test))])
test_vecs_w2v = scale(test_vecs_w2v)

  
800000it [03:07, 4277.77it/s]
  
200000it [00:48, 4146.09it/s]


Now, let's train the Random Forest classifier on the built feature vectors. Here, I only load the model built previously, uncomment the code if you want to build the model here from scratch.

In [49]:
#forest2 = RandomForestClassifier(n_estimators = 100)   
#forest2 = forest2.fit(train_vecs_w2v, y_train)

Code for saving the RF model:

In [50]:
#filename = 'finalized_model.sav'
#pickle.dump(forest2, open(filename, 'wb'))

Code for loading the RF model:

In [53]:
filename = 'finalized_model.sav'
forest2 = pickle.load(open(filename, 'rb'))

Let's see what's the accuracy of the RF model on the test set:

In [54]:
score2 = forest2.score(X = test_vecs_w2v, y = y_test)
print(score2)

0.79369


Before analyzing real tweets, let's quickly test the model on some possible, easy-to-classify query tweets.

In [66]:
def query_sentiment(query):
    n_dim = 200
    querytokens=tokenize(query)
    query_vecs_w2v = buildWordVector(querytokens, n_dim)
    pred = forest2.predict(query_vecs_w2v).item()
    sent = ""
    if(pred == 0):
        sent = "NEGATIVE"
    elif(pred == 1):
        sent = "POSITIVE"
    print(query, " - ", sent)

In [84]:
query1 = "I hate Trump's policies."
query2 = "Incredible legislation"

query_sentiment(query1)
query_sentiment(query2)

I hate Trump's policies.  -  NEGATIVE
Incredible legislation  -  POSITIVE


  
  


## Part 3: Inference on politics-related tweets

Now, let's test both RF models (forest1 trained on IMDB reviews, forest2 trained on tweets) on political tweets. I draw the test set of 30 tweets from bigger politics-related database and validate their sentiment myself, by-hand. Then, I take a look how predictions of both models compare to my own predictions and therefore draw conclusions about whether the models are useful for the task of predicting sentiment in political tweets.

Importing the dataset of political tweets:

In [369]:
kt = pd.read_csv("keyword-tweets.csv", header=None)
kt.columns = ["type", "text"]
kt = kt[kt.type == "POLIT"]
kt.reset_index(drop=True, inplace=True)

Drawing tweets:

In [100]:
import random
political_tweets = list()

In [258]:
for j in range(30):
    i = random.randint(0, kt.shape[0]-1)
    query = kt["text"][i]
    print(query)
    political_tweets.append(query)

EPA Jackson: people want to keep out cold and save on energy bills. "Environmentalism seems like enclave of the well-off." #AGIF


Now, I create the final DataFrame where I store text of 30 tweets and predictions from both models and my personal evaluation.

In [266]:
pt_df = pd.DataFrame(np.array(political_tweets), columns=["tweet"])
pt_df["imdb_model_pred"] = -1
pt_df["s140_model_pred"] = -1
pt_df["my_own_eval"] = -1

Here, I run both RF prediction models on final 30 political tweets and also make my own evaluation.

In [373]:
import warnings
warnings.filterwarnings('ignore')

# my evaluation
my_evaluation = [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0]
pt_df["my_own_eval"] = my_evaluation

#forest1 - IMDB-trained RF model
clean_test_reviews = []
for review in pt_df["tweet"]:
    clean_test_reviews.append(review_wordlist(review,remove_stopwords=True))
testDataVecs = getAvgFeatureVecs(clean_test_reviews, model1, num_features)
pt_df["imdb_model_pred"] = forest1.predict(testDataVecs)

#forest2 - Sentiment140-trained RF model
for i in range(pt_df.shape[0]):
    tw = pt_df["tweet"][i]
    query = tw
    n_dim = 200
    querytokens=tokenize(query)
    query_vecs_w2v = buildWordVector(querytokens, n_dim)
    pred = forest2.predict(query_vecs_w2v).item()
    pt_df["s140_model_pred"][i] = pred

Final DataFrame:

In [376]:
pt_df

Unnamed: 0,tweet,imdb_model_pred,s140_model_pred,my_own_eval
0,RT @Mark_Meed: Breaking: Obama Official Linked...,1,0,0
1,The Democrat stimulus plan is a mechanism whos...,1,0,0
2,is still trying to figure out why Obama got th...,0,0,0
3,RT @Lyn_Sue: RT @TellTheTruth1 http://bit.ly/v...,0,0,0
4,@lauraflyme We need to destroy every company t...,0,0,0
5,"YOU crazies are what give women and ""feminism""...",0,0,0
6,RT @worldprayr: U.S. soldier captured in Afgha...,0,0,1
7,RT @billmaher: I'd give a weeks pay 4 10min w/...,1,0,0
8,"@Dumb_Ox new congress report out, cf my post t...",1,0,1
9,"Strange activity,Twitter says account suspende...",1,0,0


Evaluation of model accuracy, when compared with my own "predictions":

In [378]:
for1 = 0
for2 = 0
for k in range(pt_df.shape[0]):
    if(pt_df["imdb_model_pred"][k] == pt_df["my_own_eval"][k]):
        for1 += 1
    if(pt_df["s140_model_pred"][k] == pt_df["my_own_eval"][k]):
        for2 += 1

Estimating the accuracy of models for sentiment prediction of political tweets:

In [1]:
print("Estimated accuracy of IMDB reviews-trained model: ", for1/30)
print("Estimated accuracy of Sentiment140 tweets-trained model: ", for2/30)

NameError: name 'for1' is not defined

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv("out-preds.csv")

In [9]:
df.head(5)

Unnamed: 0,tweet,imdb_model_pred,s140_model_pred,my_own_eval
0,RT @Mark_Meed: Breaking: Obama Official Linked...,1,0,0
1,The Democrat stimulus plan is a mechanism whos...,1,0,0
2,is still trying to figure out why Obama got th...,0,0,0
3,RT @Lyn_Sue: RT @TellTheTruth1 http://bit.ly/v...,0,0,0
4,@lauraflyme We need to destroy every company t...,0,0,0


In [10]:
for i in range(df.shape[0]):
    print("TWEET: ", df.iloc[i, 0])
    print("IMDB PREDICTION: ", df.iloc[i, 1])
    print("S140 PREDICTION: ", df.iloc[i, 2])
    print("MY OWN EVALUATION: ", df.iloc[i, 2])
    print("-------------------------")

TWEET:  RT @Mark_Meed: Breaking: Obama Official Linked to Racially Charged Boycott of Glenn Beck - HUMAN EVENTS http://bit.ly/53pKk -Infuriating!
IMDB PREDICTION:  1
S140 PREDICTION:  0
MY OWN EVALUATION:  0
-------------------------
TWEET:  The Democrat stimulus plan is a mechanism whose goal is the destruction of the traditional American way of life. Nancy Cappock
IMDB PREDICTION:  1
S140 PREDICTION:  0
MY OWN EVALUATION:  0
-------------------------
TWEET:  is still trying to figure out why Obama got the nobel peace prize
IMDB PREDICTION:  0
S140 PREDICTION:  0
MY OWN EVALUATION:  0
-------------------------
TWEET:  RT @Lyn_Sue: RT @TellTheTruth1 http://bit.ly/v2s9V Sotomayor is nvolved n bkruptcy fraud & she lied. Vote delayd SHE Cn B STOPPED!! #tea ...
IMDB PREDICTION:  0
S140 PREDICTION:  0
MY OWN EVALUATION:  0
-------------------------
TWEET:  @lauraflyme We need to destroy every company that took any Obama money. Sink the entire banking sector!
IMDB PREDICTION:  0
S140 PREDICT