# Deep Learning Assignment 4: Analyzing Sentiment
Tyler Bevan

 * some data: http://help.sentiment140.com/for-students

**Columns:**

    0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
    1 - the id of the tweet (2087)
    2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
    3 - the query (lyx). If there is no query, then this value is NO_QUERY.
    4 - the user that tweeted (robotickilldozr)
    5 - the text of the tweet (Lyx is cool)

In [1]:
from collections import Counter
import numpy as np
import string
import pandas as pd
import re
from sklearn import linear_model
from sklearn import naive_bayes
from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import *
import gensim.models.keyedvectors as w
from nltk.stem import WordNetLemmatizer



#### the data has already been split into train and test sets

In [2]:
cols = ['polarity','id', 'date', 'query', 'user', 'tweet']

data = pd.read_csv('sentiment.csv',names=cols, encoding='ISO-8859-1')
print('length of data {}'.format(len(data)))

length of data 1600000


This is a lot of data. That's great! However, it will take a long time to get through this notebook with all of that data, so I'm going to randomly choose about 10% of it. We also don't need all of those columns, so let's only keep the ones we need.

In [3]:
#data=data.sample(frac=0.1,random_state=200)
data = data.drop(['id', 'date', 'query', 'user'], axis=1)
data[:3]

Unnamed: 0,polarity,tweet
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...


How many of each type are there?

In [4]:
set(data.polarity)

{0, 4}

## 1.) How many of each polarity are there?

* Hint: use a mask over the `data` dataframe

In [5]:
count = Counter(data.polarity)
count

Counter({0: 800000, 4: 800000})

## 2.) Change all 4s in polarity to 1

* Hint: a lambda function might be useful here

In [6]:
def four_to_one(x):
    return 1 if x == 4 else x
    
data['polarity'] = data.polarity.map(four_to_one)

In [7]:
count = Counter(data.polarity)
count

Counter({0: 800000, 1: 800000})

## 3.) split the data into 10% test, 10% and 80% train

* create `test`, `dev`, and `train` data tables
* you can use the `.sample()` method for the dataframe
* print out the shapes of each of the three tables
* What is the baseline for this task? 

In [8]:
# write the code here
train = data.sample(frac=0.8, replace=True)
dev = data.sample(frac=0.1, replace=True)
test = data.sample(frac=0.1, replace=True)

train.shape, dev.shape, test.shape

((1280000, 2), (160000, 2), (160000, 2))

The baseline is .5, because simply guessing will get ~50% accuracy.

## 4.) Use a LabelEncoder to convert the tweet column to numbers

* I do this for you. Just run the following cells to see how well representing a full tweet as an index number works for this task.

In [9]:
y = train.polarity.as_matrix()

y.shape

  """Entry point for launching an IPython kernel.


(1280000,)

In [10]:
leX = preprocessing.LabelEncoder()
leX.fit(data.tweet) # use the original data df so all possibilities are encoded
X = leX.transform(train.tweet)
X = X.reshape(X.shape[0], 1)

X.shape

(1280000, 1)

In [11]:
model = linear_model.LogisticRegression(penalty='l2', solver='lbfgs')
model.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [12]:
Xdev = leX.transform(dev.tweet)
Xdev = Xdev.reshape(Xdev.shape[0], 1)
ydev = dev.polarity.values

In [13]:
accuracy_score(model.predict(Xdev), ydev) 

0.49763125

## 5.) Analysis of LabelEncoder

* How well does LabelEncoder perform compared to the baseline?
* Why does it perform so poorly? What does it have to do with the way the features are represented?

It is about the same as the baseline. It does so poorly because most tweets are unique and as such each tweet is encoded differently and there are no trends to learn from.

## 6.) one-hot encoding

* Repeat the steps of preparing the test and dev data as in #4, only this time use one-hot vectors instead of the label encoder
* Hint: do you want to represent the entire tweet as a vector, or each word? (Hint: use words to make the one-hot encoder, then sum them to represent the entire tweet)
* Hint: try `get_dummies()`, alternatively use scikitlearn's OneHotEncoder

In [14]:
vectorizer = CountVectorizer()
vectorizer.fit(data.tweet.values)
X = vectorizer.transform(train.tweet.values)
y = train.polarity.values

In [15]:
model = linear_model.LogisticRegression(penalty='l2', solver='lbfgs', max_iter=300)
model.fit(X, y)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=300, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [16]:
Xtest = vectorizer.transform(test.tweet.values)
ytest = test.polarity.values

accuracy_score(model.predict(Xtest), ytest) 

0.828125

## 7.) word2vec

* download the `GoogleNews-vectors-negative300.bin` file from https://github.com/mmihaltz/word2vec-GoogleNews-vectors and unzip the file
* load the file by running the cell below (you may need to pip install gensim and you may need to change the path to the file)

In [17]:
w2v = w.KeyedVectors.load_word2vec_format('D:\GoogleNews-vectors-negative300.bin',binary=True)

* You can access vectors like a dictionary:

In [18]:
w2v['red'][:3] # show the first three values for the vector for 'red'

array([ 0.09716797, -0.08496094,  0.27148438], dtype=float32)

* vectors are length 300

In [19]:
len(w2v['red'])

300

* Repeat the steps of preparing the test and dev data as in #4, only this time use w2v vectors
* How to do you represent a tweet, which is multiple words, as a single vector? (Hint: try summing the vectors)
* Note: w2v only has lower-cased words
* Hint: if w2v doesn't have a word you are looking for, just ignore that word

In [20]:
def generate_vector(tweet):
    vectors = []
    for word in tweet.lower().split():
        try:
            vectors.append(w2v[word.strip()])
        except:
            pass
    if len(vectors) > 1:
        return np.array(vectors).sum(axis=0)
    elif len(vectors) == 1:
        return np.array(vectors)[0]
    else:
        return np.zeros(300, dtype='f4')

In [21]:
X = train['tweet'].map(generate_vector).values
X = np.array(X.tolist())
y = train['polarity'].values

In [22]:
model = linear_model.LogisticRegression(penalty='l2', solver='lbfgs', max_iter=1000)
model.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=1000, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [23]:
Xtest = test.tweet.map(generate_vector).values
Xtest = np.array(Xtest.tolist())
ytest = test.polarity.values

accuracy_score(model.predict(Xtest), ytest) 

0.7328125

## 8.) Comparing the three approaches

* Now that you've tried things out on your `dev` set, train on your `train`+`dev` data and test on your `test` data for all three approaches and report the results. 
* Why do you think one-hot and word2vec worked better than the label encoder?
* Did one-hot or word2vec work better? Why do you think that is the case?
* What do you think would happen if you cleaned up the tweets (e.g., removed punctuation, emojis, etc.)? 

The label encoder doesn't decompose the tweet into components, so there is no way to compare tweets that aren't the same. The other two are breaking the tweets into words, and as such can assign weights to specific words in the tweet.

One-hot did better in all my tests. I suspect it is because it creates a value in the vector for each distinct word, while the word2vec has a constant vector size of 300. In addition, there are words that contain symbols or other strange spellings that don't appear in the word2vec database. As a result, the one-hot encoder has better precision than word2vec. That precision leads to better models.

Cleaning up the tweets better would improve the cohesion of the data set, as words would no longer be seperated if they had a period after them, or if they started with a #. That should improve the accuracy and reduce the data set as well.

## 9.) Compare Machine Learning Approaches

* Instead of using LogisticRegression, try another classifier from [scikit-learn](http://scikit-learn.org/stable/supervised_learning.html#supervised-learning) or nltk (e.g., decision trees, multi-layer perceptron, SVM, or maximum entropy)
* Compare the results to logistic regression. 
* Why do you think one approach works better than another?

In [24]:
vectorizer = CountVectorizer()
vectorizer.fit(data.tweet.values)
X = vectorizer.transform(train.tweet.values)
y = train.polarity.values
Xtest = vectorizer.transform(test.tweet.values)
ytest = test.polarity.values

In [25]:
model = naive_bayes.BernoulliNB()
model.fit(X,y)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [26]:
accuracy_score(model.predict(Xtest), ytest)

0.81331875

I chose to use the bernoulli naive bayes model, as it has almost as much accuracy as the logistic regression. It however trains in less than a second as opposed to the couple minutes that the logistic takes, so it's a pretty good trade off.

## 10.) Pre-process the text

* Now pre-process the text by doing one or more of the following:
  * stemming
  * lemmatizing
  * removing stop-words
* Re-run questions #6, #7, and #9 with the now preprocessed text (i.e., redo the one-hot and word2vec steps, and compare the results with LogisticRegression and your chosen approach in #9)
* Answer the following in a markdown cell:
  * How might you restrucure your notebook and programming approach to allow you to perform different pre-processing and machine learning training/evaluation steps more systematically?

In [27]:
# This function removes @mentions, urls, and removes some characters

r1 = re.compile('@[0-9A-Za-z_\-]+')
r2 = re.compile('https?://[A-Za-z0-9@_\-\./]*')
r3 = re.compile('[\.,]')
r4 = re.compile('!')
r5 = re.compile('\?')
def clean_string(x):
    x = r1.sub('', x)
    x = r2.sub('', x)
    x = r3.sub('', x)
    x = r4.sub(' !', x)
    x = r5.sub(' ?', x)
    return x



lemmatizer = WordNetLemmatizer()

def line_prep(line):
    ret_str = ''
    for word in clean_string(line).lower().split():
        ret_str = ret_str + " " + lemmatizer.lemmatize(word)
    return ret_str

# Map cleaning function
data_clean = data.tweet.map(lambda x : clean_string(x).lower())
full_words = set()
for tweet in data_clean.values:
    for word in tweet.split():
        full_words.add(lemmatizer.lemmatize(word))
        
prepped = list(full_words)

Above I'm using regex to clean up some things and then the wordnet lemmatizer to reduce the words to their bases.

In [28]:
vectorizer = CountVectorizer()
vectorizer.fit(prepped)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [29]:
prepX = train.tweet.map(line_prep)
X = vectorizer.transform(prepX.values)
y = train.polarity.values
prepXtest = test.tweet.map(line_prep)
Xtest = vectorizer.transform(prepXtest.values)
ytest = test.polarity.values

The one-hot logistic regression and Bernoulli naive bayes both use these x and y sets.

In [30]:
model = linear_model.LogisticRegression(penalty='l2', solver='lbfgs', max_iter=500)
model.fit(X,y)
accuracy_score(model.predict(Xtest), ytest)



0.81513125

The cleaned up data does slightly worse than the raw data. This could be due to the fact that some of the variation that's being removed is actually meaningful.

In [31]:
Xw2v = prepX.map(generate_vector).values
Xw2v = np.array(Xw2v.tolist())
Xw2vtest = prepXtest.map(generate_vector).values
Xw2vtest = np.array(Xw2vtest.tolist())
model = linear_model.LogisticRegression(penalty='l2', solver='lbfgs', max_iter=1000)
model.fit(Xw2v, y)
accuracy_score(model.predict(Xw2vtest), ytest)

0.74520625

The word2vec does better however, probably due to the fact that the cleaned up and lemmatized words are more likely to be in the vocabulary.

In [32]:
model = naive_bayes.BernoulliNB()
model.fit(X,y)
accuracy_score(model.predict(Xtest), ytest)

0.79548125

The bernoulli does slightly worse, in line with the first model.

My dataset generation and cleaning setups are not consistent, and some of the steps could be reduced to less block to make restarting after exiting simpler. Some of them are in functions and some are being done by hand. I should fix that to make it easier to use.