# Incorporating Deep Learning

* Continuing with movie reviews from the v2.0 polarity dataset at http://www.cs.cornell.edu/people/pabo/movie-review-data.
    * It contains written reviews of movies tagged as positive or negative.
* This notebook builds upon the turtorial "Working With Text Data" found at   
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
* You may find this information about deep learning helpful:   
https://scikit-learn.org/stable/modules/neural_networks_supervised.html

**Required Python libraries:**
* Numpy (www.numpy.org) 
* Matplotlib (matplotlib.org) 
* Scikit-learn (scikit-learn.org) 
* Pytorch (www.pytorch.org)

In [1]:
import matplotlib.pylab as py
%matplotlib inline

# Load in the movie review data, create TF-IDF features

In [None]:
from sklearn.datasets import load_files

# Training data folder must be passed as first argument.
# Note the 'txt_sentoken' contains two folders: 'neg' and 'pos'
# The 'neg' ('pos') folder contains text files of negative (positive) movie reviews
movie_reviews_data_folder = 'txt_sentoken'
print("loading text...")
dataset = load_files(movie_reviews_data_folder, shuffle=False)
print("n_samples: %d" % len(dataset.data))

In [3]:
from sklearn.model_selection import train_test_split

# split the dataset in training and test sets:
docs_train, docs_test, y_train, y_test = train_test_split(
    dataset.data, dataset.target, test_size=0.25, random_state=None)

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Use TfidVectorizer to create features for the text documents
features = TfidfVectorizer(min_df=3, max_df=0.95, ngram_range=(1,1))
X_train=features.fit_transform(docs_train)
X_test=features.transform(docs_test)

## Impliment PCA to project original features into lower dimensional space

PCA (which relies on Singular Value Decomposition, or SVD) reduces the number of features, which in turn reduces the run time for our code.

In [5]:
from sklearn.decomposition import TruncatedSVD
import torch

svd = TruncatedSVD(n_components=1000)
svd.fit(X_train)
print(sum(svd.explained_variance_ratio_))

0.847397146331857


In [6]:
X_train = svd.transform(X_train)
X_test = svd.transform(X_test)

In [7]:
X_train = torch.from_numpy(X_train)
X_test = torch.from_numpy(X_test)

# Train Multi-Layer Perceptron classifier using the movie review dataset

Incorporate Pytorch library to implement a simple MLP classifier.   
Note you will need to install Pytorch using Anaconda.

In [8]:
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

class MLPClassifier(nn.Module):
    def __init__(self):
        super(MLPClassifier, self).__init__()       # equivalent to super().__init__()
        self.fc1 = nn.Linear(X_train.shape[1], 500)
        self.fc2 = nn.Linear(500, 400)
        self.fc3 = nn.Linear(400, 300)
        self.fc4 = nn.Linear(300, 200)
        self.fc5 = nn.Linear(200, 1)

    def forward(self, input):
        # relu = rectified linear
        # See link for other options: https://pytorch.org/docs/stable/nn.functional.html
        intermediate_vector1 = F.relu(self.fc1(input))
        intermediate_vector2 = F.relu(self.fc2(intermediate_vector1))
        intermediate_vector3 = F.relu(self.fc3(intermediate_vector2))
        intermediate_vector4 = F.relu(self.fc4(intermediate_vector3))
        prediction_vector = torch.sigmoid(self.fc5(intermediate_vector4))

        return prediction_vector

In [9]:
# Create instance
mlp = MLPClassifier() 

loss_func = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(mlp.parameters())

This will take a few minutes to train.

In [10]:
num_epochs = 500
y_train = y_train.reshape(-1,1)

for i in range(1, num_epochs+1):
    # Step 1. set gradients to zero
    optimizer.zero_grad()

    # Step 2. compute output
    y_pred = mlp(X_train.float())

    # Step 3. compute loss
    loss = loss_func(y_pred, torch.tensor(y_train, dtype=torch.float))

    # Step 4. use loss to produce gradients
    loss.backward()

    # Step 5. use optimizer to take gradient step
    optimizer.step()
    
    if i % 100 == 0:
        print('Epoch ' + str(i)  + ' completed.')

Epoch 100 completed.
Epoch 200 completed.
Epoch 300 completed.
Epoch 400 completed.
Epoch 500 completed.


In [11]:
y_predicted = mlp(X_test.float())

In [12]:
y_predicted = (torch.round(y_predicted)).detach().numpy().astype(int)

In [None]:
from sklearn import metrics
cm = metrics.confusion_matrix(y_test, y_predicted)
print(cm)


In [None]:
py.scatter(range(len(y_predicted)), y_predicted)

# Transfer learning

This is the main point of the project. Apply transfer learning to an NN trained on movie reviews. How will the NN perform on tweets? It turns out, not so well. However, adding only 100 labeled tweets to the movie database of 2100 reviews and retraining the network improved the performance of the NN remarkably.  

A sample of 100 tweets (regarding Gwen Berry), labeled by humans, yielded a point estimate of .31 for the proportion of positive tweets. An NN trained on movie reviews only categorized 70% of the tweets as positive. By adding the 100 labled tweets to the training set of 2100 movie reviews the retraining network categorized 28% of the tweets as positive; very close to the point estimate (95% CI [0.23, 0.41])

1. Import data (tweets + metadata) from MongoDB
1. Extract tweet full_text from data 
1. Use the trained network to classify tweets using (simple) transfer learning.

In [13]:
import json
import pymongo

In [14]:
q = '@MzBerryThrows'
    
# The connection string for a remote hosted mongodb running on MongoDB atlas
# client = pymongo.MongoClient("mongodb+srv://test:epsabre@cluster0.fup2q.mongodb.net/myFirstDatabase?retryWrites=true&w=majority")

# A local mongodb running on your personal machine installed from using the documentation:
#    https://docs.mongodb.com/manual/tutorial/install-mongodb-on-windows/ 
client = pymongo.MongoClient("mongodb://127.0.0.1:27017")

# Get a reference to a particular database    
db = client['twitter']
    
# Reference a particular collection in the database
coll = db['statuses_'+q]

In [None]:
# Do a search in MongoDB

# This pulls all records, but not all records have full text.
#cursor = coll.find({})

# This should pull all records with full text, as all tweets should 
# contain '@' in the full_text field given the twitter search
cursor = coll.find({ 'full_text' : {'$regex': '.*@.*'} })

# Count queried tweets, i.e. tweets w/ full text
n=0
for tweet in cursor:
    n = n + 1
print('Number of tweets: ', n)

#list(tweets.find({'entities.hashtags.text': {"$ne":None}}))
# Need to reinitialize cursor before iterating through it again.
cursor = coll.find({ 'full_text' : {'$regex': '.*@.*'} })
X_tweet = []
for tweet in cursor:
    X_tweet.append(tweet['full_text'])
    

In [None]:
len(X_tweet)

In [17]:
X_tweet_features = features.transform(X_tweet)

A small visualziation to see if the movie review features and the tweets have at least some overlap.  
Not much!!!

In [None]:
py.spy(X_tweet_features[:,:200])

We use our pretrained preprocessing chain.

In [18]:
X_tweet_features_projected = svd.transform(X_tweet_features)
X_tweet_features_projected = torch.from_numpy(X_tweet_features_projected)

We use our pretrained neural network.

In [19]:
y_tweet_predicted = mlp(X_tweet_features_projected.float())

Do we get something reasonable?  
Some clearly negative tweets receive high scores (> 0.9)

In [None]:
for i in range(8):
    print(X_tweet[i])
    print(y_tweet_predicted[i])
    print()

## Sentiment Ratio
#### Some ways to measure the sentiment of the tweets

In [None]:
# Test - this pulls the number out of the tensor
#len(y_tweet_predicted)
y_tweet_predicted[0].data.numpy()[0]

In [None]:
# Two ways we can do this. 
# 1. Require all tweets to be either positive or negative
# 2. classify tweets as pos, neg, neutral

# Parameters
pos_lo_bnd = .66
mid = .5
neg_hi_bnd = .33


# Initialize
pos_sent = 0
pos_neut_sent = 0
neg_neut_sent = 0
neg_sent = 0

pos_cnt = 0
pos_neut_cnt = 0
neg_neut_cnt = 0
neg_cnt = 0

for y in y_tweet_predicted:
    if (y.data.numpy()[0] < neg_hi_bnd):
        neg_sent += 2 * y.data.numpy()[0] - 1  # scale to [-1,1]
        neg_cnt += 1
    elif (y.data.numpy()[0] < mid):
        neg_neut_sent += 2 * y.data.numpy()[0] - 1
        neg_neut_cnt += 1
    elif (y.data.numpy()[0] < pos_lo_bnd):
        pos_neut_sent += 2 * y.data.numpy()[0] - 1 
        pos_neut_cnt += 1
    else:
        pos_sent += 2 * y.data.numpy()[0] - 1
        pos_cnt += 1
        
# If all tweets are required to be either pos or neg
sent_ratio = -(pos_sent + pos_neut_sent) / (neg_sent + neg_neut_sent)
cnt_ratio = (pos_cnt + pos_neut_cnt) / (neg_cnt + neg_neut_cnt)
print("sentiment ratio w/ no neutral tweets: ", sent_ratio)
print("count ratio: ", cnt_ratio)
print("positive tweets: ", pos_cnt + pos_neut_cnt)
print("negative tweets: ", neg_cnt + neg_neut_cnt)

# If we allow for neutral tweets
sent_ratio = -(pos_sent / neg_sent)
cnt_ratio = pos_cnt / neg_cnt
print("sentiment ratio allowing for neutral tweets: ", sent_ratio)
print("count ratio: ", cnt_ratio)
print("positive tweets: ", pos_cnt)
print("neutral tweets: ", pos_neut_cnt + neg_neut_cnt)
print("negative tweets: ", neg_cnt)
