# Project 2
### Goal: Build classifiers
- Logistic Regression
- Linear Discriminant Analysis
- Support Vector Machines

### Project Notes
- Always train on ‘train/’ directory files and test on ‘test/’ directory files.
- Should be capturing sentiment, not information about the movie itself.
- No external data related to movies, can use other external data related to sentiment analysis (e.g. sentiment scores of different words).
- Vectorization of the text (TFIDF, bigrams, etc.) - can use packages, but may need to go above and beyond what the packages do. 

In [1]:
!pip install py-readability-metrics
!pip install nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## All Functionality Imported from Utilities.py

In [2]:
import numpy as np
import pandas as pd
from readability import Readability
from sklearn.utils import shuffle
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
import datetime
import os

np.random.seed(123)

from project2 import *

---
## Part 1:
You will construct your own feature set by analyzing the textual data and vectorizing it as you see fit.

Your project involves a competition between the existing feature set provided to you as part of the dataset and a feature set that you develop yourselves.

When setting up your vectorization process, please note the following facts. The training set has no more than 30 reviews per individual movie. The movies in the test set are different than the movies in the training set. Therefore, this assignment really is about sentiment detection from the text, not about determining the relationship between movies and their ratings. Therefore, unlike Project 1, where your goal was to bring in features from outside datasets, for Project 2, your feature extraction shall concentrate solely on the text of the reviews. 
You can use outside datasets related to language (e.g., datasets that specify word sentiment), but do not use any outside sources of information about the movies themselves. Your review models shall largely be agnostic of the specific movies.


#### Get Train and Test Reviews into DataFrames
**Variables:**
   - filename: full path to the file the review is found in 
   - text: enitre text of the review
   - sentiment: 1 for positive, 0 for negative

In [3]:
def get_train():
    print("\nGetting positive reviews...\n")
    pos_train_path = '/data401/reviews/train/pos/'
    pos_train = []
    i = 0
    start = datetime.datetime.now()
    for filename in pos_train_files:
        with open(pos_train_path + filename) as f:
            text = f.read().replace('<br />','\n')
        pos_train.append({
            'filename': pos_train_path + filename,
            'text': text
        })

        if i%1000 == 0:
            print(i, datetime.datetime.now() - start)

        i += 1

    pos_train_df = pd.DataFrame(pos_train)
    pos_train_df['sentiment'] = 1
    
    print("\nGetting negative reviews...\n")
    neg_train_path = '/data401/reviews/train/neg/'
    neg_train = []
    i = 0
    start = datetime.datetime.now()
    for filename in neg_train_files:
        with open(neg_train_path + filename) as f:
            text = f.read().replace('<br />','\n')
        neg_train.append({
            'filename': neg_train_path + filename,
            'text': text
        })

        if i%1000 == 0:
            print(i, datetime.datetime.now() - start)

        i += 1

    neg_train_df = pd.DataFrame(neg_train)
    neg_train_df['sentiment'] = 0
    
    print('\nCombining DataFrames\n')
    train_df = pd.concat([pos_train_df, neg_train_df], sort = False).fillna(0)
    return train_df


def get_test():
    print("\nGetting positive reviews...\n")
    pos_test_path = '/data401/reviews/test/pos/'
    pos_test = []
    i = 0
    start = datetime.datetime.now()
    for filename in pos_test_files:
        with open(pos_test_path + filename) as f:
            text = f.read().replace('<br />','\n')
        pos_test.append({
            'filename': pos_test_path + filename,
            'text': text
        })

        if i%1000 == 0:
            print(i, datetime.datetime.now() - start)

        i += 1

    pos_test_df = pd.DataFrame(pos_test)
    pos_test_df['sentiment'] = 1
    
    print("\nGetting negative reviews...\n")
    neg_test_path = '/data401/reviews/test/neg/'
    neg_test = []
    i = 0
    start = datetime.datetime.now()
    for filename in neg_test_files:
        with open(neg_test_path + filename) as f:
            text = f.read().replace('<br />','\n')
        
        neg_test.append({
            'filename': neg_test_path+filename,
            'text': text
        })

        if i%1000 == 0:
            print(i, datetime.datetime.now() - start)

        i += 1

    neg_test_df = pd.DataFrame(neg_test)
    neg_test_df['sentiment'] = 0
    
    print('\nCombining DataFrames\n')
    test_df = pd.concat([pos_test_df, neg_test_df], sort = False).fillna(0)
    return test_df

In [4]:
if 'train.csv' not in os.listdir('project2_data'):
    # This takes about 6 minutes
    train_df = get_train()
    train_df.to_csv('project2_data/train.csv', index = False)
else:
    train_df = pd.read_csv('project2_data/train.csv')

In [5]:
if 'test.csv' not in os.listdir('project2_data'):
    # This takes about 6 minutes
    test_df = get_test()
    test_df.to_csv('project2_data/test.csv', index = False)
else:
    test_df = pd.read_csv('project2_data/test.csv')

#### Add Features
**Basic Variables**
- numer of sentences
- average numer of words per sentence
- average word length

**Readability / Complexity**
- dale chall readability score

**Sentiment**

Looking at:
- positive / negative emoji use
- positive / negative word use
- boosters and diminishers

Variables:
- positive emoticons: number of positive emoticons adjusted for number of words
- negative emoticons: number of negative emoticons adjusted for number of words
- positive words: number of positive words adjusted for number of words, weighted by looking at boosting and diminishing words in the same sentence.
- negative words: number of negative words adjusted for number of words, weighted by looking at boosting and diminishing words in the same sentence.

In [6]:
def calculate_readability(text):
    try:
        return Readability(text).dale_chall().score
    except Exception:
        # Readability requires 100 words. If there aren't enough,
        # concatenate the text to itself and try again.
        text = text + ' ' + text
        return calculate_readability(text)
    
def extract_emoticons(text, num_words):
    added_features = {
        'positive_emoticons': 0,
        'negative_emoticons': 0
    }
    
    for emoticon in positive_emoticons:
        if emoticon in text:
            added_features['positive_emoticons'] +=  1/num_words
            
    for emoticon in negative_emoticons:
        if emoticon in text:
            added_features['negative_emoticons'] +=  1/num_words
            
    return added_features

def get_weight(sentence):
    # Checking for boosters and diminishers in a sentence
    # Baseline weight is 1
    # If net boosting: 1.5
    # If net diminishing: 0.5
    
    net = 0
    for dim in diminisher_words:
        if dim in sentence:
            net -= 1
    for boost in booster_words:
        if boost in sentence:
            net += 1
         
    negated = False
    for neg in negation_words:
        if neg in sentence:
            negated = True
            
    if net == 0:
        w = 1
    elif net > 0:
        w = 1.5
    else:
        w = 0.5
        
    return w, negated
    

def extract_words(sentences, num_words):
    added_features = {
        'positive_words': 0,
        'negative_words': 0
    }
    
    for sentence in sentences:
        words = word_tokenize(sentence)
        weight, negated = get_weight(sentence)

        for word in words:
            if True:
#             if not negated:
                if word in positive:
                    added_features['positive_words'] += weight/num_words
                if word in negative:
                    added_features['negative_words'] += weight/num_words
#             else:
#                 if word in positive:
#                     added_features['negative_words'] += weight/num_words
#                 if word in negative:
#                     added_features['positive_words'] += weight/num_words
    
    return added_features
    
def get_features(text, filename):
    features = {
        'filename': filename
    }
       
    # Basic Variables
    sentences = sent_tokenize(text)
    words = word_tokenize(text)
    features['num_sentences'] = len(sentences)
    features['words_per_senence'] = len(words) / len(sentences)

    word_lengths = np.array([len(w) for w in words])
    features['avg_word_length'] = word_lengths.mean()
    
    # Readability / Complexity
    features['dale_chall_readability'] = calculate_readability(text)
    
    # Sentiment
    features.update(extract_emoticons(text, len(words)))
    features.update(extract_words(sentences, len(words)))
            
    return features

Test feature extraction on one observation

In [7]:
get_features(train_df['text'][0], train_df['filename'][0])

{'filename': '/data401/reviews/train/pos/2655_10.txt',
 'num_sentences': 17,
 'words_per_senence': 51.294117647058826,
 'avg_word_length': 4.283256880733945,
 'dale_chall_readability': 10.530646269822672,
 'positive_emoticons': 0,
 'negative_emoticons': 0,
 'positive_words': 0.046444954128440394,
 'negative_words': 0.04300458715596331}

Apply feature extraction to test and train dataframes. Save new data frames to CSV.

In [8]:
def parallelize_dataframe(df, func, n_cores=8):
    df_split = np.array_split(df, n_cores)
    pool = Pool(n_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df

def get_features_parallel(df):
    trained_features = []
    for i, row in df.iterrows():
        trained_features.append(get_features(row['text'], row['filename']))
    return pd.DataFrame(trained_features)

In [9]:
if 'train_features.csv' not in os.listdir('project2_data'):
    train_features = parallelize_dataframe(train_df, get_features_parallel)
    
    train_features_df = pd.DataFrame(train_features).reset_index(drop=True)
    train_features_df.to_csv('project2_data/train_features.csv', index = False)
else:
    train_features_df = pd.read_csv('project2_data/train_features.csv')

In [10]:
if 'test_features.csv' not in os.listdir('project2_data'):
    test_features = parallelize_dataframe(test_df, get_features_parallel)

    test_features_df = pd.DataFrame(test_features).reset_index(drop=True)
    test_features_df.to_csv('project2_data/test_features.csv', index = False)
else:
    test_features_df = pd.read_csv('project2_data/test_features.csv')

In [11]:
test_features_df.head()

Unnamed: 0,filename,num_sentences,words_per_senence,avg_word_length,dale_chall_readability,positive_emoticons,negative_emoticons,positive_words,negative_words
0,/data401/reviews/test/pos/2655_10.txt,23,30.173913,4.18732,9.801523,0,0,0.048991,0.033141
1,/data401/reviews/test/pos/4521_7.txt,10,21.9,4.0,9.001824,0,0,0.043379,0.054795
2,/data401/reviews/test/pos/12429_10.txt,7,25.714286,3.861111,7.975813,0,0,0.044444,0.083333
3,/data401/reviews/test/pos/3384_10.txt,13,36.230769,4.214437,9.042558,0,0,0.045648,0.016985
4,/data401/reviews/test/pos/6697_7.txt,9,15.777778,3.887324,8.609708,0,0,0.066901,0.080986


# Merging Polarity

In [12]:
filesofinterest = ['filename','polarity']

In [13]:
neg_test_polarity = pd.read_csv('project2_data/polarity/neg_test.csv')
pos_test_polarity = pd.read_csv('project2_data/polarity/pos_test.csv')

neg_test_polarity = neg_test_polarity[filesofinterest]
neg_test_polarity['filename'] = neg_test_polarity['filename'].apply(lambda x: "/data401/reviews/test/neg/"+x)

pos_test_polarity = pos_test_polarity[filesofinterest]
pos_test_polarity['filename'] = pos_test_polarity['filename'].apply(lambda x: "/data401/reviews/test/pos/"+x)

test_polarity = pd.concat([neg_test_polarity, pos_test_polarity])

In [14]:
neg_train_polarity = pd.read_csv('project2_data/polarity/neg_train.csv')
pos_train_polarity = pd.read_csv('project2_data/polarity/pos_train.csv')

neg_train_polarity = neg_train_polarity[filesofinterest]
neg_train_polarity['filename'] = neg_train_polarity['filename'].apply(lambda x: "/data401/reviews/train/neg/"+x)

pos_train_polarity = pos_train_polarity[filesofinterest]
pos_train_polarity['filename'] = pos_train_polarity['filename'].apply(lambda x: "/data401/reviews/train/pos/"+x)

train_polarity = pd.concat([neg_train_polarity, pos_train_polarity])

In [15]:
mean_score = pd.concat([test_polarity, train_polarity])['polarity'].mean()
train_polarity = train_polarity.fillna(mean_score)
test_polarity = test_polarity.fillna(mean_score)

In [16]:
test_features_df = test_features_df.merge(test_polarity, on='filename')
train_features_df = train_features_df.merge(train_polarity, on='filename')

---
## Part 2:
You will use your classifier implementations to fit models for classifying movie reviews into positive and negative both on the feature set provided to you and on the feature set you created.

Implement three linear classifiers we discussed in class: Logistic Regression, Linear Discriminant Analysis, and Support Vector Machines. Use gradient descent/stochastic gradient descent for Logistic Regression and SVM classifiers. For Linear Discriminant Analysis use NumPy’s methods for discovering eigenvalues and eigenvectors of a matrix.
For this project you can limit your implementations to two-class classifiers, as Stanford’s Large Movie Review dataset has two classes.

For this project you will train the classifiers on the training set provided to you and evaluate them on the test set. Because of how both the training and the test set are constructed, do not use cross-validation, or other evaluation techniques, as they might produce skewed results.
As both positive and negative reviews are balanced in the test and training sets, and both are equally important to detect, your key measure is accuracy.  In addition, the software you build shall produce confusion matrices to give you an idea of what errors are more prevalent, and allow you to tune your classifier models.
Where your method comes with parameters, use grid search to hyper tune them. Do not consider learning rate a parameter of the model for methods where gradient descent is involved. Find the appropriate learning rate for each classification task, though.



In [17]:
Train = train_features_df.merge(
    train_df[['filename','sentiment']], 
    on = 'filename')
X_train = Train.drop(
    columns = ['sentiment']
)
y_train = Train['sentiment']

Test = test_features_df.merge(
    test_df[['filename','sentiment']],
    on = 'filename'
)
X_test = Test.drop(
    columns = ['sentiment']
)
y_test = Test['sentiment']

X_train = X_train.drop(columns = ['filename']).fillna(mean_score)
X_test = X_test.drop(columns = ['filename']).fillna(mean_score) 

In [18]:
scaler = StandardScaler()

In [19]:
col_names = X_train.columns.tolist()[:-1]

In [20]:
combined = X_train.append(X_test).copy()
combined[col_names] = scaler.fit_transform(combined[col_names])

In [21]:
X_train = combined.head(25000)
X_test = combined.tail(25000)

### Logistic Regression
Using Gradient Descent

In [22]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(solver = 'sag')
clf.fit(X_train, y_train) 

predictions = clf.predict(X_test)
accuracy = sum(predictions == np.array(y_test))/len(predictions)

print(accuracy)

0.74828


In [23]:
model = fitLogistic(X_train, y_train, rate = 0.01, tol = 1, maxiter = 1000)
train_predictions = classifyLogistic(X_train, model)
train_accuracy = (train_predictions == y_train).sum()/len(train_predictions)
print("Train Accuracy =",train_accuracy)

test_predictions = classifyLogistic(X_test, model)
test_accuracy = (test_predictions == y_test).sum()/len(test_predictions)
print("Test Accuracy =",test_accuracy)

Train Accuracy = 0.75068
Test Accuracy = 0.74668


### Linear Discriminant Analysis
Implementation Using Eigenvalues / Eigenvectors in [the python script](project2.py).

In [24]:
w, mu0, mu1 = fitLDA(X_train, y_train)
train_predictions = classifyLDA(X_train, w, mu0, mu1)
train_accuracy = (train_predictions == y_train).sum()/len(train_predictions)
print("Train Accuracy =",train_accuracy)

test_predictions = classifyLDA(X_test, w, mu0, mu1)
test_accuracy = (test_predictions == y_test).sum()/len(test_predictions)
print("Test Accuracy =",test_accuracy)

Train Accuracy = 0.75104
Test Accuracy = 0.74856


### Support Vector Machine
Using Gradient Descent

In [25]:
from sklearn.svm import SVC
clf = SVC(C=1, tol=1, max_iter = 100000, kernel = 'linear')
clf.fit(X_train, y_train) 

predictions = clf.predict(X_test)
accuracy = sum(predictions == np.array(y_test))/len(predictions)

print(accuracy)

0.74732


In [26]:
w = fitSVM(X_train,y_train,.01,.0000001,.4)
svm_out = predictSVM(X_test.values, w)
accuracy = accuracy_score(y_test, svm_out)
print('Accuracy:', accuracy)

Accuracy: 0.696


## Part 3:
You will compare the performance of the three classifiers to each other on each of the feature sets.

Question 2. How do the three classification techniques you implemented compare on the Large Movie Reviews dataset?


#### We see that our initial feature set still performs worse when compared to the baseline, but what if we combine the two datasets together?

In [27]:
import scipy.sparse as sp
from sklearn.datasets import load_svmlight_file

In [28]:
with open('../data401/reviews/imdb.vocab') as f:
    vocab = [l.replace('\n','') for l in f.readlines()]

X_test_other, y_test_other = load_svmlight_file(
    '../data401/reviews/test/labeledBow.feat',
    n_features = len(vocab)
)
y_test_other = np.array([-1 if y <=5 else 1 for y in y_test_other])
y_test_other_01 = np.array([0 if y == -1 else 1 for y in y_test_other])

X_train_other, y_train_other = load_svmlight_file(
    '../data401/reviews/train/labeledBow.feat',
    n_features = len(vocab)
)
y_train_other = np.array([-1 if y <=5 else 1 for y in y_train_other])
y_train_other_01 = np.array([0 if y == -1 else 1 for y in y_train_other])

In [29]:
X_train_s = pd.DataFrame(X_train_other.todense()).iloc[:,:500]
X_test_s = pd.DataFrame(X_test_other.todense()).iloc[:,:500]

In [30]:
X_train_combined = pd.concat([X_train, X_train_s],axis=1,sort=False)
X_test_combined = pd.concat([X_test, X_test_s],axis=1,sort=False)

## Logistic top 500 and Combined

In [31]:
clf = LogisticRegression(solver = 'sag')
clf.fit(X_train_s, y_train) 

predictions = clf.predict(X_test_s)
accuracy = sum(predictions == np.array(y_test))/len(predictions)

print(accuracy)

0.84108




In [32]:
clf = LogisticRegression(solver = 'sag')
clf.fit(X_train_combined, y_train) 

predictions = clf.predict(X_test_combined)
accuracy = sum(predictions == np.array(y_test))/len(predictions)

print(accuracy)

0.87852




## LDA combined and top 500

In [33]:
w, mu0, mu1 = fitLDA(X_train_combined, y_train)
train_predictions = classifyLDA(X_train_combined, w, mu0, mu1)
train_accuracy = (train_predictions == y_train).sum()/len(train_predictions)
print("Train Accuracy =",train_accuracy)

test_predictions = classifyLDA(X_test_combined, w, mu0, mu1)
test_accuracy = (test_predictions == y_test).sum()/len(test_predictions)
print("Test Accuracy =",test_accuracy)

Train Accuracy = 0.87928
Test Accuracy = 0.87336


In [34]:
w, mu0, mu1 = fitLDA(X_train_s, y_train)
train_predictions = classifyLDA(X_train_s, w, mu0, mu1)
train_accuracy = (train_predictions == y_train).sum()/len(train_predictions)
print("Train Accuracy =",train_accuracy)

test_predictions = classifyLDA(X_test_s, w, mu0, mu1)
test_accuracy = (test_predictions == y_test).sum()/len(test_predictions)
print("Test Accuracy =",test_accuracy)

Train Accuracy = 0.84644
Test Accuracy = 0.8386


# SVM top 500 and combined

In [35]:
from sklearn.svm import LinearSVC
clf = LinearSVC(C=1, tol=1, max_iter = 10000)
clf.fit(X_train_s, y_train) 

predictions = clf.predict(X_test_s)
accuracy = sum(predictions == np.array(y_test))/len(predictions)

print(accuracy)

0.84336




In [36]:
clf = LinearSVC(C=1, tol=1, max_iter = 10000)
clf.fit(X_train_combined, y_train) 

predictions = clf.predict(X_test_combined)
accuracy = sum(predictions == np.array(y_test))/len(predictions)

print(accuracy)

0.8822


