# HW04: ML and DL

Remember that these homework work as a completion grade. **You can skip one section without losing credit.**

## Load and Pre-process Text
We do sentiment analysis on the [Movie Review Data](https://www.cs.cornell.edu/people/pabo/movie-review-data/). If you would like to know more about the data, have a look at [the paper](https://www.cs.cornell.edu/home/llee/papers/pang-lee-stars.pdf) (but no need to do so).

In [2]:
# In this tutorial, we do sentiment analysis
# download the data
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar xf aclImdb_v1.tar.gz

!wget https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_data.tar.gz
!wget https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_whole_review.tar.gz
 
!tar xf scale_data.tar.gz 
!tar xf scale_whole_review.tar.gz

--2023-03-20 17:47:09--  https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2023-03-20 17:47:11 (41.8 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]

--2023-03-20 17:47:27--  https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_data.tar.gz
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.36
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.36|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4029756 (3.8M) [application/x-gzip]
Saving to: ‘scale_data.tar.gz’


2023-03-20 17:47:28 (6.48 MB/s) - ‘scale_data.tar.gz’ saved [4029756/4029756]

--2023-03-20 17:47:28--  https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_whole_review.tar.gz

First, we have to load the data for which we provide the function below. Note how we also preprocess the text using gensim's simple_preprocess() function and how we already split the data into a train and test split.

In [3]:
import os
from gensim.utils import simple_preprocess
def load_data():
    examples, labels = [], []
    authors = os.listdir("scale_whole_review")
    for author in authors:
        path = os.listdir(os.path.join("scale_whole_review", author, "txt.parag"))
        fn_ids = os.path.join("scaledata", author, "id." + author)
        fn_ratings = os.path.join("scaledata", author, "rating." + author)
        with open(fn_ids) as ids, open(fn_ratings) as ratings:
            for idx, rating in zip(ids, ratings):
                labels.append(float(rating.strip()))
                filename_text = os.path.join("scale_whole_review", author, "txt.parag", idx.strip() + ".txt")
                with open(filename_text, encoding='latin-1') as f:
                    examples.append(" ".join(simple_preprocess(f.read())))
    return examples, labels
                  
X,y  = load_data()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print ("text:", X_train[0], "\nlabel:", y_train[0])

text: for what it worth correctly guessed the identity of the killer in scream well sort of suppose should feel satisfied at my own cleverness since dimension and the makers of scream have put so much effort into keeping that piece of information secret even more so than in the original scream writer kevin williamson goes to ridiculous extremes to keep the audience guessing whodunnit so ridiculous that the film becomes too focused on the one thing which should have been least important as horror film it solid piece of work as satire it frequently hilarious as mystery it tries way way too hard scream takes place two years after the events of the original just in time for hollywood to cash in on the woodsboro high murders the non fiction book by reporter gale weathers courteney cox has become popular horror film called stab which in turn appears to have generated copycat killer when two college students turn up dead at the film premiere sidney prescott neve campbell once again begins to 

## Vectorize the data

In [4]:
# train a TF_IDF Vectorizer on X_train and vectorize X_train and X_test
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer(min_df=0.01, # at min 1% of docs
                        max_df=.5,  
                        stop_words='english',
                        ngram_range=(1,2))

##TODO train vectorizer
vec.fit(X_train)

# labels y do not have to be transformed as they already are in the correct range

##TODO transform X_train to TF-IDF values
X_train_tfidf = vec.transform(X_train)
##TODO transform X_test to TF-IDF values
X_test_tfidf = vec.transform(X_test)

In [5]:
##TODO scale both training and test data with the standard scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(with_mean=False)

scaler.fit(X_train_tfidf)

X_train_scaled = scaler.transform(X_train_tfidf)
X_test_scaled = scaler.transform(X_test_tfidf)

## ElasticNet

In [6]:
##TODO train an elastic net on the transformed output of the scaler
from sklearn.linear_model import ElasticNet

en = ElasticNet(alpha=0.01)

##TODO train the ElasticNet

en.fit(X_train_scaled, y_train)

##TODO predict the testset

y_pred = en.predict(X_test_scaled)

from sklearn.metrics import r2_score, accuracy_score, mean_squared_error, balanced_accuracy_score
##TODO print mean squared error and r2 score on the test set

# Both scores evaluate the performance of regression models
# The balanced_accuracy_score is a classification metric (thus finds no application in this task)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print('MSE: {}, r2: {}'.format(mse, r2))

MSE: 0.01587112006124458, r2: 0.5037917808908448


## Logistic Regression

Next, we train an OLS model doing binary prediction on these movie reviews. Two get two bins, we transform the continuous ratings into two classes, where one class contains all the negative ratings (value < 0.5), the other class all the positive ratings (value > 0.5)

In [7]:
y_train = [1 if i >= 0.5 else 0 for i in y_train]
y_test = [1 if i >= 0.5 else 0 for i in y_test]


In [23]:
##TODO train logistic regression on X_train
from sklearn.linear_model import LogisticRegression
logistic_regression = LogisticRegression(max_iter=200)

##TODO train a logistic regression

logistic_regression.fit(X_train_scaled, y_train)

##TODO predict the testset

y_pred = logistic_regression.predict(X_test_scaled)


##since we have continuous output, we need to post-process our labels into two classes. We choose a threshold of 0.5 
def map_predictions(predicted):
    predicted = [1 if i > 0.5 else 0 for i in predicted]
    return predicted

##TODO print the accuracy of our classifier on the testset

acc = balanced_accuracy_score(y_test, map_predictions(y_pred))
print('Acc: {}'.format(acc))

## TODO print the 10 most informative words of the regression (the 10 words having the highest coefficients)

Acc: 0.7200720320282454


# Deep Learning

## MLP

In [24]:
#Import the AG news dataset (same as hw01)
#Download them from here 
#!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv

import pandas as pd
import nltk
df = pd.read_csv('train.csv')

df.columns = ["label", "title", "lead"]
label_map = {1:"world", 2:"sport", 3:"business", 4:"sci/tech"}
def replace_label(x):
	return label_map[x]
df["label"] = df["label"].apply(replace_label) 
df["text"] = df["title"] + " " + df["lead"]
df = df.sample(n=10000) # # only use 10K datapoints
df.head()

FileNotFoundError: ignored

In [None]:
# create a new variable "business" that takes value 1 if the label is business and 0 otherwise
df['business'] = df['label'].apply(lambda x: int(x=='business'))
y = df['business'].values
df['business'].head()

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
from sklearn.feature_extraction.text import CountVectorizer

# pre-process text as you did in HW02
def tokenize(x):
    return [w.lemma_.lower() for w in nlp(x) if not w.is_stop and not w.is_punct and not w.is_digit]
df["tokens"] = df["text"].apply(lambda x: tokenize(x))
df["preprocessed"] = df['tokens'].apply(lambda x: ' '.join(x))
df["preprocessed_text"] = df["preprocessed"].apply(lambda x: " ".join(x))

##TODO vectorize the pre-processed text using CountVectorizer

Your goal here is to use features from the Vectorized text to predict whether the snippet is from a business article.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from torchsummary import summary

import math
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## TODO build a MLP model with at least 2 hidden layers with ReLU activation, followed by dropout and an output layer with sigmoid activation
## TODO summarize the model using torchsummary
## TODO fit the model using early stopping to predict the business label
# (hint: early stopping means if the validation score does not increase for more than "patience" times, training should stop and load the best model so far)