# HW03: Supervised Machine Learning, Regression and XGBoost

Remember that these homework work as a completion grade. **You can skip one section without losing credit.**

## Load and Pre-process Text
We do sentiment analysis on the [Movie Review Data](https://www.cs.cornell.edu/people/pabo/movie-review-data/). If you would like to know more about the data, have a look at [the paper](https://www.cs.cornell.edu/home/llee/papers/pang-lee-stars.pdf) (but no need to do so).

In [1]:
# In this tutorial, we do sentiment analysis
# download the data
#!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
#!tar xf aclImdb_v1.tar.gz

!wget https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_data.tar.gz
!wget https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_whole_review.tar.gz
 
!tar xf scale_data.tar.gz 
!tar xf scale_whole_review.tar.gz

--2021-03-26 10:12:01--  https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_data.tar.gz
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.36
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.36|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4029756 (3.8M) [application/x-gzip]
Saving to: ‘scale_data.tar.gz.4’


2021-03-26 10:12:02 (4.81 MB/s) - ‘scale_data.tar.gz.4’ saved [4029756/4029756]

--2021-03-26 10:12:02--  https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_whole_review.tar.gz
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.36
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.36|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8853204 (8.4M) [application/x-gzip]
Saving to: ‘scale_whole_review.tar.gz.4’


2021-03-26 10:12:04 (7.69 MB/s) - ‘scale_whole_review.tar.gz.4’ saved [8853204/8853204]



First, we have to load the data for which we provide the function below. Note how we also preprocess the text using gensim's simple_preprocess() function and how we already split the data into a train and test split.

In [2]:
import os
from gensim.utils import simple_preprocess
def load_data():
    examples, labels = [], []
    authors = os.listdir("scale_whole_review")
    for author in authors:
        path = os.listdir(os.path.join("scale_whole_review", author, "txt.parag"))
        fn_ids = os.path.join("scaledata", author, "id." + author)
        fn_ratings = os.path.join("scaledata", author, "rating." + author)
        with open(fn_ids) as ids, open(fn_ratings) as ratings:
            for idx, rating in zip(ids, ratings):
                labels.append(float(rating.strip()))
                filename_text = os.path.join("scale_whole_review", author, "txt.parag", idx.strip() + ".txt")
                with open(filename_text, encoding='latin-1') as f:
                    examples.append(" ".join(simple_preprocess(f.read())))
    return examples, labels
                  
X,y  = load_data()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print ("text:", X_train[0], "\nlabel:", y_train[0])

text: for what it worth correctly guessed the identity of the killer in scream well sort of suppose should feel satisfied at my own cleverness since dimension and the makers of scream have put so much effort into keeping that piece of information secret even more so than in the original scream writer kevin williamson goes to ridiculous extremes to keep the audience guessing whodunnit so ridiculous that the film becomes too focused on the one thing which should have been least important as horror film it solid piece of work as satire it frequently hilarious as mystery it tries way way too hard scream takes place two years after the events of the original just in time for hollywood to cash in on the woodsboro high murders the non fiction book by reporter gale weathers courteney cox has become popular horror film called stab which in turn appears to have generated copycat killer when two college students turn up dead at the film premiere sidney prescott neve campbell once again begins to 

## Vectorize the data

In [3]:
# train a TF_IDF Vectorizer on X_train and vectorize X_train and X_test
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer(min_df=0.01, # at min 1% of docs
                        max_df=.5,  
                        stop_words='english',
                        ngram_range=(1,2))

##TODO train vectorizer
vec.fit(X_train)
##TODO transform X_train to TF-IDF values
X_train_tfidf = vec.transform(X_train)
##TODO transform X_test to TF-IDF values
X_test_tfidf = vec.transform(X_test)
print (X_train_tfidf.shape)

(3354, 5624)


In [4]:
##TODO scale the data with the standard scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(with_mean=False)
X_train_scaled = scaler.fit_transform(X_train_tfidf)
X_test_scaled = scaler.transform(X_test_tfidf)

## ElasticNet

In [5]:
##TODO train an elastic net on the transformed output of the scaler
from sklearn.linear_model import ElasticNet

en = ElasticNet(alpha=0.01)

##TODO train
en.fit(X_train_scaled, y_train)
##TODO predict the testset
predicted = en.predict(X_test_scaled)

from sklearn.metrics import r2_score, accuracy_score, mean_squared_error, balanced_accuracy_score
##TODO print mean squared error and r2 score on the test set
print ("mse", mean_squared_error(y_test, predicted))
print ("r2", r2_score(y_test, predicted))

mse 0.015871120061244585
r2 0.5037917808908446


## Logistic Regression

Next, we train an OLS model doing binary prediction on these movie reviews. Two get two bins, we transform the continuous ratings into two classes, where one class contains all the negative ratings (value < 0.5), the other class all the positive ratings (value > 0.5)

In [6]:
y_train = [1 if i >= 0.5 else 0 for i in y_train]
y_test = [1 if i >= 0.5 else 0 for i in y_test]


In [14]:
##TODO train logistic regression on X_train
from sklearn.linear_model import LogisticRegression
logistic_regression = LogisticRegression()

##TODO train a logistic regression
logistic_regression.fit(X_train_scaled, y_train)
##TODO predict the testset 
predicted = en.predict(X_test_scaled)

##since we have continuous output, we need to post-process our labels similarly 
def map_predictions(predicted):
    predicted = [1 if i > 0.5 else 0 for i in predicted]
    return predicted

binary_predictions = map_predictions(predicted)
print (accuracy_score(y_test, binary_predictions))

## TODO print the 10 most informative words of the regression (the 10 words having the highest coefficients)


import numpy as np
id2word = vec.get_feature_names()
coefs = logistic_regression.coef_.squeeze()
indices = np.argsort(coefs)
print (logistic_regression.coef_.shape)
for i in indices[-10:]:
    print (coefs[i], id2word[i])

0.8184019370460048
(1, 5624)
0.16272136822550148 saves
0.16445601435035043 focus
0.16869513291903118 equal
0.16890444374119462 bit
0.1702246287134625 surprisingly
0.17141737344344304 grabs
0.18172123469104762 area
0.18417674474610274 exciting
0.20035051528196415 fine
0.23154446158133757 fascinating


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


## XGBoost

Lastly, we train an XGBoost classifier to do topic prediction on the AG news dataset, which is a multi-class prediction problem (4 classes). We again have to vectorize the data, train the classifier, predict the testset and output an evaluation metric (we go for accuracy).

In [8]:
!pip install xgboost



In [9]:
#Import the AG news dataset (same as hw01)
#Download them from here 
#!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv

import pandas as pd
import nltk
df = pd.read_csv('train.csv')

df.columns = ["label", "title", "lead"]
df["text"] = df["title"] + " " + df["lead"]
df.head()

Unnamed: 0,label,title,lead,text
0,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...,Carlyle Looks Toward Commercial Aerospace (Reu...
1,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...,Oil and Economy Cloud Stocks' Outlook (Reuters...
2,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...,Iraq Halts Oil Exports from Main Southern Pipe...
3,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco...","Oil prices soar to all-time record, posing new..."
4,3,"Stocks End Up, But Near Year Lows (Reuters)",Reuters - Stocks ended slightly higher on Frid...,"Stocks End Up, But Near Year Lows (Reuters) Re..."


In [10]:
# vectorize the data
from sklearn.feature_extraction.text import TfidfVectorizer

# only consider 10% of the data
dfs = df.sample(frac=0.1)

# split into train and test
X_train, X_test, y_train, y_test = train_test_split(dfs["text"], dfs["label"], test_size=0.33, random_state=42)

vec = TfidfVectorizer(min_df=5, # at min 1% of docs
                        max_df=.5,  
                        stop_words='english',
                        max_features=2000,
                        ngram_range=(1,2))

# transform into TF-IDF values
X_train_tfidf = vec.fit_transform(X_train).todense()
X_test_tfidf = vec.transform(X_test).todense()


XGBoost provides an interface to SKLearn classifiers, e.g. they implement the same train and predict methods as an SKLearn classifier would. If you are interested in a more detailed overview, have a look at the [official documentation](https://xgboost.readthedocs.io/en/latest/python/index.html).

In [12]:
param_dist = {'objective':'multi:softmax', 'num_class': 5, 'n_estimators':25}
import xgboost as xgb

clf = xgb.XGBModel(**param_dist)

##TODO train the XGBModel 
clf.fit(X_train_tfidf, y_train,
        eval_set=[(X_train_tfidf, y_train), (X_test_tfidf, y_test)],
        eval_metric='merror',
        verbose=True)

evals_result = clf.evals_result()

predictions = clf.predict(X_test_tfidf)
print (predictions[:5], y_test[:5])
print (accuracy_score(y_test, predictions))

[0]	validation_0-merror:0.43470	validation_1-merror:0.44192
[1]	validation_0-merror:0.36580	validation_1-merror:0.37702
[2]	validation_0-merror:0.33756	validation_1-merror:0.34848
[3]	validation_0-merror:0.30012	validation_1-merror:0.31591
[4]	validation_0-merror:0.27512	validation_1-merror:0.29697
[5]	validation_0-merror:0.26244	validation_1-merror:0.28510
[6]	validation_0-merror:0.24055	validation_1-merror:0.26263
[7]	validation_0-merror:0.23259	validation_1-merror:0.25884
[8]	validation_0-merror:0.22488	validation_1-merror:0.24773
[9]	validation_0-merror:0.21405	validation_1-merror:0.23586
[10]	validation_0-merror:0.20386	validation_1-merror:0.22803
[11]	validation_0-merror:0.19055	validation_1-merror:0.22046
[12]	validation_0-merror:0.18644	validation_1-merror:0.22020
[13]	validation_0-merror:0.17848	validation_1-merror:0.21465
[14]	validation_0-merror:0.16878	validation_1-merror:0.21010
[15]	validation_0-merror:0.16244	validation_1-merror:0.20707
[16]	validation_0-merror:0.15871	v