# HW04: Supervised Machine Learning, Regression and XGBoost

Remember that these homework work as a completion grade. **You can skip one section without losing credit.**

## Load and Pre-process Text
We do sentiment analysis on the [Movie Review Data](https://www.cs.cornell.edu/people/pabo/movie-review-data/). If you would like to know more about the data, have a look at [the paper](https://www.cs.cornell.edu/home/llee/papers/pang-lee-stars.pdf) (but no need to do so).

In [1]:
import numpy as np
import pandas as pd

In [2]:
# In this tutorial, we do sentiment analysis
# download the data
#!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
#!tar xf aclImdb_v1.tar.gz

!wget -nc https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_data.tar.gz
!wget -nc https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_whole_review.tar.gz
 
!tar xf scale_data.tar.gz 
!tar xf scale_whole_review.tar.gz

File ‘scale_data.tar.gz’ already there; not retrieving.

File ‘scale_whole_review.tar.gz’ already there; not retrieving.



First, we have to load the data for which we provide the function below. Note how we also preprocess the text using gensim's simple_preprocess() function and how we already split the data into a train and test split.

In [3]:
# wasn't me:
import os
from gensim.utils import simple_preprocess
def load_data():
    examples, labels = [], []
    authors = os.listdir("scale_whole_review")
    for author in authors:
        path = os.listdir(os.path.join("scale_whole_review", author, "txt.parag"))
        fn_ids = os.path.join("scaledata", author, "id." + author)
        fn_ratings = os.path.join("scaledata", author, "rating." + author)
        with open(fn_ids) as ids, open(fn_ratings) as ratings:
            for idx, rating in zip(ids, ratings):
                labels.append(float(rating.strip()))
                filename_text = os.path.join("scale_whole_review", author, "txt.parag", idx.strip() + ".txt")
                with open(filename_text, encoding='latin-1') as f:
                    examples.append(" ".join(simple_preprocess(f.read())))
    return examples, labels
                  
X,y  = load_data()

assert all((0 <= y <= 1) for y in y)

In [4]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print ("text:", X_train[0][0:100], '...')
print("label:", y_train[0])

text: bloody child the director writer cinematographer nina menkes screenwriter tinka menkes editors nina  ...
label: 0.6


## Vectorize the data

In [5]:
# train a TF_IDF Vectorizer on X_train and vectorize X_train and X_test
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer(min_df=0.01, # at min 1% of docs
                        max_df=.5,  
                        stop_words='english',
                        ngram_range=(1,2))

## train vectorizer
vec.fit(X_train)

## transform X_train to TF-IDF values
X_train_tfidf = vec.transform(X_train)
## transform X_test to TF-IDF values
X_test_tfidf = vec.transform(X_test)

In [6]:
## scale both training and test data with the standard scaler
from sklearn.preprocessing import StandardScaler

# The StandardScaler normalizes columnwise:
assert np.isclose(1, np.var(StandardScaler().fit_transform(np.random.random(size=(4, 4))), axis=0)).all()

scaler = StandardScaler(with_mean=False)

X_train_tfidf = scaler.fit_transform(X_train_tfidf)
X_test_tfidf = scaler.transform(X_test_tfidf)

## ElasticNet

In [7]:
## train an elastic net on the transformed output of the scaler
from sklearn.linear_model import ElasticNet

en = ElasticNet(alpha=0.01)

## train the ElasticNet
en.fit(X_train_tfidf, y_train)

## predict the testset
y_pred = en.predict(X_test_tfidf)


In [9]:
from sklearn.metrics import r2_score, accuracy_score, mean_squared_error, balanced_accuracy_score
## print mean squared error and r2 score on the test set

print(pd.Series({f.__name__: f(y_test, y_pred) for f in [mean_squared_error, r2_score]}).to_string())

mean_squared_error    0.016355
r2_score              0.483110


## Logistic Regression

Next, we train an OLS model doing binary prediction on these movie reviews. Two get two bins, we transform the continuous ratings into two classes, where one class contains all the negative ratings (value < 0.5), the other class all the positive ratings (value > 0.5)

In [10]:
y_train = np.round(y_train)
y_test = np.round(y_test)

In [11]:
## train logistic regression on X_train
from sklearn.linear_model import LogisticRegression
logistic_regression = LogisticRegression(max_iter=1000)

## train a logistic regression
logistic_regression.fit(X_train_tfidf, y_train)

## predict the testset 
y_pred = (logistic_regression.predict(X_test_tfidf) >= 0.5).astype(float)

# ##since we have continuous output, we need to post-process our labels into two classes. We choose a threshold of 0.5 
# def map_predictions(predicted):
#     predicted = [1 if i > 0.5 else 0 for i in predicted]
#     return predicted

## print the accuracy of our classifier on the testset
print("Accuracy: {:.2g}%".format(100 * accuracy_score(y_test, y_pred)))

## print the 10 most informative words of the regression (the 10 words having the highest coefficients)
predictors = pd.Series(index=vec.get_feature_names(), data=logistic_regression.coef_.flatten()).nlargest(10)
print("Predictors:")
print(predictors.to_string())

Accuracy: 80%
Predictors:
powerful        0.246628
easy            0.216025
solid           0.215400
speaks          0.197862
breathtaking    0.197833
equal           0.196745
honest          0.195792
delightful      0.195505
technique       0.195227
great           0.194487


## XGBoost

Lastly, we train an XGBoost classifier to do topic prediction on the AG news dataset, which is a multi-class prediction problem (4 classes). We again have to vectorize the data, train the classifier, predict the testset and output an evaluation metric (we go for accuracy).

In [12]:
%%capture
!pip install xgboost

In [13]:
#Import the AG news dataset (same as hw01)
#Download them from here 
#!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv

import pandas as pd
import nltk
df = pd.read_csv('train.zip')

df.columns = ["label", "title", "lead"]
df["text"] = df["title"] + " " + df["lead"]
df.head()

Unnamed: 0,label,title,lead,text
0,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...,Carlyle Looks Toward Commercial Aerospace (Reu...
1,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...,Oil and Economy Cloud Stocks' Outlook (Reuters...
2,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...,Iraq Halts Oil Exports from Main Southern Pipe...
3,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco...","Oil prices soar to all-time record, posing new..."
4,3,"Stocks End Up, But Near Year Lows (Reuters)",Reuters - Stocks ended slightly higher on Frid...,"Stocks End Up, But Near Year Lows (Reuters) Re..."


In [14]:
# vectorize the data
from sklearn.feature_extraction.text import TfidfVectorizer

# only consider 10% of the data
dfs = df.sample(frac=0.1)

# split into train and test
X_train, X_test, y_train, y_test = train_test_split(dfs["text"], dfs["label"], test_size=0.33, random_state=42)

vec = TfidfVectorizer(min_df=5, # at min 1% of docs
                        max_df=.5,  
                        stop_words='english',
                        max_features=2000,
                        ngram_range=(1,2))

# transform into TF-IDF values
X_train_tfidf = vec.fit_transform(X_train).todense()
X_test_tfidf = vec.transform(X_test).todense()


XGBoost provides an interface to SKLearn classifiers, e.g. they implement the same train and predict methods as an SKLearn classifier would. If you are interested in a more detailed overview, have a look at the [official documentation](https://xgboost.readthedocs.io/en/latest/python/index.html).

In [15]:
param_dict = {'objective': 'multi:softmax', 'num_class': 5, 'n_estimators': 25, 'eval_metric': 'mlogloss'}

# note how we only have 4 labels, but we need to pass "num_class": 5
# if we pass "num_class": 4, we get the error "label must be in [0, num_class)."
import xgboost as xgb

clf = xgb.XGBModel(**param_dict)

## train the XGBModel 
clf.fit(X_train_tfidf, y_train)

XGBModel(base_score=0.5, booster='gbtree', colsample_bylevel=1,
         colsample_bynode=1, colsample_bytree=1, eval_metric='mlogloss',
         gamma=0, gpu_id=-1, interaction_constraints='',
         learning_rate=0.300000012, max_delta_step=0, max_depth=6,
         min_child_weight=1, monotone_constraints='()', n_estimators=25,
         n_jobs=8, num_class=5, num_parallel_tree=1, objective='multi:softmax',
         random_state=0, reg_alpha=0, reg_lambda=1, subsample=1,
         tree_method='exact', validate_parameters=1)

In [16]:
## predict the testset 
y_pred = clf.predict(X_test_tfidf)

## evaluate the predictions using accuracy as a metric
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.8042929292929293
