# HW04: ML and DL

Remember that these homework work as a completion grade. **You can skip one section without losing credit.**

## Load and Pre-process Text
We do sentiment analysis on the [Movie Review Data](https://www.cs.cornell.edu/people/pabo/movie-review-data/). If you would like to know more about the data, have a look at [the paper](https://www.cs.cornell.edu/home/llee/papers/pang-lee-stars.pdf) (but no need to do so).

In [1]:
# In this tutorial, we do sentiment analysis
# download the data
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar xf aclImdb_v1.tar.gz

!wget https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_data.tar.gz
!wget https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_whole_review.tar.gz
 
!tar xf scale_data.tar.gz 
!tar xf scale_whole_review.tar.gz

--2023-03-22 13:40:19--  https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2023-03-22 13:40:24 (16.6 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]

--2023-03-22 13:40:42--  https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_data.tar.gz
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.36
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.36|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4029756 (3.8M) [application/x-gzip]
Saving to: ‘scale_data.tar.gz’


2023-03-22 13:40:42 (17.9 MB/s) - ‘scale_data.tar.gz’ saved [4029756/4029756]

--2023-03-22 13:40:42--  https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_whole_review.tar.gz

First, we have to load the data for which we provide the function below. Note how we also preprocess the text using gensim's simple_preprocess() function and how we already split the data into a train and test split.

In [2]:
import os
from gensim.utils import simple_preprocess
def load_data():
    examples, labels = [], []
    authors = os.listdir("scale_whole_review")
    for author in authors:
        path = os.listdir(os.path.join("scale_whole_review", author, "txt.parag"))
        fn_ids = os.path.join("scaledata", author, "id." + author)
        fn_ratings = os.path.join("scaledata", author, "rating." + author)
        with open(fn_ids) as ids, open(fn_ratings) as ratings:
            for idx, rating in zip(ids, ratings):
                labels.append(float(rating.strip()))
                filename_text = os.path.join("scale_whole_review", author, "txt.parag", idx.strip() + ".txt")
                with open(filename_text, encoding='latin-1') as f:
                    examples.append(" ".join(simple_preprocess(f.read())))
    return examples, labels
                  
X,y  = load_data()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print ("text:", X_train[0], "\nlabel:", y_train[0])

text: starring jodie foster liam neeson natasha richardson richard libertini nick searcy director michael apted producers renee missel and jodie foster screenplay william nicholson and mark handley based on the play idioglossia by mark handley cinematography dante spinotti music mark isham released by twentieth century fox nell jodie foster return to dramatic acting following flirtation with maverick action comedy is an entirely human movie in this lush green world of rolling hills and crystal pools technology is an unwelcome intruder civilization threatening monster both are slaves to the avaricious nell is about the importance of communication and interaction about how the events of childhood shape life and about the difficulty and rewards of reaching out to others nell foster has lived her entire life alone in the woods with an aging mother she is eventually discovered by local doctor jerome lovell liam neeson who comes to her secluded ramshackle hut on the occasion of her mother de

In [3]:
len(X_train)

3354

## Vectorize the data

(3354, 1000)

In [4]:
##TODO vectorize the pre-processed text using TfidfVectorizer
##TODO transform X_train to TF-IDF values
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(min_df=0.01, 
                        max_df=0.9,  
                        max_features=1000,
                        stop_words='english',
                        use_idf=True, # the new piece
                        ngram_range=(1,2))

X_train_tfidf = tfidf.fit_transform(X_train)

##TODO transform X_test to TF-IDF values
X_test_tfidf = tfidf.transform(X_test)

In [5]:
##TODO scale both training and test data with the standard scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(with_mean=False)

X_train_scaled = scaler.fit_transform(X_train_tfidf)
X_test_scaled = scaler.transform(X_test_tfidf)

## ElasticNet

In [6]:
##TODO train the ElasticNet

from sklearn.linear_model import ElasticNet
enet_reg = ElasticNet(alpha=.1, l1_ratio=.0001)
enet_reg.fit(X_train_scaled, y_train)

  model = cd_fast.sparse_enet_coordinate_descent(


In [7]:
##TODO predict the testset
y_pred = enet_reg.predict(X_test_scaled)
from sklearn.metrics import r2_score, accuracy_score, mean_squared_error, balanced_accuracy_score

##TODO print mean squared error and r2 score on the test set
print(f"MSE:{mean_squared_error(y_test, y_pred)}")
print(f"r2 score:{r2_score(y_test, y_pred)}")

MSE:0.019260634645141047
r2 score:0.40548886766376624


## Logistic Regression

Next, we train an OLS model doing binary prediction on these movie reviews. Two get two bins, we transform the continuous ratings into two classes, where one class contains all the negative ratings (value < 0.5), the other class all the positive ratings (value > 0.5)

In [8]:
y_train = [1 if i >= 0.5 else 0 for i in y_train]
y_test = [1 if i >= 0.5 else 0 for i in y_test]


In [9]:
##TODO train logistic regression on X_train
from sklearn.linear_model import LogisticRegression
logistic_regression = LogisticRegression()

##TODO train a logistic regression

logistic_regression.fit(X_train_scaled, y_train)
##TODO predict the testset 
y_pred = logistic_regression.predict(X_test_scaled)

##since we have continuous output, we need to post-process our labels into two classes. We choose a threshold of 0.5 
def map_predictions(predicted):
    pred = [1 if i > 0.5 else 0 for i in predicted]
    return pred

##TODO print the accuracy of our classifier on the testset
accuracy_score(y_test, map_predictions(y_pred))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.7378934624697336

# Deep Learning

## MLP

In [10]:
#Import the AG news dataset (same as hw01)
#Download them from here 
!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv

import pandas as pd
import nltk
df = pd.read_csv('train.csv')

df.columns = ["label", "title", "lead"]
label_map = {1:"world", 2:"sport", 3:"business", 4:"sci/tech"}
def replace_label(x):
	return label_map[x]
df["label"] = df["label"].apply(replace_label) 
df["text"] = df["title"] + " " + df["lead"]
df = df.sample(n=10000) # # only use 10K datapoints
df.head()

--2023-03-22 13:42:23--  https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29470338 (28M) [text/plain]
Saving to: ‘train.csv’


2023-03-22 13:42:24 (174 MB/s) - ‘train.csv’ saved [29470338/29470338]



Unnamed: 0,label,title,lead,text
41249,sport,"Weingartner, Washington fall in Seoul","Seoul, Korea (Sports Network) - Third-seeded G...","Weingartner, Washington fall in Seoul Seoul, K..."
109431,sport,McCarthy #39;s late header sends Porto through,DEFENDING champions Porto sneaked through to t...,McCarthy #39;s late header sends Porto through...
75588,world,Australian Govt. in Control of Both Houses (AP),AP - Final election results Thursday showed Jo...,Australian Govt. in Control of Both Houses (AP...
905,world,National pharmacare program would reduce hospi...,Canadian Press - TORONTO (CP) - A national pha...,National pharmacare program would reduce hospi...
96220,sci/tech,EMI's download music sales soar,EMI sees download music sales rise by nearly 6...,EMI's download music sales soar EMI sees downl...


In [11]:
# create a new variable "business" that takes value 1 if the label is business and 0 otherwise
df['business'] = df['label'].apply(lambda x: int(x=='business'))
y = df['business'].values
df['business'].head()

41249     0
109431    0
75588     0
905       0
96220     0
Name: business, dtype: int64

In [12]:
import spacy
nlp = spacy.load('en_core_web_sm')
from sklearn.feature_extraction.text import CountVectorizer

# pre-process text as you did in HW02
def tokenize(x):
    return [w.lemma_.lower() for w in nlp(x) if not w.is_stop and not w.is_punct and not w.is_digit]
df["tokens"] = df["text"].apply(lambda x: tokenize(x))
df["preprocessed"] = df["tokens"].apply(lambda x: " ".join(x))

##TODO vectorize the pre-processed text using CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(min_df=0.01, # at min 1% of docs
                        max_df=.9,  
                        max_features=1000,
                        stop_words='english',
                        ngram_range=(1,3))
X_count = vec.fit_transform(df['preprocessed'])
X_count.shape

pd.to_pickle(X_count, 'X.pkl')



In [13]:
vocab = vec.get_feature_names_out()
pd.to_pickle(vocab, 'vocab.pkl')

Y = df['business']

Your goal here is to use features from the Vectorized text to predict whether the snippet is from a business article.

In [14]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.callbacks import EarlyStopping

## TODO build a MLP model with at least 2 hidden layers with ReLU activation, followed by dropout and an output layer with sigmoid activation
model = Sequential()
model.add(Dense(50, # the layer is of type Dense and there are 50 neurons in layer
                input_dim=X_count.shape[1], #number of inputs
                activation='relu')) # optional activation function

# adding more layers
# we only need to indicate the input dimension for the first layer, after keras figures it out
model.add(Dense(50, activation='relu')) #hidden layer

# add the output layer
model.add(Dense(1, activation='relu')) #output layer
model.summary()


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 50)                21200     
                                                                 
 dense_1 (Dense)             (None, 50)                2550      
                                                                 
 dense_2 (Dense)             (None, 1)                 51        
                                                                 
Total params: 23,801
Trainable params: 23,801
Non-trainable params: 0
_________________________________________________________________


In [15]:
## TODO compile the model
model.compile(loss="binary_crossentropy", #specify loss function
              optimizer="adam",
              metrics=["accuracy"]) 

In [None]:
## TODO fit the model using early stopping to predict the business label

model_trained = model.fit(X_count.todense(), Y, epochs=4) 

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


## Autoencoders

In [None]:
from keras import backend as K

def r2(y_true, y_pred):
    SS_res =  K.sum(K.square( y_true-y_pred )) 
    SS_tot = K.sum(K.square( y_true - K.mean(y_true) ) ) 
    return ( 1 - SS_res/(SS_tot + K.epsilon()) )

##TODO build a simple autoencoder with two compression layers and two reconstruction layers using ReLu
model = Sequential()
model.add(Dense(100,
                input_dim=X_count.shape[1],
                activation='relu'))
model.add(Dense(25, activation='relu', name='compression_layer'))
model.add(Dense(100, activation='relu'))
model.add(Dense(X_count.shape[1], activation='relu'))

model.summary()
##TODO compile and fit the model minimizing "mean_squared_error"
model.compile(loss='mean_squared_error',
              optimizer='adam',
              metrics=[r2])

model_info = model.fit(X_count.todense(), X_count.todense(),
                       epochs=10,
                       validation_split=.2)

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_11 (Dense)            (None, 100)               40900     
                                                                 
 compression_layer (Dense)   (None, 25)                2525      
                                                                 
 dense_12 (Dense)            (None, 100)               2600      
                                                                 
 dense_13 (Dense)            (None, 408)               41208     
                                                                 
Total params: 87,233
Trainable params: 87,233
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
