## HW 4
### Text Analysis and Neural Networks
Gian Zlupko 

#### Part I: Build a classification model using text data

#### *Import the text data, vectorize the clickbait headline column into an X matrix.  Then run logistic regression at least three times and select a single best model.  Note that you should create three logistic regression models with different different tokenization approaches.  You should not change your modeling approach, you should simply experiment with different tokenizers. Be sure to explain your choices and evaluate your models using cross validation and using test set data.* 

First, I import the data below. 


In [23]:
import pandas as pd 
import os 

# change working directory 
os.chdir('/Users/gianzlupko/Desktop/GR5073 ML/gr5073_ML/data')

# import the data
X = pd.read_csv("X_train.csv", squeeze=True)
y_labels = pd.read_csv("y_train.csv", squeeze=True)

# one hot encode the y data 
y =  pd.get_dummies(y_labels)
y = y.iloc[:, 0] # select only first column (note: clickbait = 1)

# inspect the data sets 
print(y.head()) 
X.head()

0    0
1    0
2    1
3    1
4    1
Name: clickbait, dtype: uint8


0       MyBook Disk Drive Handles Lots of Easy Backups
1                       CIT Posts Eighth Loss in a Row
2    Candy Carson Singing The "National Anthem" Is ...
3    Why You Need To Stop What You're Doing And Dat...
4    27 Times Adele Proved She's Actually The Reale...
Name: headline, dtype: object

Next, after loading the data, I will try the first of three separate tokenization strategies. For the first strategy, I will simply tokenize the text data into word tokens, representing the most simple of the three strategies that I will use to compare.

#### Model I: Unigram Tokenization

In [24]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer().fit(X)
X_tokens = vect.transform(X) # name this data set 'X_tokens' so that it does not overwrite the original raw X data 
print("X_tokens:\n{}".format(repr(X_tokens))) 

X_tokens:
<24979x20332 sparse matrix of type '<class 'numpy.int64'>'
	with 220242 stored elements in Compressed Sparse Row format>


We see that in the overall data set, there were 220,242 words. I used the default settings for the `CountVectorizer` function, so results from tokenization returned a sparse data matrix with token counts. 

Next, I follow a standard ML approach to tuning and fitting a logistic regression model on the matrix of token counts. I use grid search CV to tune the hyperparameters 

In [34]:
# Set up training and test data
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# train test split 
X_train, X_test, y_train, y_test = train_test_split(X_tokens, y, random_state=42)

# fit and tune model using grid search CV 
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}
grid = GridSearchCV(LogisticRegression(solver = 'liblinear'), param_grid, cv=5, scoring = 'f1')
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)


Best cross-validation score: 0.97
Best parameters:  {'C': 10}


The results indicated that the logistic regression model achieved an F1 score of 0.97. This will be the F1 score to beat in the subsequent rounds of testing out different tokenization strategies.

The next tokenization strategy that I will test is to extract bigrams only. 

#### Model II - Bigrams

In [61]:
vect_bigrams = CountVectorizer(ngram_range = (1,2)).fit(X)
X_bigrams = vect_bigrams.transform(X) # name this data set 'X_tokens' so that it does not overwrite the original raw X data 
print("X_tokens:\n{}".format(repr(X_bigrams))) 

X_tokens:
<24979x135950 sparse matrix of type '<class 'numpy.int64'>'
	with 418238 stored elements in Compressed Sparse Row format>


Now, after tokenizing with bigrams, we see that the matrix is larger as now there are more possible combinations for words than there were individuals words alone. 

In [62]:
# train test split on the new data
X_bigrams_train, X_bigrams_test, y_bigrams_train, y_bigrams_test = train_test_split(X_bigrams, y)

# fit and tune hyperparameters 
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}
grid = GridSearchCV(LogisticRegression(solver = 'liblinear'), param_grid, cv=5, scoring = 'f1')
grid.fit(X_bigrams_train, y_bigrams_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)

Best cross-validation score: 0.97
Best parameters:  {'C': 10}


The model achieved the same F1 score using bigram tokenization as it did with unigram tokenization.

Next, I will try one more tokenization strategy to see if I can improve model fit. 

#### Model III: N-gram Tokenization with TF-IDF and Stop Word Removal 

For my third attempt, I will extract multiple n-grams (uni to tri-grams). In addition, I will also use TF-IDF to rescale the data and remove stopwords. 

In [43]:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

vect_scaled = TfidfVectorizer(stop_words = "english").fit(X)
X_scaled = vect_scaled.transform(X)

X_scaled_train, X__scaled_test, y_scaled_train, y_scaled_test = train_test_split(X_scaled, y, random_state=42)

param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}
grid = GridSearchCV(LogisticRegression(solver  = "liblinear", max_iter = 10000), param_grid, cv=5, scoring = "f1")
grid.fit(X_scaled_train, y_scaled_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)

Best cross-validation score: 0.95
Best parameters:  {'C': 10}


The results show that rescaling the data matrix using TF-IDF while also removing stop words, together, did not improve the model's F1 score. This combination of preprocessing and tokenization methods performed the worst of the three attempts. 

#### Part II: Build a predictive neural network using Keras

*To complete part two of the homework do the following: Train test split the iris dataset and then run a multilayer perceptron (feed forward neural network) with two hidden layers on the iris dataset using the keras Sequential interface. Data can be imported via the following link: http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv fit two models with different numbers of hidden layers and or hidden neurons and evaluate each on a test-set.  Describe the differences in the predictive accuracy of models with different numbers of hidden units/neurons.  Describe the predictive strength of your best model.  Be sure to explain your choice and evaluate this model using the test set.*

In [59]:
# load data
iris = pd.read_csv("http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv") 
iris = iris.drop(iris.columns[[0]], axis = 1) # drop extra column that was added when data was read 
iris.head() 

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
