Part 1: Build a classification model using text data

In part one of the homework, you will solve a text classification task.

You can download the following data-sets from the HW data folder on the course website:

HW4_Text_train_data.csv and HW4_text_test_data.csv

The data consists of Women’s fashion online shop reviews, consisting of a review
text, and whether the review author would recommend the product.

We are trying to determine whether a reviewer will recommend a product or not based on each review.


In a real application this might allow us to find out what is good or bad about certain products or to feature more typical reviews (like a very critical and a very positive one).

Use cross-validation to evaluate the results. Use a metric that’s appropriate for imbalanced classification (AUC or average precision for example), and inspect all models by visualizing the coefficients.

To complete part one of the homework do the following:

Import the text data, vectorize the review column into an X matrix.  Then run at least three models and select a single best model.  Note that you can also create three models that simply use different types of explanatory variables such as a logistic regression with different n grams or different tokenizers.  Be sure to explain your choice and evaluate this model using the test set.


In [0]:
from google.colab import files
test =files.upload()

Saving HW4_Text_test_data.csv to HW4_Text_test_data.csv


In [0]:
train =files.upload()

Saving HW4_Text_train_data.csv to HW4_Text_train_data.csv


In [0]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer


In [0]:
test = pd.read_csv("HW4_Text_test_data.csv")
train = pd.read_csv("HW4_Text_train_data.csv")

In [0]:
frames = [test,train]
data = pd.concat(frames, ignore_index=True)
X = np.array(data['Review'])
y = np.array(data['Recommended'])

In [0]:
vect_X = CountVectorizer().fit(X)
X_all_train = vect_X.transform(X)
print("X:\n{}".format(repr(X_all_train)))

X:
<22642x14679 sparse matrix of type '<class 'numpy.int64'>'
	with 999907 stored elements in Compressed Sparse Row format>


In [0]:
sorted_data['Recommended'].value_counts()

1    18541
0     4101
Name: Recommended, dtype: int64

In [0]:
sorted_data = data.sort_values(by=['Recommended'])
text_train_pos = np.array(sorted_data['Review'])[0:100]
text_train_neg = np.array(sorted_data["Review"])[10000:10100]

text_train = np.concatenate((text_train_pos, text_train_neg ), axis=0)
print(text_train.shape)

(200,)


In [0]:
vect = CountVectorizer().fit(text_train)
X = vect.transform(text_train)
print("X:\n{}".format(repr(X)))

X:
<200x1880 sparse matrix of type '<class 'numpy.int64'>'
	with 9746 stored elements in Compressed Sparse Row format>


In [0]:
feature_names = vect.get_feature_names()
print("Number of features: {}".format(len(feature_names)))
print("First 20 features:\n{}".format(feature_names[:20]))
print("Features 210 to 230:\n{}".format(feature_names[210:230]))
print("Every 200th feature:\n{}".format(feature_names[::200]))

Number of features: 1880
First 20 features:
['00p', '10', '100', '11', '110', '115', '115lbs', '118', '12', '120', '125', '125lb', '128', '130', '130lbs', '135lbs', '138', '14', '140', '145lbs']
Features 210 to 230:
['blend', 'blog', 'blouse', 'blouses', 'blown', 'blue', 'board', 'bod', 'bodice', 'body', 'bohemian', 'boho', 'bold', 'booties', 'boots', 'booty', 'boring', 'both', 'bother', 'bottom']
Every 200th feature:
['00p', 'beware', 'cry', 'fits', 'initially', 'mock', 'predicted', 'shirt', 'target', 'weeks']


In [0]:
X = X

y_1 = np.array(sorted_data['Recommended'])[0:100],

y_2 = np.array(sorted_data['Recommended'])[10000:10100]

y= np.append(y_1, y_2)

print(X.shape)
print(y.shape)

(200, 1880)
(200,)


In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(150, 1880)
(150,)
(50, 1880)
(50,)


In [0]:
print(X_test)

### Logistic Regression 

In [0]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

scores = cross_val_score(LogisticRegression(), X_train, y_train, cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))

Mean cross-validation accuracy: 0.67


In [0]:
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)

Best cross-validation score: 0.67
Best parameters:  {'C': 0.01}


In [0]:
print("Test score: {:.2f}".format(grid.score(X_test, y_test)))

Test score: 0.72


In [0]:
import warnings
with warnings.catch_warnings():
    warnings.simplefilter("ignore")

kfold=KFold(n_splits=5)
print("Cross-validation scores:\n{}".format(
 cross_val_score(LogisticRegression(), X, y, cv=kfold)))

print("Cross-validation scores Mean:\n{}".format(
 cross_val_score(LogisticRegression(), X, y, cv=kfold).mean()))

Cross-validation scores:
[0.55  0.525 0.45  0.45  0.3  ]
Cross-validation scores Mean:
0.45499999999999996


In [0]:
lgs = LogisticRegression()
lgs.fit(X_train.toarray(),y_train)
pred = lgs.predict(X_test.toarray())

In [0]:
print("accuracy: {:.2f}".format(lgs.score(X_test.toarray(), y_test)))
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))

accuracy: 0.80
[[20  3]
 [ 7 20]]
              precision    recall  f1-score   support

           0       0.74      0.87      0.80        23
           1       0.87      0.74      0.80        27

    accuracy                           0.80        50
   macro avg       0.81      0.81      0.80        50
weighted avg       0.81      0.80      0.80        50



### Naive Bayes




In [0]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import KFold

In [0]:
nb = GaussianNB()
nb.fit(X_train.toarray(),y_train)
pred = nb.predict(X_test.toarray())

In [0]:
print("accuracy: {:.2f}".format(nb.score(X_test.toarray(), y_test)))
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))

accuracy: 0.66
[[17  6]
 [11 16]]
              precision    recall  f1-score   support

           0       0.61      0.74      0.67        23
           1       0.73      0.59      0.65        27

    accuracy                           0.66        50
   macro avg       0.67      0.67      0.66        50
weighted avg       0.67      0.66      0.66        50



In [0]:
kfold=KFold(n_splits=5)
print("Cross-validation scores:\n{}".format(
 cross_val_score(nb, X.toarray(), y, cv=kfold)))

print("Cross-validation scores Mean:\n{}".format(
 cross_val_score(nb, X.toarray(), y, cv=kfold).mean()))

Cross-validation scores:
[0.275 0.325 0.45  0.525 0.375]
Cross-validation scores Mean:
0.39


### Support Vector Machines

In [0]:
from sklearn.svm import LinearSVC

In [0]:
svmc = LinearSVC()

In [0]:
svmc.fit(X_train,y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

In [0]:
pred = svmc.predict(X_test)

In [0]:
print("accuracy: {:.2f}".format(svmc.score(X_test, y_test)))
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))

accuracy: 0.72
[[18  5]
 [ 9 18]]
              precision    recall  f1-score   support

           0       0.67      0.78      0.72        23
           1       0.78      0.67      0.72        27

    accuracy                           0.72        50
   macro avg       0.72      0.72      0.72        50
weighted avg       0.73      0.72      0.72        50



In [0]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

In [0]:
import warnings
with warnings.catch_warnings():
    warnings.simplefilter("ignore")

kfold=KFold(n_splits=5)
print("Cross-validation scores:\n{}".format(
 cross_val_score(svmc, X, y, cv=kfold)))

print("Cross-validation scores Mean:\n{}".format(
 cross_val_score(svmc, X, y, cv=kfold).mean()))

Cross-validation scores:
[0.55  0.55  0.475 0.475 0.325]
Cross-validation scores Mean:
0.4750000000000001


Logistic Regression performs the best of the three when one observes the weighted precision avg for each model. Consulting the confusion matrix also shows that it is the model that was able to classify the most recomends correctly. Because our goal is to create a model that predicts the best given unseen data, the logistic regression would be the recomended approach, as it correctly predicts better than Naive Bayes and Support Vector Machine. 


Part 2: Build a predictive neural network using Keras

To complete part two of the homework do the following:

Run a multilayer perceptron (feed forward neural network) with two hidden layers on the iris dataset using the keras Sequential interface.

Data can be imported via the following link:

http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv

Include code for selecting the number of hidden units using GridSearchCV and evaluation on a test-set.  Describe the differences in the predictive accuracy of models with different numbers of hidden units.  Describe the predictive strength of your best model.  Be sure to explain your choice and evaluate this model using the test set.

In [0]:
!pip install tensorflow

In [0]:
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Activation, Dense

model = Sequential()

In [0]:
from keras.utils import to_categorical

In [0]:
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.datasets import load_iris

In [0]:
iris = load_iris()

In [0]:
X = iris['data']
y = to_categorical(iris['target'])

In [0]:
X_train,X_test, y_train,y_test = train_test_split(X,y, test_size = 0.3)

In [0]:
model = Sequential()

In [40]:
model = Sequential([
    Dense(10, input_shape=(4,)),
    Activation('relu'),
    Dense(10),
    Activation('relu'),
    Dense(3),
    Activation('softmax'),
])


model.summary()

Model: "sequential_42"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_52 (Dense)             (None, 10)                50        
_________________________________________________________________
activation_52 (Activation)   (None, 10)                0         
_________________________________________________________________
dense_53 (Dense)             (None, 10)                110       
_________________________________________________________________
activation_53 (Activation)   (None, 10)                0         
_________________________________________________________________
dense_54 (Dense)             (None, 3)                 33        
_________________________________________________________________
activation_54 (Activation)   (None, 3)                 0         
Total params: 193
Trainable params: 193
Non-trainable params: 0
_______________________________________________________

In [0]:
model.compile(loss ='categorical_crossentropy',optimizer = "adam", metrics=['accuracy'])

In [42]:
model.fit(X_train,y_train, validation_data=(X_test,y_test), epochs=50, batch_size=10)

Train on 105 samples, validate on 45 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.callbacks.History at 0x7f2ced38c128>

In [0]:
from sklearn.model_selection import GridSearchCV
from keras.wrappers.scikit_learn import KerasClassifier
import numpy as np


### Adam Optimizer 


In [0]:
def create_model():
  import tensorflow as tf
  from keras.models import Sequential
  model = Sequential()
  model = Sequential([
    Dense(10, input_shape=(4,)),
    Activation('relu'),
    Dense(10),
    Activation('relu'),
    Dense(3),
    Activation('softmax'),])
  model.compile(loss ='categorical_crossentropy',optimizer = "adam", metrics=['accuracy'])
  return model

In [0]:
model = KerasClassifier(build_fn=create_model, verbose=0) 


param_grid = dict(epochs=[100,200,300])
grid = GridSearchCV(estimator=model, param_grid=param_grid)
grid_result = grid.fit(X, y)

In [0]:
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.913333 using {'epochs': 300}


### Nadam Optimzer 

In [0]:
def create_model():
  import tensorflow as tf
  from keras.models import Sequential
  model = Sequential()
  model = Sequential([
    Dense(10, input_shape=(4,)),
    Activation('relu'),
    Dense(10),
    Activation('relu'),
    Dense(3),
    Activation('softmax'),])
  model.compile(loss ='categorical_crossentropy',optimizer = "nadam", metrics=['accuracy'])
  return model

In [0]:
model = KerasClassifier(build_fn=create_model, verbose=0) 


param_grid = dict(epochs=[100,200,300])
grid = GridSearchCV(estimator=model, param_grid=param_grid)
grid_result = grid.fit(X, y)

In [50]:
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.933333 using {'epochs': 200}


For this nueral network three different epoch amounts were chosen and two optimizers to create the most predictive model. Epochs: 100,200, and 300 as well as Optimzers: Adam and Nadam. "Adam is a combination of RSMprop with momentum and Nadam is a combination of RMSprop with Nesterov momenutm" Grid search cv shows that given these parameters 200 epochs performs the best, as it produces and accuracy score of 93.33% meaning, of the total number of predictions made, 93.33% of them were predicted correctly. We can see that adam required 300 epoch and only scored 91.33% a difference of 2 percentage points. The model performs quite well given the parameters chosen and the data provided. If one had to choose between 100,200, or 300 epoch models one would be adivised to levarage a model that contained an epoch value of 200. And when choosing between Adam and Nadam, for training and prediciting on this data Nadam would be recomended.  