# Reflection

Upon the start of the term and this project I had an intermediate skill level in regards to general Python scripting, an extremely rudimentary understanding of Machine Learning, and an intermediate level of understanding of the basic techniques and tools of data scientists. I had been using Python for about 1.5 years, had been studying data science for about half a year, and had really no formal experience with Machine Learning. Fortunately, the process of completing this project helped noticeably improve my proficiency in all three of these domains. In particular, this was the first project I took part in where I employed Scikit-Learn and employed several Machine Learning models, bringing me from absolute Machine Learning novice firmly to a beginner. Additionally, I had studied pandas on my own time in the previous quarter, and felt that this project yielded the perfect opportunity to actually apply my new-found skills to something. I used pandas DataFrames to hold the features for each training instance which I thought may be useful for classification. I also figured out how to easily use pandas in conjunction with Scikit-Learn, which I know will be highly useful for quite some time. Once getting the data into the appropriate form for use with the various Scikit-Learn Machine Learning algorithm classes (converting DataFrames to numpy arrays is extremely easy), I tried to find the best classification algorithm for assigning a text review of a business a star rating (from one to five). This turned out to be much more complicated than the usual binary classification problem, which is all I had worked with up till this point. Thus this process gave me my first real experience with multi-class classification, and showed me which Machine Learning classification algorithms are compatible with the multi-class problem. In addition, given the ordinal nature of the problem (_2 stars_ is much closer to _1 star_ than it is to _5 stars_), it also gave me insight into various algorithms for classification of ordinal classes. I read mostly about ordinal regression, which I found isn't currently employed in Scikit-Learn. Thus I employed the third-party module mord to perform a rudimentary ordinal regression. Looking forward, I hope to perform many more projects such as this. Perhaps I may even continue this project and manage to make it into work decent enough to publish a paper on! In terms of self-study, I plan to begin studying TensorFlow and neural networks very soon, along with continuing my studies of more advanced Python techniques.

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

# Data Preparation

We need to get the data into numpy arrays. Luckily, pandas makes it extremely easy to convert a DataFrame to an array, using the values attribute.

In [2]:
train = pd.read_pickle('train.pkl')

In [3]:
train.head(10)

Unnamed: 0,stars,length,one_interesting,two_interesting,three_interesting,four_interesting,five_interesting,punc,sentiment,subjectivity
132541,3,327,1,0,0,0,0,5,0.3075,0.52375
231497,4,483,0,0,0,0,1,0,0.20551,0.353375
43791,5,743,1,0,0,0,3,11,0.259984,0.366984
16507,2,1412,5,0,0,0,9,21,0.068792,0.464641
357253,1,233,1,0,0,0,0,3,0.105556,0.544444
73016,3,1028,0,0,0,0,1,21,0.172989,0.576623
67692,4,1741,3,0,1,2,0,20,0.281471,0.512352
299721,1,1241,3,2,0,1,1,5,0.100668,0.413503
320024,4,1313,0,0,0,2,0,11,0.033741,0.668297
266360,5,829,3,1,0,0,3,2,0.307685,0.550926


In [4]:
yolo = train.values

In [5]:
yolo

array([[ 3.00000000e+00,  3.27000000e+02,  1.00000000e+00, ...,
         5.00000000e+00,  3.07500000e-01,  5.23750000e-01],
       [ 4.00000000e+00,  4.83000000e+02,  0.00000000e+00, ...,
         0.00000000e+00,  2.05509642e-01,  3.53374656e-01],
       [ 5.00000000e+00,  7.43000000e+02,  1.00000000e+00, ...,
         1.10000000e+01,  2.59984127e-01,  3.66984127e-01],
       ...,
       [ 5.00000000e+00,  9.50000000e+01,  0.00000000e+00, ...,
         2.00000000e+00,  6.00000000e-01,  7.70000000e-01],
       [ 4.00000000e+00,  8.00000000e+02,  0.00000000e+00, ...,
         0.00000000e+00,  1.45553221e-01,  4.45378151e-01],
       [ 1.00000000e+00,  8.05000000e+02,  1.00000000e+00, ...,
         8.00000000e+00, -2.27678571e-01,  5.15773810e-01]])

In [6]:
stars = train['stars'].values

In [7]:
y = stars

In [8]:
# The stars are what we're trying to find, so we definitely need to remove that from the training set
del train['stars']
X = train.values

In [9]:
X.shape

(378951, 9)

In [10]:
y.shape

(378951,)

In [11]:
# Shuffle the data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [12]:
X[0]

array([3.2700e+02, 1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
       0.0000e+00, 5.0000e+00, 3.0750e-01, 5.2375e-01])

In [13]:
y

array([3, 4, 5, ..., 5, 4, 1], dtype=uint8)

Scaling the data makes many of the algorithms in Scikit-Learn more precise. It is interesting to note that Random Forests aren't affected by this preprocessing at all.

In [14]:
# Scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
X_test_scaled = scaler.fit_transform(X_test.astype(np.float64))

# Perform classification

I'm really not sure which algorithms to use here, so I think it may be best just to try some out and see which works best. This is not a binary classification, so we are limited to a subset of the available Scikit-Learn algorithms which support multi-class classification. Since we've already munged the data, fit it into numpy arrays, and standardized it, performing the machine learning itself is a breeze. After we get some baseline results for each algorithm, I plan to select the ones that perform well, tune their hyperparameters, and then create a hard-voting ensemble to maximize our performance.
## Random Forest

In [15]:
# First let's just try a Random Forest and see what happens
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

forest_clf = RandomForestClassifier(random_state = 0, n_estimators = 20)
scores = cross_val_score(forest_clf, X_train_scaled, y_train, cv = 10, scoring = 'f1_micro')
scores.mean()


0.4928091236010471

In [16]:
# Evaluate performance on the test data
from sklearn.metrics import f1_score
forest_clf.fit(X_train_scaled, y_train)
y_pred = forest_clf.predict(X_test_scaled)
print(f1_score(y_pred, y_test, average = 'micro'))
print(f1_score(y_pred, y_test, average = None, labels = [1, 2, 3, 4, 5]))


0.493765750550857
[0.60486255 0.17073895 0.18828674 0.32049361 0.65530058]


This seems pretty decent. I'm under the impression that as long as it's above the baseline of randomly guessing, including it in an ensemble will improve results (law of large numbers?). Looks like we're much better at getting 1 or 5 star reviews correct than the middle ones. This definitely makes sense to me, since these are the least represented classes, and there were very few interesting 2 and 3 star words compared to the remaining categories.

### Optimization

For each algorithm, perform a grid search in order to find the best hyperparameters. This will take a long time. 

In [64]:
from sklearn.model_selection import GridSearchCV
forest = RandomForestClassifier(random_state = 0)
param_grid = {"n_estimators" : [1, 5, 10, 20],
              "max_depth": [3, None],
              "max_features": [1, 3, 9]}

grid_search = GridSearchCV(forest, param_grid = param_grid)
grid_search.fit(X_train_scaled, y_train)


GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_estimators': [1, 5, 10, 20], 'max_depth': [3, None], 'max_features': [1, 3, 9]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [65]:
grid_search.best_params_

{'max_depth': 3, 'max_features': 9, 'n_estimators': 20}

In [38]:
# Ok, now train the forest classifier again with the best params
forest = RandomForestClassifier(max_depth = 3, max_features = 9, n_estimators = 20, random_state = 0)
scores = cross_val_score(forest, X_train_scaled, y_train, cv = 10, scoring = 'f1_micro')
scores.mean()

0.5053173578351511

In [39]:
forest.fit(X_train_scaled, y_train)
y_pred = forest.predict(X_test_scaled)
print(f1_score(y_pred, y_test, average = 'micro'))
print(f1_score(y_pred, y_test, average = None, labels = [1, 2, 3, 4, 5]))


0.5055349579765408
[0.60796471 0.         0.         0.28187404 0.68491555]


  'recall', 'true', average, warn_for)


## Logistic Regression

In [17]:
from sklearn.linear_model import LogisticRegression
log_clf = LogisticRegression(multi_class = 'multinomial', solver = 'newton-cg')
log_clf.fit(X_train_scaled, y_train)

scores2 = cross_val_score(log_clf, X_train_scaled, y_train, cv = 10, scoring = 'f1_micro')
scores2.mean()




0.5254057206996384

In [22]:
scores2

array([0.52694769, 0.52516657, 0.52665083, 0.52406241, 0.52419435,
       0.52557971, 0.52490434, 0.52597895, 0.52502227, 0.52555009])

In [19]:
y_pred2 = log_clf.predict(X_test_scaled)
print(f1_score(y_pred2, y_test, average = 'micro'))
print(f1_score(y_pred2, y_test, average = None, labels = [1, 2, 3, 4, 5]))

0.5259859350054756
[0.61783984 0.07250418 0.11152148 0.26508059 0.68794503]


The results here looks good. This seems to the best at identifying 1 and 5 star reviews.

### Optimization

In [40]:
from sklearn.linear_model import LogisticRegression
param_grid2 = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }
log = LogisticRegression()
grid_search2 = GridSearchCV(log, param_grid=param_grid2)
grid_search2.fit(X_train_scaled, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [41]:
grid_search2.best_params_

{'C': 10}

In [42]:
log = LogisticRegression(C = 10)
log.fit(X_train_scaled, y_train)

scores2 = cross_val_score(log, X_train_scaled, y_train, cv = 10, scoring = 'f1_micro')
scores2.mean()

0.5210515994836349

In [43]:
y_pred2 = log.predict(X_test_scaled)
print(f1_score(y_pred2, y_test, average = 'micro'))
print(f1_score(y_pred2, y_test, average = None, labels = [1, 2, 3, 4, 5]))

0.522291564961539
[0.61921738 0.03388823 0.0688295  0.23749865 0.68414029]


Looks like grid search actually gave us a slightly lower result on the test set? Interesting, but it's miniscule.

## Linear SVC

In [None]:
from sklearn.svm import LinearSVC
svc_clf = LinearSVC(multi_class = 'crammer_singer')
svc_clf.fit(X_train_scaled, y_train)

y_pred3 = svc_clf.predict(X_test)
print(f1_score(y_pred3, y_test, average = 'micro'))
print(f1_score(y_pred3, y_test, average = None, labels = [1, 2, 3, 4, 5]))


This is worse than just randomly guessing, and it took forever! Not a good choice. (I ran this and it gave me something like 17%. It took several hours to do. The result is gone now and I don't think re-running it is worth the time, so I'll just leave the code here).
## Bernoulli Naive Bayes

In [20]:
from sklearn.naive_bayes import BernoulliNB

ber_clf = BernoulliNB()
ber_clf.fit(X_train_scaled, y_train)

scores3 = cross_val_score(ber_clf, X_train_scaled, y_train, cv = 10, scoring = 'f1_micro')
scores3.mean()

0.4699235133137211

In [21]:
y_pred4 = ber_clf.predict(X_test_scaled)
print(f1_score(y_pred4, y_test, average = 'micro'))
print(f1_score(y_pred4, y_test, average = None, labels = [1, 2, 3, 4, 5]))

0.4704384425591429
[0.4439941  0.07918196 0.11491012 0.18343751 0.65270592]


### Optimize

In [45]:
from sklearn.naive_bayes import BernoulliNB

ber = BernoulliNB()
param_grid3 = {'alpha' : [0.1, 1, 3, 5, 10]}
grid_search3 = GridSearchCV(ber, param_grid = param_grid3)
grid_search3.fit(X_train_scaled, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'alpha': [0.1, 1, 3, 5, 10]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn', scoring=None, verbose=0)

In [46]:
grid_search3.best_params_

{'alpha': 0.1}

In [48]:
ber = BernoulliNB(alpha = 0.1)
ber.fit(X_train_scaled, y_train)

scores3 = cross_val_score(ber, X_train_scaled, y_train, cv = 10, scoring = 'f1_micro')
scores3.mean()

0.4699235133137211

In [49]:
y_pred4 = ber.predict(X_test_scaled)
print(f1_score(y_pred4, y_test, average = 'micro'))
print(f1_score(y_pred4, y_test, average = None, labels = [1, 2, 3, 4, 5]))

0.4704384425591429
[0.4439941  0.07918196 0.11491012 0.18343751 0.65270592]


## Gaussian NB

I'm under the impression that Gaussian NB has no hyperparameters to mess with, so we don't have to optimize this one.

In [51]:
from sklearn.naive_bayes import GaussianNB

gauss_clf = GaussianNB()
gauss_clf.fit(X_train_scaled, y_train)
scores4 = cross_val_score(gauss_clf, X_train_scaled, y_train, cv = 10, scoring = 'f1_micro')
scores4.mean()


0.4721731156566008

In [26]:
y_pred5 = gauss_clf.predict(X_test_scaled)
print(f1_score(y_pred5, y_test, average = 'micro'))
print(f1_score(y_pred5, y_test, average = None, labels = [1, 2, 3, 4, 5]))

0.47262867622804816
[0.50651904 0.1555837  0.16765441 0.15225431 0.63409363]


# Create the ensemble?

In [52]:
from sklearn.ensemble import VotingClassifier
vote_clf = VotingClassifier(estimators = [
    ('forest', forest),
    ('log', log),
    ('ber', ber),
    ('gauss', gauss_clf)
], voting = 'hard')

vote_clf.fit(X_train_scaled, y_train)

VotingClassifier(estimators=[('forest', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=3, max_features=9, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight...iNB(alpha=0.1, binarize=0.0, class_prior=None, fit_prior=True)), ('gauss', GaussianNB(priors=None))],
         flatten_transform=None, n_jobs=1, voting='hard', weights=None)

In [53]:
final_pred = vote_clf.predict(X_test_scaled)
print(f1_score(final_pred, y_test, average = 'micro'))
print(f1_score(final_pred, y_test, average = None, labels = [1, 2, 3, 4, 5]))

0.5170270876489292
[0.59534308 0.06287754 0.1056527  0.22495395 0.68361429]


  if diff:


In [54]:
final_scores = cross_val_score(vote_clf, X_train_scaled, y_train, cv = 10, scoring = 'f1_micro')
final_scores.mean()

  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


0.5156452378747759

So the ensemble seems to be a bit worse than some of our previously established methods. Perhaps this could be made better with a bit of tuning? Perhaps I could try soft-voting instead of hard in the future. Still, it feels cool to have set up my own ensemble. Scikit-Learn makes it much easier than it would be otherwise!

## Ordinal Regression
This isn't supported by SciKit-Learn, but I found a third party package --- _mord_ --- that claims to do it in a similar fashion.
The documentation for this module is EXTREMELY barebones, so I'm not sure if I'll even be able to get it to work. Might as well give it a shot. I'll try out each of the algorithms that are provided in the module (that I can get to work, some of them seem to not exist anymore?).

In [58]:
import mord

ord_clf = mord.LogisticIT()
ord_clf.fit(X_train_scaled, y_train)

LogisticIT(alpha=1.0, max_iter=1000, verbose=0)

In [59]:
ord_scores = cross_val_score(ord_clf, X_train_scaled, y_train, cv = 10, scoring = 'f1_micro')
ord_scores.mean()

0.506887452170471

In [61]:
ord_pred = ord_clf.predict(X_test_scaled)
print(f1_score(ord_pred, y_test, average = 'micro'))
print(f1_score(ord_pred, y_test, average = None, labels = [1, 2, 3, 4, 5]))

0.5066828515259068
[0.58639519 0.         0.         0.20384836 0.68965432]


  'recall', 'true', average, warn_for)


In [60]:
ord_clf2 = mord.LogisticAT()
ord_clf2.fit(X_train_scaled, y_train)

LogisticAT(alpha=1.0, max_iter=1000, verbose=0)

In [62]:
ord_scores2 = cross_val_score(ord_clf2, X_train_scaled, y_train, cv = 10, scoring = 'f1_micro')
ord_scores2.mean()

0.46190790341733734

In [63]:
ord_pred2 = ord_clf2.predict(X_test_scaled)
print(f1_score(ord_pred2, y_test, average = 'micro'))
print(f1_score(ord_pred2, y_test, average = None, labels = [1, 2, 3, 4, 5]))

0.4620601390666438
[0.49196267 0.21603734 0.21226194 0.37056341 0.64304772]


I'm not sure if I'm doing something wrong here, and there are no instructions whatsoever in the documentation. But these seem to be performing worse than the ensemble we've already created. Perhaps sometime in a future I can figure out a way to get these to perform better. It is interesting to note this model is performing by far the best on 2 and 3 star reviews out of all the ones we've tried!