<div style="text-align: right"> Ali Emre Öz </div>
<div style="text-align: right"> 213950785 </div>

<h3 align="center">EECS 461 - MACHINE LEARNING PROJECT FINAL REPORT</h3> 

#### Classification Algorithms and Performance Measurements

For classification algorithm I use: Naive Bayes Classifier, Random Forest Classifier, Logistic Regression and Ada Boost Classifier. 

For performance measurement, I use Hamming Lose which is the fraction of labels that are incorrectly predicted. According to scikit-learn documentation: "In multiclass classification, the Hamming loss correspond to the Hamming distance between y_true and y_pred"

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.ensemble import AdaBoostClassifier as AdaBoost
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.metrics import hamming_loss
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
import csv

In [2]:
x_test = pd.read_csv("x_test")

In [3]:
x_train = pd.read_csv("x_train")

In [4]:
y_test = pd.read_csv("y_test")

In [5]:
y_train = pd.read_csv("y_train")

In [6]:
del x_test["Unnamed: 0"]
del x_train["Unnamed: 0"]
del y_test["Unnamed: 0"]
del y_train["Unnamed: 0"]

In [7]:
x_train = x_train.values
x_test = x_test.values
y_train = y_train.values
y_test = y_test.values

##### 1. Naive Bayes Classifier

"Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features."

In [8]:
from sklearn.naive_bayes import GaussianNB
nb = OneVsRestClassifier(GaussianNB(), n_jobs=-1)
nb.fit(x_train, y_train)
nb_pred = nb_mdl.predict(x_test)
hl = hamming_loss(y_test, nb_pred)
print "Naive Bayes"
print "Hamming Loss of NB is " + str(hl)

Naive Bayes
Hamming Loss of NB is 0.792222484734


As can be seen from hamming loss, nearly %80 of predictions are wrong. It is not a good result and we need to improve it with other algorithms.

##### 2. Logistic Regression

In [11]:
from sklearn.linear_model import LogisticRegression
lr_mdl = OneVsRestClassifier(LogisticRegression(class_weight='balanced', C=100, solver='sag', max_iter = 2500), n_jobs=-1)
lr_mdl.fit(x_train, y_train)
lr_pred = lr_mdl.predict(x_test)
hl = hamming_loss(y_test, lr_pred)
print "Logistic Regression"
print "Hamming Loss of LR is " + str(hl)

Logistic Regression
Hamming Loss of LR is 0.300182647572


As can be seen from hamming loss, nearly %30 of predictions are wrong. It is better than naive-bayes.

##### 3. Logistic Regression with PCA

In [15]:
pca = PCA(n_components=0.95)
pca.fit(x_train)
var90pcs = len(pca.explained_variance_ratio_[np.cumsum(pca.explained_variance_ratio_)<.95])
# reduce data using PCA
X_train_reduced = pca.transform((x_train))
X_test_reduced = pca.transform((x_test))
lr_mdl_pca = OneVsRestClassifier(LogisticRegression(class_weight='balanced', C=100, solver='sag', max_iter = 5000), n_jobs=-1)
lr_mdl_pca.fit(X_train_reduced[:,0:var90pcs], y_train)
lr_pred_pca = lr_mdl_pca.predict(X_test_reduced[:,0:var90pcs])
hl = hamming_loss(y_test, lr_pred_pca)

print "Logistic Regression with PCA"
print "Hamming Loss of LR with PCA is " + str(hl)

Logistic Regression with PCA
Hamming Loss of LR with PCA is 0.304294489677


Even if we apply PCA, the hamming loss was not change. However, the speed of our algorithm were improved.

##### 4. Random Forest Classifier

In [13]:
rfc_mdl = RFC(n_estimators=50, max_depth=25, class_weight ='balanced', n_jobs=-1).fit(x_train,y_train)
rf_pred = rfc_mdl.predict(x_test)
hl = hamming_loss(y_test, rf_pred)
print "Random Forest"
print "Hamming Loss of RF is " + str(hl)

Random Forest
Hamming Loss of RF is 0.128712016575


RF Classifier will get a good result when we compare it with previous algorithms. It predict %87 of the data true.

##### 5. Random Forest Classifier with PCA

In [14]:
rfc_mdl_pca = RFC(random_state = 0).fit(X_train_reduced[:,0:var90pcs],y_train)
pred_pca = rfc_mdl_pca.predict(np.array(X_test_reduced[:,0:var90pcs]))
hl = hamming_loss(y_test, pred_pca)
print "Random Forest with PCA"
print "Hamming Loss of RF with PCA is " + str(hl)

Random Forest with PCA
Hamming Loss of RF with PCA is 0.107235024716


When we apply PCA, we see that the hamming loss will decrease.

##### 6. ADABoost Classifier

"An AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases." from scikit-learn documentation

In [16]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
ada_clf = OneVsRestClassifier(AdaBoostClassifier(DecisionTreeClassifier(max_depth=9), n_estimators=50, algorithm="SAMME", learning_rate=0.5))
ada_clf.fit(x_train, y_train)
ada_clf_pred = ada_clf.predict(x_test)
hl = hamming_loss(y_test, ada_clf_pred)
print "Adaboost Classifier for MultiClass Prediction with SAMME"
print "Hamming Loss of Adaboost Classifier for MultiClass Prediction with SAMME is " + str(hl)

Adaboost Classifier for MultiClass Prediction with SAMME
Hamming Loss of Adaboost Classifier for MultiClass Prediction with SAMME is 0.100465251527


When we look at the hamming distance, we can understand why the ensemble gave better result. It is the best result for now.

#### Discussion

If I have a larger set of data, or if I can use the director, actor, and screenwriter knowledge of the films, I can get better results.

As a further enhancement, posters from films could also be used as a good feature.

Besides, we can improve on the estimates by trying to predict more than one kind of films like multilabel classification, rather than predicting a single species.