# Introductory applied machine learning
# Assignment 3 (Part B): Mini-Challenge [25%]

<div align="right"><font color="blue" size="5">Your Score was 80.0 out of a total of 100.0, or 80.0%</font></div>

## Important Instructions

**It is important that you follow the instructions below to the letter - we will not be responsible for incorrect marking due to non-standard practices.**

1. <font color='red'>We have split Assignment 3 into two parts to make it easier for you to work on them separately and for the markers to give you feedback. This is part B of Assignment 3 - Part A is an introduction to Object Recognition. Both Assignments together are still worth 50% of CourseWork 2. **Remember to submit both notebooks (you can submit them separately).**</font>

1. You *MUST* have your environment set up as in the [README](https://github.com/michael-camilleri/IAML2018) and you *must activate this environment before running this notebook*:
```
source activate py3iaml
cd [DIRECTORY CONTAINING GIT REPOSITORY]
jupyter notebook
# Navigate to this file
```

1. Read the instructions carefully, especially where asked to name variables with a specific name. Wherever you are required to produce code you should use code cells, otherwise you should use markdown cells to report results and explain answers. In most cases we indicate the nature of answer we are expecting (code/text), and also provide the code/markdown cell where to put it

1. This part of the Assignment is the same for all students i.e. irrespective of whether you are taking the Level 10 version (INFR10069) or the Level-11 version of the course (INFR11182 and INFR11152).

1. The .csv files that you will be using are located at `./datasets` (i.e. use the `datasets` directory **adjacent** to this file).

1. In the textual answer, you are given a word-count limit of 600 words: exceeding this will lead to penalisation.

1. Make sure to distinguish between **attributes** (columns of the data) and **features** (which typically refers only to the independent variables, i.e. excluding the target variables).

1. Make sure to show **all** your code/working. 

1. Write readable code. While we do not expect you to follow [PEP8](https://www.python.org/dev/peps/pep-0008/) to the letter, the code should be adequately understandable, with plots/visualisations correctly labelled. **Do** use inline comments when doing something non-standard. When asked to present numerical values, make sure to represent real numbers in the appropriate precision to exemplify your answer. Marks *WILL* be deducted if the marker cannot understand your logic/results.

1. **Collaboration:** You may discuss the assignment with your colleagues, provided that the writing that you submit is entirely your own. That is, you must NOT borrow actual text or code from others. We ask that you provide a list of the people who you've had discussions with (if any). Please refer to the [Academic Misconduct](http://web.inf.ed.ac.uk/infweb/admin/policies/academic-misconduct) page for what consistutes a breach of the above.


### SUBMISSION Mechanics

**IMPORTANT:** You must submit this assignment by **Thursday 22/11/2018 at 16:00**. 

**Late submissions:** The policy stated in the School of Informatics is that normally you will not be allowed to submit coursework late. See the [ITO webpage](http://web.inf.ed.ac.uk/infweb/student-services/ito/admin/coursework-projects/late-coursework-extension-requests) for exceptions to this, e.g. in case of serious medical illness or serious personal problems.

**Resubmission:** If you submit your file(s) again, the previous submission is **overwritten**. We will mark the version that is in the submission folder at the deadline.

**N.B.**: This Assignment requires submitting **two files (electronically as described below)**:
 1. This Jupyter Notebook (Part B), *and*
 1. The Jupyter Notebook for Part A
 
All submissions happen electronically. To submit:

1. Fill out this notebook (as well as Part A), making sure to:
   1. save it with **all code/text and visualisations**: markers are NOT expected to run any cells,
   1. keep the name of the file **UNCHANGED**, *and*
   1. **keep the same structure**: retain the questions, **DO NOT** delete any cells and **avoid** adding unnecessary cells unless absolutely necessary, as this makes the job harder for the markers. This is especially important for the textual description and probability output (below).

1. Submit it using the `submit` functionality. To do this, you must be on a DICE environment. Open a Terminal, and:
   1. **On-Campus Students**: navigate to the location of this notebook and execute the following command:
   
      ```submit iaml cw2 03_A_ObjectRecognition.ipynb 03_B_MiniChallenge.ipynb```
      
   1. **Distance Learners:** These instructions also apply to those students who work on their own computer. First you need to copy your work onto DICE (so that you can use the `submit` command). For this, you can use `scp` or `rsync` (you may need to install these yourself). You can copy files to `student.ssh.inf.ed.ac.uk`, then ssh into it in order to submit. The following is an example. Replace entries in `[square brackets]` with your specific details: i.e. if your student number is for example s1234567, then `[YOUR USERNAME]` becomes `s1234567`.
   
    ```
    scp -r [FULL PATH TO 03_A_ObjectRecognition.ipynb] [YOUR USERNAME]@student.ssh.inf.ed.ac.uk:03_A_ObjectRecognition.ipynb
    scp -r [FULL PATH TO 03_B_MiniChallenge.ipynb] [YOUR USERNAME]@student.ssh.inf.ed.ac.uk:03_B_MiniChallenge.ipynb
    ssh [YOUR USERNAME]@student.ssh.inf.ed.ac.uk
    ssh student.login
    submit iaml cw2 03_A_ObjectRecognition.ipynb 03_B_MiniChallenge.ipynb
    ```
    
   What actually happens in the background is that your file is placed in a folder available to markers. If you submit a file with the same name into the same location, **it will *overwrite* your previous submission**. You should receive an automatic email confirmation after submission.
  


### Marking Breakdown

The Level 10 and Level 11 points are marked out of different totals, however these are all normalised to 100%. Note that Part A (Object Recognition) is worth 75% of the total Mark for Assignment 3, while Part B (this notebook) is worth 25%. Keep this in mind when allocating time for this assignment.

**70-100%** results/answer correct plus extra achievement at understanding or analysis of results. Clear explanations, evidence of creative or deeper thought will contribute to a higher grade.

**60-69%** results/answer correct or nearly correct and well explained.

**50-59%** results/answer in right direction but significant errors.

**40-49%** some evidence that the student has gained some understanding, but not answered the questions
properly.

**0-39%** serious error or slack work.

Note that while this is not a programming assignment, in questions which involve visualisation of results and/or long cold snippets, some marks may be deducted if the code is not adequately readable.

## Imports

Use the cell below to include any imports you deem necessary.

In [1]:
# Nice Formatting within Jupyter Notebook
%matplotlib inline
from IPython.display import display # Allows multiple displays from a single code-cell

# System functionality
import sys
sys.path.append('..')

# Import Here any Additional modules you use. To import utilities we provide, use something like:
#   from utils.plotter import plot_hinton

# Your Code goes here:

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import log_loss, accuracy_score
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

  from collections import Sequence
  from numpy.core.umath_tests import inner1d


# Mini challenge

In this second part of the assignment we will have a mini object-recognition challenge. Using the same type of data as in Part A, you are asked to find the best classifier for the person/no person classification task. You can apply any preprocessing steps to the data that you think fit and employ any classifier you like (with the provision that you can explain what the classifier is/preprocessing steps are doing). You can also employ any lessons learnt during the course, either from previous Assignments, the Labs or the lecture material to try and squeeze out as much performance as you possibly can. The only restriction is that all steps must be performed in `Python` by using the `numpy`, `pandas` and `sklearn` packages. You can also make use of `matplotlib` and `seaborn` for visualisation.

### DataSet Description

The datasets we use here are similar in composition but not the same as the ones used in Part A: *it will be useful to revise the description in that notebook*. Specifically, you have access to three new datasets: a training set (`Images_C_Train.csv`), a validation set (`Images_C_Validate.csv`), and a test set (`Images_C_Test.csv`). You must use the former two for training and evaluating your models (as you see fit). As before, the full data-set has 520 attributes (dimensions). Of these you only have access to the 500 features (`dim1` through `dim500`) to test your model on: i.e. the test set does not have any of the class labels.

### Model Evaluation

Your results will be evaluated in terms of the logarithmic loss metric, specifically the [logloss](http://scikit-learn.org/0.19/modules/model_evaluation.html#log-loss) function from SKLearn. You should familiarise yourself with this. To estimate this metric you will need to provide probability outputs, as opposed to discrete predictions which we have used so far to compute classification accuracies. Most models in `sklearn` implement a `predict_proba()` method which returns the probabilities for each class. For instance, if your test set consists of `N` datapoints and there are `K` class-labels, the method will return an `N` x `K` matrix (with rows summing to 1).

### Submission and Scoring

This part of Assignment 3 carries 25% of the total marks. Within this, you will be scored on two criteria:
 1. 80% of the mark will depend on the thoroughness of the exploration of various approaches. This will be assessed through your code, as well as a brief description (<600 words) justifying the approaches you considered, your exploration pattern and your suggested final approach (and why you chose it).
 1. 20% of the mark will depend on the quality of your predictions: this will be evaluated based on the logarithmic loss metric.
Note here that just getting exceptional performance is not enough: in fact, you should focus more on analysing your results that just getting the best score!

You have to submit the following:
 1. **All Code-Cells** which show your **working** with necessary output/plots already generated.
 1. In **TEXT** cell `#ANSWER_TEXT#` you are to write your explanation (<600 words) as described above. Keep this brief and to the point. **Make sure** to keep the token `#ANSWER_TEXT#` as the first line of the cell!
 1. In **CODE** cell `#ANSWER_PROB#` you are to submit your predictions. To do this:
    1. Once you have chosen your favourite model (and pre-processing steps) apply it to the test-set and estimate the posterior proabilities for the data points in the test set.
    1. Store these probabilities in a 2D numpy array named `pred_probabilities`, with predictions along the rows i.e. each row should be a complete probability distribution over whether the image contains a person or not. Note that due to the encoding of the `is_person` class, the negative case (i.e. there is no person) comes first.
    1. Execute the `#ANSWER_PROB#` code cell, making sure to not change anything. This cell will do some checks to ensure that you are submitting the right shape of array.

You may create as many code cells as you need (within reason) for training your models, evaluating the data etc: however, the text cell `#ANSWER_TEXT#` and code-cell `#ANSWER_PROB#` showing your answers must be the last two cells in the notebook.

In [50]:
# This is where your working code should start. Feel free to add as many code-cells as necessary.
#  Make sure however that all working code cells come BEFORE the #ANSWER_TEXT# and #ANSWER_PROB#
#  cells below.

# Your Code goes here:

# Read the data
dataPathTr = os.path.join(os.getcwd(),'datasets','Images_C_Train.csv')
dataPathVal = os.path.join(os.getcwd(),'datasets','Images_C_Validate.csv')
dataPathTst = os.path.join(os.getcwd(),'datasets','Images_C_Test.csv')
train_C = pd.read_csv(dataPathTr)
valid_C = pd.read_csv(dataPathVal)
test_C = pd.read_csv(dataPathTst)

In [3]:
# First I want to inspect the sizes of the training and validation sets
display(train_C.shape)
display(valid_C.shape)

(2113, 520)

(1113, 520)

In [4]:
# Combine training and validation set as I am not happy with the split
trainvalid_C = train_C.append(valid_C)

In [5]:
keep_cs = [c for c in train_C.columns if c.startswith("dim") or c == "is_person"]
trainvalid_C = trainvalid_C[keep_cs] # Extract the columns I want

In [6]:
trainvalid_C.describe()

Unnamed: 0,dim1,dim2,dim3,dim4,dim5,dim6,dim7,dim8,dim9,dim10,...,dim492,dim493,dim494,dim495,dim496,dim497,dim498,dim499,dim500,is_person
count,3226.0,3226.0,3226.0,3226.0,3226.0,3226.0,3226.0,3226.0,3226.0,3226.0,...,3226.0,3226.0,3226.0,3226.0,3226.0,3226.0,3226.0,3226.0,3226.0,3226.0
mean,0.028997,0.033535,0.03306,0.025186,0.029205,0.033261,0.033809,0.029431,0.035209,0.036372,...,0.034582,0.030724,0.030742,0.029553,0.033326,0.034241,0.025802,0.033678,0.034998,0.456913
std,0.415827,0.472264,0.390741,0.376364,0.397972,0.452469,0.468027,0.383038,0.463219,0.476861,...,0.49988,0.378243,0.427903,0.41217,0.454718,0.45602,0.354771,0.470745,0.461739,0.498217
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.000808,0.0,0.001488,0.000781,0.001116,0.001019,0.00034,0.000679,0.000762,0.001111,...,0.0,0.000679,0.000679,0.001783,0.001019,0.000744,0.001116,0.001019,0.001116,0.0
50%,0.001563,0.00034,0.003736,0.001698,0.002038,0.00186,0.000781,0.001698,0.001698,0.002056,...,0.0,0.002717,0.001359,0.003125,0.001875,0.001953,0.002038,0.002038,0.002378,0.0
75%,0.002378,0.000899,0.006454,0.002717,0.003125,0.003057,0.001488,0.003397,0.002717,0.00372,...,0.000679,0.006793,0.002232,0.004836,0.003057,0.003889,0.003057,0.003397,0.004076,1.0
max,9.984,9.122238,7.6768,9.695738,8.762671,9.489078,9.751526,8.691076,9.013933,9.602705,...,9.673318,7.375434,9.672255,9.348755,9.299061,9.951019,9.036268,9.963328,9.505755,1.0


In [7]:
# Clearly there are outliers (inspecting max/mean values)
# Let's remove them
trainvalid_C_clean = trainvalid_C.loc[trainvalid_C.le(trainvalid_C.quantile(q=0.995, axis=0), axis=1).all(axis=1)]

In [8]:
trainvalid_C_clean.describe()

Unnamed: 0,dim1,dim2,dim3,dim4,dim5,dim6,dim7,dim8,dim9,dim10,...,dim492,dim493,dim494,dim495,dim496,dim497,dim498,dim499,dim500,is_person
count,3206.0,3206.0,3206.0,3206.0,3206.0,3206.0,3206.0,3206.0,3206.0,3206.0,...,3206.0,3206.0,3206.0,3206.0,3206.0,3206.0,3206.0,3206.0,3206.0,3206.0
mean,0.001748,0.000737,0.004457,0.001896,0.002289,0.002182,0.001025,0.002475,0.002059,0.00271,...,0.000575,0.004666,0.001556,0.003631,0.002199,0.002874,0.002196,0.002458,0.003019,0.45758
std,0.001198,0.001392,0.003762,0.001369,0.001584,0.001681,0.000877,0.002851,0.001953,0.002316,...,0.001216,0.005827,0.00121,0.002594,0.001633,0.003229,0.001394,0.001953,0.002705,0.498275
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.000781,0.0,0.001488,0.000781,0.001116,0.001019,0.00034,0.000679,0.000744,0.001065,...,0.0,0.000679,0.000679,0.001756,0.001019,0.000744,0.001116,0.001019,0.001116,0.0
50%,0.001563,0.00034,0.00372,0.001698,0.002038,0.00186,0.000744,0.001698,0.001698,0.002038,...,0.0,0.002717,0.001359,0.003125,0.00186,0.00186,0.002038,0.002038,0.002378,0.0
75%,0.002378,0.000781,0.006454,0.002717,0.003057,0.003057,0.001488,0.003397,0.002717,0.00372,...,0.000679,0.006552,0.002232,0.004836,0.003057,0.003736,0.003057,0.003397,0.004076,1.0
max,0.009851,0.022135,0.027514,0.010789,0.010417,0.021739,0.005774,0.02983,0.028372,0.02038,...,0.021739,0.053329,0.010234,0.024457,0.013346,0.029225,0.008492,0.014509,0.028533,1.0


In [9]:
# Size of dataset we have for train/test split
trainvalid_C.shape

(3226, 501)

In [10]:
# Split in an 80/20 ratio
training, valid = train_test_split(trainvalid_C_clean,train_size=0.8, test_size=0.2, random_state=0, shuffle=True)

In [11]:
# Look at the first few elements
training.head()

Unnamed: 0,dim1,dim2,dim3,dim4,dim5,dim6,dim7,dim8,dim9,dim10,...,dim492,dim493,dim494,dim495,dim496,dim497,dim498,dim499,dim500,is_person
736,0.003397,0.00034,0.007133,0.001019,0.00034,0.002378,0.001698,0.004755,0.000679,0.001019,...,0.00034,0.01019,0.006793,0.004076,0.005435,0.001698,0.000679,0.00034,0.007812,1
970,0.002038,0.0,0.008832,0.001019,0.003397,0.001019,0.002717,0.00034,0.002038,0.00034,...,0.002038,0.003397,0.002038,0.006793,0.002717,0.007473,0.002717,0.001019,0.001359,0
921,0.001563,0.000391,0.001563,0.002344,0.001172,0.002344,0.001563,0.000391,0.0,0.001953,...,0.0,0.00625,0.002344,0.003125,0.001172,0.005078,0.001172,0.001172,0.001953,1
1020,0.00186,0.001116,0.000744,0.000744,0.001116,0.000744,0.000744,0.002976,0.00186,0.00186,...,0.000744,0.000372,0.00186,0.002976,0.001488,0.005208,0.000744,0.000372,0.002604,1
535,0.0,0.001698,0.002378,0.0,0.00034,0.002038,0.00034,0.002378,0.004755,0.002717,...,0.001019,0.001019,0.00034,0.004076,0.000679,0.002378,0.0,0.002038,0.0,0


In [12]:
# Create the variables I need for training/testing my models
X_train = training.drop(columns=['is_person'])
y_train = training['is_person']

X_val = valid.drop(columns=['is_person'])
y_val = valid['is_person']

In [13]:
# Use some classifiers without any further pre-processing

bl = DummyClassifier(strategy='prior')
bl.fit(X_train, y_train)
prob_bl = bl.predict_proba(X_val)
print('Base Line Classifier | score: {:.3f}  log loss: {:.5f}'.format(bl.score(X_val, y_val),
                                                                      log_loss(y_val, prob_bl)))

rf = RandomForestClassifier(random_state=0, n_estimators=500)
rf.fit(X_train, y_train)
prob_rf = rf.predict_proba(X_val)
print('Random Forest Classifier | score: {:.3f}  log loss: {:.5f}'.format(rf.score(X_val, y_val),
                                                                      log_loss(y_val, prob_rf)))

lr = LogisticRegression(solver='lbfgs')
lr.fit(X_train, y_train)
prob_lr = lr.predict_proba(X_val)
print('Logistic Regression Classifier | score: {:.3f}  log loss: {:.5f}'.format(lr.score(X_val, y_val),
                                                                      log_loss(y_val, prob_lr)))

svm = SVC(kernel='rbf', probability=True)
svm.fit(X_train, y_train)
prob_svm = svm.predict_proba(X_val)
print('SVM Classifier | score: {:.3f}  log loss: {:.5f}'.format(svm.score(X_val, y_val),
                                                                      log_loss(y_val, prob_svm)))

Base Line Classifier | score: 0.551  log loss: 0.68811
Random Forest Classifier | score: 0.724  log loss: 0.57238
Logistic Regression Classifier | score: 0.558  log loss: 0.67148
SVM Classifier | score: 0.551  log loss: 0.74630


In [14]:
# Scale the data
stdiser = StandardScaler()

stdiser.fit(X_train) 
X_train_scaled = stdiser.transform(X_train)
X_val_scaled = stdiser.transform(X_val)

In [15]:
# Turn the data back into dataframes in case I want to inspect it again
X_train = pd.DataFrame(X_train_scaled, index=X_train.index, columns=X_train.columns)
X_val = pd.DataFrame(X_val_scaled, index=X_val.index, columns=X_val.columns)

In [16]:
# Create KFold object
kf = KFold(n_splits=5, shuffle=True, random_state=0)

In [17]:
# Optimise random forest
rf = RandomForestClassifier(n_estimators=250, random_state=0)

params = {"max_depth": [None,5,10],
          "max_features": ['auto','log2'],
          "bootstrap": [True, False],
          "criterion": ["gini", "entropy"]}
gs_rf = GridSearchCV(rf, cv=kf, param_grid=params, scoring='neg_log_loss', n_jobs=-1, verbose=True).fit(X_train, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   34.6s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:  2.0min finished


In [18]:
print("Random Forest")
print("Optimal parameter values: {}".format(gs_rf.best_params_))
print("Neg log loss on validation set: {:.5f}".format(gs_rf.score(X_val, y_val)))
print("Accuracy on validation set: {:.3f}".format(accuracy_score(y_val, gs_rf.predict(X_val))))

# Save the paramter values if I want them later
op_rf = gs_rf.best_params_

Random Forest
Optimal parameter values: {'bootstrap': False, 'criterion': 'entropy', 'max_depth': None, 'max_features': 'auto'}
Neg log loss on validation set: -0.55887
Accuracy on validation set: 0.726


In [19]:
# Optimise Logisitic Regression
lr = LogisticRegression(solver='lbfgs')

params = {'C': np.logspace(-5, 5, 20)}

gs_lr = GridSearchCV(lr, cv=kf, param_grid=params, scoring='neg_log_loss', n_jobs=-1, verbose=True).fit(X_train, y_train)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=-1)]: Done  70 tasks      | elapsed:    8.0s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   13.2s finished


In [20]:
print("Logistic Regression")
print("Optimal parameter values: {}".format(gs_lr.best_params_))
print("Neg log loss on validation set: {:.5f}".format(gs_lr.score(X_val, y_val)))
print("Accuracy on validation set: {:.3f}".format(accuracy_score(y_val, gs_lr.predict(X_val))))

# Save the paramter values if I want them later
op_lr = gs_lr.best_params_

Logistic Regression
Optimal parameter values: {'C': 0.0012742749857031334}
Neg log loss on validation set: -0.57321
Accuracy on validation set: 0.713


In [24]:
# Optimise SVM
svm = SVC(kernel = 'rbf', probability=True)

params = {'C' : np.logspace(-2,2,10),
          "gamma" : np.logspace(-3,2,5)}

gs_svm = GridSearchCV(svm, cv=kf, param_grid=params, scoring='neg_log_loss', n_jobs=-1, verbose=True).fit(X_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  3.7min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 16.4min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed: 21.4min finished


In [25]:
print("SVM, rbf kernel")
print("Optimal parameter values: {}".format(gs_svm.best_params_))
print("Neg log loss on validation set: {:.5f}".format(gs_svm.score(X_val, y_val)))
print("Accuracy on validation set: {:.3f}".format(accuracy_score(y_val, gs_svm.predict(X_val))))

# Save the paramter values if I want them later
op_svm_rbf = gs_svm.best_params_

SVM, rbf kernel
Optimal parameter values: {'C': 0.5994842503189409, 'gamma': 0.001}
Neg log loss on validation set: -0.53568
Accuracy on validation set: 0.749


In [26]:
# Optimise SVM
svm = SVC(kernel = 'poly', probability=True)

params = {'C' : np.logspace(-2,2,5),
          "degree" : [1,2,3]}

gs_svm = GridSearchCV(svm, cv=kf, param_grid=params, scoring='neg_log_loss', n_jobs=-1, verbose=True).fit(X_train, y_train)

Fitting 5 folds for each of 15 candidates, totalling 75 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done  75 out of  75 | elapsed:  5.9min finished


In [27]:
print("SVM, polynomial kernel")
print("Optimal parameter values: {}".format(gs_svm.best_params_))
print("Neg log loss on validation set: {:.5f}".format(gs_svm.score(X_val, y_val)))
print("Accuracy on validation set: {:.3f}".format(accuracy_score(y_val, gs_svm.predict(X_val))))

# Save the paramter values if I want them later
op_svm_poly = gs_svm.best_params_

SVM, polynomial kernel
Optimal parameter values: {'C': 1.0, 'degree': 1}
Neg log loss on validation set: -0.58219
Accuracy on validation set: 0.704


In [28]:
pca_components=[1, 2, 5, 10, 50, 100, 200, 350, 500]

In [29]:
# PCA - Random Forest
rf = RandomForestClassifier(n_estimators=250, random_state=0, bootstrap=False, criterion='entropy',
                            max_depth=None, max_features='auto')
pca_rf = PCA()
pipe_rf = Pipeline(steps=[('pca', pca_rf), ('rf', rf)])

params_rf = {"pca__n_components" : pca_components}

gs_rf = GridSearchCV(pipe_rf, cv=kf, param_grid=params_rf, scoring='neg_log_loss', n_jobs=-1, verbose=True).fit(X_train, y_train)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:  2.7min finished


In [30]:
print("Random Forest | optimal # components: {}, accuracy: {:.3f}, neg log loss: {:.5f}".format(gs_rf.best_params_['pca__n_components'],
                                                                                            accuracy_score(y_val, gs_rf.predict(X_val)),
                                                                                            gs_rf.score(X_val, y_val)))

Random Forest | optimal # components: 50, accuracy: 0.731, neg log loss: -0.56187


In [31]:
# PCA - Logistic Regression
lr = LogisticRegression(solver='lbfgs', C=op_lr['C'] )
pca_lr = PCA()
pipe_lr = Pipeline(steps=[('pca', pca_lr), ('lr', lr)])

params_lr = {"pca__n_components" : pca_components}

gs_lr_pca = GridSearchCV(pipe_lr, cv=kf, param_grid=params_lr, scoring='neg_log_loss', n_jobs=-1, verbose=True).fit(X_train, y_train)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:   11.7s finished


In [32]:
print("Logistic Regression | optimal # components: {}, accuracy: {:.3f}, neg log loss: {:.5f}".format(gs_lr_pca.best_params_['pca__n_components'],
                                                                                               accuracy_score(y_val, gs_rf.predict(X_val)),
                                                                                               gs_lr.score(X_val, y_val)))

Logistic Regression | optimal # components: 500, accuracy: 0.731, neg log loss: -0.57321


In [33]:
# PCA - SVM (with rbf kernel as poly does not seem viable)
svm = SVC(kernel='rbf', probability=True, C=op_svm_rbf['C'], gamma=op_svm_rbf['gamma'])
pca_svm = PCA()
pipe_svm = Pipeline(steps=[('pca', pca_svm), ('svc', svm)])

params_svm = {"pca__n_components" : pca_components}

gs_svm = GridSearchCV(pipe_svm, cv=kf, param_grid=params_svm, scoring='neg_log_loss', n_jobs=-1, verbose=True).fit(X_train, y_train)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:  1.3min finished


In [34]:
print("SVM | optimal # components: {}, accuracy: {:.3f}, neg log loss: {:.5f}".format(gs_svm.best_params_['pca__n_components'],
                                                                                      accuracy_score(y_val, gs_rf.predict(X_val)),
                                                                                      gs_svm.score(X_val, y_val)))

SVM | optimal # components: 500, accuracy: 0.731, neg log loss: -0.53557


#ANSWER_TEXT#

On my first inspection of the data we were given I noticed the training set had around 2100 images and the validation set had around 1100. This seemed to me like an unnecessarily large amount of data to validate my models on, so I decided to merge the two datasets and create my own train/test split in an 80/20 ratio, to give me more data to train my models on.

I then decided that instead of creating a new validation set to tune the hyperparameters of whichever model I picked, I would do so using cross validation with 5 folds (ideally I would have chosen a higher value for k such as 10 but I did not have enough time to do this). I did this to combat the problem of overfitting my hyperparamters to a single validation set.

I then noticed that the max values for each dimension were a few order of magnitudes greater than the means. Remembering that each dimension stores tf-idf values I realised these outliers could be invalid data and I decided to get rid of them by removing any instance which included values above the 99.5th percentile; this removed around 20 instances.

To get a basis for future tests I trained some models on the data I had, including a baseline classifier which simply looked at priors, a random forest classifier, logistic regression and an SVM. From the log loss scores of each classifier I saw that random forest seemed to be the most promising but I wanted to pursue all three non-basline classifiers further.

Before moving any further I decide upon another look at the data that I would scale all the features to zero mean and unit variance as I felt that the features were too widely varied between each other in terms of scale. This can have a negative effect (in terms of time and/or performance) on a model.

By tuning some of their respective hyperparameters I noticed a definite increase in performance in all three classifiers, especially my svm with an rbf kernel. I decided to also try out an optimised svm with polynomial kernel but this proved to be less effective as the rbf kernel. I obtained a lowest neg log loss score of -0.536.

I then hypothesised that my data may exist within a subspace of the original input space and decided to try PCA before tuning my hyperparameters. The effect of this was pretty much negligible and I saw very little (if any) increase in performance. 

My best performing model was an optimised svm with rbf kernel. Since PCA did not actually decrease the dimensionality of the space I was working in I have decided that it is not worth carrying out. 

I feel that there were many other hyperparameters for each classifier that I would have liked to try and optimise. I also would have liked to try more values for the hyperparamters that I did optimise, however I had neither the time nor the computational resources (my poor little laptop) to do so.

As a final note I believe the amount of data we have to train our models on is far too small, considering our input space has 500 dimensions and we only have ~6 times that amount of data (taking into account both the training and validation data we were given). This suggests that our classifiers could struggle to generalise very well.

<div align="right"><font color="blue" size="4">AutoRanking: 60/80</font></div>

<div align="left"><font color="green" size="4">Sensible handling of outliers and choice of baseline.

This is very good work; you explore a good variety of models and take a thorough approach to tuning their parameters. One reservation - I would have liked to have seen some use of visualisations in the analysis of the hyperparameter tuning.

I had my doubts about the redistribution of data from validation to training - but you make a good point about the dimensionality of the data.</font></div>

In [51]:
# Final predictions

# Pre-processing:
# Get rid of unwanted columns
test_C = test_C[keep_cs]
test_C = test_C.drop(columns=['is_person'])
# Scale data
test_C = stdiser.transform(test_C)
# Optimised svm - predict
svm = SVC(kernel='rbf', probability=True, C=op_svm_rbf['C'], gamma=op_svm_rbf['gamma'])
svm.fit(X_train, y_train)

SVC(C=0.5994842503189409, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [52]:
# Perform predictions
pred_probabilities = svm.predict_proba(test_C)

In [53]:
#ANSWER_PROB#
# Run this cell when you are ready to submit your test-set probabilities. This cell will generate some
# warning messages if something is not right: make sure to address them!
if pred_probabilities.shape != (1114, 2):
    print('Array is of incorrect shape. Rectify this before submitting.')
elif (pred_probabilities.sum(axis=1) != 1.0).all():
    print('Submitted values are not correct probabilities. Rectify this before submitting.')
else:
    for _prob in pred_probabilities:
        print('{:.8f}, {:.8f}'.format(_prob[0], _prob[1]))

0.87120327, 0.12879673
0.85787971, 0.14212029
0.32377962, 0.67622038
0.26966267, 0.73033733
0.65430474, 0.34569526
0.13813640, 0.86186360
0.22080547, 0.77919453
0.10373689, 0.89626311
0.79887104, 0.20112896
0.77672110, 0.22327890
0.48190661, 0.51809339
0.78269772, 0.21730228
0.61480488, 0.38519512
0.74005135, 0.25994865
0.04313994, 0.95686006
0.58121896, 0.41878104
0.10495153, 0.89504847
0.52388072, 0.47611928
0.82944276, 0.17055724
0.40173627, 0.59826373
0.91486596, 0.08513404
0.45700406, 0.54299594
0.65175409, 0.34824591
0.57659979, 0.42340021
0.80757900, 0.19242100
0.09059834, 0.90940166
0.17467677, 0.82532323
0.81525750, 0.18474250
0.28673258, 0.71326742
0.52288485, 0.47711515
0.82556120, 0.17443880
0.47550040, 0.52449960
0.88606823, 0.11393177
0.61706198, 0.38293802
0.71564626, 0.28435374
0.42225965, 0.57774035
0.14671470, 0.85328530
0.71964194, 0.28035806
0.86334154, 0.13665846
0.65975854, 0.34024146
0.73949757, 0.26050243
0.62347468, 0.37652532
0.58728399, 0.41271601
0.88286711,

<div align="right"><font color="blue" size="4">AutoRanking: 20/20</font></div>

<div align="left"><font color="green" size="4">Congratulations: Your solution was in the top 10%. You nailed it!!! You completely destroyed my random forest baseline. I fitted the Random Forest model out of the box with minimal tuning (I increased number of trees and performed a short search for a good value for max_features). I would always recommend this model for tabular data as your first step after a Dummy Baseline. This model got a logloss of 0.5965.</font></div>