# Introductory applied machine learning (INFR10069)

# Assignment 4: Feature Engineering

## Marking Breakdown

**70-100%** results/answer correct plus extra achievement at understanding or analysis of results. Clear explanations, evidence of creative or deeper thought will contribute to a higher grade.

**60-69%** results/answer correct or nearly correct and well explained.

**50-59%** results/answer in right direction but significant errors.

**40-49%** some evidence that the student has gained some understanding, but not answered the questions
properly.

**0-39%** serious error or slack work.


## Mechanics

You should produce a Jupyter notebook in answer to this assignment.
**You need to submit this notebook electronically as described below.**

Place your notebook in a directory called `iamlans` and submit this directory using the submit command on a DICE machine. The format is:

`submit iaml 4 iamlans`

You can check the status of your submissions with the `show_submissions` command.

**Late submissions:** The policy stated in the School of Informatics MSc Degree Guide is that normally you will not be allowed to submit coursework late. See http://www.inf.ed.ac.uk/teaching/years/msc/courseguide10.html#exam for exceptions to this, e.g. in case of serious medical illness or serious personal problems.

**Collaboration:** You may discuss the assignment with your colleagues, provided that the writing that you submit is entirely your own. That is, you should NOT borrow actual text or code from other students. We ask that you provide a list of the people who you've had discussions with (if any).

## Important Instructions

1. In the following questions you are asked to run experiments using Python (version 2.7) and the following packages:
    * Numpy
    * Pandas
    * Scikit-learn 0.17
    * Matplotlib
    * Seaborn

2. Before you start make sure you have set up a vitual environment (or conda environment if you are working on your own machine) and the required packages installed. Instructions on how to set-up the working enviornment and install the required packages can be found in `01_Lab_1_Introduction`.

3. Wherever you are required to produce code you should use code cells, otherwise you should use markdown cells to report results and explain answers. **You are welcome to split your answer into multiple cells with intermediate printing.**

4. The .csv files that you will be using are located at `./datasets` (the `datasets` directory is adjacent to this file).

5. **IMPORTANT:** Keep your answers brief and concise. Most questions can be answered with 2-3 lines of explanation (excluding coding questions), unless stated otherwise.

## Imports

In this assignment you are asked to import all the packages and modules you will need. Include all required imports and execute the cell below.

In [1]:
from __future__ import division #print_function
import os
import math
from collections import OrderedDict
import numpy as np 
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score, confusion_matrix
from sklearn.decomposition import PCA
from sklearn.neighbors import DistanceMetric
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.svm import LinearSVC, SVC
from sklearn.cross_validation import train_test_split, KFold, cross_val_predict, cross_val_score
from sklearn.naive_bayes import GaussianNB
%matplotlib inline



In [2]:
seed = 0
rng = np.random.RandomState(seed=seed)

## Description of the datasets


This assignment is based on two datasets:
1. the 20 Newsgroups Dataset (you should recognise it from Assignment 1)
2. the MNIST digits dataset

### 20 Newsgroups

For convenience, we repeat the description here. This dataset is a collection of approximately **20,000 newsgroup documents**, partitioned (nearly) evenly across 20 different newsgroups, each corresponding to a different topic. Some of the newsgroups are **very closely related** to each other (e.g. comp.sys.ibm.pc.hardware, comp.sys.mac.hardware), while others are **highly unrelated** (e.g misc.forsale, soc.religion.christian). 

To save you time and to make the problem manageable with limited computational resources, we preprocessed the original dataset. We will use documents from only **5 out of the 20 newsgroups**, which results in a 5-class problem. More specifically the 5 classes correspond to the following newsgroups: 
1. `alt.atheism`
2. `comp.sys.ibm.pc.hardware`
3. `comp.sys.mac.hardware`
4. `rec.sport.baseball`
5. `rec.sport.hockey `

However, note here that classes **2-3** and **4-5** are rather closely related.

**In contrast to Assignment 1**, we have opted to use **tf-idf** weights ([term frequency - inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf))
for each word instead of the frequency counts. These weights represent the importance of a word to a
document with respect to a collection of documents. The importance increases proportionally to the number
of times a word appears in the document and decreases proportionally to the number of times the word
appears in the whole corpus. 

Additionally we preprocess the data to include the **most frequent 1000 words** that are in **greater than 2 documents**, less than half of all documents, and that are not **[stop words](https://en.wikipedia.org/wiki/Stop_words)**.

We will perform all this preprocessing for you.


### MNIST
This MNIST Dataset is a collection handwritten digits. The samples are partitioned (nearly) evenly across the **10 different digit classes {0, 1, . . . , 9}**. We use a preprocessed version for which the data are **$8 \times 8$** pixel images containing one digit each. For further details on how the digits are preprocessed, see the sklearn documentation. The **images are grayscale**, with each pixel taking values in **{0, 1, . . . , 16}**, where 0 corresponds to black (weakest intensity) and 16 corresponds to white (strongest intensity). Therefore, the dataset is a **N × 64** dimensional matrix where each dimension corresponds to a pixel from the image and N is the number of
images. 

Again, to save you time, we perfom the import for you.

In [3]:
X = np.array([[1, 2, 3, 4],
             [5, 6, 7, 8],
             [9, 0, 1, 2],
             [1, 6, 9, 9],
             [1, 7, 9, 9]])

In [4]:
Xmean = np.mean(X, axis=0)
Xmean

array([ 3.4,  4.2,  5.8,  6.4])

In [5]:
pcaFull = PCA(n_components=X.shape[1], random_state=seed)

In [6]:
pcaFull.fit(X)

PCA(copy=True, iterated_power='auto', n_components=4, random_state=0,
  svd_solver='auto', tol=0.0, whiten=False)

In [7]:
pcaFull.components_

array([[ 0.44347627, -0.471378  , -0.57201705, -0.50391276],
       [-0.89536311, -0.26423878, -0.25063936, -0.25628634],
       [ 0.03985121, -0.76318776,  0.6447192 ,  0.01713116],
       [ 0.00809852,  0.35429521,  0.44081011, -0.82467919]])

In [8]:
XfullTranform = pcaFull.transform(X)
XfullTranform

array([[ 2.78372692,  4.04707419, -0.2629584 , -0.05392414],
       [-1.63159924, -2.61903614, -0.50890312, -0.13982557],
       [ 9.42615272, -1.5735017 ,  0.2585261 ,  0.07001177],
       [-5.05345119,  0.20485121,  0.63826159, -0.11527863],
       [-5.5248292 , -0.05938756, -0.12492617,  0.23901658]])

In [9]:
#X.dot(pcaFull.components_) #that's wrong, you should subtract the mean

In [10]:
pcaFull.explained_variance_ #these are the eigenvalues

array([  3.10649427e+01,   5.15191167e+00,   1.63589931e-01,
         1.95557478e-02])

In [11]:
#X.dot( pcaFull.components_)
pcaFull.transform

<bound method PCA.transform of PCA(copy=True, iterated_power='auto', n_components=4, random_state=0,
  svd_solver='auto', tol=0.0, whiten=False)>

In [12]:
pcaOne = PCA(n_components=1, random_state=seed)

In [13]:
pcaOne.fit_transform(X)

array([[ 2.78372692],
       [-1.63159924],
       [ 9.42615272],
       [-5.05345119],
       [-5.5248292 ]])

In [14]:
Xnorm = X - Xmean

In [15]:
Xnorm.dot(pcaOne.components_[0])

array([ 2.78372692, -1.63159924,  9.42615272, -5.05345119, -5.5248292 ])

In [16]:
pcaOne.components_.dot(pcaOne.components_.T)

array([[ 1.]])

#### Reverse process

In [17]:
Xnorm_recovered = XfullTranform.dot(pcaFull.components_)
assert np.allclose(Xnorm_recovered, Xnorm)
Xnorm_recovered

array([[-2.4, -2.2, -2.8, -2.4],
       [ 1.6,  1.8,  1.2,  1.6],
       [ 5.6, -4.2, -4.8, -4.4],
       [-2.4,  1.8,  3.2,  2.6],
       [-2.4,  2.8,  3.2,  2.6]])

In [18]:
Xrecovered = Xnorm_recovered + Xmean
assert np.allclose(Xrecovered, X)
Xrecovered

array([[ 1.,  2.,  3.,  4.],
       [ 5.,  6.,  7.,  8.],
       [ 9.,  0.,  1.,  2.],
       [ 1.,  6.,  9.,  9.],
       [ 1.,  7.,  9.,  9.]])

#### More PCA for explained variance

In [19]:
pcaFull.explained_variance_

array([  3.10649427e+01,   5.15191167e+00,   1.63589931e-01,
         1.95557478e-02])

In [20]:
pcaFull.explained_variance_ / np.sum(pcaFull.explained_variance_)

array([  8.53432490e-01,   1.41536035e-01,   4.49422888e-03,
         5.37245818e-04])

In [21]:
pcaFull.explained_variance_ratio_

array([  8.53432490e-01,   1.41536035e-01,   4.49422888e-03,
         5.37245818e-04])

#### another dataset

In [23]:
X=np.array([
        [-2, 1],
        [-1, 1],
        [-2, 3],
        [-1, 3],
        [3, 3],
        [4, 3],
        [3, 1],
        [4, 1]
    ])

In [24]:
pca = PCA(n_components=X.shape[1], random_state=seed)

In [25]:
pca.fit(X)

PCA(copy=True, iterated_power='auto', n_components=2, random_state=0,
  svd_solver='auto', tol=0.0, whiten=False)

In [26]:
pca.components_

array([[-1., -0.],
       [-0., -1.]])