# Winery classification with the multivariate Gaussian

In this notebook, we return to winery classification, using the full set of 13 features.

## 1. Load in the data 

As usual, we start by loading in the Wine data set. Make sure the file `wine.data.txt` is in the same directory as this notebook.

Recall that there are 178 data points, each with 13 features and a label (1,2,3). As before, we will divide this into a training set of 130 points and a test set of 48 points.

In [1]:
# Standard includes
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
# Useful module for dealing with the Gaussian density
from scipy.stats import norm, multivariate_normal 

In [2]:
# Load data set.
data = np.loadtxt('wine.data.txt', delimiter=',')
# Names of features
featurenames = ['Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash','Magnesium', 'Total phenols', 
                'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue', 
                'OD280/OD315 of diluted wines', 'Proline']
# Split 178 instances into training set (trainx, trainy) of size 130 and test set (testx, testy) of size 48
np.random.seed(0)
perm = np.random.permutation(178)
trainx = data[perm[0:130],1:14]
trainy = data[perm[0:130],0]
testx = data[perm[130:178], 1:14]
testy = data[perm[130:178],0]

## 2. Fit a Gaussian generative model

We now define a function that fits a Gaussian generative model to the data.
For each class (`j=1,2,3`), we have:
* `pi[j]`: the class weight
* `mu[j,:]`: the mean, a 13-dimensional vector
* `sigma[j,:,:]`: the 13x13 covariance matrix

This means that `pi` is a 4x1 array (Python arrays are indexed starting at zero, and we aren't using `j=0`), `mu` is a 4x13 array and `sigma` is a 4x13x13 array.

In [3]:
def fit_generative_model(x,y):
    k = 3  # labels 1,2,...,k
    d = (x.shape)[1]  # number of features
    mu = np.zeros((k+1,d))
    sigma = np.zeros((k+1,d,d))
    pi = np.zeros(k+1)
    for label in range(1,k+1):
        indices = (y == label)
        mu[label] = np.mean(x[indices,:], axis=0)
        sigma[label] = np.cov(x[indices,:], rowvar=0, bias=1)
        pi[label] = float(sum(indices))/float(len(y))
    return mu, sigma, pi

In [4]:
# Fit a Gaussian generative model to the training data
mu, sigma, pi = fit_generative_model(trainx,trainy)

In [5]:
test_label = 1
test_d = trainx.shape[1]
test_mu = np.zeros((4, test_d))
test_indices = (trainy == test_label)
test_mu[test_label] = np.mean(trainx[test_indices, :], axis = 0)
test_mu[1]

array([1.37853488e+01, 2.02232558e+00, 2.42790698e+00, 1.68813953e+01,
       1.05837209e+02, 2.85162791e+00, 2.99627907e+00, 2.89069767e-01,
       1.93069767e+00, 5.63023256e+00, 1.06232558e+00, 3.16674419e+00,
       1.14190698e+03])

In [6]:
sigma[1, :, :].shape

(13, 13)

In [7]:
test_features = [0, 2]
testx[1, test_features].shape

(2,)

In [8]:
mu[1, test_features]

array([13.78534884,  2.42790698])

In [9]:
print(sigma[1])
sigma[1, test_features, test_features]

[[ 2.33252785e-01 -1.35961601e-02 -3.93531639e-03 -3.13598161e-01
   1.05226609e+00  6.06773391e-02  7.52687399e-02  4.65613845e-03
   6.21497566e-02  2.21752244e-01  1.14922120e-02 -1.16165495e-03
   4.04223580e+01]
 [-1.35961601e-02  4.31329475e-01 -9.77187669e-03  2.38159546e-01
  -2.44040022e-01 -1.37782044e-02 -4.24053002e-02 -1.89085992e-03
  -5.46760411e-02 -2.14098215e-01 -3.71030827e-02  1.33378042e-02
  -4.66765279e+01]
 [-3.93531639e-03 -9.77187669e-03  3.67746890e-02  2.35263386e-01
   5.65473229e-01  3.68015143e-03 -1.40778799e-03  4.47014602e-03
  -1.05729584e-02  1.23742564e-04  5.77928610e-03 -5.94402380e-03
   6.13422390e+00]
 [-3.13598161e-01  2.38159546e-01  2.35263386e-01  6.04011898e+00
   5.56208761e+00 -1.75295295e-01 -3.16022715e-01  1.40292050e-02
  -2.37452136e-01 -5.97414278e-01 -1.85613845e-02 -6.50373175e-02
  -8.17156842e+01]
 [ 1.05226609e+00 -2.44040022e-01  5.65473229e-01  5.56208761e+00
   1.18461871e+02  1.38398594e+00  6.48464035e-01  1.17290427e-01


array([0.23325279, 0.03677469])

## 3. Use the model to make predictions on the test set

<font color="magenta">**For you to do**</font>: Define a general purpose testing routine that takes as input:
* the arrays `pi`, `mu`, `sigma` defining the generative model, as above
* the test set (points `tx` and labels `ty`)
* a list of features `features` (chosen from 0-12)

It should return the number of mistakes made by the generative model on the test data, *when restricted to the specified features*. For instance, using the just three features 2 (`'Ash'`), 4 (`'Magnesium'`) and 6 (`'Flavanoids'`) results in 7 mistakes (out of 48 test points), so 

        `test_model(mu, sigma, pi, [2,4,6], testx, testy)` 

should print 7/48.

**Hint:** The way you restrict attention to a subset of features is by choosing the corresponding coordinates of the full 13-dimensional mean and the appropriate submatrix of the full 13x13 covariance matrix.

In [10]:
# Now test the performance of a predictor based on a subset of features
from scipy.stats import multivariate_normal

def test_model(mu, sigma, pi, features, tx, ty):
    ###
    ### Your code goes here
    ###
    k = 3
    num_pts = len(ty)
    scores = np.zeros((num_pts, k + 1)) # may need to add 1 to k here
    for i in range(num_pts):
        for label in range(1, k+1):
            scores[i, label] = np.log(pi[label]) + \
                               multivariate_normal.logpdf(tx[i, features],
                                                          mean = mu[label, features],
                                                          cov = sigma[label, features, features])
    predictions = np.argmax(scores[:, 1:4], axis = 1) + 1
    errors = np.sum(predictions != ty)
    print(scores)
    print(predictions)
    print(errors)

### <font color="magenta">Fast exercises</font>

*Note down the answers to these questions. You will need to enter them as part of this week's assignment.*

Exercise 1. How many errors are made on the test set when using the single feature 'Ash'?

In [11]:
test_model(mu, sigma, pi, [2], testx, testy)

[[  0.          -0.73103155  -1.29075971  -1.14761825]
 [  0.          -0.71281882  -0.71242668  -0.79974203]
 [  0.          -0.38438893  -0.83740347  -0.47955322]
 [  0.          -1.83571576  -0.77606112  -2.17137311]
 [  0.          -0.87550022  -1.39172724  -1.3764216 ]
 [  0.          -1.83571576  -0.77606112  -2.17137311]
 [  0.          -0.43649757  -0.78292091  -0.51080773]
 [  0.          -1.23785764  -1.61770454  -1.93492599]
 [  0.          -0.82462474  -1.35718094  -1.29641686]
 [  0.          -0.59623838  -0.72791067  -0.66979803]
 [  0.          -1.04444226  -1.50070885  -1.63885751]
 [  0.          -2.25625273  -2.16582811  -3.45552555]
 [  0.          -0.44446564  -1.03594161  -0.66300357]
 [  0.          -0.43649757  -0.78292091  -0.51080773]
 [  0.          -0.43649757  -0.78292091  -0.51080773]
 [  0.          -2.0593276   -2.06556965  -3.16472278]
 [  0.          -0.98540898  -1.46349119  -1.54764192]
 [  0.          -0.43649757  -0.78292091  -0.51080773]
 [  0.    

Exercise 2. How many errors when using 'Alcohol' and 'Ash'?

In [12]:
test_model(mu, sigma, pi, [0,2], testx, testy)

[[  0.          -0.94135798  -5.94331748  -2.34890616]
 [  0.          -4.89998652  -1.01937348  -2.05067452]
 [  0.          -2.61472664  -1.56502041  -0.98557798]
 [  0.          -5.14119704  -1.19032026  -3.05218148]
 [  0.          -1.07089625  -5.77042046  -2.45580296]
 [  0.          -3.12376806  -2.08387151  -2.47709815]
 [  0.          -3.15273889  -1.33725891  -1.17293344]
 [  0.          -1.42948493  -5.67946638  -2.87959119]
 [  0.          -2.93144638  -2.1395051   -1.76723243]
 [  0.          -5.08144086  -1.01994569  -2.05480382]
 [  0.          -1.3980856   -4.33666898  -2.14562609]
 [  0.          -2.46273565  -6.76290351  -4.63173084]
 [  0.          -6.8696285   -1.41637004  -2.99850856]
 [  0.          -0.71801137  -3.92531505  -1.11215283]
 [  0.          -1.69409583  -2.11783529  -0.8135628 ]
 [  0.         -10.78500388  -2.83271926  -6.74562978]
 [  0.          -4.99890045  -1.78364182  -2.72233833]
 [  0.          -0.6514103   -4.39322994  -1.27706804]
 [  0.    

Exercise 3. How many errors when using 'Alcohol', 'Ash', and 'Flavanoids'?

In [13]:
test_model(mu, sigma, pi, [0,2,6], testx, testy)

[[  0.          -1.96222922  -8.43093705 -55.21817814]
 [  0.          -9.26448331  -1.71976352  -9.6135414 ]
 [  0.         -14.51458033  -3.11481762  -1.35152751]
 [  0.         -24.27323096  -3.88268978  -2.88065025]
 [  0.          -1.58395535  -7.87233721 -49.41839493]
 [  0.          -3.51854111  -2.96843817 -26.12309541]
 [  0.         -13.71645004  -2.70058056  -2.11716321]
 [  0.          -1.5474256   -6.72385248 -30.5072356 ]
 [  0.         -21.74771515  -4.77845169  -1.55034105]
 [  0.          -6.03879743  -1.75887084 -21.09532866]
 [  0.          -2.0896953   -5.1291799  -23.06563876]
 [  0.          -2.44334731  -8.05129758 -37.44004639]
 [  0.          -8.48445531  -2.08188505 -18.54038554]
 [  0.          -0.81947255  -5.58931088 -40.84898432]
 [  0.         -10.13692072  -3.20917731  -3.14306435]
 [  0.         -12.63973684  -3.48452523 -21.26536268]
 [  0.         -24.13093437  -4.47601135  -2.55080709]
 [  0.          -0.93750271  -5.32667374 -26.21638417]
 [  0.    

Exercise 4. How many errors when using all 13 features?

In [14]:
test_model(mu, sigma, pi, range(0,13), testx, testy)

[[   0.          -16.84364058  -35.59331471 -120.09544067]
 [   0.          -59.90487919  -20.96702235  -50.06360376]
 [   0.          -66.55858554  -29.27494293  -17.50471788]
 [   0.          -75.26536063  -35.9641739   -16.33168785]
 [   0.          -16.53351659  -36.05894088  -90.86292247]
 [   0.          -18.48575531  -22.3475438   -68.06005308]
 [   0.          -58.53120536  -26.40336355  -18.57864695]
 [   0.          -14.48586549  -41.19337457  -89.57747127]
 [   0.          -57.76154262  -23.83655325  -19.45071216]
 [   0.          -37.98318038  -17.70772309  -53.74304314]
 [   0.          -14.91513575  -25.61279158  -64.18663585]
 [   0.          -16.29908513  -27.18448248  -95.34359054]
 [   0.          -37.49474111  -17.48063489  -59.07424474]
 [   0.          -14.15990096  -46.96131664 -104.66747501]
 [   0.          -67.43882716  -48.19537414  -17.63237814]
 [   0.          -58.01556369  -22.2559594   -51.6702847 ]
 [   0.          -81.38574274  -40.24876434  -20.8238987

Exercise 5. In lecture, we got somewhat different answers to these questions. Why do you think that might be?