#Project 1:  Global Baseline Predictors and RMSE
By Latif Masud

##Functions and Libraries
The following functions and libraries are used to make calculations simpler. Most of the work is done by numpy and masked arrays. Functions and Libraries

In [3]:
import numpy as np
import numpy.ma as ma

from math import sqrt

# Takes in predictions and mean(targets) and calculates the rmse between them
def rmse (predictions, targets):
    return sqrt(((predictions - targets) ** 2).mean(axis=None))

##Defining Data
I start by defining a random set of ratings for five users, with ratings from 1 to five. Initially, I set it as a 2D array that will soon be converted to a numpy masked array. I also define an array of 1 and zero to act as a mask of the values that are missing.

In [4]:
ratings = [[1,0,2,2,1],
           [0,4,5,4,5],
           [3,2,0,2,3],
           [1,5,1,4,0],
           [2,3,2,0,2]]
    
missing = np.array([[1,0,1,1,1],
                [0,1,1,1,1],
                [1,1,0,1,1],
                [1,1,1,1,0],
                [1,1,1,0,1]])

###Separating Data sets
To separate into a training and test dataset, I use numpy masks. A `1` representa value to be used, while a `0` represents a value to be masked (not used).

In [7]:
mask_test = np.array([[0,1,1,1,1],
                          [1,0,1,1,1],
                          [1,1,1,0,1],
                          [1,1,0,1,1],
                          [1,1,1,1,0]])
 
mask_training = np.logical_not(np.logical_and(mask_test, missing))

test = ma.array(ratings, mask = mask_test)
training = ma.array(ratings, mask = mask_training)

In [14]:
print test

[[1 -- -- -- --]
 [-- 4 -- -- --]
 [-- -- -- 2 --]
 [-- -- 1 -- --]
 [-- -- -- -- 2]]


In [15]:
print training

[[-- -- 2 2 1]
 [-- -- 5 4 5]
 [3 2 -- -- 3]
 [1 5 -- 4 --]
 [2 3 2 -- --]]


##Analysis
###Mean of Data set
To get the mean of the training dataset, we can simply call the mean function for a numpy array:

In [8]:
print "Training Mean: ", training.mean()

Training Mean:  2.93333333333


###RMSE Values
To get the rmse values, I simply call the rmse function defined above. The function takes the difference between the two matrices, squares them, and gets the mean of the corresponding matrix. The value is then squared to return the RMSE.

In [11]:
mean_matrix = np.full((5,5), training.mean())

rmse_training = rmse (training, mean_matrix)
rmse_test = rmse (test, mean_matrix)

print "Test: ", rmse_test, " Training: ", rmse_training

Test:  1.43913554299  Training:  1.33998341615


###Calculating Bias
To calculate the bias, I simply take the mean of each pair of datapoints, and then subtract the mean from it. 

In [13]:
user_bias = training.mean(axis=1) - training.mean()
sample_bias = training.mean(axis=0) - training.mean()

print "User Bias: ", user_bias
print "Sample Bias: ", sample_bias

User Bias:  [-1.2666666666666664 1.7333333333333338 -0.2666666666666666
 0.40000000000000036 -0.5999999999999996]
Sample Bias:  [-0.9333333333333331 0.40000000000000036 0.06666666666666687
 0.40000000000000036 0.06666666666666687]


###Calculating Baseline
To calculate baseline values, I first set a matrix of mean values, then loop through both the user and sample bias values, and add them to the matrix to produce our overall result. 

In [18]:
 baseline = mean_matrix
    
for n in range(0, user_bias.shape[0]):
    for m in range(0, sample_bias.shape[0]):
        baseline[n][m] = baseline[n][m] + user_bias[n] + sample_bias[m]
            
print baseline

[[-1.46666667  1.2         0.53333333  1.2         0.53333333]
 [ 4.53333333  7.2         6.53333333  7.2         6.53333333]
 [ 0.53333333  3.2         2.53333333  3.2         2.53333333]
 [ 1.86666667  4.53333333  3.86666667  4.53333333  3.86666667]
 [-0.13333333  2.53333333  1.86666667  2.53333333  1.86666667]]


###Calculating Baseline RMSE

To calculate the baseline RMSE values for training and test values, I simply call the rmse functions with the baseline matrix:

In [19]:
training_rmse = rmse(training, baseline)
test_rmse = rmse(test, baseline)

print "Training RMSE: ", training_rmse
print "Test RMSE: ", test_rmse

Training RMSE:  1.45449494862
Test RMSE:  2.2803508502


##Summary
The Training and Test RMSE values look to be quite far apart. This is most likely because the test values are [1,4,2,1,2], which are for the most part very low values, while the training values have far more range. 