# <a href=https://competitions.codalab.org/competitions/16151>The Movie Recommendation Challenge</a> 

<i> Adapted from original code of Isabelle Guyon by the Yellow Team:<br>
Sihem ABDOUN, Stephen BATIFOL, Abdallah BENZINE, Abdelhak LOUKKAL, Clément THIERRY and Yaohui WANG</i>

ALL INFORMATION, SOFTWARE, DOCUMENTATION, AND DATA ARE PROVIDED "AS-IS". The CDS, CHALEARN, AND/OR OTHER ORGANIZERS OR CODE AUTHORS DISCLAIM ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY PARTICULAR PURPOSE, AND THE WARRANTY OF NON-INFRIGEMENT OF ANY THIRD PARTY'S INTELLECTUAL PROPERTY RIGHTS. IN NO EVENT SHALL AUTHORS AND ORGANIZERS BE LIABLE FOR ANY SPECIAL, 
INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF SOFTWARE, DOCUMENTS, MATERIALS, PUBLICATIONS, OR INFORMATION MADE AVAILABLE FOR THE CHALLENGE. 

## Introduction


In [None]:
codedir = 'sample_code/'                        # Change this to the directory where you put the code
from sys import path; path.append(codedir)
%matplotlib inline
import seaborn as sns; sns.set()

## Fetch the data and load it

In [None]:
datadir = 'public_data/'                        # Change this to the directory where you put the input data
dataname = 'movierec'
basename = datadir  + dataname
!ls $basename*

In [None]:
import data_io
import eval
reload(data_io)
data = data_io.read_as_df(basename)                          # The data are loaded as a Pandas Data Frame
#data.to_csv(basename + '_train.csv', index=False)           # This allows saving the data in csv format

In [None]:
data.head()

In [None]:
data.describe() 

## Building a predictive model

Data matrices for training and making predictions.

In [None]:
import numpy as np
X_train = data.drop('target', axis=1)                   # This is the data matrix you already loaded (training data)
y_train = data['target'].values                         # These are the target values encoded as categorical variables
print 'Dimensions X_train=', X_train.shape, 'y_train=', y_train.shape
X_valid = data_io.read_as_df(basename, 'valid')

X_test = data_io.read_as_df(basename, 'test')

The initial classifier in your starting kit (in the sample_code directory).

In [None]:
import regressor
reload(regressor)                               # If you make changes to your code you have to reload it
from regressor import Regressor
Regressor??

Train, run, and save your classifier and your predictions. If you saved a trained model and/or prediction results, the evaluation script will look for those and use those in priority [(1) use saved predictions; (2) if no predictions, use saved model, do not retrain, just test; (3) if neither, train and test model from scratch]. Compute the predictions with predict_proba, this is more versatile.

In [None]:
%time 
result_dir = 'res/'
outname = result_dir + dataname
%timeit 
clf = Regressor()
clf.fit(X_train, y_train)
Y_valid = clf.predict(X_valid)
Y_test = clf.predict(X_test)
clf.save(outname)
#clf.load(outname) # Uncomment to check reloading works
data_io.write(outname + '_valid.predict', Y_valid)
data_io.write(outname + '_test.predict', Y_test)

!ls $outname*

Compute the training accuracy.

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import mean_squared_error
# Directly predicts

y_predict = clf.predict(X_train)

print 'valid accuracy =', eval.mse(y_train, y_predict)

Compute cross-validation accuracy. This is usually worse than the training accuracy. Notice that we internally split the training data into training and validation set (this is because we do NOT have the labels of X_valid and X_test).

In [None]:
from sklearn.cross_validation import StratifiedShuffleSplit
# This is just an example of 2-fold cross-validation
skf = StratifiedShuffleSplit(y_train, n_iter=5, test_size=0.5, random_state=0)
i=0
for idx_t, idx_v in skf:
    i=i+1
    Xtr = X_train.iloc[idx_t]
    Ytr = y_train[idx_t]
    Xva = X_train.iloc[idx_v]
    Yva = y_train[idx_v]
    clf = Regressor()
    clf.fit(Xtr, Ytr)
    Y_predict = clf.predict(Xva)
    print 'Fold', i, 'validation accuracy = ', eval.mae(Y_predict, Yva)

It is <b><span style="color:red">important that you test your submission files before submitting them</span></b>. All you have to do to make a submission is modify the file <code>regressor.py</code> in the <code>sample_code/</code> directory, then run this test to make sure everything works fine. This is the actual program that will be run on the server to test your submission.  The program looks for saved results and saved models in the subdirectory <code>res/</code>. If it finds them, it will use them: (1) If results are found, then are copied to the output directory; (2) If no results but a trained model is found, it is reloaded and no training occurs; (3) If nothing is found a fresh model is trained and tested.

In [None]:
outdir = '../outputs'        # If you use result_dir as output directory, your submission will include your results

In [None]:
!python run.py $datadir $outdir

## Making your submission

The test program <code>run.py</code> prepares your <code>zip</code> file, ready to go. You find it in the directory above where you ran your program. For large datasets, we recommend that <b><span style="color:red">you do NOT bundle the data with your submission</span></b>. The data directory is passed as an argument to run.py, and it is already there on the test server.