Skip to content

groupe-vanilla/mini-projet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <a href=https://competitions.codalab.org/competitions/16151>The Movie Recommendation Challenge</a> \n",
    "\n",
    "<i> Adapted from original code of Isabelle Guyon by the Yellow Team:<br>\n",
    "Sihem ABDOUN, Stephen BATIFOL, Abdallah BENZINE, Abdelhak LOUKKAL, Clément THIERRY and Yaohui WANG</i>\n",
    "\n",
    "ALL INFORMATION, SOFTWARE, DOCUMENTATION, AND DATA ARE PROVIDED \"AS-IS\". The CDS, CHALEARN, AND/OR OTHER ORGANIZERS OR CODE AUTHORS DISCLAIM ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY PARTICULAR PURPOSE, AND THE WARRANTY OF NON-INFRIGEMENT OF ANY THIRD PARTY'S INTELLECTUAL PROPERTY RIGHTS. IN NO EVENT SHALL AUTHORS AND ORGANIZERS BE LIABLE FOR ANY SPECIAL, \n",
    "INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF SOFTWARE, DOCUMENTS, MATERIALS, PUBLICATIONS, OR INFORMATION MADE AVAILABLE FOR THE CHALLENGE. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "codedir = 'sample_code/'                        # Change this to the directory where you put the code\n",
    "from sys import path; path.append(codedir)\n",
    "%matplotlib inline\n",
    "import seaborn as sns; sns.set()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fetch the data and load it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "datadir = 'public_data/'                        # Change this to the directory where you put the input data\n",
    "dataname = 'movierec'\n",
    "basename = datadir  + dataname\n",
    "!ls $basename*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import data_io\n",
    "import eval\n",
    "reload(data_io)\n",
    "data = data_io.read_as_df(basename)                          # The data are loaded as a Pandas Data Frame\n",
    "#data.to_csv(basename + '_train.csv', index=False)           # This allows saving the data in csv format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "data.describe() "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building a predictive model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data matrices for training and making predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "X_train = data.drop('target', axis=1)                   # This is the data matrix you already loaded (training data)\n",
    "y_train = data['target'].values                         # These are the target values encoded as categorical variables\n",
    "print 'Dimensions X_train=', X_train.shape, 'y_train=', y_train.shape\n",
    "X_valid = data_io.read_as_df(basename, 'valid')\n",
    "\n",
    "X_test = data_io.read_as_df(basename, 'test')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The initial classifier in your starting kit (in the sample_code directory)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import regressor\n",
    "reload(regressor)                               # If you make changes to your code you have to reload it\n",
    "from regressor import Regressor\n",
    "Regressor??"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Train, run, and save your classifier and your predictions. If you saved a trained model and/or prediction results, the evaluation script will look for those and use those in priority [(1) use saved predictions; (2) if no predictions, use saved model, do not retrain, just test; (3) if neither, train and test model from scratch]. Compute the predictions with predict_proba, this is more versatile."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%time \n",
    "result_dir = 'res/'\n",
    "outname = result_dir + dataname\n",
    "%timeit \n",
    "clf = Regressor()\n",
    "clf.fit(X_train, y_train)\n",
    "Y_valid = clf.predict(X_valid)\n",
    "Y_test = clf.predict(X_test)\n",
    "clf.save(outname)\n",
    "#clf.load(outname) # Uncomment to check reloading works\n",
    "data_io.write(outname + '_valid.predict', Y_valid)\n",
    "data_io.write(outname + '_test.predict', Y_test)\n",
    "\n",
    "!ls $outname*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute the training accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import confusion_matrix\n",
    "from sklearn.metrics import mean_squared_error\n",
    "# Directly predicts\n",
    "\n",
    "y_predict = clf.predict(X_train)\n",
    "\n",
    "print 'valid accuracy =', eval.mse(y_train, y_predict)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute cross-validation accuracy. This is usually worse than the training accuracy. Notice that we internally split the training data into training and validation set (this is because we do NOT have the labels of X_valid and X_test)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.cross_validation import StratifiedShuffleSplit\n",
    "# This is just an example of 2-fold cross-validation\n",
    "skf = StratifiedShuffleSplit(y_train, n_iter=5, test_size=0.5, random_state=0)\n",
    "i=0\n",
    "for idx_t, idx_v in skf:\n",
    "    i=i+1\n",
    "    Xtr = X_train.iloc[idx_t]\n",
    "    Ytr = y_train[idx_t]\n",
    "    Xva = X_train.iloc[idx_v]\n",
    "    Yva = y_train[idx_v]\n",
    "    clf = Regressor()\n",
    "    clf.fit(Xtr, Ytr)\n",
    "    Y_predict = clf.predict(Xva)\n",
    "    print 'Fold', i, 'validation accuracy = ', eval.mae(Y_predict, Yva)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is <b><span style=\"color:red\">important that you test your submission files before submitting them</span></b>. All you have to do to make a submission is modify the file <code>regressor.py</code> in the <code>sample_code/</code> directory, then run this test to make sure everything works fine. This is the actual program that will be run on the server to test your submission.  The program looks for saved results and saved models in the subdirectory <code>res/</code>. If it finds them, it will use them: (1) If results are found, then are copied to the output directory; (2) If no results but a trained model is found, it is reloaded and no training occurs; (3) If nothing is found a fresh model is trained and tested."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "outdir = '../outputs'        # If you use result_dir as output directory, your submission will include your results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!python run.py $datadir $outdir"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Making your submission\n",
    "\n",
    "The test program <code>run.py</code> prepares your <code>zip</code> file, ready to go. You find it in the directory above where you ran your program. For large datasets, we recommend that <b><span style=\"color:red\">you do NOT bundle the data with your submission</span></b>. The data directory is passed as an argument to run.py, and it is already there on the test server."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}

Releases

No releases published

Packages

No packages published

Languages