# Run

This notebook runs all the functions such that you can get the same result we've got on Kaggle. You can also find a file called `run.py`. But since we did all our tests in Jupyter, we decided to give you a Python Notebook as well.

Import the libraries needed

In [1]:
# All the "long" functions are in the file called helpers_run.py
from helpers_run import *

Names of the different data sets

In [2]:
# Name of the train file
TRAIN = 'train.csv'
try:
    open(TRAIN, 'r')
except:
    raise NameError('Cannot open file %s! Are you sure it exists in this directory' % TRAIN)

# Name of the test file
TEST = 'test.csv'
try:
    open(TEST, 'r')
except:
    raise NameError('Cannot open file %s! Are you sure it exists in this directory' % TEST)

# Name of the training data
TRAINING_DATA = ['train_jet_0_wout_mass.csv', 'train_jet_0_with_mass.csv',
                 'train_jet_1_wout_mass.csv', 'train_jet_1_with_mass.csv',
                 'train_jet_2_wout_mass.csv', 'train_jet_2_with_mass.csv',
                 'train_jet_3_wout_mass.csv', 'train_jet_3_with_mass.csv']

# Name of the test data                   
TESTING_DATA = ['test_jet_0_wout_mass.csv', 'test_jet_0_with_mass.csv',
                'test_jet_1_wout_mass.csv', 'test_jet_1_with_mass.csv',
                'test_jet_2_wout_mass.csv', 'test_jet_2_with_mass.csv',
                'test_jet_3_wout_mass.csv', 'test_jet_3_with_mass.csv']

## Data Analysis and Splitting

If you don't have the files in the lists `TRAINING_DATA` and `TESTING_DATA`, you can run the next cell. It will split the training and testint data-sets into 8 models as explained in the report.

In [None]:
data_analysis_splitting(TRAIN, TEST, TRAINING_DATA, TESTING_DATA)

## Cross-Validation

If you want to run the cross-validation, you can simply run the next cell. If you just want to use our $\lambda*$ and $degree*$, you can run the cell after. 

The cross-validation will take a bit of time to get all the results

In [3]:
%%time
perc_right_pred, degree_star, lambda_star = cross_validation(TRAINING_DATA, verbose=True)
print(u'Percentage of right pred on training set: {0:f}'.format(perc_right_pred))
print('degree_star = ', degree_star)
print('lambda_star = ', lambda_star)

Cross-validation with file train_jet_0_wout_mass.csv
-----------------------------------------------------
  Start the 10-fold Cross Validation!
  Start degree 8
  Finished Degree 8. Best lambda is  4.040e-03 with percentage wrong pred 0.050153
  --------------------
  Start degree 9
  Finished Degree 9. Best lambda is  5.040e+00 with percentage wrong pred 0.050153
  --------------------
  Start degree 10
  Finished Degree 10. Best lambda is  2.270e+03 with percentage wrong pred 0.049579
  --------------------
  Start degree 11
  Finished Degree 11. Best lambda is  6.500e-04 with percentage wrong pred 0.049579
  --------------------
  Start degree 12
  Finished Degree 12. Best lambda is  9.000e-06 with percentage wrong pred 0.049196
  --------------------
 10-fold Cross Validation finished!

  Max pred = 0.950804
  Lambda* =  9.000e-06
  Degree* = 12


Cross-validation with file train_jet_0_with_mass.csv
-----------------------------------------------------
  Start the 10-fold Cross Va

KeyboardInterrupt: 

In [4]:
degree_star = [12, 9, 7, 9, 10, 10, 8, 9]
lambda_star = [9e-06, 0.0212, 1.65e-05, 0.00027, 2.42e-06, 0.000309, 4e-05, 3.63e-10]

## Training

Next step is to train the data. First, we defined some hardcoded booleans. They were tested by hand as explained in the report. The training is quite fast.

In [18]:
# Booleans for the cross-terms
ct = [False, True, False, True, True, True, False, True]
sqrt = [True, True, True, True, False, True, False, True]
square = [False, True, False, True, False, True, True, False]

weights, prediction_train = training(TRAINING_DATA, degree_star, lambda_star, ct, sqrt, square)
print(u'\nIn total, there was {0:2f}% of good predictions on the training set.\n'.format(prediction_train))


Training with file train_jet_0_wout_mass.csv
-----------------------------------------------------
  Good prediction: 95.142212
Training with file train_jet_0_with_mass.csv
-----------------------------------------------------
  Good prediction: 81.403732
Training with file train_jet_1_wout_mass.csv
-----------------------------------------------------
  Good prediction: 92.832584
Training with file train_jet_1_with_mass.csv
-----------------------------------------------------
  Good prediction: 80.487840
Training with file train_jet_2_wout_mass.csv
-----------------------------------------------------
  Good prediction: 94.681572
Training with file train_jet_2_with_mass.csv
-----------------------------------------------------
  Good prediction: 85.335357
Training with file train_jet_3_wout_mass.csv
-----------------------------------------------------
  Good prediction: 97.833446
Training with file train_jet_3_with_mass.csv
-----------------------------------------------------
  Goo

## Testing

Now, we just need to apply the weights on the test data and create the Kaggle submission.

In [19]:
test(TESTING_DATA, degree_star, ct, sqrt, square, weights, 'RR_8models_10foldCV_CT.csv')

Testing with file test_jet_0_wout_mass.csv
-----------------------------------------------------
Testing with file test_jet_0_with_mass.csv
-----------------------------------------------------
Testing with file test_jet_1_wout_mass.csv
-----------------------------------------------------
Testing with file test_jet_1_with_mass.csv
-----------------------------------------------------
Testing with file test_jet_2_wout_mass.csv
-----------------------------------------------------
Testing with file test_jet_2_with_mass.csv
-----------------------------------------------------
Testing with file test_jet_3_wout_mass.csv
-----------------------------------------------------
Testing with file test_jet_3_with_mass.csv
-----------------------------------------------------
Concatenate the predictions.
  0/568238 concatenated
  100000/568238 concatenated
  200000/568238 concatenated
  300000/568238 concatenated
  400000/568238 concatenated
  500000/568238 concatenated
Data are ready to be submi