In [1]:
# Useful starting lines
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
%load_ext autoreload
%autoreload 2
from IPython import display
# Import everything in the functions folder
from functions.proj1_helpers import *
from functions.clean_data import *
from functions.least_squares import *
from functions.split import *

# Cleaning and Analysis

In this notebook, we are trying different way of cleaning the data and we're analysing the effect on the training/prediction. In order to analyze the effect, we will perform a least-square on 80% of the training data and test on the 20% remaining.

But before doing that, we need to make sure that if there are some problem in the training data, there will be the same problems in the testing data.

## Check Problems in Training and Testing data

First, we load the training and testing data.

In [28]:
DATA_TRAIN_PATH = 'data/train.csv' 
_, _, _, headers = load_data(DATA_TRAIN_PATH)
y, tX, ids = load_csv_data(DATA_TRAIN_PATH)

DATA_TEST_PATH = 'data/test.csv'
_, tX_test, ids_test = load_csv_data(DATA_TEST_PATH)

Check the percentage of NaNs for both training and testing

In [3]:
nan_train = perc_nan(tX)
nan_test = perc_nan(tX_test)
print("Number of variables: %i"%(len(tX[0])))

print("  train      test  \t Parameters")
for i in range(len(nan_train)):
    print("%f - %f \t %s"%(nan_train[i], nan_test[i], headers[i+2]))

Number of variables: 30
  train      test  	 Parameters
0.152456 - 0.152204 	 DER_mass_MMC
0.000000 - 0.000000 	 DER_mass_transverse_met_lep
0.000000 - 0.000000 	 DER_mass_vis
0.000000 - 0.000000 	 DER_pt_h
0.709828 - 0.708851 	 DER_deltaeta_jet_jet
0.709828 - 0.708851 	 DER_mass_jet_jet
0.709828 - 0.708851 	 DER_prodeta_jet_jet
0.000000 - 0.000000 	 DER_deltar_tau_lep
0.000000 - 0.000000 	 DER_pt_tot
0.000000 - 0.000000 	 DER_sum_pt
0.000000 - 0.000000 	 DER_pt_ratio_lep_tau
0.000000 - 0.000000 	 DER_met_phi_centrality
0.709828 - 0.708851 	 DER_lep_eta_centrality
0.000000 - 0.000000 	 PRI_tau_pt
0.000000 - 0.000000 	 PRI_tau_eta
0.000000 - 0.000000 	 PRI_tau_phi
0.000000 - 0.000000 	 PRI_lep_pt
0.000000 - 0.000000 	 PRI_lep_eta
0.000000 - 0.000000 	 PRI_lep_phi
0.000000 - 0.000000 	 PRI_met
0.000000 - 0.000000 	 PRI_met_phi
0.000000 - 0.000000 	 PRI_met_sumet
0.000000 - 0.000000 	 PRI_jet_num
0.399652 - 0.400286 	 PRI_jet_leading_pt
0.399652 - 0.400286 	 PRI_jet_leading_eta
0.399652 -

We can see that everytime the percentage of NaNs for a given parameter is higher than 0 in the training data, it will also be higher than 0 in the test data. Also the percentage is always close. Therefore, we can say that that if we perform some operation to clean the data for the training data, we can do it for the test data as well.

## Benchmark

Let's apply the least on the training data as they are right now. It will give us a benchmark to see if we can perform better just be removing problems. 

First, we split the data into training and testing sets.

In [4]:
ratio = 0.8
x_train, y_train, x_test, y_test = split_non_random(tX, y, ratio)

Now, we can do the Least Square on the training data and apply the weights to the test data.

In [5]:
loss, w_star = least_square(y_train, x_train)
print("Loss = %f"%(loss))
prediction(y_test, x_test, w_star)

Loss = 0.823999
Good prediction: 37229/50000 (74.458000%)
Wrong prediction: 12771/50000 (25.542000%)


# Cleaning and Testing

## Remove columns

The first thing we can clean is the columns with a high percentage of NaNs. Let's try to remove all the columns with around 70% of NaN..

In [6]:
tX_without_nan, _, _ = delete_column_nan(nan_train, tX, headers, threshold = 0.65)
print("Number of variables: %i"%(len(tX_without_nan[0])))

Number of variables: 23


We can redo the test we did with the benchmark and check if it becomes better or not.

In [7]:
ratio = 0.8
x_train, y_train, x_test, y_test = split_non_random(tX_without_nan, y, ratio)
# Do the training with LS and prediction
loss, w_star = least_square(y_train, x_train)
print("Loss = %f"%(loss))
prediction(y_test, x_test, w_star)

Loss = 0.845954
Good prediction: 36303/50000 (72.606000%)
Wrong prediction: 13697/50000 (27.394000%)


That's a really interesting result. If we remove the columns with the NaN values, the prediction becomes worse. Let's try again, but we remove all columns with NaN.

In [8]:
tX_without_nan, _, _ = delete_column_nan(nan_train, tX, headers, threshold = 0.01)
print("Number of variables: %i"%(len(tX_without_nan[0])))

ratio = 0.8
x_train, y_train, x_test, y_test = split_non_random(tX_without_nan, y, ratio)
# Do the training with LS and prediction
loss, w_star = least_square(y_train, x_train)
print("Loss = %f"%(loss))
prediction(y_test, x_test, w_star)

Number of variables: 19
Loss = 0.851050
Good prediction: 36161/50000 (72.322000%)
Wrong prediction: 13839/50000 (27.678000%)


Apparently, **removing columns with NaNs is a bad idea**. So, we'll try now to replace the NaNs by the mean of the non-NaN values.

## mean

In [13]:
tX_replaced_mean = replace_by_mean(tX, nan_train)
print("Number of variables: %i"%(len(tX_replaced[0])))


ratio = 0.8
x_train, y_train, x_test, y_test = split_non_random(tX_replaced_mean, y, ratio)
# Do the training with LS and prediction
loss, w_star = least_square(y_train, x_train)
print("Loss = %f"%(loss))
prediction(y_test, x_test, w_star)

Number of variables: 30
Loss = 0.829123
Good prediction: 36988/50000 (73.976000%)
Wrong prediction: 13012/50000 (26.024000%)


## median


In [15]:
tX_replaced_median = replace_by_median(tX, nan_train)
print("Number of variables: %i"%(len(tX_replaced[0])))


ratio = 0.8
x_train, y_train, x_test, y_test = split_non_random(tX_replaced_median, y, ratio)
# Do the training with LS and prediction
loss, w_star = least_square(y_train, x_train)
print("Loss = %f"%(loss))
prediction(y_test, x_test, w_star)

Number of variables: 30
Loss = 0.829123
Good prediction: 36988/50000 (73.976000%)
Wrong prediction: 13012/50000 (26.024000%)


Apparently, keeping the data as they are is the **best option**. So, we won't clean the data (for the moment). But this is only valid with the Least_Square. So, le'ts try to keep the data with the -999 replaced by the median and then we'll apply the Ridge Regression on it in another notebook.

In [24]:
write_data('data/train_cleaned.csv', y, tX_replaced_median, ids, headers, 'train')

Do the same procedure for the test data.

In [21]:
tX_replaced_median_test = replace_by_median(tX_test, nan_train)


In [32]:
write_data('data/test_cleaned.csv', _, tX_replaced_median_test, ids_test, headers, 'test')