# Step by Step

In this script, the "local optima leads to global optima" assumption. 
The data is cleaned, using clean zero, mean and meadian. Then it is normalized. 
Using crossvalidation, the best method to clean data is selected. 

Thereafter, the data is transformed using PCA analysis with the first 30-20 components. 
Using crossvalidation, the number of principal components that generates best results is selected. 

Thereafter, the data is transformed using polynomials, and a optimal combination of lambda and degree is selected. 

Finally, a kaggle-submission is created.

### Importing libraries and personal libraries

In [21]:
# standard libraries
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import time
%load_ext autoreload
%autoreload 2

# own functions

import proj1_helpers as P1H
import dataprocessing as DP
import implementations as ME
import cross_validation as CV
from grad_loss import*

#constants
train_path = 'train.csv'
test_path = 'test.csv'

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Importing data

In [22]:
orig_y, orig_x, orig_ids = load_csv_data(train_path, sub_sample=True) 
pred_y, pred_x, pred_ids = load_csv_data(test_path, sub_sample=True)

### Stacking training and test data together to easily preform the same transformations on both sets

In [7]:
all_x = np.vstack((orig_x, pred_x))

# value we split the all_x on before testing
split_coord = len(orig_y)

### Preforming the three different ways of cleaning the data

In [8]:
# To provide clarity to which data that is processed
x = np.copy(all_x)

# Cleans values that are -999 to zero, mean and median
no_clean = np.copy(x)
clean_zero = DP.clean_data(x)
clean_mean = DP.clean_data(x, replace_no_measure_with_mean=True)
clean_medi = DP.clean_data(x, replace_no_measure_with_median=True)

# Make array to test for later
cleanDataArray = [no_clean, clean_zero, clean_mean, clean_medi]

### Normalizing the data: 

In [9]:
## Normalizing data:
normalizedDataArray=[]
for i, data in enumerate(cleanDataArray):
    normalizedDataArray.append(DP.normalize(data))
  

### Comparing the different ways of cleaning data, using a 5 fold cross validation:

In [10]:
lambda_=2.33572146909e-05 #taken from exploration of basic methods
k_folds=5
avg_losses=[]
avg_preds_all=[]
for data in normalizedDataArray:
    avg_loss, losses, avg_preds, pred_acc_percents = CV.cross_validation(ME.ridge_regression, orig_y, data[:split_coord,:], k_folds, lambda_)
    avg_losses.append(avg_loss)
    avg_preds_all.append(avg_preds)
print("this is lambda", lambda_)
print("this is average losses: ",avg_losses)
print("this is average prediction error:",avg_preds_all)

this is lambda 2.33572146909e-05
this is average losses:  [0.81817196959179417, 0.80883503277725899, 0.81225633044988821, 0.81707134549374738]
this is average prediction error: [0.2522, 0.2444, 0.2476, 0.2496]


### Chosing the method of cleaning that minimizes loss:

In [11]:
##Based on this, we choose to continue with the data where all missing values are replaced by 0. 

chosenData=normalizedDataArray[np.argmin(avg_losses)]
print(np.argmin(avg_losses))

#Want to test what happen if we choose clean mean: 
#chosenData=normalizedDataArray[2]

#Want to test what happen if we choose clean median: 
#chosenData=normalizedDataArray[3]

1


### Performing PCA, keeping different number of dimensions

In [12]:
## Now we want to performe PCA on the chosen data
numberOfDimensions=(30,29,28,27,26,25,24,23,22,21,20)
pcas=[]
for i, degree in enumerate(numberOfDimensions):
    pca_i=DP.pca(chosenData,degree)[0]
    pcas.append(pca_i)

### Comparing the different dimensions, using a 5 fold cross validation:

In [13]:
lambda_=2.33572146909e-05 #taken from exploration of basic methods
k_folds=5
avg_losses=[]
avg_preds_all=[]
for data in pcas:
    avg_loss, losses, avg_preds, pred_acc_percents = CV.cross_validation(ME.ridge_regression, orig_y, data[:split_coord,:], k_folds, lambda_)
    avg_losses.append(avg_loss)
    avg_preds_all.append(avg_preds)
print("this is lambda", lambda_)
print("this is average losses: ",avg_losses)
print("this is average prediction error:",avg_preds_all)

this is lambda 2.33572146909e-05
this is average losses:  [0.86957757215410347, 0.8695775722204877, 0.87160488127106439, 0.87174138193975514, 0.87184296963476027, 0.87890148652423172, 0.88062914093091427, 0.88093517590255066, 0.88820784924404061, 0.88903865691775974, 0.89003075433178724]
this is average prediction error: [0.2692, 0.2692, 0.2666, 0.2672, 0.2682, 0.268, 0.2694, 0.27, 0.2762, 0.2752, 0.2764]


### Chosing the number of dimensions that minimizes loss:

In [14]:
chosenData=pcas[np.argmin(avg_losses)]
print(np.argmin(avg_losses))

0


### Finding the best combination of polynomial degree and lambda_ using cross validation

In [15]:
degrees=(3,4,5,6,7,8,9)
lambdas=np.logspace(-9,1,15)
min_loss=1000;
min_degree=0;
min_lambda=0
max_acc=0
avg_losses=np.zeros((len(degrees),len(lambdas)))
avg_acc=np.zeros((len(degrees),len(lambdas)))
for d,degree in enumerate(degrees):
    phi=DP.build_poly(chosenData[:split_coord,:],degree)
    for l,lambda_ in enumerate(lambdas):
        avg_loss, losses, avg_preds, pred_acc_percents = CV.cross_validation(ME.ridge_regression, orig_y, phi, k_folds, lambda_)
        avg_losses[d,l]=avg_loss
        avg_acc[d,l]=avg_preds
        if avg_loss < min_loss:
            min_loss=avg_loss
            min_degree=degree
            min_lambda=lambda_
            max_acc=avg_preds
            

In [16]:
print("The average minmal loss is: ", min_loss, "which is found using a polynomial of degree ",min_degree, " with lambda_=", min_lambda)
print("The average prediction error when compared to the real values is: ",max_acc)

The average minmal loss is:  0.748601946659 which is found using a polynomial of degree  9  with lambda_= 1e-09
The average prediction error when compared to the real values is:  0.1958


### Chosing the combination that minimizes loss:

In [17]:
chosenData=DP.build_poly(chosenData,min_degree)

### Creating a Kaggle submission

In [18]:
loss,w=ME.ridge_regression(orig_y,chosenData[:split_coord,:],min_lambda)


In [19]:
y_predicted=P1H.predict_labels(w,chosenData[split_coord:,:])

In [20]:
name="step_by_step.csv"
#name="step_by_step_clean_mean.csv"
#name="step_by_step_clean_median.csv"

create_csv_submission(pred_ids, y_predicted, name)

### Kaggle score: 
- clean zero (step_by_step): 0.77544
- clean mean: 0.79162
- clean median: 0.79344 