## Import necessary libraries

In [1]:
# Useful starting lines
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np
import matplotlib.pyplot as plt

from costs import *
from models import *
from helpers import * 
from evaluation import *
from gradient import *

Clean function modules. Plz don't change them as possible as you can, otherwise plz let me know and specify the changes when you commit 

1. **models**: 6 model functions
2. **costs**: calculate_loss (calculate mse/mae/rmse/log_loss)
3. **gradient**: compute_gradient (stoch_gradient, gradient_sigmoid, sigmoid, hessian)
4. **helpers**: standardize, build_poly, batch_iter, load_csv_data, load_header, predict_labels, create_csv_submission
5. **evaluation**: cross_validation

## Preprocessing
** Load the training data into feature matrix, class labels, and record ids**

We write our own `load_csv_data` function to import csv data, which gives us prediction column, feature matrix and each record ID.

In [3]:
DATA_TRAIN_PATH = 'data/train.csv' # TODO: download train data and supply path here 
y, tx, ids = load_csv_data(DATA_TRAIN_PATH, sub_sample=True)

We use [feature scaling](https://en.wikipedia.org/wiki/Feature_scaling) method to standardize our feature matrix, i.e. to rescale tx down to [0, 1], so as to avoid complicated computation caused by large numbers.

In [4]:
tx, mean_tx, std_tx = standardize(tx)

## Model Selection

Let's begin with a simple linear regression with least_square using **normal equations**. Here we don't consider using least squares with gradient descent or stochastic gradient descent for the fact that **optimal w could be derived thoeritically**. We therefore don't bother to estimate the w.

In [5]:
gamma = 0.5
max_iter = 1000
loss, w = least_squares(y, tx)
w

array([  6.66100353e-01,   1.43944820e-01,  -1.48024702e+00,
        -1.95325672e+00,   2.07611316e-01,  -2.05273171e+01,
         2.35244629e-01,  -3.64674518e+00,   8.50084408e+01,
         7.31821343e-02,   8.27790918e+00,  -4.55995963e+01,
         2.41325681e+01,   2.63030240e+01,  -5.27006907e+00,
         1.96424301e+00,   6.13727054e-01,  -4.80474078e+00,
         5.35500372e+00,   4.57906017e+00,   5.13016525e-01,
        -3.60672656e-01,  -5.31315200e-01,  -6.34098601e+01,
        -5.33032834e-01,  -7.51674008e-01,   1.42630115e+00,
        -1.22661995e+00,  -1.21167317e+00,   1.80519787e-01,
        -8.02665771e+00])

We run cross validation 4 times on our train_data to see LS performance

In [None]:
# TODO

Analysis of this model.

### Logistic Regression

Choose intial parameters

In [None]:
n_iters = 10000
gamma = 0.00001

Train with logistic regression

In [None]:
loss, w = logistic_regression(y, tx, gamma, n_iters)

Predict with our trained model

In [None]:
test_x = np.genfromtxt('data/test.csv', delimiter=',', skip_header=1)
test_x = standardize(test_x[:, 2:])  # remove id and prediction columns
# could've used load_csv_data
create_csv_submission([i for i in range(350000,918238)], predict_labels(test_x, w), 'res.csv')

## Feature Engineering
TODO

## Prediction

**Generate predictions and save ouput in csv format for submission**