## Import necessary libraries

In [1]:
# Useful starting lines
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
%load_ext autoreload
%autoreload 2
import math

In [2]:
from costs import *
from models import *
from helpers import * 
from evaluation import *

## Preprocessing
** Load the training data into feature matrix, class labels, and record ids**

We write our own `load_csv_data` function to import csv data, which gives us prediction column, feature matrix and each record ID.

In [3]:
from proj1_helpers import *
DATA_TRAIN_PATH = 'data/train.csv' # TODO: download train data and supply path here 
y, tx, ids = load_csv_data(DATA_TRAIN_PATH, sub_sample=True)

We use [feature scaling](https://en.wikipedia.org/wiki/Feature_scaling) method to standardize our feature matrix, i.e. to rescale tx down to [0, 1], so as to avoid complicated computation caused by large numbers.

In [4]:
tx = standardize(tx)

## Model Selection

Let's begin with a simple linear regression with least_square using **normal equations**. Here we don't consider using least squares with gradient descent or stochastic gradient descent for the fact that **optimal w could be derived thoeritically**. We therefore don't bother to estimate the w.

In [5]:
gamma = 0.5
max_iter = 1000
loss, w = least_squares(y, tx)
w

array([  6.66100353e-01,   1.43944820e-01,  -1.48024702e+00,
        -1.95325672e+00,   2.07611316e-01,  -2.05273171e+01,
         2.35244629e-01,  -3.64674518e+00,   8.50084408e+01,
         7.31821343e-02,   8.27790918e+00,  -4.55995963e+01,
         2.41325681e+01,   2.63030240e+01,  -5.27006907e+00,
         1.96424301e+00,   6.13727054e-01,  -4.80474078e+00,
         5.35500372e+00,   4.57906017e+00,   5.13016525e-01,
        -3.60672656e-01,  -5.31315200e-01,  -6.34098601e+01,
        -5.33032834e-01,  -7.51674008e-01,   1.42630115e+00,
        -1.22661995e+00,  -1.21167317e+00,   1.80519787e-01,
        -8.02665771e+00])

We run cross validation 4 times on our train_data to see LS performance

In [6]:
ws, losses, accs = cross_validation(y, tx, 4, 0, least_squares, method='rmse')

In [9]:
accs, ws, losses

([0.32400000000000001,
  0.32319999999999999,
  0.33760000000000001,
  0.32240000000000002],
 [array([  6.17887006e-01,   1.44753238e-01,  -1.44017078e+00,
          -2.02554857e+00,   2.49075218e-01,  -2.73717186e+01,
           2.86539824e-01,  -4.34507327e+00,   8.68566499e+01,
           4.98385802e-02,   1.25955746e+01,  -4.44435798e+01,
           2.48005245e+01,   3.18972854e+01,  -9.54590940e+00,
           5.93871885e-01,   2.13091547e+00,  -9.05394092e+00,
           7.30194088e+00,   5.30407990e+00,   4.67990396e-01,
           1.98184629e+00,  -4.89507554e-01,  -6.36371398e+01,
          -7.09165322e-01,  -8.93778255e-01,   1.75881540e+00,
          -9.69905164e-01,  -8.08825410e-01,   1.37861988e+00,
          -1.23612166e+01]),
  array([  6.32526280e-01,   1.62908598e-01,  -1.52801347e+00,
          -1.89714031e+00,   1.98243687e-01,  -1.78317055e+01,
           2.55967696e-01,  -2.70397158e+00,   8.09544603e+01,
           1.12667856e-01,   7.79990260e+00,  -4.26609703e+

Analysis of this model.

### Logistic Regression

Choose intial parameters

In [None]:
n_iters = 2000
gamma = 0.000003

Train with logistic regression

In [None]:
w = logistic_regression(y, tx, gamma, n_iters)

Predict with our trained model

In [None]:
test_x = np.genfromtxt('data/test.csv', delimiter=',', skip_header=1)
test_x = standardize(test_x[:, 2:])  # remove id and prediction columns
# could've used load_csv_data
create_csv_submission([i for i in range(350000,918238)], log_reg_predict(test_x, w), 'res.csv')

## Feature Engineering
TODO

## Prediction

**Generate predictions and save ouput in csv format for submission**