## Import necessary libraries

In [1]:
# Useful starting lines
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
%load_ext autoreload
%autoreload 2
import math

In [2]:
from costs import *
from models import *
from helpers import * 
from evaluation import *

## Preprocessing
** Load the training data into feature matrix, class labels, and record ids**

We write our own `load_csv_data` function to import csv data, which gives us prediction column, feature matrix and each record ID.

In [3]:
from proj1_helpers import *
DATA_TRAIN_PATH = 'data/train.csv' # TODO: download train data and supply path here 
y, tx, ids = load_csv_data(DATA_TRAIN_PATH, sub_sample=True)

We use [feature scaling](https://en.wikipedia.org/wiki/Feature_scaling) method to standardize our feature matrix, i.e. to rescale tx down to [0, 1], so as to avoid complicated computation caused by large numbers.

In [4]:
tx = standardize(tx)

## Model Selection

Let's begin with a simple linear regression with least_square using **normal equations**. Here we don't consider using least squares with gradient descent or stochastic gradient descent for the fact that **optimal w could be derived thoeritically**. We therefore don't bother to estimate the w.

In [5]:
gamma = 0.5
max_iter = 1000
loss, w = least_squares(y, tx)
w

AttributeError: 'tuple' object has no attribute 'T'

We run cross validation 4 times on our train_data to see LS performance

In [6]:
ws, losses, accs = cross_validation(y, tx, 4, 0, least_squares, method='rmse')

In [7]:
accs, ws, losses

([0.32400000000000001,
  0.32319999999999999,
  0.33760000000000001,
  0.32240000000000002],
 [array([  6.17887006e-01,   1.44753238e-01,  -1.44017079e+00,
          -2.02554856e+00,   2.49075217e-01,  -2.73717186e+01,
           2.86539824e-01,  -4.34507328e+00,   8.68566498e+01,
           4.98385803e-02,   1.25955747e+01,  -4.44435797e+01,
           2.48005245e+01,   3.18972854e+01,  -9.54590943e+00,
           5.93871880e-01,   2.13091547e+00,  -9.05394096e+00,
           7.30194088e+00,   5.30407990e+00,   4.67990396e-01,
           1.98184629e+00,  -4.89507554e-01,  -6.36371398e+01,
          -7.09165322e-01,  -8.93778256e-01,   1.75881540e+00,
          -9.69905163e-01,  -8.08825411e-01,   1.37861988e+00,
          -1.23612166e+01]),
  array([  6.32526280e-01,   1.62908598e-01,  -1.52801347e+00,
          -1.89714031e+00,   1.98243688e-01,  -1.78317055e+01,
           2.55967696e-01,  -2.70397158e+00,   8.09544604e+01,
           1.12667856e-01,   7.79990257e+00,  -4.26609704e+

Analysis of this model.

### Logistic Regression

Choose intial parameters

In [21]:
n_iters = 2000
gamma = 0.000003
y[y==-1] = 0

Train with logistic regression

In [22]:
y

array([ 1.,  0.,  0., ...,  1.,  0.,  1.])

In [23]:
w = logistic_regression(y, tx, gamma, n_iters)

Current iteration=0, the loss=3465.7359027999587, gradient=0.9545252713713858
Current iteration=100, the loss=3466.6988576252706, gradient=0.9510697831081719
Current iteration=200, the loss=3467.6594251008987, gradient=0.9476275128954119
Current iteration=300, the loss=3468.6176050105964, gradient=0.9441984124968245
Current iteration=400, the loss=3469.573397206277, gradient=0.9407824338583028
Current iteration=500, the loss=3470.52680160707, gradient=0.9373795291072456
Current iteration=600, the loss=3471.477818198603, gradient=0.9339896505518602
Current iteration=700, the loss=3472.4264470322596, gradient=0.9306127506804989


KeyboardInterrupt: 

In [20]:
cross_validation(y, tx, 4, 0, logistic_regression, 0.000003)

Current iteration=0, the loss=2599.3019270998116, gradient=0.9671443056725588
Current iteration=100, the loss=2600.030396614757, gradient=0.9636101927671796
Current iteration=200, the loss=2600.7570658034606, gradient=0.9600896762068446
Current iteration=300, the loss=2601.4819342885316, gradient=0.9565827059525435
Current iteration=400, the loss=2602.2050017469824, gradient=0.9530892321556135
Current iteration=500, the loss=2602.9262679096246, gradient=0.9496092051570207
Current iteration=600, the loss=2603.645732560428, gradient=0.9461425754866506
Current iteration=700, the loss=2604.363395536049, gradient=0.9426892938625869
Current iteration=800, the loss=2605.0792567251688, gradient=0.9392493111904139
Current iteration=900, the loss=2605.7933160679827, gradient=0.9358225785625014
Final loss=2606.4984598997953


NameError: name 'predict_labels' is not defined

Predict with our trained model

In [None]:
test_x = np.genfromtxt('data/test.csv', delimiter=',', skip_header=1)
test_x = standardize(test_x[:, 2:])  # remove id and prediction columns
# could've used load_csv_data
create_csv_submission([i for i in range(350000,918238)], log_reg_predict(test_x, w), 'res.csv')

## Feature Engineering
TODO

## Prediction

**Generate predictions and save ouput in csv format for submission**