## Import necessary libraries

In [None]:
# Useful starting lines
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from costs import *
from models import *
from helpers import * 
from evaluation import *
from gradient import *

Clean function modules. Plz don't change them as possible as you can, otherwise plz let me know and specify the changes when you commit 

1. **models**: 6 model functions
2. **costs**: calculate_loss (calculate mse/mae/rmse/log_loss)
3. **gradient**: compute_gradient (stoch_gradient, gradient_sigmoid, sigmoid, hessian)
4. **helpers**: standardize, build_poly, batch_iter, load_csv_data, load_header, predict_labels, create_csv_submission
5. **evaluation**: cross_validation

## Preprocessing
** Load the training data into feature matrix, class labels, and record ids**

We write our own `load_csv_data` function to import csv data, which gives us prediction column, feature matrix and each record ID.

In [None]:
DATA_TRAIN_PATH = 'data/train.csv' # TODO: download train data and supply path here 
y, tx, ids = load_csv_data(DATA_TRAIN_PATH, sub_sample=True)

We use [feature scaling](https://en.wikipedia.org/wiki/Feature_scaling) method to standardize our feature matrix, i.e. to rescale tx down to [0, 1], so as to avoid complicated computation caused by large numbers.

In [None]:
tx, mean_tx, std_tx = standardize(tx)

## Model Selection

Let's begin with a simple linear regression with least_square using **normal equations**. Here we don't consider using least squares with gradient descent or stochastic gradient descent for the fact that **optimal w could be derived thoeritically**. We therefore don't bother to estimate the w.

In [None]:
gamma = 0.5
max_iter = 1000
loss, w = least_squares(y, tx)
w

We run cross validation 4 times on our train_data to see LS performance

In [None]:
# TODO

Analysis of this model.

### Logistic Regression

Choose intial parameters

In [None]:
n_iters = 10000
gamma = 0.00001

Train with logistic regression

In [None]:
loss, w = logistic_regression(y, tx, gamma, n_iters)

Predict with our trained model

In [None]:
test_x = np.genfromtxt('data/test.csv', delimiter=',', skip_header=1)
test_x = standardize(test_x[:, 2:])  # remove id and prediction columns
# could've used load_csv_data
create_csv_submission([i for i in range(350000,918238)], predict_labels(test_x, w), 'res.csv')

In [None]:
import numpy as np

from helpers import *
from models import *
from evaluation import *
from gradient import *
from split import *

gamma = 0.00001
n_iters = 2000

y, x, ids = load_csv_data('data/train.csv')
x = build_poly(x)
split_train = split_jets(y, x)
test_y, test_x, test_ids = load_csv_data('data/test.csv')
split_test = split_jets(test_y, test_x)

ws = []
for group in split_train:
    sub_y, sub_x, id_indices = group
    sub_tx = standardize(sub_x)[0]
    loss, w = logistic_regression(sub_y, sub_tx, gamma, n_iters)
    ws.append(w)

res = {}
for index, group in enumerate(split_test):
    sub_y, sub_x, id_indices = group
    sub_tx = standardize(sub_x)[0]
    pred_y = predict_labels(ws[index], sub_tx)
    res.update(dict(zip(test_ids[id_indices], pred_y)))

In [None]:
create_csv_submission(res.keys(), res.values(), 'hah.csv')

In [22]:
gamma = 0.000003
n_iters = 5000

In [24]:
import test
import helpers
import numpy as np
import evaluation

from models import logistic_regression

In [31]:
import importlib
importlib.reload(test)

<module 'test' from '/Users/junze/Documents/ML_course/PCML_project1/test.py'>

In [None]:
for cut in range(5)

## Feature Engineering
TODO

## Prediction

**Generate predictions and save ouput in csv format for submission**