## Logistic Regression
### Definition
Regression: Statistical process for estimating the relationships among variables, often to make a prediction about some outcome.

Linear: 
1. used for:
    1. continuous target variabls y=mx+b
 

Logistic:
1. Used for:
    1. binary target variable
    2. Well-behaved data
    3. Quick benchmarking
2. Not good for
    1. Continuous data
    2. Fat or Skinny data
    3. Massive data sets
    4. High performance


### Hyperparameters
##### C
1. Considered a regularization parameter
C=1/$\lambda$
If $\lambda$ = 0, then C = $\infity$
Therefore, Low Regularizaion = High Complexity, and More Likely to OVERFIT

The inverse leads to UNDERFIT

In [1]:
import joblib
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
# supress sklearn warnings
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

%matplotlib inline

tr_features = pd.read_csv('output/train_features.csv')
tr_labels = pd.read_csv('output/train_labels.csv')


tr_labels.head()

Unnamed: 0,Bullet
0,0
1,1
2,0
3,0
4,0


In [2]:
def print_results(results):
    print('BEST PARAMS: {}\n'.format(results.best_params_))
    
    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}.'.format(round(mean, 3), round(std*2, 3), params))
        

In [3]:
lr = LogisticRegression()
parameters = {
    'C':[0.001, 0.01, 0.1, 1, 10, 100, 1000]
}

cv = GridSearchCV(lr, parameters, cv=5)

cv.fit(tr_features, tr_labels.values.ravel())

print_results(cv)

BEST PARAMS: {'C': 0.1}

0.809 (+/-0.041) for {'C': 0.001}.
0.826 (+/-0.045) for {'C': 0.01}.
0.846 (+/-0.057) for {'C': 0.1}.
0.839 (+/-0.04) for {'C': 1}.
0.829 (+/-0.059) for {'C': 10}.
0.838 (+/-0.052) for {'C': 100}.
0.839 (+/-0.036) for {'C': 1000}.


In [4]:
cv.best_estimator_

In [5]:
joblib.dump(cv.best_estimator_,'output/lr_model.pkl')

['output/lr_model.pkl']

## Support Vector Machine
### Definition
Classifier that finds an optimal hyperplane that maximizes the margin between two classes. SVM

Kernel trick (or method) transforms data that is not linearly seperable in n-dimensional space to a higher dimension where it is linearly seperable.

When to use SVM:
1. Binary target variable
2. Feature to tow ratio is high (Short and Fat data)
3. Very complex relationships
4. Lots of outliers

When not to use it:
1. Feature-to-row is very low
2. Time or compute power are constraints
3. Transparancy is important


### Hyperperameters
#### C
Regularization parameter: it is a penalty term that dertimines how closely the model fits to the training set.
#### Kernel
Determines the time of transformation that makes the data linearly seperable

In [6]:
from sklearn.svm import SVC

SVC()

In [7]:
tr_features = pd.read_csv('output/train_features.csv')
tr_labels = pd.read_csv('output/train_labels.csv')

In [8]:
# Optimize hyperperameters
svc = SVC()
parameters = {
    'kernel': ['linear', 'rbf'],
    'C': [0.1, 1, 10]
}

cv = GridSearchCV(svc, parameters, cv=5)

cv.fit(tr_features, tr_labels.values.ravel())
print_results(cv)

BEST PARAMS: {'C': 10, 'kernel': 'rbf'}

0.809 (+/-0.09) for {'C': 0.1, 'kernel': 'linear'}.
0.793 (+/-0.062) for {'C': 0.1, 'kernel': 'rbf'}.
0.813 (+/-0.086) for {'C': 1, 'kernel': 'linear'}.
0.827 (+/-0.082) for {'C': 1, 'kernel': 'rbf'}.
0.804 (+/-0.076) for {'C': 10, 'kernel': 'linear'}.
0.844 (+/-0.049) for {'C': 10, 'kernel': 'rbf'}.


In [9]:
joblib.dump(cv.best_estimator_,'output/svm_model.pkl')

['output/svm_model.pkl']

## Multilayer Perceptron
### Definition
Classic feed-forward artificial neural network, the core component of deep learning.

Directed, a-cyclic graph

What is it used for:
1. Categorical or continuous target values (both classification and regression)
2. Very complex relationships, care about performance above all else
3. Control over the training process is important

What is it not good for:
1. Image recogniton, time series, etc.
2. Transparency is important
3. Quick benchmarking
4. Small data sets

### Hyperperameters

#### Activation
Non-linearity introduced into the model
The functions in the hidden layers

Types
1. Sigmoid (s-curve)
2. TanH
3. ReLU (very popular

#### Hidden Layer Size
How many hidden layers there are, and how many nodes are in each
Controls the complexity of the model

#### Learning Rate
Whether or not the algorithm will find the optimal solution, and quickly it should do so.

High rate - won't find the optimal solution, but won't take long
Low rate - will find the optimal solution, but will take a really long time. Too low of a rate will give false optimal solutions


In [10]:
from sklearn.neural_network import MLPClassifier

In [11]:
tr_features = pd.read_csv('output/train_features.csv')
tr_labels = pd.read_csv('output/train_labels.csv')

In [12]:
mlp = MLPClassifier()
parameters = {
    'hidden_layer_sizes': [(10,),(50,),(100,)],
    'activation': ['relu', 'tanh', 'logistic'],
    'learning_rate':['constant', 'invscaling', 'adaptive']
}

cv = GridSearchCV(mlp, parameters, cv=5)

cv.fit(tr_features, tr_labels.values.ravel())
print_results(cv)

BEST PARAMS: {'activation': 'logistic', 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive'}

0.829 (+/-0.049) for {'activation': 'relu', 'hidden_layer_sizes': (10,), 'learning_rate': 'constant'}.
0.819 (+/-0.068) for {'activation': 'relu', 'hidden_layer_sizes': (10,), 'learning_rate': 'invscaling'}.
0.834 (+/-0.056) for {'activation': 'relu', 'hidden_layer_sizes': (10,), 'learning_rate': 'adaptive'}.
0.841 (+/-0.038) for {'activation': 'relu', 'hidden_layer_sizes': (50,), 'learning_rate': 'constant'}.
0.843 (+/-0.044) for {'activation': 'relu', 'hidden_layer_sizes': (50,), 'learning_rate': 'invscaling'}.
0.844 (+/-0.06) for {'activation': 'relu', 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive'}.
0.839 (+/-0.049) for {'activation': 'relu', 'hidden_layer_sizes': (100,), 'learning_rate': 'constant'}.
0.841 (+/-0.049) for {'activation': 'relu', 'hidden_layer_sizes': (100,), 'learning_rate': 'invscaling'}.
0.847 (+/-0.043) for {'activation': 'relu', 'hidden_layer_sizes': (10

In [13]:
joblib.dump(cv.best_estimator_,'output/mlp_model.pkl')

['output/mlp_model.pkl']

## Random Forest
Merges a collection of independent decision trees to get a more accurate and stable prediction

Is an ensamble method, which reduces bias and variance

Independently built decision trees on separate samples of data

When to use it:
1. Categorical or continuous target variables
2. Interested in significance of predictors
3. Need a quick benchmark model
4. Really messy data

When not to use it:
1. Probably not the best model for a problem. Can get you 90% there, but probably not 100%
2. Quick to train, but not as quick to make predictions
3. Good performance, but not the best
4. Like a swiss army knife

### Hyperperamters

#### n_estimators
How many independent decision trees are built
Width of the model

#### max_depth
How deep the decision trees can go
Depth of the model

In [14]:
from sklearn.ensemble import RandomForestClassifier

In [15]:
tr_features = pd.read_csv('output/train_features.csv')
tr_labels = pd.read_csv('output/train_labels.csv')

In [16]:
rf = RandomForestClassifier()
parameters = {
    'n_estimators': [5, 50, 250],
    'max_depth': [2, 4, 8, 16, 32, None]
}

cv = GridSearchCV(rf, parameters, cv=5)

cv.fit(tr_features, tr_labels.values.ravel())
print_results(cv)

BEST PARAMS: {'max_depth': 2, 'n_estimators': 250}

0.851 (+/-0.055) for {'max_depth': 2, 'n_estimators': 5}.
0.857 (+/-0.057) for {'max_depth': 2, 'n_estimators': 50}.
0.861 (+/-0.053) for {'max_depth': 2, 'n_estimators': 250}.
0.849 (+/-0.046) for {'max_depth': 4, 'n_estimators': 5}.
0.852 (+/-0.056) for {'max_depth': 4, 'n_estimators': 50}.
0.859 (+/-0.047) for {'max_depth': 4, 'n_estimators': 250}.
0.819 (+/-0.055) for {'max_depth': 8, 'n_estimators': 5}.
0.852 (+/-0.051) for {'max_depth': 8, 'n_estimators': 50}.
0.846 (+/-0.05) for {'max_depth': 8, 'n_estimators': 250}.
0.829 (+/-0.06) for {'max_depth': 16, 'n_estimators': 5}.
0.837 (+/-0.042) for {'max_depth': 16, 'n_estimators': 50}.
0.834 (+/-0.051) for {'max_depth': 16, 'n_estimators': 250}.
0.818 (+/-0.065) for {'max_depth': 32, 'n_estimators': 5}.
0.834 (+/-0.046) for {'max_depth': 32, 'n_estimators': 50}.
0.836 (+/-0.052) for {'max_depth': 32, 'n_estimators': 250}.
0.826 (+/-0.048) for {'max_depth': None, 'n_estimators': 5}

In [17]:
joblib.dump(cv.best_estimator_,'output/rf_model.pkl')

['output/rf_model.pkl']

## Boosting
### Definition
Ensemble method that aggregates a number of weak (just better than random guess) models to create a strong (highly correlated and predictive) one.

Each model learns from the mistakes of the model before it. 
Slow to train, but fast to predict.
One of the most-used in machine learning
Tends to overfit.
Trees for boosting should be shallower than they are for random forest

When is it used:
1. Both categorical or continuous target variable
2. Useful on nearly any type of problem
3. Interested in significance of predictors
4. Prediction time is important

When is it not used:
1. Transparancy is important
2. Training time or compute power are important
3. Data is very noisy

### Hyperparameters
Gradient boosted trees, which is only one type of boosting

#### n_estimators 
Same as random forest

#### max_depth
Same as random forest

#### learning_rate
Actual learning rate, but remains constant, unlike multilayer perceptrons



In [18]:
from sklearn.ensemble import GradientBoostingClassifier

In [19]:
tr_features = pd.read_csv('output/train_features.csv')
tr_labels = pd.read_csv('output/train_labels.csv')

In [20]:
gb = GradientBoostingClassifier()
parameters = {
    'n_estimators': [5, 50, 250, 500],
    'max_depth': [1, 3, 5, 7, 9],
    'learning_rate':[0.01, 0.1, 1, 10, 100]
}

cv = GridSearchCV(gb, parameters, cv=5)

cv.fit(tr_features, tr_labels.values.ravel())
print_results(cv)

BEST PARAMS: {'learning_rate': 0.1, 'max_depth': 1, 'n_estimators': 50}

0.635 (+/-0.003) for {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 5}.
0.828 (+/-0.033) for {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 50}.
0.854 (+/-0.05) for {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 250}.
0.856 (+/-0.05) for {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 500}.
0.635 (+/-0.003) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 5}.
0.831 (+/-0.023) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 50}.
0.856 (+/-0.052) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 250}.
0.854 (+/-0.068) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 500}.
0.635 (+/-0.003) for {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 5}.
0.831 (+/-0.031) for {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 50}.
0.838 (+/-0.057) for {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 250}.
0.829 (+/-0.052) for {'le

In [21]:
joblib.dump(cv.best_estimator_,'output/gb_model.pkl')

['output/gb_model.pkl']