<a href="https://colab.research.google.com/github/mina19/machine_learning_algorithms/blob/main/MLAlgorithms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
%matplotlib inline

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor

from sklern.metrics import accuracy_score, precision_score, recall_score
from time import time

import joblib

In [None]:
mydata = pd.read_csv('../../../file.csv')

features = pd.read_csv('../../filename.csv')
labels = pd.read_csv('../../labels.csv')

mydata.head()

1. Explore and clean the data.
2. Split data into training, validation, and testing.
3. Fit an initial model and evaluate.
4. Tune hyperparameters using k-fold cross validation.
5. Evaluate on validation set.
6. Select best model and evaluate on test set.

**Linear Regression**

**Example:** Number of umbrellas sold depending on how much rain

**Use when:** Continuous target variable

**Don't use when:**

In [None]:
# Drop all categorical features
categorical_features = ['PassengerID', 'Name']
mydata.drop(categorical_features, axis=1, inplace=True)

In [None]:
# Explore continuous features
mydata.describe()

# The count could reveal missing values.
mydata.grouby('TargetLabel').mean()

In [None]:
# Missing at random? Or in a systematic way?
mydata.groupby(mydata['Age'].isnull()).mean()

In [None]:
# Plot continuous features
for i in ['Feature1', 'Feature2']:
  died = list(titanic[titanic['Survived'] == 0][i].dropna())
  survived = list(titanic[titanic['Survived'] == 1][i].dropna())
  xmin = min(min(died), min(survived))
  xmax = max(max(died), max(survived))
  width = (xmax - xmin) / 40
  sns.distplot(died, color='r', kde=False, bins=np.arange(xmin, xmax, width))
  sns.distplot(survived, color = 'g', kde = False, bins = np.arange(xmin, xmax, width))
  plt.legend(['Did not survive', 'Survived'])
  plt.title('Overlaid histograms for {}'.format(i))
  plt.show()

In [None]:
for i, col in enumerate(['Pclass', 'SibSp', 'Parch']):
    plt.figure(i)
    sns.catplot(x=col, y='Survived', data=titanic, kind='point', aspect=2)

Fill missing values as needed.

In [None]:
df.isnull().sum()

In [None]:
df['Age'].fillna(df['Age'].mean(), inplace=True)

In [None]:
sns.catplot(x=col, y='Survived', data=df, kind='point', aspect=2)

In [None]:
df.drop(['A', 'B', 'C'], axis=1, inplace=True)

In [None]:
df.groupby(df['MissingCategory'].isnull())['y'].mean()

In [None]:
df['indicator'] = np.where(df['MissingCategory'].isnull(), 0, 1)

In [None]:
gender_num = {'male': 0, 'female': 1}
df['gender'] = df['gender'].map(gender_num)

In [None]:
# Explore categorical features
mydata.info()

for i, col in enumerate(['Pclass', 'SibSp', 'Parch']):
    plt.figure(i)
    sns.catplot(x=col, y='Survived', data=titanic, kind='point', aspect=2)

df.pivot_table('Survived', index='Sex', columns='Embarked', aggfunc='count')
df.pivot_table('Survived', index = 'Cabin_ind', columns = 'Embarked', aggfunct='count')

Split into training, validation, and test sets

In [None]:
features = df.drop('y', axis=1)
labels = df['y']

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size = 0.4, random_state = 1)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state = 1)

**Logistic Regression**

**Use when:** binary target variable, transparency is important or interested in significance of features, fairly well-behaved data, need a quick initial benchmark

**Don't use when:** continuous target variable, massive amount of data (rows or columns), outliers or skewed features, performance is the only thing that matters

In [None]:
lr = LogisticRegression()


C is a regularization paramter, default is 1.

When C goes to infinity, large penalty for misclassification, more likely to overfit, lambda goes to zero (low regularization).

When C goes to zero, small penalty for misclassification, more likely to underfit, lambda goes to infinity (high regularization).

**K-Fold Cross Validation**

Dataset split into K subsets. Iterate through those K subsets K times. Fit model on K-1 subsets, test on remaining subset.
Generate performance metric on each loop.

In [None]:
parameters ={'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}

In [None]:
cv = GridSearchCV(lr, parameters, cv=5)
cv.fit(X_train, y_train)
cv.best_estimator_

**Support Vector Machines:** a classifier that finds an optimal hyperplane that maximizes the margin between two classes. Support vectors are the vector lines from the decision boundary to the closest points.

A **kernel trick/method** transforms data that is not linearly separable in n-dimensional space to a higher dimension where it is linearly separable.

**Use when:** Binary target variable, feature-to-row ratio is very high (short and fat data), very complex relationships, lots of outliers

**Don't use when:** Feature-to-row is very low, transparency is important or interested in feature importance, looking for a quick benchmark model

In [None]:
svc = SVC()
parameters = {C = [0.1, 1, 10],
              kernel = ['linear', 'rbf']}
cv = GridSearchCV(svc, parameters, cv=5)
cv.fit(X_train, y_train)
cv.best_estimator_

**Multilayer Perceptron** is a classic feed-forward artificial neural network, the core component of deep learning. A connected series of nodes (in the form of a directed acyclic graph) where each node represents a function or a model.

Input layer has one node for each features.
Hidden layer with as many nodes as you want. Each node is a function.

Output layer with a node for each possible number of outcome (or one node).

**Use when:** categorical or continuous target variable, very complex relationships or performance is the only thing that matters, when control over the training process is important

**Don't use when:** Transparency is important or interested in feature significance, need a quick benchmark model, limited data available


In [None]:
def print_results(results):
  print('Best params: {}\n'.format(results.best_params_))
  means = results.cv_results_['mean_test_score']
  stds = results.cv_results_['std_test_score']
  for mean, std, params in zip(means, stds, results.cv_results_['params']):
    print('{} (+/-{}) for {}'.format(round(mean, 3), round(std*2, 3), params))

In [None]:
mlp = MLPClassifier()
parameters = {activation =['relu', 'tanh', 'logistic'],
              hidden_layer_sizes = [(10,), (50,), (100,)],
              learning_rate = ['constant', 'invscaling', 'adaptive']}
cv = GridSearchCV(mlp, parameters, cv=5)
cv.fit(X_train, y_train.values.ravel())
print_results(cv)
cv.best_estimator_

**Learning rate** hyperparameter facilitates both how quickly and whether or not the algorithm will find the optimal solution.

**Random Forest** merges a collection of independent decision trees to get a more accurate and stable prediction. It is a type of ensemble method which combine several machine learning models in order to decrease both bias and variance.

Take multiple data samples using sampling with replacement. Create feature samples for each data sample. Then, for each data and feature subset, create decision trees. For each example in set: run sample through the decision trees. Take the majority vote for final prediction.

**Use when**: categorical or continuous target variable, interested in significance of predictors, need a quick benchmark model, if yoou have messy data with missing values or outliers

**Don't use when**: if you're solving a very complex, novel problem, transparency is important, prediction time is important (quick to train but not quick for predictions)

In [None]:
rf = RandomForestClassifier()
parameters = {
    'n_estimators' = [5, 50, 250],
    'max_depth' = [2,4,8,16,32,None]}
cv = GridSearchCV(rf, parameters, cv=5)
cv.fit(X_train, y_train.values.ravel())

print_results(cv)

**n_estimators** hyperparameter controls how many individual decision trees will be built.

**max_depth** hyperparameter controls how deep each individual decision tree can go.

max_depth of 4 and n_estimators 50 should be decent.

**Boosting** is an ensemble method that aggregates a number of weak models to create one strong model.
A weak model is one that is only slightly better than random guessing. A strong model is one that is strongly correlated with the true classification. In boosting, the decision trees are not independent.

Boosting effectively learns from its mistakes with each iteration.

Take first data sample and create a shallow decision tree. Evaluate its performance and overweight misclassified samples. Next model uses samples the first model couldn't quite figure out. It builds a new weak model. Repeat process again and again. By the end, you have n weak models that have learned from previous mistakes.

For prediction, the models are parallelizable. Now you have weighted voting depending on how each model performed during training.

Boosting is slow for fitting, but fast for prediction. It also has tendency to overfit.

**Use when**: categorical or continuous target variable, useful on nearly any type of problem, interested in significance of predictors, prediction time is important

**Don't use when**: transparency is important, training time is important or compute power is limited, data is really noisy

In [None]:
gb = GradientBoostingClassifier()
parameters = {
    'n_estimators': [5, 50, 250, 500],
    'max_depth': [1,3,5,7,9],
    'learning_rate': [0.01, 0.1, 1, 10, 100]
}

cv = GridSearchCV(gb, parameters, cv=5)
cv.fit(X_train, y_train.values.ravel())

In [None]:
models = {}
for mdl in ['LR', 'SVM', 'MLP', 'RF', 'GB']:
  models[mdl] = joblib.load('../../{}_model.pkl'.format(mdl))

In [None]:
def evaluate_model(name, model, features, labels):
  start = time()
  pred = model.predict(features)
  end = time()
  accuracy = round(accuracy_score(labels, pred), 3)
  precision = round(precision_score, labels, pred), 3)
  recall = round(recall_score(labels, pred), 3)
  print('{} -- Accuracy: {} / Precision: {} / Recall: {} / Latency: {}ms'.format(name,
                                                                                 accuracy,
                                                                                 precision,
                                                                                 recall, round((end - start),3)))

In [None]:
for name, mdl in models.items():
  evaluate_model(name, mdl, val_features, val_labels)

In [None]:
evaluate_model('Random Forest', models['RF'], te_features, te_labels)

For spam detection problems, optimize for precision. If model says it's spam, it better be spam otherwise you miss real emails.

For fraud detection, optimize for recall because missing any real fraudulent transactions could cost you a lot.

If best model for overall accuracy is also slowest, might go for slightly less performing model with much lower latency.

In [None]:
joblib.dump(cv.best_estimator_, '../../model.pkl')