# MACHINE LEARNING 1 - BASICS

# ML Tasks

In [None]:
Type of tasks
  - Classification
  - Regression
  - Structured annotation
  - Clustering
  - Transcription

In [None]:
Challenges
  - Quality of data 
  - Time-Consuming task − Another challenge faced by ML models is the consumption of time especially for data acquisition, feature extraction and retrieval. 
  - Lack of specialist persons − As ML technology is still in its infancy stage, availability of expert resources is a tough job.
  - No clear objective for formulating business problems 
  - Issue of overfitting & underfitting 
  - Curse of dimensionality − Another challenge ML model faces is too many features of data points. This can be a real hindrance.
  - Difficulty in deployment − Complexity of the ML model makes it quite difficult to be deployed in real life.

In [None]:
Applications
  - Emotion analysis
  - Sentiment analysis
  - Error detection and prevention
  - Weather forecasting and prediction
  - Stock market analysis and forecasting
  - Speech synthesis
  - Speech recognition
  - Customer segmentation
  - Object recognition
  - Fraud detection
  - Fraud prevention
  - Recommendation of products to customer in online shopping

# Categorisation of the problem

In [None]:
Categorize by input:
- If you have labelled data, it’s a supervised learning problem.
- If you have unlabelled data and want to find structure, it’s an unsupervised learning problem.
- If you want to optimize an objective function by interacting with an environment, it’s a reinforcement learning problem.

Categorize by output.
- If the output of your model is a number, it’s a regression problem.
- If the output of your model is a class, it’s a classification problem.
- If the output of your model is a set of input groups, it’s a clustering problem.
- Do you want to detect an anomaly ? That’s anomaly detection

# Learning Classification

## Supervised Learning

In [None]:
- The majority of practical machine learning uses supervised learning.
- Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output.
- Y = f(X)
- It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process.

In [None]:
Supervised learning problems can be further grouped into regression and classification problems.
  - Classification: A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease”.
  - Regression: A regression problem is when the output variable is a real value, such as “dollars” or “weight”.

Some common types of problems built on top of classification and regression include recommendation and time series prediction respectively.

Some popular examples of supervised machine learning algorithms are:
  - Linear regression for regression problems.
  - Random forest for classification and regression problems.
  - Support vector machines for classification problems.

## Unsupervised Learning

In [None]:
Unsupervised learning is where you only have input data (X) and no corresponding output variables.
The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data.

These are called unsupervised learning because unlike supervised learning above there is no correct answers and there is no teacher. 
Algorithms are left to their own devises to discover and present the interesting structure in the data.

Unsupervised learning problems can be further grouped into clustering and association problems.
  - Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior.
  - Association:  An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y.

Some popular examples of unsupervised learning algorithms are:
  - k-means for clustering problems.
  - Apriori algorithm for association rule learning problems.

## Semi Supervised

In [None]:
Problems where you have a large amount of input data (X) and only SOME of the data is labeled (Y) are called semi-supervised learning problems.

These problems sit in between both supervised and unsupervised learning.
  - A good example is a photo archive where only some of the images are labeled, (e.g. dog, cat, person) and the majority are unlabeled.
  - Many real world machine learning problems fall into this area.
  - This is because it can be expensive or time-consuming to label data as it may require access to domain experts. Whereas unlabeled data is cheap and easy to collect and store.

You can use unsupervised learning techniques to discover and learn the structure in the input variables.
You can also use supervised learning techniques to make best guess predictions for the unlabeled data, feed that data back into the supervised learning algorithm as training data and use the model to make predictions on new unseen data.

# Basic Stats

In [1]:
import pandas as pd
import statsmodels.formula.api as sm
df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
result = sm.ols(formula="A ~ B + C", data=df).fit()
print(result.params)

Intercept    14.952480
B             0.401182
C             0.000352
dtype: float64


In [2]:
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                      A   R-squared:                       0.579
Model:                            OLS   Adj. R-squared:                  0.158
Method:                 Least Squares   F-statistic:                     1.375
Date:                Tue, 16 Jun 2020   Prob (F-statistic):              0.421
Time:                        10:44:00   Log-Likelihood:                -18.178
No. Observations:                   5   AIC:                             42.36
Df Residuals:                       2   BIC:                             41.19
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     14.9525     17.764      0.842      0.4



## Interpretation of a regression Output

In [None]:
R-squared        - coeff of determination. How well the regression line approximates real data points
Adj. R-squared   - same as above, adjsuted for number of observations and degrees of freedom of residuals
F-Stat           - measure of how significant the fit is. Mean sq error / mean sq error of residuals
Prob(F-stat)     - prob to get F-stat, given the null hypothesis they are unrelated
Log-Likelihood   - value of the likelihood function of the fitted model
AIC              - Akaike Information Criterion: adjusts log-likelihood based on number of observations and complexity of model   
BIC              - Bayesian Information Criterion: same as AIC, but with higher penalty for models wit more parameters
Df Residuals     - degrees of freedom of the residulas. Number of observations - number of parameters
Df Model         - number of parameters in the model (not including the constant)

In [None]:
Coef             - estimated value of coeff
Std err          - basic standard error of the estimate of coeff. 
t                - t-stat (how statistically significant the coeff is)
P>|t|            - p-value that null hypothesis that the coeff = 0 is true. if < 0.05: strong relationship between term and response
95% Conf.Int     - lower and upper value of the 95% confidence interval

In [None]:
Omnibus          - Angostino test: provides a combined statistical test of the presence of skewness and kurtosis
Prob(Omnibus)    - same as above, turned into prob
Skew             - measure of symmetry of data around mean
Kurtosis         - measure of shape distribution
Durbin-Watson    - test for autocorrelation (important in time series)
Jarque=Bera      - different test of skewness and kurtosis
Prob (JB)
Cond.No          - test for multicolinearity (parameters are related to each other)

In [None]:
Log Likelihood
  - not possible to compare raw log-lieklihoods between models (better to use AIC or BIC)
  - Likelihood is the likelihood of the entire model given a set of parameter estimates.
  - It is calculated by 
    - taking a set of parameter estimates
    - calculating the probability density for each one
    - multiplying the probability densities for all the observations together 
    - >> this follows from probability theory in that P(A and B) = P(A)P(B) if A and B are independent)
  - In practice, what this means for linear regression:
    - you take a set of parameter estimates (beta, sd)
    - plug them into the normal pdf
    - calculate the density for each observation y at that set of parameter estimates
    - multiply them all together. 
  - Typically, we choose to work with the log-likelihood because 
    - it is easier to calculate because instead of multiplying we can sum (log(a*b) = log(a) + log(b)), which is computationally faster. 

Log likelihood is used for almost everything. 
  - It is the basic quantity that we use to find parameter estimates (Maximum Likelihood Estimates) for a huge suite of models. 
  - For simple linear regression, these estimates turn out to be the same as those for least squares, but for more complicated models least squares may not work.

AIC
  - Lower value of AIC suggests "better" model, but it is a relative measure of model fit 
  - It is used for model selection (only), i.e. it lets you to compare different models estimated on the same dataset
  - Lower indicates a more parsimonious model, relative to a model fit with a higher AIC.
  - Model selection conducted with the AIC will choose the same model as leave-one-out cross validation 
  - Dont compare too many models with the AIC (like with p-values) because lowest AIC does not mean that it is the most appropriate mod

In [None]:
Omnibus          
  - We want something close to zero, which means normalcy of residuals
        
Prob(Omnibus)    
  - statistical test that residuals are normally distributed
  - we want something close to 1

In [None]:
Skew
  - closer to 0 means symetric residual distribution
  - If skewness is less than −1 or greater than +1, the distribution is highly skewed.
  - If skewness is between −1 and −½ or between +½ and +1, the distribution is moderately skewed.
  - If skewness is between −½ and +½, the distribution is approximately symmetric.
    
Kurtosis
  - a uniform distribution has a kurtosis of 1.8 (excess -1.2) (lowest is discrete with 2 outcomes: kurto 1)
  - a normal distribution has a kurtosis of 3 (excess 0)
  - a logistic distribution has a kurtosis of 4.2 (excess 1.2)
  - highest kurtosis is a student distribution

In [None]:
Durbin-Watson
  - Value between zero and 4.0
  - we hope to get a value between 1 and 2. ideally 2
  - A value of 2.0 means there is no autocorrelation detected in the sample. 
  - values from zero to 2.0 indicate positive autocorrelation
  - values from 2.0 to 4.0 indicate negative autocorrelation.

In [None]:
Condition Number 
  – This test measures the sensitivity of a function output as compared to its input.
  - When we have multicollinearity, we can expect much higher fluctuations to small changes in the data
  - We want a relatively small number (something below 30)

# Train Accuracy vs Test Accuracy vs Confusion matrix

In [None]:
Accuracy: 
  - The amount of correct classifications / the total amount of classifications.
  - The train accuracy: The accuracy of a model on examples it was constructed on.
  - The test accuracy is the accuracy of a model on examples it hasn't seen.

Confusion matrix: 
  - Confusion matrix returns the testing accuracy 
  - A tabulation of the predicted class (usually vertically) against the actual class (thus horizontally).

# Confusion Matrix

In [None]:
Confusion matrix 
  - matrix (table) that can be used to measure the performance of an machine learning algorithm, usually a supervised learning one

By convention here
  - row = instances of an actual class 
  - column = instances of a predicted class

In [None]:
# 2 Class looks like this
-----------------------------------------------------------------
                    Predicted Negative      Predicted Positive
Actual Negative       True Negative            False Positive
Actual Positive       False NEgative           True Positive
-----------------------------------------------------------------

Accuracy = (TN + TP) / (Total)
Precision = TP / (FP + TP)

### Implementation in Numpy

In [4]:
import numpy as np

cm = np.array(
[[5825,    1,   49,   23,    7,   46,   30,   12,   21,   26],
 [   1, 6654,   48,   25,   10,   32,   19,   62,  111,   10],
 [   2,   20, 5561,   69,   13,   10,    2,   45,   18,    2],
 [   6,   26,   99, 5786,    5,  111,    1,   41,  110,   79],
 [   4,   10,   43,    6, 5533,   32,   11,   53,   34,   79],
 [   3,    1,    2,   56,    0, 4954,   23,    0,   12,    5],
 [  31,    4,   42,   22,   45,  103, 5806,    3,   34,    3],
 [   0,    4,   30,   29,    5,    6,    0, 5817,    2,   28],
 [  35,    6,   63,   58,    8,   59,   26,   13, 5394,   24],
 [  16,   16,   21,   57,  216,   68,    0,  219,  115, 5693]])

In [5]:
def precision(label, confusion_matrix):
    col = confusion_matrix[:, label]
    return confusion_matrix[label, label] / col.sum()
    
def recall(label, confusion_matrix):
    row = confusion_matrix[label, :]
    return confusion_matrix[label, label] / row.sum()

def precision_macro_average(confusion_matrix):
    rows, columns = confusion_matrix.shape
    sum_of_precisions = 0
    for label in range(rows):
        sum_of_precisions += precision(label, confusion_matrix)
    return sum_of_precisions / rows

def recall_macro_average(confusion_matrix):
    rows, columns = confusion_matrix.shape
    sum_of_recalls = 0
    for label in range(columns):
        sum_of_recalls += recall(label, confusion_matrix)
    return sum_of_recalls / columns

In [6]:
print("label precision recall")
for label in range(10):
    print(f"{label:5d} {precision(label, cm):9.3f} {recall(label, cm):6.3f}")

label precision recall
    0     0.983  0.964
    1     0.987  0.954
    2     0.933  0.968
    3     0.944  0.924
    4     0.947  0.953
    5     0.914  0.980
    6     0.981  0.953
    7     0.928  0.982
    8     0.922  0.949
    9     0.957  0.887


In [8]:
print("precision total:", precision_macro_average(cm))
print("recall total:", recall_macro_average(cm))

precision total: 0.9496885564052286
recall total: 0.9514531547877969


In [9]:
def accuracy(confusion_matrix):
    diagonal_sum = confusion_matrix.trace()
    sum_of_all_elements = confusion_matrix.sum()
    return diagonal_sum / sum_of_all_elements 

### Implementation in Scikit-Learn

In [10]:
from sklearn.metrics import confusion_matrix

y_actu = [2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2]
y_pred = [0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2]

confusion_matrix(y_actu, y_pred)

array([[3, 0, 0],
       [0, 1, 2],
       [2, 1, 3]], dtype=int64)

In [12]:
from sklearn.metrics import accuracy_score
accuracy_score(y_actu, y_pred)


0.5833333333333334

# Models Adequacy

In [None]:
# Good Article
https://www.hackernoon.com/choosing-the-right-machine-learning-algorithm-68126944ce1f
    
# Good Cheatsheet 
http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html


## Classification

In [None]:
Classification is the task of predicting the type or class of an object within a finite number of options. 
The output variable for classification is always a categorical variable. 

- K-Nearest neighbors algorithm – simple but computationally exhaustive.
- Naive Bayes – Based on Bayes theorem.
- Logistic Regression – Linear model for binary classification.
- SVM – can be used for binary/multiclass classifications.
- Decision Tree – ‘If Else’ based classifier, more robust to outliers.
- Ensembles – Combination of multiple machine learning models clubbed together to get better results.

## Regression

In [None]:
Regression is a set of problems where the output variable can take continuous values

- Linear Regression – Simplest baseline model for regression task, works well only when data is linearly separable and very less or no multicollinearity is present.
- Lasso Regression – Linear regression with L2 regularization.
- Ridge Regression – Linear regression with L1 regularization.
- SVM regression
- Decision Tree Regression etc.

## Clustering

In [None]:
Clustering is the task of grouping similar objects together. 
Machine learning models help to identify similar objects automatically without manual intervention. 
We can not build effective supervised machine learning models (models that need to be trained with manually curated or labeled data) without homogeneous data.

- K means – Simple but suffers from high variance.
- K means++ – Modified version of K means.
- K medoids.
- Agglomerative clustering – A hierarchical clustering model.
- DBSCAN – Density-based clustering algorithm etc.

## Dimensionality Reduction

In [None]:
Dimensionality is the number of predictor variables used to predict the independent variable or target.
Often in the real world datasets the number of variables is too high. 
Too many variables also bring the curse of overfitting to the models.
In practice among these large numbers of variables, not all variables contribute equally towards the goal.
In a large number of cases, we can actually preserve variances with a lesser number of variables

- PCA – It creates lesser numbers of new variables out of a large number of predictors. The new variables are independent of each other but less interpretable.
- TSNE – Provides lower dimensional embedding of higher-dimensional data points.
- SVD – Singular value decomposition is used to decompose the matrix into smaller parts in order to efficient calculation.

## Deep Learning

In [None]:
Deep learning is a subset of machine learning which deals with neural networks. 
Based on the architecture of neural networks let’s list down important deep learning models:
- Multi-Layer perceptron
- Convolution Neural Networks
- Recurrent Neural Networks
- Boltzmann machine
- Autoencoders etc.

# Cross Validation

In [None]:
There is always a need to validate the stability of the machine learning model and need some kind of assurance that:
  - the  model has got most of the patterns from the data correct
  - the model is not picking up too much on the noise
  - the model is low on bias and variance.

In [None]:
Validation
  - process of deciding whether the numerical results quantifying hypothesized relationships between variables, are acceptable as descriptions of the data.

Residiuals
  - evaluation of residuals = error estimation for the model is made after training 
  - a numerical estimate of the difference in predicted and original responses is done, also called the training error. 
  - However, this only gives us an idea about how well our model does on the data used to train it. 
  - It possible that the model is underfitting or overfitting the data. 

Cross Validation:
  - Pupose: get an indication of how well the learner will generalize to an independent / unseen data set
  - How: discussed below

In [None]:
Model Bias / Variance

Bias
  - In an ideal scenario, these error values should sum up to zero. 
  - To return the model’s bias, we take the average of all the errors. 
  - The lower the average value, better the model.

Variance
  - Similarly for calculating the model variance, we take standard deviation of all the errors. 
  - A low value of standard deviation suggests our model does not vary a lot with different subsets of training data.

We should focus on achieving a balance between bias and variance. 
  - This can be done by reducing the variance and controlling bias to an extent.
  - This will result in a better predictive model.
  - This trade-off usually leads to building less complex predictive models as well. 

## K-Fold Cross Validation

In [None]:
The Problem
  - As there is never enough data to train your model, removing a part of it for validation poses a problem of underfitting. 
  - By reducing the training data, we risk losing important patterns/ trends in data set, which in turn increases error induced by bias. 

The Solution
  - What we require is a method that provides ample data for training the model and also leaves ample data for validation. 
  - K-Fold cross validation does exactly that.

K Fold cross validation
  - the data is divided into k subsets. 
  - the holdout method is repeated k times, such that each time:
      - one of the k subsets is used as the test set / validation set
      - the other k-1 subsets are put together to form a training set. 
  - The error estimation is averaged over all k trials to get total effectiveness of our model. 
  - As can be seen, every data point gets to be in a validation set exactly once, and gets to be in a training set k-1 times. 
  - This significantly reduces
      - bias as we are using most of the data for fitting
      - variance as most of the data is also being used in validation set. 
  - Interchanging the training and test sets also adds to the effectiveness of this method. 
  - As a general rule and empirical evidence, K = 5 or 10 is generally preferred, but nothing’s fixed and it can take any value.

In [None]:
Example of 5 Fold Cross Validation

Validation  XXXXXXXXXX  XXXXXXXXXX  XXXXXXXXXX  XXXXXXXXXX
XXXXXXXXXX  Validation  XXXXXXXXXX  XXXXXXXXXX  XXXXXXXXXX 
XXXXXXXXXX  XXXXXXXXXX  Validation  XXXXXXXXXX  XXXXXXXXXX 
XXXXXXXXXX  XXXXXXXXXX  XXXXXXXXXX  Validation  XXXXXXXXXX 
XXXXXXXXXX  XXXXXXXXXX  XXXXXXXXXX  XXXXXXXXXX  Validation

## Stratified K-Fold Cross Validation

In [None]:
In some cases, there may be a large imbalance in the response variables. 
  - For example, in dataset concerning price of houses, there might be large number of houses having high price. 
  - Or in case of classification, there might be several times more negative samples than positive samples. 

For such problems, a slight variation in the K-Fold cross validation technique is made:
  - Each fold contains approximately the same percentage of samples of each target class as the complete set
  - in case of prediction problems, the mean response value is approximately equal in all the folds. 

This variation is also known as Stratified K Fold.

## Leave P-Out Cross Validation (exhaustive method)

In [None]:
Exhaustive Methods computes all possible ways the data can be split into training and test sets.

In [None]:
Leave P-Out
  - Leaves p data points out of training data
  - Meaning: 
      - if there are n data points in the original sample then, n-p samples are used to train the model and p points are used as the validation set. 
      - This is repeated for all combinations in which original sample can be separated this way
      - Then the error is averaged for all trials, to give overall effectiveness.

This method is exhaustive in the sense that:
  - it needs to train and validate the model for all possible combinations
  - for moderately large p, it can become computationally infeasible.
    
A particular case of this method is when p = 1. 
  - This is known as Leave one out cross validation. 
  - This method is generally preferred over the previous one because it does not suffer from the intensive computation
  - Number of possible combinations is equal to number of data points in original sample or n.

## Python Implementation

In [None]:
                               
# K-Fold
from sklearn import model_selection

kfold = model_selection.KFold(n_splits=10, random_state=100)                                    # chose the number of folds
model_kfold = LinearRegression()                                                                # chose the model
results_kfold = model_selection.cross_val_score(model_kfold, X_test, y_test, cv=kfold)
print("Accuracy: %.2f%%" % (results_kfold.mean()*100.0))
print(results_kfold)  

In [None]:
# Stratified k-fold cross validation

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, random_state=None)

# X is the feature set and y is the target
for train_index, test_index in skf.split(X,y): 
    print("Train:", train_index, "Validation:", val_index) 
    X_train, X_test = X[train_index], X[val_index] 
    y_train, y_test = y[train_index], y[val_index]

In [None]:
# k-fold cross validation with repetition (if the train set does not adequately represent the entire population, strtified is not good)
# In repeated cross-validation, the cross-validation procedure is repeated n times, yielding n random partitions of the original sample

from sklearn.model_selection import RepeatedKFold
rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=None)

# X is the feature set and y is the target
for train_index, test_index in rkf.split(X):
     print("Train:", train_index, "Validation:", val_index)
     X_train, X_test = X[train_index], X[val_index]
     y_train, y_test = y[train_index], y[val_index]

In [14]:
# Using Cross Val SCore

from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_score

diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]
lasso = linear_model.Lasso()
print(cross_val_score(lasso, X, y, cv=10))

[ 0.34557351  0.34848715  0.26654262 -0.01126674  0.24875619  0.08731544
  0.13386583  0.14000888  0.2873109   0.00960079]


# Optimization