# Introduction to Statistical Classification:
* Machine learning method based on supervised learning.
* Categories are predefined (unlike clustering in unsupervised learning).
* For two categories, the problem is known as binary classification. 
* Classifier assigns the most probable class to new observation, given the training set of examples.
* Bayes - density based classifiers vs. non density based classifiers.

## Curse of dimensionality

* When the dimensionality increases, the volume of the space increases so fast that the available data become sparse. 
* To obtain a statistically reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality.
* In classification, an enormous amount of training data is required to ensure that there are several samples with each combination of values.
* The peaking paradox (Hughes phenomenon).
 
 ![Peaking](pics/Peaking.png)
 * Balancing feature size, training set size and the choice of the classifier is a basic problem in the design of classification problems.




## Overtraining 
* It happens if the complexity of the system we have chosen for training is too large for the given dataset.
* The trained system will adapt to the noise instead of to the class differences.
* The solution is either to enlarge the dataset, but, if not possible, to simplify the system.

## Binary vs. Multiple
Wiki: Classification can be thought of as two separate problems – binary classification and multiclass classification. In binary classification, a better understood task, only two classes are involved, whereas multiclass classification involves assigning an object to one of several classes.[9] Since many classification methods have been developed specifically for binary classification, multiclass classification often requires the combined use of multiple binary classifiers. 

## Linear classifier
* Linear Discriminant Analysis (or Fisher's linear discriminant) (LDA)—assumes Gaussian conditional density models
* Naive Bayes classifier with multinomial or multivariate Bernoulli event models.
* Logistic regression—maximum likelihood estimation of w → {\displaystyle {\vec {w}}} {\vec {w}} assuming that the observed training set was generated by a binomial model that depends on the output of the classifier.
* Perceptron—an algorithm that attempts to fix all errors encountered in the training set
* Support vector machine—an algorithm that maximizes the margin between the decision hyperplane and the examples in the training set.


Decision trees,Random forests

Neural networks

## Binary classification metrics
* Precision, Accuracy, Sensitivity, Specificity
 
![Precisionrecall](pics/Precisionrecall.png)

## Precision, recall and ROC

* true positive  (TP)
* true negative  (TN)
* false positive (FP)
* false negative (FN)

* sensitivity, recall, hit rate, or true positive rate (TPR)
$$  {TPR} ={\frac { {TP} }{ {P} }}={\frac { {TP} }{ {TP} + {FN} }}=1- {FNR} $$

* specificity, selectivity or true negative rate (TNR)
$$ {TNR} ={\frac { {TN} }{ {N} }}={\frac { {TN} }{ {TN} + {FP} }}=1- {FPR} $$

* precision or positive predictive value (PPV)
$${  {PPV} ={\frac { {TP} }{ {TP} + {FP} }}=1- {FDR} }$$

* negative predictive value (NPV)
$${  {NPV} ={\frac { {TN} }{ {TN} + {FN} }}=1- {FOR} } $$

* false positive rate (FPR) (alpha - type I error)
$$ {  {FPR} ={\frac { {FP} }{ {N} }}={\frac { {FP} }{ {FP} + {TN} }}=1- {TNR} } $$

* false negative rate (FNR) (beta - type II error)
$$ {  {FNR} ={\frac { {FN} }{ {P} }}={\frac { {FN} }{ {FN} + {TP} }}=1- {TPR} }$$

* acuracy (ACC)
$$ {  {ACC} ={\frac { {TP} + {TN} }{ {P} + {N} }}={\frac { {TP} + {TN} }{ {TP} + {TN} + {FP} + {FN} }}} $$

* F1 score is the harmonic mean of precision and sensitivity
$${  {F} _{1}=2\cdot {\frac { {PPV} \cdot  {TPR} }{ {PPV} + {TPR} }}={\frac {2 {TP} }{2 {TP} + {FP} + {FN} }}} $$

![ROC_curves](pics/ROC_curves.png)

The generalization of Recall and Precision to multiclass problems is to sum over rows (columns) of the confusion matrix. 

## Cross Validation
covered in the last lesson, 10 can be replaced by k
* 10-fold cross-validation, very popular and has become a standard procedure in many papers and works.
* 10 times 10-fold cross-validation, useful for comparing classifiers as the 10 repeats shrink the standard deviations in the means of the estimated classifier errors.
* 10 times 2-fold cross-validation helps compute the significance of differences in classification error means.
* Leave-one-out (LOO) cross-validation. In this case the number of folds is equal to the number of objects in the design set. 

## Grid Search
* Help us to find the "best" hyperparameters of given method to optimize loss function or evaluation metrics.
* Use pipelines and run the process as much as possible. Depends on time and resources.

Back to code ...




In [None]:
import numpy as np
import pandas as pd
import os
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

In [None]:
# load our dataset
data = 'data/attrition.csv'
df = pd.read_csv(data)

In [None]:
df.head()

In [None]:
df.describe(include=['object'])

In [None]:
df = df.sample(frac=1, random_state=42)
df.describe()

In [None]:
# let's drop EmployeeID
df.drop(['EmployeeNumber', 'EmployeeCount','StandardHours','Over18'], axis=1, inplace=True)

In [None]:
# select response variable and features
target_col_name = 'Attrition'
num_feature_cols = [
        'Age', 'DailyRate','DistanceFromHome', 'Education',
        'HourlyRate', 'EnvironmentSatisfaction', 'JobInvolvement','JobLevel',
        'JobSatisfaction', 'NumCompaniesWorked', 'PercentSalaryHike',
        'RelationshipSatisfaction', 'StockOptionLevel', 'PerformanceRating',
        'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
        'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
        'YearsWithCurrManager', 'MonthlyRate']
cat_feature_cols = [x for x in df.columns if x not in num_feature_cols and x not in [target_col_name]]

In [None]:
sns.pairplot(df, hue="Attrition",vars=['Age','DailyRate','DistanceFromHome','YearsAtCompany','YearsSinceLastPromotion'],palette="hls")

In [None]:
plt.figure(figsize=(22,22))
sns.heatmap(df.corr(), cmap='Reds',linewidth=.5,annot=True)
sns.set(font_scale=1)

In [None]:
# cast numerical columns as float
for col in num_feature_cols:
    df[col] = df[col].astype(float)

In [None]:
df_target = np.ravel(df[[target_col_name]])
df_features = df[num_feature_cols]
df_target

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_features, df_target, test_size=0.33, random_state=42)

In [None]:
# Use standard version of logistic regression
logmodel=LogisticRegression()
logmodel.fit(X_train, Y_train)

In [None]:
Y_pred=logmodel.predict(X_test)
classification_report(Y_test, Y_pred)
target_names = ['No', 'Yes']
print(classification_report(Y_test, Y_pred, target_names))
# Note that in binary classification, 
# recall of the positive class is also known as “sensitivity”;
# recall of the negative class is “specificity”.

In [None]:
confusion_matrix(Y_test, Y_pred)


In [None]:
sns.heatmap(confusion_matrix(Y_test,Y_pred),annot=True,fmt='3.0f',cmap="winter")
plt.title('Confusion matrix', y=1, size=12)

In [None]:
#Example of R style analysis :-D
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy import stats

mod1 = smf.glm(formula='Attrition ~Age * DailyRate + DistanceFromHome + Education + HourlyRate', data=df, family=sm.families.Binomial()).fit()
mod1.summary()



In [None]:
#Creating K fold Cross-validation 
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
kf = KFold(n_splits=10, shuffle=True, random_state=42)
scores = cross_val_score(logmodel, # model
                         X_train, # Feature matrix
                         Y_train, # Target vector
                         cv=kf, # Cross-validation technique
                         scoring="accuracy", # Loss function
                         n_jobs=-1) # Use all CPU scores
print('10 fold CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))


In [None]:
# Support Vector Machines
from sklearn.svm import SVC
svc = SVC()
svc.fit(X_train, Y_train)
Y_predSVC = svc.predict(X_test)
print(classification_report(Y_test, Y_predSVC, target_names))



# Neural Networks
* [ ] https://karpathy.github.io/neuralnets/
* [ ] https://playground.tensorflow.org/

### XOR using NN
Let's predict Exclusive OR using NN

```
+---+---+---------+
| A | B | A XOR B |
+---+---+---------+
| 1 | 1 | 0       |
+---+---+---------+
| 1 | 0 | 1       |
+---+---+---------+
| 0 | 1 | 1       |
+---+---+---------+
| 0 | 0 | 0       |
+---+---+---------+
```

It's a binary classification problem, and a supervised one. Our task is to create a neural network that will predict the values of one logical function given the inputs.

Let's assume the simple shallow architecture with hidden layer
![NNet](pics/NNEt.png)

* `X1,X2` = data input
* `N1,N2,N3` = neurons
* `B1,B2` = bias
* `W..` = weights
* `b..` = bias weights
* `Y` = output

Bias node is "always on" -- in NN has the role of the intercept and it's providing a way to get non-zero output on zero inputs. Without bias, the activation of features = 0 would be always zero. When using sigmoid function for example, the output of the (0,0) would be 0.5. 

We'll assume sigmoid neurons:
* they accept real values as inputs
* the value of activation is a dot product of weights and inputs (+ bias), i.e. `W*out_{prev} + b`
* output is a sigmoid function of its activation value

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
def sigmoid(x):
    return 1/(1+np.exp(-x))

In [None]:
def sigmoid_derivative(x):
    """
    x is assumed to be sigmoid function!
    """
    return x*(1-x)

In [None]:
x = np.arange(-10,10,0.01)

In [None]:
x

In [None]:
plt.plot(x, sigmoid(x), x, sigmoid_derivative(sigmoid(x)))
plt.show()

#### Learning
Information is stored in connections between the neurons -- the weights. NN learns by updating its weights according to a learning algorithm that helps it converge to the expected output.

* Initialize the weights and biases randomly.
* Iterate over the data
    * Compute the predicted output
    * Compute the loss (distance from the data)
    * `W(new) = W(old) — α ∆W`
    * `b(new) = b(old) — α ∆b`
* Repeat until the error is minimal

Update step is assumed in the direction of gradient descent.

There are two parts to the learning algorithm
* Forward pass (computation of the predicted output)
* Backward pass, aka backpropagation (change of the weights)

#### Data
Let's create our learning dataset

In [None]:
X = np.array([[0,0],[0,1],[1,0],[1,1]])
Y = np.array([[0],[1],[1],[0]])

In [None]:
X

In [None]:
Y

#### Initialization of the weights and biases
Let's sample normal distribution

In [None]:
def init_params(n_X,n_H,n_Y):
    """
    n_X ... number of input layer neurons
    n_H ... number of hidden layer neurons
    n_Y ... number of output layer neurons
    """
    W1 = np.random.randn(n_X, n_H)
    b1 = np.zeros((1, n_H))
    W2 = np.random.randn(n_H, n_Y)
    b2 = np.zeros((1, n_Y))
    
    return W1,b1,W2,b2

In [None]:
(W1, b1, W2, b2) = init_params(2, 2, 1)
print("Initial hidden weights: ",end='')
print(*W1)
print("Initial hidden biases: ",end='')
print(*b1)
print("Initial output weights: ",end='')
print(*W2)
print("Initial output biases: ",end='')
print(*b2)

In [None]:
epochs = 10000
lr = 0.1

#### Forward propagation
Computing the predicted values

In [None]:
def forward_prop(X,W1,b1,W2,b2):
    hidden_layer_activation = np.dot(X,W1) + b1
    hidden_layer_output = sigmoid(hidden_layer_activation)

    output_layer_activation = np.dot(hidden_layer_output,W2) + b2 
    Y_hat = sigmoid(output_layer_activation)
    
    return hidden_layer_output, Y_hat

#### Error (loss) function
Let's use mean squared error. Typically used rather for regression problems, but we'll ignore that for now.

$E = \frac{1}{2}(Y - Y_{hat})^2$

We're going to need it's derivative (w.r.t. $Y_{hat}$)

$\frac{\partial E}{\partial Y_{hat}} = -(Y - Y_{hat})$

In [None]:
def loss_function(Y,Y_hat):
    loss = 1/2*np.sum(np.power(Y-Y_hat,2))
    
    return loss

In [None]:
def loss_derivative(Y,Y_hat):
    d_loss = -(Y - Y_hat)
    
    return d_loss

#### Backpropagation
Goal of the backpropagation is to change the weights in order to minimize the error/loss. Since the outcome is a function of activation and further activation is a function of weights, so we want to know

$\frac{\partial E}{\partial w21} = \frac{\partial E}{\partial Y_{hat}} * \frac{\partial Y_{hat}}{\partial net_{N3}} * \frac{\partial net_{N3}}{\partial w21}$


We already know the derivative of the error with respect to the prediction. The second term is derivative of the sigmoid

$\frac{\partial Y_{hat}}{\partial net_{N3}} = Y_{hat} * (1 - Y_{hat})$

and the last term is a derivative of the activation by weights, so output of the hidden layer. 

$\frac{\partial net_{N3}}{\partial w21} = out_{N1}$

The same logic applied to other weights.

In [None]:
def back_prop(d_loss, Y_hat, W2, hidden_layer_output):
    d_Y_hat = d_loss * sigmoid_derivative(Y_hat)
    error_hidden_layer = d_Y_hat.dot(W2.T)
    d_hidden_layer = error_hidden_layer * sigmoid_derivative(hidden_layer_output)
    
    return d_Y_hat, d_hidden_layer

In [None]:
#Training algorithm
for i in range(epochs):
    #Forward Propagation
    hidden_layer_output, Y_hat = forward_prop(X,W1,b1,W2,b2)

    # Compute loss and derivatives
    loss = loss_function(Y,Y_hat)
    d_loss = loss_derivative(Y,Y_hat)
    
    #Backpropagation
    d_Y_hat, d_hidden_layer = back_prop(d_loss, Y_hat, W2, hidden_layer_output)
    
    #Updating Weights and Biases (Gradient descent)
    W2 = W2 - hidden_layer_output.T.dot(d_Y_hat) * lr
    b2 = b2 - np.sum(d_Y_hat,axis=0,keepdims=True) * lr
    W1 = W1 - X.T.dot(d_hidden_layer) * lr
    b1 = b1 - np.sum(d_hidden_layer,axis=0,keepdims=True) * lr
    
    if(i%1000 == 0):
        print('Loss after iteration# {:d}: {:f}'.format(i, loss))

In [None]:
print("Final hidden weights: ",end='')
print(*W1)
print("Final hidden bias: ",end='')
print(*b1)
print("Final output weights: ",end='')
print(*W2)
print("Final output bias: ",end='')
print(*b2)

print("\nOutput from neural network after 10,000 epochs: ",end='')
print(*Y_hat)

## Alternatives & Improvements

### Loss functions
* Has significance on learning/updating
* Choose according to the problem
    * Regression: MSE, MSLE, MAE
    * Binary Classification: Binary Cross-Entropy, Hinge loss,...
    * Multi-class Classification: KL-divergence, Multi Cross-Entropy,...
    
### Activation functions
* Linear function
    * combinations are linear, not good approximative properties
    * unbounded range
    * constant learning rate
* Sigmoids
    * in [0,1]
    * non-linear, good for stacking layers
    * sensitive around zero
    * "vanishing gradients" -- near zero learning rate
* Tanh
    * "Scaled Sigmoid", $2*sigmoid(2x)-1$
    * steeper gradient
* ReLu
    * $A(x) = max(x,0)$
    * does not fire up all activations as sigmoids -- that's costly in big networks
    * combinations are non-linear -- good approximator
    * less computationally demanding
    * "dying gradient" -- neurons will stop responding because the gradient = 0 for negative values
    * "leaky ReLu" -- $y=0.01x,\, \mathrm{for}\, x<0$, the idea is to gradient recover itself
    
* Generally, ReLu are the most used one, but sigmoids can be better approximators (if more costly ones)

### Learning step
* some alternatives to backprop, but mostly minor
* https://www.technologyreview.com/s/608911/is-ai-riding-a-one-trick-pony/
* many alternatives to gradient descent!
    * proximal grads
    * evolutionary algorithms
    * Ada optimizers
    * stochastic gradient descent
    
    
### Architecture
* many variants of mostly deep networks
* CNNs (use convolution instead of matrix multiplication) in one or more places
* RNNs (use recurrence, long short-term memory -- good for time-series data, simulate lags)
* more engineering than math

## Keras
High-level framework for NNets.

In [None]:
from keras.models import Sequential
from keras.layers import Dense
import pandas as pd
import numpy as np

In [None]:
# load our dataset
data = 'data/attrition.csv'
df = pd.read_csv(data)

In [None]:
# let's drop EmployeeID
df.drop(['EmployeeNumber', 'EmployeeCount'], axis=1, inplace=True)

In [None]:
df.head()

In [None]:
# split to input and predicted data
Xk = df.loc[:,df.columns!='Attrition']
Yk = df.loc[:,'Attrition']

In [None]:
# we're going to reuse some tricks from the last lesson
from sklearn.preprocessing import OneHotEncoder, RobustScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score

In [None]:
df.columns

In [None]:
# let's label numerical columns
num_feature_cols = [ 'Age', 'DailyRate','DistanceFromHome', 'Education',
        'HourlyRate', 'EnvironmentSatisfaction', 'JobInvolvement', 'JobLevel',
        'JobSatisfaction', 'MonthlyIncome','NumCompaniesWorked', 'PercentSalaryHike',
        'RelationshipSatisfaction', 'StockOptionLevel', 'PerformanceRating', 'StandardHours',
        'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
        'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
        'YearsWithCurrManager', 'MonthlyRate']

In [None]:
# and categorical
cat_feature_cols = [x for x in Xk.columns if x not in num_feature_cols]

In [None]:
cat_feature_cols

In [None]:
# Pipelines!
num_transformer = Pipeline(steps=[
                  ('imputer', SimpleImputer(strategy='median')),
                  ('scaler', RobustScaler())])

cat_transformer = Pipeline(steps=[
                  ('imputer', SimpleImputer(strategy='most_frequent')),
                  ('onehot', OneHotEncoder(categories='auto', 
                                     sparse=False, 
                                     handle_unknown='ignore'))])

pipeline_preprocess = ColumnTransformer(transformers=[
        ('numerical_preprocessing', num_transformer, num_feature_cols),
        ('categorical_preprocessing', cat_transformer, cat_feature_cols)],
        remainder='passthrough')

pipe0 = Pipeline([("transform_inputs", pipeline_preprocess)])

In [None]:
# apply transformations
Xkk = pipe0.fit_transform(Xk)

In [None]:
# alternative way to get booleans
Ykk = Yk.str.get_dummies().iloc[:,1]

In [None]:
Xk_train, Xk_test, Yk_train, Yk_test = train_test_split(Xkk, Ykk, test_size=0.3)


In [None]:
model = Sequential()
#First Hidden Layer
model.add(Dense(4, activation='relu', kernel_initializer='random_normal', input_dim=Xk_train.shape[1]))
#Second  Hidden Layer
model.add(Dense(4, activation='relu', kernel_initializer='random_normal'))
#Output Layer
model.add(Dense(1, activation='sigmoid', kernel_initializer='random_normal'))

In [None]:
#Compiling the neural network
model.compile(optimizer ='adam',loss='binary_crossentropy', metrics =['accuracy'])

In [None]:
model.fit(Xk_train,Yk_train, batch_size=10, epochs=100)

In [None]:
[loss_train, accuracy_train] = model.evaluate(Xk_train, Yk_train)
print('Loss and accuracy on train set {:f}: {:f}'.format(loss_train, accuracy_train))

In [None]:
# Let's predict on test set
Yk_pred = model.predict(Xk_test)
Yk_pred = (Yk_pred>0.5)

In [None]:
# Evaluate the model
cm = confusion_matrix(Yk_test, Yk_pred)
accuracy = accuracy_score(Yk_test, Yk_pred)
cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print('Confusion matrix, without normalization')
print(cm,'\n')
print('Normalized confusion matrix')
print(cm_norm,'\n')
print('Accuracy of classification {:f}'.format(accuracy))