# Supervised Learning with scikit-learn

## Exploration

### EDA (Estimation of Distibution Algorithm)

In [None]:
df.boxplot('life', 'Region', rot=60)

### Correlation

In [None]:
sns.heatmap(df.corr(), square=True, cmap='RdYlGn')

## Pre-processing

### Dummy variables

In [None]:
df_region = pd.get_dummies(df, drop_first=True)

### Missing values

Replacing 'missing' values with real missing values:

In [None]:
df[df == 'missing'] = np.nan

Print number of missing n values:

In [None]:
print(df.isnull().sum())

Dropping missing values:

In [None]:
df = df.dropna()

#### Imputation

In [None]:
from sklearn.preprocessing import Imputer

Differenct **Inputer** strategies are: “mean”, “median” and “most_frequent”

Replacing missing values with the most common value:

In [None]:
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)

#### Centering and scaling

In [None]:
from sklearn.preprocessing import scale

## Classification

### k-Nearest Neighborm

In [None]:
from sklearn.neighbors import KNeighborsClassifier

Create a k-NN classifier with 6 neighbors: knn:

In [None]:
knn = KNeighborsClassifier(6)

### Logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

### Support Vector Machine

In [None]:
from sklearn.svm import SVC

## Regression

### Linear regression

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
reg = LinearRegression()

### Regularization I: Lasso

Lasso regression is used for feature selection when you want to make regression models with few variables. Lasso performs regularization by adding to the loss function a penalty term of the absolute value of each coefficient multiplied by some alpha. This is also known as **L1** regularization because the regularization term is the L1 norm of the coefficients.

In [None]:
from sklearn.linear_model import Lasso

Instantiate a lasso regressor:

In [None]:
lasso = Lasso(alpha=0.4, normalize=True)

Compute the coefficients:

In [None]:
lasso_coef = lasso.coef_

### Regularization II: Ridge

 By taking the sum of the squared values of the coefficients multiplied by some alpha - like in Ridge regression - you would be computing the L2 norm.

In [None]:
from sklearn.linear_model import Ridge

In [None]:
ridge = Ridge(normalize=True)

### ElasticNet

In [None]:
from sklearn.linear_model import ElasticNet

## Training & Testing

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)

### Cross validation

In [None]:
from sklearn.model_selection import cross_val_score

Compute 5-fold cross-validation scores: cv_scores

In [None]:
cv_scores = cross_val_score(reg, X, y, cv=5)


Print the 5-fold cross-validation scores

In [None]:
print(cv_scores)

print("Average 5-Fold CV Score: {}".format(np.mean(cv_scores)))

## Accuracy

### Metrics for regression

In [None]:
knn.score(X_test, y_test)

#### Root mean squared error

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
np.sqrt(mean_squared_error(y_test, y_pred))

### Metrics for classification

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
confusion_matrix(y_test, y_pred )
classification_report(y_test, y_pred)

#### ROC

In [None]:
from sklearn.metrics import roc_curve

Compute predicted probabilities: y_pred_prob

In [None]:
y_pred_prob = logreg.predict_proba(X_test)[:,1]

Generate ROC curve values: fpr, tpr, thresholds

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

#### AUC computation

In [None]:
from sklearn.metrics import roc_auc_score

In [None]:
roc_auc_score(y_test, y_pred_prob)

In [None]:
cv_auc = cross_val_score(logreg, X, y, cv=5, scoring='roc_auc')

## Hyperparameter tuning

### Whole grid

In [None]:
from sklearn.model_selection import GridSearchCV

Logistic regression example:

In [None]:
logreg_cv = GridSearchCV(logreg, param_grid, cv=5) 

### Randomized

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)

### Best scoring model

In [None]:
logreg_cv.best_params_
logreg_cv.best_score_

#### Model specific hyperparameter-grids

Logistic regression:

In [None]:
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}

Decision trees:

In [None]:
param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 9),
              "min_samples_leaf": randint(1, 9),
              "criterion": ["gini", "entropy"]}

## Pipelines

In [None]:
from sklearn.pipeline import Pipeline

Scaling in a pipeline:

In [None]:
from sklearn.preprocessing import StandardScaler

Setting up pipeline steps:

In [None]:
steps = [('imputation', Imputer(missing_values='NaN', strategy='most_frequent', axis=0)),
         ('scaler', StandardScaler()),
         ('SVM', SVC())]

Create the pipeline: pipeline

In [None]:
pipeline = Pipeline(steps)

Fit the pipeline to the train set:

In [None]:
pipeline.fit(X_train, y_train)

Predict the labels of the test set:

In [None]:
y_pred = pipeline.predict(X_test)

### Hyperparameters in a pipeline

Specify the hyperparameter space using the following notation: 'step_name__parameter_name'. Here, the step_name is SVM, and the parameter_names are C and gamma.

In [None]:
parameters = {'SVM__C':[1, 10, 100],
              'SVM__gamma':[0.1, 0.01]}

## Data-sets

In [None]:
from sklearn import datasets

Digits dataset:

In [None]:
digits = datasets.load_digits()