# Basic Machine Learning with Scikit-Learn

In this notebook we will look into the basics of building ML models with Scikit-Learn. Scikit-Learn is the most widely used Python libray for ML, especially outside of deep learning (where there are several contenders and I recommend using Keras, which is a package that provides a simple API on top of several underlying contenders like TensorFlow and PyTorch).

## Some Terminology

*Machine Learning* can be considered the process of building models from data, either to gain insight into the data or to make predictions from the data. There are two main categories:

- *supervised learning*, in which the data is labelled with an outcome, and the aim is to predict new outcomes from new data; in the case where the outcome is a category this is called *classification* while if the outcome is a continuous quantity this is *regression*
- *unsupervised learning* where the data is analyzed for some underlying patterns to gain insight; common examples are *clustering* (finding similar cases) and *dimensionality reduction* (reducing the number of variables needed to represent the data, essentially a form of lossy compression)

The data used in ML is typically tabular. The columns are called *features*. For supervised learning, we call the  output the *label*. We'll often refer to the vector if features as X and the output (label) as y, and say that we're trying to find a function that approximates f(X)=y; this function is our *model* and is characterized by some *model parameters*. We usually choose the type or class of model, and then use ML techniques to learn the model parameters that minimize the *error* (the difference between the predicted and actual output). More generally we can think of this as an optimization problem, where we are trying to learn the parameters that minimize a *loss function*; that loss function is typically going to be some cummulative function of the errors; a common loss function is RMSE (root mean square of errors).

In this notebook we are going to focus on practical ML. We will not be going into the details of the various algorithms.


## A Simple Example - Predicting Species in Iris Data

In practice, we would usually store the data in the form of a Pandas DataFrame or NumPy array, the *features matrix*, and assign it to a variable `X`. We would usually start with some exploratory analysis of the data. A very common approach if the number of features is not too high is to use a *pairplot*. We'll use the famous Iris dataset in this example, and the feature we are interested in predicitng is the species:


In [None]:
%matplotlib inline

In [None]:
import seaborn as sns

iris = sns.load_dataset('iris')
iris.head()

In [None]:
sns.set()
sns.pairplot(iris, hue='species', size=1.5);

In the pairplot, we're looking for several things: 

- are there features that are strongly correlated (or strongly inversely correlated)? If so, we may want to exclude one of the features, to simplify the model;
- are there obvious clusters, that may be linearly separable? If so we may get far with simple linear models (as opposed to more complex models like decision trees).
- are there features which seem correlated but our intuition or domain knowledge tells us should not be? In this case we may be missing features that are important to the model (these are called *confounding variables*)

For an example, we may see a correlation between daily ice cream consumption and drive-by shooting rates. We shouldn't expect this. It may be that the confounding variable is weather - in hotter weather people are more likely to be outside, resulting in an increase in both ice cream consumption and drive-by shootings.

Now, lets get the data into the X/y form we need for training a model:

In [None]:
X = iris.drop('species', axis=1)
X.head()

In [None]:
y = iris['species'] 
y.head()

SciKit-Learn has a simple and consistent API that makes it easy to learn. After getting the data in the right form, the steps are just:

- import the model type
- instantiate the model with the appropriate *hyperparameters*
- call the `fit()` API to learn the model parameters
- call `predict()` to predict the results for new data

We want to test our model on new data; the typical way to do this is to split the data into a *training set* and a *testing set*; a common split is 80/20 or 75/25.


In [None]:
from sklearn.model_selection import train_test_split
        
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=1, test_size=0.2)


The first model type we will apply is Gaussian naive Bayes; this is simple and fast and needs no hyperparameters so it is a good baseline:

In [None]:
from sklearn.naive_bayes import GaussianNB
        
model = GaussianNB()
model.fit(Xtrain, ytrain)

In [None]:
y_model = model.predict(Xtest)
y_model

To evaluate the model we want to know the accuracy:

In [None]:
from sklearn.metrics import accuracy_score
        
accuracy_score(ytest, y_model)

So even a naive model is 97% accurate on this data.

Let's look at a different model, so you can see how easy it is to switch between different approaches. We'll try a decision tree model:

In [None]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth=4)
model.fit(Xtrain, ytrain)
y_model = model.predict(Xtest)
accuracy_score(ytest, y_model)

Another way to see how well we have done is to use a *confusion matrix*:

In [None]:
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt 

mat = confusion_matrix(ytest, y_model)

sns.heatmap(mat, square=True, annot=True, cbar=False)
plt.xlabel('predicted value')
plt.ylabel('true value');

We can see from the above that we got 29 predictions right, and only one wrong.

Confusion matrices are particularly useful for binary classifiers (only two output states), where we can clearly see the number of false positives and false negatives.

## Unsupervised Learning - Dimensionality Reduction with PCA on Iris Dataset

We'll now look at an example of unsupervised learning, doing Principal Component Analysis to do dimensionality reduction on the dataset so we can reduce to two dimensions for visualization.


In [None]:
from sklearn.decomposition import PCA 

model = PCA(n_components=2)
model.fit(X)  # No need to specify y for unsupervised.
X_2D = model.transform(X) 
# See the first few entries
X_2D[:5]

In [None]:
# Lets make a DataFrame and plot it.
import pandas as pd
df = pd.DataFrame(X_2D, columns=['PCA1', 'PCA2'])
df['species'] = y
df.head()

In [None]:
sns.lmplot("PCA1", "PCA2", hue='species', data=df, fit_reg=False);

The interesting thing to note is how the species are well separated. But when we generated the PCA values we did not tell the model anything about the species! This means that we should be able to use the input features (or the PCA features) and generate a good classification model. 

## Unsupervised Learning - Clustering on Iris Dataset

Now let's look at clustering. We'll use k-means clustering:

In [None]:
from sklearn.cluster import KMeans

model = KMeans(n_clusters=3)
model.fit(X)
y_kmeans = model.predict(X)

We can see how well we have done by plotting each cluster separately. We'll use the PCA pseudo-features so we can still use 2D plots.

In [None]:
df['cluster'] = y_kmeans
sns.lmplot("PCA1", "PCA2", data=df, hue='species', col='cluster', fit_reg=False); 

It's clear we have done very well in clustering in the blue case, but a bit less well with the other two species; this makes sense if you look at the earlier PCA plot, where you can see the blue species is more cleanly separated than the other two.

## Model Validation

### Cross Validation

### The Bias-Variance Trade-Off

### Learning Curves

## Handling Missing Data

## Feature Engineering