# Supervised learning with scikit-learn - sklearn
1. Classification
2. Regression
3. Fine-tuning model
4. Preprocessing and Pipelines

Background:
- What is machine learning? Giving computers the ability to learn to make decisions from Data without being explicitly programmed.
- Supervised learning - labeled data
- Unsupervised learning - uncovering hidden patterns from unlabeled data
- Reinforcement learning - software agents interact with an environment; learn how to optimize their behavior, given system of rewards and punishments, draws inspiration from behavioral psychology. Ie. AphasGo - 1st computer to defeat world champion in Go

Supervised learning
- predictor variables/features and a target variable
- Aim: predict the target variable, given the predictor variables (ie. target variable: species, predictor variables: sepal length and width)
- Classification: target variable consists of categories
- Regression: Target variable is continuous

Naming conventions:
- Features = predictor variables = independent variables
- Target variable = dependent variable = response variable

Goals of Supervised learning:
- Automate time-consuming or expensive manual tasks (ie. MD Dx)
- Make predictions about the future (ie. will a customer click an ad or not)
- Need labeled data (ie. historical data with labels, experiments to get labeled data like click on ad, crowd-sourcing labeled data)

Tools:
- scikit-learn/sklearn - integrates well with SciPy stack including Numpy
- other libraries: TensorFlow, keras

## 1. Classification

### a. EDA

In [None]:
from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.sytle.use('ggplot')

# load dataset
iris = datasets.load_iris()
type(iris)
# out: sklearn.datasets.base.Bunch
# a Bunch is like a dictionary
print(iris.keys())
# out: dict_keys(['data','target_names','DESCR','feature_names','target'])
# the data and target are numpy arrays
iris.data.shape
# out: (150, 4) # 150 samples and 4 features
iris.target_names
# out: array(['setosa','versicolor','virginica'], dtype='<U10')

# initial EDA
X = iris.data
y = iris.target
df = pd.DataFrame(X, columns=iris.feature_names)
print(df.head)

# Visual EDA, c is color, 
_ = pd.scatter_matrix(df, c=y, figsize=[8,8], s=150, marker='D')


#### i. Numerical EDA
In this chapter, you'll be working with a dataset obtained from the UCI Machine Learning Repository consisting of votes made by US House of Representatives Congressmen. Your goal will be to predict their party affiliation ('Democrat' or 'Republican') based on how they voted on certain key issues. Here, it's worth noting that we have preprocessed this dataset to deal with missing values. This is so that your focus can be directed towards understanding how to train and evaluate supervised learning models. Once you have mastered these fundamentals, you will be introduced to preprocessing techniques in Chapter 4 and have the chance to apply them there yourself - including on this very same dataset!

Before thinking about what supervised learning models you can apply to this, however, you need to perform Exploratory data analysis (EDA) in order to understand the structure of the data. For a refresher on the importance of EDA, check out the first two chapters of Statistical Thinking in Python (Part 1).

Get started with your EDA now by exploring this voting records dataset numerically. It has been pre-loaded for you into a DataFrame called df. Use pandas' .head(), .info(), and .describe() methods in the IPython Shell to explore the DataFrame, and select the statement below that is not true.

In [None]:
# explore structure of data
df.head()
df.info()
df.describe()

#### ii. Visual EDA
The Numerical EDA you did in the previous exercise gave you some very important information, such as the names and data types of the columns, and the dimensions of the DataFrame. Following this with some visual EDA will give you an even better understanding of the data. In the video, Hugo used the scatter_matrix() function on the Iris data for this purpose. However, you may have noticed in the previous exercise that all the features in this dataset are binary; that is, they are either 0 or 1. So a different type of plot would be more useful here, such as Seaborn's countplot.

Given on the right is a countplot of the 'education' bill, generated from the following code:

plt.figure()

sns.countplot(x='education', hue='party', data=df, palette='RdBu')

plt.xticks([0,1], ['No', 'Yes'])

plt.show()

In sns.countplot(), we specify the x-axis data to be 'education', and hue to be 'party'. Recall that 'party' is also our target variable. So the resulting plot shows the difference in voting behavior between the two parties for the 'education' bill, with each party colored differently. We manually specified the color to be 'RdBu', as the Republican party has been traditionally associated with red, and the Democratic party with blue.

It seems like Democrats voted resoundingly against this bill, compared to Republicans. This is the kind of information that our machine learning model will seek to learn when we try to predict party affiliation solely based on voting behavior. An expert in U.S politics may be able to predict this without machine learning, but probably not instantaneously - and certainly not if we are dealing with hundreds of samples!

In the IPython Shell, explore the voting behavior further by generating countplots for the 'satellite' and 'missile' bills, and answer the following question: Of these two bills, for which ones do Democrats vote resoundingly in favor of, compared to Republicans? Be sure to begin your plotting statements for each figure with plt.figure() so that a new figure will be set up. Otherwise, your plots will be overlayed onto the same figure.

In [None]:
# generate countplots for 'satellite' and 'missile' bills

# satellite bill
plt.figure()
sns.countplot(x='satellite', hue='party', data=df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
plt.show()
# republicans 'no', democrats 'yes'

# missile bill
plt.figure()
sns.countplot(x='missile', hue='party', data=df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
plt.show()
# republicans 'no', democrats 'yes'

### b. The classification challenge
- Training data: already labeled data

k-Nearest Neighbors
- idea is to predict the label of a data point by looking at the 'k' closest labeled data points

Training a model on the data = 'fitting' a model to the data
- .fit() method
Predict labels of new data with...
- .predict() method

In [None]:
# Using scikit-learn to fit a classifier
from sklearn.neighbors import KNeighborsClassifier
# set 'k', number of neighbors to 6
knn = KNeighborsClassifier(n_neighbors=6)
# fit classifier to training set with args: features, target
# requires args to be Numpy array or Pandas dataframe
# requires no missing values
knn.fit(iris['data'], iris['target'])
# out: KNeighborsClassifier(algorithm='auto', leaf_size=30,
# metric='minkowski', metric_params=None, n_jobs=1,
# n_neighbors=6, p=2, weights='uniform)

# check out iris data
iris['data'].shape
# out: (150, 4)

# target has to be same # rows as feature data
iris['target'].shape
# out: (150,)

# predict on unlabeled data
prediction = knn.predict(X_new)
X_new.shape
# out: (3, 4)
print('Prediction {}'.format(prediction))
# Prediction: [1 1 0]
# which means 1=versicolor for first 2 observations, and 0=sertosa

#### i. k-Nearest Neighbors: Fit
Having explored the Congressional voting records dataset, it is time now to build your first classifier. In this exercise, you will fit a k-Nearest Neighbors classifier to the voting dataset, which has once again been pre-loaded for you into a DataFrame df.

In the video, Hugo discussed the importance of ensuring your data adheres to the format required by the scikit-learn API. The features need to be in an array where each column is a feature and each row a different observation or data point - in this case, a Congressman's voting record. The target needs to be a single column with the same number of observations as the feature data. We have done this for you in this exercise. Notice we named the feature array X and response variable y: This is in accordance with the common scikit-learn practice.

Your job is to create an instance of a k-NN classifier with 6 neighbors (by specifying the n_neighbors parameter) and then fit it to the data. The data has been pre-loaded into a DataFrame called df.

In [None]:
# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier

# Create arrays for the features and the response variable
# Note sklearn practice: x for feature array, y for response variable
# Note: '.values' attribute return NumPy arrays
y = df['party'].values
X = df.drop('party', axis=1).values

# Create a k-NN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the data
knn.fit(X, y)

#Out[1]: 
#KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
#           metric_params=None, n_jobs=1, n_neighbors=6, p=2,
#           weights='uniform')

#### ii. k-Nearest Neighbors: Predict
Having fit a k-NN classifier, you can now use it to predict the label of a new data point. However, there is no unlabeled data available since all of it was used to fit the model! You can still use the .predict() method on the X that was used to fit the model, but it is not a good indicator of the model's ability to generalize to new, unseen data.

In the next video, Hugo will discuss a solution to this problem. For now, a random unlabeled data point has been generated and is available to you as X_new. You will use your classifier to predict the label for this new data point, as well as on the training data X that the model has already seen. Using .predict() on X_new will generate 1 prediction, while using it on X will generate 435 predictions: 1 for each sample.

The DataFrame has been pre-loaded as df. This time, you will create the feature array X and target variable array y yourself.

In [None]:
# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier 

# Create arrays for the features and the response variable
y = df['party'].values
X = df.drop('party', axis=1).values

# Create a k-NN classifier with 6 neighbors: knn
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the data
knn.fit(X, y)

# Predict the labels for the training data X
y_pred = knn.predict(X)

# Predict and print the label for the new data point X_new
new_prediction = knn.predict(X_new)
print("Prediction: {}".format(new_prediction))

# out: Prediction: ['democrat']

### c. measuring Model performance
- accuracy - commonly used metric of model performance to generalize
- accuracy = Fraction of correct predictions on new data

Split data into training and test set
- Fit/train the classifier on the training set
- Make predictions on test set

In [None]:
# Train/Test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = 
train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)

# Create a k-NN classifier with 8 neighbors
knn = KNeighborsClassifier(n_neighbors=8)

# Fit the classifier to the data
knn.fit(X_train, y_train)

# Predict the labels for the training data X
y_pred = knn.predict(X_test)

print("Test set predictions:\n {}".format(y_pred))

# check accuracy
knn.score(X_test, y_test)
# out: 0.9555555556

Model complexity for KNN
- larger k = smoother decision boundary = less complex model
- smaller k = more complex model = can lead to overfitting and sensitive to noise
- Model complexity curve - shows over/underfitting with too small or large k

#### i. the digits recognition dataset

#### ii. Train/Test Split + Fit/Predict/Accuracy

#### iii. Overfitting and Underfitting

## 2. Regression

## 3. Fine-tuning your model

## 4. Preprocessing and Pipelines