# Supervised learning with scikit-learn - sklearn
1. Classification
2. Regression
3. Fine-tuning model
4. Preprocessing and Pipelines

Background:
- What is machine learning? Giving computers the ability to learn to make decisions from Data without being explicitly programmed.
- Supervised learning - labeled data
- Unsupervised learning - uncovering hidden patterns from unlabeled data
- Reinforcement learning - software agents interact with an environment; learn how to optimize their behavior, given system of rewards and punishments, draws inspiration from behavioral psychology. Ie. AphasGo - 1st computer to defeat world champion in Go

Supervised learning
- predictor variables/features and a target variable
- Aim: predict the target variable, given the predictor variables (ie. target variable: species, predictor variables: sepal length and width)
- Classification: target variable consists of categories
- Regression: Target variable is continuous

Naming conventions:
- Features = predictor variables = independent variables
- Target variable = dependent variable = response variable

Goals of Supervised learning:
- Automate time-consuming or expensive manual tasks (ie. MD Dx)
- Make predictions about the future (ie. will a customer click an ad or not)
- Need labeled data (ie. historical data with labels, experiments to get labeled data like click on ad, crowd-sourcing labeled data)

Tools:
- scikit-learn/sklearn - integrates well with SciPy stack including Numpy
- other libraries: TensorFlow, keras

## 1. Classification

### a. EDA

In [None]:
from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.sytle.use('ggplot')

# load dataset
iris = datasets.load_iris()
type(iris)
# out: sklearn.datasets.base.Bunch
# a Bunch is like a dictionary
print(iris.keys())
# out: dict_keys(['data','target_names','DESCR','feature_names','target'])
# the data and target are numpy arrays
iris.data.shape
# out: (150, 4) # 150 samples and 4 features
iris.target_names
# out: array(['setosa','versicolor','virginica'], dtype='<U10')

# initial EDA
X = iris.data
y = iris.target
df = pd.DataFrame(X, columns=iris.feature_names)
print(df.head)

# Visual EDA, c is color, 
_ = pd.scatter_matrix(df, c=y, figsize=[8,8], s=150, marker='D')


#### i. Numerical EDA
In this chapter, you'll be working with a dataset obtained from the UCI Machine Learning Repository consisting of votes made by US House of Representatives Congressmen. Your goal will be to predict their party affiliation ('Democrat' or 'Republican') based on how they voted on certain key issues. Here, it's worth noting that we have preprocessed this dataset to deal with missing values. This is so that your focus can be directed towards understanding how to train and evaluate supervised learning models. Once you have mastered these fundamentals, you will be introduced to preprocessing techniques in Chapter 4 and have the chance to apply them there yourself - including on this very same dataset!

Before thinking about what supervised learning models you can apply to this, however, you need to perform Exploratory data analysis (EDA) in order to understand the structure of the data. For a refresher on the importance of EDA, check out the first two chapters of Statistical Thinking in Python (Part 1).

Get started with your EDA now by exploring this voting records dataset numerically. It has been pre-loaded for you into a DataFrame called df. Use pandas' .head(), .info(), and .describe() methods in the IPython Shell to explore the DataFrame, and select the statement below that is not true.

In [None]:
# explore structure of data
df.head()
df.info()
df.describe()

#### ii. Visual EDA
The Numerical EDA you did in the previous exercise gave you some very important information, such as the names and data types of the columns, and the dimensions of the DataFrame. Following this with some visual EDA will give you an even better understanding of the data. In the video, Hugo used the scatter_matrix() function on the Iris data for this purpose. However, you may have noticed in the previous exercise that all the features in this dataset are binary; that is, they are either 0 or 1. So a different type of plot would be more useful here, such as Seaborn's countplot.

Given on the right is a countplot of the 'education' bill, generated from the following code:

plt.figure()

sns.countplot(x='education', hue='party', data=df, palette='RdBu')

plt.xticks([0,1], ['No', 'Yes'])

plt.show()

In sns.countplot(), we specify the x-axis data to be 'education', and hue to be 'party'. Recall that 'party' is also our target variable. So the resulting plot shows the difference in voting behavior between the two parties for the 'education' bill, with each party colored differently. We manually specified the color to be 'RdBu', as the Republican party has been traditionally associated with red, and the Democratic party with blue.

It seems like Democrats voted resoundingly against this bill, compared to Republicans. This is the kind of information that our machine learning model will seek to learn when we try to predict party affiliation solely based on voting behavior. An expert in U.S politics may be able to predict this without machine learning, but probably not instantaneously - and certainly not if we are dealing with hundreds of samples!

In the IPython Shell, explore the voting behavior further by generating countplots for the 'satellite' and 'missile' bills, and answer the following question: Of these two bills, for which ones do Democrats vote resoundingly in favor of, compared to Republicans? Be sure to begin your plotting statements for each figure with plt.figure() so that a new figure will be set up. Otherwise, your plots will be overlayed onto the same figure.

In [None]:
# generate countplots for 'satellite' and 'missile' bills

# satellite bill
plt.figure()
sns.countplot(x='satellite', hue='party', data=df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
plt.show()
# republicans 'no', democrats 'yes'

# missile bill
plt.figure()
sns.countplot(x='missile', hue='party', data=df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
plt.show()
# republicans 'no', democrats 'yes'

### b. The classification challenge
- Training data: already labeled data

k-Nearest Neighbors
- idea is to predict the label of a data point by looking at the 'k' closest labeled data points

Training a model on the data = 'fitting' a model to the data
- .fit() method
Predict labels of new data with...
- .predict() method

In [None]:
# Using scikit-learn to fit a classifier
from sklearn.neighbors import KNeighborsClassifier
# set 'k', number of neighbors to 6
knn = KNeighborsClassifier(n_neighbors=6)
# fit classifier to training set with args: features, target
# requires args to be Numpy array or Pandas dataframe
# requires no missing values
knn.fit(iris['data'], iris['target'])
# out: KNeighborsClassifier(algorithm='auto', leaf_size=30,
# metric='minkowski', metric_params=None, n_jobs=1,
# n_neighbors=6, p=2, weights='uniform)

# check out iris data
iris['data'].shape
# out: (150, 4)

# target has to be same # rows as feature data
iris['target'].shape
# out: (150,)

# predict on unlabeled data
prediction = knn.predict(X_new)
X_new.shape
# out: (3, 4)
print('Prediction {}'.format(prediction))
# Prediction: [1 1 0]
# which means 1=versicolor for first 2 observations, and 0=sertosa

#### i. k-Nearest Neighbors: Fit
Having explored the Congressional voting records dataset, it is time now to build your first classifier. In this exercise, you will fit a k-Nearest Neighbors classifier to the voting dataset, which has once again been pre-loaded for you into a DataFrame df.

In the video, Hugo discussed the importance of ensuring your data adheres to the format required by the scikit-learn API. The features need to be in an array where each column is a feature and each row a different observation or data point - in this case, a Congressman's voting record. The target needs to be a single column with the same number of observations as the feature data. We have done this for you in this exercise. Notice we named the feature array X and response variable y: This is in accordance with the common scikit-learn practice.

Your job is to create an instance of a k-NN classifier with 6 neighbors (by specifying the n_neighbors parameter) and then fit it to the data. The data has been pre-loaded into a DataFrame called df.

In [None]:
# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier

# Create arrays for the features and the response variable
# Note sklearn practice: x for feature array, y for response variable
# Note: '.values' attribute return NumPy arrays
y = df['party'].values
X = df.drop('party', axis=1).values

# Create a k-NN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the data
knn.fit(X, y)

#Out[1]: 
#KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
#           metric_params=None, n_jobs=1, n_neighbors=6, p=2,
#           weights='uniform')

#### ii. k-Nearest Neighbors: Predict
Having fit a k-NN classifier, you can now use it to predict the label of a new data point. However, there is no unlabeled data available since all of it was used to fit the model! You can still use the .predict() method on the X that was used to fit the model, but it is not a good indicator of the model's ability to generalize to new, unseen data.

In the next video, Hugo will discuss a solution to this problem. For now, a random unlabeled data point has been generated and is available to you as X_new. You will use your classifier to predict the label for this new data point, as well as on the training data X that the model has already seen. Using .predict() on X_new will generate 1 prediction, while using it on X will generate 435 predictions: 1 for each sample.

The DataFrame has been pre-loaded as df. This time, you will create the feature array X and target variable array y yourself.

In [None]:
# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier 

# Create arrays for the features and the response variable
y = df['party'].values
X = df.drop('party', axis=1).values

# Create a k-NN classifier with 6 neighbors: knn
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the data
knn.fit(X, y)

# Predict the labels for the training data X
y_pred = knn.predict(X)

# Predict and print the label for the new data point X_new
new_prediction = knn.predict(X_new)
print("Prediction: {}".format(new_prediction))

# out: Prediction: ['democrat']

### c. measuring Model performance
- accuracy - commonly used metric of model performance to generalize
- accuracy = Fraction of correct predictions on new data

Split data into training and test set
- Fit/train the classifier on the training set
- Make predictions on test set

In [None]:
# Train/Test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = 
train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)

# Create a k-NN classifier with 8 neighbors
knn = KNeighborsClassifier(n_neighbors=8)

# Fit the classifier to the data
knn.fit(X_train, y_train)

# Predict the labels for the training data X
y_pred = knn.predict(X_test)

print("Test set predictions:\n {}".format(y_pred))

# check accuracy
knn.score(X_test, y_test)
# out: 0.9555555556

Model complexity for KNN
- larger k = smoother decision boundary = less complex model
- smaller k = more complex model = can lead to overfitting and sensitive to noise
- Model complexity curve - shows over/underfitting with too small or large k

#### i. the digits recognition dataset
Up until now, you have been performing binary classification, since the target variable had two possible outcomes. Hugo, however, got to perform multi-class classification in the videos, where the target variable could take on three possible outcomes. Why does he get to have all the fun?! In the following exercises, you'll be working with the MNIST digits recognition dataset, which has 10 classes, the digits 0 through 9! A reduced version of the MNIST dataset is one of scikit-learn's included datasets, and that is the one we will use in this exercise.

Each sample in this scikit-learn dataset is an 8x8 image representing a handwritten digit. Each pixel is represented by an integer in the range 0 to 16, indicating varying levels of black. Recall that scikit-learn's built-in datasets are of type Bunch, which are dictionary-like objects. Helpfully for the MNIST dataset, scikit-learn provides an 'images' key in addition to the 'data' and 'target' keys that you have seen with the Iris data. Because it is a 2D array of the images corresponding to each sample, this 'images' key is useful for visualizing the images, as you'll see in this exercise (for more on plotting 2D arrays, see Chapter 2 of DataCamp's course on Data Visualization with Python). On the other hand, the 'data' key contains the feature array - that is, the images as a flattened array of 64 pixels.

Notice that you can access the keys of these Bunch objects in two different ways: By using the . notation, as in digits.images, or the [] notation, as in digits['images'].

For more on the MNIST data, check out this exercise in Part 1 of DataCamp's Importing Data in Python course. There, the full version of the MNIST dataset is used, in which the images are 28x28. It is a famous dataset in machine learning and computer vision, and frequently used as a benchmark to evaluate the performance of a new model.

In [None]:
# Import necessary modules
from sklearn import datasets
import matplotlib.pyplot as plt

# Load the digits dataset: digits
digits = datasets.load_digits()

# Print the keys and DESCR of the dataset
print(digits.keys())
print(digits.DESCR)

# Print the shape of the images and data keys
print(digits.images.shape)
print(digits.data.shape)

# Display digit 1010
plt.imshow(digits.images[1010], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

# it shows the hand-written number 5

#### ii. Train/Test Split + Fit/Predict/Accuracy
- build a classifier that can make this prediction not only for this image, but for all the other ones in the dataset

Now that you have learned about the importance of splitting your data into training and test sets, it's time to practice doing this on the digits dataset! After creating arrays for the features and target variable, you will split them into training and test sets, fit a k-NN classifier to the training data, and then compute its accuracy using the .score() method.

In [None]:
# Import necessary modules
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.model_selection import train_test_split

# Create feature and target arrays
X = digits.data
y = digits.target

# Split into training and test set
# stratify: Stratify the split according to the labels so that 
# they are distributed in the training and test sets as they are 
# in the original dataset.
X_train, X_test, y_train, y_test = train_test_split(X, y, 
            test_size = 0.2, random_state=42, stratify=y)

# Create a k-NN classifier with 7 neighbors: knn
knn = KNeighborsClassifier(n_neighbors=7)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Print the accuracy
print(knn.score(X_test, y_test))

# out: 0.983333333333

#### iii. Overfitting and Underfitting
Remember the model complexity curve that Hugo showed in the video? You will now construct such a curve for the digits dataset! In this exercise, you will compute and plot the training and testing accuracy scores for a variety of different neighbor values. By observing how the accuracy scores differ for the training and testing sets with different values of k, you will develop your intuition for overfitting and underfitting.

The training and testing sets are available to you in the workspace as X_train, X_test, y_train, y_test. In addition, KNeighborsClassifier has been imported from sklearn.neighbors.

In [None]:
# create model complexity curve for different k values in knn

# Setup arrays to store train and test accuracies
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

# Loop over different values of k
for i, k in enumerate(neighbors):
    # Setup a k-NN Classifier with k neighbors: knn
    knn = KNeighborsClassifier(n_neighbors=k)

    # Fit the classifier to the training data
    knn.fit(X_train, y_train)
    
    #Compute accuracy on the training set
    train_accuracy[i] = knn.score(X_train, y_train)

    #Compute accuracy on the testing set
    test_accuracy[i] = knn.score(X_test, y_test)

# Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()


Conclusion: It looks like the test accuracy is highest when using 3 and 5 neighbors. Using 8 neighbors or more seems to result in a simple model that underfits the data. 

## 2. Regression
- continuous variables

In [None]:
# Boston housing data example
boston = pd.read_csv('boston.csv')
print(boston.head())

# creating feature and target arrays
X = boston.drop('MEDV', axis=1).values
y = boston['MEDV'].values

# predict house value from a single feature
X_rooms = X[:,5]
# check type: they are both NumPy arrays
type(X_rooms), type(y)
y = y.reshape(-1,1)
X_rooms = X_rooms.reshape(-1,1)
# plot house value vs number of rooms
plt.scatter(X_rooms, y)
plt.ylabel('Value of house /1000 ($)')
plt.xlabel('Number of rooms')
plt.show()

# Fit a regression model
import numpy as np
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(X_rooms, y)
prediction_space = np.linspace(min(X_rooms),
                              max(X_rooms)).reshape(-1,1)
plt.scatter(X_rooms, y, color='blue')
plt.plot(prediction_space, reg.predict(prediction_space),
        color='black', linewidth=3)
plt.show()

### a. Importing data for supervised learning
In this chapter, you will work with Gapminder data that we have consolidated into one CSV file available in the workspace as 'gapminder.csv'. Specifically, your goal will be to use this data to predict the life expectancy in a given country based on features such as the country's GDP, fertility rate, and population. As in Chapter 1, the dataset has been preprocessed.

Since the target variable here is quantitative, this is a regression problem. To begin, you will fit a linear regression with just one feature: 'fertility', which is the average number of children a woman in a given country gives birth to. In later exercises, you will use all the features to build regression models.

Before that, however, you need to import the data and get it into the form needed by scikit-learn. This involves creating feature and target variable arrays. Furthermore, since you are going to use only one feature to begin with, you need to do some reshaping using NumPy's .reshape() method. Don't worry too much about this reshaping right now, but it is something you will have to do occasionally when working with scikit-learn so it is useful to practice.

In [None]:
# Import numpy and pandas
import numpy as np
import pandas as pd

# Read the CSV file into a DataFrame: df
df = pd.read_csv('gapminder.csv')

# Create arrays for features and target variable
y = df['life'].values
X = df['fertility'].values

# Print the dimensions of X and y before reshaping
print("Dimensions of y before reshaping: {}".format(y.shape))
print("Dimensions of X before reshaping: {}".format(X.shape))

# Reshape X and y
y = y.reshape(-1,1)
X = X.reshape(-1,1)

# Print the dimensions of X and y after reshaping
print("Dimensions of y after reshaping: {}".format(y.shape))
print("Dimensions of X after reshaping: {}".format(X.shape))


<script.py> output:
    Dimensions of y before reshaping: (139,)
    Dimensions of X before reshaping: (139,)
    Dimensions of y after reshaping: (139, 1)
    Dimensions of X after reshaping: (139, 1)

Notice the differences in shape before and after applying the .reshape() method. Getting the feature and target variable arrays into the right format for scikit-learn is an important precursor to model building.

### b. Exploring the Gapminder data
As always, it is important to explore your data before building models. On the right, we have constructed a heatmap showing the correlation between the different features of the Gapminder dataset, which has been pre-loaded into a DataFrame as df and is available for exploration in the IPython Shell. Cells that are in green show positive correlation, while cells that are in red show negative correlation. Take a moment to explore this: Which features are positively correlated with life, and which ones are negatively correlated? Does this match your intuition?

Then, in the IPython Shell, explore the DataFrame using pandas methods such as .info(), .describe(), .head().

In case you are curious, the heatmap was generated using Seaborn's heatmap function (http://seaborn.pydata.org/generated/seaborn.heatmap.html) and the following line of code, where df.corr() computes the pairwise correlation between columns:

sns.heatmap(df.corr(), square=True, cmap='RdYlGn')

Once you have a feel for the data, consider the statements below and select the one that is not true. After this, Hugo will explain the mechanics of linear regression in the next video and you will be on your way building regression models!

In [None]:
# explore df
df.info()
df.describe()
df.head()

### c. Basics of linear regression

Regression mechanics:

y = ax + b
- y = target
- x = single feature
- a, b = parameters of model

How do we choose a and b?
1. Define an error function (loss/cost function) for any given line
2. Choose the line that minimizes the error function

The loss function: Ordinary Least Squares (OLS) - minimize sum of squares of residuals
- residual = vertical distance between dot (data) and regression line
- same as minimizing the mean squared errors on the training set


Linear regression in higher dimensions:

y = a1*x1 + a2*x2 + b
- To fit a linear regression model here, you need to specify 3 variables
- In higher dimensions, must specify coefficient for each feature (ai) and variable b

In [None]:
# Example: Linear regression on all features
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
    test_size=0.3, random_state=42)

reg_all = linear_model.LinearRegression()
reg_all.fit(X_train, y_train)
y_pred = reg_all.predict(X_test)
reg_all.score(X_test, y_test)

# R2 ("R squared") - default scoring method for linear regression

# Note: usually will use Regularization instead of 
# linear regression like this to put further constraints on
# model coefficients

#### i. Fit and Predict for regression (1 feature)
Now, you will fit a linear regression and predict life expectancy using just one feature. You saw Andy do this earlier using the 'RM' feature of the Boston housing dataset. In this exercise, you will use the 'fertility' feature of the Gapminder dataset. Since the goal is to predict life expectancy, the target variable here is 'life'. The array for the target variable has been pre-loaded as y and the array for 'fertility' has been pre-loaded as X_fertility.

A scatter plot with 'fertility' on the x-axis and 'life' on the y-axis has been generated. As you can see, there is a strongly negative correlation, so a linear regression should be able to capture this trend. Your job is to fit a linear regression and then predict the life expectancy, overlaying these predicted values on the plot to generate a regression line. You will also compute and print the R2 (R-squared) score using sckit-learn's .score() method.

In [None]:
# Import LinearRegression
from sklearn.linear_model import LinearRegression

# Create the regressor: reg
reg = LinearRegression()

# Create the prediction space
prediction_space = np.linspace(min(X_fertility), 
                               max(X_fertility)).reshape(-1,1)

# Fit the model to the data
reg.fit(X_fertility, y)

# Compute predictions over the prediction space: y_pred
y_pred = reg.predict(prediction_space)

# Print R^2 
print(reg.score(X_fertility, y))

# Plot regression line
plt.plot(prediction_space, y_pred, color='black', linewidth=3)
plt.show()

# out: 0.619244216774

#### ii. Train/Test split for regression
As you learned in Chapter 1, train and test sets are vital to ensure that your supervised learning model is able to generalize well to new data. This was true for classification models, and is equally true for linear regression models.

In this exercise, you will split the Gapminder dataset into training and testing sets, and then fit and predict a linear regression over all features. In addition to computing the R2 score, you will also compute the Root Mean Squared Error (RMSE), which is another commonly used metric to evaluate regression models. The feature array X and target variable array y have been pre-loaded for you from the DataFrame df.

In [None]:
# Import necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.3,
                                                    random_state=42)

# Create the regressor: reg_all
reg_all = LinearRegression()

# Fit the regressor to the training data
reg_all.fit(X_train, y_train)

# Predict on the test data: y_pred
y_pred = reg_all.predict(X_test)

# Compute and print R^2 and RMSE
print("R^2: {}".format(reg_all.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))

# out:
# R^2: 0.838046873142936
# Root Mean Squared Error: 3.2476010800377213

Using all features has improved the model score. This makes sense, as the model has more information to learn from. However, there is one potential pitfall to this process. Can you spot it? You'll learn about this as well how to better validate your models next.

### d. Cross-validation
Motivation:
- model performance is dependent on way the data is split
- not representative of model's ability to generalize

Note:
- k folds = k-fold CV
- Tradeoff: more folds = more computationally expensive

In [None]:
# example: cross-validation
from sklearn.model_selection import cross_val_score
reg = linear_model.LinearRegression()
# specify cv = number of folds utilized
cv_results = cross_val_score(reg, X, y, cv=5)
print(cv_results)
# compute mean
np.mean(cv_results)

#### i. 5-fold cross-validation
Cross-validation is a vital step in evaluating a model. It maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data.

In this exercise, you will practice 5-fold cross validation on the Gapminder data. By default, scikit-learn's cross_val_score() function uses R^2 as the metric of choice for regression. Since you are performing 5-fold cross-validation, the function will return 5 scores. Your job is to compute these 5 scores and then take their average.

The DataFrame has been loaded as df and split into the feature/target variable arrays X and y. The modules pandas and numpy have been imported as pd and np, respectively.

In [None]:
# Import the necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Create a linear regression object: reg
reg = LinearRegression()

# Compute 5-fold cross-validation scores: cv_scores
cv_scores = cross_val_score(reg, X, y, cv=5)

# Print the 5-fold cross-validation scores
print(cv_scores)

print("Average 5-Fold CV Score: {}".format(np.mean(cv_scores)))

# out
#[ 0.81720569  0.82917058  0.90214134  0.80633989  0.94495637]
#Average 5-Fold CV Score: 0.8599627722793232

#### ii. K-Fold CV comparison
Cross validation is essential but do not forget that the more folds you use, the more computationally expensive cross-validation becomes. In this exercise, you will explore this for yourself. Your job is to perform 3-fold cross-validation and then 10-fold cross-validation on the Gapminder dataset.

In the IPython Shell, you can use %timeit to see how long each 3-fold CV takes compared to 10-fold CV by executing the following cv=3 and cv=10:

%timeit cross_val_score(reg, X, y, cv = ____)

pandas and numpy are available in the workspace as pd and np. The DataFrame has been loaded as df and the feature/target variable arrays X and y have been created.

In [None]:
# Import necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Create a linear regression object: reg
reg = LinearRegression()

# Perform 3-fold CV
cvscores_3 = cross_val_score(reg, X, y, cv=3)
print(np.mean(cvscores_3))

# Perform 10-fold CV
cvscores_10 = cross_val_score(reg, X, y, cv=10)
print(np.mean(cvscores_10))

# out
# 0.871871278262
# 0.843612862013

In [None]:
# time 3 vs 10 fold CV to see increase in computational expense/time
%timeit cross_val_score(reg, X, y, cv=3)
%timeit cross_val_score(reg, X, y, cv=10)

#100 loops, best of 3: 9.91 ms per loop
#10 loops, best of 3: 31.2 ms per loop

### e. Regularized regression
- Linear regression minimizes a loss function
- It chooses a coefficient for each feature variable
- Large coefficients can lead to overfitting (predict anything)
- Regularization = penalize large coefficients

#### i. Regularization 1: Lasso (feature selection)
- Loss function = OLS loss function + alpha * sum of abs value of coeff
- Used for feature selection
    - How? By shrinking the coefficients of less important feature to exactly 0

In [None]:
# example: Lasso regression
from sklearn.linear_model import Lasso

X_train, X_test, y_train, y_test = train_test_split(X,y,
        test_size=0.3, random_state=42)

# normalize arg ensures all variables on same scale
lasso = Lasso(alpha=0.1, normalize=True)
lasso.fit(X_train, y_train)
lasso = ridge.predict(X_test)
lasso.score(X_test, y_test)
# out: 0.595022...

In [None]:
# example: Lasso regularization used for feature selection
from sklearn.linear_model import Lasso

# store feature names in names
names = boston.drop('MEDV', axis=1).columns

lasso = Lasso(alpha=0.1)
lasso_coef = lasso.fit(X, y).coef_

# plot the coefficients as a function of feature name
_ = plt.plot(range(len(names)), lasso_coef)
_ = plt.xticks(range(len(names)), names, rotation=60)
_ = plt.ylab('Coefficients')
plt.show()

Regularization I: Lasso
In the video, you saw how Lasso selected out the 'RM' feature as being the most important for predicting Boston house prices, while shrinking the coefficients of certain other features to 0. Its ability to perform feature selection in this way becomes even more useful when you are dealing with data involving thousands of features.

In this exercise, you will fit a lasso regression to the Gapminder data you have been working with and plot the coefficients. Just as with the Boston data, you will find that the coefficients of some features are shrunk to 0, with only the most important ones remaining.

The feature and target variable arrays have been pre-loaded as X and y.

In [None]:
# Import Lasso
from sklearn.linear_model import Lasso

# Instantiate a lasso regressor: lasso
lasso = Lasso(alpha=0.4, normalize=True)

# Fit the regressor to the data
lasso.fit(X, y)

# Compute and print the coefficients
lasso_coef = lasso.coef_
print(lasso_coef)

# Plot the coefficients
plt.plot(range(len(df_columns)), lasso_coef)
plt.xticks(range(len(df_columns)), df_columns.values, rotation=60)
plt.margins(0.02)
plt.show()

# plot shows child_mortality feature is important when predicting
# life expectancya

#### ii. Regularization 2: Ridge (1st choice in building regression models)
- Loss function = OLS loss function + alpha * sum of coefficient^2
- Need to choose alpha parameter for the best performing model
- Picking alpha is similar to picking k in k-NN
- aka Hyperparameter tuning
- alpha (sometimes lambda) controls model complexity
    - alpha = 0, get back OLS and overfitting
    - very high alpha can lead to underfitting

In [None]:
# example: Ridge regression
from sklearn.linear_model import Ridge

X_train, X_test, y_train, y_test = train_test_split(X,y,
        test_size=0.3, random_state=42)

# normalize arg ensures all variables on same scale
ridge = Ridge(alpha=0.1, normalize=True)
ridge.fit(X_train, y_train)
ridge_pred = ridge.predict(X_test)
ridge.score(X_test, y_test)
# out: 0.69969...

Regularization II: Ridge
Lasso is great for feature selection, but when building regression models, Ridge regression should be your first choice.

Recall that lasso performs regularization by adding to the loss function a penalty term of the absolute value of each coefficient multiplied by some alpha. This is also known as L1 regularization because the regularization term is the L1 norm of the coefficients. This is not the only way to regularize, however.

If instead you took the sum of the squared values of the coefficients multiplied by some alpha - like in Ridge regression - you would be computing the L2 norm. In this exercise, you will practice fitting ridge regression models over a range of different alphas, and plot cross-validated R2 scores for each, using this function that we have defined for you, which plots the R2 score as well as standard error for each alpha:

def display_plot(cv_scores, cv_scores_std):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.plot(alpha_space, cv_scores)

    std_error = cv_scores_std / np.sqrt(10)

    ax.fill_between(alpha_space, cv_scores + std_error, cv_scores - std_error, alpha=0.2)
    ax.set_ylabel('CV Score +/- Std Error')
    ax.set_xlabel('Alpha')
    ax.axhline(np.max(cv_scores), linestyle='--', color='.5')
    ax.set_xlim([alpha_space[0], alpha_space[-1]])
    ax.set_xscale('log')
    plt.show()
Don't worry about the specifics of the above function works. The motivation behind this exercise is for you to see how the R2 score varies with different alphas, and to understand the importance of selecting the right value for alpha. 

In [None]:
# example: Ridge regularization

# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Setup the array of alphas and lists to store scores
alpha_space = np.logspace(-4, 0, 50)
ridge_scores = []
ridge_scores_std = []

# Create a ridge regressor: ridge
ridge = Ridge(normalize=True)

# Compute scores over range of alphas
for alpha in alpha_space:

    # Specify the alpha value to use: ridge.alpha
    ridge.alpha = alpha
    
    # Perform 10-fold CV: ridge_cv_scores
    ridge_cv_scores = cross_val_score(ridge,X,y,cv=10)
    
    # Append the mean of ridge_cv_scores to ridge_scores
    ridge_scores.append(np.mean(ridge_cv_scores))
    
    # Append the std of ridge_cv_scores to ridge_scores_std
    ridge_scores_std.append(np.std(ridge_cv_scores))

# Display the plot
display_plot(ridge_scores, ridge_scores_std)


## 3. Fine-tuning your model

Model performance measured with accuracy, but not always a useful metric.
- consider class imbalance with 99% accuracy for 99% real emails vs 1% spam emails
More nuanced performance metrics:
- Diagnosing classification predictions
    - Confusion matrix (T/F Positive/Negative)
- Classification report (Confusion matrix metrics):
    - Accuracy
    - Precision (PPV - positive predictive value)
        = tp / (tp + fp)
        - high precision: not many real emails predicted as spam
    - Recall = tp / (tp + fn) aka Sensitivity, Hit rate, True Positive Rate
        - high recall: predicted most spam emails correctly
    - F1 score = 2*(precision*recall)/(precision+recall) = harmonic mean of precision and recall

In [None]:
# Confusion matrix in scikit-learn
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

knn = KNeighborsClasssifier(n_neighbors=8)
X_train, X_test, y_train, y_test = train_test_split(X, y,
        test_size=0.4, random_state=42)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

# create confusion matrix
# note: y_test is the true label, prediction label is the 2nd arg
print(confusion_matrix(y_test, y_pred))
# print classification report: precision, recall, f1-score, support
print(classification_report(y_test, y_pred))

Metrics for classification
In Chapter 1, you evaluated the performance of your k-NN classifier based on its accuracy. However, accuracy is not always an informative metric. In this exercise, you will dive more deeply into evaluating the performance of binary classifiers by computing a confusion matrix and generating a classification report.

You may have noticed in the video that the classification report consisted of three rows, and an additional support column. 

Support column in classification report: 
- The support gives the number of samples of the true response that lie in that class - so in the video example, the support was the number of Republicans or Democrats in the test set on which the classification report was computed. The precision, recall, and f1-score columns, then, gave the respective metrics for that particular class.

Here, you'll work with the PIMA Indians dataset obtained from the UCI Machine Learning Repository. The goal is to predict whether or not a given female patient will contract diabetes based on features such as BMI, age, and number of pregnancies. Therefore, it is a binary classification problem. A target value of 0 indicates that the patient does not have diabetes, while a value of 1 indicates that the patient does have diabetes. As in Chapters 1 and 2, the dataset has been preprocessed to deal with missing values.

The dataset has been loaded into a DataFrame df and the feature and target variable arrays X and y have been created for you. In addition, sklearn.model_selection.train_test_split and sklearn.neighbors.KNeighborsClassifier have already been imported.

Your job is to train a k-NN classifier to the data and evaluate its performance by generating a confusion matrix and classification report.

In [None]:
# example: knn confusion matrix and classification report

# Import necessary modules
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Create training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y,
    test_size=0.4, random_state=42)

# Instantiate a k-NN classifier: knn
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)

# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classifcation_report(y_test, y_pred))

# out
#[[176  30]
# [ 52  50]]
#             precision    recall  f1-score   support

#          0       0.77      0.86      0.81       206
#          1       0.62      0.47      0.54       102

#avg / total       0.72      0.73      0.72       308

### 3.1 Logistic regression and ROC curve
- used in classification problems (not regression)

Logistic regression for binary classification:
- logistic regression outputs probabilities
    - if probability, p, > 0.5: data labeled '1'
    - p < 0.5: data labeled '0'
- Probability thresholds
    - default logistic regression threshold = 0.5
    - could also be used as k-NN classifiers
    - What happens if the threshold varies? What happens to True Positive and False Positive rates with threshold varied? Look at ROC Curve
    
- ROC curve (Receiver Operator Characteristic Curve)

In [None]:
# example: logistic regression in scikit-learn

# Import necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# instantiate logisitic regression classifier
logreg = LogisticRegression()

# split the data in training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
    test_size=0.4, random_state=42)

# Fit the classifier to the training data
logreg.fit(X_train, y_train)

# Predict on test set
y_pred = logreg.predict(X_test)

In [None]:
# example: plot ROC curve
from sklearn.metrics import roc_curve
# y_pred_prob = predicted probabilities
# .predict_proba returns an array with 2 columns, each column
# contains probabilities for the respective target values,
# we choose the 2nd column, which is index 1, the probabilities
# being 1
y_pred_prob = logreg.predict_proba(X_test)[:,1]
# fpr = false positive rate, tpr = true positive rate
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

plt.plot([0,1],[0,1],'k--')
plt.plot(fpr, tpr, label='Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve')
plt.show();

#### 3.1.a Building a logistic regression model
Building a logistic regression model

Time to build your first logistic regression model! As Hugo showed in the video, scikit-learn makes it very easy to try different models, since the Train-Test-Split/Instantiate/Fit/Predict paradigm applies to all classifiers and regressors - which are known in scikit-learn as 'estimators'. You'll see this now for yourself as you train a logistic regression model on exactly the same data as in the previous exercise. Will it outperform k-NN? There's only one way to find out!

The feature and target variable arrays X and y have been pre-loaded, and train_test_split has been imported for you from sklearn.model_selection.

In [None]:
# example: build logistical regression model - binary classification

# Import the necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42)

# Create the classifier: logreg
logreg = LogisticRegression()

# Fit the classifier to the training data
logreg.fit(X_train, y_train)

# Predict the labels of the test set: y_pred
y_pred = logreg.predict(X_test)

# Compute and print the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# out
#[[176  30]
# [ 35  67]]
#             precision    recall  f1-score   support

#          0       0.83      0.85      0.84       206
#          1       0.69      0.66      0.67       102

#avg / total       0.79      0.79      0.79       308

#### 3.1.b Plotting ROC curve - visually evaluate models
Classification reports and confusion matrices are great methods to quantitatively evaluate model performance, while ROC curves provide a way to visually evaluate models. 

.predict_proba()
- Most classifiers in scikit-learn have a .predict_proba() method which returns the probability of a given sample being in a particular class. Having built a logistic regression model, you'll now evaluate its performance by plotting an ROC curve. In doing so, you'll make use of the .predict_proba() method and become familiar with its functionality.

Here, you'll continue working with the PIMA Indians diabetes dataset. The classifier has already been fit to the training data and is available as logreg.

In [None]:
# Import necessary modules
from sklearn.metrics import roc_curve

# Compute predicted probabilities: y_pred_prob
y_pred_prob = logreg.predict_proba(X_test)[:,1]

# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr) # x-axis: fpr, y-axis: tpr
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

#### 3.1.c Precision-Recall Curve
When looking at your ROC curve, you may have noticed that the y-axis (True positive rate) is also known as recall. Indeed, in addition to the ROC curve, there are other ways to visually evaluate model performance. One such way is the precision-recall curve, which is generated by plotting the precision and recall for different thresholds. As a reminder, precision and recall are defined as:

- Precision=TP/(TP+FP)
- Recall=TP/(TP+FN)

On the right, a precision-recall curve has been generated for the diabetes dataset. The classification report and confusion matrix are displayed in the IPython Shell.

Study the precision-recall curve and then consider the statements given below. Choose the one statement that is not true. Note that here, the class is positive (1) if the individual has diabetes.

             precision    recall  f1-score   support

          0       0.83      0.85      0.84       206
          1       0.69      0.66      0.67       102

    avg / total       0.79      0.79      0.79       308

    [[176  30]
     [ 35  67]]
     
     
TRUE: A recall of 1 corresponds to a classifier with a low threshold in which all females who contract diabetes were correctly classified as such, at the expense of many misclassifications of those who did not have diabetes.

TRUE: Precision is undefined for a classifier which makes no positive predictions, that is, classifies everyone as not having diabetes.

TRUE: When the threshold is very close to 1, precision is also 1, because the classifier is absolutely certain about its predictions.

FALSE: Precision and recall take true negatives into consideration.

### 3.2 Area under the ROC curve (AUC)
#### - one popular metric for classification models
- Larger AUC = better model

In [None]:
# example: Compute AUC
from sklearn.metrics import roc_auc_score

logreg = LogisticRegression()

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                        test_size = 0.4, random_state=42)

# Fit the classifier to the training data
logreg.fit(X_train, y_train)

# Compute predicted probabilities to compute AUC
y_pred = logreg.predict_proba(X_test)[:,1]
# Pass true labels and predicted probabilities to ROC AUC score
roc_auc_score(y_test, y_pred_prob)
# out: 0.997466216

In [None]:
# Compute AUC using cross-validation
from sklearn.model_selection import cross_val_score

# pass the estimator, features, and target
cv_scores = cross_val_score(logreg, X, y, cv=5, scoring='roc_auc')

print(cv_scores)
# out: [ 0.9967   0.99183 .  0.99583 .  1.   0.961406]

#### AUC computation
Say you have a binary classifier that in fact is just randomly making guesses. It would be correct approximately 50% of the time, and the resulting ROC curve would be a diagonal line in which the True Positive Rate and False Positive Rate are always equal. The Area under this ROC curve would be 0.5. This is one way in which the AUC, which Hugo discussed in the video, is an informative metric to evaluate a model. If the AUC is greater than 0.5, the model is better than random guessing. Always a good sign!

In this exercise, you'll calculate AUC scores using the roc_auc_score() function from sklearn.metrics as well as by performing cross-validation on the diabetes dataset.

X and y, along with training and test sets X_train, X_test, y_train, y_test, have been pre-loaded for you, and a logistic regression classifier logreg has been fit to the training data.

In [None]:
# example: calculate AUC with 2 methods: roc_auc_score() and
# using cross_val_score()

# Import necessary modules
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score

# Method 1
# Compute predicted probabilities: y_pred_prob
y_pred_prob = logreg.predict_proba(X_test)[:,1]

# Compute and print AUC score
print("AUC: {}".format(roc_auc_score(y_test, y_pred_prob)))

# Method 2
# Compute cross-validated AUC scores: cv_auc
cv_auc = cross_val_score(logreg, X, y, cv=5, scoring='roc_auc')

# Print list of AUC scores
print("AUC scores computed using 5-fold cross-validation: {}".
      format(cv_auc))

# out
#AUC: 0.8254806777079764
#AUC scores computed using 5-fold cross-validation: 
#[ 0.80148148  0.8062963   0.81481481  0.86245283  0.8554717 ]

### 3.3 Hyperparameter tuning
Review
- Linear regression: choose parameters
- Ridge/Lasso regression: choose alpha
- k-Nearest Neighbors: choose n_neighbors

Hyperparameters: parameters that need to be specified before model fitting (so can't be learned by fitting model)
- parameters like alpha and k

Choosing the correct hyperparameter
- try a bunch of different hyperparameter values
- fit all of them separately
- see how well each performs
- choose the best performing one
- essential to use cross-validation (using train-test-split alone risks overfitting hyperparameter to test set)

Methods:
- Grid search cross-validation
- Randomized Search CV

In [None]:
# GridSearchCV
from sklearn.model_selection import GridSearchCV
# keys are hyperparameter names like 'n_neighbors' in k-nn, or
# alpha in Ridge/Lasso regression
param_grid = {'n_neighbors': np.arange(1, 50)}
knn = KneighborsClassifier()
# args: model, grid, number of folds for cross validation
knn_cv = GridSearchCV(knn, param_grid, cv=5)
# use this to fit data
knn_cv.fit(X, y)

knn_cv.best_params_
knn_cv.best_score_

#### 3.3.a Hyperparameter tuning with GridSearchCV
Hugo demonstrated how to tune the n_neighbors parameter of the KNeighborsClassifier() using GridSearchCV on the voting dataset. You will now practice this yourself, but by using logistic regression on the diabetes dataset instead!

Like the alpha parameter of lasso and ridge regularization that you saw earlier, logistic regression also has a regularization parameter: C. C controls the inverse of the regularization strength, and this is what you will tune in this exercise. A large C can lead to an overfit model, while a small C can lead to an underfit model.

The hyperparameter space for C has been setup for you. Your job is to use GridSearchCV and logistic regression to find the optimal C in this hyperparameter space. The feature array is available as X and target variable array is available as y.

You may be wondering why you aren't asked to split the data into training and test sets. Good observation! Here, we want you to focus on the process of setting up the hyperparameter grid and performing grid-search cross-validation. In practice, you will indeed want to hold out a portion of your data for evaluation purposes, and you will learn all about this in the next video!

In [None]:
# Tune (optimize) C regularization parameter for logistic regression
# This example doesn't have a test and training set, but normally
# you should

# Import necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}

# Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression()

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

# Fit it to the data
logreg_cv.fit(X, y)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
print("Best score is {}".format(logreg_cv.best_score_))

# out
#Tuned Logistic Regression Parameters: {'C': 3.7275937203149381}
#Best score is 0.7708333333333334

#### 3.3.b Hyperparameter tuning with RandomizedSearchCV
- faster alternative to GridSearchCV

GridSearchCV can be computationally expensive, especially if you are searching over a large hyperparameter space and dealing with multiple hyperparameters. A solution to this is to use RandomizedSearchCV, in which not all hyperparameter values are tried out. Instead, a fixed number of hyperparameter settings is sampled from specified probability distributions. You'll practice using RandomizedSearchCV in this exercise and see how this works.

Here, you'll also be introduced to a new model: the Decision Tree. Don't worry about the specifics of how this model works. Just like k-NN, linear regression, and logistic regression, decision trees in scikit-learn have .fit() and .predict() methods that you can use in exactly the same way as before. Decision trees have many parameters that can be tuned, such as max_features, max_depth, and min_samples_leaf: This makes it an ideal use case for RandomizedSearchCV.

As before, the feature array X and target variable array y of the diabetes dataset have been pre-loaded. The hyperparameter settings have been specified for you. Your goal is to use RandomizedSearchCV to find the optimal hyperparameters. Go for it!

In [None]:
# Import necessary modules
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

# Setup the parameters and distributions to sample from: param_dist
param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 9),
              "min_samples_leaf": randint(1, 9),
              "criterion": ["gini", "entropy"]}

# Instantiate a Decision Tree classifier: tree
tree = DecisionTreeClassifier()

# Instantiate the RandomizedSearchCV object: tree_cv
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)

# Fit it to the data
tree_cv.fit(X, y)

# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))

# out
#Tuned Decision Tree Parameters: {'max_depth': 3, 
#    'criterion': 'entropy', 'min_samples_leaf': 3, 
#    'max_features': 5}
#Best score is 0.7330729166666666

#### Note: RandomizedSearchCV never outperforms GridSearchCV but it will save computational time

### 3.4 Hold-out set for final evaluation

#### 3.4.a Hold-out set reasoning
- Use test data set to evaluate model performance
- Using all data for cross-validation is not ideal
- split data into training and hold-out (test) set at beginning
- perform grid search cross-validation on training set
- choose best hyperparameters and evaluate on hold-out set

#### 3.4.b Hold-out set in practice I: Classification
You will now practice evaluating a model with tuned hyperparameters on a hold-out set. The feature array and target variable array from the diabetes dataset have been pre-loaded as X and y.

In addition to C, logistic regression has a 'penalty' hyperparameter which specifies whether to use 'l1' or 'l2' regularization. 

Your job in this exercise is to create a hold-out set, tune the 'C' and 'penalty' hyperparameters of a logistic regression classifier using GridSearchCV on the training set.

In [None]:
# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Create the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space, 'penalty': ['l1', 'l2']}

# Instantiate the logistic regression classifier: logreg
logreg = LogisticRegression()

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,
test_size=0.4, random_state=42)

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

# Fit it to the training data
logreg_cv.fit(X_train, y_train)

# Print the optimal parameters and best score
print("Tuned Logistic Regression Parameter: {}".format(logreg_cv.best_params_))
print("Tuned Logistic Regression Accuracy: {}".format(logreg_cv.best_score_))

# out
# Tuned Logistic Regression Parameter: {'C': 0.43939705607607948, 'penalty': 'l1'}
# Tuned Logistic Regression Accuracy: 0.7652173913043478

#### 3.4.c Hold-out set in practice II: Regression
Remember lasso and ridge regression from the previous chapter? Lasso used the L1 penalty to regularize, while ridge used the L2 penalty. There is another type of regularized regression known as the elastic net. 

In elastic net regularization, the penalty term is a linear combination of the L1 and L2 penalties:
- a∗L1+b∗L2

In scikit-learn, this term is represented by the 'l1_ratio' parameter: An 'l1_ratio' of 1 corresponds to an L1 penalty, and anything lower is a combination of L1 and L2.

In this exercise, you will GridSearchCV to tune the 'l1_ratio' of an elastic net model trained on the Gapminder data. As in the previous exercise, use a hold-out set to evaluate your model's performance.

In [None]:
# Import necessary modules
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV, train_test_split

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
test_size=0.4, random_state=42)

# Create the hyperparameter grid
l1_space = np.linspace(0, 1, 30)
param_grid = {'l1_ratio': l1_space}

# Instantiate the ElasticNet regressor: elastic_net
elastic_net = ElasticNet()

# Setup the GridSearchCV object: gm_cv
# Use GridSearchCV with 5-fold cross-validation to tune 'l1_ratio' 
# on the training data X_train and y_train. This involves 
# first instantiating the GridSearchCV object with the correct 
# parameters and then fitting it to the training data.
gm_cv = GridSearchCV(elastic_net, param_grid, cv=5)

# Fit it to the training data
gm_cv.fit(X_train, y_train)

# Predict on the test set and compute metrics
y_pred = gm_cv.predict(X_test)
r2 = gm_cv.score(X_test, y_test)
mse = mean_squared_error(y_test, y_pred)
print("Tuned ElasticNet l1 ratio: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))
print("Tuned ElasticNet MSE: {}".format(mse))

# out
# Tuned ElasticNet l1 ratio: {'l1_ratio': 0.20689655172413793}
# Tuned ElasticNet R squared: 0.8668305372460283
# Tuned ElasticNet MSE: 10.05791413339844

## 4. Preprocessing and Pipelines

### 4.1 Preprocessing Data
Dealing with categorical features
- scikit-learn will not accept categorical features
- need to encode these features numerically
- convert to 'dummy variables'
    - 0: Observation was NOT that category
    - 1: Observation was that category
- example: 3 origins for a car
    - origin_Asia 0 or 1
    - origin_Europe 0 or 1, can remove Europe if implicitly we know it's not from Asia or Europe, duplication can cause issues in some models
    - origin_USA 0 or 1
    
Dealing with categorical features:
- scikit-learn: OneHotEncoder()
- pandas: get_dummies()

Boxplots are useful in visualizing categorical features

In [None]:
# Example: dummy variables with pandas get_dummies()
# convert origins for car (noted above) into dummy variables
import pandas as pd
df= pd.read_csv('auto.csv')
df_origin = pd.get_dummies(df)
print(df_origin.head())
# drop origin_asia column since we can imply it
df_origin = df_origin.drop('origin_Asia', axis=1)
print(df_origin.head())

# Ridge Linear regression with dummy variables
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
X_train, X_test, y_train, y_test = train_test_split(X, y,
        test_size=0.3, random_state=42)
ridge = Ridge(alpha.0.5, normalize=True).fit(X_train, y_train)
ridge.score(X_test, y_test)
# out: 0.7190645
# can compute R^2

#### 4.1.a Exploring categorical features
The Gapminder dataset that you worked with in previous chapters also contained a categorical 'Region' feature, which we dropped in previous exercises since you did not have the tools to deal with it. Now however, you do, so we have added it back in!

Your job in this exercise is to explore this feature. 

Boxplots are particularly useful for visualizing categorical features such as this.

#### 4.1.b 

#### 4.1.c 

### 4.2 Handling Missing Data

#### 4.2.a 

#### 4.2.b 

#### 4.2.c 

### 4.3 Centering and Scaling


#### 4.3.a 

#### 4.3.b 

#### 4.3.c 

#### 4.3.d 

# Supervised learning with scikit-learn - sklearn
1. Classification
2. Regression
3. Fine-tuning model
4. Preprocessing and Pipelines

Background:
- What is machine learning? Giving computers the ability to learn to make decisions from Data without being explicitly programmed.
- Supervised learning - labeled data
- Unsupervised learning - uncovering hidden patterns from unlabeled data
- Reinforcement learning - software agents interact with an environment; learn how to optimize their behavior, given system of rewards and punishments, draws inspiration from behavioral psychology. Ie. AphasGo - 1st computer to defeat world champion in Go

Supervised learning
- predictor variables/features and a target variable
- Aim: predict the target variable, given the predictor variables (ie. target variable: species, predictor variables: sepal length and width)
- Classification: target variable consists of categories
- Regression: Target variable is continuous

Naming conventions:
- Features = predictor variables = independent variables
- Target variable = dependent variable = response variable

Goals of Supervised learning:
- Automate time-consuming or expensive manual tasks (ie. MD Dx)
- Make predictions about the future (ie. will a customer click an ad or not)
- Need labeled data (ie. historical data with labels, experiments to get labeled data like click on ad, crowd-sourcing labeled data)

Tools:
- scikit-learn/sklearn - integrates well with SciPy stack including Numpy
- other libraries: TensorFlow, keras

## 1. Classification

### a. EDA

In [None]:
from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.sytle.use('ggplot')

# load dataset
iris = datasets.load_iris()
type(iris)
# out: sklearn.datasets.base.Bunch
# a Bunch is like a dictionary
print(iris.keys())
# out: dict_keys(['data','target_names','DESCR','feature_names','target'])
# the data and target are numpy arrays
iris.data.shape
# out: (150, 4) # 150 samples and 4 features
iris.target_names
# out: array(['setosa','versicolor','virginica'], dtype='<U10')

# initial EDA
X = iris.data
y = iris.target
df = pd.DataFrame(X, columns=iris.feature_names)
print(df.head)

# Visual EDA, c is color, 
_ = pd.scatter_matrix(df, c=y, figsize=[8,8], s=150, marker='D')


#### i. Numerical EDA
In this chapter, you'll be working with a dataset obtained from the UCI Machine Learning Repository consisting of votes made by US House of Representatives Congressmen. Your goal will be to predict their party affiliation ('Democrat' or 'Republican') based on how they voted on certain key issues. Here, it's worth noting that we have preprocessed this dataset to deal with missing values. This is so that your focus can be directed towards understanding how to train and evaluate supervised learning models. Once you have mastered these fundamentals, you will be introduced to preprocessing techniques in Chapter 4 and have the chance to apply them there yourself - including on this very same dataset!

Before thinking about what supervised learning models you can apply to this, however, you need to perform Exploratory data analysis (EDA) in order to understand the structure of the data. For a refresher on the importance of EDA, check out the first two chapters of Statistical Thinking in Python (Part 1).

Get started with your EDA now by exploring this voting records dataset numerically. It has been pre-loaded for you into a DataFrame called df. Use pandas' .head(), .info(), and .describe() methods in the IPython Shell to explore the DataFrame, and select the statement below that is not true.

In [None]:
# explore structure of data
df.head()
df.info()
df.describe()

#### ii. Visual EDA
The Numerical EDA you did in the previous exercise gave you some very important information, such as the names and data types of the columns, and the dimensions of the DataFrame. Following this with some visual EDA will give you an even better understanding of the data. In the video, Hugo used the scatter_matrix() function on the Iris data for this purpose. However, you may have noticed in the previous exercise that all the features in this dataset are binary; that is, they are either 0 or 1. So a different type of plot would be more useful here, such as Seaborn's countplot.

Given on the right is a countplot of the 'education' bill, generated from the following code:

plt.figure()

sns.countplot(x='education', hue='party', data=df, palette='RdBu')

plt.xticks([0,1], ['No', 'Yes'])

plt.show()

In sns.countplot(), we specify the x-axis data to be 'education', and hue to be 'party'. Recall that 'party' is also our target variable. So the resulting plot shows the difference in voting behavior between the two parties for the 'education' bill, with each party colored differently. We manually specified the color to be 'RdBu', as the Republican party has been traditionally associated with red, and the Democratic party with blue.

It seems like Democrats voted resoundingly against this bill, compared to Republicans. This is the kind of information that our machine learning model will seek to learn when we try to predict party affiliation solely based on voting behavior. An expert in U.S politics may be able to predict this without machine learning, but probably not instantaneously - and certainly not if we are dealing with hundreds of samples!

In the IPython Shell, explore the voting behavior further by generating countplots for the 'satellite' and 'missile' bills, and answer the following question: Of these two bills, for which ones do Democrats vote resoundingly in favor of, compared to Republicans? Be sure to begin your plotting statements for each figure with plt.figure() so that a new figure will be set up. Otherwise, your plots will be overlayed onto the same figure.

In [None]:
# generate countplots for 'satellite' and 'missile' bills

# satellite bill
plt.figure()
sns.countplot(x='satellite', hue='party', data=df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
plt.show()
# republicans 'no', democrats 'yes'

# missile bill
plt.figure()
sns.countplot(x='missile', hue='party', data=df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
plt.show()
# republicans 'no', democrats 'yes'

### b. The classification challenge
- Training data: already labeled data

k-Nearest Neighbors
- idea is to predict the label of a data point by looking at the 'k' closest labeled data points

Training a model on the data = 'fitting' a model to the data
- .fit() method
Predict labels of new data with...
- .predict() method

In [None]:
# Using scikit-learn to fit a classifier
from sklearn.neighbors import KNeighborsClassifier
# set 'k', number of neighbors to 6
knn = KNeighborsClassifier(n_neighbors=6)
# fit classifier to training set with args: features, target
# requires args to be Numpy array or Pandas dataframe
# requires no missing values
knn.fit(iris['data'], iris['target'])
# out: KNeighborsClassifier(algorithm='auto', leaf_size=30,
# metric='minkowski', metric_params=None, n_jobs=1,
# n_neighbors=6, p=2, weights='uniform)

# check out iris data
iris['data'].shape
# out: (150, 4)

# target has to be same # rows as feature data
iris['target'].shape
# out: (150,)

# predict on unlabeled data
prediction = knn.predict(X_new)
X_new.shape
# out: (3, 4)
print('Prediction {}'.format(prediction))
# Prediction: [1 1 0]
# which means 1=versicolor for first 2 observations, and 0=sertosa

#### i. k-Nearest Neighbors: Fit
Having explored the Congressional voting records dataset, it is time now to build your first classifier. In this exercise, you will fit a k-Nearest Neighbors classifier to the voting dataset, which has once again been pre-loaded for you into a DataFrame df.

In the video, Hugo discussed the importance of ensuring your data adheres to the format required by the scikit-learn API. The features need to be in an array where each column is a feature and each row a different observation or data point - in this case, a Congressman's voting record. The target needs to be a single column with the same number of observations as the feature data. We have done this for you in this exercise. Notice we named the feature array X and response variable y: This is in accordance with the common scikit-learn practice.

Your job is to create an instance of a k-NN classifier with 6 neighbors (by specifying the n_neighbors parameter) and then fit it to the data. The data has been pre-loaded into a DataFrame called df.

In [None]:
# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier

# Create arrays for the features and the response variable
# Note sklearn practice: x for feature array, y for response variable
# Note: '.values' attribute return NumPy arrays
y = df['party'].values
X = df.drop('party', axis=1).values

# Create a k-NN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the data
knn.fit(X, y)

#Out[1]: 
#KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
#           metric_params=None, n_jobs=1, n_neighbors=6, p=2,
#           weights='uniform')

#### ii. k-Nearest Neighbors: Predict
Having fit a k-NN classifier, you can now use it to predict the label of a new data point. However, there is no unlabeled data available since all of it was used to fit the model! You can still use the .predict() method on the X that was used to fit the model, but it is not a good indicator of the model's ability to generalize to new, unseen data.

In the next video, Hugo will discuss a solution to this problem. For now, a random unlabeled data point has been generated and is available to you as X_new. You will use your classifier to predict the label for this new data point, as well as on the training data X that the model has already seen. Using .predict() on X_new will generate 1 prediction, while using it on X will generate 435 predictions: 1 for each sample.

The DataFrame has been pre-loaded as df. This time, you will create the feature array X and target variable array y yourself.

In [None]:
# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier 

# Create arrays for the features and the response variable
y = df['party'].values
X = df.drop('party', axis=1).values

# Create a k-NN classifier with 6 neighbors: knn
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the data
knn.fit(X, y)

# Predict the labels for the training data X
y_pred = knn.predict(X)

# Predict and print the label for the new data point X_new
new_prediction = knn.predict(X_new)
print("Prediction: {}".format(new_prediction))

# out: Prediction: ['democrat']

### c. measuring Model performance
- accuracy - commonly used metric of model performance to generalize
- accuracy = Fraction of correct predictions on new data

Split data into training and test set
- Fit/train the classifier on the training set
- Make predictions on test set

In [None]:
# Train/Test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = 
train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)

# Create a k-NN classifier with 8 neighbors
knn = KNeighborsClassifier(n_neighbors=8)

# Fit the classifier to the data
knn.fit(X_train, y_train)

# Predict the labels for the training data X
y_pred = knn.predict(X_test)

print("Test set predictions:\n {}".format(y_pred))

# check accuracy
knn.score(X_test, y_test)
# out: 0.9555555556

Model complexity for KNN
- larger k = smoother decision boundary = less complex model
- smaller k = more complex model = can lead to overfitting and sensitive to noise
- Model complexity curve - shows over/underfitting with too small or large k

#### i. the digits recognition dataset
Up until now, you have been performing binary classification, since the target variable had two possible outcomes. Hugo, however, got to perform multi-class classification in the videos, where the target variable could take on three possible outcomes. Why does he get to have all the fun?! In the following exercises, you'll be working with the MNIST digits recognition dataset, which has 10 classes, the digits 0 through 9! A reduced version of the MNIST dataset is one of scikit-learn's included datasets, and that is the one we will use in this exercise.

Each sample in this scikit-learn dataset is an 8x8 image representing a handwritten digit. Each pixel is represented by an integer in the range 0 to 16, indicating varying levels of black. Recall that scikit-learn's built-in datasets are of type Bunch, which are dictionary-like objects. Helpfully for the MNIST dataset, scikit-learn provides an 'images' key in addition to the 'data' and 'target' keys that you have seen with the Iris data. Because it is a 2D array of the images corresponding to each sample, this 'images' key is useful for visualizing the images, as you'll see in this exercise (for more on plotting 2D arrays, see Chapter 2 of DataCamp's course on Data Visualization with Python). On the other hand, the 'data' key contains the feature array - that is, the images as a flattened array of 64 pixels.

Notice that you can access the keys of these Bunch objects in two different ways: By using the . notation, as in digits.images, or the [] notation, as in digits['images'].

For more on the MNIST data, check out this exercise in Part 1 of DataCamp's Importing Data in Python course. There, the full version of the MNIST dataset is used, in which the images are 28x28. It is a famous dataset in machine learning and computer vision, and frequently used as a benchmark to evaluate the performance of a new model.

In [None]:
# Import necessary modules
from sklearn import datasets
import matplotlib.pyplot as plt

# Load the digits dataset: digits
digits = datasets.load_digits()

# Print the keys and DESCR of the dataset
print(digits.keys())
print(digits.DESCR)

# Print the shape of the images and data keys
print(digits.images.shape)
print(digits.data.shape)

# Display digit 1010
plt.imshow(digits.images[1010], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

# it shows the hand-written number 5

#### ii. Train/Test Split + Fit/Predict/Accuracy
- build a classifier that can make this prediction not only for this image, but for all the other ones in the dataset

Now that you have learned about the importance of splitting your data into training and test sets, it's time to practice doing this on the digits dataset! After creating arrays for the features and target variable, you will split them into training and test sets, fit a k-NN classifier to the training data, and then compute its accuracy using the .score() method.

In [None]:
# Import necessary modules
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.model_selection import train_test_split

# Create feature and target arrays
X = digits.data
y = digits.target

# Split into training and test set
# stratify: Stratify the split according to the labels so that 
# they are distributed in the training and test sets as they are 
# in the original dataset.
X_train, X_test, y_train, y_test = train_test_split(X, y, 
            test_size = 0.2, random_state=42, stratify=y)

# Create a k-NN classifier with 7 neighbors: knn
knn = KNeighborsClassifier(n_neighbors=7)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Print the accuracy
print(knn.score(X_test, y_test))

# out: 0.983333333333

#### iii. Overfitting and Underfitting
Remember the model complexity curve that Hugo showed in the video? You will now construct such a curve for the digits dataset! In this exercise, you will compute and plot the training and testing accuracy scores for a variety of different neighbor values. By observing how the accuracy scores differ for the training and testing sets with different values of k, you will develop your intuition for overfitting and underfitting.

The training and testing sets are available to you in the workspace as X_train, X_test, y_train, y_test. In addition, KNeighborsClassifier has been imported from sklearn.neighbors.

In [None]:
# create model complexity curve for different k values in knn

# Setup arrays to store train and test accuracies
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

# Loop over different values of k
for i, k in enumerate(neighbors):
    # Setup a k-NN Classifier with k neighbors: knn
    knn = KNeighborsClassifier(n_neighbors=k)

    # Fit the classifier to the training data
    knn.fit(X_train, y_train)
    
    #Compute accuracy on the training set
    train_accuracy[i] = knn.score(X_train, y_train)

    #Compute accuracy on the testing set
    test_accuracy[i] = knn.score(X_test, y_test)

# Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()


Conclusion: It looks like the test accuracy is highest when using 3 and 5 neighbors. Using 8 neighbors or more seems to result in a simple model that underfits the data. 

## 2. Regression
- continuous variables

In [None]:
# Boston housing data example
boston = pd.read_csv('boston.csv')
print(boston.head())

# creating feature and target arrays
X = boston.drop('MEDV', axis=1).values
y = boston['MEDV'].values

# predict house value from a single feature
X_rooms = X[:,5]
# check type: they are both NumPy arrays
type(X_rooms), type(y)
y = y.reshape(-1,1)
X_rooms = X_rooms.reshape(-1,1)
# plot house value vs number of rooms
plt.scatter(X_rooms, y)
plt.ylabel('Value of house /1000 ($)')
plt.xlabel('Number of rooms')
plt.show()

# Fit a regression model
import numpy as np
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(X_rooms, y)
prediction_space = np.linspace(min(X_rooms),
                              max(X_rooms)).reshape(-1,1)
plt.scatter(X_rooms, y, color='blue')
plt.plot(prediction_space, reg.predict(prediction_space),
        color='black', linewidth=3)
plt.show()

### a. Importing data for supervised learning
In this chapter, you will work with Gapminder data that we have consolidated into one CSV file available in the workspace as 'gapminder.csv'. Specifically, your goal will be to use this data to predict the life expectancy in a given country based on features such as the country's GDP, fertility rate, and population. As in Chapter 1, the dataset has been preprocessed.

Since the target variable here is quantitative, this is a regression problem. To begin, you will fit a linear regression with just one feature: 'fertility', which is the average number of children a woman in a given country gives birth to. In later exercises, you will use all the features to build regression models.

Before that, however, you need to import the data and get it into the form needed by scikit-learn. This involves creating feature and target variable arrays. Furthermore, since you are going to use only one feature to begin with, you need to do some reshaping using NumPy's .reshape() method. Don't worry too much about this reshaping right now, but it is something you will have to do occasionally when working with scikit-learn so it is useful to practice.

In [None]:
# Import numpy and pandas
import numpy as np
import pandas as pd

# Read the CSV file into a DataFrame: df
df = pd.read_csv('gapminder.csv')

# Create arrays for features and target variable
y = df['life'].values
X = df['fertility'].values

# Print the dimensions of X and y before reshaping
print("Dimensions of y before reshaping: {}".format(y.shape))
print("Dimensions of X before reshaping: {}".format(X.shape))

# Reshape X and y
y = y.reshape(-1,1)
X = X.reshape(-1,1)

# Print the dimensions of X and y after reshaping
print("Dimensions of y after reshaping: {}".format(y.shape))
print("Dimensions of X after reshaping: {}".format(X.shape))


<script.py> output:
    Dimensions of y before reshaping: (139,)
    Dimensions of X before reshaping: (139,)
    Dimensions of y after reshaping: (139, 1)
    Dimensions of X after reshaping: (139, 1)

Notice the differences in shape before and after applying the .reshape() method. Getting the feature and target variable arrays into the right format for scikit-learn is an important precursor to model building.

### b. Exploring the Gapminder data
As always, it is important to explore your data before building models. On the right, we have constructed a heatmap showing the correlation between the different features of the Gapminder dataset, which has been pre-loaded into a DataFrame as df and is available for exploration in the IPython Shell. Cells that are in green show positive correlation, while cells that are in red show negative correlation. Take a moment to explore this: Which features are positively correlated with life, and which ones are negatively correlated? Does this match your intuition?

Then, in the IPython Shell, explore the DataFrame using pandas methods such as .info(), .describe(), .head().

In case you are curious, the heatmap was generated using Seaborn's heatmap function (http://seaborn.pydata.org/generated/seaborn.heatmap.html) and the following line of code, where df.corr() computes the pairwise correlation between columns:

sns.heatmap(df.corr(), square=True, cmap='RdYlGn')

Once you have a feel for the data, consider the statements below and select the one that is not true. After this, Hugo will explain the mechanics of linear regression in the next video and you will be on your way building regression models!

In [None]:
# explore df
df.info()
df.describe()
df.head()

### c. Basics of linear regression

Regression mechanics:

y = ax + b
- y = target
- x = single feature
- a, b = parameters of model

How do we choose a and b?
1. Define an error function (loss/cost function) for any given line
2. Choose the line that minimizes the error function

The loss function: Ordinary Least Squares (OLS) - minimize sum of squares of residuals
- residual = vertical distance between dot (data) and regression line
- same as minimizing the mean squared errors on the training set


Linear regression in higher dimensions:

y = a1*x1 + a2*x2 + b
- To fit a linear regression model here, you need to specify 3 variables
- In higher dimensions, must specify coefficient for each feature (ai) and variable b

In [None]:
# Example: Linear regression on all features
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
    test_size=0.3, random_state=42)

reg_all = linear_model.LinearRegression()
reg_all.fit(X_train, y_train)
y_pred = reg_all.predict(X_test)
reg_all.score(X_test, y_test)

# R2 ("R squared") - default scoring method for linear regression

# Note: usually will use Regularization instead of 
# linear regression like this to put further constraints on
# model coefficients

#### i. Fit and Predict for regression (1 feature)
Now, you will fit a linear regression and predict life expectancy using just one feature. You saw Andy do this earlier using the 'RM' feature of the Boston housing dataset. In this exercise, you will use the 'fertility' feature of the Gapminder dataset. Since the goal is to predict life expectancy, the target variable here is 'life'. The array for the target variable has been pre-loaded as y and the array for 'fertility' has been pre-loaded as X_fertility.

A scatter plot with 'fertility' on the x-axis and 'life' on the y-axis has been generated. As you can see, there is a strongly negative correlation, so a linear regression should be able to capture this trend. Your job is to fit a linear regression and then predict the life expectancy, overlaying these predicted values on the plot to generate a regression line. You will also compute and print the R2 (R-squared) score using sckit-learn's .score() method.

In [None]:
# Import LinearRegression
from sklearn.linear_model import LinearRegression

# Create the regressor: reg
reg = LinearRegression()

# Create the prediction space
prediction_space = np.linspace(min(X_fertility), 
                               max(X_fertility)).reshape(-1,1)

# Fit the model to the data
reg.fit(X_fertility, y)

# Compute predictions over the prediction space: y_pred
y_pred = reg.predict(prediction_space)

# Print R^2 
print(reg.score(X_fertility, y))

# Plot regression line
plt.plot(prediction_space, y_pred, color='black', linewidth=3)
plt.show()

# out: 0.619244216774

#### ii. Train/Test split for regression
As you learned in Chapter 1, train and test sets are vital to ensure that your supervised learning model is able to generalize well to new data. This was true for classification models, and is equally true for linear regression models.

In this exercise, you will split the Gapminder dataset into training and testing sets, and then fit and predict a linear regression over all features. In addition to computing the R2 score, you will also compute the Root Mean Squared Error (RMSE), which is another commonly used metric to evaluate regression models. The feature array X and target variable array y have been pre-loaded for you from the DataFrame df.

In [None]:
# Import necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.3,
                                                    random_state=42)

# Create the regressor: reg_all
reg_all = LinearRegression()

# Fit the regressor to the training data
reg_all.fit(X_train, y_train)

# Predict on the test data: y_pred
y_pred = reg_all.predict(X_test)

# Compute and print R^2 and RMSE
print("R^2: {}".format(reg_all.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))

# out:
# R^2: 0.838046873142936
# Root Mean Squared Error: 3.2476010800377213

Using all features has improved the model score. This makes sense, as the model has more information to learn from. However, there is one potential pitfall to this process. Can you spot it? You'll learn about this as well how to better validate your models next.

### d. Cross-validation
Motivation:
- model performance is dependent on way the data is split
- not representative of model's ability to generalize

Note:
- k folds = k-fold CV
- Tradeoff: more folds = more computationally expensive

In [None]:
# example: cross-validation
from sklearn.model_selection import cross_val_score
reg = linear_model.LinearRegression()
# specify cv = number of folds utilized
cv_results = cross_val_score(reg, X, y, cv=5)
print(cv_results)
# compute mean
np.mean(cv_results)

#### i. 5-fold cross-validation
Cross-validation is a vital step in evaluating a model. It maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data.

In this exercise, you will practice 5-fold cross validation on the Gapminder data. By default, scikit-learn's cross_val_score() function uses R^2 as the metric of choice for regression. Since you are performing 5-fold cross-validation, the function will return 5 scores. Your job is to compute these 5 scores and then take their average.

The DataFrame has been loaded as df and split into the feature/target variable arrays X and y. The modules pandas and numpy have been imported as pd and np, respectively.

In [None]:
# Import the necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Create a linear regression object: reg
reg = LinearRegression()

# Compute 5-fold cross-validation scores: cv_scores
cv_scores = cross_val_score(reg, X, y, cv=5)

# Print the 5-fold cross-validation scores
print(cv_scores)

print("Average 5-Fold CV Score: {}".format(np.mean(cv_scores)))

# out
#[ 0.81720569  0.82917058  0.90214134  0.80633989  0.94495637]
#Average 5-Fold CV Score: 0.8599627722793232

#### ii. K-Fold CV comparison
Cross validation is essential but do not forget that the more folds you use, the more computationally expensive cross-validation becomes. In this exercise, you will explore this for yourself. Your job is to perform 3-fold cross-validation and then 10-fold cross-validation on the Gapminder dataset.

In the IPython Shell, you can use %timeit to see how long each 3-fold CV takes compared to 10-fold CV by executing the following cv=3 and cv=10:

%timeit cross_val_score(reg, X, y, cv = ____)

pandas and numpy are available in the workspace as pd and np. The DataFrame has been loaded as df and the feature/target variable arrays X and y have been created.

In [None]:
# Import necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Create a linear regression object: reg
reg = LinearRegression()

# Perform 3-fold CV
cvscores_3 = cross_val_score(reg, X, y, cv=3)
print(np.mean(cvscores_3))

# Perform 10-fold CV
cvscores_10 = cross_val_score(reg, X, y, cv=10)
print(np.mean(cvscores_10))

# out
# 0.871871278262
# 0.843612862013

In [None]:
# time 3 vs 10 fold CV to see increase in computational expense/time
%timeit cross_val_score(reg, X, y, cv=3)
%timeit cross_val_score(reg, X, y, cv=10)

#100 loops, best of 3: 9.91 ms per loop
#10 loops, best of 3: 31.2 ms per loop

### e. Regularized regression
- Linear regression minimizes a loss function
- It chooses a coefficient for each feature variable
- Large coefficients can lead to overfitting (predict anything)
- Regularization = penalize large coefficients

#### i. Regularization 1: Lasso (feature selection)
- Loss function = OLS loss function + alpha * sum of abs value of coeff
- Used for feature selection
    - How? By shrinking the coefficients of less important feature to exactly 0

In [None]:
# example: Lasso regression
from sklearn.linear_model import Lasso

X_train, X_test, y_train, y_test = train_test_split(X,y,
        test_size=0.3, random_state=42)

# normalize arg ensures all variables on same scale
lasso = Lasso(alpha=0.1, normalize=True)
lasso.fit(X_train, y_train)
lasso = ridge.predict(X_test)
lasso.score(X_test, y_test)
# out: 0.595022...

In [None]:
# example: Lasso regularization used for feature selection
from sklearn.linear_model import Lasso

# store feature names in names
names = boston.drop('MEDV', axis=1).columns

lasso = Lasso(alpha=0.1)
lasso_coef = lasso.fit(X, y).coef_

# plot the coefficients as a function of feature name
_ = plt.plot(range(len(names)), lasso_coef)
_ = plt.xticks(range(len(names)), names, rotation=60)
_ = plt.ylab('Coefficients')
plt.show()

Regularization I: Lasso
In the video, you saw how Lasso selected out the 'RM' feature as being the most important for predicting Boston house prices, while shrinking the coefficients of certain other features to 0. Its ability to perform feature selection in this way becomes even more useful when you are dealing with data involving thousands of features.

In this exercise, you will fit a lasso regression to the Gapminder data you have been working with and plot the coefficients. Just as with the Boston data, you will find that the coefficients of some features are shrunk to 0, with only the most important ones remaining.

The feature and target variable arrays have been pre-loaded as X and y.

In [None]:
# Import Lasso
from sklearn.linear_model import Lasso

# Instantiate a lasso regressor: lasso
lasso = Lasso(alpha=0.4, normalize=True)

# Fit the regressor to the data
lasso.fit(X, y)

# Compute and print the coefficients
lasso_coef = lasso.coef_
print(lasso_coef)

# Plot the coefficients
plt.plot(range(len(df_columns)), lasso_coef)
plt.xticks(range(len(df_columns)), df_columns.values, rotation=60)
plt.margins(0.02)
plt.show()

# plot shows child_mortality feature is important when predicting
# life expectancya

#### ii. Regularization 2: Ridge (1st choice in building regression models)
- Loss function = OLS loss function + alpha * sum of coefficient^2
- Need to choose alpha parameter for the best performing model
- Picking alpha is similar to picking k in k-NN
- aka Hyperparameter tuning
- alpha (sometimes lambda) controls model complexity
    - alpha = 0, get back OLS and overfitting
    - very high alpha can lead to underfitting

In [None]:
# example: Ridge regression
from sklearn.linear_model import Ridge

X_train, X_test, y_train, y_test = train_test_split(X,y,
        test_size=0.3, random_state=42)

# normalize arg ensures all variables on same scale
ridge = Ridge(alpha=0.1, normalize=True)
ridge.fit(X_train, y_train)
ridge_pred = ridge.predict(X_test)
ridge.score(X_test, y_test)
# out: 0.69969...

Regularization II: Ridge
Lasso is great for feature selection, but when building regression models, Ridge regression should be your first choice.

Recall that lasso performs regularization by adding to the loss function a penalty term of the absolute value of each coefficient multiplied by some alpha. This is also known as L1 regularization because the regularization term is the L1 norm of the coefficients. This is not the only way to regularize, however.

If instead you took the sum of the squared values of the coefficients multiplied by some alpha - like in Ridge regression - you would be computing the L2 norm. In this exercise, you will practice fitting ridge regression models over a range of different alphas, and plot cross-validated R2 scores for each, using this function that we have defined for you, which plots the R2 score as well as standard error for each alpha:

def display_plot(cv_scores, cv_scores_std):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.plot(alpha_space, cv_scores)

    std_error = cv_scores_std / np.sqrt(10)

    ax.fill_between(alpha_space, cv_scores + std_error, cv_scores - std_error, alpha=0.2)
    ax.set_ylabel('CV Score +/- Std Error')
    ax.set_xlabel('Alpha')
    ax.axhline(np.max(cv_scores), linestyle='--', color='.5')
    ax.set_xlim([alpha_space[0], alpha_space[-1]])
    ax.set_xscale('log')
    plt.show()
Don't worry about the specifics of the above function works. The motivation behind this exercise is for you to see how the R2 score varies with different alphas, and to understand the importance of selecting the right value for alpha. 

In [None]:
# example: Ridge regularization

# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Setup the array of alphas and lists to store scores
alpha_space = np.logspace(-4, 0, 50)
ridge_scores = []
ridge_scores_std = []

# Create a ridge regressor: ridge
ridge = Ridge(normalize=True)

# Compute scores over range of alphas
for alpha in alpha_space:

    # Specify the alpha value to use: ridge.alpha
    ridge.alpha = alpha
    
    # Perform 10-fold CV: ridge_cv_scores
    ridge_cv_scores = cross_val_score(ridge,X,y,cv=10)
    
    # Append the mean of ridge_cv_scores to ridge_scores
    ridge_scores.append(np.mean(ridge_cv_scores))
    
    # Append the std of ridge_cv_scores to ridge_scores_std
    ridge_scores_std.append(np.std(ridge_cv_scores))

# Display the plot
display_plot(ridge_scores, ridge_scores_std)


## 3. Fine-tuning your model

Model performance measured with accuracy, but not always a useful metric.
- consider class imbalance with 99% accuracy for 99% real emails vs 1% spam emails
More nuanced performance metrics:
- Diagnosing classification predictions
    - Confusion matrix (T/F Positive/Negative)
- Classification report (Confusion matrix metrics):
    - Accuracy
    - Precision (PPV - positive predictive value)
        = tp / (tp + fp)
        - high precision: not many real emails predicted as spam
    - Recall = tp / (tp + fn) aka Sensitivity, Hit rate, True Positive Rate
        - high recall: predicted most spam emails correctly
    - F1 score = 2*(precision*recall)/(precision+recall) = harmonic mean of precision and recall

In [None]:
# Confusion matrix in scikit-learn
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

knn = KNeighborsClasssifier(n_neighbors=8)
X_train, X_test, y_train, y_test = train_test_split(X, y,
        test_size=0.4, random_state=42)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

# create confusion matrix
# note: y_test is the true label, prediction label is the 2nd arg
print(confusion_matrix(y_test, y_pred))
# print classification report: precision, recall, f1-score, support
print(classification_report(y_test, y_pred))

Metrics for classification
In Chapter 1, you evaluated the performance of your k-NN classifier based on its accuracy. However, accuracy is not always an informative metric. In this exercise, you will dive more deeply into evaluating the performance of binary classifiers by computing a confusion matrix and generating a classification report.

You may have noticed in the video that the classification report consisted of three rows, and an additional support column. 

Support column in classification report: 
- The support gives the number of samples of the true response that lie in that class - so in the video example, the support was the number of Republicans or Democrats in the test set on which the classification report was computed. The precision, recall, and f1-score columns, then, gave the respective metrics for that particular class.

Here, you'll work with the PIMA Indians dataset obtained from the UCI Machine Learning Repository. The goal is to predict whether or not a given female patient will contract diabetes based on features such as BMI, age, and number of pregnancies. Therefore, it is a binary classification problem. A target value of 0 indicates that the patient does not have diabetes, while a value of 1 indicates that the patient does have diabetes. As in Chapters 1 and 2, the dataset has been preprocessed to deal with missing values.

The dataset has been loaded into a DataFrame df and the feature and target variable arrays X and y have been created for you. In addition, sklearn.model_selection.train_test_split and sklearn.neighbors.KNeighborsClassifier have already been imported.

Your job is to train a k-NN classifier to the data and evaluate its performance by generating a confusion matrix and classification report.

In [None]:
# example: knn confusion matrix and classification report

# Import necessary modules
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Create training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y,
    test_size=0.4, random_state=42)

# Instantiate a k-NN classifier: knn
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)

# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classifcation_report(y_test, y_pred))

# out
#[[176  30]
# [ 52  50]]
#             precision    recall  f1-score   support

#          0       0.77      0.86      0.81       206
#          1       0.62      0.47      0.54       102

#avg / total       0.72      0.73      0.72       308

### 3.1 Logistic regression and ROC curve
- used in classification problems (not regression)

Logistic regression for binary classification:
- logistic regression outputs probabilities
    - if probability, p, > 0.5: data labeled '1'
    - p < 0.5: data labeled '0'
- Probability thresholds
    - default logistic regression threshold = 0.5
    - could also be used as k-NN classifiers
    - What happens if the threshold varies? What happens to True Positive and False Positive rates with threshold varied? Look at ROC Curve
    
- ROC curve (Receiver Operator Characteristic Curve)

In [None]:
# example: logistic regression in scikit-learn

# Import necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# instantiate logisitic regression classifier
logreg = LogisticRegression()

# split the data in training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
    test_size=0.4, random_state=42)

# Fit the classifier to the training data
logreg.fit(X_train, y_train)

# Predict on test set
y_pred = logreg.predict(X_test)

In [None]:
# example: plot ROC curve
from sklearn.metrics import roc_curve
# y_pred_prob = predicted probabilities
# .predict_proba returns an array with 2 columns, each column
# contains probabilities for the respective target values,
# we choose the 2nd column, which is index 1, the probabilities
# being 1
y_pred_prob = logreg.predict_proba(X_test)[:,1]
# fpr = false positive rate, tpr = true positive rate
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

plt.plot([0,1],[0,1],'k--')
plt.plot(fpr, tpr, label='Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve')
plt.show();

#### 3.1.a Building a logistic regression model
Building a logistic regression model

Time to build your first logistic regression model! As Hugo showed in the video, scikit-learn makes it very easy to try different models, since the Train-Test-Split/Instantiate/Fit/Predict paradigm applies to all classifiers and regressors - which are known in scikit-learn as 'estimators'. You'll see this now for yourself as you train a logistic regression model on exactly the same data as in the previous exercise. Will it outperform k-NN? There's only one way to find out!

The feature and target variable arrays X and y have been pre-loaded, and train_test_split has been imported for you from sklearn.model_selection.

In [None]:
# example: build logistical regression model - binary classification

# Import the necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42)

# Create the classifier: logreg
logreg = LogisticRegression()

# Fit the classifier to the training data
logreg.fit(X_train, y_train)

# Predict the labels of the test set: y_pred
y_pred = logreg.predict(X_test)

# Compute and print the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# out
#[[176  30]
# [ 35  67]]
#             precision    recall  f1-score   support

#          0       0.83      0.85      0.84       206
#          1       0.69      0.66      0.67       102

#avg / total       0.79      0.79      0.79       308

#### 3.1.b Plotting ROC curve - visually evaluate models
Classification reports and confusion matrices are great methods to quantitatively evaluate model performance, while ROC curves provide a way to visually evaluate models. 

.predict_proba()
- Most classifiers in scikit-learn have a .predict_proba() method which returns the probability of a given sample being in a particular class. Having built a logistic regression model, you'll now evaluate its performance by plotting an ROC curve. In doing so, you'll make use of the .predict_proba() method and become familiar with its functionality.

Here, you'll continue working with the PIMA Indians diabetes dataset. The classifier has already been fit to the training data and is available as logreg.

In [None]:
# Import necessary modules
from sklearn.metrics import roc_curve

# Compute predicted probabilities: y_pred_prob
y_pred_prob = logreg.predict_proba(X_test)[:,1]

# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr) # x-axis: fpr, y-axis: tpr
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

#### 3.1.c Precision-Recall Curve
When looking at your ROC curve, you may have noticed that the y-axis (True positive rate) is also known as recall. Indeed, in addition to the ROC curve, there are other ways to visually evaluate model performance. One such way is the precision-recall curve, which is generated by plotting the precision and recall for different thresholds. As a reminder, precision and recall are defined as:

- Precision=TP/(TP+FP)
- Recall=TP/(TP+FN)

On the right, a precision-recall curve has been generated for the diabetes dataset. The classification report and confusion matrix are displayed in the IPython Shell.

Study the precision-recall curve and then consider the statements given below. Choose the one statement that is not true. Note that here, the class is positive (1) if the individual has diabetes.

             precision    recall  f1-score   support

          0       0.83      0.85      0.84       206
          1       0.69      0.66      0.67       102

    avg / total       0.79      0.79      0.79       308

    [[176  30]
     [ 35  67]]
     
     
TRUE: A recall of 1 corresponds to a classifier with a low threshold in which all females who contract diabetes were correctly classified as such, at the expense of many misclassifications of those who did not have diabetes.

TRUE: Precision is undefined for a classifier which makes no positive predictions, that is, classifies everyone as not having diabetes.

TRUE: When the threshold is very close to 1, precision is also 1, because the classifier is absolutely certain about its predictions.

FALSE: Precision and recall take true negatives into consideration.

### 3.2 Area under the ROC curve (AUC)
#### - one popular metric for classification models
- Larger AUC = better model

In [None]:
# example: Compute AUC
from sklearn.metrics import roc_auc_score

logreg = LogisticRegression()

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                        test_size = 0.4, random_state=42)

# Fit the classifier to the training data
logreg.fit(X_train, y_train)

# Compute predicted probabilities to compute AUC
y_pred = logreg.predict_proba(X_test)[:,1]
# Pass true labels and predicted probabilities to ROC AUC score
roc_auc_score(y_test, y_pred_prob)
# out: 0.997466216

In [None]:
# Compute AUC using cross-validation
from sklearn.model_selection import cross_val_score

# pass the estimator, features, and target
cv_scores = cross_val_score(logreg, X, y, cv=5, scoring='roc_auc')

print(cv_scores)
# out: [ 0.9967   0.99183 .  0.99583 .  1.   0.961406]

#### AUC computation
Say you have a binary classifier that in fact is just randomly making guesses. It would be correct approximately 50% of the time, and the resulting ROC curve would be a diagonal line in which the True Positive Rate and False Positive Rate are always equal. The Area under this ROC curve would be 0.5. This is one way in which the AUC, which Hugo discussed in the video, is an informative metric to evaluate a model. If the AUC is greater than 0.5, the model is better than random guessing. Always a good sign!

In this exercise, you'll calculate AUC scores using the roc_auc_score() function from sklearn.metrics as well as by performing cross-validation on the diabetes dataset.

X and y, along with training and test sets X_train, X_test, y_train, y_test, have been pre-loaded for you, and a logistic regression classifier logreg has been fit to the training data.

In [None]:
# example: calculate AUC with 2 methods: roc_auc_score() and
# using cross_val_score()

# Import necessary modules
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score

# Method 1
# Compute predicted probabilities: y_pred_prob
y_pred_prob = logreg.predict_proba(X_test)[:,1]

# Compute and print AUC score
print("AUC: {}".format(roc_auc_score(y_test, y_pred_prob)))

# Method 2
# Compute cross-validated AUC scores: cv_auc
cv_auc = cross_val_score(logreg, X, y, cv=5, scoring='roc_auc')

# Print list of AUC scores
print("AUC scores computed using 5-fold cross-validation: {}".
      format(cv_auc))

# out
#AUC: 0.8254806777079764
#AUC scores computed using 5-fold cross-validation: 
#[ 0.80148148  0.8062963   0.81481481  0.86245283  0.8554717 ]

### 3.3 Hyperparameter tuning
Review
- Linear regression: choose parameters
- Ridge/Lasso regression: choose alpha
- k-Nearest Neighbors: choose n_neighbors

Hyperparameters: parameters that need to be specified before model fitting (so can't be learned by fitting model)
- parameters like alpha and k

Choosing the correct hyperparameter
- try a bunch of different hyperparameter values
- fit all of them separately
- see how well each performs
- choose the best performing one
- essential to use cross-validation (using train-test-split alone risks overfitting hyperparameter to test set)

Methods:
- Grid search cross-validation
- Randomized Search CV

In [None]:
# GridSearchCV
from sklearn.model_selection import GridSearchCV
# keys are hyperparameter names like 'n_neighbors' in k-nn, or
# alpha in Ridge/Lasso regression
param_grid = {'n_neighbors': np.arange(1, 50)}
knn = KneighborsClassifier()
# args: model, grid, number of folds for cross validation
knn_cv = GridSearchCV(knn, param_grid, cv=5)
# use this to fit data
knn_cv.fit(X, y)

knn_cv.best_params_
knn_cv.best_score_

#### 3.3.a Hyperparameter tuning with GridSearchCV
Hugo demonstrated how to tune the n_neighbors parameter of the KNeighborsClassifier() using GridSearchCV on the voting dataset. You will now practice this yourself, but by using logistic regression on the diabetes dataset instead!

Like the alpha parameter of lasso and ridge regularization that you saw earlier, logistic regression also has a regularization parameter: C. C controls the inverse of the regularization strength, and this is what you will tune in this exercise. A large C can lead to an overfit model, while a small C can lead to an underfit model.

The hyperparameter space for C has been setup for you. Your job is to use GridSearchCV and logistic regression to find the optimal C in this hyperparameter space. The feature array is available as X and target variable array is available as y.

You may be wondering why you aren't asked to split the data into training and test sets. Good observation! Here, we want you to focus on the process of setting up the hyperparameter grid and performing grid-search cross-validation. In practice, you will indeed want to hold out a portion of your data for evaluation purposes, and you will learn all about this in the next video!

In [None]:
# Tune (optimize) C regularization parameter for logistic regression
# This example doesn't have a test and training set, but normally
# you should

# Import necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}

# Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression()

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

# Fit it to the data
logreg_cv.fit(X, y)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
print("Best score is {}".format(logreg_cv.best_score_))

# out
#Tuned Logistic Regression Parameters: {'C': 3.7275937203149381}
#Best score is 0.7708333333333334

#### 3.3.b Hyperparameter tuning with RandomizedSearchCV
- faster alternative to GridSearchCV

GridSearchCV can be computationally expensive, especially if you are searching over a large hyperparameter space and dealing with multiple hyperparameters. A solution to this is to use RandomizedSearchCV, in which not all hyperparameter values are tried out. Instead, a fixed number of hyperparameter settings is sampled from specified probability distributions. You'll practice using RandomizedSearchCV in this exercise and see how this works.

Here, you'll also be introduced to a new model: the Decision Tree. Don't worry about the specifics of how this model works. Just like k-NN, linear regression, and logistic regression, decision trees in scikit-learn have .fit() and .predict() methods that you can use in exactly the same way as before. Decision trees have many parameters that can be tuned, such as max_features, max_depth, and min_samples_leaf: This makes it an ideal use case for RandomizedSearchCV.

As before, the feature array X and target variable array y of the diabetes dataset have been pre-loaded. The hyperparameter settings have been specified for you. Your goal is to use RandomizedSearchCV to find the optimal hyperparameters. Go for it!

In [None]:
# Import necessary modules
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

# Setup the parameters and distributions to sample from: param_dist
param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 9),
              "min_samples_leaf": randint(1, 9),
              "criterion": ["gini", "entropy"]}

# Instantiate a Decision Tree classifier: tree
tree = DecisionTreeClassifier()

# Instantiate the RandomizedSearchCV object: tree_cv
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)

# Fit it to the data
tree_cv.fit(X, y)

# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))

# out
#Tuned Decision Tree Parameters: {'max_depth': 3, 
#    'criterion': 'entropy', 'min_samples_leaf': 3, 
#    'max_features': 5}
#Best score is 0.7330729166666666

#### Note: RandomizedSearchCV never outperforms GridSearchCV but it will save computational time

### 3.4 Hold-out set for final evaluation

#### 3.4.a Hold-out set reasoning
- Use test data set to evaluate model performance
- Using all data for cross-validation is not ideal
- split data into training and hold-out (test) set at beginning
- perform grid search cross-validation on training set
- choose best hyperparameters and evaluate on hold-out set

#### 3.4.b Hold-out set in practice I: Classification
You will now practice evaluating a model with tuned hyperparameters on a hold-out set. The feature array and target variable array from the diabetes dataset have been pre-loaded as X and y.

In addition to C, logistic regression has a 'penalty' hyperparameter which specifies whether to use 'l1' or 'l2' regularization. 

Your job in this exercise is to create a hold-out set, tune the 'C' and 'penalty' hyperparameters of a logistic regression classifier using GridSearchCV on the training set.

In [None]:
# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Create the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space, 'penalty': ['l1', 'l2']}

# Instantiate the logistic regression classifier: logreg
logreg = LogisticRegression()

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,
test_size=0.4, random_state=42)

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

# Fit it to the training data
logreg_cv.fit(X_train, y_train)

# Print the optimal parameters and best score
print("Tuned Logistic Regression Parameter: {}".format(logreg_cv.best_params_))
print("Tuned Logistic Regression Accuracy: {}".format(logreg_cv.best_score_))

# out
# Tuned Logistic Regression Parameter: {'C': 0.43939705607607948, 'penalty': 'l1'}
# Tuned Logistic Regression Accuracy: 0.7652173913043478

#### 3.4.c Hold-out set in practice II: Regression
Remember lasso and ridge regression from the previous chapter? Lasso used the L1 penalty to regularize, while ridge used the L2 penalty. There is another type of regularized regression known as the elastic net. 

In elastic net regularization, the penalty term is a linear combination of the L1 and L2 penalties:
- a∗L1+b∗L2

In scikit-learn, this term is represented by the 'l1_ratio' parameter: An 'l1_ratio' of 1 corresponds to an L1 penalty, and anything lower is a combination of L1 and L2.

In this exercise, you will GridSearchCV to tune the 'l1_ratio' of an elastic net model trained on the Gapminder data. As in the previous exercise, use a hold-out set to evaluate your model's performance.

In [None]:
# Import necessary modules
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV, train_test_split

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
test_size=0.4, random_state=42)

# Create the hyperparameter grid
l1_space = np.linspace(0, 1, 30)
param_grid = {'l1_ratio': l1_space}

# Instantiate the ElasticNet regressor: elastic_net
elastic_net = ElasticNet()

# Setup the GridSearchCV object: gm_cv
# Use GridSearchCV with 5-fold cross-validation to tune 'l1_ratio' 
# on the training data X_train and y_train. This involves 
# first instantiating the GridSearchCV object with the correct 
# parameters and then fitting it to the training data.
gm_cv = GridSearchCV(elastic_net, param_grid, cv=5)

# Fit it to the training data
gm_cv.fit(X_train, y_train)

# Predict on the test set and compute metrics
y_pred = gm_cv.predict(X_test)
r2 = gm_cv.score(X_test, y_test)
mse = mean_squared_error(y_test, y_pred)
print("Tuned ElasticNet l1 ratio: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))
print("Tuned ElasticNet MSE: {}".format(mse))

# out
# Tuned ElasticNet l1 ratio: {'l1_ratio': 0.20689655172413793}
# Tuned ElasticNet R squared: 0.8668305372460283
# Tuned ElasticNet MSE: 10.05791413339844

## 4. Preprocessing and Pipelines

### 4.1 Preprocessing Data
Dealing with categorical features
- scikit-learn will not accept categorical features
- need to encode these features numerically
- convert to 'dummy variables'
    - 0: Observation was NOT that category
    - 1: Observation was that category
- example: 3 origins for a car
    - origin_Asia 0 or 1
    - origin_Europe 0 or 1, can remove Europe if implicitly we know it's not from Asia or Europe, duplication can cause issues in some models
    - origin_USA 0 or 1
    
Dealing with categorical features:
- scikit-learn: OneHotEncoder()
- pandas: get_dummies()

Boxplots are useful in visualizing categorical features

In [None]:
# Example: dummy variables with pandas get_dummies()
# convert origins for car (noted above) into dummy variables
import pandas as pd
df= pd.read_csv('auto.csv')
df_origin = pd.get_dummies(df)
print(df_origin.head())
# drop origin_asia column since we can imply it
df_origin = df_origin.drop('origin_Asia', axis=1)
print(df_origin.head())

# Ridge Linear regression with dummy variables
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
X_train, X_test, y_train, y_test = train_test_split(X, y,
        test_size=0.3, random_state=42)
ridge = Ridge(alpha.0.5, normalize=True).fit(X_train, y_train)
ridge.score(X_test, y_test)
# out: 0.7190645
# can compute R^2

#### 4.1.a Exploring categorical features
The Gapminder dataset that you worked with in previous chapters also contained a categorical 'Region' feature, which we dropped in previous exercises since you did not have the tools to deal with it. Now however, you do, so we have added it back in!

Your job in this exercise is to explore this feature. 

Boxplots are particularly useful for visualizing categorical features such as this.

#### 4.1.b 

#### 4.1.c 

### 4.2 Handling Missing Data

#### 4.2.a 

#### 4.2.b 

#### 4.2.c 

### 4.3 Centering and Scaling


#### 4.3.a 

#### 4.3.b 

#### 4.3.c 

#### 4.3.d 

# Supervised learning with scikit-learn - sklearn
1. Classification
2. Regression
3. Fine-tuning model
4. Preprocessing and Pipelines

Background:
- What is machine learning? Giving computers the ability to learn to make decisions from Data without being explicitly programmed.
- Supervised learning - labeled data
- Unsupervised learning - uncovering hidden patterns from unlabeled data
- Reinforcement learning - software agents interact with an environment; learn how to optimize their behavior, given system of rewards and punishments, draws inspiration from behavioral psychology. Ie. AphasGo - 1st computer to defeat world champion in Go

Supervised learning
- predictor variables/features and a target variable
- Aim: predict the target variable, given the predictor variables (ie. target variable: species, predictor variables: sepal length and width)
- Classification: target variable consists of categories
- Regression: Target variable is continuous

Naming conventions:
- Features = predictor variables = independent variables
- Target variable = dependent variable = response variable

Goals of Supervised learning:
- Automate time-consuming or expensive manual tasks (ie. MD Dx)
- Make predictions about the future (ie. will a customer click an ad or not)
- Need labeled data (ie. historical data with labels, experiments to get labeled data like click on ad, crowd-sourcing labeled data)

Tools:
- scikit-learn/sklearn - integrates well with SciPy stack including Numpy
- other libraries: TensorFlow, keras

## 1. Classification

### a. EDA

In [None]:
from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.sytle.use('ggplot')

# load dataset
iris = datasets.load_iris()
type(iris)
# out: sklearn.datasets.base.Bunch
# a Bunch is like a dictionary
print(iris.keys())
# out: dict_keys(['data','target_names','DESCR','feature_names','target'])
# the data and target are numpy arrays
iris.data.shape
# out: (150, 4) # 150 samples and 4 features
iris.target_names
# out: array(['setosa','versicolor','virginica'], dtype='<U10')

# initial EDA
X = iris.data
y = iris.target
df = pd.DataFrame(X, columns=iris.feature_names)
print(df.head)

# Visual EDA, c is color, 
_ = pd.scatter_matrix(df, c=y, figsize=[8,8], s=150, marker='D')


#### i. Numerical EDA
In this chapter, you'll be working with a dataset obtained from the UCI Machine Learning Repository consisting of votes made by US House of Representatives Congressmen. Your goal will be to predict their party affiliation ('Democrat' or 'Republican') based on how they voted on certain key issues. Here, it's worth noting that we have preprocessed this dataset to deal with missing values. This is so that your focus can be directed towards understanding how to train and evaluate supervised learning models. Once you have mastered these fundamentals, you will be introduced to preprocessing techniques in Chapter 4 and have the chance to apply them there yourself - including on this very same dataset!

Before thinking about what supervised learning models you can apply to this, however, you need to perform Exploratory data analysis (EDA) in order to understand the structure of the data. For a refresher on the importance of EDA, check out the first two chapters of Statistical Thinking in Python (Part 1).

Get started with your EDA now by exploring this voting records dataset numerically. It has been pre-loaded for you into a DataFrame called df. Use pandas' .head(), .info(), and .describe() methods in the IPython Shell to explore the DataFrame, and select the statement below that is not true.

In [None]:
# explore structure of data
df.head()
df.info()
df.describe()

#### ii. Visual EDA
The Numerical EDA you did in the previous exercise gave you some very important information, such as the names and data types of the columns, and the dimensions of the DataFrame. Following this with some visual EDA will give you an even better understanding of the data. In the video, Hugo used the scatter_matrix() function on the Iris data for this purpose. However, you may have noticed in the previous exercise that all the features in this dataset are binary; that is, they are either 0 or 1. So a different type of plot would be more useful here, such as Seaborn's countplot.

Given on the right is a countplot of the 'education' bill, generated from the following code:

plt.figure()

sns.countplot(x='education', hue='party', data=df, palette='RdBu')

plt.xticks([0,1], ['No', 'Yes'])

plt.show()

In sns.countplot(), we specify the x-axis data to be 'education', and hue to be 'party'. Recall that 'party' is also our target variable. So the resulting plot shows the difference in voting behavior between the two parties for the 'education' bill, with each party colored differently. We manually specified the color to be 'RdBu', as the Republican party has been traditionally associated with red, and the Democratic party with blue.

It seems like Democrats voted resoundingly against this bill, compared to Republicans. This is the kind of information that our machine learning model will seek to learn when we try to predict party affiliation solely based on voting behavior. An expert in U.S politics may be able to predict this without machine learning, but probably not instantaneously - and certainly not if we are dealing with hundreds of samples!

In the IPython Shell, explore the voting behavior further by generating countplots for the 'satellite' and 'missile' bills, and answer the following question: Of these two bills, for which ones do Democrats vote resoundingly in favor of, compared to Republicans? Be sure to begin your plotting statements for each figure with plt.figure() so that a new figure will be set up. Otherwise, your plots will be overlayed onto the same figure.

In [None]:
# generate countplots for 'satellite' and 'missile' bills

# satellite bill
plt.figure()
sns.countplot(x='satellite', hue='party', data=df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
plt.show()
# republicans 'no', democrats 'yes'

# missile bill
plt.figure()
sns.countplot(x='missile', hue='party', data=df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
plt.show()
# republicans 'no', democrats 'yes'

### b. The classification challenge
- Training data: already labeled data

k-Nearest Neighbors
- idea is to predict the label of a data point by looking at the 'k' closest labeled data points

Training a model on the data = 'fitting' a model to the data
- .fit() method
Predict labels of new data with...
- .predict() method

In [None]:
# Using scikit-learn to fit a classifier
from sklearn.neighbors import KNeighborsClassifier
# set 'k', number of neighbors to 6
knn = KNeighborsClassifier(n_neighbors=6)
# fit classifier to training set with args: features, target
# requires args to be Numpy array or Pandas dataframe
# requires no missing values
knn.fit(iris['data'], iris['target'])
# out: KNeighborsClassifier(algorithm='auto', leaf_size=30,
# metric='minkowski', metric_params=None, n_jobs=1,
# n_neighbors=6, p=2, weights='uniform)

# check out iris data
iris['data'].shape
# out: (150, 4)

# target has to be same # rows as feature data
iris['target'].shape
# out: (150,)

# predict on unlabeled data
prediction = knn.predict(X_new)
X_new.shape
# out: (3, 4)
print('Prediction {}'.format(prediction))
# Prediction: [1 1 0]
# which means 1=versicolor for first 2 observations, and 0=sertosa

#### i. k-Nearest Neighbors: Fit
Having explored the Congressional voting records dataset, it is time now to build your first classifier. In this exercise, you will fit a k-Nearest Neighbors classifier to the voting dataset, which has once again been pre-loaded for you into a DataFrame df.

In the video, Hugo discussed the importance of ensuring your data adheres to the format required by the scikit-learn API. The features need to be in an array where each column is a feature and each row a different observation or data point - in this case, a Congressman's voting record. The target needs to be a single column with the same number of observations as the feature data. We have done this for you in this exercise. Notice we named the feature array X and response variable y: This is in accordance with the common scikit-learn practice.

Your job is to create an instance of a k-NN classifier with 6 neighbors (by specifying the n_neighbors parameter) and then fit it to the data. The data has been pre-loaded into a DataFrame called df.

In [None]:
# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier

# Create arrays for the features and the response variable
# Note sklearn practice: x for feature array, y for response variable
# Note: '.values' attribute return NumPy arrays
y = df['party'].values
X = df.drop('party', axis=1).values

# Create a k-NN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the data
knn.fit(X, y)

#Out[1]: 
#KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
#           metric_params=None, n_jobs=1, n_neighbors=6, p=2,
#           weights='uniform')

#### ii. k-Nearest Neighbors: Predict
Having fit a k-NN classifier, you can now use it to predict the label of a new data point. However, there is no unlabeled data available since all of it was used to fit the model! You can still use the .predict() method on the X that was used to fit the model, but it is not a good indicator of the model's ability to generalize to new, unseen data.

In the next video, Hugo will discuss a solution to this problem. For now, a random unlabeled data point has been generated and is available to you as X_new. You will use your classifier to predict the label for this new data point, as well as on the training data X that the model has already seen. Using .predict() on X_new will generate 1 prediction, while using it on X will generate 435 predictions: 1 for each sample.

The DataFrame has been pre-loaded as df. This time, you will create the feature array X and target variable array y yourself.

In [None]:
# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier 

# Create arrays for the features and the response variable
y = df['party'].values
X = df.drop('party', axis=1).values

# Create a k-NN classifier with 6 neighbors: knn
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the data
knn.fit(X, y)

# Predict the labels for the training data X
y_pred = knn.predict(X)

# Predict and print the label for the new data point X_new
new_prediction = knn.predict(X_new)
print("Prediction: {}".format(new_prediction))

# out: Prediction: ['democrat']

### c. measuring Model performance
- accuracy - commonly used metric of model performance to generalize
- accuracy = Fraction of correct predictions on new data

Split data into training and test set
- Fit/train the classifier on the training set
- Make predictions on test set

In [None]:
# Train/Test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = 
train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)

# Create a k-NN classifier with 8 neighbors
knn = KNeighborsClassifier(n_neighbors=8)

# Fit the classifier to the data
knn.fit(X_train, y_train)

# Predict the labels for the training data X
y_pred = knn.predict(X_test)

print("Test set predictions:\n {}".format(y_pred))

# check accuracy
knn.score(X_test, y_test)
# out: 0.9555555556

Model complexity for KNN
- larger k = smoother decision boundary = less complex model
- smaller k = more complex model = can lead to overfitting and sensitive to noise
- Model complexity curve - shows over/underfitting with too small or large k

#### i. the digits recognition dataset
Up until now, you have been performing binary classification, since the target variable had two possible outcomes. Hugo, however, got to perform multi-class classification in the videos, where the target variable could take on three possible outcomes. Why does he get to have all the fun?! In the following exercises, you'll be working with the MNIST digits recognition dataset, which has 10 classes, the digits 0 through 9! A reduced version of the MNIST dataset is one of scikit-learn's included datasets, and that is the one we will use in this exercise.

Each sample in this scikit-learn dataset is an 8x8 image representing a handwritten digit. Each pixel is represented by an integer in the range 0 to 16, indicating varying levels of black. Recall that scikit-learn's built-in datasets are of type Bunch, which are dictionary-like objects. Helpfully for the MNIST dataset, scikit-learn provides an 'images' key in addition to the 'data' and 'target' keys that you have seen with the Iris data. Because it is a 2D array of the images corresponding to each sample, this 'images' key is useful for visualizing the images, as you'll see in this exercise (for more on plotting 2D arrays, see Chapter 2 of DataCamp's course on Data Visualization with Python). On the other hand, the 'data' key contains the feature array - that is, the images as a flattened array of 64 pixels.

Notice that you can access the keys of these Bunch objects in two different ways: By using the . notation, as in digits.images, or the [] notation, as in digits['images'].

For more on the MNIST data, check out this exercise in Part 1 of DataCamp's Importing Data in Python course. There, the full version of the MNIST dataset is used, in which the images are 28x28. It is a famous dataset in machine learning and computer vision, and frequently used as a benchmark to evaluate the performance of a new model.

In [None]:
# Import necessary modules
from sklearn import datasets
import matplotlib.pyplot as plt

# Load the digits dataset: digits
digits = datasets.load_digits()

# Print the keys and DESCR of the dataset
print(digits.keys())
print(digits.DESCR)

# Print the shape of the images and data keys
print(digits.images.shape)
print(digits.data.shape)

# Display digit 1010
plt.imshow(digits.images[1010], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

# it shows the hand-written number 5

#### ii. Train/Test Split + Fit/Predict/Accuracy
- build a classifier that can make this prediction not only for this image, but for all the other ones in the dataset

Now that you have learned about the importance of splitting your data into training and test sets, it's time to practice doing this on the digits dataset! After creating arrays for the features and target variable, you will split them into training and test sets, fit a k-NN classifier to the training data, and then compute its accuracy using the .score() method.

In [None]:
# Import necessary modules
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.model_selection import train_test_split

# Create feature and target arrays
X = digits.data
y = digits.target

# Split into training and test set
# stratify: Stratify the split according to the labels so that 
# they are distributed in the training and test sets as they are 
# in the original dataset.
X_train, X_test, y_train, y_test = train_test_split(X, y, 
            test_size = 0.2, random_state=42, stratify=y)

# Create a k-NN classifier with 7 neighbors: knn
knn = KNeighborsClassifier(n_neighbors=7)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Print the accuracy
print(knn.score(X_test, y_test))

# out: 0.983333333333

#### iii. Overfitting and Underfitting
Remember the model complexity curve that Hugo showed in the video? You will now construct such a curve for the digits dataset! In this exercise, you will compute and plot the training and testing accuracy scores for a variety of different neighbor values. By observing how the accuracy scores differ for the training and testing sets with different values of k, you will develop your intuition for overfitting and underfitting.

The training and testing sets are available to you in the workspace as X_train, X_test, y_train, y_test. In addition, KNeighborsClassifier has been imported from sklearn.neighbors.

In [None]:
# create model complexity curve for different k values in knn

# Setup arrays to store train and test accuracies
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

# Loop over different values of k
for i, k in enumerate(neighbors):
    # Setup a k-NN Classifier with k neighbors: knn
    knn = KNeighborsClassifier(n_neighbors=k)

    # Fit the classifier to the training data
    knn.fit(X_train, y_train)
    
    #Compute accuracy on the training set
    train_accuracy[i] = knn.score(X_train, y_train)

    #Compute accuracy on the testing set
    test_accuracy[i] = knn.score(X_test, y_test)

# Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()


Conclusion: It looks like the test accuracy is highest when using 3 and 5 neighbors. Using 8 neighbors or more seems to result in a simple model that underfits the data. 

## 2. Regression
- continuous variables

In [None]:
# Boston housing data example
boston = pd.read_csv('boston.csv')
print(boston.head())

# creating feature and target arrays
X = boston.drop('MEDV', axis=1).values
y = boston['MEDV'].values

# predict house value from a single feature
X_rooms = X[:,5]
# check type: they are both NumPy arrays
type(X_rooms), type(y)
y = y.reshape(-1,1)
X_rooms = X_rooms.reshape(-1,1)
# plot house value vs number of rooms
plt.scatter(X_rooms, y)
plt.ylabel('Value of house /1000 ($)')
plt.xlabel('Number of rooms')
plt.show()

# Fit a regression model
import numpy as np
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(X_rooms, y)
prediction_space = np.linspace(min(X_rooms),
                              max(X_rooms)).reshape(-1,1)
plt.scatter(X_rooms, y, color='blue')
plt.plot(prediction_space, reg.predict(prediction_space),
        color='black', linewidth=3)
plt.show()

### a. Importing data for supervised learning
In this chapter, you will work with Gapminder data that we have consolidated into one CSV file available in the workspace as 'gapminder.csv'. Specifically, your goal will be to use this data to predict the life expectancy in a given country based on features such as the country's GDP, fertility rate, and population. As in Chapter 1, the dataset has been preprocessed.

Since the target variable here is quantitative, this is a regression problem. To begin, you will fit a linear regression with just one feature: 'fertility', which is the average number of children a woman in a given country gives birth to. In later exercises, you will use all the features to build regression models.

Before that, however, you need to import the data and get it into the form needed by scikit-learn. This involves creating feature and target variable arrays. Furthermore, since you are going to use only one feature to begin with, you need to do some reshaping using NumPy's .reshape() method. Don't worry too much about this reshaping right now, but it is something you will have to do occasionally when working with scikit-learn so it is useful to practice.

In [None]:
# Import numpy and pandas
import numpy as np
import pandas as pd

# Read the CSV file into a DataFrame: df
df = pd.read_csv('gapminder.csv')

# Create arrays for features and target variable
y = df['life'].values
X = df['fertility'].values

# Print the dimensions of X and y before reshaping
print("Dimensions of y before reshaping: {}".format(y.shape))
print("Dimensions of X before reshaping: {}".format(X.shape))

# Reshape X and y
y = y.reshape(-1,1)
X = X.reshape(-1,1)

# Print the dimensions of X and y after reshaping
print("Dimensions of y after reshaping: {}".format(y.shape))
print("Dimensions of X after reshaping: {}".format(X.shape))


<script.py> output:
    Dimensions of y before reshaping: (139,)
    Dimensions of X before reshaping: (139,)
    Dimensions of y after reshaping: (139, 1)
    Dimensions of X after reshaping: (139, 1)

Notice the differences in shape before and after applying the .reshape() method. Getting the feature and target variable arrays into the right format for scikit-learn is an important precursor to model building.

### b. Exploring the Gapminder data
As always, it is important to explore your data before building models. On the right, we have constructed a heatmap showing the correlation between the different features of the Gapminder dataset, which has been pre-loaded into a DataFrame as df and is available for exploration in the IPython Shell. Cells that are in green show positive correlation, while cells that are in red show negative correlation. Take a moment to explore this: Which features are positively correlated with life, and which ones are negatively correlated? Does this match your intuition?

Then, in the IPython Shell, explore the DataFrame using pandas methods such as .info(), .describe(), .head().

In case you are curious, the heatmap was generated using Seaborn's heatmap function (http://seaborn.pydata.org/generated/seaborn.heatmap.html) and the following line of code, where df.corr() computes the pairwise correlation between columns:

sns.heatmap(df.corr(), square=True, cmap='RdYlGn')

Once you have a feel for the data, consider the statements below and select the one that is not true. After this, Hugo will explain the mechanics of linear regression in the next video and you will be on your way building regression models!

In [None]:
# explore df
df.info()
df.describe()
df.head()

### c. Basics of linear regression

Regression mechanics:

y = ax + b
- y = target
- x = single feature
- a, b = parameters of model

How do we choose a and b?
1. Define an error function (loss/cost function) for any given line
2. Choose the line that minimizes the error function

The loss function: Ordinary Least Squares (OLS) - minimize sum of squares of residuals
- residual = vertical distance between dot (data) and regression line
- same as minimizing the mean squared errors on the training set


Linear regression in higher dimensions:

y = a1*x1 + a2*x2 + b
- To fit a linear regression model here, you need to specify 3 variables
- In higher dimensions, must specify coefficient for each feature (ai) and variable b

In [None]:
# Example: Linear regression on all features
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
    test_size=0.3, random_state=42)

reg_all = linear_model.LinearRegression()
reg_all.fit(X_train, y_train)
y_pred = reg_all.predict(X_test)
reg_all.score(X_test, y_test)

# R2 ("R squared") - default scoring method for linear regression

# Note: usually will use Regularization instead of 
# linear regression like this to put further constraints on
# model coefficients

#### i. Fit and Predict for regression (1 feature)
Now, you will fit a linear regression and predict life expectancy using just one feature. You saw Andy do this earlier using the 'RM' feature of the Boston housing dataset. In this exercise, you will use the 'fertility' feature of the Gapminder dataset. Since the goal is to predict life expectancy, the target variable here is 'life'. The array for the target variable has been pre-loaded as y and the array for 'fertility' has been pre-loaded as X_fertility.

A scatter plot with 'fertility' on the x-axis and 'life' on the y-axis has been generated. As you can see, there is a strongly negative correlation, so a linear regression should be able to capture this trend. Your job is to fit a linear regression and then predict the life expectancy, overlaying these predicted values on the plot to generate a regression line. You will also compute and print the R2 (R-squared) score using sckit-learn's .score() method.

In [None]:
# Import LinearRegression
from sklearn.linear_model import LinearRegression

# Create the regressor: reg
reg = LinearRegression()

# Create the prediction space
prediction_space = np.linspace(min(X_fertility), 
                               max(X_fertility)).reshape(-1,1)

# Fit the model to the data
reg.fit(X_fertility, y)

# Compute predictions over the prediction space: y_pred
y_pred = reg.predict(prediction_space)

# Print R^2 
print(reg.score(X_fertility, y))

# Plot regression line
plt.plot(prediction_space, y_pred, color='black', linewidth=3)
plt.show()

# out: 0.619244216774

#### ii. Train/Test split for regression
As you learned in Chapter 1, train and test sets are vital to ensure that your supervised learning model is able to generalize well to new data. This was true for classification models, and is equally true for linear regression models.

In this exercise, you will split the Gapminder dataset into training and testing sets, and then fit and predict a linear regression over all features. In addition to computing the R2 score, you will also compute the Root Mean Squared Error (RMSE), which is another commonly used metric to evaluate regression models. The feature array X and target variable array y have been pre-loaded for you from the DataFrame df.

In [None]:
# Import necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.3,
                                                    random_state=42)

# Create the regressor: reg_all
reg_all = LinearRegression()

# Fit the regressor to the training data
reg_all.fit(X_train, y_train)

# Predict on the test data: y_pred
y_pred = reg_all.predict(X_test)

# Compute and print R^2 and RMSE
print("R^2: {}".format(reg_all.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))

# out:
# R^2: 0.838046873142936
# Root Mean Squared Error: 3.2476010800377213

Using all features has improved the model score. This makes sense, as the model has more information to learn from. However, there is one potential pitfall to this process. Can you spot it? You'll learn about this as well how to better validate your models next.

### d. Cross-validation
Motivation:
- model performance is dependent on way the data is split
- not representative of model's ability to generalize

Note:
- k folds = k-fold CV
- Tradeoff: more folds = more computationally expensive

In [None]:
# example: cross-validation
from sklearn.model_selection import cross_val_score
reg = linear_model.LinearRegression()
# specify cv = number of folds utilized
cv_results = cross_val_score(reg, X, y, cv=5)
print(cv_results)
# compute mean
np.mean(cv_results)

#### i. 5-fold cross-validation
Cross-validation is a vital step in evaluating a model. It maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data.

In this exercise, you will practice 5-fold cross validation on the Gapminder data. By default, scikit-learn's cross_val_score() function uses R^2 as the metric of choice for regression. Since you are performing 5-fold cross-validation, the function will return 5 scores. Your job is to compute these 5 scores and then take their average.

The DataFrame has been loaded as df and split into the feature/target variable arrays X and y. The modules pandas and numpy have been imported as pd and np, respectively.

In [None]:
# Import the necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Create a linear regression object: reg
reg = LinearRegression()

# Compute 5-fold cross-validation scores: cv_scores
cv_scores = cross_val_score(reg, X, y, cv=5)

# Print the 5-fold cross-validation scores
print(cv_scores)

print("Average 5-Fold CV Score: {}".format(np.mean(cv_scores)))

# out
#[ 0.81720569  0.82917058  0.90214134  0.80633989  0.94495637]
#Average 5-Fold CV Score: 0.8599627722793232

#### ii. K-Fold CV comparison
Cross validation is essential but do not forget that the more folds you use, the more computationally expensive cross-validation becomes. In this exercise, you will explore this for yourself. Your job is to perform 3-fold cross-validation and then 10-fold cross-validation on the Gapminder dataset.

In the IPython Shell, you can use %timeit to see how long each 3-fold CV takes compared to 10-fold CV by executing the following cv=3 and cv=10:

%timeit cross_val_score(reg, X, y, cv = ____)

pandas and numpy are available in the workspace as pd and np. The DataFrame has been loaded as df and the feature/target variable arrays X and y have been created.

In [None]:
# Import necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Create a linear regression object: reg
reg = LinearRegression()

# Perform 3-fold CV
cvscores_3 = cross_val_score(reg, X, y, cv=3)
print(np.mean(cvscores_3))

# Perform 10-fold CV
cvscores_10 = cross_val_score(reg, X, y, cv=10)
print(np.mean(cvscores_10))

# out
# 0.871871278262
# 0.843612862013

In [None]:
# time 3 vs 10 fold CV to see increase in computational expense/time
%timeit cross_val_score(reg, X, y, cv=3)
%timeit cross_val_score(reg, X, y, cv=10)

#100 loops, best of 3: 9.91 ms per loop
#10 loops, best of 3: 31.2 ms per loop

### e. Regularized regression
- Linear regression minimizes a loss function
- It chooses a coefficient for each feature variable
- Large coefficients can lead to overfitting (predict anything)
- Regularization = penalize large coefficients

#### i. Regularization 1: Lasso (feature selection)
- Loss function = OLS loss function + alpha * sum of abs value of coeff
- Used for feature selection
    - How? By shrinking the coefficients of less important feature to exactly 0

In [None]:
# example: Lasso regression
from sklearn.linear_model import Lasso

X_train, X_test, y_train, y_test = train_test_split(X,y,
        test_size=0.3, random_state=42)

# normalize arg ensures all variables on same scale
lasso = Lasso(alpha=0.1, normalize=True)
lasso.fit(X_train, y_train)
lasso = ridge.predict(X_test)
lasso.score(X_test, y_test)
# out: 0.595022...

In [None]:
# example: Lasso regularization used for feature selection
from sklearn.linear_model import Lasso

# store feature names in names
names = boston.drop('MEDV', axis=1).columns

lasso = Lasso(alpha=0.1)
lasso_coef = lasso.fit(X, y).coef_

# plot the coefficients as a function of feature name
_ = plt.plot(range(len(names)), lasso_coef)
_ = plt.xticks(range(len(names)), names, rotation=60)
_ = plt.ylab('Coefficients')
plt.show()

Regularization I: Lasso
In the video, you saw how Lasso selected out the 'RM' feature as being the most important for predicting Boston house prices, while shrinking the coefficients of certain other features to 0. Its ability to perform feature selection in this way becomes even more useful when you are dealing with data involving thousands of features.

In this exercise, you will fit a lasso regression to the Gapminder data you have been working with and plot the coefficients. Just as with the Boston data, you will find that the coefficients of some features are shrunk to 0, with only the most important ones remaining.

The feature and target variable arrays have been pre-loaded as X and y.

In [None]:
# Import Lasso
from sklearn.linear_model import Lasso

# Instantiate a lasso regressor: lasso
lasso = Lasso(alpha=0.4, normalize=True)

# Fit the regressor to the data
lasso.fit(X, y)

# Compute and print the coefficients
lasso_coef = lasso.coef_
print(lasso_coef)

# Plot the coefficients
plt.plot(range(len(df_columns)), lasso_coef)
plt.xticks(range(len(df_columns)), df_columns.values, rotation=60)
plt.margins(0.02)
plt.show()

# plot shows child_mortality feature is important when predicting
# life expectancya

#### ii. Regularization 2: Ridge (1st choice in building regression models)
- Loss function = OLS loss function + alpha * sum of coefficient^2
- Need to choose alpha parameter for the best performing model
- Picking alpha is similar to picking k in k-NN
- aka Hyperparameter tuning
- alpha (sometimes lambda) controls model complexity
    - alpha = 0, get back OLS and overfitting
    - very high alpha can lead to underfitting

In [None]:
# example: Ridge regression
from sklearn.linear_model import Ridge

X_train, X_test, y_train, y_test = train_test_split(X,y,
        test_size=0.3, random_state=42)

# normalize arg ensures all variables on same scale
ridge = Ridge(alpha=0.1, normalize=True)
ridge.fit(X_train, y_train)
ridge_pred = ridge.predict(X_test)
ridge.score(X_test, y_test)
# out: 0.69969...

Regularization II: Ridge
Lasso is great for feature selection, but when building regression models, Ridge regression should be your first choice.

Recall that lasso performs regularization by adding to the loss function a penalty term of the absolute value of each coefficient multiplied by some alpha. This is also known as L1 regularization because the regularization term is the L1 norm of the coefficients. This is not the only way to regularize, however.

If instead you took the sum of the squared values of the coefficients multiplied by some alpha - like in Ridge regression - you would be computing the L2 norm. In this exercise, you will practice fitting ridge regression models over a range of different alphas, and plot cross-validated R2 scores for each, using this function that we have defined for you, which plots the R2 score as well as standard error for each alpha:

def display_plot(cv_scores, cv_scores_std):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.plot(alpha_space, cv_scores)

    std_error = cv_scores_std / np.sqrt(10)

    ax.fill_between(alpha_space, cv_scores + std_error, cv_scores - std_error, alpha=0.2)
    ax.set_ylabel('CV Score +/- Std Error')
    ax.set_xlabel('Alpha')
    ax.axhline(np.max(cv_scores), linestyle='--', color='.5')
    ax.set_xlim([alpha_space[0], alpha_space[-1]])
    ax.set_xscale('log')
    plt.show()
Don't worry about the specifics of the above function works. The motivation behind this exercise is for you to see how the R2 score varies with different alphas, and to understand the importance of selecting the right value for alpha. 

In [None]:
# example: Ridge regularization

# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Setup the array of alphas and lists to store scores
alpha_space = np.logspace(-4, 0, 50)
ridge_scores = []
ridge_scores_std = []

# Create a ridge regressor: ridge
ridge = Ridge(normalize=True)

# Compute scores over range of alphas
for alpha in alpha_space:

    # Specify the alpha value to use: ridge.alpha
    ridge.alpha = alpha
    
    # Perform 10-fold CV: ridge_cv_scores
    ridge_cv_scores = cross_val_score(ridge,X,y,cv=10)
    
    # Append the mean of ridge_cv_scores to ridge_scores
    ridge_scores.append(np.mean(ridge_cv_scores))
    
    # Append the std of ridge_cv_scores to ridge_scores_std
    ridge_scores_std.append(np.std(ridge_cv_scores))

# Display the plot
display_plot(ridge_scores, ridge_scores_std)


## 3. Fine-tuning your model

Model performance measured with accuracy, but not always a useful metric.
- consider class imbalance with 99% accuracy for 99% real emails vs 1% spam emails
More nuanced performance metrics:
- Diagnosing classification predictions
    - Confusion matrix (T/F Positive/Negative)
- Classification report (Confusion matrix metrics):
    - Accuracy
    - Precision (PPV - positive predictive value)
        = tp / (tp + fp)
        - high precision: not many real emails predicted as spam
    - Recall = tp / (tp + fn) aka Sensitivity, Hit rate, True Positive Rate
        - high recall: predicted most spam emails correctly
    - F1 score = 2*(precision*recall)/(precision+recall) = harmonic mean of precision and recall

In [None]:
# Confusion matrix in scikit-learn
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

knn = KNeighborsClasssifier(n_neighbors=8)
X_train, X_test, y_train, y_test = train_test_split(X, y,
        test_size=0.4, random_state=42)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

# create confusion matrix
# note: y_test is the true label, prediction label is the 2nd arg
print(confusion_matrix(y_test, y_pred))
# print classification report: precision, recall, f1-score, support
print(classification_report(y_test, y_pred))

Metrics for classification
In Chapter 1, you evaluated the performance of your k-NN classifier based on its accuracy. However, accuracy is not always an informative metric. In this exercise, you will dive more deeply into evaluating the performance of binary classifiers by computing a confusion matrix and generating a classification report.

You may have noticed in the video that the classification report consisted of three rows, and an additional support column. 

Support column in classification report: 
- The support gives the number of samples of the true response that lie in that class - so in the video example, the support was the number of Republicans or Democrats in the test set on which the classification report was computed. The precision, recall, and f1-score columns, then, gave the respective metrics for that particular class.

Here, you'll work with the PIMA Indians dataset obtained from the UCI Machine Learning Repository. The goal is to predict whether or not a given female patient will contract diabetes based on features such as BMI, age, and number of pregnancies. Therefore, it is a binary classification problem. A target value of 0 indicates that the patient does not have diabetes, while a value of 1 indicates that the patient does have diabetes. As in Chapters 1 and 2, the dataset has been preprocessed to deal with missing values.

The dataset has been loaded into a DataFrame df and the feature and target variable arrays X and y have been created for you. In addition, sklearn.model_selection.train_test_split and sklearn.neighbors.KNeighborsClassifier have already been imported.

Your job is to train a k-NN classifier to the data and evaluate its performance by generating a confusion matrix and classification report.

In [None]:
# example: knn confusion matrix and classification report

# Import necessary modules
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Create training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y,
    test_size=0.4, random_state=42)

# Instantiate a k-NN classifier: knn
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)

# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classifcation_report(y_test, y_pred))

# out
#[[176  30]
# [ 52  50]]
#             precision    recall  f1-score   support

#          0       0.77      0.86      0.81       206
#          1       0.62      0.47      0.54       102

#avg / total       0.72      0.73      0.72       308

### 3.1 Logistic regression and ROC curve
- used in classification problems (not regression)

Logistic regression for binary classification:
- logistic regression outputs probabilities
    - if probability, p, > 0.5: data labeled '1'
    - p < 0.5: data labeled '0'
- Probability thresholds
    - default logistic regression threshold = 0.5
    - could also be used as k-NN classifiers
    - What happens if the threshold varies? What happens to True Positive and False Positive rates with threshold varied? Look at ROC Curve
    
- ROC curve (Receiver Operator Characteristic Curve)

In [None]:
# example: logistic regression in scikit-learn

# Import necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# instantiate logisitic regression classifier
logreg = LogisticRegression()

# split the data in training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
    test_size=0.4, random_state=42)

# Fit the classifier to the training data
logreg.fit(X_train, y_train)

# Predict on test set
y_pred = logreg.predict(X_test)

In [None]:
# example: plot ROC curve
from sklearn.metrics import roc_curve
# y_pred_prob = predicted probabilities
# .predict_proba returns an array with 2 columns, each column
# contains probabilities for the respective target values,
# we choose the 2nd column, which is index 1, the probabilities
# being 1
y_pred_prob = logreg.predict_proba(X_test)[:,1]
# fpr = false positive rate, tpr = true positive rate
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

plt.plot([0,1],[0,1],'k--')
plt.plot(fpr, tpr, label='Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve')
plt.show();

#### 3.1.a Building a logistic regression model
Building a logistic regression model

Time to build your first logistic regression model! As Hugo showed in the video, scikit-learn makes it very easy to try different models, since the Train-Test-Split/Instantiate/Fit/Predict paradigm applies to all classifiers and regressors - which are known in scikit-learn as 'estimators'. You'll see this now for yourself as you train a logistic regression model on exactly the same data as in the previous exercise. Will it outperform k-NN? There's only one way to find out!

The feature and target variable arrays X and y have been pre-loaded, and train_test_split has been imported for you from sklearn.model_selection.

In [None]:
# example: build logistical regression model - binary classification

# Import the necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42)

# Create the classifier: logreg
logreg = LogisticRegression()

# Fit the classifier to the training data
logreg.fit(X_train, y_train)

# Predict the labels of the test set: y_pred
y_pred = logreg.predict(X_test)

# Compute and print the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# out
#[[176  30]
# [ 35  67]]
#             precision    recall  f1-score   support

#          0       0.83      0.85      0.84       206
#          1       0.69      0.66      0.67       102

#avg / total       0.79      0.79      0.79       308

#### 3.1.b Plotting ROC curve - visually evaluate models
Classification reports and confusion matrices are great methods to quantitatively evaluate model performance, while ROC curves provide a way to visually evaluate models. 

.predict_proba()
- Most classifiers in scikit-learn have a .predict_proba() method which returns the probability of a given sample being in a particular class. Having built a logistic regression model, you'll now evaluate its performance by plotting an ROC curve. In doing so, you'll make use of the .predict_proba() method and become familiar with its functionality.

Here, you'll continue working with the PIMA Indians diabetes dataset. The classifier has already been fit to the training data and is available as logreg.

In [None]:
# Import necessary modules
from sklearn.metrics import roc_curve

# Compute predicted probabilities: y_pred_prob
y_pred_prob = logreg.predict_proba(X_test)[:,1]

# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr) # x-axis: fpr, y-axis: tpr
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

#### 3.1.c Precision-Recall Curve
When looking at your ROC curve, you may have noticed that the y-axis (True positive rate) is also known as recall. Indeed, in addition to the ROC curve, there are other ways to visually evaluate model performance. One such way is the precision-recall curve, which is generated by plotting the precision and recall for different thresholds. As a reminder, precision and recall are defined as:

- Precision=TP/(TP+FP)
- Recall=TP/(TP+FN)

On the right, a precision-recall curve has been generated for the diabetes dataset. The classification report and confusion matrix are displayed in the IPython Shell.

Study the precision-recall curve and then consider the statements given below. Choose the one statement that is not true. Note that here, the class is positive (1) if the individual has diabetes.

             precision    recall  f1-score   support

          0       0.83      0.85      0.84       206
          1       0.69      0.66      0.67       102

    avg / total       0.79      0.79      0.79       308

    [[176  30]
     [ 35  67]]
     
     
TRUE: A recall of 1 corresponds to a classifier with a low threshold in which all females who contract diabetes were correctly classified as such, at the expense of many misclassifications of those who did not have diabetes.

TRUE: Precision is undefined for a classifier which makes no positive predictions, that is, classifies everyone as not having diabetes.

TRUE: When the threshold is very close to 1, precision is also 1, because the classifier is absolutely certain about its predictions.

FALSE: Precision and recall take true negatives into consideration.

### 3.2 Area under the ROC curve (AUC)
#### - one popular metric for classification models
- Larger AUC = better model

In [None]:
# example: Compute AUC
from sklearn.metrics import roc_auc_score

logreg = LogisticRegression()

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                        test_size = 0.4, random_state=42)

# Fit the classifier to the training data
logreg.fit(X_train, y_train)

# Compute predicted probabilities to compute AUC
y_pred = logreg.predict_proba(X_test)[:,1]
# Pass true labels and predicted probabilities to ROC AUC score
roc_auc_score(y_test, y_pred_prob)
# out: 0.997466216

In [None]:
# Compute AUC using cross-validation
from sklearn.model_selection import cross_val_score

# pass the estimator, features, and target
cv_scores = cross_val_score(logreg, X, y, cv=5, scoring='roc_auc')

print(cv_scores)
# out: [ 0.9967   0.99183 .  0.99583 .  1.   0.961406]

#### AUC computation
Say you have a binary classifier that in fact is just randomly making guesses. It would be correct approximately 50% of the time, and the resulting ROC curve would be a diagonal line in which the True Positive Rate and False Positive Rate are always equal. The Area under this ROC curve would be 0.5. This is one way in which the AUC, which Hugo discussed in the video, is an informative metric to evaluate a model. If the AUC is greater than 0.5, the model is better than random guessing. Always a good sign!

In this exercise, you'll calculate AUC scores using the roc_auc_score() function from sklearn.metrics as well as by performing cross-validation on the diabetes dataset.

X and y, along with training and test sets X_train, X_test, y_train, y_test, have been pre-loaded for you, and a logistic regression classifier logreg has been fit to the training data.

In [None]:
# example: calculate AUC with 2 methods: roc_auc_score() and
# using cross_val_score()

# Import necessary modules
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score

# Method 1
# Compute predicted probabilities: y_pred_prob
y_pred_prob = logreg.predict_proba(X_test)[:,1]

# Compute and print AUC score
print("AUC: {}".format(roc_auc_score(y_test, y_pred_prob)))

# Method 2
# Compute cross-validated AUC scores: cv_auc
cv_auc = cross_val_score(logreg, X, y, cv=5, scoring='roc_auc')

# Print list of AUC scores
print("AUC scores computed using 5-fold cross-validation: {}".
      format(cv_auc))

# out
#AUC: 0.8254806777079764
#AUC scores computed using 5-fold cross-validation: 
#[ 0.80148148  0.8062963   0.81481481  0.86245283  0.8554717 ]

### 3.3 Hyperparameter tuning
Review
- Linear regression: choose parameters
- Ridge/Lasso regression: choose alpha
- k-Nearest Neighbors: choose n_neighbors

Hyperparameters: parameters that need to be specified before model fitting (so can't be learned by fitting model)
- parameters like alpha and k

Choosing the correct hyperparameter
- try a bunch of different hyperparameter values
- fit all of them separately
- see how well each performs
- choose the best performing one
- essential to use cross-validation (using train-test-split alone risks overfitting hyperparameter to test set)

Methods:
- Grid search cross-validation
- Randomized Search CV

In [None]:
# GridSearchCV
from sklearn.model_selection import GridSearchCV
# keys are hyperparameter names like 'n_neighbors' in k-nn, or
# alpha in Ridge/Lasso regression
param_grid = {'n_neighbors': np.arange(1, 50)}
knn = KneighborsClassifier()
# args: model, grid, number of folds for cross validation
knn_cv = GridSearchCV(knn, param_grid, cv=5)
# use this to fit data
knn_cv.fit(X, y)

knn_cv.best_params_
knn_cv.best_score_

#### 3.3.a Hyperparameter tuning with GridSearchCV
Hugo demonstrated how to tune the n_neighbors parameter of the KNeighborsClassifier() using GridSearchCV on the voting dataset. You will now practice this yourself, but by using logistic regression on the diabetes dataset instead!

Like the alpha parameter of lasso and ridge regularization that you saw earlier, logistic regression also has a regularization parameter: C. C controls the inverse of the regularization strength, and this is what you will tune in this exercise. A large C can lead to an overfit model, while a small C can lead to an underfit model.

The hyperparameter space for C has been setup for you. Your job is to use GridSearchCV and logistic regression to find the optimal C in this hyperparameter space. The feature array is available as X and target variable array is available as y.

You may be wondering why you aren't asked to split the data into training and test sets. Good observation! Here, we want you to focus on the process of setting up the hyperparameter grid and performing grid-search cross-validation. In practice, you will indeed want to hold out a portion of your data for evaluation purposes, and you will learn all about this in the next video!

In [None]:
# Tune (optimize) C regularization parameter for logistic regression
# This example doesn't have a test and training set, but normally
# you should

# Import necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}

# Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression()

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

# Fit it to the data
logreg_cv.fit(X, y)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
print("Best score is {}".format(logreg_cv.best_score_))

# out
#Tuned Logistic Regression Parameters: {'C': 3.7275937203149381}
#Best score is 0.7708333333333334

#### 3.3.b Hyperparameter tuning with RandomizedSearchCV
- faster alternative to GridSearchCV

GridSearchCV can be computationally expensive, especially if you are searching over a large hyperparameter space and dealing with multiple hyperparameters. A solution to this is to use RandomizedSearchCV, in which not all hyperparameter values are tried out. Instead, a fixed number of hyperparameter settings is sampled from specified probability distributions. You'll practice using RandomizedSearchCV in this exercise and see how this works.

Here, you'll also be introduced to a new model: the Decision Tree. Don't worry about the specifics of how this model works. Just like k-NN, linear regression, and logistic regression, decision trees in scikit-learn have .fit() and .predict() methods that you can use in exactly the same way as before. Decision trees have many parameters that can be tuned, such as max_features, max_depth, and min_samples_leaf: This makes it an ideal use case for RandomizedSearchCV.

As before, the feature array X and target variable array y of the diabetes dataset have been pre-loaded. The hyperparameter settings have been specified for you. Your goal is to use RandomizedSearchCV to find the optimal hyperparameters. Go for it!

In [None]:
# Import necessary modules
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

# Setup the parameters and distributions to sample from: param_dist
param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 9),
              "min_samples_leaf": randint(1, 9),
              "criterion": ["gini", "entropy"]}

# Instantiate a Decision Tree classifier: tree
tree = DecisionTreeClassifier()

# Instantiate the RandomizedSearchCV object: tree_cv
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)

# Fit it to the data
tree_cv.fit(X, y)

# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))

# out
#Tuned Decision Tree Parameters: {'max_depth': 3, 
#    'criterion': 'entropy', 'min_samples_leaf': 3, 
#    'max_features': 5}
#Best score is 0.7330729166666666

#### Note: RandomizedSearchCV never outperforms GridSearchCV but it will save computational time

### 3.4 Hold-out set for final evaluation

#### 3.4.a Hold-out set reasoning
- Use test data set to evaluate model performance
- Using all data for cross-validation is not ideal
- split data into training and hold-out (test) set at beginning
- perform grid search cross-validation on training set
- choose best hyperparameters and evaluate on hold-out set

#### 3.4.b Hold-out set in practice I: Classification
You will now practice evaluating a model with tuned hyperparameters on a hold-out set. The feature array and target variable array from the diabetes dataset have been pre-loaded as X and y.

In addition to C, logistic regression has a 'penalty' hyperparameter which specifies whether to use 'l1' or 'l2' regularization. 

Your job in this exercise is to create a hold-out set, tune the 'C' and 'penalty' hyperparameters of a logistic regression classifier using GridSearchCV on the training set.

In [None]:
# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Create the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space, 'penalty': ['l1', 'l2']}

# Instantiate the logistic regression classifier: logreg
logreg = LogisticRegression()

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,
test_size=0.4, random_state=42)

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

# Fit it to the training data
logreg_cv.fit(X_train, y_train)

# Print the optimal parameters and best score
print("Tuned Logistic Regression Parameter: {}".format(logreg_cv.best_params_))
print("Tuned Logistic Regression Accuracy: {}".format(logreg_cv.best_score_))

# out
# Tuned Logistic Regression Parameter: {'C': 0.43939705607607948, 'penalty': 'l1'}
# Tuned Logistic Regression Accuracy: 0.7652173913043478

#### 3.4.c Hold-out set in practice II: Regression
Remember lasso and ridge regression from the previous chapter? Lasso used the L1 penalty to regularize, while ridge used the L2 penalty. There is another type of regularized regression known as the elastic net. 

In elastic net regularization, the penalty term is a linear combination of the L1 and L2 penalties:
- a∗L1+b∗L2

In scikit-learn, this term is represented by the 'l1_ratio' parameter: An 'l1_ratio' of 1 corresponds to an L1 penalty, and anything lower is a combination of L1 and L2.

In this exercise, you will GridSearchCV to tune the 'l1_ratio' of an elastic net model trained on the Gapminder data. As in the previous exercise, use a hold-out set to evaluate your model's performance.

In [None]:
# Import necessary modules
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV, train_test_split

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
test_size=0.4, random_state=42)

# Create the hyperparameter grid
l1_space = np.linspace(0, 1, 30)
param_grid = {'l1_ratio': l1_space}

# Instantiate the ElasticNet regressor: elastic_net
elastic_net = ElasticNet()

# Setup the GridSearchCV object: gm_cv
# Use GridSearchCV with 5-fold cross-validation to tune 'l1_ratio' 
# on the training data X_train and y_train. This involves 
# first instantiating the GridSearchCV object with the correct 
# parameters and then fitting it to the training data.
gm_cv = GridSearchCV(elastic_net, param_grid, cv=5)

# Fit it to the training data
gm_cv.fit(X_train, y_train)

# Predict on the test set and compute metrics
y_pred = gm_cv.predict(X_test)
r2 = gm_cv.score(X_test, y_test)
mse = mean_squared_error(y_test, y_pred)
print("Tuned ElasticNet l1 ratio: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))
print("Tuned ElasticNet MSE: {}".format(mse))

# out
# Tuned ElasticNet l1 ratio: {'l1_ratio': 0.20689655172413793}
# Tuned ElasticNet R squared: 0.8668305372460283
# Tuned ElasticNet MSE: 10.05791413339844

## 4. Preprocessing and Pipelines

### 4.1 Preprocessing Data
Dealing with categorical features
- scikit-learn will not accept categorical features
- need to encode these features numerically
- convert to 'dummy variables'
    - 0: Observation was NOT that category
    - 1: Observation was that category
- example: 3 origins for a car
    - origin_Asia 0 or 1
    - origin_Europe 0 or 1, can remove Europe if implicitly we know it's not from Asia or Europe, duplication can cause issues in some models
    - origin_USA 0 or 1
    
Dealing with categorical features:
- scikit-learn: OneHotEncoder()
- pandas: get_dummies()

Boxplots are useful in visualizing categorical features

In [None]:
# Example: dummy variables with pandas get_dummies()
# convert origins for car (noted above) into dummy variables
import pandas as pd
df= pd.read_csv('auto.csv')
df_origin = pd.get_dummies(df)
print(df_origin.head())
# drop origin_asia column since we can imply it
df_origin = df_origin.drop('origin_Asia', axis=1)
print(df_origin.head())

# Ridge Linear regression with dummy variables
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
X_train, X_test, y_train, y_test = train_test_split(X, y,
        test_size=0.3, random_state=42)
ridge = Ridge(alpha.0.5, normalize=True).fit(X_train, y_train)
ridge.score(X_test, y_test)
# out: 0.7190645
# can compute R^2

#### 4.1.a Exploring categorical features
The Gapminder dataset that you worked with in previous chapters also contained a categorical 'Region' feature, which we dropped in previous exercises since you did not have the tools to deal with it. Now however, you do, so we have added it back in!

Your job in this exercise is to explore this feature. 

Boxplots are particularly useful for visualizing categorical features such as this.

#### 4.1.b 

#### 4.1.c 

### 4.2 Handling Missing Data

#### 4.2.a 

#### 4.2.b 

#### 4.2.c 

### 4.3 Centering and Scaling


#### 4.3.a 

#### 4.3.b 

#### 4.3.c 

#### 4.3.d 