# Data Modeling with Scikit-Learn

**Outline:**

* [Intro to Scikit-Learn](#Intro-to-Scikit-Learn)
* [Importing Built-In Datasets](#Importing-Built-In-Datasets)
* [Creating a Model](#Creating-a-Model)
* [Training and Testing a Model](#Training-and-Testing-a-Model)
  * [Performing Cross-Validation](#Performing-Cross-Validation)
  * [Selecting Features](#Selecting-Features)
  * [Searching for Optimal Model Parameters](#Searching-for-Optimal-Model-Parameters)
* [Evaluating a Model](#Evaluating-a-Model)
* [Model Persistence](#Model-Persistence)
* [Scikit-Learn Algorithm Cheat Sheet](#Scikit-Learn-Algorithm-Cheat-Sheet)

![](supervised-classification.png)
<div style="text-align: center;">
<strong>Credit:</strong> http://www.nltk.org/book/ch06.html
</div>

## Intro to Scikit-Learn

In [None]:
from IPython.core.display import HTML
HTML("<iframe src=http://scikit-learn.org/ width=800 height=350></iframe>")

## Importing Built-In Datasets

In [None]:
from sklearn.datasets import load_iris

In [None]:
iris = load_iris()
type(iris)

In [None]:
iris.data

In [None]:
type(iris.data)

In [None]:
iris.data.shape

In [None]:
iris.feature_names

In [None]:
iris.target_names

In [None]:
iris.target

In [None]:
X = iris.data
y = iris.target

In [None]:
X

In [None]:
y

## Importing Datasets using Pandas

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("iris.csv")

In [None]:
df.head()

In [None]:
X = df[[
    "sepal.length",
    "sepal.width",
    "petal.length",
    "petal.width"
]]

In [None]:
X

In [None]:
y = df["variety"]

In [None]:
y

## Creating a Model

**Note:** 4-step modeling pattern

### K-nearest neighbors (KNN) classification

**Step 1:** Import the model (import)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

**Step 2:** Instantiate an estimator (instantiate)

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)

**Step 3:** Fit the model (fit)

In [None]:
knn.fit(X, y)

**Step 4:** Make a prediction (predict)

In [None]:
X_new = [
    [3, 5, 4, 2],
]
knn.predict(X_new)

In [None]:
X_new = [
    [3, 5, 4, 2],
    [5, 4, 3, 2]
]
knn.predict(X_new)

### Try a different value for K

In [None]:
# instantiate
knn = KNeighborsClassifier(n_neighbors=2)

# fit
knn.fit(X, y)

# predict
knn.predict(X_new)

### Use a different classification model

In [None]:
# import
from sklearn.linear_model import LogisticRegression

# instantiate
logreg = LogisticRegression()

# fit
logreg.fit(X, y)

# predict
logreg.predict(X_new)

In [None]:
from sklearn import svm

clf = svm.SVC()

clf.fit(X, y)

clf.predict(X_new)

## Training and Testing a Model

### Procedure 1: Train and test on the (same) entire dataset

In [None]:
from sklearn.datasets import load_iris
from sklearn import metrics

iris = load_iris()

X = iris.data
y = iris.target

### Logistic Regression

In [None]:
# import
from sklearn.linear_model import LogisticRegression

# instantiate
logreg = LogisticRegression()

# fit
logreg.fit(X, y)

# predict
y_pred = logreg.predict(X)

metrics.accuracy_score(y, y_pred)

### KNN (K = 5)

In [None]:
# import
from sklearn.neighbors import KNeighborsClassifier

# instantiate
knn = KNeighborsClassifier(n_neighbors=5)

# fit
knn.fit(X, y)

# predict
y_pred = knn.predict(X)

metrics.accuracy_score(y, y_pred)

### KNN (K = 1)

In [None]:
# instantiate
knn = KNeighborsClassifier(n_neighbors=1)

# fit
knn.fit(X, y)

# predict
y_pred = knn.predict(X)

metrics.accuracy_score(y, y_pred)

### Procedure 2: Train and test split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

In [None]:
print(X_train.shape)
print(X_test.shape)

In [None]:
print(y_train.shape)
print(y_test.shape)

### Logistic Regression

In [None]:
# instantiate
logreg = LogisticRegression()

# fit
logreg.fit(X_train, y_train)

# predict
y_pred = logreg.predict(X_test)

metrics.accuracy_score(y_test, y_pred)

### KNN (K = 5)

In [None]:
# instantiate
knn = KNeighborsClassifier(n_neighbors=5)

# fit
knn.fit(X_train, y_train)

# predict
y_pred = knn.predict(X_test)

metrics.accuracy_score(y_test, y_pred)

### KNN (K = 1)

In [None]:
# instantiate
knn = KNeighborsClassifier(n_neighbors=1)

# fit
knn.fit(X_train, y_train)

# predict
y_pred = knn.predict(X_test)

metrics.accuracy_score(y_test, y_pred)

### Find a better value for K

In [None]:
k_range = range(1, 26)
scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    scores.append(metrics.accuracy_score(y_test, y_pred))

In [None]:
scores

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt

plt.plot(k_range, scores)
plt.xlabel("Value of K for KNN")
plt.ylabel("Testing Accuracy")

### Select the best value for K

In [None]:
# instantiate
knn = KNeighborsClassifier(n_neighbors=11)

# fit
knn.fit(X, y)

# predict
X_new = [[3, 5, 4, 2]]
knn.predict(X_new)

### Performing Cross-Validation

* Parameter tuning
* Model selection
* Feature selection

In [None]:
from sklearn.model_selection import KFold

data = range(1, 26)
kf = KFold(n_splits=5, random_state=None, shuffle=False)

for i, (train_index, test_index) in enumerate(kf.split(data)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")

In [None]:
from sklearn.model_selection import cross_val_score

knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, X, y, cv=10, scoring="accuracy")
print(scores)

In [None]:
print(scores.mean())

Let's try varying the value for K.

In [None]:
k_range = range(1, 31)
k_scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X, y, cv=10, scoring="accuracy")
    k_scores.append(scores.mean())

print(k_scores)

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt

plt.plot(k_range, k_scores)
plt.xlabel("Value of K for KNN")
plt.ylabel("Cross-Validated Accuracy")

### Selecting Features

In [None]:
import pandas as pd

data = pd.read_csv("Advertising.csv")

In [None]:
data.head()

In [None]:
data = pd.read_csv("Advertising.csv", index_col=0)

In [None]:
data.head()

In [None]:
feature_cols = ["TV", "Radio", "Newspaper"]
X = data[feature_cols]
y = data.Sales

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import make_scorer, mean_squared_error

scoring_metrics = make_scorer(
    mean_squared_error, 
    greater_is_better=False
)

lm = LinearRegression()
scores = cross_val_score(lm, X, y, cv=10, scoring=scoring_metrics)
scores

In [None]:
import numpy as np

mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

In [None]:
import numpy as np

feature_cols = ["TV", "Radio"]
X = data[feature_cols]
np.sqrt(-cross_val_score(lm, X, y, cv=10, scoring=scoring_metrics)).mean()

In [None]:
import numpy as np

feature_cols = ["TV"]
X = data[feature_cols]
np.sqrt(-cross_val_score(lm, X, y, cv=10, scoring=scoring_metrics)).mean()

In [None]:
import numpy as np

feature_cols = ["Newspaper"]
X = data[feature_cols]
np.sqrt(-cross_val_score(lm, X, y, cv=10, scoring=scoring_metrics)).mean()

In [None]:
import numpy as np

feature_cols = ["Radio", "Newspaper"]
X = data[feature_cols]
np.sqrt(-cross_val_score(lm, X, y, cv=10, scoring=scoring_metrics)).mean()

### Searching for Optimal Model Parameters

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
k_range = list(range(1, 31))
print(k_range)

In [None]:
param_grid = dict(n_neighbors=k_range)
print(param_grid)

In [None]:
param_grid["leaf_size"] = [15]
# param_grid["leaf_size"] = [15, 30]
print(param_grid)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

X = iris.data
y = iris.target

knn = KNeighborsClassifier()
grid = GridSearchCV(knn, param_grid, cv=10, scoring="accuracy")
grid.fit(X, y)
grid.cv_results_

In [None]:
grid_mean_scores = []
for result in grid.cv_results_["mean_test_score"]:
    grid_mean_scores.append(result)

In [None]:
print(grid_mean_scores)

In [None]:
plt.plot(k_range, grid_mean_scores)
plt.xlabel("Value of K for KNN")
plt.ylabel("Cross-Validated Accuracy")

In [None]:
print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)

In [None]:
from sklearn.neighbors import KNeighborsClassifier


X = iris.data
y = iris.target

knn = KNeighborsClassifier()
grid = GridSearchCV(knn, param_grid, cv=10, scoring="accuracy")
grid.fit(X, y)
grid.cv_results_

## Evaluating a Model

UCI Machine Learning Repository: [Spambase Data Set](https://archive.ics.uci.edu/ml/datasets/Spambase)

In [None]:
import pandas as pd

spam = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data", header=None)

In [None]:
spam.head()

In [None]:
X = spam.drop(57, axis=1)
y = spam[57]

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

logreg.fit(X_train, y_train)

y_pred_class = logreg.predict(X_test)

In [None]:
from sklearn import metrics

print(metrics.accuracy_score(y_test, y_pred_class))

**Null accuracy:** accuracy that could be achieved by always predicting the most frequent class

In [None]:
pd.Series(y_test).value_counts()

In [None]:
pd.Series(y_test).value_counts().head(1) / len(y_test)

### Confusion Matrix

In [None]:
print(metrics.confusion_matrix(y_test, y_pred_class))

In [None]:
print(metrics.accuracy_score(y_test, y_pred_class))

In [None]:
print(metrics.recall_score(y_test, y_pred_class))

In [None]:
confusion = metrics.confusion_matrix(y_test, y_pred_class)
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]

In [None]:
print(TP / float(TP + FP))
print(metrics.precision_score(y_test, y_pred_class))

### Classification Report

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred_class, target_names=["ham", "spam"]))

### Receiver Operating Characteristic (ROC)

In [None]:
y_pred_prob = logreg.predict_proba(X_test)[:, 1]
y_pred_prob

In [None]:
plt.hist(y_pred_prob, bins=8)
plt.xlim(0, 1)
plt.title("Histogram of predicted probabilities")
plt.xlabel("Predicted probability of spam")
plt.ylabel("Frequency")

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)

In [None]:
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title("ROC curve for spam classifier")
plt.xlabel("False Positive Rate (1 - Specificity)")
plt.ylabel("True Positive Rate (Sensitivity)")
plt.grid(True)

### Area Under Curve (AUC)

In [None]:
metrics.roc_auc_score(y_test, y_pred_prob)

## Model Persistence

Suppose we have a model below.

In [None]:
from sklearn.model_selection import train_test_split

X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=20)
knn.fit(X, y)
y_pred = knn.predict(X_test)

metrics.accuracy_score(y_test, y_pred)

Dump the model into a pickle file.

In [None]:
import pickle

f = open("knn.pkl", "wb")
pickle.dump(knn, f)
f.close()

In [None]:
!ls

Load the model from the pickle file.

In [None]:
f = open("knn.pkl", "rb")
stored_knn = pickle.load(f)
f.close()

In [None]:
y_pred = stored_knn.predict(X_test)

metrics.accuracy_score(y_test, y_pred)

## Scikit-Learn Algorithm Cheat Sheet

![](scikit-learn-algorithm-cheat-sheet.png)
<div style="text-align: center;">
<strong>Credit:</strong> http://peekaboo-vision.blogspot.de/2013/01/machine-learning-cheat-sheet-for-scikit.html
</div>