# Scikit-learn introduction

Scikit-learn is one of the most popular and actively developed machine learning libraries out there. It contains nearly every all "canonical" preprocessing and classification techniques. The scikit-learn preprocessing and classifier APIs are identical across all techniques, so models can be quickly swapped out.

This notebook is just a brief introduction to some of the core features of scikit-learn, specifically the uniform function calls to validation, preprocessing, model fitting and testing. 

More recently, the Keras deep learning module has introduced a scikit-learn interface. 

In [1]:
# Let's get some random data to play with.
import numpy as np
x = (np.random.rand(100, 10) - .5) * 100
b = np.random.rand(10, 1)
y = np.dot(x, b) + np.random.randn(100, 1)*20

## Validation

As will be discussed later, keeping training and testing sets separate is crucial to valid models and inferences. The scikit-learn APIs make it easy to keep them separate, as well as containing the submodule `model_selection` for good data-splitting hygiene.

In [2]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)  # randomly 80/20 split 

# Preprocessing

When doing your projects or working with real data, you'll learn very quickly that the regressions and classifications from this course don't generally work on their own. (Otherwise everyone would be doing this!). Every type of data (imaging, financial, social media, etc) has its own characteristics and thus call for different preprocessing techniques. You can write your own methods using numpy and scipy, and scikit-learn includes quite a few common ones in sklearn.preprocessing.

Here is just an example of some of the preprocessing options

In [3]:
import sklearn.preprocessing as preproc
scaler = preproc.StandardScaler()  # will zscore the dataset it is given and save the mean and variance
x_train_sc = scaler.fit_transform(x_train)  # zscore training set and save mean
x_test_sc = scaler.transform(x_test)  # apply mean and variance to test

In [4]:
# check the mean and variance
print(x_train_sc.mean(), x_train_sc.var())
print(x_test_sc.mean(), x_test_sc.var())

-1.9984014443252817e-17 1.0
0.1466607139294716 1.1202232165082364


## Linear models
Scikit-learn has the submodule `linear_model`. This module contains many different linear approaches, including ordinary least squares and regularized regressions. Several common ones are demonstrated below, and will be further explored later. For now, notice that the syntax for training and testing is uniform (you can fine-tune free parameters by updating the model object itself)

In [5]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
ols = LinearRegression()
print(ols.fit(x_train_sc, y_train).score(x_test_sc, y_test))
# fit a model on the training data, and see how it does on the testing set

ridge = Ridge(alpha=1.0)
print(ridge.fit(x_train_sc, y_train).score(x_test_sc, y_test))

lasso = Lasso(alpha=1.0)
print(lasso.fit(x_train_sc, y_train).score(x_test_sc, y_test))


0.891325478223941
0.8930797538114582
0.8921442159424512


## Classification
Like linear models, classifiers have a common calling structure. We import a bunch of models below.

In [6]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.ensemble import BaggingClassifier

Let's initialize a whole bunch of different classifiers. Each has its own free parameters that you can tweak (e.g. depth for trees, features for forests, kernels for SVM, etc).

In [7]:
h = .02  # step size in the mesh

names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", 
         "Decision Tree", "Random Forest", 
         "Naive Bayes", "QDA"]

classifiers = [
    KNeighborsClassifier(3),  #k = 3
    SVC(kernel="linear", C=0.025),  # linear SVM with C = 0.025
    SVC(gamma=2, C=1, kernel="rbf"),  # RBF SVM with C = 1, gamma = 2
    DecisionTreeClassifier(max_depth=5),  # depth 5 decision tree
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),  # depth 5 random forest with 10 trees
    GaussianNB(),
    QuadraticDiscriminantAnalysis()]

Let's generate some random data to demonstrate classification

In [8]:
# from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
                           random_state=1, n_clusters_per_class=1)

rng = np.random.RandomState(2)  # add some noise
X += 2 * rng.uniform(size=X.shape)




In [9]:
# preprocess data
x_train, x_test, y_train, y_test = \
        train_test_split(X, y, test_size=.2, random_state=42)
    
x_train_sc = scaler.fit_transform(x_train)  # zscore training set and save mean
x_test_sc = scaler.transform(x_test)  # apply mean and variance to test



In [10]:
# run models in a loop. because each has a .fit() and .score() method, they can just be swapped out
for name, clf in zip(names, classifiers):
    clf.fit(x_train_sc, y_train)
    score = clf.score(x_test_sc, y_test)
    print(name + ": " + str(score))


Nearest Neighbors: 0.95
Linear SVM: 0.9
RBF SVM: 0.95
Decision Tree: 0.9
Random Forest: 0.9
Naive Bayes: 0.95
QDA: 0.95
