![](images/intro1.png)

![](images/intro2.png)

![](images/intro3.png)

![](images/intro4.png)

# About Ted Petrou

* Author of Pandas Cookbook ![](images/book1.png)

* Founder of Dunder Data - Classes available through Wintellect

* Founded Houston  Data Science Meetup

# Outline

* What is Machine Learning

* Typical Workflow

* Overview of Scikit-Learn

* Demo with Dataset

* Questions

# What is Machine Learning?

* A process by which a machine (our computer) learns how to be successful at a certain task without being explicitly programmed to do so

* The computer must be given data in order to learn

* Success must be defined

* Algorithms use the definition of success to define a model that transforms the input into an output

* New data can then be transformed into an output upon which a decision can be made

# Examples of Machine Learning

* Voice Translation

* Predicting GDP of a country

* Learning to play poker

* Discovering topics in news articles

* Recommending a product to buy

# What Makes a good Machine Learning Model?

* It must be better than a default guess

* It must generalize to new unseen data (and not memorize historical data)

* The cost of upkeep and maintenance must be less than the value it provides

# Types of Machine Learning

* Supervised Learning
    * Regression - continuous output
    * Classification - categorical output

* Unsupervised Learning
    * Look for structure within data

# Typical Workflow for Beginners
* Find dataset
    * [Kaggle Datasets](https://www.kaggle.com/datasets)
    * [data.world](https://data.world/)
    * [data.gov](https://www.data.gov/)
 

* Read data into Pandas

* Clean data

* Exploratory data analysis with basic statistics and visualizations

* Define Problem

* Extract to NumPy

* Train and Evaluate model with Scikit-Learn

# Overview of Scikit-Learn
* Most popular Python library to build the basic machine learning models

* Easy to use and can train a model in 3 lines of code

* Does not focus on Deep Learning - Use TensorFlow or Keras instead

* Built on top of NumPy

* In addition to training machine learning models Scikit-Learn provides a host of other tools for data preprocessing and model evaluation

# Scikit-Learn Vocabulary

* The input data for all our machine learning models must be a **2-dimensional NumPy array** usually given the name **`X`**

* Each column of this array is a **feature**

* Each row is a **sample**

* The **estimator** is the Python object that learns from the data. This is our primary **object** that we will use to train and make predictions.

* In supervised learning, each **sample** (row) has a **target** - either a category or a continuous value. This is also known as a **label**. All the labels are separated into their own array, usually given the name **`y`**.

# Scikit-Learn Gotchas

* There can be no missing data in either the input or target arrays

* All input data must be in a numeric 2-d array. Even if there is one feature it must be a 2d array. Must encode string data as binary.

# Ames Housing Dataset from Kaggle

* Compiled by professor Dean De Cock from Ames, Iowa from 2006 - 2010

* Original dataset has 79 features and 1460 samples

* For simplicity, we will only look at 8 features

* Predict sale price

* Evaluation metric - R^2

# Read in data with pandas

In [None]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 100
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
housing = pd.read_csv('data/housing.csv')
housing.head()

# Some Quick EDA

In [None]:
housing.groupby('OverallQual')['SalePrice'].mean().plot(figsize=(12, 5));

In [None]:
housing.groupby('GarageType')['SalePrice'].mean().plot(kind='bar', figsize=(12, 5));

In [None]:
housing.sample(frac=.2).plot(kind='scatter', x='GrLivArea', y='SalePrice', figsize=(12, 5));

In [None]:
housing.groupby(pd.cut(housing['GrLivArea'], 10))['SalePrice'] \
       .mean().sort_index().plot(kind='bar', figsize=(12, 5));

# Remedying missing values
* Replacing numeric missing values with the median, mean, or mode

In [None]:
housing.isna().sum()

In [None]:
housing['LotFrontage'].median()

In [None]:
housing_ml = housing.copy()

In [None]:
# technically this is data snooping - there are better ways

lot_frontage_median = housing_ml['LotFrontage'].median()
housing_ml['LotFrontage'] = housing_ml['LotFrontage'].fillna(lot_frontage_median)

In [None]:
housing_ml['GarageType'] = housing_ml['GarageType'].fillna('Missing')

In [None]:
housing_ml.isna().sum()

# Categorical vs Continuous features

* Each feature (column) is either a categorical or continuous

* Categorical features are distinct values and are usually strings (though can be numbers as well)

* Continuous features can take on any value and are are always numeric

* scikit-learn does not internally handle columns that are strings. The easiest way to encode them is with the pandas `get_dummies` function.

* `get_dummies` automatically binarizes (makes 0/1) each unique string in the categorical columns.

In [None]:
housing_ml = pd.get_dummies(housing_ml)
housing_ml.head()

In [None]:
# check data types - make sure there are no object
housing_ml.dtypes

# Export to NumPy
* Remove **`SalePrice`** to its own variable

In [None]:
# Remove SalePrice and assign to variable
sale_price = housing_ml.pop('SalePrice')

In [None]:
sale_price.head()

In [None]:
y = sale_price.values

In [None]:
y

In [None]:
type(y)

In [None]:
X = housing_ml.values

In [None]:
X.shape

In [None]:
y.shape

# Ready for Machine Learning
* Begin with the dumbest model
* Helps form a baseline

In [None]:
# Import estimator
from sklearn.dummy import DummyRegressor

In [None]:
# Instantiate estimator
# guess the mean every single time
dummy_reg = DummyRegressor(strategy='mean')

In [None]:
# fit estimator
dummy_reg.fit(X, y)

In [None]:
# predict
dummy_reg.predict(X)

In [None]:
# score - by definition r-squared is 0 when guessing the mean
dummy_reg.score(X, y)

# Understanding R-squared
* R-squared is a metric that tells us how much better our model is than the dumbest model. 
* More technically, it tells us what percentage of the variance has decreased over the worst model
* From the picture below, it measures the percentage of area decrease from the red to the blue squares

![](images/r2.png)

# Slowly build more complex models

In [None]:
from sklearn.linear_model import LinearRegression
lr_model = LinearRegression()

In [None]:
X_lf = housing_ml[['LotFrontage']]

In [None]:
lr_model.fit(X_lf, y)

In [None]:
y_pred = lr_model.predict(X_lf)

In [None]:
lr_model.score(X_lf, y)

In [None]:
plt.figure(figsize=(12, 5))
plt.scatter(y_pred, y);

# Use better predictor

In [None]:
X_area = housing_ml[['GrLivArea']]

In [None]:
lr_model.fit(X_area, y)

In [None]:
y_pred = lr_model.predict(X_area)

In [None]:
lr_model.score(X_area, y)

In [None]:
plt.figure(figsize=(12, 5))
plt.scatter(y_pred, y);

# Use all predictors!

In [None]:
lr_model.fit(X, y)

In [None]:
y_pred = lr_model.predict(X)

In [None]:
lr_model.score(X, y)

In [None]:
plt.figure(figsize=(12, 5))
plt.scatter(y_pred, y);

In [None]:
X_train = X[::2]
y_train = y[::2]

X_test = X[1::2]
y_test = y[1::2]

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
lr_model.fit(X_train, y_train)

In [None]:
# Worse performance on the test data
lr_model.score(X_test, y_test)

# Overfitting - It only counts on data that you have not seen

* One of the main purposes of machine learning is to be able to use your model with new unseen data

* Overfitting is akin to memorizing all the answers from a practice exam expecting to do well on the real one

* Machine learning models are evaluated on data that they have not been trained on

# KFold Cross Validation
KFold cross validation is one of the most common and popular methods to give you a better idea of what kind of accuracy you can expect to have with unseen data.

![](images/kfold.png)

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
# You don't even have to fit the estimator first. cross_val_score does everything for you
lr_model = LinearRegression()

In [None]:
cross_val_score(lr_model, X, y, cv=10)

# Using a different estimator
* All Scikit-Learn estimators are similar and have many of the same methods. The main method is **`fit`** which all supervised learning estimators have.

### Using a Decision Tree

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
dtr = DecisionTreeRegressor()

In [None]:
cross_val_score(dtr, X, y, cv=10)

In [None]:
cross_val_score(dtr, X, y, cv=10).mean()

In [None]:
cross_val_score(lr_model, X, y, cv=10).mean()

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
# Higher R-squared than linear regression!
rfr = RandomForestRegressor()
cross_val_score(rfr, X, y, cv=10).mean()

# Classification
Classification is a different type of supervised learning where the target variable is a discrete value. In this example we attempt to determine whether the person is a student.

In [None]:
credit = pd.read_csv('data/credit.csv')

In [None]:
credit.head()

In [None]:
# check for missing values
credit.isna().sum()

In [None]:
credit['Student'].value_counts(normalize=True).plot(kind='bar')

In [None]:
credit.groupby('Student')['Age'].mean().plot(kind='bar')

In [None]:
credit.groupby('Student')['Rating'].mean().plot(kind='bar')

In [None]:
credit.groupby('Student')['Balance'].mean().plot(kind='bar')

# Binarize the categorical variables
First we need to remove the target variable. The target variable does not need to be numeric

In [None]:
y = credit.pop('Student').values

In [None]:
credit_dummies = pd.get_dummies(credit)

In [None]:
credit_dummies.head()

In [None]:
X = credit_dummies.values

# Use a categorical estimator

In [None]:
from sklearn.dummy import DummyClassifier

In [None]:
dummy_cls = DummyClassifier(strategy='most_frequent')

In [None]:
dummy_cls.fit(X, y)

In [None]:
dummy_cls.score(X, y)

# Must beat 73% accuracy

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logr_model = LogisticRegression()

In [None]:
logr_model.fit(X, y)

In [None]:
logr_model.predict(X)[:5]

In [None]:
logr_model.score(X, y)

# Must use cross validation

In [None]:
cross_val_score(logr_model, X, y, cv=5).mean()

# Parameter Tuning
Most models have parameters that you can tune in order to produce a better model.

In [None]:
from sklearn.neighbors import KNeighborsRegressor

In [None]:
knn = KNeighborsRegressor(n_neighbors=3)

In [None]:
X = housing_ml.values
y = sale_price

In [None]:
cross_val_score(knn, X, y, cv=10).mean()

In [None]:
knn = KNeighborsRegressor(n_neighbors=5)
cross_val_score(knn, X, y, cv=10).mean()

In [None]:
knn = KNeighborsRegressor(n_neighbors=10)
cross_val_score(knn, X, y, cv=10).mean()

# An automated way to tune parameters with Grid Search

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
grid = {'n_neighbors': np.arange(1, 30)}

In [None]:
gs = GridSearchCV(knn, param_grid=grid)

In [None]:
gs.fit(X, y)

In [None]:
gs.best_estimator_

In [None]:
cross_val_score(gs.best_estimator_, X, y, cv=15).mean()

# Summary
* Use Pandas to import and explore data
* Prepare for machine learning by filling in missing values and binarizing categorical features (pd.get_dummies)
* Export to NumPy by creating an `X` 2d numeric array and a `y` 1d array (can be strings).
* Import an estimator (regression or classification)
* Use `fit`, `predict`, and `score` methods
* Use `cross_val_score` to automate cross validation
* Use `GridSearchCV` with a parameter grid to automate parameter tuning

In [None]:
from traitlets.config.manager import BaseJSONConfigManager
path = "/Users/Ted/anaconda3/etc/jupyter/nbconfig"
cm = BaseJSONConfigManager(config_dir=path)
cm.update("livereveal", {
              "theme": "serif",
              "transition": "zoom",
              "start_slideshow_at": "selected",
               "scroll": True
})