# Day 8 

## Table of Contents

- [Pandas recap](#pandas-recap)
- [ML Intro](#intro-to-machine-learning)
- [Scikit-learn](#introduction-to-scikit-learn)

## Pandas Visualization Recap

I missed a couple examples yesterday of plots I  wanted to show 😅 , so let's try this again...

In [None]:
# import pandas and matplotlib.pyplot


In [None]:
# read data from csv from yesterday on wines


In [None]:
# box plot "by" another quantity like quality
# figsize is actually another argument that gets passed in separate line.


In [None]:
# let's now learn about the object-oriented approach to data visualization
# it's based on a figure object and axes objects
# the figure object is the container for all the axes objects
# the axes object is the container for the plot
# the plot object is the thing that gets drawn


In [None]:
# more complicated example with legend, colors, etc. 
# use axes to add  mean and medians to the plot with colors
# also legend can be added with axes object


In [None]:
# multiple separate boxplots in same figure
# we can set title, labels, and legend for each boxplot



In [None]:
# multiple boxplots (easier)
# you can also just pass in a list of column names to boxplot in `column` argument
# axes should match the number of columns in the list



In [None]:
# histograms
# histograms need to be done one at a time


### Lecture Exercises

1. Create a single figure with three histograms with quantities: alcohol, density, pH. Plot the mean and median as a vertical line of each histogram.

2. Create a figure containing separate boxplots for every quantity in the dataframe. Change the `figsize` so it is somewhat readable :-) 

3. Create a scatter plot of `density` vs `alcohol` separated by median `quality` (half points in each group), using a blue/red color for the dots, some transparency, title, legends, and ensure appropriate labels.

## Intro to Machine Learning

**Credit:** Created by [Data School](https://www.dataschool.io). Watch all 10 videos on [YouTube](https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A). Download the notebooks from [GitHub](https://github.com/justmarkham/scikit-learn-videos).

### What is Machine Learning?

One definition: "Machine Learning is the semi-automated extraction of knowledge from data"

- **Knowledge from data**: Starts with a question that might be answerable using data
- **Automated extraction**: A computer provides the insight
- **Semi-automated**: Requires many smart decisions by a human

### What are the two main categories of Machine Learning?

**Supervised learning**: Making predictions using data
    
- Example: Is a given email "spam" or "ham"?
- There is an outcome we are trying to predict

![Spam filter](01_spam_filter.png)

**Unsupervised learning**: Extracting structure from data

- Example: Segment grocery store shoppers into clusters that exhibit similar behaviors
- There is no "right answer"

![Clustering](01_clustering.png)

### How does Machine Learning "work"?

High-level steps of supervised learning:

1. First, train a **Machine Learning model** using **labeled data**

    - "Labeled data" has been labeled with the outcome
    - "Machine Learning model" learns the relationship between the attributes of the data and its outcome

2. Then, make **predictions** on **new data** for which the label is unknown

![Supervised learning](01_supervised_learning.png)

The primary goal of supervised learning is to build a model that "generalizes": It accurately predicts the **future** rather than the **past**!

### Questions about Machine Learning

- How do I choose **which attributes** of my data to include in the model?
- How do I choose **which model** to use?
- How do I **optimize** this model for best performance?
- How do I ensure that I'm building a model that will **generalize** to unseen data?
- Can I **estimate** how well my model is likely to perform on unseen data?

### Benefits and drawbacks of scikit-learn

#### Benefits:

- **Consistent interface** to Machine Learning models
- Provides many **tuning parameters** but with **sensible defaults**
- Exceptional **documentation**
- Rich set of functionality for **companion tasks**
- **Active community** for development and support

#### Potential drawbacks:

- Harder (than R) to **get started with Machine Learning**
- Less emphasis (than R) on **model interpretability**

#### Further reading:

- Ben Lorica: [Six reasons why I recommend scikit-learn](https://www.oreilly.com/content/six-reasons-why-i-recommend-scikit-learn/)
- scikit-learn authors: [API design for machine learning software](https://arxiv.org/pdf/1309.0238v1.pdf)
- Data School: [Should you teach Python or R for data science?](https://www.dataschool.io/python-or-r-for-data-science/)

![scikit-learn logo](02_sklearn_logo.png)

## Installing scikit-learn

**Option 1:** [Install scikit-learn library](https://scikit-learn.org/stable/install.html) and dependencies (NumPy and SciPy)

**Option 2:** [Install Anaconda distribution](https://www.anaconda.com/products/individual) of Python, which includes:

- Hundreds of useful packages (including scikit-learn)
- IPython and Jupyter Notebook
- conda package manager
- Spyder IDE

## Scikit-Learn Primer with Linear regression

### Types of supervised learning

- **Classification:** Predict a categorical response
- **Regression:** Predict a continuous response

### Reading data using pandas

In [None]:
# import pandas just in case


In [None]:
# read CSV file called 'Advertising.csv'

# display the first 5 rows using `head()`


Primary object types:

- **DataFrame:** rows and columns (like a spreadsheet)
- **Series:** a single column

In [None]:
# display the last 5 rows with `tail()`


In [None]:
# check the shape of the DataFrame (rows, columns)


What are the features?
- **TV:** advertising dollars spent on TV for a single product in a given market (in thousands of dollars)
- **Radio:** advertising dollars spent on Radio
- **Newspaper:** advertising dollars spent on Newspaper

What is the response?
- **Sales:** sales of a single product in a given market (in thousands of items)

What else do we know?
- Because the response variable is continuous, this is a **regression** problem.
- There are 200 **observations** (represented by the rows), and each observation is a single market.

## Visualizing data using seaborn

**Seaborn:** Python library for statistical data visualization built on top of Matplotlib

- Anaconda users: run **`conda install seaborn`** from the command line
- Other users: [installation instructions](http://seaborn.pydata.org/installing.html)

In [None]:
# conventional way to import seaborn

# allow plots to appear within the notebook


In [None]:
# visualize the relationship between the features and the response using sns.pairplot()


## Linear regression

**Pros:** fast, no tuning required, highly interpretable, well-understood

**Cons:** unlikely to produce the best predictive accuracy (presumes a linear relationship between the features and response)

### Form of linear regression

$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$

- $y$ is the response
- $\beta_0$ is the intercept
- $\beta_1$ is the coefficient for $x_1$ (the first feature)
- $\beta_n$ is the coefficient for $x_n$ (the nth feature)

In this case:

$y = \beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times Newspaper$

The $\beta$ values are called the **model coefficients**. These values are "learned" during the model fitting step using the "least squares" criterion. Then, the fitted model can be used to make predictions!

## Preparing X and y using pandas

- scikit-learn expects X (feature matrix) and y (response vector) to be NumPy arrays.
- However, pandas is built on top of NumPy.
- Thus, X can be a pandas DataFrame and y can be a pandas Series!

In [None]:
# create a Python list of feature names

# use the list to select a subset of the original DataFrame

# equivalent command to do this in one line

# print the first 5 rows


In [None]:
# check the type and shape of X


In [None]:
# select a Series from the DataFrame

# equivalent command that works if there are no spaces in the column name

# print the first 5 values


In [None]:
# check the type and shape of y


## Splitting X and y into training and testing sets

In [None]:
# separate into training and testing sets usint train_test_split()
# keep in mind that is random


In [None]:
# default split is 75% for training and 25% for testing


## Linear regression in scikit-learn

In [None]:
# import model LinearRegression from sklearn.linear_model

# create model

# fit the model to the training data (learn the coefficients)


### Interpreting model coefficients

In [None]:
# print the intercept and coefficients


In [None]:
# pair the feature names with the coefficients


$$y = 2.88 + 0.0466 \times TV + 0.179 \times Radio + 0.00345 \times Newspaper$$

How do we interpret the **TV coefficient** (0.0466)?

- For a given amount of Radio and Newspaper ad spending, **a "unit" increase in TV ad spending** is associated with a **0.0466 "unit" increase in Sales**.
- Or more clearly: For a given amount of Radio and Newspaper ad spending, **an additional $1,000 spent on TV ads** is associated with an **increase in sales of 46.6 items**.

Important notes:

- This is a statement of **association**, not **causation**.
- If an increase in TV ad spending was associated with a **decrease** in sales, $\beta_1$ would be **negative**.

### Making predictions

In [None]:
# make predictions on the testing set


We need an **evaluation metric** in order to compare our predictions with the actual values!

## Model evaluation metrics for regression

Evaluation metrics for classification problems, such as **accuracy**, are not useful for regression problems. Instead, we need evaluation metrics designed for comparing continuous values.

Let's create some example numeric predictions, and calculate **three common evaluation metrics** for regression problems:

In [None]:
# define true and predicted response values


**Mean Absolute Error** (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

In [None]:
# calculate MAE by hand

# calculate MAE using scikit-learn


**Mean Squared Error** (MSE) is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

In [None]:
# calculate MSE by hand

# calculate MSE using scikit-learn


**Root Mean Squared Error** (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

In [None]:
# calculate RMSE by hand


# calculate RMSE using scikit-learn


Comparing these metrics:

- **MAE** is the easiest to understand, because it's the average error.
- **MSE** is more popular than MAE, because MSE "punishes" larger errors.
- **RMSE** is even more popular than MSE, because RMSE is interpretable in the "y" units.

### Computing the RMSE for our Sales predictions

## Feature selection

Does **Newspaper** "belong" in our model? In other words, does it improve the quality of our predictions?

Let's **remove it** from the model and check the RMSE!

In [None]:
# compute y_pred with new features


In [None]:
# check RMSE 

The RMSE **decreased** when we removed Newspaper from the model. (Error is something we want to minimize, so **a lower number for RMSE is better**.) Thus, it is unlikely that this feature is useful for predicting Sales, and should be removed from the model.

### Lecture exercises

1. Make a scatter plot of `y_pred` vs `y` (your prediction of sales vs the true value of sales). If your prediction was "perfect", you would get all the data in a 1-to-1 line. Plot this line. 

2. Make a histogram of the "residuals" of your predictions for whatever model you like. This is a histogram of the following quantities (one for each prediction $i$): 

$$(y^{\rm pred}_{i} - y_{i}) / \sigma(y_{i})$$

where $\sigma(y_{i})$ is the standard deviation of the predictions. This is another good way to evaluate your model.

3. Write and train a new model that only uses Radio as the feature. What is the RMSE ? 


## Resources

Linear regression:

- [Longer notebook on linear regression](https://github.com/justmarkham/DAT4/blob/master/notebooks/08_linear_regression.ipynb) by me
- Chapter 3 of [An Introduction to Statistical Learning](https://www.statlearning.com/) and [related videos](https://www.dataschool.io/15-hours-of-expert-machine-learning-videos/) by Hastie and Tibshirani (Stanford)
- [Quick reference guide to applying and interpreting linear regression](https://www.dataschool.io/applying-and-interpreting-linear-regression/) by me
- [Introduction to linear regression](http://people.duke.edu/~rnau/regintro.htm) by Robert Nau (Duke)

Pandas:

- [pandas Q&A video series](https://www.dataschool.io/easier-data-analysis-with-pandas/) by me
- [Three-part pandas tutorial](http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/) by Greg Reda
- [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) and [read_table](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_table.html) documentation

Seaborn:

- [Official seaborn tutorial](http://seaborn.pydata.org/tutorial.html)
- [Example gallery](http://seaborn.pydata.org/examples/index.html)