> **Jupyter slideshow:** This notebook can be displayed as slides. To view it as a slideshow in your browser type in the console:


> `> jupyter nbconvert [this_notebook.ipynb] --to slides --post serve`


> To toggle off the slideshow cell formatting, click the `CellToolbar` button, then `View --> Cell Toolbar --> None`

<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Research Design and Machine Learning

_Authors: Alexander Egorenkov (DC), Amy Roberts (NYC)_

---

<a id="learning-objectives"></a>
## Learning objectives
- Define a problem and types of data
- Identify data set types
- Define the data science workflow
- Introduce machine learning and its role in data science

### Lesson Guide
- [Learning objectives](#learning-objectives)
- [Experimental Design](#experimental-design)
	- [Let's review the data science workflow](#lets-review-the-data-science-workflow)
	- [Asking a good question](#asking-a-good-question)
	- [What is a good question?](#what-is-a-good-question)
- [Data Types](#data-types)
	- [Why data types matter](#why-data-types-matter)
	- [Cross-sectional data](#cross-sectional-data)
	- [Time series/longitudinal data](#time-serieslongitudinal-data)
- [What are some common questions in data science?](#what-are-some-common-questions-in-data-science)
- [Machine Learning](#machine-learning)
	- [What is machine learning?](#what-is-machine-learning)
	- [Types of machine learning](#types-of-machine-learning)
- [Machine learning terminology](#machine-learning-terminology)
- [Supervised learning](#supervised-learning)
	- [Classification vs. Regression](#classification-vs-regression)
	- [Regression or classification?](#regression-or-classification)
- [Unsupervised learning](#unsupervised-learning)
	- [Common types of unsupervised learning](#common-types-of-unsupervised-learning)
- [Intro to Sci-kit learn](#intro-to-sci-kit-learn)


<a id="experimental-design"></a>
## Experimental Design

<a id="lets-review-the-data-science-workflow"></a>
### Let's review the data science workflow
The steps:
- Identify the problem
- Acquire the data
- Parse the data
- Mine the data
- Refine the data
- Build a data model
- Present the results

We’re going to focus on steps 1-2 (Identify the Problem and Acquire the Data).
We’ll also talk about modeling.

![](./assets/images/data-science-workflow.png)

<a id="asking-a-good-question"></a>
### Asking a good question

**Why do we need a good question?**

“A problem well stated is half solved.” -Charles Kettering
- Sets you up for success as you begin analysis
- Establishes the basis for reproducibility
- Enables collaboration through clear goals
    - Really hard to collaborate without a vision

One way to look at questions is through goal setting through the SMART Goals Framework.

- **Specific**:  The dataset and key variables are clearly defined.
- **Measurable**:  The type of analysis and major assumptions are articulated.
- **Attainable**:  The question you are asking is feasible for your dataset and is not likely to be biased.
- **Reproducible**:  Another person (or future you) can read and understand exactly how your analysis is performed.
- **Time-bound**:  You clearly state the time period and population for which this analysis will pertain.

<a id="what-is-a-good-question"></a>
### What is a good question?

#### Demo: Diagramming An AIM

Determine the association of foods in the home with child dietary intake. Using one 24-hour recall from the cross-sectional NHANES 2007-2010 we will determine the factors associated with food available in the homes of American children and adolescents. We will test if reported availability of fruits, dark green vegetables, low fat milk or sugar sweetened beverages available in the home increases the likelihood that children and adolescents will meet their USDA recommended dietary intake for that food.

#### Hypothesis

Children will be more likely to meet the USDA recommended intake level when food is always available in their home compared to rarely or never.

**How data was collected:**

24-hour recall, self-reported

**What data was collected:**

Fruits, dark green vegetables, low fat milk or sugar sweetened beverages, always vs. rarely available

**How data will be analyzed:**

Using USDA recommendations as a gold-standard to measure the association

**The specific hypothesis & direction of the expected associations:**

Children will be more likely to meet their recommended intake level

#### Measureable
Determine the association of foods in the home with child dietary intake.

We will test if the reported availability of certain foods increases the likelihood that children and adolescents will meet their USDA recommended dietary intake for food.

#### Attainable
Cross-sectional data has inherent limitations; one of the most common is that causal inference is typically not possible.

Note that we are determining association, not causation.

#### Reproducible
With all the specifics, it would be straightforward to pull the data from NHANES and reproduce the analysis.

#### Time Bound

Using one 24-hour recall from NHANES 2007-2010, we will determine the factors associated with food available in the homes of American children and adolescents.


The previous example laid out research goals.

In a business setting, you will need to articulate business objectives.

Example:  Success for the Netflix recommendation engine may be if 70% of customers over the age of 18 select a movie from the recommended queue during Q3 of 2015.

Regardless of setting, planning ahead and outlining the entire project will help you save time. You'll prioritize tasks better, you won't get stuck on small details that don't matter, and most importantly, you'll know when to stop.

#### Knowledge Check
Answer the following questions

**Which of the following uses the SMART framework?  Why?  What is missing?**

I am looking to see if there is an association with number of passengers with carry on luggage and delayed take-off time.

Determine if the number of passengers on JetBlue, Delta and United domestic flights with carry-on luggage is associated with delayed take-off time using data from flightstats.com from January 2015- December 2015.

<a id="data-types"></a>
## Data Types
**Cross-sectional**

All information is determined at the same time; all data comes from the same time period.

**Time series**

The information is collected over a period of time for a single group.

**Longitudinal/Panel**

The information is collected over a period of time for several groups.

(Check out the data structures available in Pandas http://pandas.pydata.org/pandas-docs/stable/overview.html)

<a id="why-data-types-matter"></a>
### Why data types matter
- Different data types have different limitations and strengths.
- Certain types of analyses aren’t possible with certain data types.

<a id="cross-sectional-data"></a>
### Cross-sectional data

**Strengths**
- Often population based
- Generalizability
- Reduce cost compared to other types of data collection methods

**Weaknesses**
- There is no distinction between exposure and outcome
- Separation of cause and effect may be difficult (or impossible)
- Variables/cases with long duration are over-represented

<a id="time-serieslongitudinal-data"></a>
### Time series/longitudinal data

**Strengths**
- Unambiguous temporal sequence - exposure precedes outcome
- Multiple outcomes can be measured

**Weaknesses**
- Expense
- Takes a long time to collect data
- Vulnerable to missing data

#### Knowledge Check

**What type of data is the flight_delays data?**

**Can you create a cross-sectional analysis from a longitudinal data collection? How?**

In [None]:
import pandas as pd

In [None]:
# Find the flight_delays dataset in the datasets folder
flights = pd.read_csv('../../dataset/flight_delays.csv')

<a id="what-are-some-common-questions-in-data-science"></a>
## What are some common questions in data science?
**Machine learning more or less asks the following questions:**
- Does X predict y? Where X is a set of data and y is an outcome.
- Are there any distinct groups in our data?
- What are the key components of our data?
- Is one of our observations “weird”?

This seems limited, but we rewrite most questions to fit this form.

**From a business perspective:**
- What is the likelihood a customer will buy this product?
- Is this a good or bad review?
- How much demand will there be for my service tomorrow?
- Is this the cheapest way to deliver my goods?
- Is there a better way to segment my marketing strategies?
- What groups of products are purchased together?
- Can we automate this simple yes/no decision?

There are an endless amount of examples from various fields. https://www.kaggle.com/wiki/DataScienceUseCases

**Class activity: Rewriting questions**

Answer the following question

**How do the machine learning questions fit with the business application questions in the previous slide?**

**Activity: Write a research question with raw data**

Individually, look at the data from Kaggle’s Titanic competition and write a high quality research question.

Make sure you answer the following questions:
- What type of data is this, cross-sectional or longitudinal?
- What will we be measuring?
- What is the SMART aim for this data?
- When finished, split into pairs and share your answers with each other.

The data dictionary can be found here https://www.kaggle.com/c/titanic/data

In [None]:
titanic = pd.read_csv('../../dataset/titanic.csv')

<a id="machine-learning"></a>
## Machine Learning

<a id="what-is-machine-learning"></a>
### What is machine learning?

"A field of study that gives computers the ability to learn without being explicitly programmed."
~ Arthur Samuel, AI pioneer

"Seriously, I don't like the phrase "Big Data". I prefer "Data Science", which is the automatic (or semi-automatic) extraction of knowledge from data. That is here to stay, it's not a fad. The amount of data generated by our digital world is growing exponentially with high rate (at the same rate our hard-drives and communication networks are increasing their capacity). But the amount of human brain power in the world is not increasing nearly as fast. This means that now or in the near future most of the knowledge in the world will be extracted by machine and reside in machines. It's inevitable. En entire industry is building itself around this, and a new academic discipline is emerging."
~ Yann LeCun

One definition: “Machine learning is the semi-automatic extraction of knowledge from data.”

**Knowledge from data:** Starts with a question that might be answerable using data

**Automatic extraction:** A computer provides the insight

**Semi-automatic:** Requires many smart decisions by a human

<a id="types-of-machine-learning"></a>
### Types of machine learning
- supervised learning
    - classification - predict categories
    - regression - predict numbers
- unsupervised learning
    - clustering - identify groups within data
    - dimensionality reduction - identify components or key variables within data
    
(One could give a much more detailed taxonomy, but this is useful.)    

There are two main categories of machine learning: supervised learning and unsupervised learning.

**Supervised learning (aka “predictive modeling”):**
- Predict an outcome based on input data
- Example: predict whether an email is spam or ham
- Goal is “generalization”

**Unsupervised learning:**
- Extracting structure from data
- Example: segment grocery store shoppers into “clusters” that exhibit similar behaviors
- Goal is “representation”

It's typical to combine both in a project to reduce the costs of collecting data by learning a better representation. This is referred to as transfer learning.

Unsupervised learning tends to be a more difficult problem because the goals are amorphous. Supervised learning has goals that are almost too clear and leave people in the trap of optimizing metrics without thinking about business value.

<a id="machine-learning-terminology"></a>
## Machine learning terminology
![](./assets/images/feature_matrix.png)

**Observations** are also known as: samples, examples, instances, records

**Features** are also known as: predictors, independent variables, inputs, regressors, covariates, attributes

**Response** is also known as: outcome, label, target, dependent variable

<a id="supervised-learning"></a>
## Supervised learning

How does supervised learning “work”?

1. Train a **machine learning model** using **labeled data**
    - “Labeled data” is data with a response variable
    - “Machine learning model” learns the relationship between the features and the response

2. Make predictions on **new data** for which the response is unknown

The primary goal of supervised learning is to build a model that “generalizes”: It accurately predicts the **future** rather than the **past**!

![](./assets/images/supervised-learning.png)

<a id="classification-vs-regression"></a>
### Classification vs. Regression

There are two categories of supervised learning:

**Regression**
- Outcome we are trying to predict is continuous
- Examples: price, blood pressure

**Classification**
- Outcome we are trying to predict is categorical (values in a finite set)
- Examples: spam/ham, cancer class of tissue sample

The type of supervised learning problem has nothing to do with the features, only the response matters!

<a id="regression-or-classification"></a>
### Regression or classification?

#### Predict salary using demographic data
![](./assets/images/salary-regression.png)

#### Identify the numbers in a handwritten zip code
![](./assets/images/ocr-classification.png)

**Problem:** Children born prematurely are at high risk of developing infections, many of which are not detected until after the baby is sick

**Goal:** Detect subtle patterns in the data that predicts infection before it occurs



**Data:** 16 vital signs such as heart rate, respiration rate, blood pressure, etc…

**Impact:** Model is able to predict the onset of infection 24 hours before the traditional symptoms of infection appear

**Case Study:** http://www.amazon.com/Big-Data-Revolution-Transform-Think/dp/0544002695

![](./assets/images/netflix.png)

**Supervised learning example: Coin classifier**

- **Observations:** Coins
- **Features:** Size and mass
- **Response:** Coin type, hand-labeled

- Train a machine learning model using labeled data
    - Model learns the relationship between the features and the coin type

- Make predictions on new data for which the response is unknown
    - Give it a new coin, predicts the coin type automatically

![](./assets/images/supervised-coins.png)

<a id="unsupervised-learning"></a>
## Unsupervised learning

**Supervised learning (aka “predictive modeling”):**
- Predict an outcome based on input data
- Example: predict whether an email is spam or ham
- Goal is “generalization”

**Unsupervised learning:**
- Extracting structure from data
- Example: segment grocery store shoppers into “clusters” that exhibit similar behaviors
- Goal is “representation”

<a id="common-types-of-unsupervised-learning"></a>
### Common types of unsupervised learning
- Clustering: group “similar” data points together
- Dimensionality Reduction: reduce the dimensionality of a dataset by extracting features that capture most of the variance in the data

Unsupervised learning has some clear differences from supervised learning. With unsupervised learning:

- There is no clear objective
- There is no “right anwser” (hard to tell how well you are doing)
- There is no response variable, just observations with features
- Labeled data is not required

**Unsupervised learning example: Coin clustering and dimensionality reduction**

- **Observations:** Coins
- **Features:** Size and mass
- **Response:** There isn’t one (no hand-labeling required!)

![](./assets/images/unsupervised-coins.png)

**Clustering**
1. Perform unsupervised learning
2. Cluster the coins based on “similarity”
3. Inspect the grouping that the algorithm found
4. You’re done!

Hopefully this would put the coins into four separate groups.

**Dimensionality reduction**
1. Perform unsupervised learning
2. Cluster the coins based on “similarity”
3. Inspect the features produces by the algorithm
4. You’re done!

Hopefully the algorithm would recognize something like: 

$$\dfrac {mass} {size} = density$$

Sometimes, unsupervised learning is used as a “preprocessing” step for supervised learning. (How?)

<a id="intro-to-sci-kit-learn"></a>
<a id="intro-to-sci-kit-learn"></a>
## Intro to Sci-kit learn

- We typically won't be implementing machine learning algorithms from scratch
- Sci-kit learn, refered to as sklearn, is a very popular machine learning library in Python
- It benefits from ease of use and great documentation
- It's possible to find other libraries and tools with better performing algorithms, but sklearn is a great start

http://scikit-learn.org/stable/