<a id="part2"> </a>
# Part 2: Foundational Concepts of Machine Learning


<a id="ai"> </a>
## Data Science vs. Machine Learning vs. AI

- ML is more prediction-oriented, whereas Statistical Modeling is generally interpretation-oriented

<img src="assets/ml.png" width="500"/>

Source: [Laurence Moroney](https://www.meetup.com/The-Telegraph-Engineering/events/251468961/) (Staff Developer Advocate at Google)

- We might use machine learning to build a solution (eg. show an ad based on predicted age group)
- A system we built might have some behaviours qualified as AI (eg. recommender system)

<img src="assets/AI-ML-DL.png" width="500"/>

Source: "[Deep Learning](https://www.deeplearningbook.org/contents/intro.html)", Ian Goodfellow and Yoshua Bengio and Aaron Courville, MIT Press, 2016

<a id="ds-problems"> </a>

## Data Science Problems

Example questions for Data Science:

- how many products will we sell tomorrow?
- is this picture a hot dog or not a hot dog?
- based on this user's purchase history, which other users should we target with similar ads?
- is there something suspicious about this credit card transaction?

### How many products will we sell tomorrow?



This is a <strong style="color:green">regression</strong> problem, because the answer is a number on a **continuous** range

### Is this picture a hot dog or not a hot dog?



This is a <strong style="color:green">classification</strong> problem, because the answer is one of a **discrete set** of answers.

### Based on this user's purchase history, which other users should we target with similar ads?



This is a <strong style="color:green">clustering</strong> problem, because we are **grouping together** users **without knowing the groups in advance**.

### Is there something suspicious about this credit card transaction?



This is an <strong style="color:green">anomaly detection</strong> problem, because we are looking for things that are **outside some definition of "normal"**.

<a id="ds-problems-activity"> </a>

## Activity: Data Science Problems

In pairs, think of 2 examples each of a(n):

- regression task
- classification task
- clustering task
- anomaly detection task

Remember:

- Regression = **predicting continuous outcome**
    - e.g. predicting tomorrow's sales
- Classification = **telling the difference between discrete outcomes**
    - e.g. is this a picture of a hot dog or not?
- Clustering = **finding similar things without a "true" answer**
    - finding similar users based on purchases
- Anomaly detection = **finding "strange" things**
    - identifying if a credit card transaction is suspicious (fraudulent)

<a id="supervised"> </a>
## Supervised vs. Unsupervised

There are two main categories of machine learning: supervised learning and unsupervised learning.


### Supervised Learning

Supervised learning tends to be the most frequent type of work that data scientists do and will be the main focus of this course. How does supervised learning work?



1) We train a **machine learning model** using **labeled data**. <br>
    - The “machine learning model” learns some kind of relationship between the features and the response.

2) We make predictions on **new data** for which the response is unknown. <br>

The primary goal of supervised learning is to build a model that “generalizes” — i.e., accurately predicts the **future** rather than the **past**!

## Unsupervised Learning

- Extracts structure from data.
    - Example: Segmenting grocery store shoppers into “clusters” that exhibit similar behaviors.
- Attempts to represent.
- **Does not require** past data on the element we want to predict (no labeled data)

#### Common Types of Unsupervised Learning

- **Clustering:** Groups “similar” data points together.
- **Dimensionality reduction:** Reduce the dimensionality of a data set by extracting features that capture most of the variance in the data.

### Examples

**Supervised Learning: Coin Classifier**

- **Observations:** Coins.
- **Features:** Size and mass.
- **Response or target variable:** Hand-labeled coin type.

- Train a machine learning model using labeled data.
    - The model learns the relationship between the features and the coin type.

- Make predictions on new data for which the response is unknown.
    - Give the model a new coin and it will predict the coin type automatically.
    
**Unsupervised Learning: Types of Customers at a Bar**

- **Observations:** Customers.
- **Features:** Drink purchases, people they interact with, etc.
- **Response or target variable:** There isn’t one — instead, we group similar customers together.

<a id="agood-questions"> </a>

### Asking a Good Question

Even though all data science projects have different general flows, they start in the same place: with a problem.  From this problem statement arises questions; questions we will ask the data in order to gain more information so we can attempt to find a solution to that problem.


**Why do we need a good question?**

_“A problem well stated is half solved.”_ — Charles Kettering


A good question: 

- Sets you up for success as you begin analysis.
- Establishes the basis for reproducibility.
- Enables collaboration through clear goals.
    - It's hard to collaborate without a vision.

One way to approach formulating a question is through goal-setting via the SMART Goals Framework:


- **Specific**: The data set and key variables are clearly defined.



- **Measurable**: The type of analysis and major assumptions are articulated.


- **Attainable**: The question you are asking is feasible for your data set and not likely to be biased.


- **Reproducible**: Another person (or future you) can read and understand exactly how your analysis is performed.


- **Time-bound**: You clearly state the time period and population to which this analysis pertains.


#### What Are Some Common Questions Asked in Data Science?

**Machine learning more or less asks the following questions:**

- Does X predict Y? (Where X is a set of data and y is an outcome.)
- Are there any distinct groups in our data?
- What are the key components of our data?
- Is one of our observations “weird”?


**From a business perspective, we can ask:**

- What is the likelihood that a customer will buy this product?
- Is this a good or bad review?
- How much demand will there be for my service tomorrow?
- Is this the cheapest way to deliver my goods?
- Is there a better way to segment my marketing strategies?
- What groups of products are customers purchasing together?
- Can we automate this simple yes/no decision?

_This list may seem limited, but we rewrite most questions to fit this form._

<a id="simple-workflow"> </a>

## A simple workflow

1. Business Question

2. Data Question

3. Data Answer

4. Business Answer

*Credit: Renee Teate, [Becoming a Data Scientist](http://becomingadatascientist.com)*

### Example

#### 1. Business Question

"Is it better to hold a sale of my pork pies in June or July?"

#### 2. Data Question

"In past years, did June or July have higher demand for pork pies?"

#### 3. Data Answer

"The average number of pork pies sold over the last 5 years in June was 25% higher than July."

#### 4. Business Answer

"I recommend holding a sale in June based on seeing higher sales in the last 5 years."

<a id="complex-workflow"></a>
## A More Complex Data Science Workflow

---



- **Frame**: Develop a hypothesis-driven approach to your analysis.


- **Prepare**: Select, import, explore, and clean your data.


- **Analyse**: Structure, visualise, and complete your analysis.


- **Interpret**: Derive recommendations and business decisions from your data.



- **Communicate**: Present (edited) insights from your data to different audiences.

![](./assets/Data-Framework-White-BG.png)

#### Notes about this workflow

_Remember, these steps are not hard-set rules; instead, think of them as problem-solving guidelines._


- Some projects may not require every step.
- These steps are iterative; it's normal to go back and repeat certain steps a few times in a row.
- The process is cyclical; after completing the process, you may restart it on new findings.