# Introduction to Classification using Machine Learning

<img src="imgs/brain_learning.jpg">

<p style="text-align:left;">
    <a href="https://pixabay.com/illustrations/brain-chip-neurons-machine-learning-6010961/">Photo Credit: Pixabay</a>
    <span style="float:right;">
        March 21, 2022 <br>
        Firas Moosvi
    </span>
</p>

## Learning Context (5 mins)

<center>
<img src="imgs/avatars.jpg" width=50%>
</center>

### Academic program:

- Year 2 of the iSchool Master of Information program
- Concentration: Primarily HCDS students, some UXD, and C&T

### Course Details

- INF 2179 - Machine Learning with Applications in Python
- **Prerequisite**:
    - INF 1340 - Programming for Data Science
- Elective course
- Almost everyone in here wants to be here and is excited to learn more!

### Course Schedule

- Week 1: Review of Python
- Week 2: Loading and cleaning data
- Week 3: Data wrangling
- Week 4: This lesson (Introduction to Classification)

### Programming Experience

- Least experienced: Two terms of working in Python and R sporadically
- Most experienced:  Worked in software industry for 2+ years
- ~ 50 students in the class

### Learning Intentions

- Develop intuition about classification using machine learning.

- Identify the general steps of classification using machine learning.

- Summarize the kNN algorithm and examine its advantages and limitations.

- Critically evaluate the machine learning process and consider the importance of making human-centered choices.

## Overview (1 min)

In today's class we will discuss how to classify data using Machine Learning.

Here is the general algorithm for a machine learning process, for which Classification is a subset:

<img src="https://www.sap.com/dam/application/shared/graphics/what-is-machine-learning-process.svg" width=100%>

**Caption**: General workflow of the Machine Learning process. Image is copyright of [SAP Insights](https://www.sap.com/canada/insights/what-is-machine-learning.html), used under the copyright exception.

## Building some intuition I (6 mins)

### Activity (3 mins)

- Set up a Game
    - We are going to see a series of objects, and your task is just to say "up" or "down".
    - After you vote, I will show you the "answer".
    - As we see more, you should get more and more accurate
    - Deciding whether or not to give instructions
    - Ask students to practice Zoom Annotations
- Show processing sketch (space for next, up for 1, down for 0)
- Show about 10-15 of these shapes to give people an intuition

<img src="imgs/shape.png" align="center" width=60%>

### Debrief (3 mins)

- What made you decide "up" or "down"?
- What "attributes" did you use to make a decision?
    - Attributes
    - Traits
    - Features
    - Variables
- Which of them were quantitative, and which were qualitative?
    - Colour: qualitative
    - Points: quantitative
    - Curved: binary
    - Transparency (after looking at enough of them, you will see that the colour is a red herring! It's actually the transparency that seems to matter)
    

- Can you imagine a computer doing the same task?
- What changes would you make to adapt this task for a machine?
    - Count number of points
    - Quantify the shape
    - Curved or not
    - Colour or Transparency

### Classification Process (2 mins)

This is the Classification process:

1. Start with a "Training data set" where the classification (answer) is known.
2. Build a model based on the "learning" that happens (with the training dataset).
3. Use the model on a different, previously unseen data set, "Test data set" to classify it into categories.
4. Check the accuracy of the model.
    
<img src="" width=100%>

**Caption**: 

In [47]:
from pandas import *
from altair import *

## Load and Visualize Data (1 mins)

In [48]:
trial = read_csv('data/trial_data.csv')
trial.head()

Unnamed: 0,brightness,points,curve,up_down
0,19,0,0,1
1,20,0,0,1
2,21,0,0,1
3,22,0,0,1
4,22,3,0,1


In [49]:
chart_trial = (
    Chart(trial)
    .mark_point(size=50)
    .encode(
        X("brightness", title="Brightness"),
        Y("points", title="Number of points"),
        Color("up_down:N", title=""),
    )
    .properties(title="Classifying a shape as Up (0) or Down (1)")
    .configure_title(anchor="start")
)

In [50]:
chart_trial

## Building some intuition II (5 mins)

- Pick a random point on the plot, ask students if it's "up" or "down"
- Do this several times, start with easy ones, and then gradually get closer and closer to the middle

In [51]:
chart_trial

- We clearly have some intuition about which point belongs to which group. 
- But let's try to quantify this (again, so it's easier to understand what Machine Learning is)

In [52]:
chart_trial

- What if we looked at the closest "neighbour" of the test point, and just classify based on the neighbouring point?

- How would we quantify this?
    - Euclidean Distance between point A, $P_A$($x_1$,$y_1$) and point B, $P_B$($x_2$,$y_2$) is:
       - $d = \sqrt{ (x_2 - x_1)^2 + (y_2 - y_1)^2) }$

## $k$ Nearest Neighbours, kNN method (5 mins)

<img src ="imgs/knn.png" width=60%>

**Caption**: Pictoral description of the kNN algorithm, with three different classes (A, B, C) and a test point ($P_t$). The $k$ nearest neighbours of $P_t$ dictates which class the point belongs to. Source: {cite}`Atallah:2019`

- What is one problem with this method?
    - Problem: Points that look equidistant will not have the same distances unless the data are re-scaled
    - Solution: Let's center the data by subtracting the mean, and then divide by the standard deviation

- What is a second problem with this method?
    - Problem: Prone to errors due to outliers and random "chance"
    - Solution: Consider more than one neighbour.
        - How many neighbours?
            - Choose an odd number so there are no ties!
            - Start with a low-number ($k_1 = 1$), run your classification on the training dataset, record accuracy; repeat for the next $k$, $k_2 = k_1+2$, etc...
            - Plot the accuracy vs. $k$, and pick the lowest one! (Yes, really!)
            - Rough guideline: optimal $k \approx \sqrt{N}$, where $N$ is the number of data points

### Extending kNN

- So far we've only talked about two different classes (up or down) with just two input variables (Number of points and Brightness).

- It's a bit harder to imagine in multiple dimensions, but here is what the picture looks like in 3 dimensions:

<img src="https://inferentialthinking.com/_images/Implementing_the_Classifier_12_0.png" width=60%>

**Caption**: Visualization of points in 3-dimensional space with three input variables and two classes. Source: [Computational and Inferential Thinking: The Foundations of Data Science](https://inferentialthinking.com/chapters/intro.html) by Ani Adhikari, John DeNero, David Wagner distributed under a CC BY-NC-ND 4.0 license.

- To calculate the Euclidean distance between two points in three dimensions with points , we just extend the 2D formula.

- For two points in 3D space, $P_C$($x_1$,$y_1$,$z_1$) and $P_D$($x_2$,$y_2$,$z_2$):

    - $d = \sqrt{ (x_2 - x_1)^2 + (y_2 - y_1)^2 + (z_2 - z_1)^2}$


- And so on... for points in N-dimensions

### Demo of kNN (2 mins)

Steps of kNN:

- Load Training Data
- Visualize Data
- Create classifier
- Train Algorithm
- Check accuracy of classifier

In [53]:
# code to do all the above

## Checking the accuracy of the predictions (5 mins)

- How do we check how well we did?
- Let's apply the classifier to our training data and see how accurate the predictions are

In [54]:
# code to apply classifier to training data

### Separating the training and testing data (2 mins)

## Context Matters (10 mins)

- Let's start with the accuracy numbers:
    - Our algorithm gets it right ~ 7 times out of 10.
    - That's not bad - it's okay if we mis-classify 3 shapes out of every 10.

- What if I told you that the data you've been working with has nothing to do with shapes, but people?

- Here's what the data actually means, all I did was change the names of the columns:

| Current Column Name | Actual Column Name |
| ------------------- | ------------------ |
| Number of Points | |
| Brightness | |
| Curve | |
| Up_Down | |

- Now re-evaluate the algorithm performance with this contextual information.

    - When we changed the context, we started to look beyond numbers and started seeing human beings instead.
    
    - Consider your choices carefully - they may seem inoccuous initially, but hiding behind black box algorithms and deploying weapons of Math Destruction without appropriately scrutinizing the underlying data is DANGEROUS!
    
    - "Data" is not a panacea or a magical cure-all ; it's not biased in itself but it **does** reflect the biases of our society.
    
    - We will talk more about this as we go through the course.
    
- For now, let's dig into this example a bit more...

## Beyond "accuracy" (10 mins)

- Right and Wrong seem insufficient... it feels like we need more metrics right?

### Activity: Brainstorm (2 mins)

- What are some analysis we can do to understand the *impact* of our classifier?
    - Hint: keep it simple!

### Common metrics (7 mins)

- Accuracy
- Sensitivity
- Specificity 
- Precision
- f-score

### Other Metrics (1 min)

- We'll talk about other metrics as we go through the course and they become more relevant (warning: there are lots!!)

## Summary (5 mins)

- Intuition on classification with Machine Learning
- Summary of kNN method
- Checking the accuracy of our classifier
- Importance of Context
- Other metrics beyond accuracy

## Summary of Learning Intentions

- Develop intuition about classification using machine learning.

- Identify the general steps of classification using machine learning.

- Summarize the kNN algorithm and examine its advantages and limitations.

- Critically evaluate the machine learning process and consider the importance of making human-centered choices.

## Activity: You Try (homework)

- Task 1: Use the same training data and increase the number of "neighbours", or the $k$ value.

- Task 2: Compute performance metrics of the new prediction.

- Task 3: Try adding the "curve" variable to the input data and re-train the model. Do you think it performs better?

- Task 4: Write a loop to do this classification for all $k$ values from 1 to $N$ where $N$ is the total number of data points.
    - Plot the accuracy vs. $N$; for which $k$ value is the accuracy highest?


We will briefly review these tasks at the start of next class!