# Dip into Data Tutorial

## Introduction

Welcome to the course Dip into Data! In this tutorial, we seek to give you a short introduction to data science (specifically machine learning).

But first, a few logistics.

This is a Jupyter Notebook. It's just like a regular file from which you can read and write, but it also allows you to run code.

What you see below is a "cell." While you have the cell selected, click on the "play" symbol to run the cell. 

In [None]:
print("Hello Jupyter Notebook!")

Running the cell displays the output of the code snippet that you ran. We will be using the Python programming language throughout this tutorial. The code you saw above prints a string (a combinations of letters) as the output. 

You can rerun the same cell multiple times. The number on the left hand side shows the order of the cells you have run. It is also important to note that the variables created are saved throughout your session. For example, let's create the variable `a`:

In [None]:
a = 5
print(a)

Now, I can create a completely different cell and still access the variable `a`.

In [None]:
print(a)

If you want to start again from scratch, choose Kernel --> Restart & Clear Output. 

While there are ways to run the entire notebook at once, we recommend you read each line and go step by step. You can follow along with the class.

## What is machine learning? 

*Machine learning* "is the study of computer algorithms that allow computer programs to automatically improve through experience" [\[1\]](#references). 

Teacher: We see machine learning and artificial intelligence applied in many areas in day to day lives. For example, when Amazon gives us product recommendations based on what our previous purchases were, or when we use Google maps to find the best route between two places, or even when we are talking to virtual assistants like Siri who can understand our language, sentences and provide answers to our questions! 

In all these cases, vast amounts of data is being used in algorithms to accomplish these amazing things! 

<div>
<img src="attachment:image-2.png" width="600"/>
</div>

<div>
<img src="attachment:image.png" width="600"/>
</div>

<div>
<img src="attachment:image.png" width="600"/>
</div>

### Other examples

There are some other examples! Have you seen how your camera recognizes faces? Or when you're on social media, some of your friends can be automatically tagged! How does this happen? How does the machine learn to recognize your friend's face? All of these examples use machine learning and artificial intelligence in some form. 
* Automatic tagging of friends on social media
* Recognizing faces

### Activity 1 - Discussion

Can you think of other examples that you think uses artificial intelligence or machine learning in your daily life? 

While all of these examples, broadly state how machines perform "intelligent" tasks, machine learning is a small part of artificial intelligence, which the machine learns how to perform human-like tasks! 


There are two types of learning (among others) that we are mainly going to focus on here - Supervised and Unsupervised! 
Let's see what supervised learning means - 

## How does machine learning work? 

Here's a simple example of how we might teach a machine learning model to recognize apples:
1. Show many pictures of apples. Include different colors, shapes, and sizes so that the model can recognize many different kinds of apples. (Training Set)
2. Show the model a different set of pictures and see how it performs. (Test Set) 

The "many pictures of apples" in step 1 is called the *training set*. Models learn from the training set.

<div>
<img src="../assets/classification-apple-orange.png" width="500"/>
</div>

The "different set of pictures" in step 2 is called the *test set*. Models do not learn from the test set; rather, they apply what they have learned from the training set. This way, we can evaluate how well our model learned. 

In this case, our model would classify the image to the left as an apple and the image to the right as not an apple. 

<div>
<img src="../assets/test-apple-orange.png" width="500"/>
</div>

## Data 

How can we do this? By utilizing data that we have all around us! 

Previously, we discussed where and how machine learning was used all around us. Amazon recommends new products to us based on our previous purchases, and Facebook recognizes faces because it has shown a model many different pictures of faces. Google Maps knows the shortest path between two points because it has data about traffic, weather, and geography. 

Data come in all shapes and sizes. When we classified images in the apple example above, the data came in the form of images (sets of colored pixels). For Amazon recommendations, the data may include a list of your purchases, a table of users with similar purchasing patterns as you, and a list of their purchases. For our class today and our subsequent activities, we will use tables of data.  

## Types of Machine Learning

### 1. Classification

*Classification* is the process of categorizing data into predefined classes. It is a form of supervised learning because we are given a labeled dataset.

For example, in the figure below, we must classify a picture as a cat or a dog. In this case, our classes are "cat" and "dog," and we classify our picture as a cat.  

<img src="../assets/cat-dog-classification.png" width="600">

*Images courtesy of photos-public-domain.com.*

Let's look at an example with data. We have a handful of points, and we want to classify them into two groups: group Red and group Blue.

The code below reads our data points from a file and then creates a scatterplot.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt # import necessary packages
import pandas
import numpy as np

data = pandas.read_csv("../assets/classification-synthetic.csv") # Read the data, and put it into a variable. 
print(data.head()) # show the first five data points
                   # in the data, "r" stands for red, and "b" stands for blue

In [None]:
plt.scatter(x=data["x"], y=data["y"], c=data["class"]) # Create a scatter plot. 
                                                       # "c" stands for color. We color our points by class.

The red points are in group Red, and the blue points are in group Blue. The two classes have different characteristics; the Red class is in the top left corner of the plot, and the Blue class is in the bottom right. 

**Question:** If you are given a new point like the one shown below, would you classify it as belonging to Red or Blue? 

In [None]:
new_point = pandas.DataFrame({"x":[3], "y":[7], "class":["k"]}) # Create a new point at (3,7) and color it black.
new_data = data.append(new_point) # Add the new point to the dataset. 
plt.scatter(x=new_data["x"], y=new_data["y"], c=new_data["class"]) 

**Question:** What above the new point in the graph below? Is it Red or Blue? Why did you put it in the class that you did?

In [None]:
new_point = pandas.DataFrame({"x":[6], "y":[5], "class":["k"]}) # Create a new point at (3,7) and color it black.
new_data = data.append(new_point) # Add the new point to the dataset. 
plt.scatter(x=new_data["x"], y=new_data["y"], c=new_data["class"]) 

One way to decide whether or not a point is in a given class is to draw a line that is equidistant from the two classes. Let's draw a line separating our two classes.

In [None]:
plt.scatter(x=data["x"], y=data["y"], c=data["class"]) 
x = np.linspace(0, 10, 1000)
plt.plot(x, 1.5*x-3, color='black')
plt.xlim(0,11)
plt.ylim(0,11)

*Support vector machines* create lines between classes. The lines can be straight or curved, and if you have many classes, you can draw many lines separating your classes. 

<img src="../assets/iris-svm.png" width="300">

*Image courtesy of sci-kit learn.*

If your data is 3D, the line would turn into a plane.

<img src="../assets/3d-svm.png" width="300">

*Image courtesy of [KDnuggets](https://www.kdnuggets.com/2019/09/friendly-introduction-support-vector-machines.html).*

<details open>
<summary>Note: Why the name support vector machine? </summary>
<br>
The points nearest to the line are called support vectors because they are "holding up" (i.e. supporting) the separating line between the classes. (Think of them as push pins holding up the separating line.)
<img src="../assets/support-vectors.png" width="400">
</details>

**Activity:** Think about the kinds of software you use and websites you visit. What sorts of applications require classification? Discuss. 

### 2. Clustering

*Clustering* categorizes data into groups based on how similar the data is to one another. Unlike classification, the groups are not predefined. 



In [None]:
data = pandas.read_csv("../assets/clustering-synthetic.csv") 
print(data.head())

In [None]:
plt.scatter(x=data["x"], y=data["y"])

**Question:** What is the difference between clustering and classification? 

**Question:** What is the difference between supervised and unsupervised learning? 

### 3. Regression

In both classification and clustering, you have seen that data is used to make discrete predictions about new pieces of data. For example, does a new data point belong to Class A or Class B? To which cluster does a new data point belong? 

However, in many cases in the real world, data labels are not discrete classes or clusters. *Regression* is a technique that predicts real numbers from data. The most common form of regression is *linear regression*, which involves drawing a straight line through your data. You may have come across linear regression in your math and science classes in school. 

One simple linear equation that you would have come across is
$$y = mx + b$$

Let us take an example of going to the market and buying apples. We want to define the prices of apples with a linear equation:

In [None]:
# Importing python libraries that make drawing figures possible
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np

In [None]:
# Define figure and axes
fig = plt.figure()
ax = plt.axes()

In [None]:
# Plot figure for y = mx + b
x = np.linspace(0, 10, 1000)
plt.plot(x, 2*x+1, color='black')        # specify color by name
plt.title('Y = mx + b')
plt.xlabel('Apple quantity (lb)')
plt.ylabel('Price of apple ($)')

#### What does this figure mean? 

This figure shows us a way to calculate the price of any apply quantity that we wish to purchase. The intercept shows that there is a minimum amount of money that you have to pay for purchasing apples ($1), and for every pound, the price of the apple is 2*weight.

#### What does this have to do with linear regression? 

One simple way to describe linear regression is - given a bunch of data points, we are going to try to figure out the slope of the equation and the intercept. That is, m and b are not previously known, and we will try to find the best fitting line.

In [None]:
import pandas 
data = pandas.read_csv("../assets/regression-synthetic.csv") 
print(data.head())

This is the data that we have, we visualize it in the form of a scatter plot below. Note that the data is noisy, and does not exactly represent the line that we are trying to get.

In [None]:
plt.scatter(data['apple'], data['prices'], color='green')
plt.title("Apples and Prices")
plt.xlabel("Apples (lb)")
plt.ylabel("Price ($)")

Below, we can see the line that we learned from the data that gives us the best fit.

In [None]:
# Plot the linear regression line in the same plot 
x = np.linspace(0, 12, 1000)
plt.scatter(data['apple'], data['prices'], color='green')
plt.plot(x, 2*x+1, color='black')
plt.title("Apples and Prices")
plt.xlabel("Apples (lb)")
plt.ylabel("Price ($)")

## Dataset 

During this course, we will use the World Happiness Report dataset, which includes data from UN surveys on global well-being. The dataset includes information from 2015 to 2019 on 155 countries, including the GDP per capita, quality of family life, life expectancy, amount of personal freedom, amount of interpersonal generosity, and amount of trust in the government. The dataset also includes a happiness score and ranking from the Gallup World Poll [\[ref\]](#references).

**For next time**, read the full description of the dataset [here](https://www.kaggle.com/unsdsn/world-happiness). After future lessons, you will complete activities based on the dataset. 

## Further Reading

### Machine Learning
1. [Stanford's Machine Learning course](https://www.youtube.com/playlist?list=PLoROMvodv4rMiGQp3WXShtMGgzqpfVfbU) - introductory machine learning course, taught by Dr. Andrew Ng

2. [Machine Learning Mastery](https://machinelearningmastery.com/start-here/) - blog for learning machine learning using code examples

### Python

1. [Real Python](https://realpython.com/) - blog for learning Python 

2. [Invent with Python](https://inventwithpython.com/) - free e-books with simple Python projects. Scroll down to "Programming Books by Al Sweigart" for the free online versions.


## References

\[1\] Tom Mitchell, *Machine Learning*, McGraw Hill, 1997.

Sustainable Development Solutions Network, "World Happiness Report, Version 2", *Kaggle*, 2019. \[Dataset\]. Available: https://www.kaggle.com/unsdsn/world-happiness.