## Learning Objectives 

By the end of this class, you should be able to...

- Recall what we have learned in DS 1.1 as a Data Analyst
- Understand what we are going to learn in DS 2.1 which covers:
    - Machine Learning Components
    - Data Preprocessing in Machine Learning
        - Data Scaling (Normalization)
        - Convert categorical feature to numerical

## What we learned in DS 1.1 as a Data Analyst

### Question: What do YOU remember from DS 1.1??

Don't look ahead! Share with the class your favorite topic, your most challenging topic, or both!

### Core Concepts:

* The principles of effective storytelling

* How to make proposals and decisions based on analysis

* Exploratory Data Analysis (EDA), focusing on the basics of statistical inference, hypothesis testing, correlation

* Check this out as a good resource for Data Analyst VS. Data Engineer VS. Data Scientist roles:

https://www.springboard.com/blog/data-engineer-vs-data-scientist/

## What we are going to learn in DS 2.1

### Question: What is machine learning?

Share with the class what you think machine learnning is, and how we could use it!

### Our definition:

* Using data and algorithms that predict or classify unseen, upcoming data with acceptable accuracy

### Two major machine learning algorithms:

1. Supervised Learning

2. Unsupervised Learning

**Supervised learning** is used for:

1. Regression --> Temperature prediction, stock market prediction, next purchased item prediction

2. Classification --> Is an email spam or not-spam? Is the image of a dog or cat? Is a comment about a post positive, negative or neutral?

**Unsupervised learning** is used for:

1. Clustering data into groups

2. Reducing the dimension of features

## Machine Learning Components:

<img src="Images/machine_learning.png" width="600" height="600">

## What Level of Math Do We Need for Machine Learning?

DS 2.1 does require some higher-level math concepts. You should spend some time refreshing or learning the following topics:

**You can review these concepts in [QL 1.1](https://github.com/Make-School-Courses/QL-1.1-Quantitative-Reasoning)!**

1. Calculus - Derivatives, Linear Regression
    1. Class 8 and 9 from QL 1.1
1. Linear Algebra - Intro to Vectors and Matrix Multiplication
    1. Class 11 from QL 1.1
1. Probability and statistics - Mean, Median, Mode, Standard deviation, Variance and Percentiles, Correlation and Covariance, PDFs, CDFs
    1. Classes 2-7 from QL 1.1

<img src="Images/intro_2.png" width="400" height="400">

## Data Preprocessing for Machine Learning: Data Scaling 

**Data Scaling or Normalization**: Assume we have a dataset with 2 columns. The values of one column are much larger than the values of the second column. The goal of data scaling is to scale the columns to be in the same range.

To help us learn more about preprocessing and scaling, we're going to take some time to research these topics through the following guides, articles, and documentation. We will then come back to apply these learnings through some coding challenges!

**Please read the following:**

- Preparing Data section: https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_preparing_data.htm

- Section 6.3.1: https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling

- http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html

- Also, look at the first page here: https://www.analyticsvidhya.com/infographics/Scikit-Learn-Infographic.pdf
   

## Activity for Data Scaling: Max_Min Scaler

**You will do this activity in groups of 4**

A two dimensional array (Matrix) X is given, write a function `max_min_s` that for each column:

1. Obtains the minimum value of the column (min)
1. Obtains the range of the column (range)
1. For each value in the column, calculate the following: (value-min)/range

**Remember**: the range is the difference between the maximum element and minimum element of each column
    
Use this dataset as an example input to your function: `X_train = np.array([[1000.0, 2.0], [1500.0, 3.0]])`

We will use breakout rooms in Zoom for group work!

In [1]:
import numpy as np

def max_min_s(X):
    # the two-step process as described above
    return (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))

X_train = np.array([[1000.0, 2.0], [1500.0, 3.0]])
print(max_min_s(X_train))

[[0. 0.]
 [1. 1.]]


### Use the Sklearn preprocessing package to do the same thing:

In [3]:
# Let's use the Sklearn preprocessing package to do the same thing as our above function:

from sklearn import preprocessing

min_max_scaler = preprocessing.MinMaxScaler()
X_minmax = min_max_scaler.fit_transform(X_train)
print(X_minmax)

[[0. 0.]
 [1. 1.]]


## Activity for Data Scaling: Standard Scaler

**You will do this activity in groups of 4**

A two dimensional array (Matrix) X is given, write a function that for each column:

1. Obtain the mean for the column (mean)
1. Obtain the standard deviation for the column (std)
1. For each value in the column, calculate the following: (value - mean)/std

We will use breakout rooms in Zoom for group work!

In [4]:
def standard_s(X):
    return (X - X.mean(axis=0)) /X.std(axis=0) 

print(standard_s(X_train))

[[-1. -1.]
 [ 1.  1.]]


### Lets use Sklearn preprecoessing package to do the same thing:

In [5]:
standard_scaler = preprocessing.StandardScaler()
X_ss = standard_scaler.fit_transform(X_train)
print(X_ss)

[[-1. -1.]
 [ 1.  1.]]


## Data Preprocessing for Machine Learning: Label Encoding and One-Hot Encoding

- **Label Encoding** is when we convert categories to a numerical value. For example, the name of countries (France, Germany, Spain) into numbers (0, 1, 2)

- **One-Hot Encoding** converts the output of a label encoder (numbers) to a vector consisting only of 1s and 0s (positive/negative)

The below image shows an example of one-hot encoding. Each row in the second table represents the values in the Color table. That's why the Red column has 1s in the first two rows, since the first two rows of the Color table are Red

<img src="Images/One-Hot-Encoding-Diagram.png" width="400" height="400">

## Activity for Label-Encoder and One-Hot-Encoder:

**You will do this activity in groups of 4**

We want to be able to make future predicitions on if a person will exit (churn) based on a variety of features. However, not all of our data is numerical: Geography and Gender are categorical, so we can't make predictions using machine learning until we encode them as numbers.

Your goal is to convert the Geography and Gender columns of the [Churn_Modelling](Datasets/Churn_Modelling.csv) dataset to numbers, then one-hot representation.

We'll use breakout rooms again for this! Later we'll go over how to do this as a class

In [6]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Read in the CSV file
df = pd.read_csv('Churn_Modelling.csv')

# Print out the first 5 rows of each column in a readable format
print(df.head())

   RowNumber  CustomerId   Surname  CreditScore Geography  Gender  Age  \
0          1    15634602  Hargrave          619    France  Female   42   
1          2    15647311      Hill          608     Spain  Female   41   
2          3    15619304      Onio          502    France  Female   42   
3          4    15701354      Boni          699    France  Female   39   
4          5    15737888  Mitchell          850     Spain  Female   43   

   Tenure    Balance  NumOfProducts  HasCrCard  IsActiveMember  \
0       2       0.00              1          1               1   
1       1   83807.86              1          0               1   
2       8  159660.80              3          1               0   
3       1       0.00              2          0               0   
4       2  125510.82              1          1               1   

   EstimatedSalary  Exited  
0        101348.88       1  
1        112542.58       0  
2        113931.57       1  
3         93826.63       0  
4         790

## Two columns have categorical values: Geography and Gender

In [7]:
# Print the unique values from the Geography column
print(df['Geography'].unique())

['France' 'Spain' 'Germany']


In [9]:
# Print the unique values from the Gender column
print(df['Gender'].unique())

['Female' 'Male']


## Define the feature matrix and target column

**Target column:** The column we are trying to predict values for. We will test against the values here to see if our predictions are accurate with the current data. Later, for new customers, we want to be able to predict the possibility that the customer will exit based on the 10 features we are looking at in our matrix.

In [65]:
# Feature matrix
# We don't care about first 3 columns (RowNumber, CustomerId, Surname),
# as those don't factor in to whether a person churns or not
X = df.iloc[:, 3:13].values

# Target column
# We want to predict the exit value,
# which therefore makes it our target column
y = df.iloc[:, 13].values

## print the first 10 rows of the feature matrix

In [66]:
print(X[0:10,:])

[[619 'France' 'Female' 42 2 0.0 1 1 1 101348.88]
 [608 'Spain' 'Female' 41 1 83807.86 1 0 1 112542.58]
 [502 'France' 'Female' 42 8 159660.8 3 1 0 113931.57]
 [699 'France' 'Female' 39 1 0.0 2 0 0 93826.63]
 [850 'Spain' 'Female' 43 2 125510.82 1 1 1 79084.1]
 [645 'Spain' 'Male' 44 8 113755.78 2 1 0 149756.71]
 [822 'France' 'Male' 50 7 0.0 2 1 1 10062.8]
 [376 'Germany' 'Female' 29 4 115046.74 4 1 0 119346.88]
 [501 'France' 'Male' 44 4 142051.07 2 0 1 74940.5]
 [684 'France' 'Male' 27 2 134603.88 1 1 1 71725.73]]


## Apply Label Encoder to the second and third columns

In [67]:
from sklearn.preprocessing import LabelEncoder

label_encoder_X_1 = LabelEncoder()
X[:, 1] = label_encoder_X_1.fit_transform(X[:, 1])
label_encoder_X_2 = LabelEncoder()
X[:, 2] = label_encoder_X_2.fit_transform(X[:, 2])
print(X[0:10,:])
print(X.shape)

[[619 0 0 42 2 0.0 1 1 1 101348.88]
 [608 2 0 41 1 83807.86 1 0 1 112542.58]
 [502 0 0 42 8 159660.8 3 1 0 113931.57]
 [699 0 0 39 1 0.0 2 0 0 93826.63]
 [850 2 0 43 2 125510.82 1 1 1 79084.1]
 [645 2 1 44 8 113755.78 2 1 0 149756.71]
 [822 0 1 50 7 0.0 2 1 1 10062.8]
 [376 1 0 29 4 115046.74 4 1 0 119346.88]
 [501 0 1 44 4 142051.07 2 0 1 74940.5]
 [684 0 1 27 2 134603.88 1 1 1 71725.73]]
(10000, 10)


#### LabelEncoder has replaced France with 0, Germany with 1, and Spain with 2.  What else do you notice?

In [69]:
from sklearn.preprocessing import OneHotEncoder

one_hot_encoder = OneHotEncoder(categorical_features=[1, 2])
X = one_hot_encoder.fit_transform(X).toarray()
print(pd.DataFrame(X[0:10,:]))

     0    1    2    3    4    5    6      7     8    9         10   11   12  \
0  1.0  0.0  1.0  0.0  1.0  1.0  0.0  619.0  42.0  2.0       0.00  1.0  1.0   
1  1.0  0.0  0.0  1.0  0.0  1.0  0.0  608.0  41.0  1.0   83807.86  1.0  0.0   
2  1.0  0.0  1.0  0.0  1.0  1.0  0.0  502.0  42.0  8.0  159660.80  3.0  1.0   
3  1.0  0.0  1.0  0.0  1.0  1.0  0.0  699.0  39.0  1.0       0.00  2.0  0.0   
4  1.0  0.0  0.0  1.0  0.0  1.0  0.0  850.0  43.0  2.0  125510.82  1.0  1.0   
5  1.0  0.0  0.0  1.0  0.0  0.0  1.0  645.0  44.0  8.0  113755.78  2.0  1.0   
6  1.0  0.0  1.0  0.0  1.0  0.0  1.0  822.0  50.0  7.0       0.00  2.0  1.0   
7  0.0  1.0  1.0  0.0  0.0  1.0  0.0  376.0  29.0  4.0  115046.74  4.0  1.0   
8  1.0  0.0  1.0  0.0  1.0  0.0  1.0  501.0  44.0  4.0  142051.07  2.0  0.0   
9  1.0  0.0  1.0  0.0  1.0  0.0  1.0  684.0  27.0  2.0  134603.88  1.0  1.0   

    13         14  
0  1.0  101348.88  
1  1.0  112542.58  
2  0.0  113931.57  
3  0.0   93826.63  
4  1.0   79084.10  
5  0.0  14

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


## We can do Label encoding and one-hot encoding at the same time in Pandas

In [70]:
import pandas as pd

X = df.iloc[:, 3:13]
y = df.iloc[:, 13]
pd.get_dummies(X).head(10)

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
0,619,42,2,0.0,1,1,1,101348.88,1,0,0,1,0
1,608,41,1,83807.86,1,0,1,112542.58,0,0,1,1,0
2,502,42,8,159660.8,3,1,0,113931.57,1,0,0,1,0
3,699,39,1,0.0,2,0,0,93826.63,1,0,0,1,0
4,850,43,2,125510.82,1,1,1,79084.1,0,0,1,1,0
5,645,44,8,113755.78,2,1,0,149756.71,0,0,1,0,1
6,822,50,7,0.0,2,1,1,10062.8,1,0,0,0,1
7,376,29,4,115046.74,4,1,0,119346.88,0,1,0,1,0
8,501,44,4,142051.07,2,0,1,74940.5,1,0,0,0,1
9,684,27,2,134603.88,1,1,1,71725.73,1,0,0,0,1


In [None]:
pd.get_dummies(X).head(10).values

## Resources

- [Full Tutorials Point guide on Machine Learning with Python](https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_quick_guide.htm)
- [Scikit Learn Preprocessing documentation](https://scikit-learn.org/stable/modules/preprocessing.html)