## <br><br><span style="color:rebeccapurple">Overview of machine learning</span>

Two most common types of machine learning

1. Supervised learning
 - Data are labelled, i.e., input-output pairs
 - Goal: making "good" predictions
 - Obvious error metric/loss function where we can compare predictions with observations
 - EXAMPLE MODELS: linear regression (continuous output), logistic regression (discrete output), decision tree (classification and regression)
 - EXAMPLE PROBLEMS: classify type of flowers based on physical features like petal width, length and color


2. Unsupervised learning
 - Data are not labelled, i.e., no output
 - Goal: finding interesting pattern/knowledge discovery
 - No obvious error metric
 - EXAMPLE MODELS: clustering, PCA, etc.
 - EXAMPLE PROBLEMS: identify groups of households that are similar to each other and create targeted marketing
 
Two main types of models

1. Parametric models
 - Make strong assumptions about data (normally distributed, etc.)
 - Learn and use fixed number of parameters
 - Example: linear regression model
 
 
2. Nonparametric models
 - Make fewer assumptions about data (no asummed distribution)
 - Learn and use a flexible number of parameters
 - Example: decision tree


## <br><br><span style="color:rebeccapurple">Overview of the 3-day workshop</span>

1. Preprocessing data and building Machine Learning pipelines in scikit-learn
2. Supervised learning with parametric models
3. Cross-validation
4. Supervised learning with nonparametric models
5. Unsupervised learning

# <br><br><span style="color:rebeccapurple">Preprocessing</span>

Raw data can take on any range of values. By preprocessing data, we make it easier to interpret and use.

There are many ways to preproccess data for ML, depending on the modeling purpose and data characteristics.
<br><br>We're going to discuss several methods to deal with these three types of data:
- Numerical
- Categorical
- Missing

<br>We won't have time to cover every possible way of preprocessing data but let's review what some of your  options are. In a few minutes, we will be learning how to combine your preproccessing steps into a Scikit-learn tool called Pipeline. 

In [1]:
import numpy as np
import pandas as pd

# <br><span style="color:purple"> Overview of data

In [75]:
## Import data
df = pd.read_csv('datasets/airfoil_self_noise.csv')
labels = df.columns
## Take a peak at the dataset
df.head()

Unnamed: 0,frequency,angle,chord_len,velocity,thickness,sound_pressure
0,800.0,0.0,0.3048,71.3,0.002663,126.201
1,1000.0,0.0,0.3048,71.3,0.002663,125.201
2,1250.0,0.0,0.3048,71.3,0.002663,125.951
3,1600.0,0.0,0.3048,71.3,0.002663,127.591
4,2000.0,0.0,0.3048,71.3,0.002663,127.461


The most common type of data in machine learning is tabular data as shown above. We will work with tabular data throughout this workshop. 

In tabular data, we have
- columns: contain data of a single type. We refer to the type of data in each column as a feature/attribute. 
- rows: contain a set of observations. We refer to each row as a sample/instance.

In (supervised) machine learning, we also have
- predictors: the input data of the algorithm or the independent variables in statistical perspective
- response/target: the output of the algorithm or the dependent variables in statistical perspective

# <br><span style="color:purple"> Numerical data

In [67]:
## Get a summary of the dataset
df.describe()

Unnamed: 0,frequency,angle,chord_len,velocity,thickness,sound_pressure
count,1503.0,1503.0,1503.0,1503.0,1503.0,1503.0
mean,2886.380572,6.782302,0.136548,50.860745,0.01114,124.835943
std,3152.573137,5.918128,0.093541,15.572784,0.01315,6.898657
min,200.0,0.0,0.0254,31.7,0.000401,103.38
25%,800.0,2.0,0.0508,39.6,0.002535,120.191
50%,1600.0,5.4,0.1016,39.6,0.004957,125.721
75%,4000.0,9.9,0.2286,71.3,0.015576,129.9955
max,20000.0,22.2,0.3048,71.3,0.058411,140.987


## <br><span style="color:teal"> Standardization

This is the process of converting data into the standard format where each feature has zero mean and unit variance (i.e., std=1). 

In [68]:
from sklearn.preprocessing import StandardScaler

## <br><span style="color:teal"> Scaling features to a range

Apart from standardization, we can scale features to lie between a given minimum and maximum value, often between zero and one. Range compression helps with robustness due to small standard deviations of features after scaling and at the same time preserves zero entries.

In [15]:
from sklearn.preprocessing import MinMaxScaler

## <br><span style="color:teal"> Normalization

This is the process of scaling individual samples to have unit norm. So far, we have scaled data by features i.e., calculations are applied on individual columns. Sometimes, we need to scale data across rows. For example, clustering requires normalization to calculate cosine similarity scores.

In [18]:
from sklearn.preprocessing import Normalizer

# <br><span style="color:purple"> Categorical data
Data sometimes come in non-numeric values in predictors and/or response.

In [2]:
df = pd.read_csv('datasets/breast_cancer.csv')
df.head()

Unnamed: 0,target,age,tumor-size,deg_malig,side,quad,irradiat
0,no-recurrence-events,30-39,30-34,3,left,left_low,no
1,no-recurrence-events,40-49,20-24,2,right,right_up,no
2,no-recurrence-events,40-49,20-24,2,left,left_low,no
3,no-recurrence-events,60-69,15-19,2,right,left_up,no
4,no-recurrence-events,40-49,0-4,2,right,right_low,no


### <br><span style="color:blue"> Question:
Which features (age, tumor-size, deg_malig, side, quad, irradiat) are categorical? 

## <br><span style="color:teal"> Ordinal encoding 

This is the process of assigning each unique category an integer value. Doing this, we impose a natural ordered relationship between each category.
    
For example, age is ordered in nature and we can map the different ranges to integer values. More specifically, 30-39 => 0, 40-49 =>1, 50-59 => 2, etc.

In [21]:
from sklearn.preprocessing import OrdinalEncoder

## <br><span style="color:teal"> One-hot encoding

When there is no natural ordinal relationship among different categories, OrdinalEncoder is not an appropriate approach.

In addition, when the response variable has no ordinal relationship, encoding its labels as ordered integer values can result in poor performance. For example, suppose we encode the response's labels as 0, 1, 2. The algorithm can return a prediction of 1.5.
  
One-hot encoding is the process of transforming each label of the orginal categorical variable into a new binary variable. This means the total number of features will increase after preprocessing. 

In [27]:
from sklearn.preprocessing import OneHotEncoder

### <br><span style="color:blue"> Question:
    
Which features should be processed using one-hot encoding

## <br><span style="color:teal"> Label encoding 

sklearn has a separate module to encode target variable.

In [34]:
from sklearn.preprocessing import LabelEncoder

## <br><span style="color:teal"> Imputation of missing values
    
Sometimes, your data will have missing values (NaN).

In [38]:
df = pd.read_csv('datasets/iris_numeric.csv')
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,,1.4,0.2,0


Note that we use the dataset with transformed target labels here. If we apply sklearn transformer on the entire dataset and there are non-numeric values, Python will raise an error.

We can check if there is any missing value in the entire dataset

In [781]:
df.isna().values.any()

True

We can also check which columns contain missing values

In [782]:
df.isna().any()

sepal_length    False
sepal_width      True
petal_length     True
petal_width     False
target          False
dtype: bool

In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   149 non-null    float64
 2   petal_length  149 non-null    float64
 3   petal_width   150 non-null    float64
 4   target        150 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 6.0 KB


We can get the specific rows with missing values

In [783]:
df[df.isna().any(axis=1)]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
4,5.0,,1.4,0.2,0
10,5.4,3.7,,0.2,0


There are several approaches to imputing missing values before building an estimator. We will explore the most simple approach which involves replacing the missing values with
- constant values
- statistics like mean, median, mode

In [784]:
from sklearn.impute import SimpleImputer

# <br><span style="color:purple"> Additional materials
    
If you want to see in details how each of these methods alters your data, you can check out the practice below after class.

## <br><span style="color:purple"> Numerical data

In [11]:
# ## Import data
# df = pd.read_csv('datasets/airfoil_self_noise.csv')
# labels = df.columns

## <br><span style="color:teal"> Standardization

In [68]:
from sklearn.preprocessing import StandardScaler

Compute the mean and std to be used for later scaling.


In [2]:
# scaler = StandardScaler().fit(df)

In [3]:
# print(scaler.mean_)
# print(scaler.scale_)

Perform standardization by centering and scaling.

In [4]:
# my_array_scaled = scaler.transform(df)

The two methods above can be combined using the fit_transform method

In [5]:
# my_array_scaled = StandardScaler().fit_transform(df)

In [6]:
# df_scaled = pd.DataFrame(my_array_scaled, columns = labels)
# df_scaled.describe()

It is possible to disable either centering or scaling by either passing with_mean=False or with_std=False to the constructor of StandardScaler.

In [7]:
# my_array_scaled = StandardScaler(with_mean=False).fit_transform(df)

In [12]:
# df_scaled = pd.DataFrame(my_array_scaled, columns = labels)
# df_scaled.describe()

 Center the data in my_array to mean 0 but disable scaling

In [8]:
# my_array_scaled =

In [14]:
# df_scaled = pd.DataFrame(my_array_scaled, columns = labels)
#df_scaled.describe()

## <br><span style="color:teal"> Scaling features to a range

In [15]:
from sklearn.preprocessing import MinMaxScaler

   Apply the transform method to scale the features

In [9]:
# my_array_scaled = 

In [17]:
# df_scaled = pd.DataFrame(my_array_scaled, columns = labels)
#df_scaled.describe()

## <br><span style="color:teal"> Normalization

In [18]:
from sklearn.preprocessing import Normalizer

Apply the transform method to scale the features

In [10]:
# my_array_scaled = 

## <br><span style="color:purple"> Categorical data

In [12]:
# df = pd.read_csv('datasets/breast_cancer.csv')
# df_transformed = df.copy()

## <br><span style="color:teal"> Ordinal encoding 

In [21]:
from sklearn.preprocessing import OrdinalEncoder

We can fit the data to the encoder to get the labels and corresponding values.

In [13]:
# enc_age = OrdinalEncoder().fit(df[["age"]])

In [14]:
# enc_age.categories_

The OrdinalEncoder module automatically sorts the labels in the feature and assigns the numerical value from 0 to (n-1) with n being the number of unique labels. 

After fitting the data, we can transform the categorical values in age to numerical values.

In [15]:
# df_transformed["age"] = enc_age.transform(df[["age"]])
# df_transformed.head()

Encode tumor-size variable

In [18]:
# # Fit the data to get labels
# enc_size = 

In [19]:
# # Transform the values and assign the new values to the column tumor-size
# df_transformed["tumor-size"] = 

Encode irradiat variable

## <br><span style="color:teal"> One-hot encoding

In [27]:
from sklearn.preprocessing import OneHotEncoder

In [20]:
# enc_side = OneHotEncoder().fit(df[["side"]])

In [21]:
# enc_side.categories_

In [22]:
# side_transformed = enc_side.transform(df[["side"]])
# type(side_transformed)

Results from OneHotEncoder are saved as scipy sparse matrix format. We can obtain the actual values using the toarray() method.

In [23]:
# df_transformed[enc_side.categories_[0]] = side_transformed.toarray()
# df_transformed.head()

Encode quad variable

In [24]:
# enc_quad = 

## <br><span style="color:teal"> Label encoding 

In [33]:
# df = pd.read_csv('datasets/iris.csv')
# df

sklearn has a separate module to encode target variable.

In [25]:
# from sklearn.preprocessing import LabelEncoder

In [26]:
# le = LabelEncoder().fit(df['target'])

In [27]:
# le.classes_

In [28]:
# target_transformed = le.transform(df['target'])

## <br><span style="color:teal"> Imputation of missing values

In [9]:
# df = pd.read_csv('datasets/iris_numeric.csv')
# df.head()

In [784]:
from sklearn.impute import SimpleImputer

### <br><span style="color:teal"> Using a constant value

In [10]:
# imp_const = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=0.).fit_transform(df)

In [11]:
# df_transformed_const = pd.DataFrame(imp_const, columns=df.columns)
# df_transformed_const.isna().any()

In [12]:
# df_transformed_const.loc[[4,10]]

### <br><span style="color:teal"> Using a statistic

In [13]:
# imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean').fit_transform(df)

In [14]:
# df_transformed_mean = pd.DataFrame(imp_mean, columns=df.columns)
# df_transformed_mean.isna().any()

In [15]:
# df_transformed_mean.loc[[4,10]]