<a href="https://colab.research.google.com/github/hhan3/COMP479/blob/main/Copy_of_01_ml_notebook_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data-driven Aspects

---
Loyola University Chicago  
COMP 379-001/479-001, Spring 2024, Machine Learning  
Instructor: Daniel Moreira (dmoreira1@luc.edu)  
More at https://danielmoreira.github.io/teaching/ml-spr24/

---

Practical examples and exercises of the data-driven aspects of Machine Learning.

Language: Python 3  

Needed libraries:
* NumPy (https://numpy.org/)
* Pandas (https://pandas.pydata.org/)
* Scikit-learn (https://scikit-learn.org/)

-------
## Data Partition

In [None]:
# download the wine dataset
!curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

In [None]:
import numpy as np
import pandas as pd

# loads the wine dataset into memory
df_wine = pd.read_csv('/content/wine.data')

# adds headers to the dataset according to documentation
df_wine.columns = [
    'label', 'alcohol', 'malic acid', 'ash', 'alcalinity', 'magnesium',
    'phenols', 'flavanoids', 'nonflavanoid phenols', 'proanthocyanins',
    'color intensity', 'hue', 'protein concentration', 'proline']

# prints info
print('Data shape:', df_wine.shape)
print('Labels, Label count:', np.unique(df_wine['label'], return_counts=True))
print()

# first 10 samples
df_wine.head(10)

In [None]:
# data partition using sklearn
# reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
from sklearn.model_selection import train_test_split

# separation into data (X) and respective labels (y)
X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values
print('First 10 samples:', X[:10])
print('First 10 labels:', y[:10])
print()

# split configuration
test_size = 0.3 # data percentage going to test
random_seed = 0 # save the seed for reproducibility

# data split
X_train, X_test, y_train, y_test =\
  train_test_split(X, y,
                   random_state=random_seed,
                   test_size=test_size,
                   stratify=y)

# train info
print('Train data shape:', X_train.shape)
print('Train data labels, label count:', np.unique(y_train, return_counts=True))
print()

### Exercise 1
Print info about the test partition.

In [None]:
# add your code here


-------
## Data Preprocessing


### Numerical Data

In [None]:
# download the California housing dataset
!pip install gdown
!gdown 1QkyFNJR8CloAprShjVz7QR8L0e1UuJOf

In [None]:
import numpy as np
import pandas as pd

# loads the housing dataset into memory
df_housing = pd.read_csv('/content/housing.csv')

# prints info
print('Data shape:', df_housing.shape)
print('All features numerical but last one.')

# last 5 samples
df_housing.tail(5)

#### Exercise 2
By using the official [Scikit-learn reference](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html), split the California housing dataset into *train* and *test*, with the test partition containing half of the data.

In [None]:
# add your code here


### Ordinal Data

In [None]:
# toy-case example
import pandas as pd

df = pd.DataFrame([['green', 'M', 10.1],
                   ['red', 'L', 13.5],
                   ['blue', 'XL', 15.3],
                   ['green', 'S', 8.9]])

df.columns = ['color', 'size', 'price']
df

In [None]:
# ordinal data mapping to numerical
size_mapping = {'XL': 3,
                'L': 2,
                'M': 1,
                'S': 0}

df['size'] = df['size'].map(size_mapping)
df

In [None]:
# inverse mapping, to allow further explanation
inv_size_mapping = {v: k for k, v in size_mapping.items()}
df['size'].map(inv_size_mapping)

### Nominal Data

In [None]:
# one-hot encoding example but with colinearity

# needed libraries
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# column transformation
X = df.values
encoder = OneHotEncoder()
transformer = ColumnTransformer([('onehot', encoder, [0]), #[0]: color column
                               ('keep', 'passthrough', [1, 2])])
X_transf = transformer.fit_transform(X).astype(float)
print(X_transf)

# pandas format
df_transf = pd.DataFrame(X_transf)
df_transf.columns = ['color_blue', 'color_green', 'color_red', 'size', 'price']
df_transf

#### Exercise 3

**Question**

What is the problem with this solution in terms of *colinearity*?

> *Add your answer here.*
>

In [None]:
# one-hot encoding example WITHOUT colinearity

# needed libraries
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# column transformation
X = df.values
encoder = OneHotEncoder(categories='auto', drop='first') # attention here!
transformer = ColumnTransformer([ ('onehot', encoder, [0]), #[0]: color column
                               ('keep', 'passthrough', [1, 2])])
X_transf = transformer.fit_transform(X).astype(float)
print(X_transf)

# pandas format
df_transf = pd.DataFrame(X_transf)
df_transf.columns = ['color_green', 'color_red', 'size', 'price']
df_transf

#### Exercise 4

**Question**

How do you say if a shirt is blue?

> *Add your answer here.*
>

#### Exercise 5
Handle the "*ocean proximity*" column within the California dataset **train partition**.

In [None]:
# add your code here


#### Exercise 6
Handle the "*ocean proximity*" column within the California dataset **test partition**.

> Don't contaminate the data!

In [None]:
# add your code here


#### Exercise 7

**Question**

How did you avoid contaminating the test set?

> *Add your answer here.*
>

### Missing Data

In [None]:
# toy-case example
import pandas as pd
from io import StringIO

csv_data = \
'''A,B,C,D,E,F,G
1.0,2.0,3.0,4.0,5.0,6.0,7.0
8.0,9.0,,11.0,,13.0,14.0
15.0,16.0,17.0,,,,21.0
22.0,23.0,24.0,25.0,26.0,27.0,28.0'''

df = pd.read_csv(StringIO(csv_data))
print(df.values)
print()

df

#### Feature Drop

In [None]:
# drop columns that have one or more missing values
df_transf = df.dropna(axis=1)
df_transf

In [None]:
# drop columns that have fewer than <3> non-empty values
# i.e., all columns must have at least 3 non-empty values
df_transf = df.dropna(axis=1, thresh=3)
df_transf

#### Sample Drop


In [None]:
# drop rows that have one or more missing values
df_transf = df.dropna(axis=0)
df_transf

In [None]:
# drop rows that have fewer than <5> non-empty values
# i.e., all rows must have at least 5 non-empty values
df_transf = df.dropna(axis=0, thresh=5)
df_transf

In [None]:
# drop rows whose specific columns have missing values
df_transf = df.dropna(subset=['D', 'F']) # D and F cannot have missing values
df_transf

#### Value Replacement

In [None]:
# original values, for reference
df

In [None]:
# replaces missing values with column-wise mean values
mean = df.mean()
df_transf = df.fillna(mean)
df_transf

In [None]:
# replaces columns C and D with mean, and E and F with max
mean = df.mean()
max = df.max()

df_transf = df.copy()
df_transf[['C', 'D']] = df[['C', 'D']].fillna(mean)
df_transf[['E', 'F']] = df[['E', 'F']].fillna(max)
df_transf

### Data Normalization

#### Data Scaling

In [None]:
# toy-case example (with train and test partition)
import pandas as pd

df_train = pd.DataFrame([[0.15, 1230, 0.00000005, 315.10],
                         [0.12, 4217, 0.00000027, 117.00],
                         [0.23,  943, 0.00000003, 230.40],
                         [0.18, 1014,        0.0,   3.14]])

df_test = pd.DataFrame([[0.25, 1500, 0.00000002,   3.14],
                        [0.16, 3500,        0.0, 100.00]])

df_train, df_test

In [None]:
# train data scale implementation
df_max = df_train.max()
df_min = df_train.min()

df_train_norm = (df_train - df_min) / (df_max - df_min)
df_train_norm

In [None]:
# test data scale normalization
df_max = df_train.max() # use the train data to avoid contamination!
df_min = df_train.min() # use the train data to avoid contamination!

df_test_norm = (df_test - df_min) / (df_max - df_min)
df_test_norm

In [None]:
# sklearn data scale implementation
from sklearn.preprocessing import MinMaxScaler

# scaler object
scaler = MinMaxScaler()

# train data
X_train = df_train.values
X_train_norm = scaler.fit_transform(X_train) # use fit only on the train set!

X_train_norm

In [None]:
# test data
X_test = df_test.values
X_test_norm = scaler.transform(X_test) # don't use fit on the test set!

X_test_norm

#### Data Standardization

In [None]:
# same previous toy-case example (with train and test partition)
import pandas as pd

df_train = pd.DataFrame([[0.15, 1230, 0.00000005, 315.10],
                         [0.12, 4217, 0.00000027, 117.00],
                         [0.23,  943, 0.00000003, 230.40],
                         [0.18, 1014,        0.0,   3.14]])

df_test = pd.DataFrame([[0.25, 1500, 0.00000002,   3.14],
                        [0.16, 3500,        0.0, 100.00]])

df_train, df_test

In [None]:
# train data standardization
df_mean = df_train.mean()
df_std = df_train.std(ddof=0)

df_train_norm = (df_train - df_mean) / df_std
df_train_norm

In [None]:
# test data standardization
df_mean = df_train.mean()     # use the train data to avoid contamination!
df_std = df_train.std(ddof=0) # use the train data to avoid contamination!

df_test_norm = (df_test - df_mean) / df_std
df_test_norm

In [None]:
# sklearn implementation
from sklearn.preprocessing import StandardScaler

# scaler object
scaler = StandardScaler()

# train data
X_train = df_train.values
X_train_norm = scaler.fit_transform(X_train) # use fit only on the train set!

X_train_norm

In [None]:
# test data
X_test = df_test.values
X_test_norm = scaler.transform(X_test) # don't use fit on the test set!

X_test_norm

#### Exercise 8

Execute data partition and preprocessing on the California housing dataset, which was provided above (see the "Numerical Data" section).   

Split the data into train and test partitions, with the test set containing 30% of the data.

Handle all the eventual categorical data, and replace all the missing values with the median of their respective features. Lastly, normalize the dataset with standardization.

Make sure not to mix train and test partitions.   
**Do not contaminate the data!**


In [None]:
# add your code here
