<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/polyhedron-gdl/introduction-to-machine-learning-for-finance/blob/main/2022/1-notebooks/chapter-2-1.ipynb">
        <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

# Data Preprocessing

## Introduction: The Importance of Data Pre-Processing in Machine Learning

Before applying any machine learning model, it is essential to establish a rigorous understanding of **data pre-processing**.  Poorly prepared data can lead to misleading conclusions, rendering even the most sophisticated algorithms ineffective.

Data pre-processing introduces several key concepts that are **transversal to all of machine learning**. Among these, **handling missing values, feature scaling, encoding categorical variables, and detecting outliers** are crucial. These techniques ensure that the input data is structured, consistent, and suitable for training robust models. Without proper data preparation, models may struggle with convergence, exhibit bias, or fail to generalize to new data.

To develop intuition for these ideas, we will use **simple datasets** as our primary tool. This choice allows us to focus on the core issues of **data cleaning, transformation, and feature engineering** without being overwhelmed by complex datasets or domain-specific knowledge. By using well-defined examples, we can systematically explore how different pre-processing techniques affect the model’s performance.

In this lesson, we will cover the essential components of data pre-processing, including:

- **Handling missing data: imputation and removal strategies**
- **Feature scaling: normalization and standardization**
- **Encoding categorical variables**
- **Detecting and handling outliers**
- **Feature Selection and Dimensionality Reduction**

By mastering these principles early, we establish a solid foundation that will allow us to prepare datasets effectively for any machine learning application. With these concepts in place, we can confidently proceed to building and validating models, knowing that our data is well-structured and optimized for success.

## Definitions

Raw data rarely comes in the form and shape that is necessary for the optimal
performance of a learning algorithm. On the other hand, the success of a machine learning algorithm highly depends on the quality of the data fed into the model. Real-world data is often dirty containing outliers, missing values, wrong data types, irrelevant features, or non-standardized data. The presence of any of these will prevent the machine learning model to properly learn. For this reason, transforming raw data into a useful format is an essential stage in the machine learning process. Therefore,
it is absolutely critical to ensure that we examine and preprocess a dataset before
we feed it to a learning algorithm. In this section, we will discuss the essential data
preprocessing techniques that will help us to build good machine learning models.

The topics that we will cover in this lesson are as follows:

- Removing and imputing missing values from the dataset
- Getting categorical data into shape for machine learning algorithms
- Selecting relevant features for the model construction
- Feature Normalization

## Philosophy of Use of Scikit-Learn

Scikit-learn is one of the most widely used Python libraries for machine learning. Its design philosophy is centered on **simplicity, modularity, and consistency**, making it accessible to both beginners and advanced users.

The **core principles** that guide the usage of scikit-learn are:
1. **Unified API**: Every model (regression, classification, clustering, dimensionality reduction, etc.) follows the same pattern.
2. **Minimal Configuration**: Most models work well with default parameters and can be fine-tuned later.
3. **Consistency**: Whether you are dealing with a linear regression, decision tree, or neural network, the interaction with models remains the same.
4. **Pipeline-Oriented**: Scikit-learn encourages a step-by-step workflow involving data preprocessing, model training, and prediction.

### The Core Methods: `fit()`, `transform()`, `predict()`

Scikit-learn is built around a **three-step workflow**: **fitting**, **transforming**, and **predicting**. Almost every estimator (a model or transformer) in scikit-learn follows these methods.

**1. `fit()` – Learning from Data**

- This method is used to **train** a model or a transformer on the given dataset.
- It extracts relevant patterns, parameters, or statistics from the data.
- Used in **both preprocessing transformers (e.g., scalers, PCA)** and **models (e.g., linear regression, decision trees)**.

**Usage:**

```python
model.fit(X_train, y_train)
```
or, for transformers:
```python
scaler.fit(X_train)
```

**Example: Linear Regression**

```python
from sklearn.linear_model import LinearRegression

X_train = [[1], [2], [3], [4]]
y_train = [2, 4, 6, 8]

model = LinearRegression()
model.fit(X_train, y_train)  # Learns the relationship (y = 2x)
```

**Example: Standard Scaler**

```python
# The scaler methods in scikit-learn are preprocessing techniques used to normalize or 
# standardize numerical data before feeding it into a machine learning model.

from sklearn.preprocessing import StandardScaler

X_train = [[10], [20], [30], [40]]

scaler = StandardScaler()
scaler.fit(X_train)  # Computes mean and standard deviation
```

**2. `transform()` – Applying a Transformation**

- Used **only by transformers** (not predictive models).
- It applies a learned transformation to new data.
- Example use cases:
  - **Feature scaling (e.g., StandardScaler, MinMaxScaler)**
  - **Dimensionality reduction (e.g., PCA)**
  - **Encoding categorical variables (e.g., OneHotEncoder)**

**Usage:**

```python
X_transformed = transformer.transform(X_new)
```

**Example: Standard Scaler**

```python
X_test = [[25], [35]]

X_scaled = scaler.transform(X_test)  # Applies scaling learned from fit()
```

> **Important:** `fit_transform(X)` is a shortcut for `fit(X)` followed by `transform(X)`.

```python
X_scaled = scaler.fit_transform(X_train)  # Often used in pipelines
```

**3. `predict()` – Making Predictions**

- Used **only by predictive models** (not transformers).
- Takes new input data (`X_test`) and outputs predictions (`y_pred`).
- Works with both **classification (e.g., DecisionTreeClassifier, SVM)** and **regression (e.g., LinearRegression, RandomForestRegressor)**.

**Usage:**

```python
y_pred = model.predict(X_test)
```

**Example: Predicting with Linear Regression**

```python
X_test = [[5], [6]]

y_pred = model.predict(X_test)  # Output: [10, 12] (y = 2x)
```

### How These Methods Fit Together in a Typical Pipeline

A typical **machine learning workflow** in scikit-learn follows these steps:

1. **Preprocess the data** (fit and transform):
   - Handle missing values, scale features, encode categorical variables.
   - Example: `StandardScaler().fit_transform(X)`

<p></p>

2. **Train the model** (fit):
   - Example: `model.fit(X_train, y_train)`

<p></p>

3. **Make predictions** (predict):
   - Example: `y_pred = model.predict(X_test)`

**Example: Full Pipeline**

```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

# Define pipeline: Scaling + Linear Regression
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # First, scale features
    ('regressor', LinearRegression())  # Then, fit regression model
])

# Fit pipeline
pipeline.fit(X_train, y_train)

# Predict
y_pred = pipeline.predict(X_test)
```

**Key Takeaways**
- `fit()`: **Learns** from the data (used by both transformers and models).
- `transform()`: **Applies** learned transformations (only for transformers).
- `predict()`: **Generates predictions** from trained models (only for predictive models).
- **Scikit-learn enforces a uniform API**, making it easy to switch between models.
- **Pipelines** streamline the workflow by combining preprocessing and modeling in a single object.

## Data Cleaning

### Dealing with missing data

The real-world data often has a lot of missing values. The cause of missing values can be data corruption or failure to record data. The handling of missing data is very important during the preprocessing of the dataset as many machine learning algorithms do not support missing values.


Let's create
a simple example data frame from a comma-separated values (CSV) file to get
a better grasp of the problem:

In [81]:
import pandas as pd
#
# The StringIO module is an in-memory file-like object. This object can be used as input or output 
# to the most function that would expect a standard file object. When the StringIO object is created 
# it is initialized by passing a string to the constructor. If no string is passed the StringIO will 
# start empty. In both cases, the initial cursor on the file starts at zero. NOTE: This module does 
# not exist in the latest version of Python so to work with this module we have to import it from 
# the io module.
#
from io import StringIO

csv_data = \
    '''
    A,B,C,D
    1.0,2.0,3.0,4.0
    5.0,6.0,,8.0
    10.0,11.0,12.0,
    10.0,11.0,12.0,13.0
    '''
df = pd.read_csv(StringIO(csv_data))
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,
3,10.0,11.0,12.0,13.0


### Delete Rows with Missing Values 

One of the easiest ways to deal with missing data is simply to remove the
corresponding features (columns) or training examples (rows) from the dataset
entirely. Missing values can be handled by deleting the rows or columns having null values. If columns have more than half of the rows as null then the entire column can be dropped. The rows which are having one or more columns values as null can also be dropped.

Remember that, in pandas, rows with missing values can easily be dropped via the **dropna** method:

In [82]:
df1 = df.dropna(axis=0)
df1

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
3,10.0,11.0,12.0,13.0


In [83]:
df2 = df.dropna(axis=1)
df2

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,10.0,11.0
3,10.0,11.0


Although the removal of missing data seems to be a convenient approach, it also
comes with certain disadvantages; for example, we may end up removing too
many samples, which will make a reliable analysis impossible. Or, if we remove too
many feature columns, we will run the risk of losing valuable information that our
classifier needs to discriminate between classes. In the next section, we will look
at one of the most commonly used alternatives for dealing with missing values:
interpolation techniques.

**Pros**:
- A model trained with the removal of all missing values creates a robust model.

**Cons**:
- Loss of a lot of information.
- Works poorly if the percentage of missing values is excessive in comparison to the complete dataset.

### Imputing missing values

One of the most common interpolation
techniques is called **imputation**, where we simply replace the missing value with
the mean value of the entire feature column. 

>**scikit-learn - SimpleImputer**
>
>A convenient way to achieve this is by
>using the **SimpleImputer** class from scikit-learn. Scikit-learn, infact,  has built-in methods to perform these  preprocessing steps. For example, the `SimpleImputer()` fills in missing values using a method of your choice (see the code >below). The Scikit-learn documentation lists the full options for data preprocessing [here](https://scikit-learn.org/stable/modules/preprocessing.html).

In [84]:
from sklearn.impute import SimpleImputer
import numpy as np
#
# define the imputing method
#
imr = SimpleImputer(missing_values=np.nan, strategy='mean')

imr = imr.fit(df.values)
imputed_data = imr.transform(df.values)

imputed_data

array([[ 1.        ,  2.        ,  3.        ,  4.        ],
       [ 5.        ,  6.        ,  9.        ,  8.        ],
       [10.        , 11.        , 12.        ,  8.33333333],
       [10.        , 11.        , 12.        , 13.        ]])

Alternatively, an even more convenient way to impute missing values is by using
pandas' **fillna** method and providing an imputation method as an argument. For
example, using pandas, we could achieve the same mean imputation directly in the
DataFrame object via the following command:

In [85]:
df.fillna(df.mean())

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,9.0,8.0
2,10.0,11.0,12.0,8.333333
3,10.0,11.0,12.0,13.0


**Pros**:
- Prevent data loss which results in deletion of rows or columns
- Works well with a small dataset and is easy to implement.

**Cons**:
- Works only with numerical continuous variables.
- Can cause data leakage
- Do not factor the covariance between features.

### Identify and Delete Zero-Variance Predictors

Zero-variance predictors refer to input features that contain a single value across the entire spectrum of observations. Accordingly, they do not add any value to the prediction algorithm since the target variable is not affected by the input value, making them redundant. Some ML algorithms might also run into unexpected errors or output wrong results.
Pandas provides a function to count and list the number of unique values in each column of a Pandas dataframe:

In [86]:
csv_data = \
data_string = '''A,B,C,D,E,F,G,H
1.0,2.0,3.0,4.0,5.0,6.0,7.0,42.0
5.0,2.0,7.0,8.0,5.0,6.0,11.0,42.0
9.0,6.0,11.0,12.0,9.0,10.0,15.0,42.0
13.0,6.0,15.0,16.0,9.0,10.0,19.0,42.0
17.0,10.0,19.0,20.0,13.0,14.0,23.0,42.0
21.0,10.0,23.0,24.0,13.0,14.0,27.0,42.0
25.0,14.0,27.0,28.0,17.0,18.0,31.0,42.0
29.0,14.0,31.0,32.0,17.0,18.0,35.0,42.0
33.0,18.0,35.0,36.0,21.0,22.0,39.0,42.0
37.0,18.0,39.0,40.0,21.0,22.0,43.0,42.0'''

df = pd.read_csv(StringIO(csv_data))
# Get number of rows and columns
num_rows, num_columns = df.shape
print(num_rows, num_columns)
df

10 8


Unnamed: 0,A,B,C,D,E,F,G,H
0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,42.0
1,5.0,2.0,7.0,8.0,5.0,6.0,11.0,42.0
2,9.0,6.0,11.0,12.0,9.0,10.0,15.0,42.0
3,13.0,6.0,15.0,16.0,9.0,10.0,19.0,42.0
4,17.0,10.0,19.0,20.0,13.0,14.0,23.0,42.0
5,21.0,10.0,23.0,24.0,13.0,14.0,27.0,42.0
6,25.0,14.0,27.0,28.0,17.0,18.0,31.0,42.0
7,29.0,14.0,31.0,32.0,17.0,18.0,35.0,42.0
8,33.0,18.0,35.0,36.0,21.0,22.0,39.0,42.0
9,37.0,18.0,39.0,40.0,21.0,22.0,43.0,42.0


In [87]:
df.nunique()

A    10
B     5
C    10
D    10
E     5
F     5
G    10
H     1
dtype: int64

The code below will drop all columns that have a single value and update the df dataframe.

In [88]:
df2 = df.drop(columns = df.columns[df.nunique() == 1],inplace = False)
print(df2)
print(df)

      A     B     C     D     E     F     G
0   1.0   2.0   3.0   4.0   5.0   6.0   7.0
1   5.0   2.0   7.0   8.0   5.0   6.0  11.0
2   9.0   6.0  11.0  12.0   9.0  10.0  15.0
3  13.0   6.0  15.0  16.0   9.0  10.0  19.0
4  17.0  10.0  19.0  20.0  13.0  14.0  23.0
5  21.0  10.0  23.0  24.0  13.0  14.0  27.0
6  25.0  14.0  27.0  28.0  17.0  18.0  31.0
7  29.0  14.0  31.0  32.0  17.0  18.0  35.0
8  33.0  18.0  35.0  36.0  21.0  22.0  39.0
9  37.0  18.0  39.0  40.0  21.0  22.0  43.0
      A     B     C     D     E     F     G     H
0   1.0   2.0   3.0   4.0   5.0   6.0   7.0  42.0
1   5.0   2.0   7.0   8.0   5.0   6.0  11.0  42.0
2   9.0   6.0  11.0  12.0   9.0  10.0  15.0  42.0
3  13.0   6.0  15.0  16.0   9.0  10.0  19.0  42.0
4  17.0  10.0  19.0  20.0  13.0  14.0  23.0  42.0
5  21.0  10.0  23.0  24.0  13.0  14.0  27.0  42.0
6  25.0  14.0  27.0  28.0  17.0  18.0  31.0  42.0
7  29.0  14.0  31.0  32.0  17.0  18.0  35.0  42.0
8  33.0  18.0  35.0  36.0  21.0  22.0  39.0  42.0
9  37.0  18.0  3

In [89]:
df.drop(columns = df.columns[df.nunique() == 1], inplace = True)
df

Unnamed: 0,A,B,C,D,E,F,G
0,1.0,2.0,3.0,4.0,5.0,6.0,7.0
1,5.0,2.0,7.0,8.0,5.0,6.0,11.0
2,9.0,6.0,11.0,12.0,9.0,10.0,15.0
3,13.0,6.0,15.0,16.0,9.0,10.0,19.0
4,17.0,10.0,19.0,20.0,13.0,14.0,23.0
5,21.0,10.0,23.0,24.0,13.0,14.0,27.0
6,25.0,14.0,27.0,28.0,17.0,18.0,31.0
7,29.0,14.0,31.0,32.0,17.0,18.0,35.0
8,33.0,18.0,35.0,36.0,21.0,22.0,39.0
9,37.0,18.0,39.0,40.0,21.0,22.0,43.0


> **pandas remind**: Here’s a concise reminder of the key **pandas** syntax properties used in the given instruction:
>
>1. **`df.nunique()`**  
>   - Returns the number of unique values for each column in the DataFrame.
>
>2. **`df.columns[...]`**  
>   - Retrieves the column labels of the DataFrame.
>   - `df.columns[df.nunique() == 1]` selects columns where all values are the same (i.e., with only one unique value).
>
>3. **`df.drop(columns=...)`**  
>   - Drops the specified columns from the DataFrame.
>   - `inplace=False` ensures that the original DataFrame remains unchanged, returning a new modified DataFrame (`df2` in this >case). 

## Categorical Data

Categorical data is a form of data that takes on values within a finite set of discrete classes. It is difficult to count or measure categorical data using numbers and therefore they are divided into categories: **ordinal** and **nominal** features. 

**Ordinal** features can be understood as categorical
values that *can be sorted or ordered*. For example, t-shirt size would be an ordinal
feature, because we can define an order: XL > L > M. 

In contrast, **nominal** features
don't imply any order and, to continue with the previous example, we could think
of t-shirt color as a nominal feature since it typically doesn't make sense to say that,
for example, red is larger than blue.

### Encoding

Before we explore different techniques for handling such categorical data, let's create a new DataFrame to illustrate the problem:

In [90]:
# Define possible S&P ratings
ratings = ["AAA", "AA", "A", "BBB", "BB", "B", "CCC", "CC", "C", "D"]

num_samples = 10

# Generate an updated synthetic dataset
df = pd.DataFrame({
    "rating": np.random.choice(ratings, num_samples),                 # Random S&P rating assignment
    "income": np.random.randint(20000, 200000, num_samples),          # Income in dollars
    "age": np.random.randint(18, 75, num_samples),                    # Age of the individual
    "employment_status": np.random.choice(["Employed", "Unemployed", "Self-Employed"], num_samples),
    "loan_amount": np.random.randint(5000, 500000, num_samples),      # Loan amount in dollars
    "default_history": np.random.choice(["Yes", "No"], num_samples, p=[0.2, 0.8])  # 20% default history
})

df

Unnamed: 0,rating,income,age,employment_status,loan_amount,default_history
0,CC,172136,49,Unemployed,385030,No
1,CC,128712,46,Unemployed,294572,No
2,D,123129,48,Employed,73211,No
3,BB,153940,21,Employed,217991,No
4,BB,34905,22,Unemployed,44238,No
5,BB,22174,59,Self-Employed,130365,No
6,C,57352,62,Employed,26513,No
7,B,121403,40,Employed,5823,Yes
8,CC,147412,41,Self-Employed,29195,No
9,C,58682,30,Unemployed,394905,Yes


> **REMIND - FEATURES AND LABELS**
> ***
> Remember that in machine learning, you have **features** and **labels**. *The features are the **descriptive** attributes*, and *the 
> label is what you're attempting to predict or forecast*. In this simple example, **rating**, **income**, **age**, **employment_status** and **loan_amount** are **features** while 
> **default_history** is the field that contains the **label** of the corresponding record.

To make sure that the learning algorithm interprets the ordinal features correctly,
we need to convert the categorical string values into integers. 

In [91]:
from sklearn.preprocessing import OrdinalEncoder

ratings = [["AAA", "AA", "A", "BBB", "BB", "B", "CCC", "CC", "C", "D"]]

le = OrdinalEncoder(categories=ratings)
df["rating_encoded"] = le.fit_transform(df[["rating"]])
df

Unnamed: 0,rating,income,age,employment_status,loan_amount,default_history,rating_encoded
0,CC,172136,49,Unemployed,385030,No,7.0
1,CC,128712,46,Unemployed,294572,No,7.0
2,D,123129,48,Employed,73211,No,9.0
3,BB,153940,21,Employed,217991,No,4.0
4,BB,34905,22,Unemployed,44238,No,4.0
5,BB,22174,59,Self-Employed,130365,No,4.0
6,C,57352,62,Employed,26513,No,8.0
7,B,121403,40,Employed,5823,Yes,5.0
8,CC,147412,41,Self-Employed,29195,No,7.0
9,C,58682,30,Unemployed,394905,Yes,8.0


In [92]:
rating_map = {
    "AAA": 0, "AA": 1, "A": 2, "BBB": 3, "BB": 4, 
    "B": 5, "CCC": 6, "CC": 7, "C": 8, "D": 9
}
df["rating_numeric"] = df["rating"].map(rating_map)
df

Unnamed: 0,rating,income,age,employment_status,loan_amount,default_history,rating_encoded,rating_numeric
0,CC,172136,49,Unemployed,385030,No,7.0,7
1,CC,128712,46,Unemployed,294572,No,7.0,7
2,D,123129,48,Employed,73211,No,9.0,9
3,BB,153940,21,Employed,217991,No,4.0,4
4,BB,34905,22,Unemployed,44238,No,4.0,4
5,BB,22174,59,Self-Employed,130365,No,4.0,4
6,C,57352,62,Employed,26513,No,8.0,8
7,B,121403,40,Employed,5823,Yes,5.0,5
8,CC,147412,41,Self-Employed,29195,No,7.0,7
9,C,58682,30,Unemployed,394905,Yes,8.0,8


> **Preprocessing : sklearn.preprocessing**
> 
> Among some commonly used preprocessing tasks come `OneHotEncoder`, `StandardScaler`, `MinMaxScaler`, etc. These are respectively for encoding of the categorical features into a one-hot numeric array, standardization of the features and scaling each feature to a given range. Many other preprocessing methods are built-in this module.
We can import this module as follows:

In [93]:
df2 = df.drop(columns=["rating", "rating_encoded"], inplace=False)
df2.rename(columns={"rating_numeric":"rating"}, inplace=True)

cols = ["rating"] + [col for col in df2.columns if col != "rating"]
df2 = df2[cols]
df2

Unnamed: 0,rating,income,age,employment_status,loan_amount,default_history
0,7,172136,49,Unemployed,385030,No
1,7,128712,46,Unemployed,294572,No
2,9,123129,48,Employed,73211,No
3,4,153940,21,Employed,217991,No
4,4,34905,22,Unemployed,44238,No
5,4,22174,59,Self-Employed,130365,No
6,8,57352,62,Employed,26513,No
7,5,121403,40,Employed,5823,Yes
8,7,147412,41,Self-Employed,29195,No
9,8,58682,30,Unemployed,394905,Yes


### Encoding Class Labels

Many machine learning libraries require that class labels are encoded as integer
values. Although most estimators for classification in scikit-learn convert class
labels to integers internally, it is considered good practice to provide class labels as
integer arrays to avoid technical glitches. To encode the class labels, we can use an
approach similar to the mapping of ordinal features discussed previously. We need
to remember that class labels are not ordinal, and it doesn't matter which integer
number we assign to a particular string label. Thus, we can simply enumerate
the class labels, starting at 0:

<div style = 'background-color:skyblue'>
    <strong>Python Pills</strong>
    <p>
    enumerate() method in Python
    </p>    
    <p>
    Enumerate() method adds a counter to an iterable and returns it in a form of enumerating object. This enumerated object can then be used directly for loops or converted into a list of tuples using the list() method.
    </p>        
</div>

In [94]:
import numpy as np


class_mapping = {label: idx for idx, label in enumerate(np.unique(df2['default_history']))}
class_mapping

{'No': 0, 'Yes': 1}

Next, we can use the mapping dictionary to transform the class labels into integers:

In [95]:
df2['default_history'] = df['default_history'].map(class_mapping)
df2

Unnamed: 0,rating,income,age,employment_status,loan_amount,default_history
0,7,172136,49,Unemployed,385030,0
1,7,128712,46,Unemployed,294572,0
2,9,123129,48,Employed,73211,0
3,4,153940,21,Employed,217991,0
4,4,34905,22,Unemployed,44238,0
5,4,22174,59,Self-Employed,130365,0
6,8,57352,62,Employed,26513,0
7,5,121403,40,Employed,5823,1
8,7,147412,41,Self-Employed,29195,0
9,8,58682,30,Unemployed,394905,1


We can reverse the key-value pairs in the mapping dictionary as follows to map the
converted class labels back to the original string representation:

In [96]:
inv_class_mapping = {v: k for k, v in class_mapping.items()}
df2['default_history'] = df2['default_history'].map(inv_class_mapping)
df2

Unnamed: 0,rating,income,age,employment_status,loan_amount,default_history
0,7,172136,49,Unemployed,385030,No
1,7,128712,46,Unemployed,294572,No
2,9,123129,48,Employed,73211,No
3,4,153940,21,Employed,217991,No
4,4,34905,22,Unemployed,44238,No
5,4,22174,59,Self-Employed,130365,No
6,8,57352,62,Employed,26513,No
7,5,121403,40,Employed,5823,Yes
8,7,147412,41,Self-Employed,29195,No
9,8,58682,30,Unemployed,394905,Yes


Alternatively, there is a convenient LabelEncoder class directly implemented in
scikit-learn to achieve this:

In [97]:
from sklearn.preprocessing import LabelEncoder

class_le = LabelEncoder()
y = class_le.fit_transform(df2['default_history'].values)
y

array([0, 0, 0, 0, 0, 0, 0, 1, 0, 1])

### Ordinal Encoding vs Label Encoding: Key Differences

Both **Ordinal Encoding** and **Label Encoding** transform categorical data into numerical values, but they serve different purposes and have distinct behaviors. 

**1. Ordinal Encoding (`OrdinalEncoder`)**

**Concept**:  

- Each unique category is **mapped to an integer** based on a specific order.
- **It preserves the order** of the categories.

**Example: Credit Ratings**
| Rating | Ordinal Encoding |
|---------|----------------|
| CCC     | 0              |
| B       | 1              |
| BB      | 2              |
| BBB     | 3              |
| A       | 4              |
| AA      | 5              |
| AAA     | 6              |

```python
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

data = pd.DataFrame({'rating': ['AAA', 'BB', 'A', 'CCC', 'AA']})

encoder = OrdinalEncoder(categories=[['CCC', 'B', 'BB', 'BBB', 'A', 'AA', 'AAA']])
data['rating_encoded'] = encoder.fit_transform(data[['rating']])

print(data)
```

**When to Use Ordinal Encoding?**

✔ When the categorical variable has an **intrinsic order** (e.g., credit rating, education level, survey ratings). 

❌ Not ideal for unordered categorical variables like `color`, `city`, or `car brands`.

**2. Label Encoding (`LabelEncoder`)**

**Concept**:

- Assigns **a unique integer** to each category **without considering order**.
- The numbers **do not represent any ranking**—they are just arbitrary labels.

**Example: Car Brands**

| Car Brand | Label Encoding |
|-----------|---------------|
| Toyota    | 0             |
| Ford      | 1             |
| BMW       | 2             |
| Tesla     | 3             |

```python
from sklearn.preprocessing import LabelEncoder

data = pd.DataFrame({'car_brand': ['Toyota', 'Ford', 'BMW', 'Tesla', 'Ford']})

encoder = LabelEncoder()
data['brand_encoded'] = encoder.fit_transform(data['car_brand'])

print(data)
```

**When to Use Label Encoding?**

✔ When the categorical variable is **nominal** (i.e., no natural order).  

❌ Not suitable for ordinal variables (e.g., credit ratings), as it might **mislead models into thinking there's an order when there isn’t**.

**Key Differences**

| Feature          | Ordinal Encoding | Label Encoding |
|-----------------|----------------|---------------|
| **Preserves Order?** | ✅ Yes | ❌ No |
| **Use Case** | Ordered categories (e.g., credit rating, survey responses) | Unordered categories (e.g., city names, brands) |
| **Assigns Numeric Values?** | ✅ Yes | ✅ Yes |
| **Numbers Represent Ranking?** | ✅ Yes | ❌ No |
| **Risk of Misinterpretation?** | 🚨 If order is incorrect | 🚨 If used on ordinal data |
| **Scikit-Learn Class** | `OrdinalEncoder` | `LabelEncoder` |


**When to Avoid These Encodings?**

If the categorical variable is **nominal** (no order) and has **many unique values**, both methods can cause issues.

### One-hot Encoding

When there is no a natural order we have to resort to a different approach that is to use the technique called **one-hot encoding**.  The idea behind this approach is to create a new dummy feature for each
unique value in the nominal feature column. Here, we would convert the `employment_status`
feature into three new features: *employed*, *self_employed*, and *unemployed*. Binary values can then be used
to indicate the particular employment status of an example; for example, an employed customer can be
encoded as *employed=1, self_employed=0, unemployed=0*. To perform this transformation, we can use the
`OneHotEncoder` that is implemented in `scikit-learn`'s preprocessing module:

In [98]:
from sklearn.preprocessing import OneHotEncoder

X = df2[['employment_status']].values
color_ohe = OneHotEncoder()
color_ohe.fit_transform(X[:, 0].reshape(-1, 1)).toarray()

array([[0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

Another way to create those dummy features via one-hot encoding
is to use the get_dummies method implemented in pandas. Applied to a DataFrame,
the get_dummies method will only convert string columns and leave all other
columns unchanged:

In [99]:
pd.get_dummies(df2[['employment_status']])

Unnamed: 0,employment_status_Employed,employment_status_Self-Employed,employment_status_Unemployed
0,False,False,True
1,False,False,True
2,True,False,False
3,True,False,False
4,False,False,True
5,False,True,False
6,True,False,False
7,True,False,False
8,False,True,False
9,False,False,True


## Feature Normalization

Many machine learning algorithms require that the selected features are on
the same scale for optimal performance, this process is called "Feature Normalization" and is the subject of this paragraph.

Data Normalization is a common practice in machine learning which consists of transforming numeric columns to a common scale. In machine learning, some feature values differ from others multiple times. The features with higher values will dominate the leaning process. However, it does not mean those variables are more important to predict the outcome of the model. Data normalization transforms multiscaled data to the same scale. After normalization, all variables have a similar influence on the model, improving the stability and performance of the learning algorithm.

There are multiple normalization techniques in statistics. In this notebook, we will cover the most important ones:

- The maximum absolute scaling
- The min-max feature scaling
- The z-score method

### The maximum absolute scaling

The maximum absolute scaling rescales each feature between -1 and 1 by dividing every observation by its maximum absolute value.

$$
x_{new} = \frac{x_{old}}{\max \vert x_{old} \vert}
$$

### The min-max feature scaling

The min-max approach (often called normalization) rescales the feature to a fixed range of [0,1] by subtracting the minimum value of the feature and then dividing by the range:

$$
x_{new} = \frac{x_{old}-x_{min}}{x_{max}-x_{min}}
$$

The min-max scaling procedure is implemented in scikit-learn and can be used as
follows:

In [113]:
#
# Here we have to load the file 'salary_vs_age_1.csv'
#
if 'google.colab' in str(get_ipython()):
    from google.colab import files
    uploaded = files.upload()
    path = ''
else:
    path = './data/'

In [114]:
# Load the Pandas libraries with alias 'pd' 
import pandas as pd 
# Read data from file 'salary_vs_age_1.csv' 
# (in the same directory that your python process is based)
# Control delimiters, with read_table 
df1 = pd.read_table(path + "salary_vs_age_1.csv", sep=";") 
# Preview the first 5 lines of the loaded data 
print(df1.head())

columns_titles = ["Salary","Age"]
df2=df1.reindex(columns=columns_titles)
df2

df2['Salary'] = df2['Salary']/1000 
df2['Age2']=df2['Age']**2
df2['Age3']=df2['Age']**3
df2['Age4']=df2['Age']**4
df2['Age5']=df2['Age']**5
df2

   Age  Salary
0   25  135000
1   27  105000
2   30  105000
3   35  220000
4   40  300000


Unnamed: 0,Salary,Age,Age2,Age3,Age4,Age5
0,135.0,25,625,15625,390625,9765625
1,105.0,27,729,19683,531441,14348907
2,105.0,30,900,27000,810000,24300000
3,220.0,35,1225,42875,1500625,52521875
4,300.0,40,1600,64000,2560000,102400000
5,270.0,45,2025,91125,4100625,184528125
6,265.0,50,2500,125000,6250000,312500000
7,260.0,55,3025,166375,9150625,503284375
8,240.0,60,3600,216000,12960000,777600000
9,265.0,65,4225,274625,17850625,1160290625


In [115]:
from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
df3 = pd.DataFrame(mms.fit_transform(df2))
df3

Unnamed: 0,0,1,2,3,4,5
0,0.153846,0.0,0.0,0.0,0.0,0.0
1,0.0,0.05,0.028889,0.015668,0.008065,0.003984
2,0.0,0.125,0.076389,0.043919,0.024019,0.012633
3,0.589744,0.25,0.166667,0.105212,0.063574,0.037162
4,1.0,0.375,0.270833,0.186776,0.124248,0.080515
5,0.846154,0.5,0.388889,0.291506,0.212486,0.151898
6,0.820513,0.625,0.520833,0.422297,0.335588,0.263127
7,0.794872,0.75,0.666667,0.582046,0.501718,0.428951
8,0.692308,0.875,0.826389,0.773649,0.719895,0.667377
9,0.820513,1.0,1.0,1.0,1.0,1.0


### Z-Score

The **z-score** method (often called **standardization**) transforms the data into a distribution with a mean of 0 and a standard deviation of 1. Each standardized value is computed by subtracting the mean of the corresponding feature and then dividing by the standard deviation.

$$
x_{new} = \frac{x_{old} - \mu}{\sigma}
$$

Unlike min-max scaling, the z-score does not rescale the feature to a fixed range. The z-score typically ranges from -3.00 to 3.00 (more than 99% of the data) if the input is normally distributed.

It is important to bear in mind that z-scores are not necessarily normally distributed. They just scale the data and follow the same distribution as the original input. This transformed distribution has a mean of 0 and a standard deviation of 1 and is going to be the standard normal distribution only if the input feature follows a normal distribution.

Standardization can easily be achieved by using the built-in NumPy methods mean
and std:

In [116]:
import numpy as np

X = np.array([6, 7, 7, 12, 13, 13, 15, 16, 19, 22])

X_std = np.copy(X)
X_std = (X - X.mean()) / X.std()

print(X_std)

[-1.39443338 -1.19522861 -1.19522861 -0.19920477  0.          0.
  0.39840954  0.5976143   1.19522861  1.79284291]


Or simply using the specific function of the stats module of scipy

In [117]:
import scipy.stats as stats

stats.zscore(X)

array([-1.39443338, -1.19522861, -1.19522861, -0.19920477,  0.        ,
        0.        ,  0.39840954,  0.5976143 ,  1.19522861,  1.79284291])

Standardization is very useful with gradient descent learning. In this case
the optimizer has to go through fewer steps to find a good or optimal solution (the
global cost minimum).

Similar to the MinMaxScaler class, scikit-learn also implements a class for
standardization:

In [118]:
from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
df4 = pd.DataFrame(stdsc.fit_transform(df2))
df4

Unnamed: 0,0,1,2,3,4,5
0,-1.170242,-1.359724,-1.189131,-1.041783,-0.920815,-0.824435
1,-1.601005,-1.210304,-1.102065,-0.994071,-0.895974,-0.812022
2,-1.601005,-0.986174,-0.958907,-0.908042,-0.846835,-0.785069
3,0.050256,-0.612623,-0.686823,-0.721391,-0.725003,-0.70863
4,1.198959,-0.239072,-0.37288,-0.473014,-0.538122,-0.573535
5,0.768195,0.134478,-0.017078,-0.154092,-0.266345,-0.351091
6,0.696401,0.508029,0.380582,0.244194,0.11282,-0.00448
7,0.624608,0.881579,0.820102,0.730661,0.624511,0.51226
8,0.337432,1.25513,1.301481,1.314127,1.296512,1.255243
9,0.696401,1.628681,1.824719,2.003411,2.159252,2.29176


## Exercises

### Chocolate Bar Ratings

**Context**

Chocolate is one of the most popular candies in the world. Each year, residents of the United States collectively eat more than 2.8 billions pounds. However, not all chocolate bars are created equal! This dataset contains expert ratings of over 1,700 individual chocolate bars, along with information on their regional origin, percentage of cocoa, the variety of chocolate bean used and where the beans were grown.

**Flavors of Cacao Rating System**:

5= Elite (Transcending beyond the ordinary limits)
4= Premium (Superior flavor development, character and style)
3= Satisfactory(3.0) to praiseworthy(3.75) (well made with special qualities)
2= Disappointing (Passable but contains at least one significant flaw)
1= Unpleasant (mostly unpalatable)

**Link**

https://www.kaggle.com/rtatman/chocolate-bar-ratings

**Problem:** 

Download the `csv` file from the kaggle web page above and perform a simple visualization


**Answer:**

<details>

First of all you need to import pandas library, then define a variable path (the folder in which you saved the csv file) and finally load the file using the method read_csv of pandas. Use the method head() to have a look to the first lines:
    
```python
import pandas as pd
    
path = './data'
df = pd.read_csv(path + "/flavors_of_cacao.csv")
df.head()
```
</details>

In [119]:
# put here your code


**Problem:** 

Change columns names into:

- "Company"
- "Spec_Bean_Origin_or_Bar_Name"
- "Review_Date"
- "Cocoa_Percent"
- "Company_Location"
- "Bean_Type""Broad_Bean_Origin"

**Answer:**

<details>

A possible solution is to use a dictionary. Please note that sometimes in pandas you can find strange characters, in particular the '\xa0' character that you have to remove as in this example. This seems to be a common problem in pandas dataframes, see for example this link https://stackoverflow.com/questions/55442727/remove-unicode-xa0-from-pandas-column    
```python
df = df.rename(columns={"Company\xa0\n(Maker-if known)": "Company",
                        "Specific Bean Origin\nor Bar Name": "Spec_Bean_Origin_or_Bar_Name",
                        "Review\nDate": "Review_Date",
                        "Cocoa\nPercent": "Cocoa_Percent",
                        "Company\nLocation": "Company_Location",
                        "Bean\nType": "Bean_Type",
                        "Broad Bean\nOrigin": "Broad_Bean_Origin"
                       })
```
</details>

In [120]:
# put here your code


**Problem:** 

Use the pandas data frame function info() is used in order to quickly check which data types are available and if data is missing. Do you note something strange?

**Answer:**

<details>
When looking at the missing values, only the features Broad_Bean_Origin and Bean_Type are containing one missing value out of 1795 total samples. However, when looking at the data frame head, the first five rows of feature Bean_Type are empty and should be therefore count as missing value. 
    
Since we don't know exactly what is the content of the first entry Bean_Type, we can fetched it in order to check its value and to use this for replacing these values with NaN.

```python
    
    missing_val_indication_bean_type = df.Bean_Type.values[0]

    def replace_with_nan(missing_val_indication, current_val):
    if current_val == missing_val_indication:
        return np.nan
    else:
        return current_val

    # replace missing value of Bean_Type with np.nan
    df["Bean_Type"] = df["Bean_Type"].apply(lambda x: 
                                        replace_with_nan(missing_val_indication_bean_type, x))
```    
    
</details>

**Problem:**

Find all categorical features.

**Answer:**

<details>
    
```python    
    # get list of categorical features
    list_categorical_cols = list(df.columns[df.dtypes == np.object])
``` 
    
</details>    

In [121]:
# put your code here


**Problem:**

Find all numerical features

**Answer:**

<details>
    
```python    
    # get list of numerical features
    list_numerical_cols = list(df.columns[df.dtypes != np.object])
```

</details>

In [122]:
# put your code here


### Cleaning Data with Pandas

**Problem:**

Try to produce the following dataframe

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>numbers</th>
      <th>nums</th>
      <th>colors</th>
      <th>other_column</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>#23</td>
      <td>23</td>
      <td>green</td>
      <td>0</td>
    </tr>
    <tr>
      <th>1</th>
      <td>#24</td>
      <td>24</td>
      <td>red</td>
      <td>1</td>
    </tr>
    <tr>
      <th>2</th>
      <td>#18</td>
      <td>18</td>
      <td>yellow</td>
      <td>0</td>
    </tr>
    <tr>
      <th>3</th>
      <td>#14</td>
      <td>14</td>
      <td>orange</td>
      <td>2</td>
    </tr>
    <tr>
      <th>4</th>
      <td>#12</td>
      <td>NaN</td>
      <td>purple</td>
      <td>1</td>
    </tr>
    <tr>
      <th>5</th>
      <td>#10</td>
      <td>XYZ</td>
      <td>blue</td>
      <td>0</td>
    </tr>
    <tr>
      <th>6</th>
      <td>#35</td>
      <td>35</td>
      <td>pink</td>
      <td>2</td>
    </tr>
  </tbody>
</table>

**Answer:**

<details>
    
```python    
df = pd.DataFrame({"numbers": ["#23", "#24", "#18", "#14", "#12", "#10", "#35"],
                   "nums": ["23", "24", "18", "14", np.nan, "XYZ", "35"],
                   "colors": ["green", "red", "yellow", "orange", "purple", "blue", "pink"],
                   "other_column": [0, 1, 0, 2, 1, 0, 2]})
df
```

</details>

In [123]:
# put your code here

**Problem:**

What would happen if we wanted to try and compute the mean of numbers?

**Answer:**

<details>
    
```python    
df["numbers"].mean()
```

</details>

In [124]:
# put your code here


**Problem:**

Is there anything wrong with the previous question? Why? How can you solve the error?

**Answer:**

<details>
You have first of all convert all the string like '#32' into numbers.    
</details>    

## Appendix

A more realistic dataset for credit risk example

In [112]:
# Define rating categories and their default probabilities
rating_categories = {
    "AAA": {"default_prob": 0.01, "income_range": (100000, 200000), "loan_range": (50000, 300000)},
    "AA": {"default_prob": 0.02, "income_range": (90000, 180000), "loan_range": (40000, 250000)},
    "A": {"default_prob": 0.03, "income_range": (80000, 160000), "loan_range": (35000, 200000)},
    "BBB": {"default_prob": 0.05, "income_range": (60000, 140000), "loan_range": (30000, 150000)},
    "BB": {"default_prob": 0.10, "income_range": (40000, 120000), "loan_range": (20000, 100000)},
    "B": {"default_prob": 0.15, "income_range": (30000, 100000), "loan_range": (15000, 80000)},
    "CCC": {"default_prob": 0.25, "income_range": (25000, 80000), "loan_range": (10000, 60000)},
    "CC": {"default_prob": 0.35, "income_range": (20000, 70000), "loan_range": (8000, 40000)},
    "C": {"default_prob": 0.50, "income_range": (15000, 50000), "loan_range": (5000, 20000)},
    "D": {"default_prob": 0.80, "income_range": (10000, 30000), "loan_range": (2000, 10000)},
}

# Generate the dataset with improved consistency
num_samples = 10
ratings = list(rating_categories.keys())

df_consistent_credit_risk = pd.DataFrame()

# Generate data row by row ensuring consistency
for _ in range(num_samples):
    rating = np.random.choice(ratings)  # Select a credit rating
    rating_info = rating_categories[rating]

    income = np.random.randint(*rating_info["income_range"])  # Income based on rating
    loan_amount = np.random.randint(*rating_info["loan_range"])  # Loan based on rating
    age = np.random.randint(18, 75)  # Random age
    employment_status = np.random.choice(["Employed", "Unemployed", "Self-Employed"])

    # Default history based on rating's probability
    default_history = np.random.choice(["Yes", "No"], p=[rating_info["default_prob"], 1 - rating_info["default_prob"]])

    # Append row to DataFrame
    df_consistent_credit_risk = pd.concat([df_consistent_credit_risk, 
        pd.DataFrame([[rating, income, age, employment_status, loan_amount, default_history]], 
                     columns=["rating", "income", "age", "employment_status", "loan_amount", "default_history"])])

# Reset index
df_consistent_credit_risk.reset_index(drop=True, inplace=True)

df_consistent_credit_risk

Unnamed: 0,rating,income,age,employment_status,loan_amount,default_history
0,AA,105111,71,Employed,44245,No
1,AAA,176186,36,Self-Employed,89675,No
2,AAA,160161,23,Employed,207251,No
3,BB,106071,63,Self-Employed,53904,No
4,C,32744,71,Unemployed,14698,No
5,AAA,104120,65,Employed,165566,No
6,CCC,78300,34,Employed,30290,No
7,D,15262,19,Employed,8939,Yes
8,AA,169092,30,Unemployed,142462,No
9,CC,49753,29,Self-Employed,27251,Yes


## References and Credits

**WEB**

**Abhyankar Ameya**, "*Exploring Risk Analytics using PCA with Python*", [Medium](https://abhyankar-ameya.medium.com/exploring-risk-analytics-using-pca-with-python-3aca369cbfe4), data files for the interest rate example and further details about the python code can be dowloaded from the github repository of the author [here](https://github.com/Ameya1983/TheAlchemist)