## **Chapter 2:** Data Processing

**Data preprocessing** is a critical step in the machine learning pipeline. It involves transforming raw data into a format that is more suitable for modeling. This chapter will delve into essential preprocessing techniques using scikit-learn.

<img src=https://i.stack.imgur.com/jEULG.png height=400>


### [2.1 Introduction to the datset](#14-iris-dataset)

The *Iris dataset* is one of the most well-known datasets in the field of machine learning and statistics. It's often used as a simple, accessible example for teaching data visualization, machine learning, and data preprocessing techniques. The [dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) was introduced by the British biologist [Ronald Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher) in 1936 as an example of discriminant analysis.

```python 
# Load the data from sklearn
from sklearn import datasets
iris = datasets.load_iris()
```

When you load the Iris dataset using `load_iris()` from `sklearn.datasets`, you receive a dataset that consists of `150` samples of iris flowers. These samples are divided evenly across three different species of iris (`50` samples from each of `Setosa`, `Versicolor`, and `Virginica`). 

<img src=https://miro.medium.com/v2/resize:fit:1400/1*ZK9_HrpP_lhSzTq9xVJUQw.png height=200>

For each sample, the dataset includes four features (or `measurements`):

1. **Sepal Length:** The length of the sepal (the part of the flower that encases the bud before it blooms) in centimeters.
2. **Sepal Width:** The width of the sepal in centimeters.
3. **Petal Length:** The length of the petal (the colorful, often bright part of the flower) in centimeters.
4. **Petal Width:** The width of the petal in centimeters.

These measurements make up the feature matrix `X`, which is a `150x4` `np.array` where each row corresponds to a flower sample, and each column corresponds to one of the four features mentioned above.

In [None]:
from sklearn import datasets

# Load the Iris dataset
iris = datasets.load_iris()
x_iris = iris.data
y_iris = iris.target


print(f"{x_iris.shape = }")
print(f"{y_iris.shape = }")

The target variable `y` is a 1-dimensional array of size `150`, where each entry is an integer representing the species of the corresponding flower in the feature matrix. The species are encoded as `0`, `1`, and `2`, which correspond to `Setosa`, `Versicolor`, and `Virginica`, respectively. 

The primary use of the Iris dataset in machine learning is for classification tasks, where the goal is to predict the species of an iris flower based on the measurements of its petals and sepals. Because of its simplicity and small size, it's particularly useful for demonstrating the basics of machine learning techniques, from data preprocessing to building various types of models, such as decision trees, support vector machines, or neural networks. 

In this course, we will use `iris` to introduce many tools from `scikit-learn`.

### [2.2 Data preprocessing with sklearn](#22-data-preprocessing-with-sklearn)

Data preprocessing is an essential stage in the machine learning pipeline, involving several crucial steps to make raw data more suitable for building models. In this chapter, we explore these steps using scikit-learn. 

Please not that sklearn just provides *one* way to solve these challenges. You can often also use `numpy` or `pandas` to solve these tasks in a similar fashion.  

#### Handling Missing Values with `SimpleImputer` 

Missing values in a dataset can lead to inaccurate models. **Scikit-learn**'s `SimpleImputer` offers a convenient way of dealing with this issue by allowing us to replace missing values with a specified placeholder. 

For example, we can replace missing numerical values with the mean of the remaining non-missing values in each column.


In [None]:
import numpy as np
from sklearn.impute import SimpleImputer


# Example dataset with missing values
x = np.array([[1, 2, np.nan], [4, np.nan, 6], [7, 8, 9]])

# Using SimpleImputer to replace missing values with the mean of the column
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(x)

X_imputed

**Iris:** While the Iris dataset doesn't have missing values, let's simulate this scenario to demonstrate handling them:

In [None]:
import numpy as np
from sklearn.impute import SimpleImputer


# Simulate missing values in the first feature
x_missing = x_iris.copy()
x_missing[::10, 0] = np.nan

imputer = SimpleImputer(strategy='mean')
x_imputed = imputer.fit_transform(x_missing)

x_imputed

#### Feature Scaling via `StandardScaler`

Many machine learning algorithms perform better when numerical input variables are scaled to a standard range. `StandardScaler` is a tool in scikit-learn that standardizes features by removing the mean and scaling to unit variance.


In [None]:
from sklearn.preprocessing import StandardScaler

# Dataset
x = [[0, 15], [1, -10], [2, 20]]

# Standardizing the features
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)

x_scaled

**Iris:** Feature scaling can be demonstrated by standardizing the Iris dataset's features:

In [None]:
from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
x_scaled = scaler.fit_transform(x_iris)

# x_scaled now has each feature standardized
x_scaled

#### Normalizing Features with Normalizer 

Normalization adjusts the scale of data attributes. Scikit-learn's `Normalizer` scales individual samples to have unit norm, a process that can be particularly useful for sparse datasets.

In [None]:
from sklearn.preprocessing import Normalizer

# Dataset
x = [[4, 1, 2, 2], [1, 3, 9, 3], [5, 7, 5, 1]]

# Normalizing the features
normalizer = Normalizer()
x_normalized = normalizer.fit_transform(x)

x_normalized

**Iris:** Normalization ensures that each sample's feature vector has a unit norm. 

In [None]:
from sklearn.preprocessing import Normalizer

normalizer = Normalizer()
x_normalized = normalizer.fit_transform(x_iris)

# x_normalized now ensures each sample has a unit norm
x_normalized

#### Encoding Categorical Variables Using `OneHotEncoder`

Categorical variables are often represented as `strings` or `numbers` that indicate categories. To make these variables understandable to machine learning algorithms, we use *one-hot encoding*, which represents each category as a binary vector.

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Dataset with categorical features
x = [['Male', 1], ['Female', 3], ['Female', 2]]

# Applying one-hot encoding
encoder = OneHotEncoder()
x_encoded = encoder.fit_transform(x).toarray()

x_encoded


**Iris:**: The Iris dataset's target variable (`y`) is already numerical and represents classes as `0`, `1`, and `2`. Typically, you would **not** one-hot encode the target variable for most scikit-learn classifiers since they handle numerical class labels directly. 

However, for demonstration:

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
y_encoded = encoder.fit_transform(y_iris.reshape(-1, 1)).toarray()

# y_encoded is now one-hot encoded
y_encoded


#### Feature Selection with `SelectKBest`

Not all features are equally important for a model. `SelectKBest` allows us to select a subset of the most important features according to a statistical measure, like the [chi-squared](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html) test, enhancing model performance and reducing overfitting.

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

# Sample dataset
x = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]
y = [0, 1, 0]

# Selecting the 2 best features based on chi-squared test
selector = SelectKBest(chi2, k=2)
x_selected = selector.fit_transform(x, y)

x_selected


**Iris:** To select the two best features according to their [ANOVA F-value](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html):

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(f_classif, k=2)
x_selected = selector.fit_transform(x_iris, y_iris)

# x_selected now contains only the 2 features deemed most important
x_selected


#### Feature Extraction Techniques

[Feature extraction](https://scikit-learn.org/stable/modules/feature_extraction.html) is crucial for converting raw data into a structured format. Scikit-learn provides several tools for this purpose, such as `DictVectorizer` for feature arrays represented as lists of dictionaries and `CountVectorizer` for converting text documents into a matrix of token counts.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
corpus = ['text data preprocessing', 'feature extraction with scikit-learn']

# Vectorizing text data
vectorizer = CountVectorizer()
x_vectorized = vectorizer.fit_transform(corpus).toarray()

x_vectorized

**Iris:** Feature extraction is more relevant to datasets with raw data like text or images. Since the Iris dataset is already in a structured format, feature extraction like text vectorization doesn't directly apply. However, you can transform or create new features based on existing ones, a process known as feature engineering, which might involve calculating interactions between features or other transformations.

In summary, these examples demonstrate how to apply various preprocessing techniques to generic data and to the Iris dataset using scikit-learn, preparing the data for machine learning modeling. 

This chapter prepares for for the important task of data preprocessing and ensuring that your dataset is in the best possible shape for training models.

### [2.3 Preparing for model training](#23-preparing-for-model-training)

Before training a model, it's vital to prepare your dataset properly. Besides data preprocessing, this preparation involves creating separate datasets for training and testing. By doing so, you can train your model on one portion of the data and evaluate its performance on a separate set that it hasn't seen before, which is crucial for assessing the model's ability to generalize.

<img src=https://miro.medium.com/v2/resize:fit:580/1*OECM6SWmlhVzebmSuvMtBg.png height=300>

#### Splitting a dataset using `train_test_split`

The purpose of splitting your dataset into training and testing sets is to provide an honest assessment of the model's performance. Scikit-learn offers the `train_test_split function`, which is an efficient way to randomly partition the dataset.

The `train_test_split` function shuffles your data and then splits it into training and testing subsets. You can specify the proportion of the dataset you want to allocate for testing with the test_size parameter.


In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

# Generate dummy data
x_dummy = np.random.rand(100, 4)  # 100 samples, 4 features
y_dummy = np.random.randint(0, 2, 100)  # Binary target variable

# Splitting the dataset into training and testing sets
x_train_dummy, x_test_dummy, y_train_dummy, y_test_dummy = train_test_split(x_dummy, y_dummy, test_size=0.2, random_state=42)

print(f"Training set size: {x_train_dummy.shape[0]} samples")
print(f"Testing set size: {x_test_dummy.shape[0]} samples")


Here's how to apply it to the Iris dataset:

In [None]:
from sklearn.model_selection import train_test_split

# Assuming x and y are already defined (Iris dataset features and target)
x_train, x_test, y_train, y_test = train_test_split(x_iris, y_iris, test_size=0.2, random_state=42)

# This splits the dataset into 80% training data and 20% testing data
print(f"{x_train.shape = }")
print(f"{x_test.shape = }")
print(f"{y_train.shape = }")
print(f"{y_test.shape = }")


In this example, `20%` of the data is reserved for testing, and the `random_state` parameter ensures that the split is reproducible; using the same seed will produce the same split in future runs.

#### Importance of Random State in Reproducibility

In some cases, especially in datasets where some classes are underrepresented, it's important to use a stratified split. This method ensures that the proportion of classes in both the training and testing sets reflects that of the original dataset.

In [None]:
# Lets check the total number of classes in the y values of our iris data
print(f"{np.unique(y_train, return_counts=True)}")
print(f"{np.unique(y_test, return_counts=True)}")

Scikit-learn's `train_test_split` function allows for *stratified* splitting by using the `stratify` parameter:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x_iris, y_iris, test_size=0.2, stratify=y_iris, random_state=41)

# This again splits the dataset into 80% training data and 20% testing data, this time accounting for the y ratio
# Lets check the total number of classes in the y values of our iris data
print(f"{np.unique(y_train, return_counts=True)}")
print(f"{np.unique(y_test, return_counts=True)}")

By setting `stratify=y_iris`, we ensure that the distribution of classes in both the training and testing sets matches the original dataset as closely as possible.

#### Cross-Validation: An Alternative to Train/Test Split

While splitting your data into `training` and `testing` sets is a good practice, it's also useful to employ `cross-validation` as an additional step. 

<img src=https://www.researchgate.net/publication/340567535/figure/fig2/AS:880966289588226@1587050139118/Train-test-cross-validation-split-methodology-used-in-this-paper-The-first-operation.jpg height= 400>

Cross-validation involves splitting the dataset into multiple chunks (called a `fold`) and training multiple models by using different chunks as the test set each time. This process helps in assessing the model's performance more robustly.

Scikit-learn offers various [cross-validation strategies](https://scikit-learn.org/stable/modules/cross_validation.html), but a simple and commonly used method is **K-fold cross-validation**, which can be performed as follows:

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

# Initialize a simple classifier (We will cover classifier in detail in chapter 3.1)
classifier_dummy = RandomForestClassifier(random_state=42)

# Performing 5-fold cross-validation
scores_dummy = cross_val_score(classifier_dummy, x_dummy, y_dummy, cv=5)

print("Cross-validation accuracy scores:", scores_dummy) # (We will cover scoring in chapter 3.3)


Here is how to apply cross validation to our iris data.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

# Initialize the model
classifier = RandomForestClassifier(random_state=42)

# Perform 5-fold cross-validation
scores = cross_val_score(classifier, x_iris, y_iris, cv=5)

print("Accuracy scores for each fold:", scores.mean())


In this example, the dataset is split into five parts; the model is trained five times, each time using a different part as the test set. The `cross_val_score` function then returns the accuracy of the model for each fold, providing insight into how the model performs across different subsets of the data.

By combining train/test splits with cross-validation, you can ensure that your models are both effective and generalizable, ready for the final steps of training and evaluation.

---
# --> 🚀💻💥 *Coding Challenge ([Step 0-3]((#34-coding-challenge)))*
---