# **Chapter 6. Machine Learning**

## **6.1. Introduction to Machine Learning**

**Machine Learning (ML)** is a subset of artificial intelligence (AI) focused on building systems that learn from and make predictions or decisions based on data. ML algorithms improve their performance as the amount of data available for learning increases.

Machine learning has emerged as a transformative tool in chemistry, offering new approaches to solving complex chemical problems. By leveraging patterns in data, ML algorithms can predict molecular properties, optimize chemical reactions, design new materials, and more, often with greater efficiency than traditional methods.

**Types of Machine Learning:**

- **Supervised Learning:** Models are trained on labeled data, learning to predict outputs from inputs.
- **Unsupervised Learning:** Models identify patterns in data without any labels.
- **Reinforcement Learning:** Models learn to make sequences of decisions by receiving feedback on their actions.

**Applications of Machine Learning in Chemistry:**

- **Predictive Modeling:** Using ML to predict properties of molecules such as solubility, toxicity, reactivity, or the outcomes of chemical reactions such as yield, chemoselectivity and regioselectivity.
- **Reaction Optimization:** Finding optimal conditions for chemical reactions.
- **Quantitative Analysis:** Using ML to perform quantitative structure-activity relationships (QSAR) and quantitative structure-property relationships (QSPR).
- **Drug Discovery:** Virtual screening, ADMET prediction.
- **Molecular Dynamics and Simulations:** Applying ML to improve the efficiency of molecular dynamics simulations.
- **Material Design:** Discovering new materials with desired properties.

**Challenges and Limitations**
- **Generalization and Overfitting**
    - The balance between model complexity and predictive power.
    - Methods to prevent overfitting such as cross-validation.
- **Interpretability**
    - The "black box" nature of some ML models and the importance of model interpretability in chemistry.

## **6.2. Data Preprocessing for Machine Learning**

Preprocessing is crucial in transforming raw data into a format suitable for machine learning. It involves cleaning data, handling missing values, normalization, feature extraction, and data transformation. Preprocessed data leads to more efficient training of machine learning models, can enhance model accuracy, and helps in achieving more reliable predictions.

Preprocessing steps include:
- **Cleaning Data:** Removing duplicates, handling missing values, or filtering out irrelevant data.
- **Feature Extraction:** Transforming chemical compounds into a suitable format for machine learning, such as SMILES (Simplified Molecular Input Line Entry System) strings, molecular fingerprints, or graph representations.
- **Normalization/Standardization:** Scaling data to a specific range or distribution. This is important for methods sensitive to the scale of input data.
- **Encoding Categorical Data** Transform categorical data into numeric data.
- **Data Transformation:** Converting data into formats that can be efficiently processed by ML algorithms, such as one-hot encoding for categorical data.
- **Data Augmentation:** Generating additional data points from existing data.
- **Dimensionality Reduction:** High-dimensional data, often encountered in chemistry, can lead to issues like overfitting and long training times. Dimensionality reduction techniques, such as Principal Component Analysis (PCA) or t-Distributed Stochastic Neighbor Embedding (t-SNE), reduce the number of features while retaining most of the important information. This step is crucial for simplifying models, improving interpretability, and sometimes enhancing model performance.
- **Splitting Data:** Splitting the data into smaller datasets for training, evaluation, and testing

### **6.2.1. Normalization and Standardization**

In machine learning, normalization and standardization are essential preprocessing techniques that ensure the scalability and effectiveness of algorithms that are sensitive to the scale of input data. They help improve the model training process by transforming features to have a similar range or distribution.

#### **6.2.1.1. Normalization**

Normalization typically refers to the process of scaling down the values of a feature to a specific range, typically [0, 1]. This is done to ensure that the scale of different features does not disproportionately influence the results of the model. The method used for normalization is often called Min-Max Scaling.

The Min-Max scaling formula is given by:

$$
X' = \frac{X - X_{min}}{X_{max} - X_{min}}
$$

where $X$ is the original value, $X_{min}$ is the minimum value of the feature, and $X_{max}$ is the maximum value of the feature.

**Example:** Min-Max Scaler using scikit-learn

Here’s how to apply Min-Max normalization using the `MinMaxScaler` from scikit-learn:

In [None]:
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Sample data
data = np.array([[1, 2],
                 [3, 4],
                 [5, 6]])

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Fit and transform the data
normalized_data = scaler.fit_transform(data)

print("Normalized Data:\n", normalized_data)

#### **6.2.1.2. Standardization**

Standardization, on the other hand, refers to scaling the features such that they have a mean of 0 and a standard deviation of 1. This is particularly useful in algorithms that assume data is normally distributed. The standard score \(Z\) is calculated using the formula:

$$
Z = \frac{X - \mu}{\sigma}
$$

where $X$ is the original value, $\mu$ is the mean of the feature, and $\sigma$ is the standard deviation.

**Example:** Standard Scaler using scikit-learn

Here’s how to apply standardization using the `StandardScaler` from scikit-learn:

In [None]:
from sklearn.preprocessing import StandardScaler

# Sample data
data = np.array([[1, 2],
                 [3, 4],
                 [5, 6]])

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the data
standardized_data = scaler.fit_transform(data)

print("Standardized Data:\n", standardized_data)

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 1</b></p>

1. **Generate Sample Data:**
    - Create a 1D NumPy array consisting of 10 random values between 1 and 100.

2. **Perform Min-Max Scaling:**
    - Use `MinMaxScaler` from `sklearn.preprocessing` to normalize this data to the range [0, 1].
    - Print the original data and the normalized data.

3. **Transform New Data and Inverse Transform:**
    - Create a new 1D NumPy array consisting of 5 random values between 1 and 100. Transform the data using the `MinMaxScaler`, then inverse transform the result.

4. **Perform Standardization:**
    - Use `StandardScaler` from `sklearn.preprocessing` to standardize the original data.
    - Print the original data and the standardized data.

5. **Unscale the Standardized Data:**
    - Create a new 1D NumPy array consisting of 5 random values between 1 and 100. Transform the data using the `MinMaxScaler`, then inverse transform the result.

### **6.2.2. Encoding Categorical Data***

In machine learning, it is common to encounter categorical variables that need to be encoded into numerical values so that algorithms can process them effectively. Encoding categorical data is essential for ensuring that models can interpret the variables correctly and make predictions.

#### ***6.2.2.1. Label Encoding***

Label encoding is another technique for converting categorical variables into numerical form. In this method, each unique category in the feature is assigned an integer value. While this method is simpler than one-hot encoding, it is important to note that it introduces an ordinal relationship between categories, which may not always be appropriate.

- **Process:** Assign an integer value to each unique category in the categorical variable.

- **Application:** Label encoding is ideal for ordinal variables where there is a clear ordering among the categories (e.g., 'Low', 'Medium', 'High').

**Example: Using Label Encoding with scikit-learn**

In [None]:
from sklearn.preprocessing import LabelEncoder

# Sample data
data = ['Red', 'Blue', 'Green', 'Blue', 'Red']
label_encoder = LabelEncoder()

# Applying Label Encoding
encoded_labels = label_encoder.fit_transform(data)

print("Original Data:\n", data)
print("\nLabel Encoded Data:\n", encoded_labels)

#### ***6.2.2.2. One-Hot Encoding***

One-hot encoding is a popular method for converting categorical variables into a binary (0 or 1) format. Each category in the variable is represented as a separate binary feature. This technique avoids implying ordinality between categories, which is crucial when the categorical variable does not have a natural ordering.

- **Process:**
  1. Identify all unique categories in the categorical variable.
  2. Create a new binary feature for each category.
  3. Assign a 1 or 0 in the new features based on the original category.

- **Application:** One-hot encoding is particularly useful for nominal variables, where there is no ordinal relationship between the categories.

**Example: Using One-Hot Encoding with scikit-learn**

In [None]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Sample data
data = np.array([['Red'],
                 ['Blue'],
                 ['Green'],
                 ['Blue'],
                 ['Red']])

# Initialize OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)

# Applying One-Hot Encoding
one_hot_encoded_data = encoder.fit_transform(data)

print("Original Data:\n", data)
print("\nOne-Hot Encoded Data:\n", one_hot_encoded_data)

### **6.2.3. Splitting Data**

In order to evaluate a model, we have to split the data into 2 sets: train set and test set. The train dataset is used for training the model, and the test dataset is used for evaluation.

![Train-test split](images/train_test_split.png)

In training models with multiple iterations (e.g. ANN models). The dataset is usually splitted into 3 sets: train set, validation set, and test set. The validation dataset is used for monitoring the training process of the model in order to avoid underfitting and overfitting during training.

![Train-val-test split](images/train_val_test_split.png)

**Example: Splitting data with scikit-learn**

In [None]:
import pandas as pd

# Load the Dataset
data = pd.read_csv('./datasets/IrisFlower.csv')

# Explore the Data
print(data.shape)
print(data.head())

from sklearn.model_selection import train_test_split

# Get the inputs and output
x = data.drop('Species', axis=1)
y = data['Species']

# Split the Data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
print(f'Shape of x_train: {x_train.shape}')
print(f'Shape of x_test: {x_test.shape}')
print(f'Shape of y_train: {y_train.shape}')
print(f'Shape of y_test: {y_test.shape}')

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 2</b></p>

1. **Load the Dataset:** Start by loading the breast cancer dataset from the `BreastCancer.csv` file using pandas.

2. **Get the Input and Output Columns:** Use `diagnosis` as output and all other columns as inputs.

3. **Scale the Data:** Apply min-max scaling for all the input features.

3. **Encode the Output** Encode the output column: `0` for `B` (benign), `1` for `M` (malignant) using `LabelEncoder` and `OneHotEncoder`

### **6.2.4. Dimensionality Reduction**

Dimensionality reduction is a crucial preprocessing step in machine learning that aims to reduce the number of features in a dataset while retaining its essential characteristics and information. This reduction is particularly important in high-dimensional data, which can lead to issues such as overfitting, increased computation time, and difficulty in visualizing data.

**Why Use Dimensionality Reduction?**

1. **Improved Performance:** By reducing the number of features, models can train faster and run more efficiently, often resulting in improved performance metrics.
2. **Visualization:** Lower-dimensional data can be visualized easily, enabling better understanding and insights.
3. **Mitigation of Overfitting:** Fewer features can help prevent models from becoming overly complex and fitting noise in the training data.
4. **Noise Reduction:** Removing less informative features can help reduce noise and improve model generalization.

**Common Dimensionality Reduction Techniques**

- **1. Principal Component Analysis (PCA)**
- **2.**

#### ***6.2.4.1. Principal Component Analysis (PCA)***

Principal Component Analysis (PCA) is one of the most widely used dimensionality reduction techniques. It transforms the data into a new coordinate system where the greatest variance by any projection lies on the first coordinate (the first principal component), the second greatest variance on the second coordinate, and so on.

- **Process:**
  1. Standardize the data.
  2. Compute the covariance matrix.
  3. Calculate the eigenvalues and eigenvectors of the covariance matrix.
  4. Sort the eigenvalues and their corresponding eigenvectors.
  5. Select a subset of the eigenvectors as principal components.
  6. Transform the original data into the new feature subspace.

- **Application:** PCA is commonly used for exploratory data analysis, image compression, and as a preprocessing step before feeding data into machine learning models.

**Example: Using PCA with scikit-learn**

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np

# Sample data
data = np.random.rand(100, 5)  # 100 samples and 5 features
print("Data Shape:", data.shape)

# Standardizing the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

# Applying PCA
pca = PCA(n_components=2)  # Reduce to 2 dimensions
reduced_data = pca.fit_transform(scaled_data)

print("Reduced Data Shape:", reduced_data.shape)

#### ***6.2.4.2. t-Distributed Stochastic Neighbor Embedding (t-SNE)***

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique particularly popular for visualizing high-dimensional datasets in lower dimensions (typically 2D or 3D).

- **Characteristics:**
  - t-SNE converts affinities of data points to probabilities.
  - It seeks to minimize the divergence between two probability distributions: one for the high-dimensional space and one for the low-dimensional space.

- **Application:** t-SNE is widely used for visualizing complex data structures, such as gene expression profiles or high-dimensional feature spaces in image data.

 **Example: Using t-SNE with scikit-learn**

In [None]:
from sklearn.manifold import TSNE

# Sample data
data = np.random.rand(100, 50)  # 100 samples and 50 features
print("Data Shape:", data.shape)

# Applying t-SNE
tsne = TSNE(n_components=2)
tsne_reduced_data = tsne.fit_transform(data)

print("t-SNE Reduced Data Shape:", tsne_reduced_data.shape)

#### ***6.2.4.3. Linear Discriminant Analysis (LDA)***

Linear Discriminant Analysis (LDA) is a supervised dimensionality reduction technique that is used when the data is labeled. It seeks to find a linear combination of features that characterizes or separates two or more classes.

- **Characteristics:**
  - LDA aims to maximize the distance between different classes and minimize the distance within the same class.

- **Application:** LDA is often used in pattern recognition and machine learning as a way to reduce dimension while preserving as much class discriminatory information as possible.

**Example: Using LDA with scikit-learn**

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Sample data with labels
x = np.random.rand(100, 5)  # 100 samples and 5 features
y = np.random.choice([0, 1], size=100)  # Binary labels
print("Data Shape:", x.shape)

# Applying LDA
lda = LinearDiscriminantAnalysis(n_components=1)
lda_reduced_data = lda.fit_transform(x, y)

print("LDA Reduced Data Shape:", lda_reduced_data.shape)

#### ***6.2.4.4. Non-negative Matrix Factorization (NMF)***

Non-negative Matrix Factorization (NMF) is a group of algorithms in multivariate statistics and linear algebra where a matrix is factorized into two non-negative matrices. This approach is particularly useful when dealing with data that is inherently non-negative, like document-term matrices.

- **Characteristics:**
  - NMF results in a parts-based representation, making it suitable for tasks like image processing and topic modeling.

- **Application:** NMF is commonly used for text mining, recommendation systems, and in various applications where interpretability is crucial (e.g., topic extraction from documents).

In [None]:
from sklearn.decomposition import NMF

# Sample data
data = np.random.rand(100, 50)  # 100 samples and 50 features
print("Data Shape:", data.shape)

# Applying NMF
nmf = NMF(n_components=10, init='random', random_state=0, max_iter=1000)
nmf_reduced_data = nmf.fit_transform(data)

print("NMF Reduced Data Shape:", nmf_reduced_data.shape)

#### ***6.2.4.5. Independent Component Analysis (ICA)***

Independent Component Analysis (ICA) is a computational technique used to separate a multivariate signal into additive, independent non-Gaussian components. It is widely used in signal processing.

- **Characteristics:**
  - ICA is effective in identifying hidden factors that underlie sets of random variables or observed data.

- **Application:** ICA is particularly useful in applications such as image processing (e.g., separating mixed images), biomedical signal analysis (e.g., separating EEG signals), and financial data analysis.

**Example: Using ICA with scikit-learn**

In [None]:
from sklearn.decomposition import FastICA

# Sample data
data = np.random.rand(100, 5)  # 100 samples and 5 features
print("Data Shape:", data.shape)

# Applying ICA
ica = FastICA(n_components=3)
ica_reduced_data = ica.fit_transform(data)

print("ICA Reduced Data Shape:", ica_reduced_data.shape)

#### ***6.2.4.6. Variance Thresholding***

Variance Thresholding is a simple feature selection method that will remove all features whose variance does not meet a certain threshold. This technique is particularly useful for removing features with low variance which are unlikely to contribute meaningful information to the model.

- **Characteristics:**
  - It requires setting a threshold value for variance; features with variance below this threshold are removed.
  
- **Application:** Variance Thresholding is commonly used in preprocessing steps to reduce the dimensionality of datasets by eliminating uninformative features.

**Example: Using VarianceThreshold with scikit-learn**

In [None]:
from sklearn.feature_selection import VarianceThreshold
import numpy as np

# Sample data
data = np.array([[0, 0, 11, 9],
                 [1, 0, 8, 4],
                 [1, 0, 3, 4],
                 [1, 0, 8, 9]])
print("Data Shape:", data.shape)

# Applying Variance Threshold
threshold = 0.1
selector = VarianceThreshold(threshold=threshold)
reduced_data = selector.fit_transform(data)

print("Original Data Shape:", data.shape)
print("Reduced Data Shape:", reduced_data.shape)
print(reduced_data)

#### ***6.2.4.7. Filter Methods***

Filter methods are a type of feature selection technique that relies on the intrinsic properties of the data to select the dimensions or features. These methods evaluate the relevance of features independently from the learning algorithm.

- **Characteristics:**
  - Filter methods usually use statistical measures to select the best features (e.g., correlation coefficient, chi-square test, mutual information).

- **Application:** They are often used as a preprocessing step to improve the performance of supervised learning algorithms or to reduce computational costs.

**Example: Using SelectKBest with scikit-learn**

In [None]:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, f_classif

# Load the dataset
data = load_iris()
x, y = data.data, data.target
print("Data Shape:", x.shape)

# Applying Filter Method
selector = SelectKBest(score_func=f_classif, k=2)
x_reduced = selector.fit_transform(x, y)

print("Filtered Data Shape:", x_reduced.shape)

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 3</b></p>

1. **Load the Dataset:** Start by loading the breast cancer dataset from the `BreastCancer.csv` file using pandas.

2. **Scale the Data:** Apply min-max scaling for all the input features.

3. **Choose and Apply a Dimensionality Reduction Technique:** Select one method to reduce the number of input features. Transform your feature set using the selected technique, ensuring you keep a manageable number of components or features suitable for regression analysis.

5. **Prepare the Target Variable:** Identify and separate the target variable from the feature set. If the dataset includes a class label (e.g., benign or malignant), use appropriate numerical coding or transformation.

6. **Split the Data:** Use a part of your dataset for training and another part for testing (e.g., 80% training, 20% testing).

7. **Build a Linear Regression Model:** Using `sklearn.linear_model.LinearRegression`, fit a linear regression model on the reduced feature set.

8. **Evaluate the Model:** Assess the model's performance using appropriate metrics.