# **Chapter 6. Machine Learning**

## **6.3. Regression Models**

In the following sections, will we evaluate different regression models to predict the solubility of chemical compounds:

***a. Import required libraries***

In [None]:
import math
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
from tqdm import tqdm

***b. Load dataset***

In [None]:
# Load the lipophilicity dataset
data_file_path = './datasets/Solubility.csv'
df = pd.read_csv(data_file_path)
df.head()

***c. Get the input and output columns***

In [None]:
x = df[['AMW', 'num_rings', 'fraction_CSP3', 'num_hba', 'num_hbd', 'num_het_atoms', 'logP', 'TPSA']].to_numpy()
y = df['solubility'].to_numpy()
print(f'Shape of inputs: {x.shape}')
print(f'Shape of output: {y.shape}')

***d. Data preprocessing***

In [None]:
# Set the random seed
random_seed = 0
np.random.seed(random_seed)

# Define input and output scalers
input_scaler = MinMaxScaler(feature_range=(0, 1))
output_scaler = MinMaxScaler(feature_range=(0, 1))

# Scale the data
x_scaled = input_scaler.fit_transform(x)
y_scaled = output_scaler.fit_transform(y.reshape(-1, 1)).reshape(-1)

# Split the data into training and testing sets
x_train_scaled, x_test_scaled, y_train_scaled, y_test_scaled = train_test_split(x_scaled, y_scaled, test_size=0.3, random_state=random_seed)

***e. Model training and evaluation***

### **6.3.1. Linear Regression**

Linear regression is a fundamental approach that models the linear relationship between a dependent variable and one or more independent variables.

In [None]:
from sklearn.linear_model import LinearRegression

# Initialize the Linear Regression model
model = LinearRegression()

# Train the model on the training data
model.fit(x_train_scaled, y_train_scaled)

In [None]:
# Define a function for evaluating the model
def evaluate_regression_model(model, x_test_scaled, y_test_scaled):
    # Predict on the test set
    y_pred_scaled = model.predict(x_test_scaled)
    
    # Transform predictions back to original scale
    y_pred = output_scaler.inverse_transform(y_pred_scaled.reshape(-1, 1)).reshape(-1)
    
    # Transform the test set back to original scale
    y_test = output_scaler.inverse_transform(y_test_scaled.reshape(-1, 1)).reshape(-1)
    
    # Evaluate the model
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = math.sqrt(mse)
    r2 = r2_score(y_test, y_pred)

    # Print regression metrics
    print(f"Mean Absolute Error (MAE): {mae:.4f}")
    print(f"Mean Squared Error (MSE): {mse:.4f}")
    print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
    print(f"R^2 Score: {r2:.4f}")
    
    # Plotting the results
    plt.figure(figsize=(8, 6))
    plt.scatter(y_test, y_pred, color='blue', edgecolor='k', alpha=0.7, s=40)
    plt.plot(y_test, y_test, color='red', linewidth=2)  # Ideal line for perfect predictions
    plt.xlabel('Actual Solubility')
    plt.ylabel('Predicted Solubility')
    plt.title('Predictions vs Actual')
    plt.grid(True)
    plt.show()

In [None]:
# Evaluate model
evaluate_regression_model(model, x_test_scaled, y_test_scaled)

After training, we can make predictions with this model, for examples:

In [None]:
features = np.array([40.065, 0, 0.333333, 0, 0, 0, 0.63950, 0.00])

# Scale features
features_scaled = input_scaler.transform(features.reshape(1, -1))

# Make prediction
output_scaled = model.predict(features_scaled)

# Transform predictions back to original scale
output = output_scaler.inverse_transform(output_scaled.reshape(-1, 1)).reshape(-1)

# Print out the prediction
print(output[0])

### **6.3.2. Ridge Regression**

Ridge regression extends linear regression by adding a regularization term, which helps in reducing model complexity and preventing overfitting.

In [None]:
from sklearn.linear_model import Ridge

# Initialize the Ridge Regression model
# You can adjust the alpha parameter to control the amount of regularization
model = Ridge(alpha=1.0)

# Train the model on the training data
model.fit(x_train_scaled, y_train_scaled)

In [None]:
# Evaluate model
evaluate_regression_model(model, x_test_scaled, y_test_scaled)

### **6.3.3. Lasso Regression**

Lasso regression, similar to ridge regression, adds a regularization term but in a way that can completely eliminate the weights of some features, thus performing feature selection.

In [None]:
from sklearn.linear_model import Lasso

# Initialize the Lasso Regression model
# You can adjust the alpha parameter to control the amount of regularization
model = Lasso(alpha=0.001)

# Train the model on the training data
model.fit(x_train_scaled, y_train_scaled)

In [None]:
# Evaluate model
evaluate_regression_model(model, x_test_scaled, y_test_scaled)

### **6.3.4. Elastic Net**

Elastic net combines features of both ridge and lasso regression, using a mix of both L1 and L2 regularization to improve model robustness.

In [None]:
from sklearn.linear_model import ElasticNet

# Initialize the Elastic Net model
# You can adjust the alpha and l1_ratio parameters to control the amount of regularization
# alpha controls the overall strength, while l1_ratio controls the balance between L1 and L2 regularization
model = ElasticNet(alpha=0.001, l1_ratio=0.5)

# Train the model on the training data
model.fit(x_train_scaled, y_train_scaled)

In [None]:
# Evaluate model
evaluate_regression_model(model, x_test_scaled, y_test_scaled)

### **6.3.5. K-Nearest Neighbors Regression**

KNN regression predicts the output based on the K nearest neighbors in the feature space, averaging their values to determine the final prediction.

In [None]:
from sklearn.neighbors import KNeighborsRegressor

# Initialize the KNN regression model
# You can adjust the number of neighbors (n_neighbors)
model = KNeighborsRegressor(n_neighbors=5)

# Train the model on the training data
model.fit(x_train_scaled, y_train_scaled)

In [None]:
# Evaluate model
evaluate_regression_model(model, x_test_scaled, y_test_scaled)

### **6.3.6. Decision Tree Regression**

Decision tree regression models make predictions by splitting data into subsets based on feature values, building a tree-like model of decisions.

In [None]:
from sklearn.tree import DecisionTreeRegressor

# Initialize the Decision Tree Regression model
# You can adjust various parameters like max_depth, min_samples_split, etc.
model = DecisionTreeRegressor(max_depth=5)

# Train the model on the training data
model.fit(x_train_scaled, y_train_scaled)

In [None]:
# Evaluate model
evaluate_regression_model(model, x_test_scaled, y_test_scaled)

### **6.3.7. Random Forest Regression**

Random forest regression improves upon decision tree regression by creating an ensemble of decision trees and averaging their predictions to reduce overfitting.

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Initialize the Random Forest Regression model
# You can adjust parameters like n_estimators (number of trees), max_depth, etc.
model = RandomForestRegressor(n_estimators=100, max_depth=5, random_state=random_seed)

# Train the model on the training data
model.fit(x_train_scaled, y_train_scaled)

In [None]:
# Evaluate model
evaluate_regression_model(model, x_test_scaled, y_test_scaled)

### **6.3.8. Gaussian Process Regression**

Gaussian process regression is a probabilistic model that uses kernel functions to make predictions, providing not only estimations but also uncertainty measures.

In [None]:
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import DotProduct, WhiteKernel

# Initialize the Gaussian Process Regressor model
# You can adjust the kernel and other parameters as needed
kernel = DotProduct() + WhiteKernel()
model = GaussianProcessRegressor(kernel=kernel, random_state=random_seed)

# Train the model on the training data
model.fit(x_train_scaled, y_train_scaled)

In [None]:
# Evaluate model
evaluate_regression_model(model, x_test_scaled, y_test_scaled)

### **6.3.9. Support Vector Machine (SVM) Regression**

SVM regression, or Support Vector Regression (SVR), uses the SVM technique to model complex relationships between features and target variables, including both linear and non-linear interactions.

In [None]:
from sklearn.svm import SVR

# Initialize the SVM Regression model with a Gaussian (RBF) kernel
# You can adjust parameters like C (regularization parameter) and gamma (kernel coefficient)
model = SVR(kernel='rbf', C=1.0, gamma='scale', epsilon=0.1)

# Train the model on the training data
model.fit(x_train_scaled, y_train_scaled)

In [None]:
# Evaluate model
evaluate_regression_model(model, x_test_scaled, y_test_scaled)

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 1</b></p>

1. **Load Data:** Load the lipophilicity dataset from the file `Lipophilicity.csv`. The output column is `lipophilicity`, the other 512 columns are the atom-pairs fingerprint of molecules, which is used for the prediction of lipophilicity. 

2. **Data Preprocessing:**
   - Reduce the number of inputs using method(s) of your choice, the number of reduced input columns should not be greater than 32.
   - Split the data into train and test sets (70:30)

4. **Model Training and Evaluation:** Train different ML models and evaluate their performance.

5. **Analysis:** Analyze the results, choose the best model based on their performance.

## **6.4. Classification Models**

In the following sections, will we evaluate different classification models to predict whether a compound can penetrate blood-brain barrier:

***a. Import required libraries***

In [None]:
import math
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

***b. Load dataset***

In [None]:
# Load the lipophilicity dataset
data_file_path = './datasets/BBBP.csv'
df = pd.read_csv(data_file_path)
df.head()

***c. Get the input and output columns***

In [None]:
x = df.drop('p_np', axis=1).to_numpy()
y = df['p_np'].to_numpy()
print(f'Shape of inputs: {x.shape}')
print(f'Shape of output: {y.shape}')

***d. Data preprocessing***

In [None]:
# Set the random seed
random_seed = 0
np.random.seed(random_seed)

# Reduce number of inputs with variance threshold and PCA
selector = VarianceThreshold(threshold=0.1)
x_reduced = selector.fit_transform(x)

pca = PCA(n_components=32)  # Reduce to 32 dimensions
x_reduced = pca.fit_transform(x_reduced)

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x_reduced, y, test_size=0.3, random_state=random_seed)

***e. Model training and evaluation***

### **6.4.1. Logistic Regression**

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although there are extensions to handle multi-class problems.

In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
# You can adjust the 'C' parameter to control regularization strength
model = LogisticRegression(C=1.0, random_state=random_seed)

# Train the model on the training data
model.fit(x_train, y_train)

In [None]:
# Define a function for evaluating the model
def evaluate_classification_model(model, x_test, y_test):
    # Predict on the test set
    y_pred = model.predict(x_test)
    y_pred_proba = model.predict_proba(x_test)[:, 1]  # Probability estimates for the positive class
    
    # Evaluate the model
    cm = confusion_matrix(y_test, y_pred)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
    roc_auc = auc(fpr, tpr)
    
    # Print classification metrics
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
    
    # Display confusion matrix
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
    disp.plot(cmap="Blues")

    # Plot ROC Curve
    plt.figure()
    plt.plot(fpr, tpr, color='blue', lw=2, label='ROC Curve (AUC = {:.2f})'.format(roc_auc))
    plt.plot([0, 1], [0, 1], color='red', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend(loc='lower right')
    plt.show()

In [None]:
# Evaluate the model
evaluate_classification_model(model, x_test, y_test)

### **6.4.2. Naive Bayes**

Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong independence assumptions between the features.

In [None]:
from sklearn.naive_bayes import GaussianNB

# Initialize the Gaussian Naive Bayes model
model = GaussianNB()

# Train the model on the training data
model.fit(x_train, y_train)

In [None]:
# Evaluate the model
evaluate_classification_model(model, x_test, y_test)

### **6.4.3. K-Nearest Neighbors (KNN)**

KNN classification predicts the class of a data point based on the majority class among its k nearest neighbors. It's a simple, distance-based algorithm often used for its ease of interpretation.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Initialize the KNN classifier
# You can adjust the number of neighbors (n_neighbors)
model = KNeighborsClassifier(n_neighbors=5)

# Train the model on the training data
model.fit(x_train, y_train)

In [None]:
# Evaluate the model
evaluate_classification_model(model, x_test, y_test)

### **6.4.4. Decision Tree**

Decision tree classifiers make decisions by splitting data based on feature values, creating a tree-like model of decisions. They are intuitive and easy to interpret but can be prone to overfitting.

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Initialize the Decision Tree Classifier
# You can adjust parameters like max_depth, min_samples_split, etc.
model = DecisionTreeClassifier(max_depth=5, random_state=random_seed)

# Train the model on the training data
model.fit(x_train, y_train)

In [None]:
# Evaluate the model
evaluate_classification_model(model, x_test, y_test)

### **6.4.5. Random Forest**

Random forest classifiers improve upon decision trees by creating an ensemble of decision trees and aggregating their predictions to reduce overfitting and improve prediction accuracy.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest Classifier
# You can adjust parameters like n_estimators (number of trees), max_depth, etc.
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=random_seed)

# Train the model on the training data
model.fit(x_train, y_train)

In [None]:
# Evaluate the model
evaluate_classification_model(model, x_test, y_test)

### **6.4.6. Gaussian Process Classifier**

Gaussian Process classifiers extend Gaussian processes to classification tasks, using kernel functions and Bayesian inference to predict categorical outcomes, often with uncertainty estimates.

In [None]:
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF

# Initialize the Gaussian Process Classifier
# The choice of kernel can be important; RBF is a common choice
kernel = 1.0 * RBF(1.0)
model = GaussianProcessClassifier(kernel=kernel, random_state=random_seed)

# Train the model on the training data
model.fit(x_train, y_train)

In [None]:
# Evaluate the model
evaluate_classification_model(model, x_test, y_test)

### **6.4.7. Support Vector Machine (SVM)**

SVM classifiers construct hyperplanes in a multidimensional space to separate different classes with as wide a margin as possible. SVMs are effective in high-dimensional spaces and versatile with various kernel functions.

In [None]:
from sklearn.svm import SVC

# Initialize the SVM classifier with a Gaussian (RBF) kernel
# You can adjust parameters like C (regularization parameter) and gamma (kernel coefficient)
random_seed=0
model = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=random_seed, probability=True)

# Train the model on the training data
model.fit(x_train, y_train)

In [None]:
# Evaluate the model
evaluate_classification_model(model, x_test, y_test)

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 2</b></p>

1. **Load Data:** Load the breast cancer dataset from the file `BreastCancer.csv`. The output column is `diagnosis`, the other columns are used as inputs. 

2. **Data Preprocessing:**
   - Scale the input and output columns using min-max scaler
   - Reduce the number of inputs using method(s) of your choice, the number of reduced input columns should not be greater than 16.
   - Split the data into train and test sets (70:30)

4. **Model Training and Evaluation:** Train different ML models and evaluate their performance.

5. **Analysis:** Analyze the results, choose the best model based on their performance.