# Lecture 05 notebook

**Date:** Jan 23, 2024

## Learning objectives

What you should be able to do after today's lecture.

1.  🧮 Define linear regression, its limitations, and objective function.
2.  🧮 Describe the purpose of loss functions in regression.
3.  🐍 Understand the conversion of data from a DataFrame to NumPy arrays.
4.  🐍 Develop hands-on programming skills for implementing regression in Python.
5.  🧮 Interpret the coefficients obtained through optimization and evaluate the model's performance.
6.  🐍 Visualize linear regression models and their fit to data.
7.  🧮 Discuss practical considerations for model interpretation, assumptions, and limitations.

## Readings

Relevant content for today's lecture.

-   [Plotting](../../../modules/intro/plotting/)
-   [Regression](../../../modules/intro/regression/)

## Imports

First, let's get all of our imports out of the way.

In [1]:
import sys

IN_COLAB = "google.colab" in sys.modules
if IN_COLAB:
    !pip install rdkit-pypi
    !pip install py3dmol

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from sklearn.linear_model import LinearRegression


def show_mol(smi, style="stick"):
    """Renders a visualization of a smiles string"""
    mol = Chem.MolFromSmiles(smi)
    mol = Chem.AddHs(mol)
    AllChem.EmbedMolecule(mol)
    AllChem.MMFFOptimizeMolecule(mol, maxIters=200)
    mblock = Chem.MolToMolBlock(mol)

    view = py3Dmol.view(width=500, height=500)
    view.addModel(mblock, "mol")
    view.setStyle({style: {}})
    view.zoomTo()
    view.show()

## Muddy points

The cell block below is for clarifying any muddy points we have today.

## Motivation

Today, we delve into the application of regression analysis in chemistry and data science.
Our dataset comprises pKa values and corresponding molecular descriptors, offering a quantitative approach to understanding molecular properties.


## pKa

The pKa measures a substance's acidity or basicity, particularly in chemistry.
It is the negative logarithm (base 10) of the acid dissociation constant (Ka) of a solution.
The pKa value helps quantify the strength of an acid in a solution.

The expression for the acid dissociation constant (Ka), from which pKa is derived, is given by the following chemical equilibrium equation for a generic acid (HA) in water:

$$
\text{HA} \rightleftharpoons \text{H}^+ + \text{A}^-
$$

The equilibrium constant (Ka) for this reaction is defined as the ratio of the concentrations of the dissociated ions ($\text{H}^+$ and $\text{A}^-$) to the undissociated acid ($\text{HA}$):

$$
\text{Ka} = \frac{[\text{H}^+][\text{A}^-]}{[\text{HA}]}
$$

Taking the negative logarithm (base 10) of both sides of the equation gives the expression for pKa:

$$
\text{pKa} = -\log_{10}(\text{Ka})
$$

So, in summary, the pKa is calculated by taking the negative logarithm of the acid dissociation constant (Ka) for a given acid. A lower pKa indicates a stronger acid.

In simpler terms:

-   A lower pKa indicates a stronger acid because it means the acid is more likely to donate a proton (H+) in a chemical reaction.
-   A higher pKa indicates a weaker acid as it is less likely to donate a proton.

The pKa is a crucial parameter in understanding the behavior of acids and bases in various chemical reactions.
It is commonly used in fields such as medicinal chemistry, biochemistry, and environmental science to describe and predict the behavior of molecules in solution.

## Molecular descriptors

Molecular descriptors are quantitative representations of chemical compounds that capture various structural, electronic, and physicochemical properties.
These descriptors are numerical values or sets of values that encode information about the characteristics of a molecule.
Molecular descriptors provide a structured and standardized way to quantify molecular features, facilitating the analysis and comparison of different molecules in chemical and computational studies.

Molecular descriptors can include a wide range of information, such as:

-   Structural Descriptors: These describe the geometry and connectivity of atoms within a molecule. Examples include molecular weight, size, and shape descriptors.
-   Topological Descriptors: These capture information about the connectivity of atoms in a molecular structure, often expressed as graphs or matrices.
-   Electronic Descriptors: These reflect the electronic properties of a molecule, including features related to electron distribution, charge, and orbital energies.
-   Physicochemical Descriptors: These encompass properties such as solubility, partition coefficients, and melting points, providing insights into the physical behavior of the molecule.
-   Quantum Chemical Descriptors: These are derived from quantum mechanical calculations and provide detailed information about a molecule's electronic structure and energetics.

Molecular descriptors are crucial in quantitative structure-activity relationship (QSAR) studies, computational chemistry, and drug design.
By converting complex molecular structures into numerical values, researchers can apply statistical and computational techniques to analyze and model the relationships between molecular features and various properties or activities.


## Exploring the dataset

Now, let's shift our focus to a practical application of our theoretical knowledge.

I found [this dataset](https://github.com/IUPAC/Dissociation-Constants) that contains a bunch of high-quality experimental measurements of pKas.
It contains a bunch of information and other aspects that makes regression a bit of a nightmare; thus, I did some cleaning of the data and computed some molecular features (i.e., descriptors) that we can use.
Before delving into regression analysis, it is essential to conduct a systematic review of the dataset.
This preliminary examination will provide us with the necessary foundation to understand the quantitative relationships between molecular features and acidity.
Let's now proceed with a methodical investigation of the empirical data, setting the stage for our subsequent analytical endeavors.

### Loading

Using the Pandas library, read the CSV file into a DataFrame.
Use the variable you defined in the previous step.

In [2]:
CSV_PATH = "https://gitlab.com/oasci/courses/pitt/biosc1540-2024s/-/raw/main/biosc1540/files/csv/pka/pka_desc_selected.csv"

### Take a look

Print the first few rows of the DataFrame to check if the data has been successfully loaded.

### Histograms

Use Matplotlib's `plt.subplots() `function to create a single subplot.
Assign the returned figure and axis objects to variables (`fig` and `ax`, respectively).

Utilize the `ax.hist()` function to plot a histogram.
Use the `"pka_value"` column from the DataFrame (`df`) as the data, and set the number of bins to `100`.
Choose a `color` for the bars.

Set labels for the x-axis and y-axis using `ax.set_xlabel()` and `ax.set_ylabel()` functions.

Finally, use `plt.show()` to display the histogram.

### Scatter

Use Matplotlib's `plt.subplots()` function to create a single subplot.
Assign the returned figure and axis objects to variables (`fig` and `ax`, respectively).
Utilize the `ax.scatter()` function to plot a scatter plot.
Specify the x-axis data as `"MaxEStateIndex"` and the y-axis data as `"pka_value"` from your DataFrame (`df`).
Choose a color for the scatter points.

The method assigns numerical values (E-state indices) to each atom in a molecule, representing the atom's electronic state.
The E-state indices are calculated based on the atom's atomic number, hybridization, and the types of neighboring atoms.
The `MaxEStateIndex` is then derived by finding the maximum E-state index among all atoms in the molecule.

Use Matplotlib's `plt.subplots()` function to create a single subplot.
Assign the returned figure and axis objects to variables (`fig` and `ax`, respectively).
Utilize the `ax.scatter()` function to plot a scatter plot.
Specify the x-axis data as `"SPS"` and the y-axis data as `"pka_value"` from your DataFrame (`df`).
Choose a color for the scatter points.

Use the `ax.scatter()` function to create a scatter plot.
Specify the x-axis data as "SPS" and the y-axis data as `"MaxEStateIndex"` from your DataFrame (`df`).

Spacial score (SPS) is an empirical scoring system to express the spacial complexity of a compound in an uniform manner and on a highly granular scale for ranking and comparison between molecules.

## Prediction

Our objective is to predict pKa values using a machine learning model.
In order to do this, we need to prepare our data for training and testing.
Identify the features and target variable.
In this case, the features are the molecular descriptors, and the target variable is the pKa value.
Create separate dataframes for `df_features` and `df_targets`.

The target variable (`df_targets`) and features (`df_features`) need to be converted into NumPy arrays for compatibility with machine learning models.

### Linear model

Create an instance of the `LinearRegression` model from scikit-learn.
Use the `fit` method to train the linear regression model with your features and target variable.

Use the score method to calculate and print the coefficient of determination ($R^2$) of the linear regression model.

Use the predict method to generate predictions (predictions) based on the trained linear regression model and the input `features`.
Create a scatter plot where the x-axis represents the actual pKa values (`targets`), the y-axis represents the predicted pKa values (`predictions`).

Points close to the diagonal line indicate accurate predictions, while deviations suggest discrepancies between actual and predicted values.

Extracted from the columns of `df_features`, representing the names of the features.
Obtain the coefficients obtained from the trained linear regression model.

Use `ax.barh()` to create a horizontal bar graph. Feature names are on the y-axis, and corresponding coefficients are on the x-axis.
Add a vertical dashed line (`ax.axvline()`) at zero for reference.