# Assignment 2

## Pre-Questions

*In accordance with the class policy, if any AI tools are used in completion of any part of this assignment they must be acknowledged. Failure to acknowledge usage constitutes academic dishonesty.*

1.	What are the key steps that should be done to analyze and process your data before applying any machine learning algorithm?
    

2. Explain what is meant by a model (a) overfitting and (b) underfitting the data. 

3.	Explain the difference between a linear regression model and a multi-linear regression model. Which do you expect to be more accurate and why/under what circumstances?

## Machine Learning Model

In this part of the assignment you will load, explore and generate a machine learning model for a new data set. The data you will use results from high quality quantum chemistry calculations on chemical reactions. The data set consists of molecular identifiers for the reactant and product molecules, the activation energy barrier (dE0 in kcal/mol) and the reaction enthalpy (dHrxn298 in kcal/mol). The original published data can be found here: [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13328872.svg)](https://doi.org/10.5281/zenodo.13328872)

All of the commands needed for completing this activity can be found in the in-class activities. 

Start by installing potentially useful libraries


In [None]:
pip install sweetviz

In [None]:
pip install pycaret

In [None]:
## Restart kernel to load libraries properly
from IPython import get_ipython

if get_ipython():
    get_ipython().kernel.do_shutdown(restart=True)

In [None]:
## load libraries and functions...not all need to be used...you can also use others as desired
import pandas as pd                 # for data manipulation
import seaborn as sns               # for data visualization
import matplotlib.pyplot as plt     # for data visualization
import numpy as np                  # for numerical operations
import sweetviz as sv               # for fast exploratory data analysis (eda)

from pycaret.regression import *    # for comparing ML models

from rdkit import Chem              # for calculating cheminformatics properties of molecules
from rdkit.Chem import Descriptors  # for determining chemical descriptors
from rdkit.Chem import Crippen      # for calculating logP (cLogP)
from rdkit.Chem import PandasTools  # for displaying molecules
PandasTools.RenderImagesInAllDataFrames(images=True) # Ensures molecules are rendered in the notebook

from sklearn.preprocessing import StandardScaler            # for scaling the data
from sklearn.model_selection import train_test_split        # for splitting the data into training and testing sets
from sklearn.model_selection import cross_val_score, KFold  # for K-fold cross-validation
from sklearn.linear_model import LinearRegression           # for creating a linear regression model
from sklearn.ensemble import RandomForestRegressor          # for creating a random forest regression model
from sklearn.metrics import mean_squared_error, r2_score    # for evaluating the model
from sklearn.pipeline import make_pipeline                  # for building operational pipelines

1) Load the ccsdtf12_dz.csv file into a pandas dataframe

2) Check for missing values and perform an exploratory data analysis of the numerical parameters

3) Add columns containing the structures of the reactant and product molecules based on the reactant (rsmi) and product (rsmi) SMILES strings to the original dataframe. Remove any columns with incomplete data.

4) Add molecular descriptors from rdkit that may be useful for predicting the reaction barriers or reaction enthalpies, note that you have both reactant and product molecules

5) Create a simplifed data frame containing only the numerical data.

6) Split the numerical data into a training and testing set

7) Scale features or perform other feature engineering as appropriate

8) Select a machine learning model (explain your choice)

9) Train your machine learning model

10) How does your model perform? Comment on the R2 and MSE/RMSE values. Suggest a method for improving the model

11) Try at least one thing to improve your model or choose a different model. Comment on whether it actually led to an approvement or not.