Data and context can be found on this paper: https://pubs.acs.org/doi/epdf/10.1021/acs.chemmater.8b01425?ref=article_openPDF
This project attempts to find insights and predict methane uptake capacity of covalent organic frameworks via a regression model.
First, I wanted to visualise the data to understand the trends and outliers. This includes:
- a report of min, max and all categorical variables
 - boxplots of continuous values
 - histograms of discrete values
 
Then, the data was visualised using a sns.relplot() to show the relationship of predictors to the response (y = AbsMU_high_P_[molec/unit_cell])
and color-coded by bond types.
The data was then organised into X and y and using a random forest to find feature importance based on mean decrease of purity. This was done to reduce the dimensionality from p=1116. A threshold of 0.001 was used to chose important features, with supercell volume being the most important.
Many algorithms were assessed for selection. Algorithms (from sklearn) were trialed using default parameters with RepeatedKFold cross-validation (n_splits = 5, n_repeats = 10) include:
- Linear Regression
 - Decision Tree
 - ensemble methods:
- Random Forest
 - AdaBoost
 - Bagging
 - GradientBoosting
 - XGBoost <-- using the XGBoost library
 
 - SVR
 - KNeighbors
 
Evalution of each algorithm includes:
- metrics: Averages, train and validation scores printed
- mean_absolute_error
 - mean_square_error
 - root_mean_square_error
 
 - Plots
- Learning Curves (scoring = RMSE)
 - Prediction plots of simulated data and predicted data
 
 
Random Forest had the best performance so was this algorithm was selected. Hyperparameter tuning using Optuna evaluated on held-out test set.
- Do a more in-depth search with classification:
- multi-nomial classification of qualitative values
- bond_type (
K=5) - parent network (
K=309) 
 - bond_type (
 - evaluation of 2D and 3D COF
 - unsupervised learning
- clustering
 
 
 - multi-nomial classification of qualitative values
 
- Curate large dataset
 - Trained ML algorithm to predict target property
 - Select optimal algorithm for material representation
 - Validate algorithm
 - Developed an assessment protocol informed by construction of model
 
