# MolSim 2020: ML for Gas Adsorption

In this exercise, we will build a model that can predict the CO$_2$ uptake of MOFs.

## Import Packages which we will need

In [10]:
# basics 
import os 
import numpy as np 

# data
import pandas as pd 
import pandas_profiling

# machine learning 
# scaling of data
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
# train/test split
from sklearn.model_selection import train_test_split
# model selection 
from sklearn.model_selection import GridSearchCV
# model
from sklearn.kernel_ridge import KernelRidge
# pipeline 
from sklearn.pipeline import Pipeline
# PCA
from sklearn.decomposition import PCA

# plotting 
import matplotlib.pyplot as plt 
%matplotlib inline 
import seborn as sns

# for interactive plots, you can try to use holoviewes
import holoviews as hv
hv.extension('bokeh')

RANDOM_SEED = 4242424242

ModuleNotFoundError: No module named 'seborn'

In [None]:
TARGET

In [None]:
FEATURES

## Import the data

In [None]:
df = pd.read_csv()

Just give it a quick look to make sure that everything looks fine.

In [None]:
df.info()

### Split the data

Before doing any analysis or transformation on the data, we split it into two disjoint sets. 
If you want more explanation why this is important, you might want to look into [chapter 7.10.2 of Elements of Statistical Learning](https://web.stanford.edu/~hastie/ElemStatLearn//printings/ESLII_print10.pdf). In a nutshell, we want to avoid *any* data leakage, also in terms of the feature selection.

In [None]:
df_train, df_test = train_test_split(df, train_size=0.7, 
                                         test_size=0.3, 
                                         random_state=RANDOM_SEED) 

#### Split with stratification

If there are imbalanced classes, we want to make sure that random sampling does not distort class distributions. 
I.e., if we would have only very few good materials in our dataset, we might have nearly none of them in our test set if we are unlucky in our random sampling. 

Stratification ensures that the class distributions (ratio of good to bad materials) is the same in the training and test set.

Later, we will explore the effect of this stratification in more detail.

For stratification to work, we need categories. Currently, our target is continuos, i.e. we have to binarize if. For this, we create a column that encodes 0 and 1 to encode the category which a material belongs to (hint: you can use [pd.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html), list comprehension, the [binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html#sklearn.preprocessing.Binarizer) ...) 

In [None]:
df['target_binned'] = # add your code

In [None]:
df_train_stratified, df_test_stratified = train_test_split(df, train_size=0.7, 
                                                                test_size=0.3, 
                                                                random_state=RANDOM_SEED, 
                                                                stratify='target_binned') 

## Exploratory data anaylsis 

Now, as we are sure that we have put data aside that we won't touch, we can give it a closer look

### Distribution of target property

Let's plot the distribution of the target property. What do you observe?`

[seaborn's distplot](https://seaborn.pydata.org/generated/seaborn.distplot.html), [holoview's distribution](http://holoviews.org/reference/elements/matplotlib/Distribution.html) or [matplotlib's distribution](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.hist.html) all can do this, try them out if you want!

In [None]:
hv.Distribution(df[TARGET])

### Correlations

- Plot some features against the target properties and calculate the Pearson and Spearman correlation coefficient
- What are the strongest correlations? 
- Are they different for the different targets and does this correspond to what you would expect?

### Visualization

The output of `df_train.shape()` will show you that our data is high-dimensional. 
It's hard to visualize such data.