# Usage of the data4cat module

For convenience and e.g. the usage in lectures datasets from the central NFDI4Cat repository (Dataverse) where wrapped into modules.  The convenience functions should enable a smooth start on how to work with published remote data. Datasets included up to now are:

* The BasCat DinoRun dataset on synthesis to ethanol

## Installation of the data4cat module

For the installation you can clone or download the repository:
```
git clone https://github.com/nfdi4cat/data4cat.git
```
cd into the directory an install data4cat:

```
pip install .
```
Or you can directly install the module from the remote source:
```
python -m pip install git+https://github.com/nfdi4cat/data4cat.git@main
````
To uninstall simply do a:
```
pip uninstall data4cat
```

With the package installed you first need to import the module:

In [1]:
from data4cat import dino_run

And create an instance:

In [2]:
dinodat = dino_run.dino_offline()

Instance created


The two steps above have to be done always.

## The dino_run dataset from the NFDI4Cat Dataverse instance

One dataset is the BasCat performance dataset on the syngas to ethanol reaction.

### Download the dino_run dataset 

In case that there is no offline version of the dataset available (e.g. after a fresh install) a copy of the dataset can be downloaded like this:

In [3]:
dinodat.one_shot_dumb()

Dictionary created
Dictionary read
Dataset downloaded
Stored dataset
Done


### Create a dataset from the offline data

You can get the data either in the form of a pandas dataframe or as a Bunch object in the style of scikit-learn datasets. You can get the original data in the following way:

In [4]:
original = dinodat.original_data()

In [5]:
original.head()

Unnamed: 0,Clock,Experiment,xCO,xH2,Temperature,Vflow,Pressure,TOS,Reactor,X_CO,...,S_Butanal,S_Acetic acid,S_Methyl acetate,S_Ethyl acetate,S_Unknown,S_CO2,S_C2_oxy,S_C2_p_oxy,S_C2_p_HCs,C-balance
0,"01-07-2020 20:11:45,85",1,0.2,0.6,260,41.7,54,4,1,12.8,...,0.65,11.63,0.22,0.55,0,0.34,22.05,22.82,15.66,94.821057
1,"01-07-2020 23:40:47,85",1,0.2,0.6,260,41.7,54,8,1,10.34,...,0.62,12.9,0.3,0.63,0,0.49,24.33,25.26,14.85,93.196889
2,"02-07-2020 03:09:42,85",1,0.2,0.6,260,41.7,54,11,1,9.42,...,0.6,14.08,0.37,0.69,0,0.58,26.24,27.3,14.13,96.114929
3,"02-07-2020 06:38:32,85",1,0.2,0.6,260,41.7,54,15,1,8.39,...,0.58,14.61,0.41,0.77,0,0.64,27.25,28.43,13.65,93.683447
4,"02-07-2020 10:07:29,85",1,0.2,0.6,260,41.7,54,18,1,8.1,...,0.57,14.81,0.46,0.78,0,0.69,27.95,29.19,13.6,97.105808


### Create a subset of the offline data for the startup phase

There is a sub dataset for the startup phase with a TOS < 85 available. Again both as pandas dataframe and Bunch object.

In [6]:
startup = dinodat.startup_data()

In [7]:
startup.head()

Unnamed: 0,TOS,X_CO,Reactor
0,4,12.8,1
1,8,10.34,1
2,11,9.42,1
3,15,8.39,1
4,18,8.1,1


### Create a subset of the offline data for the selectivity

Especially for unsupervised learning tasks there is a subset of the data prepared that contains only the selectivity data. When asking for this subset also reactors are provided, here they are put in a clusters object.

In [8]:
selectivity, clusters = dinodat.selectivity()

In [9]:
selectivity.head()

Unnamed: 0,S_Methane,S_Ethane,S_Propane,S_n-Butane,S_nC5,S_nC6,S_Ethylene,S_Propylene,S_n-Butene,S_C5-1,...,S_Acetaldehyde,S_Propanal,S_Butanal,S_Acetic acid,S_Methyl acetate,S_Ethyl acetate,S_Unknown,S_CO2,S_C2_oxy,S_C2_p_oxy
0,59.56,4.97,6.46,2.02,0.81,0.37,0.0,0.88,0.15,0,...,7.54,0.71,0.65,11.63,0.22,0.55,0,0.34,22.05,22.82
1,57.86,4.74,5.92,1.84,0.76,0.34,0.0,1.1,0.15,0,...,8.11,0.71,0.62,12.9,0.3,0.63,0,0.49,24.33,25.26
2,56.43,4.61,5.6,1.73,0.73,0.34,0.0,0.98,0.14,0,...,8.52,0.71,0.6,14.08,0.37,0.69,0,0.58,26.24,27.3
3,55.75,4.49,5.36,1.66,0.71,0.26,0.0,1.02,0.15,0,...,8.78,0.7,0.58,14.61,0.41,0.77,0,0.64,27.25,28.43
4,55.0,4.43,5.23,1.61,0.69,0.29,0.0,1.2,0.15,0,...,9.06,0.69,0.57,14.81,0.46,0.78,0,0.69,27.95,29.19


In [10]:
clusters.head()

0    1
1    1
2    1
3    1
4    1
Name: Reactor, dtype: int64

### Create a subset of the offline data for the selectivity without reactor 5

In case needed when you provide the r5 argument to False it will exclude the empty reactor 5.

In [11]:
selectivity_wo5, clusters = dinodat.selectivity(r5=False)

In [12]:
selectivity_wo5.head()

Unnamed: 0,S_Methane,S_Ethane,S_Propane,S_n-Butane,S_nC5,S_nC6,S_Ethylene,S_Propylene,S_n-Butene,S_C5-1,...,S_Acetaldehyde,S_Propanal,S_Butanal,S_Acetic acid,S_Methyl acetate,S_Ethyl acetate,S_Unknown,S_CO2,S_C2_oxy,S_C2_p_oxy
0,59.56,4.97,6.46,2.02,0.81,0.37,0.0,0.88,0.15,0,...,7.54,0.71,0.65,11.63,0.22,0.55,0,0.34,22.05,22.82
1,57.86,4.74,5.92,1.84,0.76,0.34,0.0,1.1,0.15,0,...,8.11,0.71,0.62,12.9,0.3,0.63,0,0.49,24.33,25.26
2,56.43,4.61,5.6,1.73,0.73,0.34,0.0,0.98,0.14,0,...,8.52,0.71,0.6,14.08,0.37,0.69,0,0.58,26.24,27.3
3,55.75,4.49,5.36,1.66,0.71,0.26,0.0,1.02,0.15,0,...,8.78,0.7,0.58,14.61,0.41,0.77,0,0.64,27.25,28.43
4,55.0,4.43,5.23,1.61,0.69,0.29,0.0,1.2,0.15,0,...,9.06,0.69,0.57,14.81,0.46,0.78,0,0.69,27.95,29.19


In [13]:
clusters.head()

0    1
1    1
2    1
3    1
4    1
Name: Reactor, dtype: int64

### Create a subset of the offline data for the reaction conditions

For supervised tasks a subset of the data is provided that contains the reaction conditions as features and the selectivity to ethanol as target.

In [14]:
react_cond, selectivity_EtOH = dinodat.react_cond()

In [15]:
react_cond.head()

Unnamed: 0,xCO,xH2,Temperature,Vflow,Pressure,TOS,X_CO,X_H2,Rh,Mn,Fe
0,0.2,0.6,260,41.7,54,4,12.8,25.6,2.12,0.0,0.0
1,0.2,0.6,260,41.7,54,8,10.34,29.08,2.12,0.0,0.0
2,0.2,0.6,260,41.7,54,11,9.42,23.78,2.12,0.0,0.0
3,0.2,0.6,260,41.7,54,15,8.39,27.11,2.12,0.0,0.0
4,0.2,0.6,260,41.7,54,18,8.1,19.38,2.12,0.0,0.0


In [16]:
selectivity_EtOH.head()

0    2.88
1    3.32
2    3.64
3    3.86
4    4.08
Name: S_Ethanol, dtype: float64

### Create a subset of the offline data for the reaction conditions without reactor 5

Like before the empty reactor 5 can be excluded with the r5 argument set to False.

In [17]:
react_cond, selectivity_EtOH = dinodat.react_cond(r5=False)

In [18]:
react_cond.tail()

Unnamed: 0,xCO,xH2,Temperature,Vflow,Pressure,TOS,X_CO,X_H2,Rh,Mn,Fe
207,0.2,0.6,260,41.6,54,181,8.0,15.51,2.46,1.46,1.46
208,0.2,0.6,260,41.7,54,185,7.98,16.24,2.46,1.46,1.46
209,0.2,0.6,260,41.6,54,188,7.98,22.39,2.46,1.46,1.46
210,0.2,0.6,260,41.7,54,192,7.78,23.34,2.46,1.46,1.46
211,0.2,0.6,260,41.7,54,195,7.7,22.23,2.46,1.46,1.46


In [19]:
selectivity_EtOH.tail()

207    23.26
208    22.94
209    22.78
210    22.68
211    22.58
Name: S_Ethanol, dtype: float64