# Deep4Chem Data Extraction notebook

## Purpose and Context

This notebook will extract the data from Deep4Chem CSV, correct any epsilon values errors found and store it in a parquet format

Go to http://deep4chem.korea.ac.kr/ for information

## Setup

import libraries

In [None]:
import pandas as pd
import numpy as np

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

import utils

## Load Data

In [None]:
data = pd.read_csv('../rawData/Deep4Chem/DB for chromophore_Sci_Data_rev02.csv')

# Loading the verified data and correcting any issue we found
temp = pd.read_csv('../rawData/Deep4Chem/DoubleCheck-High Extinction.csv')[['Tag', 'Should be']]

temp['log(Epsilon)'] = temp['Should be'].apply(lambda x: x if x != 'x' else np.nan).astype('float').apply(np.log10)

data = data.merge(temp, on = 'Tag', how = 'left')
data['log(Epsilon)'] = data['log(Epsilon)'].fillna(data['log(e/mol-1 dm3 cm-1)'])

data.drop(['Tag', 'Reference', 'Should be', 'log(e/mol-1 dm3 cm-1)'], axis = 'columns', inplace = True)

## Removing rows that don't have a log(Epsilon)
#data = data[data['log(Epsilon)'].isnull() == False].copy().reset_index(drop = True)
print('Total Count: ' + str(len(data)))
data.head(1)

Cleaning up data, converting strings to int/float data type and compressing integers

Removing Entries with No Epsilon

In [None]:
data = data[data['log(Epsilon)'].isnull() == False].copy().reset_index(drop = True)

In [None]:
data.columns = data.columns.str.replace('_', ' ').str.title()
utils.DropAllNullColumns(data)
utils.ConvertStringColumnsToInt(data)
utils.ConvertFloatColumnsToIntegerIfNoDataLoss(data)
utils.CompressIntegerColumns(data)

## Basic Analysis

In [None]:
data.info()

In [None]:
utils.InspectColumnValues(data)

In [None]:
data.describe()

In [None]:
utils.ShowHistogramCharts(data)

## Saving data for use later

In [None]:
utils.SaveDataToOutput(data, 'extraction-deep4Chem')
utils.LoadDataFromOutput('extraction-deep4Chem')