# PhotoChemCAD3 data extraction notebook

## Purpose and Context

This notebook will extract the data from PhotochemCAD compound database, combine it with the SMILES data and store it in a parquet format

Go to http://photochemcad.com/ for information

## Setup

import libraries

In [None]:
import pandas as pd
import numpy as np

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

import utils

## Load Data

In [None]:
data = pd.read_csv('../rawData/PhotochemCAD3/PCAD3 Compd Database 2018/2018_03 PCAD3.db', sep = '\t', encoding='oem')
data.drop(['#', 'Instrument', 'Date', 'Reference', 'Inv', 'Instrument.1', 'Date.1', 'Reference.1', 'Inv.1', 'Unnamed: 21'], axis = 'columns', inplace = True)
data.head(1)

temp = pd.read_csv('../rawData/PhotochemCAD3/SmilesData.csv')
temp['Smiles'] = temp['Correct Smiles'].fillna(temp['Generated Smiles'])
temp.head(1)

data = data.merge(temp[['Structure', 'Smiles']], on = 'Structure')
print('Total Count: ' + str(len(data)))
data.head(1)

Cleaning up data, converting strings to int/float data type and compressing integers

In [None]:
data.columns = data.columns.str.replace('_', ' ').str.title()
utils.DropAllNullColumns(data)
utils.ConvertStringColumnsToInt(data)
utils.ConvertFloatColumnsToIntegerIfNoDataLoss(data)
utils.CompressIntegerColumns(data)

## Basic Analysis

In [None]:
data.info()

In [None]:
utils.InspectColumnValues(data)

In [None]:
data.describe()

In [None]:
utils.ShowHistogramCharts(data)

## Saving data for use later

In [None]:
utils.SaveDataToOutput(data, 'extraction-photoChemCAD3')
utils.LoadDataFromOutput('extraction-photoChemCAD3')