# PubChem Data Extraction notebook

## Purpose and Context

This notebook will extract the data from PubChem's Compound TOC: UV Spectra list and store it in a parquet format

Go to https://pubchem.ncbi.nlm.nih.gov/classification/ to download the lastest version and save it in "/PredictEpsilon-Publish/rawData/PubChem/PubChem Compound TOC UV Spectra.csv"

## Setup

import libraries

In [None]:
import pandas as pd
import numpy as np

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

import utils

## Load Data

In [None]:
import os

temp = []
dataDirectory = '../rawData/PubChem/'
for file in os.listdir(dataDirectory):
    temp.append(pd.read_csv(os.path.join(dataDirectory, file)))
    
data = pd.concat(temp).drop_duplicates(ignore_index = True)
data

Cleaning up data, converting strings to int/float data type and compressing integers

In [None]:
data.columns = data.columns.str.replace('_', ' ').str.title()
utils.DropAllNullColumns(data)
utils.ConvertStringColumnsToInt(data)
utils.ConvertFloatColumnsToIntegerIfNoDataLoss(data)
utils.CompressIntegerColumns(data)

## Basic Analysis

In [None]:
data.info()

In [None]:
utils.InspectColumnValues(data)

In [None]:
data.describe()

In [None]:
utils.ShowHistogramCharts(data)

## Saving data for use later

In [None]:
utils.SaveDataToOutput(data, 'extraction-pubChem')
utils.LoadDataFromOutput('extraction-pubChem')