# 01 - QC of the data

This notebook explores the supplementary materials from the ACS Nano Paper:
>Labouta HI, Asgarian N, Rinker K, Cramb DT. Meta-Analysis of Nanoparticle Cytotoxicity via Data-Mining the Literature. ACS Nano. 2019 Jan 31; doi:10.1021/acsnano.8b07562 (Scholia)

ACS seems to block scrapers, so the supplementary data needs to be manually downloaded from the [supporting information link](https://pubs.acs.org/doi/suppl/10.1021/acsnano.8b07562/suppl_file/nn8b07562_si_001.xlsx) in the [ACS Nano page](https://pubs.acs.org/doi/full/10.1021/acsnano.8b07562), and then stored under [../data](../data).

## 1. Imports and functions used throughout the notebook

In [29]:
import pandas as pd
import numpy as np
import math
import os
import sys
from IPython.display import Markdown, display
from code import interact
import re

## 2. Loading data
The dataset is an overview of literature nanoparticle citotoxicity assays. The authors harmonized the units and used the features in the table above to run decision tree analyses.

In [30]:
file = "../data/nn8b07562_si_001.xlsx"
df = pd.read_excel(file)

Next step is to verify the data types for each column:

In [31]:
df_dtypes = pd.DataFrame(df.dtypes, columns=["Dtype"])
cols = [i for i in df.columns]
display(df_dtypes)
display(Markdown("Data shape is {}.".format(df.shape)))

Unnamed: 0,Dtype
Nanoparticle,object
Type: Organic (O)/inorganic (I),object
coat,object
Diameter (nm),float64
Concentration μM,float64
Zeta potential (mV),float64
Cells,object
Cell line (L)/primary cells (P),object
Human(H)/Animal(A) cells,object
Animal?,object


Data shape is (2896, 24).

Converting all numeric columns to floats and `Particle ID` to string:

In [32]:
int_cols = list(df_dtypes.loc[df_dtypes['Dtype'] == int].index)
df[int_cols] = df[int_cols].astype(float)
df["Particle ID"] = df["Particle ID"].astype(object)
qual_cols = list(df_dtypes.loc[df_dtypes['Dtype'] == object].index)
display(pd.DataFrame(df[int_cols].dtypes, columns=["Dtype"]))

Unnamed: 0,Dtype
Exposure time (h),float64
Publication year,float64
Particle ID,object


## 3. Describing the data features
### Qualitative variables

Table below is a description of the qualitative variables in the data (`dtype=object`).

In [33]:
display(df.describe(include="object"))

Unnamed: 0,Nanoparticle,Type: Organic (O)/inorganic (I),coat,Cells,Cell line (L)/primary cells (P),Human(H)/Animal(A) cells,Animal?,Cell morphology,"Cell age: embryonic (E), Adult (A)",Cell-organ/tissue source,Test,Test indicator,Biochemical metric,Interference checked (Y/N),Colloidal stability checked (Y/N),Positive control (Y/N),Particle ID,Reference DOI
count,2896,2896,1052,2896,2896,2896,651,2896,2896,2896,2896,2896,2896,2896,2896,2896,2896.0,2896
unique,33,2,46,81,2,2,8,15,2,30,23,17,6,2,2,2,118.0,89
top,Iron oxide,I,PEI,A549,L,H,Mouse,Epithelial,A,Blood,MTT,tetrazolium salt,cell metabolic activity,N,N,N,19.0,10.1186/1556-276X-7-77
freq,490,2274,123,298,2356,2231,411,1456,2757,536,872,1302,1678,2348,2309,2395,225.0,225


Table below shows the percentage of missing values.

In [34]:
df_null = pd.DataFrame(df[qual_cols].isnull().sum()/len(df)*100, columns = ["%na"]).transpose()
display(df_null)

Unnamed: 0,Nanoparticle,Type: Organic (O)/inorganic (I),coat,Cells,Cell line (L)/primary cells (P),Human(H)/Animal(A) cells,Animal?,Cell morphology,"Cell age: embryonic (E), Adult (A)",Cell-organ/tissue source,Test,Test indicator,Biochemical metric,Interference checked (Y/N),Colloidal stability checked (Y/N),Positive control (Y/N),Reference DOI
%na,0.0,0.0,63.674033,0.0,0.0,0.0,77.520718,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Fixing the missing values:

In [35]:
df[qual_cols] = df[qual_cols].fillna('')

### Describing the quantitative features

In [36]:
describe = df.drop(["Particle ID", "Publication year"], axis=1).describe()
cols_d = [i for i in describe.columns]
nas = [df[col].isna().sum()/len(df[col])*100 for col in cols_d]
describe.loc["%na"] = nas
display(describe)
na_overall = str(np.round(df.isna().sum().sum() / df.size * 100, 3))
display(Markdown("The overall percentage of missing values in the quantitative features is {}%.".format(na_overall)))

Unnamed: 0,Diameter (nm),Concentration μM,Zeta potential (mV),Exposure time (h),% Cell viability
count,2896.0,2896.0,1261.0,2896.0,2896.0
mean,125.082465,85.74635,-1.963933,35.515539,75.208409
std,171.931194,797.9487,28.925259,27.950149,34.267026
min,1.0,1.660539e-20,-48.0,1.0,-58.89764
25%,20.0,2.5e-06,-27.0,24.0,54.219643
50%,49.2,0.0005,-8.0,24.0,86.965674
75%,165.0,0.01054755,17.7,48.0,97.65237
max,957.0,15000.0,87.0,336.0,404.8117
%na,0.0,0.0,56.457182,0.0,0.0


The overall percentage of missing values in the quantitative features is 2.352%.

## 4. Final transformations and export
The relevant columns of the cleaned data frame are exported as a a pickle file before the RDFication step.

In [37]:
df["Organism"] = [val if val !="" else "Human" for val in df["Animal?"]]
df["DOI"] = ["" if "(" in val else "https://doi.org/"+val for val in df["Reference DOI"]]
df["Reference"] = [val.replace("not provided (", "https://").replace(")", "") if "(" in val else "" for val in df["Reference DOI"]]
df["Type"] = ["organic" if val=="O" else "inorganic" for val in df["Type: Organic (O)/inorganic (I)"]]
df["Diameter units"] = ["nm" for i in range(len(df))]
df["Concentration units"] = ["μM" for i in range(len(df))]

In [38]:
df = df.drop(["Type: Organic (O)/inorganic (I)", "Animal?", "Reference DOI", ], axis=1, )


In [39]:
df.to_pickle("../data/nn9b07562_si_001.pkl")