---
title: 01 - QC of the data for nn8b07562_si_001.xlsx

author: Javier Millan Acosta

---

# Introduction and motivation
## Source

This notebook explores the supplementary materials from the ACS Nano Paper:
>Labouta HI, Asgarian N, Rinker K, Cramb DT. Meta-Analysis of Nanoparticle Cytotoxicity via Data-Mining the Literature. ACS Nano. 2019 Jan 31; doi:10.1021/acsnano.8b07562 (Scholia)

ACS seems to block scrapers, so the supplementary data needs to be manually downloaded from the [supporting information link](https://pubs.acs.org/doi/suppl/10.1021/acsnano.8b07562/suppl_file/nn8b07562_si_001.xlsx) in the [ACS Nano page](https://pubs.acs.org/doi/full/10.1021/acsnano.8b07562), and then stored under [../data](../data).

## Summary
The steps in this notebook prepare the input data for the [RML-based RDFication](https://rml.io/specs/rml/) of cytotoxicity data ([CSV mapping](https://rml.io/specs/rml/#example-CSV) ). Specifically, the goals are:

    I) Describe the data set
    II) Identify inconsistencies and clean the data
    III) Assist with the selection of eNanoMapper terms for the mapping
    IV) Help detect missing relevant classes in the eNanoMapper ontology
    V) Provide the foundation for the RML that will be used for the RDFication
    VI) Serve as a (working) mockup for an eNanoMapper ontology allign tool


# Imports

In [1]:
import pandas as pd
import numpy as np
import math
import os
import sys
from IPython.display import Markdown, display
from code import interact
import re
import rdflib
import requests
from ipywidgets import interactive_output, interact_manual, Layout, widgets, interact, Dropdown, Select, Text, Button, Textarea
import ipysheet as ip

# Loading data
The dataset is an overview of literature nanoparticle citotoxicity assays. The authors harmonized the units and used the features in the table above to run decision tree analyses.

In [2]:
file = "../data/nn8b07562_si_001.xlsx"
df = pd.read_excel(file)

Next step is to verify the data types for each column:

In [3]:
df_dtypes = pd.DataFrame(df.dtypes, columns=["Dtype"])
cols = [i for i in df.columns]
display(df_dtypes.transpose())
display(Markdown("Data shape of {} is {}.".format(file, df.shape)))

Unnamed: 0,Nanoparticle,Type: Organic (O)/inorganic (I),coat,Diameter (nm),Concentration μM,Zeta potential (mV),Cells,Cell line (L)/primary cells (P),Human(H)/Animal(A) cells,Animal?,...,Test,Test indicator,Biochemical metric,% Cell viability,Interference checked (Y/N),Colloidal stability checked (Y/N),Positive control (Y/N),Publication year,Particle ID,Reference DOI
Dtype,object,object,object,float64,float64,float64,object,object,object,object,...,object,object,object,float64,object,object,object,int64,int64,object


Data shape of ../data/nn8b07562_si_001.xlsx is (2896, 24).

Converting all numeric columns to floats and `Particle ID` to string:

In [4]:
int_cols = list(df_dtypes.loc[df_dtypes['Dtype'] == int].index)
df[int_cols] = df[int_cols].astype(float)
df["Particle ID"] = df["Particle ID"].astype(object)
qual_cols = list(df_dtypes.loc[df_dtypes['Dtype'] == object].index)
display(pd.DataFrame(df[int_cols].dtypes, columns=["Dtype"]))

Unnamed: 0,Dtype
Exposure time (h),float64
Publication year,float64
Particle ID,object


# Describing the data features
## Qualitative features

Table below is a description of the qualitative variables in the data (`dtype=object`).

In [5]:
df.describe(include="object")

Unnamed: 0,Nanoparticle,Type: Organic (O)/inorganic (I),coat,Cells,Cell line (L)/primary cells (P),Human(H)/Animal(A) cells,Animal?,Cell morphology,"Cell age: embryonic (E), Adult (A)",Cell-organ/tissue source,Test,Test indicator,Biochemical metric,Interference checked (Y/N),Colloidal stability checked (Y/N),Positive control (Y/N),Particle ID,Reference DOI
count,2896,2896,1052,2896,2896,2896,651,2896,2896,2896,2896,2896,2896,2896,2896,2896,2896.0,2896
unique,33,2,46,81,2,2,8,15,2,30,23,17,6,2,2,2,118.0,89
top,Iron oxide,I,PEI,A549,L,H,Mouse,Epithelial,A,Blood,MTT,tetrazolium salt,cell metabolic activity,N,N,N,19.0,10.1186/1556-276X-7-77
freq,490,2274,123,298,2356,2231,411,1456,2757,536,872,1302,1678,2348,2309,2395,225.0,225


Table below shows the percentage of missing values.

In [6]:
df_null = pd.DataFrame(df[qual_cols].isnull().sum()/len(df)*100, columns = ["%na"]).transpose()
df_null

Unnamed: 0,Nanoparticle,Type: Organic (O)/inorganic (I),coat,Cells,Cell line (L)/primary cells (P),Human(H)/Animal(A) cells,Animal?,Cell morphology,"Cell age: embryonic (E), Adult (A)",Cell-organ/tissue source,Test,Test indicator,Biochemical metric,Interference checked (Y/N),Colloidal stability checked (Y/N),Positive control (Y/N),Reference DOI
%na,0.0,0.0,63.674033,0.0,0.0,0.0,77.520718,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


As described in the paper, there are many missing values for `coat`. The missing values in `Animal?` are not relevant -this column will be replaced with an `organism` column. 

In [7]:
df[qual_cols] = df[qual_cols].fillna('')

## Quantitative features

In [8]:
describe = df.drop(["Particle ID", "Publication year"], axis=1).describe()
cols_d = [i for i in describe.columns]
nas = [df[col].isna().sum()/len(df[col])*100 for col in cols_d]
describe.loc["%na"] = nas
display(describe)
na_overall = str(np.round(df.isna().sum().sum() / df.size * 100, 3))
display(Markdown("The overall percentage of missing values in the quantitative features is {}%.".format(na_overall)))

Unnamed: 0,Diameter (nm),Concentration μM,Zeta potential (mV),Exposure time (h),% Cell viability
count,2896.0,2896.0,1261.0,2896.0,2896.0
mean,125.082465,85.74635,-1.963933,35.515539,75.208409
std,171.931194,797.9487,28.925259,27.950149,34.267026
min,1.0,1.660539e-20,-48.0,1.0,-58.89764
25%,20.0,2.5e-06,-27.0,24.0,54.219643
50%,49.2,0.0005,-8.0,24.0,86.965674
75%,165.0,0.01054755,17.7,48.0,97.65237
max,957.0,15000.0,87.0,336.0,404.8117
%na,0.0,0.0,56.457182,0.0,0.0


The overall percentage of missing values in the quantitative features is 2.352%.

As described in the paper, the amount of rows missing `Zeta potential` measurements is very high.

# Cleaning data

(TBD)

In [9]:
df["Organism"] = [val if val !="" else "Human" for val in df["Animal?"]]
df["DOI"] = ["" if "(" in val else "https://doi.org/"+val for val in df["Reference DOI"]]
df["Reference"] = [val.replace("not provided (", "https://").replace(")", "") if "(" in val else "" for val in df["Reference DOI"]]
df["Type"] = ["organic" if val=="O" else "inorganic" for val in df["Type: Organic (O)/inorganic (I)"]]
df["Diameter units"] = ["nm" for i in range(len(df))]
df["Concentration units"] = ["μM" for i in range(len(df))]

In [10]:
df = df.drop(["Type: Organic (O)/inorganic (I)", "Animal?", "Reference DOI", ], axis=1, )

# Mapping terms with the eNanoMapper ontology
The [Ontology Lookup Service](https://www.ebi.ac.uk/ols/index) [search API](https://www.ebi.ac.uk/ols/docs/api) is used to retrieve IRI and labels for matches of terms. These can be used as input in a workflow that creates the `RML` model.

## Column names
Some axioms will be added with the column names as predicates (i.e., measured values like Concentration). The widget below retrieves the best matches from the [Ontology Lookup Service](https://www.ebi.ac.uk/ols/docs/api#Search) for reference for these columns.

Defining a function that looks up column names in the OLS and retrieves all the matches:

In [11]:
def ols_lookup(var_list):
    base_url = "https://www.ebi.ac.uk"
    get_query = "/ols/api/search?q={}&groupField=iri&start=0&ontology=enm"
    allign = dict()
    for var in var_list:
        r = requests.get(base_url+get_query.format(var))
        d = dict(r.json())
        matches = dict()
        for match in range(len(d["response"]["docs"])):
            label = d["response"]["docs"][match]["label"]
            iri = d["response"]["docs"][match]["iri"]
            matches[label] = iri
        allign[var] = matches
    return allign

In [12]:
allign = ols_lookup(list(df.columns))

Using the function on the data and visualizing the results:

In [13]:
mapping_cols = pd.DataFrame(["a" for i in list(allign.keys())], list(allign.keys()), columns=["Mapping"])
select_var = Select(options = allign.keys())
@interact(select = allign)
def show_matches(select):
    display(Markdown("Below are the matches returned by the OLS.")) 
    display(pd.DataFrame([select]).transpose())     

interactive(children=(Dropdown(description='select', options={'Nanoparticle': {'nanoparticle': 'http://purl.bi…

## Cell values
The unique values are analyzed individually with a similar approach as above. The analyzed columns will naturally be only qualitative.


In [14]:
qual_cols = [i for i in qual_cols if i not in ["Type: Organic (O)/inorganic (I)", "Cell age: embryonic (E), Adult (A)", 
                                               "Reference DOI", "Positive control (Y/N)", "Colloidal stability checked (Y/N)", 
                                               "Interference checked (Y/N)", "Animal?"]]
str_qual = str_qual = "- " + "\n- ".join(qual_cols)
display(Markdown(str_qual))

- Nanoparticle
- coat
- Cells
- Cell line (L)/primary cells (P)
- Human(H)/Animal(A) cells
- Cell morphology
- Cell-organ/tissue source
- Test
- Test indicator
- Biochemical metric

In [15]:
allign = {col : ols_lookup(np.unique(df[col])) for col in qual_cols}

In [24]:
select_col = Select(options = qual_cols,)
select_var = Select(options = allign[select_col.value].keys())
def update_var(*args):
    select_var.options = allign[select_col.value].keys()
select_var.observe(update_var)
    
all_keys = [allign[col].keys() for col in allign.keys()]
all_keys = [key_key for key in all_keys for key_key in key]

input_iris = Textarea(layout = widgets.Layout(width='800px'))
map_button = Button(description = "Assign IRIs")
matches = {key:'' for key in all_keys}
out = widgets.Output(layout={'border': '0px solid black'})


def map_click(*args):
    if "http://" in input_iris.value:
        matches[select_var.value] = list(np.unique(input_iris.value.split("\n")))
        with out:
            out.clear_output()
            display(Markdown("{} was assigned IRI: {}".format(select_var.value, matches[select_var.value])))
            
def display_mapping(*args):
    display(pd.DataFrame([allign[select_col.value][select_var.value]]).transpose())
    

@interact(select_var = select_var, input_iris = input_iris, select_col = select_col, allign = allign)
def show_matches(select_col, select_var, input_iris):
    display(out)
    display(map_button)
    iris = map_button.on_click(map_click)
    display(Markdown("#### Click the button to assign IRIs in `input_iris` to the highlighted term {}.".format(select_var)))
    display(Markdown("------"))
    display(Markdown("Below are the **eNanoMapper ontology matches** (label, IRI) returned by the OLS for **{} ({})**".format(select_var, select_col))) 
    display_mapping()

interactive(children=(Select(description='select_col', options=('Nanoparticle', 'coat', 'Cells', 'Cell line (L…

In general, it seems like the data needs more cleaning before it can be possible to successfully map terms with the OLS. The following steps are aimed at cleaning the data before the process is repeated, and a final step of manual tuning will be attempted after that. The terms that are still missing will then be looked against all ontologies in the OLS and [added to the eNanoMapper ontology](tbd).

#### Nanoparticle

#### Coat

#### Cells

#### Cell morphology

#### Cell age
How to approach this one?

#### Cell-organ/tissue source
The eNanoMapper ontology lacks tissue and organ classes.

In [18]:
#df.to_pickle("../data/nn9b07562_si_001.pkl")