A data science project aiming at making a GC-MS based metabolite profiles of Rose scent published in Nature Genetics, in June 2018 compliant with the FAIR principles, through the use of several data standards, namely: Frictionless Tabular data package as syntax for the data matrix, semantic markup of measurements with STATO ontology, of chemicals with CHEBI ontology and InChi for annotation, and insisting on the need to clarify the semantics of data matrices.
To re-enact the FAIRification process we have performed on this dataset, 2 options are available: either running the Make command, or run the Jupyter notebooks which are provided.
├── LICENSE
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ and a short `-` delimited description, `1-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── setup.py <- makes project pip installable (pip install -e .) so src can be imported
├── src <- Source code for use in this project.
│
└── tox.ini <- tox file with settings for running tox; see tox.testrun.org
- ensure that Python 3.6 is installed or available in a virtual environment.
- NOTE: On MacOS, whether running a virtualenv or not, you may need to create the following file:
touch /.matplotlib/matplotlibrc
- open the file and add the following to it:
backend: TkAgg
- save and close.
- NOTE: On MacOS, whether running a virtualenv or not, you may need to create the following file:
- from the root folder of the project, invoke the following commands:
make data
to convert the excel and pdf legacy data into the Frictionaless Data Package: (the ouput of the command will be stored in the 'denovo' folder under './data/processed/')make figure
to generate the figures (the reference output is stored, under './figures' directory): (the output of the command will be stored in the 'denovo' folder under './figures')- NOTE: the script
src/rose-plotting-from-rdf.py
runs a sparql query, which may take time to execute. One may wish to bypass this and comment out line 41 in themake
file. make clean
to restore the project to its initial status, this will remove all "denovo" created data).
An exhaustive documentation, in the form of a series of jupyter notebooks, is provided. It describes the various steps of the FAIRification process. The Four notebooks are available and can be run locally or using the Binder infrastructure. Note that launching notebook with binder may take several minutes (10-15 minutes) for the installation process to complete the first time around and depending on load on the infrastructure. Once done and as long as the build remains on the Binder infrastructure, starting and running the notebooks is very quick. However, bear in mind that the lifespan of these notebook instances on the virtual infrastructure is by nature limited.
- Start Binder environment from the main repository:
Once again, if running with myBinder.org, be aware that it takes some time to build the environment the first time around. Let the process complete.
Important: one may see the following message Failed to connect to event stream
from the Binder page while it builds. If this occurs, simply refresh the page and let the process run until the notebook is launched.
-
Start Binder environment from individual notebooks:
- Converting the original Excel spreadsheet released as associated material to a Frictionless Tabular Data package.
A Jupyter+Python notebook: 0-rose-metabolites-Python-data-handling.ipynb
A data exploration and graphical recapitulation of the dataset is performed in python using the graphic grammar library plotnine from either the Tabular Data Package or from the RDF/Linked Data graph, to demonstrate that equivalency of the representations. A visual exploration also shows how 2 datasets treated with the same protocol can be readily mobilized for a data integration exercise.
(Note: known issue = when using mybinder infrastructure, calls to the libchebi api may time out. This is an issue with the infrastructure, not the code being run). Running the code locally (see below for instructions).
- Analysing the metabolite profiles using python and the plotnine library from the Frictionless Data Package:
A Jupyter+Python notebook: 1-rose-metabolites-Python-analysis.ipynb
- Recapitulating the analysis done previously but from an RDF graph and SPARQL queries. A full Linked Data (LD) representation is provided and visualization of the metabolite profiles in 6 rose strains and 3 plants parts is presented.
(Note: known issue: the 3rd sparql query can take time to execute, be patient)
A Jupyter+Python+Sparql notebook: 2-rose-metabolites-Python-RDF-querying-analysis.ipynb
- A fourth notebook performs the visual exploration of the dataset using R, to show case how the Tabular Data Package can be consumed in a different environment very easily.
A Jupyter+R notebook: 3-rose-metabolites-R-analysis.ipynb
If the binder route does not behave, simply run the notebooks locally, following the instructions below.
You will need *Python 3.6 or higher *, jupyter, and virtualenv (if you want to use virtual environments).
- Clone this repository:
git clone https://github.com/proccaserra/rose2018ng-notebook
- Get into the repository:
cd rose2018ng-notebook
If you do not want to use a virtual environment, skip the first two steps below:
- Create a virtual environment:
pyenv virtualenv 3.6.5 venv365
- Activate the virtual environment
pyenv activate venv365
- Install all the requirements:
pip install -r requirements.txt
- Make sure the ipython kernel is available:
ipython kernel install --user
- Run jupyter notebook:
jupyter notebook
A python sparql kernel is needed to be able to run SPARQL queries
It is installed by default with the current set up configuration.
(based on https://github.com/paulovn/sparql-kernel)
The installation process requires two steps:
1. Install the Python package::
```pip install sparqlkernel```
2. Install the kernel into Jupyter::
jupyter sparqlkernel install [--user ] [--logdir ]
The --user
option will install the kernel in the current user's personal
config, while the generic command will install it as a global kernel (but
needs write permissions in the system directories).
To install R, please see here To install the IPython R kernel, refer to IRKernel:
1. Start R from a terminal
(Important: *do not* run R from the app if you try to install the kernel, it must be from a terminal)
2. From the R terminal, issue ```IRkernel::installspec(user=FALSE)```.
Note: if running into problems (e.g.), try running R with the following command:
```sudo -i R``` first, and then run the kernel installation.
If you are still stuck, have a look at [this tutorial](https://mpacer.org/maths/r-kernel-for-ipython-notebook)