# Data Mining of Human Skin Microbiome from EBI-Metagenomics Portal

_Matin Nuhamunada_<sup>1*</sup>, _Gregorius Altius Pratama_<sup>1</sup>, _Setianing Wikanthi_<sup>2</sup>, and _Mohamad Khoirul Anam_<sup>1</sup>

<sup>1</sup>Department of Tropical Biology, Universitas Gadjah Mada;   
Jl. Teknika Selatan, Sekip Utara, Bulaksumur, Yogyakarta, Indonesia, 55281;   

<sup>2</sup>Department of Agricultural Microbiology, Universitas Gadjah Mada;  

*Correspondence: [matin_nuhamunada@ugm.ac.id](mailto:matin_nuhamunada@mail.ugm.ac.id)  
[mohamad.khoirul.anam@mail.ugm.ac.id](mailto:mohamad.khoirul.anam@mail.ugm.ac.id)  
[gregorius.altius.p@mail.ugm.ac.id](mailto:gregorius.altius.p@mail.ugm.ac.id)  
[setianingwikanthi@mail.ugm.ac.id](mailto:setianingwikanthi@mail.ugm.ac.id)

---
## Abstract
Human skin microbiome is unique to individuals in regards to many aspects, including behaviour, environment, and perhaps maybe genes. To understand more about the distribution of human skin microbiome across the globe, we compare several skin microbiome study available in the EBI-Metagenomic Portal. Study data was acquired using EBI-Metagenome API, and sample data was selected based on sex, location, and bodysite. The biological observation matrix from the analysis result of the selected samples were compared using MEGAN. 

### Keywords
Human Skin, Microbiome, EBI-Metagenome


## Import Python Modules
We use python 3 script using ``pandas``, ``jsonapi_client``, ``pycurl``, to mine the data from EBI-metagenomic portal  [[1]](#ref1).

In [2]:
from pandas import DataFrame
import pandas as pd

try:
    from urllib import urlencode
except ImportError:
    from urllib.parse import urlencode

In [3]:
from jsonapi_client import Session, Filter

API_BASE = 'https://www.ebi.ac.uk/metagenomics/api/latest/'

In [4]:
import pycurl
import os, sys

## Load Functions

In [4]:
def get_metadata(metadata, key):
    import html
    for m in metadata:
        if m['key'].lower() == key.lower():
            value = m['value']
            unit = html.unescape(m['unit']) if m['unit'] else ""
            return "{value} {unit}".format(value=value, unit=unit)
    return None

## Get Study
We search the EBI Metagenomic database on human skin microbiome study in the host-associated biome with 'skin' as search term. The study list can be found on this link: https://www.ebi.ac.uk/metagenomics/projects/doExportDetails?searchTerm=skin&includingChildren=true&biomeLineage=root%3AHost-associated%3AHuman&search=Search

In [6]:
filename = '01_Study_Skin+Host-Associated+Human_raw.csv'
print(filename)
if not os.path.isfile(filename):
    with open(filename, 'wb') as f:
        c = pycurl.Curl()
        c.setopt(c.URL, 'https://www.ebi.ac.uk/metagenomics/projects/doExportDetails?searchTerm=skin&includingChildren=true&biomeLineage=root%3AHost-associated%3AHuman&search=Search')
        c.setopt(c.WRITEDATA, f)
        c.perform()
        c.close()

01_Study_Skin+Host-Associated+Human_raw.csv


In [7]:
#Load Study Data
df1 = pd.read_csv(filename)
#df1

In [8]:
#Select relevant information
df_study = df1[["Study ID","Study Name","Number Of Samples", "Study Abstract"]]
#study_data

In [10]:
#Add information on biome
df2 = DataFrame(columns=("Biome","Lineage"))
df2.index.name = 'No'

for i in range(len(df_study)):
    with Session(API_BASE) as s:
        study = s.get('studies', df_study.loc[i, "Study ID"]).resource
        for biome in study.biomes:
            df2.loc[i] = [biome.biome_name,
                          biome.lineage
        ]
#df2

In [14]:
#Merge & Filter Table
df3 = pd.concat([df_study, df2], axis=1)
df4 = df3.set_index(["Study ID"])
df_study_biome = df4.query('Biome == ["Human", "Skin"] and Lineage == ["root:Host-associated:Human:Skin"]')

In [15]:
df_study_biome

Unnamed: 0_level_0,Study Name,Number Of Samples,Study Abstract,Biome,Lineage
Study ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
SRP002480,Gene-Environment Interactions at the Skin Surface,2560,16S rRNA gene sequences amplified from subject...,Skin,root:Host-associated:Human:Skin
ERP018577,Human skin bacterial and fungal microbiotas an...,96,Using high-throughput 16S rDNA and ITS1 sequen...,Skin,root:Host-associated:Human:Skin
ERP022958,Impact of the Mk VI SkinSuit on skin microbiot...,204,Microgravity induces physiological decondition...,Skin,root:Host-associated:Human:Skin
ERP019566,Longitudinal study of the diabetic skin and wo...,258,Background: Type II diabetes is a chronic heal...,Skin,root:Host-associated:Human:Skin
ERP016629,Microbiome samples derived from Buruli ulcer w...,14,Background: Buruli ulcer (BU) is an infectious...,Skin,root:Host-associated:Human:Skin
SRP056364,Skin microbiome in human volunteers inoculated...,191,The aim of this project was to investigate the...,Skin,root:Host-associated:Human:Skin


In [None]:
#Export study data
df_study_biome.to_csv('02_Study|Skin_Host-Associated_Human_filtered.csv')

## Choose relevant study
From the data above, we can filter which study can be used as the data for comparative microbiome analysis of human skin samples. Therefore we choose the study ID 'SRP002480' as our data

In [None]:
#Selected study
study = 'SRP002480'

### List samples with biomes for the given study

Get study: https://www.ebi.ac.uk/metagenomics/api/latest/studies/SRP002480

List samples: https://www.ebi.ac.uk/metagenomics/api/latest/studies/SRP002480/samples


Fetch samples for the given study accession: https://www.ebi.ac.uk/metagenomics/api/latest/samples?study_accession=SRP002480


In [7]:
filename_sample = '03_sample_'+study+'_raw.csv'
print(filename_sample)
if not os.path.isfile(filename_sample):
    with open(filename, 'wb') as f:
        c = pycurl.Curl()
        c.setopt(c.URL, 'https://www.ebi.ac.uk/metagenomics/projects/'+study+'/overview/doExport')
        c.setopt(c.WRITEDATA, f)
        c.perform()
        c.close()
        

NameError: name 'study' is not defined

In [6]:
#Script ini untuk mengekstrak data tabel dari file CSV
df_sample = pd.read_csv(filename_Sample)
df_sample_refine = df_sample[["Sample ID","Run ID","Release version"]]

NameError: name 'filename_Sample' is not defined

In [5]:
df_meta = DataFrame(columns=('metadata'))
df_meta.index.name = 'No'

TypeError: Index(...) must be called with a collection of some kind, 'metadata' was passed

In [None]:
#Fetch metadata for given sample
pbar = tqdm_notebook(range(len(df_sample_refine))) #to make progressbar

for i in pbar:
    if not i in df_meta.index:
        with Session(API_BASE) as s:
            s_meta = s.get('samples', df_sample_refine.loc[i, "Sample ID"]).resource
            df_meta.loc[i] = [
                get_metadata(s_meta.sample_metadata, 'sex'),
                get_metadata(s_meta.sample_metadata, 'body site')
            ]
        pbar.set_description('processed: %d' % (i))
        pbar.update(1)
        sleep(1)

In [None]:
df_meta2.loc[3100]

In [None]:
result = pd.concat([df_sample_list_refine, df_meta], axis=1)
result

In [None]:
result.to_csv('output.csv')

In [22]:
import pandas as pd
df_result_raw = pd.read_csv('output.csv', index_col=0)
df_result_raw

Unnamed: 0,Sample ID,Run ID,Release version,Sex,Body site
0,SRS451417,SRR919527,2.0,male,antecubital crease
1,SRS451417,SRR919587,2.0,male,antecubital crease
2,SRS451418,SRR919528,2.0,male,back
3,SRS451418,SRR919588,2.0,male,back
4,SRS451419,SRR919529,2.0,male,external auditory canal
5,SRS451419,SRR919589,2.0,male,external auditory canal
6,SRS451420,SRR919530,2.0,male,hypothenar palm
7,SRS451420,SRR919590,2.0,male,hypothenar palm
8,SRS451421,SRR919531,2.0,male,retroauricular crease
9,SRS451421,SRR919591,2.0,male,retroauricular crease


In [51]:
df_result_raw.columns = df_result_raw.columns.str.replace(' ', '_')
#df_result_male = df_result_raw.query('Sample_ID == ["SRS451431"] and Run_ID == ["SRR919601"]')
df_result_male = df_result_raw.query('Sex == ["male "] and Body_site == ["back "]')
df_result_male

Unnamed: 0,Sample_ID,Run_ID,Release_version,Sex,Body_site
2,SRS451418,SRR919528,2.0,male,back
3,SRS451418,SRR919588,2.0,male,back
20,SRS451427,SRR919536,2.0,male,back
21,SRS451427,SRR919596,2.0,male,back
42,SRS451438,SRR919548,2.0,male,back
43,SRS451438,SRR919608,2.0,male,back
190,SRS451613,SRR919884,2.0,male,back
191,SRS451613,SRR919934,2.0,male,back
226,SRS451712,SRR920114,2.0,male,back
227,SRS451712,SRR920161,2.0,male,back


## Sampling Data BIOM dari EBI
https://www.ebi.ac.uk/metagenomics//projects/SRP002480/samples/SRS451457/runs/SRR919567/results/versions/2.0/taxonomy/OTU-table-HDF5-BIOM

In [None]:
import pandas as pd
df4 = pd.read_csv("skin.csv")
df5 = df4.loc[:,"Sample ID":"Run ID"]
print(df5)

In [None]:
data_sampel = df5.loc[0:25,'Sample ID']
print(data_sampel)

In [None]:
data_run = df5.loc[0:25,'Run ID']
print(data_run)

In [None]:
#Create ouput folder
import os, sys

cwd = os.getcwd()
output_folder = "\output" #name output folder, use \ for directory

if not os.path.isdir(cwd + output_folder):
    os.mkdir(cwd + output_folder)

new_dir = cwd + output_folder
    
print(cwd)

In [None]:
import pycurl

#Ambil data dari EBI
for i in range(25):
    os.chdir(new_dir) #pindah ke folder output
    filename = data_sampel[i] + '_' + data_run[i] + '.biom'
    print(filename)
    if not os.path.isfile(filename):
        with open(filename, 'wb') as f:
            c = pycurl.Curl()
            c.setopt(c.URL, 'https://www.ebi.ac.uk/metagenomics//projects/SRP002480/samples/'+ data_sampel[i] + '/runs/' + data_run[i] +'/results/versions/2.0/taxonomy/OTU-table-JSON-BIOM')
            c.setopt(c.WRITEDATA, f)
            c.perform()
            c.close()
    os.chdir(cwd) #balik ke folder semula
print('done')

### References
---
<a id='ref1'></a>
1. Alex L Mitchell, Maxim Scheremetjew, Hubert Denise, Simon Potter, Aleksandra Tarkowska, Matloob Qureshi, Gustavo A Salazar, Sebastien Pesseat, Miguel A Boland, Fiona M I Hunter, Petra ten Hoopen, Blaise Alako, Clara Amid, Darren J Wilkinson, Thomas P Curtis, Guy Cochrane, Robert D Finn; EBI Metagenomics in 2017: enriching the analysis of microbial communities, from sequence reads to assemblies, Nucleic Acids Research, Volume 46, Issue D1, 4 January 2018, Pages D726–D735, https://doi.org/10.1093/nar/gkx967