# Data Mining of Human Skin Microbiome from EBI-Metagenomics Portal

_Matin Nuhamunada_<sup>1*</sup>, _Gregorius Altius Pratama_<sup>1</sup>, _Setianing Wikanthi_<sup>2</sup>, and _Mohamad Khoirul Anam_<sup>1</sup>

<sup>1</sup>Department of Tropical Biology, Universitas Gadjah Mada;   
Jl. Teknika Selatan, Sekip Utara, Bulaksumur, Yogyakarta, Indonesia, 55281;   

<sup>2</sup>Department of Agricultural Microbiology, Universitas Gadjah Mada;  

*Correspondence: [matin_nuhamunada@ugm.ac.id](mailto:matin_nuhamunada@mail.ugm.ac.id)  
[mohamad.khoirul.anam@mail.ugm.ac.id](mailto:mohamad.khoirul.anam@mail.ugm.ac.id)  
[gregorius.altius.p@mail.ugm.ac.id](mailto:gregorius.altius.p@mail.ugm.ac.id)  
[setianingwikanthi@mail.ugm.ac.id](mailto:setianingwikanthi@mail.ugm.ac.id)

---
## Abstract
Human skin microbiome is unique to individuals in regards to many aspects, including behaviour, environment, and perhaps maybe genes. To understand more about the distribution of human skin microbiome across the globe, we compare several skin microbiome study available in the EBI-Metagenomic Portal. Study data was acquired using EBI-Metagenome API, and sample data was selected based on sex, location, and bodysite. The biological observation matrix from the analysis result of the selected samples were compared using MEGAN. 

### Keywords
Human Skin, Microbiome, EBI-Metagenome


## Import Python Modules
We use python 3 script using ``pandas``, ``jsonapi_client``, ``pycurl``, to mine the data from EBI-metagenomic portal  [[1]](#ref1).

In [2]:
from pandas import DataFrame
import pandas as pd

try:
    from urllib import urlencode
except ImportError:
    from urllib.parse import urlencode

In [3]:
from jsonapi_client import Session, Filter

API_BASE = 'https://www.ebi.ac.uk/metagenomics/api/latest/'

In [4]:
import pycurl
import os, sys

## Get Study
We search the EBI Metagenomic database on human skin microbiome study in the host-associated biome with 'skin' as search term. The study list can be found on this link: https://www.ebi.ac.uk/metagenomics/projects/doExportDetails?searchTerm=skin&includingChildren=true&biomeLineage=root%3AHost-associated%3AHuman&search=Search

In [5]:
filename = 'data.csv'
print(filename)
if not os.path.isfile(filename):
    with open(filename, 'wb') as f:
        c = pycurl.Curl()
        c.setopt(c.URL, 'https://www.ebi.ac.uk/metagenomics/projects/doExportDetails?searchTerm=skin&includingChildren=true&biomeLineage=root%3AHost-associated%3AHuman&search=Search')
        c.setopt(c.WRITEDATA, f)
        c.perform()
        c.close()

data.csv


In [6]:
#Script ini untuk mengekstrak data tabel dari file CSV
df1 = pd.read_csv("data.csv")
print(df1)

     Study ID                                         Study Name  \
0   ERP104068  EMG produced TPA metagenomics assembly of the ...   
1   SRP002480  Gene-Environment Interactions at the Skin Surface   
2   ERP018577  Human skin bacterial and fungal microbiotas an...   
3   ERP022958  Impact of the Mk VI SkinSuit on skin microbiot...   
4   ERP019566  Longitudinal study of the diabetic skin and wo...   
5   ERP016629  Microbiome samples derived from Buruli ulcer w...   
6   ERP021525  Micromes on salmon skin and surrounding sea water   
7   SRP056364  Skin microbiome in human volunteers inoculated...   
8   ERP104518                  skin microbiota in infected frogs   
9   ERP104520                 Skin microbiota of Scinax alcatraz   
10  ERP104516  Variations on the diversity of amphibian skin ...   

    Number Of Samples Submitted Date  Analysis NCBI Project ID  \
0                  45     2017-11-15  Finished      PRJEB22388   
1                2560     2016-02-03  Finished     

In [7]:
df2 = df1.set_index("Study ID", drop = False)
#print(df2["Study Abstract"])
df3 = df2["Study ID"]
#print(df3)

In [8]:
for i in range(len(df3)):
    with Session(API_BASE) as s:
        study = s.get('studies', df3[i]).resource
        print('Study id:', study.id)
        print('Study name:', study.study_name)
        print('Study abstract:', study.study_abstract)
        for biome in study.biomes:
            print('Biome:', biome.biome_name, biome.lineage)
        print('_____________________________________________________________')

Study id: ERP104068
Study name: EMG produced TPA metagenomics assembly of the Raw reads of the microbiota of premature infant mouth, skin, and gut (human gut metagenome) data set
Study abstract: The human gut metagenome Third Party Annotation (TPA) assembly was derived from the primary whole genome shotgun (WGS) data set PRJNA327106. This project includes samples from the following biomes : Human gut.
Biome: Human root:Host-associated:Human
_____________________________________________________________
Study id: SRP002480
Study name: Gene-Environment Interactions at the Skin Surface
Study abstract: 16S rRNA gene sequences amplified from subjects with eczema and age-matched healthy controls.  Microbes living in and on humans are ten times more numerous than human cells. Culture-based methods have been the primary techniques used to study microbes inhabiting humans; however, many species are not successfully grown in culture. The NIH Roadmap for Medical Research Human Microbiome Project (

In [9]:
study1 = 'SRP002480'

### List samples with biomes for the given study

Get study: https://www.ebi.ac.uk/metagenomics/api/latest/studies/SRP002480

List samples: https://www.ebi.ac.uk/metagenomics/api/latest/studies/SRP002480/samples


Fetch samples for the given study accession: https://www.ebi.ac.uk/metagenomics/api/latest/samples?study_accession=SRP002480


In [10]:
df = DataFrame(columns=('sample name', 'lineage', 'sex', 'sample metadata', 'description'))
df.index.name = 'accession'

with Session(API_BASE) as s:
    params = {
        'study_accession': study1,
        'page_size': 100,
    }
    f = Filter(urlencode(params))
    for sample in s.iterate('samples', f):
        df.loc[sample.accession] = [
            sample.sample_name,
            sample.biome.id,
            sample.sample_metadata[0]["value"],
            sample.sample_metadata[1]["value"],
            sample.sample_desc
        ]
df

KeyboardInterrupt: 

In [11]:
#print(sample.accession)
#print(sample.analysis_completed)
#print(sample.as_resource_identifier_dict)
#print(sample.attributes)
#print(sample.biome)
#print(sample.biosample)
#print(sample.collection_date)
#print(sample.commit)
#print(sample.create_map)
#print(sample.delete)
#print(sample.dirty_fields)
#print(sample.environment_biome)
#print(sample.environment_feature)
#print(sample.environment_material)
#print(sample.fields)
#print(sample.geo_loc_name)
#print(sample.host_tax_id)
#print(sample.id)
#print(sample.is_dirty)
print(sample.sample_metadata[1])
print(sample.sample_metadata[0])

{'key': 'body site', 'value': 'retroauricular crease', 'unit': None}
{'key': 'sex', 'value': 'female', 'unit': None}


In [None]:
df.to_csv('List Sample'+study1+'.csv',index=True,header=True)

In [None]:
def get_metadata(metadata, key):
    import html
    for m in metadata:
        if m['key'].lower() == key.lower():
            value = m['value']
            unit = html.unescape(m['unit']) if m['unit'] else ""
            return "{value} {unit}".format(value=value, unit=unit)
    return None

depth_label = 'geographic location (depth)'
temp_label = 'temperature'
df = DataFrame(columns=('sample name', 'biome', 'temperature', 'depth', 'longitude', 'latitude'))
df.index.name = 'accession'

with Session(API_BASE) as s:
    params = {
        'study_accession': study1,
        'include': 'biome',
        'page_size': 100,
    }
    f = Filter(urlencode(params))
    for sample in s.iterate('samples', f):
        df.loc[sample.accession] = [
            sample.sample_name, sample.biome.id,
            get_metadata(sample.sample_metadata, temp_label),
            get_metadata(sample.sample_metadata, depth_label),
            sample.longitude, sample.latitude
        ]
df

## Sampling Data BIOM dari EBI
https://www.ebi.ac.uk/metagenomics//projects/SRP002480/samples/SRS451457/runs/SRR919567/results/versions/2.0/taxonomy/OTU-table-HDF5-BIOM

In [12]:
import pandas as pd
df4 = pd.read_csv("skin.csv")
df5 = df4.loc[:,"Sample ID":"Run ID"]
print(df5)

      Sample ID      Run ID
0     SRS451417   SRR919527
1     SRS451417   SRR919587
2     SRS451418   SRR919528
3     SRS451418   SRR919588
4     SRS451419   SRR919529
5     SRS451419   SRR919589
6     SRS451420   SRR919530
7     SRS451420   SRR919590
8     SRS451421   SRR919531
9     SRS451421   SRR919591
10    SRS451422   SRR919532
11    SRS451422   SRR919592
12    SRS451423   SRR919533
13    SRS451423   SRR919593
14    SRS451424   SRR919534
15    SRS451424   SRR919594
16    SRS451425   SRR919535
17    SRS451425   SRR919595
18    SRS451426   SRR919537
19    SRS451426   SRR919597
20    SRS451427   SRR919536
21    SRS451427   SRR919596
22    SRS451428   SRR919538
23    SRS451428   SRR919598
24    SRS451429   SRR919539
25    SRS451429   SRR919599
26    SRS451430   SRR919540
27    SRS451430   SRR919600
28    SRS451431   SRR919541
29    SRS451431   SRR919601
...         ...         ...
4368  SRS732139  SRR1633154
4369  SRS732139  SRR1633155
4370  SRS732139  SRR1633156
4371  SRS732139  SRR

In [19]:
data_sampel = df5.loc[0:25,'Sample ID']
print(data_sampel)

0     SRS451417
1     SRS451417
2     SRS451418
3     SRS451418
4     SRS451419
5     SRS451419
6     SRS451420
7     SRS451420
8     SRS451421
9     SRS451421
10    SRS451422
11    SRS451422
12    SRS451423
13    SRS451423
14    SRS451424
15    SRS451424
16    SRS451425
17    SRS451425
18    SRS451426
19    SRS451426
20    SRS451427
21    SRS451427
22    SRS451428
23    SRS451428
24    SRS451429
25    SRS451429
Name: Sample ID, dtype: object


In [18]:
data_run = df5.loc[0:25,'Run ID']
print(data_run)

0     SRR919527
1     SRR919587
2     SRR919528
3     SRR919588
4     SRR919529
5     SRR919589
6     SRR919530
7     SRR919590
8     SRR919531
9     SRR919591
10    SRR919532
11    SRR919592
12    SRR919533
13    SRR919593
14    SRR919534
15    SRR919594
16    SRR919535
17    SRR919595
18    SRR919537
19    SRR919597
20    SRR919536
21    SRR919596
22    SRR919538
23    SRR919598
24    SRR919539
25    SRR919599
Name: Run ID, dtype: object


In [15]:
#Create ouput folder
import os, sys

cwd = os.getcwd()
output_folder = "\output" #name output folder, use \ for directory

if not os.path.isdir(cwd + output_folder):
    os.mkdir(cwd + output_folder)

new_dir = cwd + output_folder
    
print(cwd)

E:\Jupyter_Lab\KetiakProject\src


In [21]:
import pycurl

#Ambil data dari EBI
for i in range(25):
    os.chdir(new_dir) #pindah ke folder output
    filename = data_sampel[i] + '_' + data_run[i] + '.biom'
    print(filename)
    if not os.path.isfile(filename):
        with open(filename, 'wb') as f:
            c = pycurl.Curl()
            c.setopt(c.URL, 'https://www.ebi.ac.uk/metagenomics//projects/SRP002480/samples/'+ data_sampel[i] + '/runs/' + data_run[i] +'/results/versions/2.0/taxonomy/OTU-table-JSON-BIOM')
            c.setopt(c.WRITEDATA, f)
            c.perform()
            c.close()
    os.chdir(cwd) #balik ke folder semula
print('done')

SRS451417_SRR919527.biom
SRS451417_SRR919587.biom
SRS451418_SRR919528.biom
SRS451418_SRR919588.biom
SRS451419_SRR919529.biom
SRS451419_SRR919589.biom
SRS451420_SRR919530.biom
SRS451420_SRR919590.biom
SRS451421_SRR919531.biom
SRS451421_SRR919591.biom
SRS451422_SRR919532.biom
SRS451422_SRR919592.biom
SRS451423_SRR919533.biom
SRS451423_SRR919593.biom
SRS451424_SRR919534.biom
SRS451424_SRR919594.biom
SRS451425_SRR919535.biom
SRS451425_SRR919595.biom
SRS451426_SRR919537.biom
SRS451426_SRR919597.biom
SRS451427_SRR919536.biom
SRS451427_SRR919596.biom
SRS451428_SRR919538.biom
SRS451428_SRR919598.biom
SRS451429_SRR919539.biom
done


### References
---
<a id='ref1'></a>
1. Alex L Mitchell, Maxim Scheremetjew, Hubert Denise, Simon Potter, Aleksandra Tarkowska, Matloob Qureshi, Gustavo A Salazar, Sebastien Pesseat, Miguel A Boland, Fiona M I Hunter, Petra ten Hoopen, Blaise Alako, Clara Amid, Darren J Wilkinson, Thomas P Curtis, Guy Cochrane, Robert D Finn; EBI Metagenomics in 2017: enriching the analysis of microbial communities, from sequence reads to assemblies, Nucleic Acids Research, Volume 46, Issue D1, 4 January 2018, Pages D726–D735, https://doi.org/10.1093/nar/gkx967