# Data Mining of Human Skin Microbiome from EBI-Metagenomics Portal

_Matin Nuhamunada_<sup>1*</sup>, _Gregorius Altius Pratama_<sup>1</sup>, _Setianing Wikanthi_<sup>2</sup>, and _Mohamad Khoirul Anam_<sup>1</sup>

<sup>1</sup>Department of Tropical Biology, Universitas Gadjah Mada;   
Jl. Teknika Selatan, Sekip Utara, Bulaksumur, Yogyakarta, Indonesia, 55281;   

<sup>2</sup>Department of Agricultural Microbiology, Universitas Gadjah Mada;  

*Correspondence: [matin_nuhamunada@ugm.ac.id](mailto:matin_nuhamunada@mail.ugm.ac.id)  
[mohamad.khoirul.anam@mail.ugm.ac.id](mailto:mohamad.khoirul.anam@mail.ugm.ac.id)  
[gregorius.altius.p@mail.ugm.ac.id](mailto:gregorius.altius.p@mail.ugm.ac.id)  
[setianingwikanthi@mail.ugm.ac.id](mailto:setianingwikanthi@mail.ugm.ac.id)

---
## Abstract
Human skin microbiome is unique to individuals in regards to many aspects, including behaviour, environment, and perhaps maybe genes. To understand more about the distribution of human skin microbiome across the globe, we compare several skin microbiome study available in the EBI-Metagenomic Portal. Study data was acquired using EBI-Metagenome API, and sample data was selected based on sex, location, and bodysite. The biological observation matrix from the analysis result of the selected samples were compared using MEGAN. 

### Keywords
Human Skin, Microbiome, EBI-Metagenome


## Import Python Modules
We use python 3 script using ``pandas``, ``jsonapi_client``, ``pycurl``, to mine the data from EBI-metagenomic portal  [[1]](#ref1).

In [1]:
from pandas import DataFrame
import pandas as pd

try:
    from urllib import urlencode
except ImportError:
    from urllib.parse import urlencode

In [2]:
from jsonapi_client import Session, Filter

API_BASE = 'https://www.ebi.ac.uk/metagenomics/api/latest/'

In [3]:
import pycurl
import os, sys
from tqdm import tqdm_notebook
import numpy as np
from time import sleep

## Load Functions

In [None]:
def get_metadata(metadata, key):
    import html
    for m in metadata:
        if m['key'].lower() == key.lower():
            value = m['value']
            unit = html.unescape(m['unit']) if m['unit'] else ""
            return "{value} {unit}".format(value=value, unit=unit)
    return None

def get_study(term, lineage, biome, filename):
    if not os.path.isfile(filename):
        with open(filename, 'wb') as f:
            c = pycurl.Curl()
            c.setopt(c.URL, 'https://www.ebi.ac.uk/metagenomics/projects/doExportDetails?searchTerm='+term+'&includingChildren=true&biomeLineage=root%3A'+lineage+'%3A'+biome+'&search=Search')
            c.setopt(c.WRITEDATA, f)
            c.perform()
            c.close()
    return filename

def get_analysis_result(run, extension):
    API_BASE_RUN = 'https://www.ebi.ac.uk/metagenomics/api/latest/runs'
    with Session(API_BASE_RUN) as s:
        study = s.get(run,'analysis').resource
        for i in study.downloads:
            if extension in i.file_format['name']:
                i.url
    return i.url

## Get Study
We search the EBI Metagenomic database on human skin microbiome study in the host-associated biome with 'skin' as search term. The study list can be found on this link: https://www.ebi.ac.uk/metagenomics/projects/doExportDetails?searchTerm=skin&includingChildren=true&biomeLineage=root%3AHost-associated%3AHuman&search=Search

In [None]:
#Search Study
term = 'skin'
lineage = 'Host-associated'
biome = 'Human'
file_study = '01_Study_'+term+'+'+biome+'+'+lineage+'_raw.csv'

In [None]:
#Download study information
get_study(term, biome, lineage, file_study)

In [None]:
#Load Study information
df1 = pd.read_csv(file_study)
#df1

In [None]:
#Select relevant information
df_study = df1[["Study ID","Study Name","Number Of Samples", "Submitted Date", "Experimental Factor", "Study Abstract"]]
#df_study

In [None]:
#Add information on biome
df2 = DataFrame(columns=("Biome","Lineage","Publication"))
#df2.index.name = 'No'

for i in tqdm_notebook(range(len(df_study))):
    with Session(API_BASE) as s:
        study = s.get('studies', df_study.loc[i, "Study ID"]).resource
        for biome in study.biomes:
            df2.loc[i] = [biome.biome_name,
                          biome.lineage,
                          study.publications
        ]
#df2

In [None]:
#Merge & Filter Table
df3 = pd.concat([df_study, df2], axis=1)
df3.set_index(["Study ID"])
df_study_biome = df3.query('Biome == ["Human", "Skin"] and Lineage == ["root:Host-associated:Human:Skin"]')
df_study_biome

In [None]:
#Export study data
df_study_biome.to_csv('02_Study_Skin+Host-Associated+Human_filtered.csv')

## Choose relevant study
From the data above, we can filter which study can be used as the data for comparative microbiome analysis of human skin samples. Therefore we choose the study ID 'SRP002480' as our data

In [4]:
#Selected study
study = 'SRP002480'

### Get list of sample for a given study and its metadata

Get study: https://www.ebi.ac.uk/metagenomics/api/latest/studies/SRP002480
List samples: https://www.ebi.ac.uk/metagenomics/api/latest/studies/SRP002480/samples
Fetch samples for the given study accession: https://www.ebi.ac.uk/metagenomics/api/latest/samples?study_accession=SRP002480


In [None]:
#Fetch a list of sample data from a given study
filename_sample = '03_sample_'+study+'_raw.csv'
print(filename_sample)
if not os.path.isfile(filename_sample):
    with open(filename_sample, 'wb') as f:
        c = pycurl.Curl()
        c.setopt(c.URL, 'https://www.ebi.ac.uk/metagenomics/projects/'+study+'/overview/doExport')
        c.setopt(c.WRITEDATA, f)
        c.perform()
        c.close()

In [None]:
#Filter relevant information from the list
df_sample = pd.read_csv(filename_sample)
df_sample_refine = df_sample[["Sample ID","Run ID","Release version"]]

In [None]:
#Create Container
if not os.path.isfile('04_raw_meta_'+study+'.csv'):
    df_meta = DataFrame(columns=('Sex',"Body site"))
    df_meta.index.name = 'No'
else:
    df_meta = pd.read_csv('04_raw_meta_'+study+'.csv', index_col=0)

#Fetch metadata for given sample
pbar = tqdm_notebook(range(len(df_sample_refine))) #to make progressbar
for i in pbar:
    if not i in df_meta.index:
        with Session(API_BASE) as s:
            s_meta = s.get('samples', df_sample_refine.loc[i, "Sample ID"]).resource
            df_meta.loc[i] = [
                get_metadata(s_meta.sample_metadata, 'sex'),
                get_metadata(s_meta.sample_metadata, 'body site')
            ]
        pbar.set_description('processed: %d' % (i))
        pbar.update(1)
        sleep(1)

In [None]:
#Write to container
df_meta.to_csv('04_raw_meta_'+study+'.csv')
#df_meta

In [None]:
#s_meta.sample_metadata

In [None]:
#Merge metadata with the raw sample list
result = pd.concat([df_sample_refine, df_meta], axis=1)
result.to_csv('05_sample_'+study+'_meta.csv')

### Query samples based on its metadata

In [6]:
#Load sample list
df_result_raw = pd.read_csv('05_sample_'+study+'_meta.csv', index_col = 0)
#df_result_raw

In [7]:
df_result_raw.columns = df_result_raw.columns.str.replace(' ', '_') #Refine the data to make it easier for filtering 

#Query for back sample from male population
df_result_m_back = df_result_raw.query('Sex == ["male "] and Body_site == ["back "]')
df_result_m_back

Unnamed: 0,Sample_ID,Run_ID,Release_version,Sex,Body_site
2,SRS451418,SRR919528,2.0,male,back
3,SRS451418,SRR919588,2.0,male,back
20,SRS451427,SRR919536,2.0,male,back
21,SRS451427,SRR919596,2.0,male,back
42,SRS451438,SRR919548,2.0,male,back
43,SRS451438,SRR919608,2.0,male,back
190,SRS451613,SRR919884,2.0,male,back
191,SRS451613,SRR919934,2.0,male,back
226,SRS451712,SRR920114,2.0,male,back
227,SRS451712,SRR920161,2.0,male,back


dataframe = df_result_m_back
Sex = 'male '
Body_site = 'back '

def sort_by_meta(dataframe, Sex, Body_site):
    return dataframe.query('Sex == [' + Sex + '] and Body_site == [' + Body_site + ']')

sort_by_meta(dataframe, Sex, Body_site)

In [None]:
run = 'SRR919528'
extension = 'JSON Biom'

#Ambil data dari EBI
for i in tqdm_notebook(df_result_m_back.index):
    os.chdir(new_dir) #pindah ke folder output
    filename = df_result_m_back.loc[i, "Sample_ID"]+\
    '_'+df_result_m_back.loc[i, "Run_ID"]+\
    '_'+df_result_m_back.loc[i, "Sex"]+\
    df_result_m_back.loc[i, "Body_site"]+\
    '.biom'
    if not os.path.isfile(filename):
        with open(filename, 'wb') as f:
            c = pycurl.Curl()
            c.setopt(c.URL, get_analysis_result(run, extension))
            c.setopt(c.WRITEDATA, f)
            c.perform()
            c.close()
    os.chdir(cwd) #balik ke folder semula

## Random sampling

In [12]:
def random_sampling(dataframe, amount):
    df_random = DataFrame(columns=('Sample_ID','Run_ID','Release_version','Sex','Body_site'))
    df_random.index.name = 'No'
    a = 0
    while a < amount:
        i = np.random.choice(dataframe.index.values, 1)
        container = df_random.loc[:, 'Sample_ID']
        if not container.isin([dataframe.loc[i[0], 'Sample_ID']]).any():
            df_random.loc[i[0]] = [dataframe.loc[i[0], 'Sample_ID'], \
                                   dataframe.loc[i[0], 'Run_ID'], \
                                   dataframe.loc[i[0], 'Release_version'], \
                                   dataframe.loc[i[0], 'Sex'], \
                                   dataframe.loc[i[0], 'Body_site']
                                  ]
            a = a + 1
    return df_random

In [15]:
dataframe = df_result_m_back
amount = 5
df_random_sample = random_sampling(dataframe, amount)

In [16]:
df_random_sample.to_csv('06_sampled_biom_'+study+'.csv')

### Get analysis result of a given sample in a study
https://www.ebi.ac.uk/metagenomics//projects/SRP002480/samples/SRS451457/runs/SRR919567/results/versions/2.0/taxonomy/OTU-table-HDF5-BIOM

In [None]:
#Create ouput folder
cwd = os.getcwd() #get current working directory
output_folder = "\output_"+study #name of the output folder for a given study, use \ for directory in windows
if not os.path.isdir(cwd + output_folder):
    os.mkdir(cwd + output_folder)
new_dir = cwd + output_folder 
new_dir

In [None]:
run = 'SRR919528'
extension = 'JSON Biom'

#Ambil data dari EBI
for i in tqdm_notebook(df_result_m_back.index):
    os.chdir(new_dir) #pindah ke folder output
    filename = df_result_m_back.loc[i, "Sample_ID"]+\
    '_'+df_result_m_back.loc[i, "Run_ID"]+\
    '_'+df_result_m_back.loc[i, "Sex"]+\
    df_result_m_back.loc[i, "Body_site"]+\
    '.biom'
    if not os.path.isfile(filename):
        with open(filename, 'wb') as f:
            c = pycurl.Curl()
            c.setopt(c.URL, get_analysis_result(run, extension))
            c.setopt(c.WRITEDATA, f)
            c.perform()
            c.close()
    os.chdir(cwd) #balik ke folder semula

### References
---
<a id='ref1'></a>
1. Alex L Mitchell, Maxim Scheremetjew, Hubert Denise, Simon Potter, Aleksandra Tarkowska, Matloob Qureshi, Gustavo A Salazar, Sebastien Pesseat, Miguel A Boland, Fiona M I Hunter, Petra ten Hoopen, Blaise Alako, Clara Amid, Darren J Wilkinson, Thomas P Curtis, Guy Cochrane, Robert D Finn; EBI Metagenomics in 2017: enriching the analysis of microbial communities, from sequence reads to assemblies, Nucleic Acids Research, Volume 46, Issue D1, 4 January 2018, Pages D726–D735, https://doi.org/10.1093/nar/gkx967