# Data Mining of Human Skin Microbiome from EBI-Metagenomics Portal

_Matin Nuhamunada_<sup>1*</sup>, _Gregorius Altius Pratama_<sup>1</sup>, _Setianing Wikanthi_<sup>2</sup>, and _Mohamad Khoirul Anam_<sup>1</sup>

<sup>1</sup>Department of Tropical Biology, Universitas Gadjah Mada;   
Jl. Teknika Selatan, Sekip Utara, Bulaksumur, Yogyakarta, Indonesia, 55281;   

<sup>2</sup>Department of Agricultural Microbiology, Universitas Gadjah Mada;  
Jl. Flora No. 1 Bulaksumur Yogyakarta, Indonesia, 55281

*Correspondence: [matin_nuhamunada@ugm.ac.id](mailto:matin_nuhamunada@mail.ugm.ac.id)  

---
## Abstract
Human skin microbiome is unique to individuals in regards to many aspects, including behaviour, environment, and perhaps maybe genes. To understand more about the distribution of human skin microbiome across the globe, we compare several skin microbiome study available in the EBI-Metagenomic Portal. Study data was acquired using EBI-Metagenome API, and sample data was selected based on sex, location, and bodysite. The biological observation matrix from the analysis result of the selected samples were compared using MEGAN. 

### Keywords
Human Skin, Microbiome, EBI-Metagenome


## Import Python Modules
We use python 3 script using Jupyter Notebook by utilising ``pandas``, ``jsonapi_client``, ``pycurl``, to mine the data from EBI-metagenomic portal using it's API [[1]](#ref1).

In [1]:
from pandas import DataFrame
import pandas as pd
import numpy as np

In [2]:
from jsonapi_client import Session, Filter
import pycurl
import html

API_BASE = 'https://www.ebi.ac.uk/metagenomics/api/latest/'

In [3]:
import os, sys
from tqdm import tqdm_notebook
import ipywidgets
from time import sleep
import glob

## Load Functions

In [4]:
def get_metadata(metadata, key):
    for m in metadata:
        if m['key'].lower() == key.lower():
            value = m['value']
            unit = html.unescape(m['unit']) if m['unit'] else ""
            return "{value} {unit}".format(value=value, unit=unit)
    return None

def get_study(term, lineage, biome, filename):
    if not os.path.isfile(filename):
        with open(filename, 'wb') as f:
            c = pycurl.Curl()
            c.setopt(c.URL, 'https://www.ebi.ac.uk/metagenomics/projects/doExportDetails?searchTerm='+term+'&includingChildren=true&biomeLineage=root%3A'+lineage+'%3A'+biome+'&search=Search')
            c.setopt(c.WRITEDATA, f)
            c.perform()
            c.close()
    return filename

def get_analysis_result(run, extension):
    API_BASE_RUN = 'https://www.ebi.ac.uk/metagenomics/api/latest/runs'
    with Session(API_BASE_RUN) as s:
        study = s.get(run,'analysis').resource
        for i in study.downloads:
            if extension in i.file_format['name']:
                link = i.url
    return link

def random_sampling(dataframe, amount):
    df_random = DataFrame(columns=('Sample_ID','Run_ID','Release_version','Sex','Body_site', 'Description'))
    df_random.index.name = 'No'
    a = 0
    while a < amount:
        i = np.random.choice(dataframe.index.values, 1)
        container = df_random.loc[:, 'Sample_ID']
        if not container.isin([dataframe.loc[i[0], 'Sample_ID']]).any():
            df_random.loc[i[0]] = [dataframe.loc[i[0], 'Sample_ID'], \
                                   dataframe.loc[i[0], 'Run_ID'], \
                                   dataframe.loc[i[0], 'Release_version'], \
                                   dataframe.loc[i[0], 'Sex'], \
                                   dataframe.loc[i[0], 'Body_site'], \
                                   dataframe.loc[i[0], 'Description']\
                                  ]
            a = a + 1
    return df_random

## 1. Get List of Studies
___
We search the EBI Metagenomic database on human skin microbiome study in the host-associated biome with 'skin' as search term. The study list can be found on this link: https://www.ebi.ac.uk/metagenomics/projects/doExportDetails?searchTerm=skin&includingChildren=true&biomeLineage=root%3AHost-associated%3AHuman&search=Search

In [11]:
#Search Study
term = 'skin'
lineage = 'Host-associated'
biome = 'Human'
file_study = '01_Study_'+term+'+'+biome+'+'+lineage+'_raw.csv'

In [12]:
#Download study information
get_study(term, biome, lineage, file_study)

'01_Study_skin+Human+Host-associated_raw.csv'

In [1]:
#Load Study information
df1 = pd.read_csv(file_study)
#df1

NameError: name 'pd' is not defined

In [14]:
#Select relevant information
df_study = df1[["Study ID","Study Name","Number Of Samples", "Submitted Date"]]
#df_study

In [17]:
#Add information on biome
df2 = DataFrame(columns=("Biome","Lineage","Latitude","Longitude","Publication"))
#df2.index.name = 'No'

for i in tqdm_notebook(range(len(df_study))):
    with Session(API_BASE) as s:
        std = s.get('studies', df_study.loc[i, "Study ID"]).resource
        for a in std.biomes:
            df2.loc[i, "Biome"] = a.biome_name
            df2.loc[i, "Lineage"] = a.lineage
        for g in std.geocoordinates:
            df2.loc[i, "Latitude"] = g.latitude
            df2.loc[i, "Longitude"] = g.longitude
        for p in std.publications:
            df2.loc[i, "Publication"] = p.doi


HBox(children=(IntProgress(value=0, max=11), HTML(value='')))

ConnectionError: HTTPSConnectionPool(host='www.ebi.ac.uk', port=443): Max retries exceeded with url: /metagenomics/api/latest/studies/ERP104068 (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x000001FA38C30160>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond',))

In [51]:
#Merge & Filter Table
df3 = pd.concat([df_study, df2], axis=1)
df3.set_index(["Study ID"])
df_study_biome = df3.query('Biome == ["Human", "Skin"] and Lineage == ["root:Host-associated:Human:Skin"]')
df_study_biome

Unnamed: 0,Study ID,Study Name,Number Of Samples,Submitted Date,Biome,Lineage,Latitude,Longitude,Publication
1,SRP002480,Gene-Environment Interactions at the Skin Surface,2560,2016-02-03,Skin,root:Host-associated:Human:Skin,,,10.1101/gr.131029.111
2,ERP018577,Human skin bacterial and fungal microbiotas an...,96,2016-11-03,Skin,root:Host-associated:Human:Skin,-9.9667,-55.2502,
3,ERP022958,Impact of the Mk VI SkinSuit on skin microbiot...,204,2017-06-16,Skin,root:Host-associated:Human:Skin,,,
4,ERP019566,Longitudinal study of the diabetic skin and wo...,258,2017-11-27,Skin,root:Host-associated:Human:Skin,33.8832,151.201,10.7717/peerj.3543
5,ERP016629,Microbiome samples derived from Buruli ulcer w...,14,2016-07-29,Skin,root:Host-associated:Human:Skin,6.6645,2.1598,10.1371/journal.pone.0181994
7,SRP056364,Skin microbiome in human volunteers inoculated...,191,2016-02-04,Skin,root:Host-associated:Human:Skin,,,


In [52]:
#Export study data
df_study_biome.to_csv('02_Study_'+term+'+'+biome+'+'+lineage+'_filtered.csv')

## 2. Choose relevant studies
___
Our search using the keyterm "skin" on Human host-associated biome found 7 hits of studies. Of this studies, we tried to select comparable sample of human skin microbiome to get more understanding of human microbiome profile from different studies.

### 2.1. SRP002480: Gene-Environment Interactions at the Skin Surface
#### 2.1.1 Get list of sample for the given study and its metadata

Get study: https://www.ebi.ac.uk/metagenomics/api/latest/studies/SRP002480  
List samples: https://www.ebi.ac.uk/metagenomics/api/latest/studies/SRP002480/samples  
Fetch samples for the given study accession: https://www.ebi.ac.uk/metagenomics/api/latest/samples?study_accession=SRP002480  

In [33]:
#Selected study
study = 'SRP002480'

In [34]:
#Fetch a list of sample data from a given study
filename_sample = '03_sample_'+study+'_raw.csv'
print(filename_sample)
if not os.path.isfile(filename_sample):
    with open(filename_sample, 'wb') as f:
        c = pycurl.Curl()
        c.setopt(c.URL, 'https://www.ebi.ac.uk/metagenomics/projects/'+study+'/overview/doExport')
        c.setopt(c.WRITEDATA, f)
        c.perform()
        c.close()

03_sample_SRP002480_raw.csv


In [35]:
#Filter relevant information from the list
df_sample = pd.read_csv(filename_sample)
df_sample_refine = df_sample[["Sample ID","Run ID","Release version"]]

In [45]:
#check available metadata key
with Session(API_BASE) as s:
    s_meta = s.get('samples', df_sample_refine.loc[0, "Sample ID"]).resource
s_meta.sample_metadata

[{'key': 'sex', 'unit': None, 'value': 'male'},
 {'key': 'body site', 'unit': None, 'value': 'antecubital crease'},
 {'key': 'NCBI sample classification', 'unit': None, 'value': '646099'},
 {'key': 'instrument model', 'unit': None, 'value': '454 GS FLX Titanium'}]

In [36]:
#Create Container
if not os.path.isfile('04_raw_meta_'+study+'.csv'):
    df_meta = DataFrame(columns=('Sex',"Body site", "Description"))
    df_meta.index.name = 'No'
else:
    df_meta = pd.read_csv('04_raw_meta_'+study+'.csv', index_col=0)

#Fetch metadata for given sample
pbar = tqdm_notebook(range(len(df_sample_refine))) #to make progressbar
for i in pbar:
    if not i in df_meta.index:
        with Session(API_BASE) as s:
            s_meta = s.get('samples', df_sample_refine.loc[i, "Sample ID"]).resource
            df_meta.loc[i] = [
                get_metadata(s_meta.sample_metadata, 'sex'),
                get_metadata(s_meta.sample_metadata, 'body site'),
                s_meta.sample_desc
            ]
        pbar.set_description('processed: %d' % (i))
        pbar.update(1)
        sleep(1)

HBox(children=(IntProgress(value=0, max=4398), HTML(value='')))




In [37]:
#Write to container
df_meta.to_csv('04_raw_meta_'+study+'.csv')

In [38]:
#Merge metadata with the raw sample list
result = pd.concat([df_sample_refine, df_meta], axis=1)
result.to_csv('05_sample_'+study+'_meta.csv')

#### 2.1.2. Query samples based on its metadata

In [128]:
#Load sample list
df_result_raw = pd.read_csv('05_sample_'+study+'_meta.csv', index_col = 0)

In [129]:
#Refine the column name to make it easier for filtering 
df_result_raw.columns = df_result_raw.columns.str.replace(' ', '_') 
#df_result_raw

In [132]:
#Delete empty datas
df = df_result_raw
df_result_refine = df[df_result_raw != '#N/A ']
df_result_refine = df_result_refine.dropna()
df_result_refine['Sex'] = df_result_refine['Sex'].str.strip()
df_result_refine['Body_site'] = df_result_refine['Body_site'].str.strip()
df_result_refine.to_csv('05_sample_'+study+'_meta_refine.csv')
df_result_refine

Unnamed: 0,Sample_ID,Run_ID,Release_version,Sex,Body_site
0,SRS451417,SRR919527,2.0,male,antecubital crease
1,SRS451417,SRR919587,2.0,male,antecubital crease
2,SRS451418,SRR919528,2.0,male,back
3,SRS451418,SRR919588,2.0,male,back
4,SRS451419,SRR919529,2.0,male,external auditory canal
5,SRS451419,SRR919589,2.0,male,external auditory canal
6,SRS451420,SRR919530,2.0,male,hypothenar palm
7,SRS451420,SRR919590,2.0,male,hypothenar palm
8,SRS451421,SRR919531,2.0,male,retroauricular crease
9,SRS451421,SRR919591,2.0,male,retroauricular crease


In [43]:
#Create a list of values in body site metadata
meta_list = df_result_refine["Body_site"].tolist()
meta_list_refine = list(set(meta_list))
for i in range(len(meta_list_refine)):
    meta_list_refine[i] = meta_list_refine[i].strip()

    meta_list2 = df_result_refine["Sex"].tolist()
meta_list2_refine = list(set(meta_list2))
for i in range(len(meta_list2_refine)):
    meta_list2_refine[i] = meta_list2_refine[i].strip()

print(meta_list_refine)
print(meta_list2_refine)

['back', 'hypothenar palm', 'volar forearm', 'vagina', 'manubrium', 'toenail', 'antecubital crease', 'popliteal fossa', 'glabella', 'external auditory canal', 'toeweb', 'inguinal crease', 'occiput', 'antecubital fossa', 'nare', 'plantar heel', 'retroauricular crease']
['female', 'male']


### 2.2. ERP018577: Human skin bacterial and fungal microbiotas an...
#### 2.2.1 Get list of sample for the given study and its metadata

Get study: https://www.ebi.ac.uk/metagenomics/api/latest/studies/ERP018577  
List samples: https://www.ebi.ac.uk/metagenomics/api/latest/studies/ERP018577/samples  
Fetch samples for the given study accession: https://www.ebi.ac.uk/metagenomics/api/latest/samples?study_accession=ERP018577  

In [5]:
#Selected study
study2 = 'ERP018577'

In [6]:
#Fetch a list of sample data from a given study
filename_sample = '03_sample_'+study2+'_raw.csv'
print(filename_sample)
if not os.path.isfile(filename_sample):
    with open(filename_sample, 'wb') as f:
        c = pycurl.Curl()
        c.setopt(c.URL, 'https://www.ebi.ac.uk/metagenomics/projects/'+study2+'/overview/doExport')
        c.setopt(c.WRITEDATA, f)
        c.perform()
        c.close()

03_sample_ERP018577_raw.csv


In [7]:
#Filter relevant information from the list
df_sample = pd.read_csv(filename_sample)
df_sample_refine = df_sample[["Sample ID","Run ID","Release version"]]

In [13]:
#check available metadata key
with Session(API_BASE) as s:
    s_meta = s.get('samples', df_sample_refine.loc[0, "Sample ID"]).resource
s_meta.sample_metadata

[{'key': 'investigation type', 'unit': None, 'value': 'metagenome'},
 {'key': 'project name', 'unit': None, 'value': 'Dandruff'},
 {'key': 'collection date', 'unit': None, 'value': '2014'},
 {'key': 'sequencing method',
  'unit': None,
  'value': 'Illumina Miseq Sequencing'},
 {'key': 'NCBI sample classification', 'unit': None, 'value': '646099'},
 {'key': 'instrument model', 'unit': None, 'value': 'Illumina MiSeq'},
 {'key': 'ENA checklist',
  'unit': None,
  'value': 'GSC MIxS human associated (ERC000014)'},
 {'key': 'host body site', 'unit': None, 'value': 'Scalp'},
 {'key': 'host disease status', 'unit': None, 'value': 'Dandruff'}]

In [18]:
#Create Container
if not os.path.isfile('04_raw_meta_'+study2+'.csv'):
    df_meta = DataFrame(columns=('Sex',"Body site", "Description"))
    df_meta.index.name = 'No'
else:
    df_meta = pd.read_csv('04_raw_meta_'+study2+'.csv', index_col=0)

#Fetch metadata for given sample
pbar = tqdm_notebook(range(len(df_sample_refine))) #to make progressbar
for i in pbar:
    if not i in df_meta.index:
        with Session(API_BASE) as s:
            s_meta = s.get('samples', df_sample_refine.loc[i, "Sample ID"]).resource
            df_meta.loc[i] = [
                get_metadata(s_meta.sample_metadata, 'sex'),
                get_metadata(s_meta.sample_metadata, 'host body site'),
                get_metadata(s_meta.sample_metadata, 'host disease status')
            ]
        pbar.set_description('processed: %d' % (i))
        pbar.update(1)
        sleep(1)

HBox(children=(IntProgress(value=0, max=96), HTML(value='')))




In [19]:
#Write to container
df_meta.to_csv('04_raw_meta_'+study2+'.csv')

In [20]:
#Merge metadata with the raw sample list
result = pd.concat([df_sample_refine, df_meta], axis=1)
result.to_csv('05_sample_'+study2+'_meta.csv')

#### 2.2.2. Query samples based on its metadata

In [176]:
#Load sample list
df_result_raw = pd.read_csv('05_sample_'+study2+'_meta.csv', index_col = 0)

In [177]:
#Refine the column name to make it easier for filtering 
df_result_raw.columns = df_result_raw.columns.str.replace(' ', '_') 
#df_result_raw

In [178]:
#Delete empty datas
df = df_result_raw
df_result_refine = df[df_result_raw != '#N/A ']
df_result_refine['Body_site'] = df_result_refine['Body_site'].str.strip()
df_result_refine['Disease'] = df_result_refine['Disease'].str.strip()
df_result_refine.to_csv('05_sample_'+study2+'_meta_refine.csv')
df_result_refine

Unnamed: 0,Sample_ID,Run_ID,Release_version,Sex,Body_site,Disease
0,ERS1421303,ERR1701206,3.0,,Scalp,Dandruff
1,ERS1421304,ERR1701207,3.0,,Forehead,Dandruff
2,ERS1421305,ERR1701208,3.0,,Scalp,Dandruff
3,ERS1421306,ERR1701209,3.0,,Forehead,Dandruff
4,ERS1421307,ERR1701210,3.0,,Scalp,Dandruff
5,ERS1421308,ERR1701211,3.0,,Forehead,Dandruff
6,ERS1421309,ERR1701212,3.0,,Scalp,Dandruff
7,ERS1421310,ERR1701213,3.0,,Forehead,Dandruff
8,ERS1421311,ERR1701214,3.0,,Scalp,Dandruff
9,ERS1421312,ERR1701215,3.0,,Forehead,Dandruff


In [30]:
#Create a list of values in body site metadata
meta_list = df_result_refine["Body_site"].tolist()
meta_list_refine = list(set(meta_list))
for i in range(len(meta_list_refine)):
    meta_list_refine[i] = meta_list_refine[i].strip()

meta_list2 = df_result_refine["Disease"].tolist()
meta_list2_refine = list(set(meta_list2))
for i in range(len(meta_list2_refine)):
    meta_list2_refine[i] = meta_list2_refine[i].strip()

print(meta_list_refine)
print(meta_list2_refine)

['Scalp', 'Forehead']
['Health', 'Dandruff']


### 2.3. ERP019566: Longitudinal study of the diabetic skin and wo...
#### 2.3.1 Get list of sample for the given study and its metadata

Get study: https://www.ebi.ac.uk/metagenomics/api/latest/studies/ERP019566  
List samples: https://www.ebi.ac.uk/metagenomics/api/latest/studies/ERP019566/samples  
Fetch samples for the given study accession: https://www.ebi.ac.uk/metagenomics/api/latest/samples?study_accession=ERP019566 

In [58]:
#Selected study
study3 = 'ERP019566'

In [59]:
#Fetch a list of sample data from a given study
filename_sample = '03_sample_'+study3+'_raw.csv'
print(filename_sample)
if not os.path.isfile(filename_sample):
    with open(filename_sample, 'wb') as f:
        c = pycurl.Curl()
        c.setopt(c.URL, 'https://www.ebi.ac.uk/metagenomics/projects/'+study3+'/overview/doExport')
        c.setopt(c.WRITEDATA, f)
        c.perform()
        c.close()

03_sample_ERP019566_raw.csv


In [60]:
#Filter relevant information from the list
df_sample = pd.read_csv(filename_sample)
df_sample_refine = df_sample[["Sample ID","Run ID","Release version"]]

In [75]:
#check available metadata key
with Session(API_BASE) as s:
    s_meta = s.get('samples', df_sample_refine.loc[6, "Sample ID"]).resource
s_meta.sample_metadata


[{'key': 'investigation type', 'unit': None, 'value': 'metagenome'},
 {'key': 'geographic location (longitude)', 'unit': None, 'value': '151.2005'},
 {'key': 'geographic location (country and/or sea,region)',
  'unit': None,
  'value': 'Australia'},
 {'key': 'collection date', 'unit': None, 'value': '2014-01-01/2014-12-31'},
 {'key': 'environment (biome)',
  'unit': None,
  'value': 'human-associated habitat'},
 {'key': 'environment (feature)',
  'unit': None,
  'value': 'foot plantar aspect'},
 {'key': 'environment (material)', 'unit': None, 'value': 'skin'},
 {'key': 'environmental package', 'unit': None, 'value': 'human-skin'},
 {'key': 'sequencing method',
  'unit': None,
  'value': '16S amplicon sequencing Illumina MiSeq'},
 {'key': 'geographic location (latitude)', 'unit': None, 'value': '33.8832'},
 {'key': 'instrument model', 'unit': None, 'value': 'Illumina MiSeq'}]

In [74]:
s_meta.sample_name

'Control patient 1 left foot time 3'

In [79]:
#Create Container
if not os.path.isfile('04_raw_meta_'+study3+'.csv'):
    df_meta = DataFrame(columns=('Sex',"Body site", "Description"))
    df_meta.index.name = 'No'
else:
    df_meta = pd.read_csv('04_raw_meta_'+study3+'.csv', index_col=0)

#Fetch metadata for given sample
pbar = tqdm_notebook(range(len(df_sample_refine))) #to make progressbar
for i in pbar:
    if not i in df_meta.index:
        with Session(API_BASE) as s:
            s_meta = s.get('samples', df_sample_refine.loc[i, "Sample ID"]).resource
            df_meta.loc[i] = [
                get_metadata(s_meta.sample_metadata, 'sex'),
                get_metadata(s_meta.sample_metadata, 'environment (feature)'),
                s_meta.sample_desc
            ]
        pbar.set_description('processed: %d' % (i))
        pbar.update(1)
        sleep(1)

HBox(children=(IntProgress(value=0, max=258), HTML(value='')))

In [83]:
#Write to container
df_meta.to_csv('04_raw_meta_'+study3+'.csv')

In [84]:
#Merge metadata with the raw sample list
result = pd.concat([df_sample_refine, df_meta], axis=1)
result.to_csv('05_sample_'+study3+'_meta.csv')

#### 2.3.2. Query samples based on its metadata

In [210]:
#Load sample list
df_result_raw = pd.read_csv('05_sample_'+study3+'_meta.csv', index_col = 0)

In [211]:
#Refine the column name to make it easier for filtering 
df_result_raw.columns = df_result_raw.columns.str.replace(' ', '_') 
#df_result_raw

In [212]:
#Delete empty datas
df = df_result_raw
df_result_refine = df[df_result_raw != '#N/A ']
df_result_refine['Body_site'] = df_result_refine['Body_site'].str.strip()
df_result_refine['Description'] = df_result_refine['Description'].str.strip()
df_result_refine.to_csv('05_sample_'+study3+'_meta_refine.csv')
#df_result_refine

In [89]:
#Create a list of values in body site metadata
meta_list = df_result_refine["Body_site"].tolist()
meta_list_refine = list(set(meta_list))
for i in range(len(meta_list_refine)):
    meta_list_refine[i] = meta_list_refine[i].strip()

meta_list2 = df_result_refine["Description"].tolist()
meta_list2_refine = list(set(meta_list2))
for i in range(len(meta_list2_refine)):
    meta_list2_refine[i] = meta_list2_refine[i].strip()

print(meta_list_refine)
print(meta_list2_refine)

['foot plantar aspect']
['control_skin_right', 'Positive_3', 'Positive_4', 'diabetic_skin_contra', 'No_DNA_4', 'diabetic_skin_adj', 'control_skin_left', 'Blank_3', 'wound_deb', 'wound_swab', 'Blank_4', 'No_DNA3']


## 3. Random sampling from each study to do comparison
From the data above, we can compare some samples according to the body sites:
1. Foot plantar aspect from control samples of study ERP019566 (Australia) vs plantar heel samples from study SRP002480 (USA)
2. Healthy Scalp samples from study vs ERP018577 (Brazil) vs occiput samples from study SRP002480 (USA)

### 3.1 Random sampling of occiput and plantar heel samples from study SRP002480 (male & femalae)

In [147]:
df = pd.read_csv('05_sample_'+study+'_meta_refine.csv', index_col = 0)       
sex_cat = ['female', 'male']
bs = ['plantar heel', 'occiput']

In [170]:
df_merge = pd.DataFrame(columns=('Sample_ID', 'Run_ID', 'Release_version', 'Sex', 'Body_site'))

for i in range(len(sex_cat)):
    sex_cat_ = sex_cat[i]
    for a in range(len(bs)):
        bs_ = bs[a]
        df_sort = df.query('Sex==@sex_cat_ and Body_site==@bs_')
        dataframe = df_sort
        amount = 5
        df_random_sample = random_sampling(dataframe, amount)
        df_merge = pd.concat([df_merge, df_random_sample], ignore_index=True)      

df_study_id = []

for i in range(len(df_merge)):
    df_study_id.append(study)
df_study_id = pd.DataFrame(df_study_id, columns=["Study_ID"])

df_merge = pd.concat([df_study_id, df_merge], axis=1)
df_merge

Unnamed: 0,Study_ID,Sample_ID,Run_ID,Release_version,Sex,Body_site
0,SRP002480,SRS451776,SRR920257,2.0,female,plantar heel
1,SRP002480,SRS451759,SRR920208,2.0,female,plantar heel
2,SRP002480,SRS451470,SRR919640,2.0,female,plantar heel
3,SRP002480,SRS451469,SRR919639,2.0,female,plantar heel
4,SRP002480,SRS451588,SRR919858,2.0,female,plantar heel
5,SRP002480,SRS451593,SRR919914,2.0,female,occiput
6,SRP002480,SRS451777,SRR920225,2.0,female,occiput
7,SRP002480,SRS452030,SRR920903,2.0,female,occiput
8,SRP002480,SRS452122,SRR921121,2.0,female,occiput
9,SRP002480,SRS451904,SRR920702,2.0,female,occiput


In [171]:
df_merge.to_csv('06_sampled_biom_'+study+'.csv')

### 3.2 Random sampling of Healthy Scalp samples from study ERP018577 (Brazil)

In [197]:
df = pd.read_csv('05_sample_'+study2+'_meta_refine.csv', index_col = 0)       
bs = 'Scalp'
des = 'Health'

In [200]:
df_merge = pd.DataFrame(columns=('Sample_ID', 'Run_ID', 'Release_version', 'Sex', 'Body_site', 'Disease'))

df_sort = df.query('Body_site==@bs and Disease==@des')
dataframe = df_sort
amount = 5
df_random_sample = random_sampling(dataframe, amount)
df_merge = pd.concat([df_merge, df_random_sample], ignore_index=True)      

df_study_id = []

for i in range(len(df_merge)):
    df_study_id.append(study2)
df_study_id = pd.DataFrame(df_study_id, columns=["Study_ID"])

df_merge = pd.concat([df_study_id, df_merge], axis=1)
df_merge

Unnamed: 0,Study_ID,Sample_ID,Run_ID,Release_version,Sex,Body_site,Disease
0,ERP018577,ERS1421377,ERR1701280,3.0,,Scalp,Health
1,ERP018577,ERS1421389,ERR1701292,3.0,,Scalp,Health
2,ERP018577,ERS1421383,ERR1701286,3.0,,Scalp,Health
3,ERP018577,ERS1421335,ERR1701238,3.0,,Scalp,Health
4,ERP018577,ERS1421381,ERR1701284,3.0,,Scalp,Health


In [201]:
df_merge.to_csv('06_sampled_biom_'+study2+'.csv')

### 3.3 Random sampling of Foot plantar aspect from control samples of study ERP019566 (Australia)

In [213]:
df = pd.read_csv('05_sample_'+study3+'_meta_refine.csv', index_col = 0)       
bs = 'foot plantar aspect'
des = ['control_skin_right', 'control_skin_left']
df.loc[0, "Body_site"]

'foot plantar aspect'

In [220]:
df_merge = pd.DataFrame(columns=('Sample_ID', 'Run_ID', 'Release_version', 'Sex', 'Body_site', 'Description'))

for i in range(len(des)):
    des_ = des[i]
    df_sort = df.query('Body_site==@bs and Description==@des_')
    dataframe = df_sort
    amount = 5
    df_random_sample = random_sampling(dataframe, amount)
    df_merge = pd.concat([df_merge, df_random_sample], ignore_index=True)      

df_study_id = []

for i in range(len(df_merge)):
    df_study_id.append(study3)
df_study_id = pd.DataFrame(df_study_id, columns=["Study_ID"])

df_merge = pd.concat([df_study_id, df_merge], axis=1)
df_merge

Unnamed: 0,Study_ID,Sample_ID,Run_ID,Release_version,Sex,Body_site,Description
0,ERP019566,ERS1474791,ERR1760038,4.0,,foot plantar aspect,control_skin_right
1,ERP019566,ERS1474780,ERR1760027,4.0,,foot plantar aspect,control_skin_right
2,ERP019566,ERS1474511,ERR1759893,4.0,,foot plantar aspect,control_skin_right
3,ERP019566,ERS1474797,ERR1760044,4.0,,foot plantar aspect,control_skin_right
4,ERP019566,ERS1474793,ERR1760040,4.0,,foot plantar aspect,control_skin_right
5,ERP019566,ERS1474566,ERR1759918,4.0,,foot plantar aspect,control_skin_left
6,ERP019566,ERS1474502,ERR1759884,4.0,,foot plantar aspect,control_skin_left
7,ERP019566,ERS1474571,ERR1759923,4.0,,foot plantar aspect,control_skin_left
8,ERP019566,ERS1474573,ERR1759925,4.0,,foot plantar aspect,control_skin_left
9,ERP019566,ERS1474796,ERR1760043,4.0,,foot plantar aspect,control_skin_left


In [221]:
df_merge.to_csv('06_sampled_biom_'+study3+'.csv')

## 4. Get analysis result of the given sample in the studies
___
https://www.ebi.ac.uk/metagenomics//projects/SRP002480/samples/SRS451457/runs/SRR919567/results/versions/2.0/taxonomy/OTU-table-HDF5-BIOM

In [5]:
def download_biom(df_BIOM, extension, study):
    #Create ouput folder
    cwd = os.getcwd() #get current working directory
    output_folder = "\output_"+study #name of the output folder for a given study, use \ for directory in windows
    if not os.path.isdir(cwd + output_folder):
        os.mkdir(cwd + output_folder)
    new_dir = cwd + output_folder 
    new_dir
    
    #Ambil data dari EBI
    for i in tqdm_notebook(df_BIOM.index):
        os.chdir(new_dir) #pindah ke folder output
        filename = df_BIOM.loc[i, "Sample_ID"]+'.biom'
        if not os.path.isfile(filename):
            link = get_analysis_result(df_BIOM.loc[i, "Run_ID"], extension)
            with open(filename, 'wb') as f:
                c = pycurl.Curl()
                c.setopt(c.URL, link)
                c.setopt(c.WRITEDATA, f)
                c.perform()
                c.close()
        os.chdir(cwd) #balik ke folder semula
    return

In [14]:
study = 'SRP002480'
extension = 'JSON Biom'
df_BIOM = pd.read_csv('06_sampled_biom_'+study+'.csv')
download_biom(df_BIOM, extension, study)

HBox(children=(IntProgress(value=0, max=20), HTML(value='')))

In [15]:
study = 'ERP018577'
extension = 'JSON Biom'
df_BIOM = pd.read_csv('06_sampled_biom_'+study+'.csv')
download_biom(df_BIOM, extension, study)

HBox(children=(IntProgress(value=0, max=5), HTML(value='')))

In [16]:
study = 'ERP019566'
extension = 'JSON Biom'
df_BIOM = pd.read_csv('06_sampled_biom_'+study+'.csv')
download_biom(df_BIOM, extension, study)

HBox(children=(IntProgress(value=0, max=10), HTML(value='')))

## 5. Create accompanying metadata for downloaded analysis
___

In [17]:
def download_metadata(df):
    df_meta_random = DataFrame(columns=('#SampleID','BarcodeSequence','LinkerPrimerSequence', 'StudyID', 'RunID', 'Sex', 'BodySite', 'Description'))
    pbar = tqdm_notebook(range(len(df))) #to make progressbar
    for i in pbar:
        df_meta_random.loc[i] = [df.loc[i, 'Sample_ID'], \
                                 '_', \
                                 '_', \
                                 df.loc[i, 'Study_ID'], \
                                 df.loc[i, 'Run_ID'], \
                                 df.loc[i, 'Sex'], \
                                 df.loc[i, 'Body_site'], \
                                 df.loc[i, 'Description']
                                ]
    return df_meta_random

In [18]:
study = 'SRP002480'
df_biom = pd.read_csv('06_sampled_biom_'+study+'.csv')
df_meta_random1 = download_metadata(df_biom)
df_meta_random1.to_csv('07_metadata_'+study+'.txt', sep="\t", index = False)

HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




In [19]:
study = 'ERP018577'
df_biom = pd.read_csv('06_sampled_biom_'+study+'.csv')
df_meta_random2 = download_metadata(df_biom)
df_meta_random2.to_csv('07_metadata_'+study+'.txt', sep="\t", index = False)

HBox(children=(IntProgress(value=0, max=5), HTML(value='')))




In [22]:
study= 'ERP019566'
df_biom = pd.read_csv('06_sampled_biom_'+study+'.csv')
df_meta_random3 = download_metadata(df_biom)
df_meta_random3.to_csv('07_metadata_'+study+'.txt', sep="\t", index = False)

HBox(children=(IntProgress(value=0, max=10), HTML(value='')))




In [23]:
df_meta_random_merged = pd.concat([df_meta_random1, df_meta_random2, df_meta_random3])
df_meta_random_merged = df_meta_random_merged.fillna('_')
df_meta_random_merged.to_csv('07_metadata_merged.txt', sep="\t", index = False)
df_meta_random_merged.to_csv('07_metadata_merged.csv')
df_meta_random_merged

Unnamed: 0,#SampleID,BarcodeSequence,LinkerPrimerSequence,StudyID,RunID,Sex,BodySite,Description
0,SRS451776,_,_,SRP002480,SRR920257,female,plantar heel,_
1,SRS451759,_,_,SRP002480,SRR920208,female,plantar heel,_
2,SRS451470,_,_,SRP002480,SRR919640,female,plantar heel,_
3,SRS451469,_,_,SRP002480,SRR919639,female,plantar heel,_
4,SRS451588,_,_,SRP002480,SRR919858,female,plantar heel,_
5,SRS451593,_,_,SRP002480,SRR919914,female,occiput,_
6,SRS451777,_,_,SRP002480,SRR920225,female,occiput,_
7,SRS452030,_,_,SRP002480,SRR920903,female,occiput,_
8,SRS452122,_,_,SRP002480,SRR921121,female,occiput,_
9,SRS451904,_,_,SRP002480,SRR920702,female,occiput,_


### References
---
<a id='ref1'></a>
1. Alex L Mitchell, Maxim Scheremetjew, Hubert Denise, Simon Potter, Aleksandra Tarkowska, Matloob Qureshi, Gustavo A Salazar, Sebastien Pesseat, Miguel A Boland, Fiona M I Hunter, Petra ten Hoopen, Blaise Alako, Clara Amid, Darren J Wilkinson, Thomas P Curtis, Guy Cochrane, Robert D Finn; EBI Metagenomics in 2017: enriching the analysis of microbial communities, from sequence reads to assemblies, Nucleic Acids Research, Volume 46, Issue D1, 4 January 2018, Pages D726–D735, https://doi.org/10.1093/nar/gkx967