# EBI Scraping
In this notebook, we will download metadata and the results of functional analyses from several studies related to human skin microbiome.
The notebook utilises [MGnify API toolkit](https://github.com/EBI-Metagenomics/emg-toolkit).

## Table of contents
1. [Study dataset](#introduction)
2. [Example of downloading one study set](#example1)
3. [Example of downloading multiple study set](#example2)

In [1]:
import pandas as pd
from mg_toolkit.metadata import OriginalMetadata
from mg_toolkit.bulk_download import BulkDownloader
import os, sys
from requests import get
from pathlib import Path

In [2]:
def fetch_metadata(study_accession):
    '''
    Retrieve original metadata of a given study accession. Download a csv file.
    '''
    outfile = f'../tables/{study_accession}/{study_accession}_sample.csv'
    if os.path.exists(outfile):
        pass
    else:
        try:
            os.mkdir(f'../tables/{study_accession}')
        except FileExistsError:
            pass
        try:
            metadata = pd.DataFrame(OriginalMetadata(study_accession).fetch_metadata()).T
            metadata.to_csv(outfile)
        except ConnectionError:
            print("Unexpected error:", sys.exc_info()[0])
    return


In [3]:
def stats_downloader(study_accession, pipeline, result_group='statistics'):
    '''
    Retrieve the list of downloadables item for a given study. Download a tsv file.
    '''
    outdir = Path('../tables')
    outfile = outdir / study_accession / f'v{pipeline}_{study_accession}_metadata.tsv'
    if os.path.exists(outfile):
        pass
    else:
        try:
            os.mkdir(outdir / study_accession)
        except FileExistsError:
            pass
        data = BulkDownloader(study_accession, outdir, pipeline, result_group)
        data.run()
        try:
            os.rename(outdir / study_accession / f'{study_accession}_metadata.tsv', outfile)
        except FileNotFoundError:
            pass
    return 

In [4]:
def read_statistics(study_accession, pipeline):
    '''
    Helper function to read a big csv file into pandas dataframe.
    '''
    indir = Path('../tables')
    infile = indir / study_accession / f'v{pipeline}_{study_accession}_metadata.tsv'

    # https://stackoverflow.com/questions/41303246/error-tokenizing-data-c-error-out-of-memory-pandas-python-large-file-csv/41303449
    try:
        mylist = []

        for chunk in pd.read_csv(infile, sep='\t', chunksize=20000):
            mylist.append(chunk)

        big_data = pd.concat(mylist, axis= 0)
        del mylist
    except FileNotFoundError:
        big_data = "Pipeline version " + pipeline + ' is not available' #+ str(sys.exc_info()[0])
    return big_data

In [5]:
def item_downloader(accession, df):
    '''
    Download items listed in the dataframe from the stats_downloader function. It is recommended to filter the dataframe for files of interest. 
    '''
    for i in df.index:
        outdir = Path('../data') / accession / str(df.loc[i, 'pipeline_version']) / str(df.loc[i, 'group_type']).replace(' ', '_').lower()
        outfile = outdir / df.loc[i, 'name']
        #print(outfile)
        if os.path.exists(outfile):
            pass
        else:
            try:
                os.makedirs(outdir)
            except FileExistsError:
                pass
            url = df.loc[i, 'download_url']
            html = get(url)
            r = get(url, allow_redirects=True)
            open(outfile, 'wb').write(r.content)
            print(outfile)
    return

## Study dataset <a name="introduction"></a>
The list of studies were downloaded as csv file using https://www.ebi.ac.uk/metagenomics/search, with host-associated:human and skin as keyword. The csv file is available at [../tables/search_download.csv]('../data/search_download.csv').

Here, the dataset was filtered to exclude third party annotation (TPA) and datasets with know access problem (to be defined).

In [6]:
df_studies = pd.read_csv('../tables/search_download.csv')

# filter for third party annotation
mask = [num for num, i in enumerate(df_studies.creation_date) if not 'TPA' in i]
df_studies = df_studies.loc[mask, :]

# filter to exclude hard to access studies
df_studies = df_studies[~df_studies.loc[:, 'ENA_PROJECT'].isin(['PRJNA46333', 'PRJNA554499', 'PRJNA395539', 'PRJEB16723'])]
df_studies

Unnamed: 0,ENA_PROJECT,METAGENOMICS_ANALYSES,METAGENOMICS_SAMPLES,biome_name,centre_name,creation_date,description,domain_source,id,last_modification_date,name,releaseDate_date
0,PRJEB26427,MGYA00381378,ERS2431659,Skin,P&G Singapore Innovation Center,Metagenomics samples of multiple skin sites (u...,metagenomics_projects,MGYS00005102,Understanding the microbial basis of body odor...,,,
1,PRJNA314604,MGYA00497609,SRS1333647,Skin,NYU Langone Medical Center,To characterize the diversity of cutaneous mic...,metagenomics_projects,MGYS00005212,Body site is a more determinant factor than hu...,,,
2,PRJNA281366,MGYA00381322,SRS927195,Skin,University of Trento,Skin shotgun metagenomes from psoriasis patients,metagenomics_projects,MGYS00005101,Skin metagenomes,,,
4,PRJNA277905,MGYA00376956,SRS892108,Skin,Genome Institute of Singapore,Human Skin Microbiome,metagenomics_projects,MGYS00005037,Human Skin Microbiome Metagenome,,,
6,PRJEB10133,MGYA00010238,ERS805761,Skin,GSTT BRC Bioinformatics,These samples are selections from a larger coh...,metagenomics_projects,MGYS00000518,These samples are selections from a larger coh...,,,
7,PRJEB10295,MGYA00010281,ERS809858,Skin,LEIDEN UNIVERSITY MEDICAL CENTER,Whole genome sequencing of metagenomes extract...,metagenomics_projects,MGYS00000520,Whole genome sequencing of metagenomes extract...,,,
10,PRJEB14627,MGYA00556025,ERS1225501,Skin,UNIVERSITY OF HELSINKI,The skin protects from outer threats and this ...,metagenomics_projects,MGYS00005533,Patterns in the skin microbiota differ in chil...,,,
14,PRJEB5728,MGYA00415113,ERS414654,Skin,CCME-COLORADO,Forehead skin samples from Flores_SMP for subm...,metagenomics_projects,MGYS00005156,Flores_forehead_EBI,,,
16,PRJEB5758,MGYA00440312,ERS418163,Skin,CCME-COLORADO,Skin samples from human 3D metabolic map,metagenomics_projects,MGYS00005232,Dorrestein_3D_metabolic_map,,,
18,PRJNA269787,MGYA00421513,SRS786731,Skin,Fudan University,The variability in skin microbial communities ...,metagenomics_projects,MGYS00005238,Human skin microbiota Metagenome,,,


## Example of downloading one study set <a name="example1"></a>
Below are example of downloading one study set

In [7]:
study_accession = 'PRJEB26427'
pipeline_version = '4.1'
fetch_metadata(study_accession)  
stats_downloader(study_accession, pipeline_version)
PRJEB26427_downloadables = read_statistics(study_accession, pipeline_version)

In [8]:
pd.read_csv('../tables/PRJEB26427/PRJEB26427_sample.csv', index_col=0)

Unnamed: 0,collection date,environment (biome),environment (feature),environment (material),geographic location (country and/or sea),geographic location (latitude),geographic location (longitude),human skin environmental package,investigation type,project name,sequencing method,ENA-CHECKLIST,ENA-SPOT-COUNT,ENA-BASE-COUNT,ENA-FIRST-PUBLIC,ENA-LAST-UPDATE,Sample,Read depth
ERR2538349,2015-08,Human-associated habitat,Malodor aspect,skin,Philippines,14.5995DD,120.9842DD,human-skin,metagenome,Understanding the microbial basis of body odor...,Illumina HiSeq,ERC000017,31417827,3136503308,2018-05-07,2018-04-26,ERS2431609,
ERR2538350,2015-08,Human-associated habitat,Malodor aspect,skin,Philippines,14.5995DD,120.9842DD,human-skin,metagenome,Understanding the microbial basis of body odor...,Illumina HiSeq,ERC000017,27827945,2778965519,2018-05-07,2018-04-26,ERS2431610,
ERR2538351,2015-08,Human-associated habitat,Malodor aspect,skin,Philippines,14.5995DD,120.9842DD,human-skin,metagenome,Understanding the microbial basis of body odor...,Illumina HiSeq,ERC000017,38533917,3848100187,2018-05-07,2018-04-26,ERS2431611,
ERR2538352,2015-08,Human-associated habitat,Malodor aspect,skin,Philippines,14.5995DD,120.9842DD,human-skin,metagenome,Understanding the microbial basis of body odor...,Illumina HiSeq,ERC000017,35959808,3590778644,2018-05-07,2018-04-26,ERS2431612,
ERR2538353,2015-08,Human-associated habitat,Malodor aspect,skin,Philippines,14.5995DD,120.9842DD,human-skin,metagenome,Understanding the microbial basis of body odor...,Illumina HiSeq,ERC000017,26637196,2660198624,2018-05-07,2018-04-26,ERS2431613,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ERR2538523,2015-08,Human-associated habitat,Malodor aspect,skin,Philippines,14.5995DD,120.9842DD,human-skin,metagenome,Understanding the microbial basis of body odor...,Illumina HiSeq,ERC000017,27391255,2732856825,2018-05-07,2018-04-26,ERS2431783,
ERR2538524,2015-08,Human-associated habitat,Malodor aspect,skin,Philippines,14.5995DD,120.9842DD,human-skin,metagenome,Understanding the microbial basis of body odor...,Illumina HiSeq,ERC000017,11858320,1181401371,2018-05-07,2018-04-26,ERS2431784,
ERR2538525,2015-08,Human-associated habitat,Malodor aspect,skin,Philippines,14.5995DD,120.9842DD,human-skin,metagenome,Understanding the microbial basis of body odor...,Illumina HiSeq,ERC000017,12816015,1277202640,2018-05-07,2018-04-26,ERS2431785,
ERR2538526,2015-08,Human-associated habitat,Malodor aspect,skin,Philippines,14.5995DD,120.9842DD,human-skin,metagenome,Understanding the microbial basis of body odor...,Illumina HiSeq,ERC000017,14769354,1473118680,2018-05-07,2018-04-26,ERS2431786,


In [9]:
# emg toolkit bulkdownloader can only download in bulk of what's in the group type of a given study
#downloader(study_accession, '4.1', 'functional_analysis') # FASTQ_InterPro.tsv.gz ~approx 500 mb

In [10]:
# filter for GO slim annotaion
PRJEB26427_downloadables = read_statistics(study_accession, pipeline_version)
PRJEB26427_downloadables[PRJEB26427_downloadables.description == 'GO slim annotation']

Unnamed: 0,analysis_id,name,group_type,description,download_url,pipeline_version,experiment_type
16,MGYA00381356,ERR2538349_FASTQ_GO_slim.csv,Functional analysis,GO slim annotation,https://www.ebi.ac.uk/metagenomics/api/v1/anal...,4.1,metagenomic
56,MGYA00381357,ERR2538350_FASTQ_GO_slim.csv,Functional analysis,GO slim annotation,https://www.ebi.ac.uk/metagenomics/api/v1/anal...,4.1,metagenomic
98,MGYA00381358,ERR2538351_FASTQ_GO_slim.csv,Functional analysis,GO slim annotation,https://www.ebi.ac.uk/metagenomics/api/v1/anal...,4.1,metagenomic
140,MGYA00381359,ERR2538352_FASTQ_GO_slim.csv,Functional analysis,GO slim annotation,https://www.ebi.ac.uk/metagenomics/api/v1/anal...,4.1,metagenomic
181,MGYA00381360,ERR2538353_FASTQ_GO_slim.csv,Functional analysis,GO slim annotation,https://www.ebi.ac.uk/metagenomics/api/v1/anal...,4.1,metagenomic
...,...,...,...,...,...,...,...
6682,MGYA00381530,ERR2538523_FASTQ_GO_slim.csv,Functional analysis,GO slim annotation,https://www.ebi.ac.uk/metagenomics/api/v1/anal...,4.1,metagenomic
6715,MGYA00381531,ERR2538524_FASTQ_GO_slim.csv,Functional analysis,GO slim annotation,https://www.ebi.ac.uk/metagenomics/api/v1/anal...,4.1,metagenomic
6747,MGYA00381532,ERR2538525_FASTQ_GO_slim.csv,Functional analysis,GO slim annotation,https://www.ebi.ac.uk/metagenomics/api/v1/anal...,4.1,metagenomic
6782,MGYA00381533,ERR2538526_FASTQ_GO_slim.csv,Functional analysis,GO slim annotation,https://www.ebi.ac.uk/metagenomics/api/v1/anal...,4.1,metagenomic


In [11]:
# download filtered item
PRJEB26427_goslim = PRJEB26427_downloadables[PRJEB26427_downloadables.description == 'GO slim annotation']
item_downloader(study_accession, PRJEB26427_goslim)

## Example of downloading multiple study set <a name="example2"></a>
Below are example of downloading the first 10 study metadata and its downloadable list

In [12]:
study_accession = df_studies.loc[:10, 'ENA_PROJECT']
pipeline_version = '4.1'
for i in study_accession:
    print(i)
    fetch_metadata(i)  
    stats_downloader(i, pipeline_version, 'statistics')
    #_downloadables = read_statistics(i)

PRJEB26427
PRJNA314604
PRJNA281366
PRJNA277905
PRJEB10133


0it [00:00, ?it/s]
Study Id: PRJEB10133
Pipeline version: 4.1



 Download complete!
PRJEB10295


0it [00:00, ?it/s]
Study Id: PRJEB10295
Pipeline version: 4.1



 Download complete!
PRJEB14627


In [13]:
#dict_ = {}
for i in study_accession:
    _downloadables = read_statistics(i, pipeline_version)
    try:
        _downloadables_goslim = _downloadables[_downloadables.description == 'GO slim annotation']
        if _downloadables_goslim.experiment_type.isin(['metagenomic']).any():
            print(i, ': all downloaded')
            #dict_[i] = _downloadables_goslim
            item_downloader(i, _downloadables_goslim)
        else:
            print('Error', i, 'available analysis types :', _downloadables.experiment_type.unique())
    except (AttributeError, TypeError) as e: #, TypeError, AttributeError, 
        print('Error', i, _downloadables)

PRJEB26427 : all downloaded
Error PRJNA314604 available analysis types : ['amplicon']
PRJNA281366 : all downloaded
PRJNA277905 : all downloaded
Error PRJEB10133 Pipeline version 4.1 is not available
Error PRJEB10295 Pipeline version 4.1 is not available
Error PRJEB14627 available analysis types : ['amplicon']
