<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Find-data-from-each-study" data-toc-modified-id="Find-data-from-each-study-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Find data from each study</a></span></li><li><span><a href="#Access-Literature_Sources-in-Google-Drive" data-toc-modified-id="Access-Literature_Sources-in-Google-Drive-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Access Literature_Sources in Google Drive</a></span></li><li><span><a href="#Download-data-from-those-study-ids" data-toc-modified-id="Download-data-from-those-study-ids-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Download data from those study ids</a></span></li></ul></div>

# Find data from each study

First, I searched for `Study ID` in [Qiita](https://qiita.ucsd.edu/study/list/) and clicked on the study title, e.g. Study ID [10143](https://qiita.ucsd.edu/study/description/10143).

Initially, I downloaded all the QIIME maps and BIOMs from each study, and noticed a problem. The problem was that depending on the study, there will be many analyses done (e.g., different trimming lengths), and for each analysis there will be many files including a BIOM file. It is not clear to me what is the "final" BIOM file, meaning the BIOM file used for the analysis in each paper. 

My solution was to look in the provenance of each Study ID, and select the final BIOM file from an analysis done with trimming resulting in 90 nt in length. I chose 90 nt because this length was common across the 6 studies I am working with now.

**How to get the final BIOM?**  

**1.** In the Study ID page, click on 16S under Data Types. The prep information will appear, and in this case, there are two preps. Click on one of the preps, and the provenance graph will appear on the right side.
![pic1](images/picture1.png)


**2.** In the provenance graph, click on a Trimming job (green circle). The job parameters will appear underneath the graph. Make sure length is 90. In addition, follow the line and arrow to the next gray triangle, which is the resulting trimmed sequences. Confirm the mean length in the feature summary.
![pic2](images/picture2.png)
![pic3](images/picture3.png)


**3.** Then, follow the arrow to the next green circle, which should be Deblur. Ignore the black-filled triangles. Look for the deblur reference hit table. According to the Qiita [documentation](https://qiita.ucsd.edu/static/doc/html/processingdata/index.html#deblurring), the deblur reference hit table only contains 16S deblurred sequences, and the deblur final table contains all the sequences. Write the ID number in the spreadsheet of literature sources (columns ID Final Table).
![pic4](images/picture4.png)

**Repeat** this process for each prep information in each study id.

# Access Literature_Sources in Google Drive

1. Access the spreadsheet
2. Get all Qiita Study ids
3. Create directories and download biom and sample information data

In [1]:
import os
import wget
import glob
import subprocess
import pandas as pd
from natsort import natsorted

In [2]:
# --------------------------------------------------
def go_into_dir(s):
    """Make dir, and change dir"""

    if os.path.isdir(s):
        os.chdir(s)
    else:
        os.makedirs(s)
        os.chdir(s)


# --------------------------------------------------
def download_from_qiita(s):
    """Download i) biom + mapping files, and 
        ii) sample information from Qiita"""

    biom_file = f"biom_{s}.zip"

    sample_info = f"sample_information_{s}.zip"

    # public data via single-end point
    sep = "https://qiita.ucsd.edu/public_download/?data="

    to_download = [(biom_file, 'biom&study_id='),
                   (sample_info, 'sample_information&study_id=')]

    for t in to_download:
        print(f'Downloading {t[0]}...')
        cmd_to_run = f'wget -O {t[0]} "{sep}{t[1]}{s}"'
        subprocess.call(cmd_to_run, shell=True)
        print('Done')

In [3]:
spreadsheets = glob.glob("Literature_Sources-Sheet1*")

for s in spreadsheets:
    if s in os.listdir():
        os.remove(s)

lit_source = open('source_file.txt').read().rstrip()
filename = "Literature_Sources-Sheet1.csv"
data_sources = wget.download(lit_source, out=filename)

# Download data from those study ids

In [4]:
df = pd.read_csv(filename)
df = df[(~df['StudyID'].isna()) & (df['ID Final table'] != '-')]
df['StudyID'] = df['StudyID'].astype(int).astype(str)

study_ids = list(set(df.loc[:, 'StudyID'].dropna().to_list()))
study_ids = natsorted([y.replace(',', "") for x in study_ids for y in x.split()],
                   reverse=False)

study_ids

['714', '1609', '10141', '10143', '10321', '13450']

In [5]:
for study in study_ids:
    go_into_dir(study)
    download_from_qiita(study)
    os.chdir("../")