# Data Acquistion of Microbiome Data from Qiita

Microbiome Data was acquired from Qiita, a great UCSD open source microbiome database.

Documentation can be found [here](https://qiita.ucsd.edu/static/doc/html/index.html)

## Steps:
- Make an account
- Find a study of interest to you
    - Our data came from a [UCSD Geolocation study](https://qiita.ucsd.edu/study/description/10423)
- Download **ALL QIIME Maps and BIOMs**
- Convert .BIOM files to .tsv
- Unzip files
- Match to the sample file

### Convert .BIOM Files to .tsv

In [None]:
! pip install biom-format

! biom convert -i data/BIOM/60899/all.biom -o data/BIOM/60899/feature_table.tsv --to-tsv

## Match to sample file
A 'map' file will be in txt format and shows the metadata for each sample. #SampleID matches the Sample_id of the tsv

In [1]:
import pandas as pd
sample_table = pd.read_csv('data/mapping_files/60899_mapping_file.txt', sep = '\t')
sample_table.head(1)

Unnamed: 0,#SampleID,BarcodeSequence,LinkerPrimerSequence,center_name,experiment_design_description,instrument_model,its16s,library_construction_protocol,linker,pcr_primers,...,replicate,row,sample_type,sampletype,scientific_name,time,timeofcollection,title,weekday,Description
0,10423.10CFSSA714MNF,ATCTAGTGGCaA,CGGCTGCGTTCTTCaTCGATGC,ANL,samples from office materials in 3 cities (San...,Illumina MiSeq,ITS,Illumina MiSeq ITS1/2,CG,FWD:CTTGCTCATTTAGAGGAAGTAA; REV:GCTGCGTTCTTCAT...,...,no,2,dust,office,indoor metagenome,11:43:00,not applicable,Geography and location are the primary drivers...,Monday,T1C.2.Ca.078


## Formatting Microbiome table
Our Microbiome table, converted from the .BIOM file needs to be reformated into a dataset ammenable to Machine Learning. This final table will be a high-dimensional, sparse matrix containing samples and counts of the various OTU's present

In [2]:
%%capture
microbiome_table_raw = pd.read_csv('data/BIOM/60899/feature_table.tsv', sep = '\t', )

In [6]:
def convert_tsv(df):
    """
    Helper function to take the output of the biom-format package from a .biom file
    Converts it to a feature table with features on the column space and samples in the row space
    """
    length = df.shape[0]
    df = df.reset_index().T
    df.set_index(0)
    
    # We can also encode this so it is less space
    new_header = df.iloc[0]
    
    df = df[1:]
    df.columns = new_header
    df = df.reset_index().drop('index',axis = 1).set_index('#OTU ID')
    return df

In [7]:
microbiome_table = convert_tsv(microbiome_table_raw)

In [8]:
microbiome_table.head(1)

level_0,GTAGGTGAACCTGCGGAAGGATCATTAATGAATTATGAAAGGGGTTGTCGCTGGCTGTTAGCAGCATGTGCACGCTCTGATCATTATCCATCTTACACACCTGTGCACACACTGTAAGTCGGCTTTTGATGCAAAGTAAGGGTCTTCATT,GTTGGTGAACCAGCGGAGGGATCATTACCGAGTTTTTTTTACAACTCCCAACCCTTGCGAACTATACCCAACTTCTGTTCTCGTTGCTTTTGGCGGGCGGACGAGGAAGCATATCTATTTGATAAGCTTCTCTCGCCCCCGCCGGCAGCT,GTAGGTGAACCTGCGGAAGGATCATTACCGAGTGAGGGCCCTCTGGGTCCAACCTCCCACCCGTGTTTATTTTACCTTGTTGCTTCGGCGGGCCCGCCTTAACTGGCCGCCGGGGGGCTTACGCCCCCGGGCCCGCGCCCGCCGAAGACA,GTTGGTGAACCAGCGGAGGGATCATTACCGAGTTTTTTTACAACTCCCAACCCTTGCGAACTATACCCAACTTTTGTTCTCGCTGCTTTTGGCGGGCGGACGAGGAAGCATATCTATTTGATAAGCTTCTCTCGCCCCCGCCGGCAGCTG,GTTGGTGAACCAGCGGAGGGATCATTACCGAGTTTTTTTACAACTCCCAACCCTTGCGAACTATACCCAACTTCTGTTCTCGTTGCTTTTGGCGGGCGGACGAGGAAGCATATCTATTTGATAAGCTTCTCTCGCCCCCACCGGCAGCTG,GTAGGTGAACCTGCGGAAGGATCATTAAAATAATTTATTCACACTCTTAGGAACAAACTCTAAATCTTAAATCTCAACAAAGTTTAAAAAAAAACTTTCAACAACGGATCTCTTGGTTCTCGCATCGATGAAGAACGCAGCCGCAAGGTT,GTAGGTGAACCTGCGGAAGGATCATTAAAGATTGACCGAAAGGTCTTATCTCTATATCCCTCACCTCTGTGAACTGTGGACCTCCGGGTCTGTCTTAACAAACATCAGTGTAATGAACGTATATATCATTAAACAAAACAAAACTTTCAA,GTAGGTGAACCTGCGGAAGGATCATTAAAAAGAATTATACACTTTGCATTTGCGAACAAAAAAATAAATTTTTTTATTCGAATCATTTAAATCAAAACTTTCAACAACGGATCTCTTGGTTCTCGCATCGATGAAGAACGCAGCCGCAAG,GTAGGTGAACCTGCGGAAGGATCATTACCTAGAGTTGTAGGCTTTGCCTGCTATCTCTTACCCATGTCTTTTGAGTACTTACGTTTCCTCGGCGGGTCCGCCCGCCGATTGGACAATTTAAACCATTTGCAGTTGCAATCAGCGTCTGAA,GTAGGTGAACCTGCGGAAGGATCATTAATGAATTTTAGGACGTTCTTTTTAGAAGTCCGACCCTTTCATTTTCTTACACTGTGCACACACTTCTTTTTTACACACACTTTTAACACCTTAGTATAAGAATGTAATAGTCTCTTTATTGAG,...,GTAGGTGAACCTGCGGAAGGACCATTGCTGATTTTCATGAGGGAGAGGGCGACCTCTCCCCGACCGACACCTCCGTGCACTCTGGGGGGGAGACTCTCCGTCTCCCCTTTTTTTATAACGAACGCCTGTATTCTCGCGCGTATACGACGA,GTAGGTGAACCTGTGGAGGGATCATTACAAGTTGACCCCGGCCCCCGGGCCGGGATGTTCACAACCCTTTGTTGTCCGACTCTGTTGCCTCCGGGGCGACCCTGCCTCCGGGCGGGGGCCCCGGGTGGACACTTCAAAACTCTTGCGTAA,GTAGGTGAACCTGCGGAAGGATCATTATCGAGTTTTGAAGTGGGCTTGATGCTGGCGTCTTCACGACGCATGTGCTCAGCCCCGCTCTCAAATCCACTCTACACCTGTGCACTCTCAAAGTTTGTAGTCCTCCGTAATGGGAGCCGCAAA,GTAGGTGAACCTGCGGAAGGATCATTATCGAGTTTTTTGGACGGGTTGTCGCTGGCCTCTCACGAGGCATGTGCACGCCGGCTCATCCACTCTCAACCTCTGTGCACTTTATGTGAGTAACGGTGGACTTATTGGCTCTTTGAGTCGTTA,GTAGGTGAACCTGCGGAGGGATCATTACAAGTGACCCCGGTCTAACCACCGGGATGTTCATAACCCTTTTTTGTCCGACTCTGTTGCCTCCGGGGCGACCCTGCCTTCGGGCGGGGGCTCCGGGTGGACACTTCAAACTCTTGCGTAACT,GTAGGTGAACCTGCGGAAGGATCATTACTGAGACTGGGTGCTTCGGCGCCCGACCTCCAACCCCCTGTCTACCTTACCACTGTTGCCTCGGCGTTTCCACCCGCCCCCCCCCTCTCGCAGGGGGTCGCTGGGCGGTGCGTCGGCGGCCAA,GTAGGTGAACCTGCGGAAGGATCATTACTGAGACTGGGTGCTTCGGCGCCCGACCTCCAACCCCCTGTCTACCTTACCACTGTTGCCTCGGCGTTTCCACCCGCCCCCCCCTCTCGCAGGGGGTCGCTGGGCGGTGCGTCGGCGGCCAAA,GTAGGTGAACCTGCGGAAGGATCATTACTGAGACTGGGTGCTTCGGCGCCCGACCTCCAACCCCCTGTCTACCTTACCACTGTTGCCTCGGCGTTTCCACCCGCCCCCCCCCCTCTCGCAGGGGGTCGCTGGGCGGTGCGTCGGCGGCCA,GTAGGTGAACCTGCGGAAGGATCATTACCGAGTGCGGGCTGCCTCCGGGCGCCAACCTCCCACCCGTGACTACCTAACACTGTTGCTTCGGCGGGGAGCCCTCTCGGGGGCGAGCCGACGGGGACTACTGAACTTCATGCCTGAGAGAGA,GTAGGTGAACCTGCAGAAGGATCATTAGTGAAGATTTGGGCAGGCCATACGGACGCCAAAAAGTGTCCCTGGCCGCCTACACCCACTATACATCCACAAACCCGTGTGCACTGTCTTGGAGAAAGGCTTCTTGAGAAGTTATGTGACCTC
#OTU ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10423.FM2P2T4553XR,2306.0,1586.0,1471.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0


With the microbiome table and the sample metadata, we can start investigating and understanding the microbiom composition of the communities