DSE-American-Gut-Project

Repository Info

Biom Data Ingestion

The 'biom_data_ingest' directory contains notebooks which pull the American Gut Project biom OTU data of interest from the Redbiom python API, removes bloom OTU's from the biom table data, runs rarefaction on the data, and then finally exports it into a sparse dataframe (in .pkl format).

Two different ingestion notebooks exist within the directory, one pulls from 'greengenes' context, the other 'deblur' context. See the notebooks for more detailed information on the ingestion process.

You can also look at 'Notebooks/redbiom/AG_example.ipynb' for a more detailed walkthrough example of the biom ingestion process as well'

Alpha Diversity

The 'biom_data_ingest' directory also contains 'greengenes_alpha_diversity.ipynb' which calculates phylogenetic alpha diversity of the biom data samples, using the greengenes 97 tree.

Beta Diversity / PCoA

The 'beta_diversity_pcOa' directory contains notebooks which calculate phylogenetic beta diversity between samples using the greengenes 97 tree, as well as running principal coordinates analysis on the beta diversity matrix results.

Data Cleaning / Integration

The 'data_join_workflow' directory contains a jupyter notebook that pulls in data from all 3 data sources (raw metadata .txt file, biom .pkl file, and drug questionnaire .csv). The notebook cleans the data, including the extraction / creation of a consistent 'sample_id' which is joinable across the 3 datasets. The notebook outputs cleaned and joinable versions of the different data sources, ready for analysis.

Please see the README.md in the 'data_join_workflow' directory for more information

The cleaned output data sources are used as the input data into our pipeline. The pipeline is used for multiple analyitical workflows including preprocessing, hyperbolic and word2vec embeddings, hyperparamter tuning, and supervised machine learning classifcation (See Pipeline section below)

Pipeline

The pipeline is under the american_gut_project_pipeline folder, and creates a library as such.

Installation

Acquire AWS Credentials to pull files from S3. Make sure they are saved into ~/.aws/credentials file
From the root of the git repo, run pip install -e . which installs the packages needed and the newly made library

Executing

The entire pipeline can be executed by running agp-pipeline in the terminal. If the AWS credentials are not in the [default] block within the credential file, the command can point to another credential like this agp-pipeline -a <aws_profile_name>

Alternatively, all the output and intermediate data artifacts are stored within S3. Running agp-pull (or with -a) will sync the local file system with the data artifacts on S3. There is also an agp-push command that uploads the files to S3.

The data artifacts from the pipeline are stored under the data directory

It is also possible to run single transformations (or tasks) by editing the main function in an individual file and executing it from the command line or through an IDE. Note that since luigi is handling the dependency management of the artifacts, if the output artifacts of a task exist in the data directory, luigi won't recompute the task unless the file is moved or deleted.

Interactive Dashboard

A full interactive dashboard demonstrating american gut project embedding results can be found here: https://github.com/mas-dse-ringhilt/american_gut_capstone_dashboard

Pure Metadata Survey Analysis

The directory VioAndMetadata_Cleaning contains analysis on purely metadata survey information and no microbiome to investigate what information might already be achieved with surveys and data science. Please read the README and investigate notebooks for more information.

Imputation techniques with SoftImpute, KNN, and Iterative Imputer along with tests are located within this directory

References

Consult the QIIME2 and Qiita documentation that can be found online here: https://qiime2.org/ https://qiita.ucsd.edu/ (AGP here: https://qiita.ucsd.edu/study/description/10317)

Metagenomics Info: http://www.metagenomics.wiki/pdf/definition

BIOM format: http://biom-format.org/

AGP notebooks: https://github.com/knightlab-analyses/american-gut-analyses/tree/master/ipynb

Papers

American Gut: https://msystems.asm.org/content/3/3/e00031-18

Conducting a Microbiome Study: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5074386/

A Guide to Enterotypes across the Human Body: Meta-Analysis of Microbial Community Structures in Human Microbiome Datasets: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3542080/

Supervised classification of human microbiota: https://www.ncbi.nlm.nih.gov/pubmed/21039646

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DSE-American-Gut-Project

Repository Info

Biom Data Ingestion

Alpha Diversity

Beta Diversity / PCoA

Data Cleaning / Integration

Pipeline

Installation

Executing

Interactive Dashboard

Pure Metadata Survey Analysis

References

Papers

About

Releases

Packages

Contributors 5

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
Metadata		Metadata
Notebooks		Notebooks
VioAndMetadata_Cleaning		VioAndMetadata_Cleaning
american_gut_project_pipeline		american_gut_project_pipeline
beta_diversity_pcOa		beta_diversity_pcOa
biom_data_ingest		biom_data_ingest
data		data
data_join_workflow		data_join_workflow
drug_data		drug_data
old		old
visualization		visualization
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

License

mas-dse-ringhilt/DSE-American-Gut-Project

Folders and files

Latest commit

History

Repository files navigation

DSE-American-Gut-Project

Repository Info

Biom Data Ingestion

Alpha Diversity

Beta Diversity / PCoA

Data Cleaning / Integration

Pipeline

Installation

Executing

Interactive Dashboard

Pure Metadata Survey Analysis

References

Papers

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages