GitHub - KyleLevi/BAM_Scripts: Tools for (1) retrieving data from the Sequence Read Archive, (2) using Read Mapping for analysis, and (3) performing many common tasks when working with BAM/SAM files.

BAM_Scripts is a collection of tools grouped into two main functions.

Getting publicly available data from the Sequence Read Archive (SRA) and using read mapping (Bowtie2) to search for matching reads.
Analyzing and visualizing data stored in .BAM or .SAM files, as well as many common tasks such as removing matches below a certain length, splitting files based on organism, retrieving XML metadata from SRA runs.

Getting Started

1. Finding SRA data to work with

The SRA provides a great interactive search functionality for finding data sets. Use this to find runs that you are interested in and when you are done, download the run list as a file called "SraRunAcc.txt" and palce it in the "Input" folder of BAM_Scripts.

Downloading the "SraRunAcc.txt" file can be done clicking the "Send to" drop down menu and selecting "File" and choosing "Accession List" as the format.

2. Add FASTA genomes to the Input/Genomes folder

Whether you are looking for a bacteria, phage or even just a region of DNA, add the FASTA format file to the Input/Genomes folder. When using the Makefile, these will automatically be combined and a Bowtie2 index will be constructed.

3. Generating BAM files

Now that you have your datasets and target DNA chosen, it is time to start downloading and scanning SRA runs. From a terminal in the main direcory (../BAM_Scripts/) type:

make split_BAM_files

This will:

Download 100,000 reads for each SRA run (in the FASTQ format).
Create a Bowtie2 index of all FASTA files in the Input/Genomes/ folder.
Scan each of the downloaded data sets with Bowtie 2 - creating SAM files in the Input/SAM_files/ folder.
Convert the SAM files to BAM files and index them.
Lastly, the BAM files in Input/raw_BAM_files/ will be split by organism into new folders in Output/split_BAM_files/<organism_name_here>/.

4. Take a peek inside those BAM files with bam_stats.py

Since BAM files aren't human readable, if you want to see what was found, try running bam_stats.py:

python bin/bam_stats.py -i Output/split_BAM_files/<organism_name_here>/ -o Output/bam_stats.csv

This will output a CSV document (bam_stats.csv) that is viewable in any spreadsheet program.

Requirements

This project requires the following programs:

The SRA toolkit to download datasets.
Samtools/HTSlib to convert SAM to BAM files and index them.
Bowtie 2 for read mapping.
Python with the module Pysam installed and optionally Beautiful Soup 4 if you plan on working with XML metadata files.

Note: if you are using the Makefile, these programs must also have the appropriate PATH variable setup. This can be checked by running:

make test

Additionally, a demo is available by running:

make demo

Lastly, the project can be reset with:

make clean

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
Input		Input
bin		bin
test_files		test_files
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
Version		Version

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input

Input

bin

bin

test_files

test_files

LICENSE

LICENSE

Makefile

Makefile

README.md

README.md

Version

Version

Repository files navigation

Getting Started

1. Finding SRA data to work with

2. Add FASTA genomes to the Input/Genomes folder

3. Generating BAM files

4. Take a peek inside those BAM files with bam_stats.py

Requirements

About

Releases

Packages

Languages

License

KyleLevi/BAM_Scripts

Folders and files

Latest commit

History

Repository files navigation

Getting Started

1. Finding SRA data to work with

2. Add FASTA genomes to the Input/Genomes folder

3. Generating BAM files

4. Take a peek inside those BAM files with bam_stats.py

Requirements

About

Resources

License

Stars

Watchers

Forks

Languages