dna-stats

Final project for Coursera's Python for Genomics course, offered through Johns Hopkins. Reads a multi-FASTA file and performs operations on extracted data.

Requirements

Python 3
seqstats.py module

Usage

sample_usage.py [-h] <filename>

-h              print this message

<filename>      the file has to be in FASTA format

Summary

This program was developed as a final project for Coursera's Python for Genomics course, offered through Johns Hopkins. I chose to wrap all statistics about the given multi-FASTA file into a class, contained in the module seqstats.py.

SeqStats Class Basics

To create a new instance:

foo = SeqStats(seq_dict)

Note: The variable seq_dict must be formatted such that each key represents a sequence ID, and each value is a list of sequences matching that sequence ID

Upon creation, the SeqStats class automatically calculates some statistics for the given seq_dict, which can be summarized by calling print_seq_stats():

foo.print_seq_stats()

This prints:

Number of records
A dictionary of sequence lengths
The longest sequence length
The total number of sequences of the longest sequence length
A list of all sequences with the longest sequence length
The shortest sequence length
The total number of sequence of the shortest sequence length
A list of all sequences with the shortest sequence length

To calculate ORF statistics, invoke the find_orfs() method:

foo.find_orfs(1)

Note: find_orfs() expects an integer of either 1, 2, or 3, to specify the reading frame.

To print ORF statistics based on the current reading frame, use print_orf_stats():

foo.print_orf_stats()

This prints:

Length of the longest ORF in the file
Longest ORF sequence in the file
Sequence ID of longest ORF

To calculate ORF statistics for a particular sequence ID only, enter that sequence ID as an optional second argument:

foo.get_longest_orf('bar')

To print ORF statistics based on the current reading frame for a particular sequence ID only, enter that sequence ID as an optional second argument:

foo.print_orf_stats('bar')

This prints:

Length of the longest ORF in the given sequence
Longest ORF in the given sequence
Starting position in sequence of the longest ORF

To find all repeats of length n in the file, use get_repeats():

foo.get_repeats(n)

Note: get_repeats() expects a positive integer value

To print repeat statistics of length n, use print_repeat_stats(): foo.print_repeat_stats()

This prints:

A dictionary of all repeats and their frequencies
A dictionary of the highest repeat frequency and all sequences with this frequency

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
sample_usage.py		sample_usage.py
seqstats.py		seqstats.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dna-stats

Requirements

Usage

Summary

SeqStats Class Basics

About

Releases

Packages

Contributors 2

Languages

License

krzamiela/dna-stats

Folders and files

Latest commit

History

Repository files navigation

dna-stats

Requirements

Usage

Summary

SeqStats Class Basics

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages