This project is to solve common bioinformatic challenges using Python.
Many of the functions and scripts here are inspired by challenges on Rosalind.
This script contains important functions pertaining to bioinformatics. This module is stand alone and can be useful for elementary bioinformatic processes.
Many of the lower level Rosalind challenges can be solved using this simple module.
The table below shows the available functions with descriptions:
Function | Description | Arguments |
---|---|---|
convert_phred() | Takes a phred score and returns the respective Qscore | letter, val=33 |
qual_score() | Takes a string of quality scores and returns the average quality score | phred_score |
validate_base_seq() | Takes a sequence and confirms that it is DNA or RNA by returning bool | seq, RNAflag=False |
gc_content() | Takes a DNA or RNA sequence and returns proportion of sequence that is 'G' or 'C' | seq |
calc_median() | Takes a sorted numerical list and returns the median value from that list | sortedlist |
oneline_fasta() | Takes a FASTA file and outputs a FASTA file where each sequence is only one line | filer, filew='oneline.fa' |
rev_compliment() | Takes a sequence of DNA or RNA and returns the reverse compliment sequence | seq |
dna_to_aa() | Takes a sequence of DNA and returns a list of peptides that are encoded from that DNA sequence | seq |
permutation_calc() | Takes n and r and returns number of permutations and an optional list of numeric permutations | n, r, perm_out=True |
transition_transversion() | Takes two DNA sequences and returns the transition to transversion ratio R(s1, s2) | seq1, seq2 |
kmerize() | Takes a sequence and kmerizes it with k length | seq, k |
Rosalind contains a large set of bioinformatic challenges that are free to attempt and learn from.
Here is code to solve some of Rosalind's challenges found in the rosalind folder.
Script | Description | Problem Title |
---|---|---|
point_mutations.py | Calculates the hamming distance between two sequences | Counting Point Mutations |
open_reading_f.py | Finds all possible polypeptides from a DNA sequence | Open Reading Frames |
restriction_sites.py | Locates restriction sites by finding reverse palindromes in DNA | Locating Restriction Sites |
mRNA_poss.py | Calculates number of mRNA sequences a protein sequence could have been derived from | Inferring mRNA from Protein |
shared_motif.py | Finds the longest shared motif in a DNA FASTA file | Finding a Shared Motif |
rna_splicing.py | Takes a DNA sequence with intron sequences and splices, returns resulting protein string | RNA Splicing |
transition_transversion.py | Takes a FASTA file with two sequences and returns the transition to transversion ration between the two | Transitions and Transversions |
overlap_graphs.py | Takes a FASTA file and compares all sequences to find overlapping suffixes and prefixes | Overlap Graphs |
fibonacci_recurrence.py | Given n (generations) and k (offspring per generation) returns number of breeding pairs after n time | Rabbits and Recurrence Relations |
enumerating_kmers.py | From a string of letters, returns all combinations of length r and sorts lexicographically | Enumerating k-mers Lexicographically |
kmer_composition.py | Takes a single sequence in FASTA format and returns the 4-mer composition of the sequence in lexicographic order | k-Mer Composition |
consensus_seq.py | Takes a FASTA file of sequences of all the same length and returns the consensus sequence and nucleotide make up per position | Consensus and Profile |
To use bioinfo_toolbox.py or Rosalind scripts, follow this breif guide.
- sys
- math
- random
NOTE: All Rosalind scripts must be executed from the rosalind directory
- First clone the repository:
git clone https://github.com/ivango17/Bioinformatics_Challenges.git
- Navigate to the debug directory:
cd Bioinformatics_Challenges/debug
- Run the test script to ensure that bioinfo_toolbox.py is working properly:
./debug.py
or
python debug.py
- To use this repository to solve Rosalind problems and navigate to the rosalind directory:
cd ../rosalind
- Each script has a help option on how to input data from Rosalind:
./<script.py> -h
or
python <script.py> -h
- Finish bioinfo_toolbox.py
- Add more Rosalind code
- Counting Point Mutation
- Open Reading Frames
- Locating Restriction Sites
- Infering mRNA from Protein
- Finding a Shared Motif
- RNA Splicing
- Transitions and Transversions
- Overlap Graphs
- Rabbits and Reccurence Relations
- Enumerating k-mers Lexicographically
- k-Mer Composition
- Consensus and Profile
- Add to README.md