Skip to content

This project is to solve common bioinformatic challenges using Python

Notifications You must be signed in to change notification settings

ivango17/Bioinformatics_Challenges

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Bioinformatics Challenges

This project is to solve common bioinformatic challenges using Python.

Many of the functions and scripts here are inspired by challenges on Rosalind.

This script contains important functions pertaining to bioinformatics. This module is stand alone and can be useful for elementary bioinformatic processes.

Many of the lower level Rosalind challenges can be solved using this simple module.

The table below shows the available functions with descriptions:

Function Description Arguments
convert_phred() Takes a phred score and returns the respective Qscore letter, val=33
qual_score() Takes a string of quality scores and returns the average quality score phred_score
validate_base_seq() Takes a sequence and confirms that it is DNA or RNA by returning bool seq, RNAflag=False
gc_content() Takes a DNA or RNA sequence and returns proportion of sequence that is 'G' or 'C' seq
calc_median() Takes a sorted numerical list and returns the median value from that list sortedlist
oneline_fasta() Takes a FASTA file and outputs a FASTA file where each sequence is only one line filer, filew='oneline.fa'
rev_compliment() Takes a sequence of DNA or RNA and returns the reverse compliment sequence seq
dna_to_aa() Takes a sequence of DNA and returns a list of peptides that are encoded from that DNA sequence seq
permutation_calc() Takes n and r and returns number of permutations and an optional list of numeric permutations n, r, perm_out=True
transition_transversion() Takes two DNA sequences and returns the transition to transversion ratio R(s1, s2) seq1, seq2
kmerize() Takes a sequence and kmerizes it with k length seq, k

Rosalind Challenges

Rosalind contains a large set of bioinformatic challenges that are free to attempt and learn from.

Here is code to solve some of Rosalind's challenges found in the rosalind folder.

Script Description Problem Title
point_mutations.py Calculates the hamming distance between two sequences Counting Point Mutations
open_reading_f.py Finds all possible polypeptides from a DNA sequence Open Reading Frames
restriction_sites.py Locates restriction sites by finding reverse palindromes in DNA Locating Restriction Sites
mRNA_poss.py Calculates number of mRNA sequences a protein sequence could have been derived from Inferring mRNA from Protein
shared_motif.py Finds the longest shared motif in a DNA FASTA file Finding a Shared Motif
rna_splicing.py Takes a DNA sequence with intron sequences and splices, returns resulting protein string RNA Splicing
transition_transversion.py Takes a FASTA file with two sequences and returns the transition to transversion ration between the two Transitions and Transversions
overlap_graphs.py Takes a FASTA file and compares all sequences to find overlapping suffixes and prefixes Overlap Graphs
fibonacci_recurrence.py Given n (generations) and k (offspring per generation) returns number of breeding pairs after n time Rabbits and Recurrence Relations
enumerating_kmers.py From a string of letters, returns all combinations of length r and sorts lexicographically Enumerating k-mers Lexicographically
kmer_composition.py Takes a single sequence in FASTA format and returns the 4-mer composition of the sequence in lexicographic order k-Mer Composition
consensus_seq.py Takes a FASTA file of sequences of all the same length and returns the consensus sequence and nucleotide make up per position Consensus and Profile

Usage

To use bioinfo_toolbox.py or Rosalind scripts, follow this breif guide.

Required Packages

  • sys
  • math
  • random

NOTE: All Rosalind scripts must be executed from the rosalind directory

  1. First clone the repository:
git clone https://github.com/ivango17/Bioinformatics_Challenges.git
  1. Navigate to the debug directory:
cd Bioinformatics_Challenges/debug
  1. Run the test script to ensure that bioinfo_toolbox.py is working properly:
./debug.py

or

python debug.py
  1. To use this repository to solve Rosalind problems and navigate to the rosalind directory:
cd ../rosalind
  1. Each script has a help option on how to input data from Rosalind:
./<script.py> -h

or

python <script.py> -h

To Do

  • Finish bioinfo_toolbox.py
  • Add more Rosalind code
    • Counting Point Mutation
    • Open Reading Frames
    • Locating Restriction Sites
    • Infering mRNA from Protein
    • Finding a Shared Motif
    • RNA Splicing
    • Transitions and Transversions
    • Overlap Graphs
    • Rabbits and Reccurence Relations
    • Enumerating k-mers Lexicographically
    • k-Mer Composition
    • Consensus and Profile
  • Add to README.md

About

This project is to solve common bioinformatic challenges using Python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages