Skip to content

gkarthik/dna-potrait

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Inspired by ARAGÓN ARTACHO, Francisco J., et al. "Walking on real numbers". The Mathematical Intelligencer. Vol. 35, Issue 1 (March 2013). ISSN 0343-6993, pp. 42-60

Walking on nucleotides

ARAGÓN ARTACHO, Francisco J., et al. present new ways to visualize large mathematical datasets. In particular the paper goes into trying to visualize and tell if a real number is "Normal".

”A real constant α is b-normal if, given the positive integer b ≥ 2, every m-long string of base-b digits appears in the base-b expansion of α with precisely the expected limiting frequency 1/b^m"

The paper shows visualization of many base 4 numbers. The authors also go on to suggest applying this visualization technique to genome comparisons.

Visualization

A walk for a base 4 number consists of steps, where at each step the path moves one unit east, north, west or south, depending on the whether the number at that position is 0, 1, 2 or 3.

I've applied this visualization technique to DNA. DNA consists of 4 nucleotides denoted by A, T, G and C. For each of these nucleotides, I assigned a direction to the step - North, East, South and West respectively. Instead of the HSV color scheme used in the paper, that generates a rainbow color, I've used fixed colors for each nucleotide to get an idea of the nucleotide content in the sequence which is of value in Biology.

Color Scheme

  • A - #D9042B
  • T - #03588C
  • G - #F29F05
  • C - #F24B0F

I picked up the color scheme from Adobe Color - https://color.adobe.com/Madrid---Carabanchel-color-theme-8586545/

#Dependencies

#Compile

cd dna-potrait/
gcc -g -o dna_potrait dna_potraint.c

Usage

./draw_potrait.sh <path to fasta sequence file>

By default I've assumed this folder structure,

.
├── coords
├── data
└── potraits
  • potraits contains the final potraits.
  • data contains the fasta files.
  • coords contains intermediate coordinate files that the C script dna_potrait generates.

Benchmarks

The script takes less than 0.5s to generate potraits of lengths ranging from 10000 to 20000 nucleotides(The human genome is roughly 3 billion nucleotides in length). More robust benchmarks coming soon...