Skip to content

gkarthik/dna-potrait

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Inspired by ARAGÓN ARTACHO, Francisco J., et al. "Walking on real numbers". The Mathematical Intelligencer. Vol. 35, Issue 1 (March 2013). ISSN 0343-6993, pp. 42-60

Walking on nucleotides

ARAGÓN ARTACHO, Francisco J., et al. present new ways to visualize large mathematical datasets. In particular the paper goes into trying to visualize and tell if a real number is "Normal".

”A real constant α is b-normal if, given the positive integer b ≥ 2, every m-long string of base-b digits appears in the base-b expansion of α with precisely the expected limiting frequency 1/b^m"

The paper shows visualization of many base 4 numbers. The authors also go on to suggest applying this visualization technique to genome comparisons.

Visualization

A walk for a base 4 number consists of steps, where at each step the path moves one unit east, north, west or south, depending on the whether the number at that position is 0, 1, 2 or 3.

I've applied this visualization technique to DNA. DNA consists of 4 nucleotides denoted by A, T, G and C. For each of these nucleotides, I assigned a direction to the step - North, East, South and West respectively. Instead of the HSV color scheme used in the paper, that generates a rainbow color, I've used fixed colors for each nucleotide to get an idea of the nucleotide content in the sequence which is of value in Biology.

Color Scheme

  • A - #D9042B
  • T - #03588C
  • G - #F29F05
  • C - #F24B0F

I picked up the color scheme from Adobe Color - https://color.adobe.com/Madrid---Carabanchel-color-theme-8586545/

#Dependencies

#Compile

cd dna-potrait/
gcc -g -o dna_potrait dna_potraint.c

Usage

./draw_potrait.sh <path to fasta sequence file>

By default I've assumed this folder structure,

.
├── coords
├── data
└── potraits
  • potraits contains the final potraits.
  • data contains the fasta files.
  • coords contains intermediate coordinate files that the C script dna_potrait generates.

Benchmarks

The script takes less than 0.5s to generate potraits of lengths ranging from 10000 to 20000 nucleotides(The human genome is roughly 3 billion nucleotides in length). More robust benchmarks coming soon...

Releases

No releases published

Packages

No packages published