<a href="https://colab.research.google.com/github/kunjkinger/bioinformatics/blob/main/DnaComposition_GC_AT_content_and_frequency.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install biopython==1.76


Collecting biopython==1.76
[?25l  Downloading https://files.pythonhosted.org/packages/7d/2a/48ef94adf91c079bb7a97b38193a06735e1c41fbc19ee57edd8bebd6c6fa/biopython-1.76-cp37-cp37m-manylinux1_x86_64.whl (2.3MB)
[K     |████████████████████████████████| 2.3MB 4.9MB/s 
Installing collected packages: biopython
Successfully installed biopython-1.76


## GC content in dna

*   GC content is the percentage of nitrogeous bases in a DNa or RNA _molecule that are either G (guanine) or cytosine(c)

A=T means 2 hydrogen bond
G=-C means 3 hydrogen bond and more stable to mutate 

--> higher gc content level indicates a relatively higher melting temperature.



*   AT contents in dna
--> At base yeilds only 2 hydrogen bonds

In [3]:
import Bio

In [4]:
from Bio.Seq import Seq

In [28]:
from Bio.SeqUtils import GC


In [5]:
dna = Seq('ATGATCTCGTAA')

In [6]:
dna

Seq('ATGATCTCGTAA')

In [7]:
GC(dna)

33.333333333333336

In [10]:
#method 2
#custom fxn to get the GC count
dna.count('G')

2

In [16]:
def gc_content(seq):
  result = float(seq.count('G') + seq.count('C'))/len(seq) * 100
  return result

In [17]:
gc_content(dna)

33.33333333333333

In [19]:
#method 3
def gc_content2(seq):
  gc = [ B for B in seq if B in 'GC']
  result = float(len(gc))/len(seq) * 100
  return result

In [20]:
gc_content2(dna)

33.33333333333333

In [23]:
dna.lower()

Seq('atgatctcgtaa')

In [24]:
gc_content2(dna.lower())

0.0

In [25]:
GC(dna.lower())

33.333333333333336

In [26]:
#### AT content

In [35]:
def at_content(seq):
  at = [b for b in seq.upper() if b in 'AT']
  result = float(len(at))/len(seq)*100
  return result

In [36]:
at_content(dna)

66.66666666666666

In [37]:
#### melting point of DNA
# --> Tm_wallace:'Rule of thumb'
# --> tm_gc: Empirical formulas based on gc content. salt and mismatch corrections can be included
# --> tm_NN: calculation based on nearest neighbor thermodynamics. several tables for DNA/DNA, DNA/RNA and RNA/RNA hybridizations are included. correction
# for mimatches ,dangling ends, salt concerntration and other additives are available

In [38]:
import Bio.SeqUtils

In [39]:
dir(Bio.SeqUtils)

['Alphabet',
 'GC',
 'GC123',
 'GC_skew',
 'IUPACData',
 'MutableSeq',
 'Seq',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 'cos',
 'molecular_weight',
 'nt_search',
 'pi',
 'print_function',
 're',
 'seq1',
 'seq3',
 'sin',
 'six_frame_translations',
 'xGC_skew']

In [40]:
from Bio.SeqUtils import MeltingTemp as mt

In [41]:
dna 

Seq('ATGATCTCGTAA')

In [42]:
GC(dna)

33.333333333333336

In [43]:
#check for melting point using wallace
mt.Tm_Wallace(dna)

32.0

In [44]:
#check for the melting point using gc content
mt.Tm_GC(dna)

23.32155893208184

In [48]:
seq1 = Seq('ATGCATGGTGCGCGA')
seq2 = Seq('ATTTGTGCTCCTGGA')

In [49]:
def get_metrics(seq):
  gc = GC(seq)
  at = at_content(seq)
  melting_point = mt.Tm_GC(seq)
  result = "GC:{}, AT: {},Temp:{}".format(gc,at,melting_point)
  return result

In [50]:
get_metrics(seq1)

'GC:60.0, AT: 40.0,Temp:44.254892265415165'

In [51]:
get_metrics(seq2)

'GC:46.666666666666664, AT: 53.333333333333336,Temp:38.7882255987485'

In [52]:
#GC skew
#check fo

#GC skew
##check when the nucleotide (G,C) are over or under abundant in a particular region of a DNA or RNA


*   helps to indicate DNA lagging strand or leading strand


*   GC skew pos = leading


*   GC skew neg = lagging










In [53]:
from Bio.SeqUtils import GC123,GC_skew,xGC_skew

In [55]:
#gc content 1,2,3
dna

Seq('ATGATCTCGTAA')

In [57]:
GC(dna)

33.333333333333336

In [56]:
GC123(dna)

(33.333333333333336, 0.0, 25.0, 75.0)

In [59]:
#GC_Skew
GC_skew(dna)

[0.0]

## Subsequences

--> search for a DNA subseq in sequence , return listof [subseq,positions]

In [61]:
from Bio.SeqUtils import nt_search

In [64]:
main_seq = Seq('ACTATT')
subseq = Seq('ACT')

In [65]:
nt_search(str(main_seq),str(subseq))

['ACT', 0]

In [66]:
main_seq = Seq('TGCACTATT')
subseq = Seq('ACT')

In [67]:
nt_search(str(main_seq),str(subseq))

['ACT', 3]