# Codon Usage

### Abstract: 
DNA codon usage frequencies of a large sample of diverse biological organisms from different taxa

### Source:
Bohdan Khomtchouk, Ph.D. University of Chicago, Department of Medicine, Section of Computational Biomedicine and Biomedical Data Science.

#### Date Donated: 2020-10-03

### Attribute Information:

- Column 1: Kingdom 
- Column 2: DNAtype 
- Column 3: SpeciesID 
- Column 4: Ncodons 
- Column 5: SpeciesName 
- Columns 6-69: codon (header: nucleotide bases; entries: frequency of usage (5 digit floating point number)) 

### Nucleotides bases

- A : Adenine
- C : Cytosine
- G : Guanine
- U : Uracil

## Problem Statement :

### Classifying different species into their Kingdom types on the basis of codon frequency usage as input features with Genomic DNA type.

#### Class labels: 
1. Virus 
2. Bacteria 
3. Plants
4. Vertebrates
5. Invertebrates

## Importing libraries

In [6]:
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings(action='ignore')

## Reading dataset

In [7]:
data_dir = Path('../data')
data_path = data_dir / 'codon_usage_dataset_processed.csv'

df = pd.read_csv(data_path)

In [8]:
df.head()

Unnamed: 0,Kingdom,SpeciesID,Ncodons,SpeciesName,UUU,UUC,UUA,UUG,CUU,CUC,...,CGG,AGA,AGG,GAU,GAC,GAA,GAG,UAA,UAG,UGA
0,vrl,100217,1995,Epizootic haematopoietic necrosis virus,0.01654,0.01203,0.0005,0.00351,0.01203,0.03208,...,0.00451,0.01303,0.03559,0.01003,0.04612,0.01203,0.04361,0.00251,0.0005,0.0
1,vrl,100220,1474,Bohle iridovirus,0.02714,0.01357,0.00068,0.00678,0.00407,0.02849,...,0.00136,0.01696,0.03596,0.01221,0.04545,0.0156,0.0441,0.00271,0.00068,0.0
2,vrl,100755,4862,Sweet potato leaf curl virus,0.01974,0.0218,0.01357,0.01543,0.00782,0.01111,...,0.00596,0.01974,0.02489,0.03126,0.02036,0.02242,0.02468,0.00391,0.0,0.00144
3,vrl,100880,1915,Northern cereal mosaic virus,0.01775,0.02245,0.01619,0.00992,0.01567,0.01358,...,0.00366,0.0141,0.01671,0.0376,0.01932,0.03029,0.03446,0.00261,0.00157,0.0
4,vrl,100887,22831,Soil-borne cereal mosaic virus,0.02816,0.01371,0.00767,0.03679,0.0138,0.00548,...,0.00604,0.01494,0.01734,0.04148,0.02483,0.03359,0.03679,0.0,0.00044,0.00131


In [9]:
df.shape

(8657, 68)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8657 entries, 0 to 8656
Data columns (total 68 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Kingdom      8657 non-null   object 
 1   SpeciesID    8657 non-null   int64  
 2   Ncodons      8657 non-null   int64  
 3   SpeciesName  8657 non-null   object 
 4   UUU          8657 non-null   float64
 5   UUC          8657 non-null   float64
 6   UUA          8657 non-null   float64
 7   UUG          8657 non-null   float64
 8   CUU          8657 non-null   float64
 9   CUC          8657 non-null   float64
 10  CUA          8657 non-null   float64
 11  CUG          8657 non-null   float64
 12  AUU          8657 non-null   float64
 13  AUC          8657 non-null   float64
 14  AUA          8657 non-null   float64
 15  AUG          8657 non-null   float64
 16  GUU          8657 non-null   float64
 17  GUC          8657 non-null   float64
 18  GUA          8657 non-null   float64
 19  GUG   

### Dividing dataset into features and labels dataframes

In [11]:
labels = df[['Kingdom']].copy()

In [12]:
labels.head()

Unnamed: 0,Kingdom
0,vrl
1,vrl
2,vrl
3,vrl
4,vrl


In [13]:
features = df.drop(columns=['Kingdom','SpeciesID','SpeciesName'])

In [14]:
features.head()

Unnamed: 0,Ncodons,UUU,UUC,UUA,UUG,CUU,CUC,CUA,CUG,AUU,...,CGG,AGA,AGG,GAU,GAC,GAA,GAG,UAA,UAG,UGA
0,1995,0.01654,0.01203,0.0005,0.00351,0.01203,0.03208,0.001,0.0401,0.00551,...,0.00451,0.01303,0.03559,0.01003,0.04612,0.01203,0.04361,0.00251,0.0005,0.0
1,1474,0.02714,0.01357,0.00068,0.00678,0.00407,0.02849,0.00204,0.0441,0.01153,...,0.00136,0.01696,0.03596,0.01221,0.04545,0.0156,0.0441,0.00271,0.00068,0.0
2,4862,0.01974,0.0218,0.01357,0.01543,0.00782,0.01111,0.01028,0.01193,0.02283,...,0.00596,0.01974,0.02489,0.03126,0.02036,0.02242,0.02468,0.00391,0.0,0.00144
3,1915,0.01775,0.02245,0.01619,0.00992,0.01567,0.01358,0.0094,0.01723,0.02402,...,0.00366,0.0141,0.01671,0.0376,0.01932,0.03029,0.03446,0.00261,0.00157,0.0
4,22831,0.02816,0.01371,0.00767,0.03679,0.0138,0.00548,0.00473,0.02076,0.02716,...,0.00604,0.01494,0.01734,0.04148,0.02483,0.03359,0.03679,0.0,0.00044,0.00131


### Splitting dataset into training and testing sets using Stratified strategy (70-30 split)

In [15]:
train_features, test_features, train_labels, test_labels = train_test_split(features, 
                                                                            labels, 
                                                                            test_size=0.3,
                                                                            stratify=labels,
                                                                            random_state=0)

In [16]:
print('====================')
print(' Train-Test Split')
print('====================')
print('')
print('Overall class ratio:')
print('{}'.format(labels.value_counts() / len(labels)))
print(' ')
print('Train set class ratio:')
print('{}'.format(train_labels.value_counts() / len(train_labels)))
print(' ')
print('Test set class ratio:')
print('{}'.format(test_labels.value_counts() / len(test_labels)))

 Train-Test Split

Overall class ratio:
Kingdom
bct        0.336953
vrl        0.327019
pln        0.175927
inv        0.106503
vrt        0.053598
dtype: float64
 
Train set class ratio:
Kingdom
bct        0.337019
vrl        0.326952
pln        0.175937
inv        0.106453
vrt        0.053639
dtype: float64
 
Test set class ratio:
Kingdom
bct        0.336798
vrl        0.327175
pln        0.175905
inv        0.106620
vrt        0.053503
dtype: float64
