# Codon Usage

### Abstract: 
DNA codon usage frequencies of a large sample of diverse biological organisms from different taxa

### Source:
Bohdan Khomtchouk, Ph.D. University of Chicago, Department of Medicine, Section of Computational Biomedicine and Biomedical Data Science.

#### Date Donated: 2020-10-03

### Attribute Information:

- Column 1: Kingdom 
- Column 2: DNAtype 
- Column 3: SpeciesID 
- Column 4: Ncodons 
- Column 5: SpeciesName 
- Columns 6-69: codon (header: nucleotide bases; entries: frequency of usage (5 digit floating point number)) 

### Importing libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

import warnings
warnings.filterwarnings(action='ignore')

### Reading dataset

In [2]:
data_dir = Path('./Dataset')
data_path = data_dir / 'codon_usage.csv'

In [3]:
df = pd.read_csv(data_path)
df.head()

Unnamed: 0,Kingdom,DNAtype,SpeciesID,Ncodons,SpeciesName,UUU,UUC,UUA,UUG,CUU,...,CGG,AGA,AGG,GAU,GAC,GAA,GAG,UAA,UAG,UGA
0,vrl,0,100217,1995,Epizootic haematopoietic necrosis virus,0.01654,0.01203,0.0005,0.00351,0.01203,...,0.00451,0.01303,0.03559,0.01003,0.04612,0.01203,0.04361,0.00251,0.0005,0.0
1,vrl,0,100220,1474,Bohle iridovirus,0.02714,0.01357,0.00068,0.00678,0.00407,...,0.00136,0.01696,0.03596,0.01221,0.04545,0.0156,0.0441,0.00271,0.00068,0.0
2,vrl,0,100755,4862,Sweet potato leaf curl virus,0.01974,0.0218,0.01357,0.01543,0.00782,...,0.00596,0.01974,0.02489,0.03126,0.02036,0.02242,0.02468,0.00391,0.0,0.00144
3,vrl,0,100880,1915,Northern cereal mosaic virus,0.01775,0.02245,0.01619,0.00992,0.01567,...,0.00366,0.0141,0.01671,0.0376,0.01932,0.03029,0.03446,0.00261,0.00157,0.0
4,vrl,0,100887,22831,Soil-borne cereal mosaic virus,0.02816,0.01371,0.00767,0.03679,0.0138,...,0.00604,0.01494,0.01734,0.04148,0.02483,0.03359,0.03679,0.0,0.00044,0.00131


In [4]:
df.shape

(13028, 69)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13028 entries, 0 to 13027
Data columns (total 69 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Kingdom      13028 non-null  object 
 1   DNAtype      13028 non-null  int64  
 2   SpeciesID    13028 non-null  int64  
 3   Ncodons      13028 non-null  int64  
 4   SpeciesName  13028 non-null  object 
 5   UUU          13028 non-null  object 
 6   UUC          13028 non-null  object 
 7   UUA          13028 non-null  float64
 8   UUG          13028 non-null  float64
 9   CUU          13028 non-null  float64
 10  CUC          13028 non-null  float64
 11  CUA          13028 non-null  float64
 12  CUG          13028 non-null  float64
 13  AUU          13028 non-null  float64
 14  AUC          13028 non-null  float64
 15  AUA          13028 non-null  float64
 16  AUG          13028 non-null  float64
 17  GUU          13028 non-null  float64
 18  GUC          13028 non-null  float64
 19  GUA 

### Data Processing

#### Notice columns 'UUU' and 'UUC' are of data type 'object' which means there are some values other than float in those columns we should get rid of.

In [6]:
for i,j in zip(df['UUU'], df['UUC']):
    print('{}    {}'.format(i,j))

0.01654    0.01203
0.02714    0.01357
0.01974    0.0218
0.01775    0.02245
0.02816    0.01371
0.02579    0.02218
0.04635    0.01545
0.02285    0.02678
0.01105    0.02106
0.03411    0.0143
0.03441    0.01431
0.03305    0.01317
0.03354    0.01254
0.02749    0.01516
0.03128    0.01199
0.05276    0.00666
0.02889    0.01874
0.00926    0.03627
0.00939    0.03139
0.042    0.007
0.03233    0.01086
0.01499    0.0215
0.01504    0.01401
0.01452    0.01686
0.0053    0.02724
0.01409    0.01845
0.02283    0.01178
0.02921    0.00656
0.02487    0.01303
0.03354    0.00663
0.01514    0.02144
0.01486    0.01982
0.00983    0.02932
0.03191    0.0162
0.00491    0.0419
0.00999    0.01498
0.01909    0.01772
0.02907    0.02625
0.01658    0.00785
0.01904    0.03433
0.0238    0.02105
0.02115    0.02228
0.01213    0.01554
0.02425    0.01217
0.01413    0.02973
0.02148    0.02046
0.02485    0.0171
0.02337    0.0182
0.02857    0.02122
0.00699    0.02818
0.02413    0.01989
0.01108    0.02832
0.03482    0.01905
0.0270

0.01345    0.0287
0.01455    0.02818
0.03336    0.01352
0.04192    0.00833
0.01594    0.01963
0.0118    0.02149
0.02103    0.00847
0.02105    0.02178
0.02911    0.01695
0.0303    0.01783
0.02252    0.02027
0.02088    0.01806
0.02762    0.01522
0.03508    0.00607
0.01873    0.02284
0.03763    0.01134
0.0213    0.0213
0.02612    0.02965
0.01323    0.02144
0.02    0.01551
0.02946    0.02161
0.031    0.01633
0.02766    0.01489
0.01932    0.02554
0.02061    0.02404
0.02189    0.02318
0.02273    0.01164
0.02129    0.02056
0.03528    0.01093
0.02703    0.01864
0.05549    0.00488
0.01781    0.02017
0.02712    0.01632
0.02306    0.00845
0.03039    0.04028
0.03328    0.01057
0.02906    0.01505
0.02752    0.01835
0.0249    0.01437
0.01115    0.03717
0.017    0.02392
0.01855    0.03154
0.01433    0.03457
0.01388    0.03608
0.01197    0.02806
0.01493    0.03638
0.01476    0.03209
0.013    0.03343
0.01657    0.03039
0.01295    0.03423
0.01391    0.03711
0.01302    0.03907
0.01376    0.02687
0.01753 

0.02898    0.00825
0.02703    0.01208
0.01126    0.03002
0.04623    0.00579
0.03941    0.0076
0.00318    0.03021
0.03465    0.00396
0.00618    0.02327
0.00646    0.04019
0.01914    0.0076
0.01348    0.03515
0.0075    0.02533
0.0118    0.0236
0.00422    0.02726
0.02567    0.00776
0.00031    0.03445
0.00936    0.00819
0.02716    0.01239
0.02118    0.0133
0.03    0.00857
0.02247    0.01797
0.00437    0.027
0.01967    0.01978
0.01141    0.02779
0.01927    0.01721
0.00095    0.02372
0    0.0347
0.00063    0.03063
0.00042    0.02433
0.00083    0.02656
0.00031    0.0277
0.0178    0.02745
0.03675    0.0058
0.01846    0.01231
0.02053    0.01221
0.02572    0.01367
0.01819    0.01078
0.03647    0.0086
0.01586    0.01559
0.03344    0.00997
0.02498    0.01674
0.0335    0.00847
0.00376    0.03214
0.02538    0.01856
0.02675    0.00877
0.01376    0.0044
0.02664    0.00592
0.04111    0.00846
0.03536    0.00737
0.00422    0.02944
0.0014    0.02909
0.00603    0.02561
0.04958    0.00284
0.00461    0.03158

0.08471    0.02824
0.02256    0.02328
0.0431    0.01952
0.02076    0.02147
0.07546    0.02224
0.0271    0.01646
0.01089    0.02888
0.04891    0.01731
0.02392    0.02966
0.08803    0.02538
0.01866    0.02375
0.02637    0.02469
0.02435    0.0182
0.03295    0.01743
0.03324    0.02303
0.00795    0.02703
0.01691    0.01977
0.04778    0.02059
0.04198    0.01666
0.0396    0.01791
0.05627    0.0209
0.04691    0.0214
0.01833    0.0234
0.01415    0.01302
0.04899    0.02148
0.008    0.02624
0.05138    0.00547
0.02425    0.02088
0.05378    0.0259
0.02682    0.02226
0.02004    0.01477
0.02444    0.02256
0.02605    0.02037
0.05882    0.02293
0.04796    0.02302
0.02553    0.01763
0.02203    0.02203
0.02699    0.01871
0.0207    0.02806
0.01904    0.00781
0.02057    0.0221
0.02248    0.02861
0.02303    0.02064
0.02069    0.01818
0.02364    0.01489
0.01214    0.03141
0.01115    0.02124
0.01561    0.01908
0.01758    0.03287
0.01334    0.02143
0.01976    0.02431
0.05986    0.0157
0.03041    0.01694
0.0218

0.01315    0.02694
0.00814    0.02428
0.01115    0.027
0.00713    0.03875
0.03589    0.03134
0.0145    0.03823
0.01194    0.0296
0.01927    0.01943
0.01459    0.03286
0.01097    0.03997
0.01181    0.03052
0.00996    0.03121
0.09747    0.00385
0.0083    0.01967
0.01394    0.02674
0.09333    0.00785
0.00414    0.04734
0.00914    0.0391
0.0074    0.03533
0.00845    0.01502
0.02073    0.02522
0.01786    0.02797
0.015    0.02644
0.01049    0.02248
0.0256    0.02682
0.01918    0.02764
0.06398    0.01029
0.01537    0.02401
0.07581    0.00782
0.01406    0.02634
0.02383    0.02527
0.01288    0.02612
0.01602    0.02377
0.01407    0.02002
0.0197    0.01297
0.01342    0.02261
0.01436    0.04129
0.01391    0.02269
0.08699    0.00457
0.01321    0.02184
0.08495    0.00559
0.02411    0.03654
0.01037    0.02718
0.00902    0.03024
0.01014    0.03041
0.00917    0.02679
0.0193    0.02537
0.01615    0.02531
0.01286    0.0245
0.08788    0.00611
0.01355    0.02683
0.08223    0.0067
0.0121    0.02318
0.01243 

0.01635    0.04013
0.02285    0.03344
0.0149    0.02192
0.01223    0.04028
0.01805    0.01834
0.00525    0.06562
0.00539    0.06299
0.01453    0.04479
0.01849    0.04315
0.01331    0.02618
0.03587    0.03363
0.02635    0.0398
0.04313    0.01875
0.02353    0.04228
0.04036    0.02635
0.01294    0.04388
0.00811    0.07325
0.01365    0.0429
0.01522    0.01662
0.0149    0.06135
0.01055    0.06245
0.01838    0.03651
0.02804    0.05701
0.0512    0.01581
0.01553    0.03771
0.01082    0.04509
0.01746    0.02067
0.01133    0.04268
0.00611    0.0239
0.01683    0.0202
0.01638    0.04379
0.03518    0.02811
0.01555    0.04354
0.01717    0.04256
0.01347    0.04352
0.01317    0.0216
0.01745    0.05369
0.00526    0.02895
0.01074    0.05638
0.01417    0.05932
0.01623    0.04869
0.02087    0.04529
0.01837    0.05337
0.01713    0.04418
0.01312    0.05774
0.01894    0.04238
0.039    0.02035
0.0407    0.01908
0.02362    0.04987
0.01442    0.01811
0.01247    0.02282
0.01725    0.02562
0.01312    0.06037
0.00

#### We found that 'non-B hepatitis virus' and '-' are irrelevant for codon frequencies 'UUU' and 'UUC' and we should drop these rows

In [7]:
df.drop(df.index[df['UUU'] == 'non-B hepatitis virus'], inplace = True)

In [8]:
df.drop(df.index[df['UUC']== '-'], inplace=True)

In [9]:
df['UUU'] = df['UUU'].astype('float')
df['UUC'] = df['UUC'].astype('float')

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13026 entries, 0 to 13027
Data columns (total 69 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Kingdom      13026 non-null  object 
 1   DNAtype      13026 non-null  int64  
 2   SpeciesID    13026 non-null  int64  
 3   Ncodons      13026 non-null  int64  
 4   SpeciesName  13026 non-null  object 
 5   UUU          13026 non-null  float64
 6   UUC          13026 non-null  float64
 7   UUA          13026 non-null  float64
 8   UUG          13026 non-null  float64
 9   CUU          13026 non-null  float64
 10  CUC          13026 non-null  float64
 11  CUA          13026 non-null  float64
 12  CUG          13026 non-null  float64
 13  AUU          13026 non-null  float64
 14  AUC          13026 non-null  float64
 15  AUA          13026 non-null  float64
 16  AUG          13026 non-null  float64
 17  GUU          13026 non-null  float64
 18  GUC          13026 non-null  float64
 19  GUA 

#### Now the data types are consistent among the columns and all columns are non-null

#### Let's take a look at 'Kingdom' and 'DNAtype' columns

In [11]:
df['Kingdom'].value_counts()

bct    2919
vrl    2831
pln    2523
vrt    2077
inv    1345
mam     572
phg     220
rod     215
pri     180
arc     126
plm      18
Name: Kingdom, dtype: int64

#### There are some imbalance in data labels columns - Kingdom, so we choose top 5 columns labels for our analysis

In [12]:
df['DNAtype'].value_counts()

0     9265
1     2899
2      816
4       31
12       5
3        2
9        2
5        2
11       2
6        1
7        1
Name: DNAtype, dtype: int64

#### The DNAtype columns also has imbalance in categories. So in order to narrow down our topic of interest, we will only use DNAtype - 0(genomic). 
#### Hence, our problem statement will be: Given Genomic DNAtype, we want to predict Kingdom type on the basis of codon frequencies.

Let's make a copy of the dataframe to make these changes and filter the data for our problem statement

In [13]:
df2 = df.copy()

In [14]:
df2 = df2[df2.DNAtype == 0]

Now our dataframe only consists of Genomic DNAtype

In [15]:
df2

Unnamed: 0,Kingdom,DNAtype,SpeciesID,Ncodons,SpeciesName,UUU,UUC,UUA,UUG,CUU,...,CGG,AGA,AGG,GAU,GAC,GAA,GAG,UAA,UAG,UGA
0,vrl,0,100217,1995,Epizootic haematopoietic necrosis virus,0.01654,0.01203,0.00050,0.00351,0.01203,...,0.00451,0.01303,0.03559,0.01003,0.04612,0.01203,0.04361,0.00251,0.00050,0.00000
1,vrl,0,100220,1474,Bohle iridovirus,0.02714,0.01357,0.00068,0.00678,0.00407,...,0.00136,0.01696,0.03596,0.01221,0.04545,0.01560,0.04410,0.00271,0.00068,0.00000
2,vrl,0,100755,4862,Sweet potato leaf curl virus,0.01974,0.02180,0.01357,0.01543,0.00782,...,0.00596,0.01974,0.02489,0.03126,0.02036,0.02242,0.02468,0.00391,0.00000,0.00144
3,vrl,0,100880,1915,Northern cereal mosaic virus,0.01775,0.02245,0.01619,0.00992,0.01567,...,0.00366,0.01410,0.01671,0.03760,0.01932,0.03029,0.03446,0.00261,0.00157,0.00000
4,vrl,0,100887,22831,Soil-borne cereal mosaic virus,0.02816,0.01371,0.00767,0.03679,0.01380,...,0.00604,0.01494,0.01734,0.04148,0.02483,0.03359,0.03679,0.00000,0.00044,0.00131
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13017,pri,0,9597,41117,Pan paniscus,0.02342,0.02846,0.00987,0.01527,0.01542,...,0.00795,0.01459,0.01318,0.01384,0.01824,0.02196,0.03478,0.00046,0.00092,0.00158
13019,pri,0,9598,328023,Pan troglodytes,0.01380,0.02200,0.00569,0.01047,0.01147,...,0.01214,0.01339,0.01351,0.01751,0.02667,0.02191,0.04283,0.00065,0.00060,0.00165
13021,pri,0,9600,96254,Pongo pygmaeus,0.01739,0.02236,0.00887,0.01221,0.01307,...,0.00885,0.01497,0.01232,0.01886,0.02383,0.02546,0.03319,0.00149,0.00068,0.00150
13023,pri,0,9601,1097,Pongo pygmaeus abelii,0.02552,0.03555,0.00547,0.01367,0.01276,...,0.00820,0.01367,0.01094,0.01367,0.02279,0.02005,0.04102,0.00091,0.00091,0.00638


In [16]:
df2.Kingdom.value_counts()

bct    2917
vrl    2831
pln    1523
inv     922
vrt     464
phg     220
arc     126
mam     102
pri      83
rod      59
plm      18
Name: Kingdom, dtype: int64

In [17]:
df2 = df2[df2['Kingdom'].apply(lambda x: 1 if x in ['bct','vrl','pln','inv','vrt'] else 0)==1]

Now the dataset contains top 5 Kingdom according to datapoints

In [18]:
df2.describe()

Unnamed: 0,DNAtype,SpeciesID,Ncodons,UUU,UUC,UUA,UUG,CUU,CUC,CUA,...,CGG,AGA,AGG,GAU,GAC,GAA,GAG,UAA,UAG,UGA
count,8657.0,8657.0,8657.0,8657.0,8657.0,8657.0,8657.0,8657.0,8657.0,8657.0,...,8657.0,8657.0,8657.0,8657.0,8657.0,8657.0,8657.0,8657.0,8657.0,8657.0
mean,0.0,130460.084671,103179.2,0.019945,0.020836,0.013814,0.015684,0.014446,0.015746,0.0078,...,0.006725,0.011355,0.008121,0.028814,0.025561,0.030758,0.027695,0.001377,0.000598,0.001013
std,0.0,127428.106795,704530.5,0.011201,0.009112,0.013628,0.007658,0.007225,0.010625,0.005028,...,0.007331,0.008183,0.006202,0.011944,0.012523,0.014711,0.01283,0.001719,0.000811,0.00125
min,0.0,7.0,1000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,28104.0,1799.0,0.01201,0.01442,0.00327,0.01035,0.00945,0.0082,0.00388,...,0.00202,0.00433,0.00314,0.0209,0.01679,0.02078,0.01864,0.00048,0.0,0.00023
50%,0.0,76885.0,3583.0,0.01925,0.02044,0.01016,0.01516,0.01385,0.01366,0.00758,...,0.00462,0.01086,0.00671,0.02934,0.023,0.02916,0.02602,0.00106,0.00045,0.00076
75%,0.0,223926.0,12987.0,0.02703,0.02637,0.0193,0.0202,0.01878,0.02118,0.01089,...,0.00847,0.01649,0.0121,0.03675,0.03211,0.03892,0.03494,0.00191,0.00082,0.00145
max,0.0,465364.0,34132280.0,0.0866,0.07424,0.09044,0.06186,0.05168,0.09626,0.03518,...,0.05554,0.09883,0.05843,0.18566,0.11384,0.14489,0.15855,0.0452,0.0142,0.03315


In [19]:
df2

Unnamed: 0,Kingdom,DNAtype,SpeciesID,Ncodons,SpeciesName,UUU,UUC,UUA,UUG,CUU,...,CGG,AGA,AGG,GAU,GAC,GAA,GAG,UAA,UAG,UGA
0,vrl,0,100217,1995,Epizootic haematopoietic necrosis virus,0.01654,0.01203,0.00050,0.00351,0.01203,...,0.00451,0.01303,0.03559,0.01003,0.04612,0.01203,0.04361,0.00251,0.00050,0.00000
1,vrl,0,100220,1474,Bohle iridovirus,0.02714,0.01357,0.00068,0.00678,0.00407,...,0.00136,0.01696,0.03596,0.01221,0.04545,0.01560,0.04410,0.00271,0.00068,0.00000
2,vrl,0,100755,4862,Sweet potato leaf curl virus,0.01974,0.02180,0.01357,0.01543,0.00782,...,0.00596,0.01974,0.02489,0.03126,0.02036,0.02242,0.02468,0.00391,0.00000,0.00144
3,vrl,0,100880,1915,Northern cereal mosaic virus,0.01775,0.02245,0.01619,0.00992,0.01567,...,0.00366,0.01410,0.01671,0.03760,0.01932,0.03029,0.03446,0.00261,0.00157,0.00000
4,vrl,0,100887,22831,Soil-borne cereal mosaic virus,0.02816,0.01371,0.00767,0.03679,0.01380,...,0.00604,0.01494,0.01734,0.04148,0.02483,0.03359,0.03679,0.00000,0.00044,0.00131
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12043,vrt,0,96903,1190,Trichogaster trichopterus,0.01681,0.03025,0.00336,0.01176,0.01176,...,0.00756,0.01261,0.01176,0.01345,0.03697,0.01597,0.03613,0.00084,0.00168,0.00168
12049,vrt,0,98921,2665,Takifugu pardalis,0.02176,0.04353,0.00450,0.01163,0.00901,...,0.00563,0.01013,0.01088,0.01914,0.03452,0.02477,0.05216,0.00038,0.00000,0.00075
12050,vrt,0,98923,5800,Verasper moseri,0.01138,0.02483,0.00241,0.01310,0.01655,...,0.00328,0.01241,0.01155,0.01259,0.02086,0.01569,0.03914,0.00052,0.00034,0.00155
12057,vrt,0,99586,8062,Echis ocellatus,0.02270,0.01079,0.01191,0.01662,0.01240,...,0.00360,0.01898,0.00558,0.04478,0.01526,0.04317,0.02444,0.00136,0.00012,0.00112


Since now DNAtype column is redundant, we can drop this column

In [20]:
df2 = df2.drop(columns=['DNAtype'])
df2.head()

Unnamed: 0,Kingdom,SpeciesID,Ncodons,SpeciesName,UUU,UUC,UUA,UUG,CUU,CUC,...,CGG,AGA,AGG,GAU,GAC,GAA,GAG,UAA,UAG,UGA
0,vrl,100217,1995,Epizootic haematopoietic necrosis virus,0.01654,0.01203,0.0005,0.00351,0.01203,0.03208,...,0.00451,0.01303,0.03559,0.01003,0.04612,0.01203,0.04361,0.00251,0.0005,0.0
1,vrl,100220,1474,Bohle iridovirus,0.02714,0.01357,0.00068,0.00678,0.00407,0.02849,...,0.00136,0.01696,0.03596,0.01221,0.04545,0.0156,0.0441,0.00271,0.00068,0.0
2,vrl,100755,4862,Sweet potato leaf curl virus,0.01974,0.0218,0.01357,0.01543,0.00782,0.01111,...,0.00596,0.01974,0.02489,0.03126,0.02036,0.02242,0.02468,0.00391,0.0,0.00144
3,vrl,100880,1915,Northern cereal mosaic virus,0.01775,0.02245,0.01619,0.00992,0.01567,0.01358,...,0.00366,0.0141,0.01671,0.0376,0.01932,0.03029,0.03446,0.00261,0.00157,0.0
4,vrl,100887,22831,Soil-borne cereal mosaic virus,0.02816,0.01371,0.00767,0.03679,0.0138,0.00548,...,0.00604,0.01494,0.01734,0.04148,0.02483,0.03359,0.03679,0.0,0.00044,0.00131


In [21]:
df2.shape

(8657, 68)

#### We have processed our dataset for analysis. Now, we can save this dataset into a .csv file

In [22]:
df2.to_csv('Dataset/codon_usage_dataset_processed.csv', index=False)

### Points about the Codon dataset:
1. Codon Usage dataset consists of 8657 rows and 68 columns where 'Kingdom' is the class label. 
2. 'SpeciesID' is the object ID and corresponding species name in 'SpeciesName' column.
3. The feature values are codon frequencies from column 5-68 and 'Ncodons' which is the algebraic sum of the numbers listed for the different codons in an entry of CUTG.

### Loading dataset

In [23]:
data_dir = Path('./Dataset')
data_path_processed = data_dir / 'codon_usage_dataset_processed.csv'

In [24]:
codon_dataset = pd.read_csv(data_path_processed)

In [25]:
codon_dataset.head()

Unnamed: 0,Kingdom,SpeciesID,Ncodons,SpeciesName,UUU,UUC,UUA,UUG,CUU,CUC,...,CGG,AGA,AGG,GAU,GAC,GAA,GAG,UAA,UAG,UGA
0,vrl,100217,1995,Epizootic haematopoietic necrosis virus,0.01654,0.01203,0.0005,0.00351,0.01203,0.03208,...,0.00451,0.01303,0.03559,0.01003,0.04612,0.01203,0.04361,0.00251,0.0005,0.0
1,vrl,100220,1474,Bohle iridovirus,0.02714,0.01357,0.00068,0.00678,0.00407,0.02849,...,0.00136,0.01696,0.03596,0.01221,0.04545,0.0156,0.0441,0.00271,0.00068,0.0
2,vrl,100755,4862,Sweet potato leaf curl virus,0.01974,0.0218,0.01357,0.01543,0.00782,0.01111,...,0.00596,0.01974,0.02489,0.03126,0.02036,0.02242,0.02468,0.00391,0.0,0.00144
3,vrl,100880,1915,Northern cereal mosaic virus,0.01775,0.02245,0.01619,0.00992,0.01567,0.01358,...,0.00366,0.0141,0.01671,0.0376,0.01932,0.03029,0.03446,0.00261,0.00157,0.0
4,vrl,100887,22831,Soil-borne cereal mosaic virus,0.02816,0.01371,0.00767,0.03679,0.0138,0.00548,...,0.00604,0.01494,0.01734,0.04148,0.02483,0.03359,0.03679,0.0,0.00044,0.00131


In [26]:
codon_dataset.shape

(8657, 68)