# A visual exploration of the size and GC content of sequenced genomes avaliable on [NCBI](https://www.ncbi.nlm.nih.gov/)

   #Genome size is commonly reported in megabase pairs (megabase = 1,000,000 base pairs of DNA). The size of genomes is extremly variable across species and influenced by a large number of factors (if you're curious the human genome is ~3,000 Mb (megabase pairs), or 3 billion base pairs in size). The NCBI sequenced genome dataset includes genomes from three domains of organisms: eukaryotes (multi cellular organisms, like you, ducks and ferms), porkaryote (single cell, bacteria) and virus genomes (not a Linnean doman... [more a weird grey area](https://www.scientificamerican.com/article/are-viruses-alive-2004/). Here I plot the size of the genomes in these three domains to compare them, and also look at the GC content of the genomes across these domains. %GC content is the percentage of DNA base pairs composed of G and C (guanine and cytosine) GC-content is indicative of many protein coding genes, and the %GC content is indicative of a gene-rich genome. GC ratios within a genome are know to be extremly variable, with low %GC indicating a large amount of non-coding DNA in the genome. 

#These data are derived from [NCBI's repository of sequenced genomes](https://www.ncbi.nlm.nih.gov/home/genomes/). Through the links provided in the Kaggle dataset, one can access the entire sequence to any of these genomes as they are pubically avaliable (for free! thanks science!)

## Loading the data, assessment and cleanup


In [None]:
library('tidyverse')
library('ggthemes')

In [None]:
raw_eukaryote = read_csv('../input/eukaryotes.csv')
head(raw_eukaryote)

**#My Own Annalysis:**
**My hypothesis:** Insects have higher  amount of DNA contained in a haploid genome than Land Plants



In [None]:
#Now I want Graph the Chromosome and Scaffold level only
eukaryote = raw_eukaryote[raw_eukaryote$Level == "Chromosome" || raw_eukaryote$Level == "Scaffold",]
head(eukaryote)


*Now I chose only Land Plants data to use*

In [None]:
eukaryote_LandPlant = eukaryote[eukaryote$"Organism Groups" == "Eukaryota;Plants;Land Plants",]
head(eukaryote_LandPlant)
count(eukaryote_LandPlant)

*I do not know the main size(Mb) of Land Plant, so I will graph a general bar before going to the final bar.*

In [None]:
eukaryote_LandPlant = eukaryote_LandPlant[eukaryote_LandPlant$"Size(Mb)" > 5,]


In [None]:
#Graph the raw Graph before going further
qplot(eukaryote_LandPlant$"Size(Mb)") +
    labs(title = "Histogram of Land Plant genome Size(Mb) content in Scaffold Level", x= "Size(Mb) content", y = "Frequency") +
	theme_fivethirtyeight()

**Because most of the data is less than 600**

I chose the data less than 500 only

In [None]:
eukaryote_LandPlant = eukaryote_LandPlant[eukaryote_LandPlant$"Size(Mb)" > 5,]
eukaryote_LandPlant = eukaryote_LandPlant[eukaryote_LandPlant$"Size(Mb)" < 500,]

In [None]:
# This is the final graph for Plant data:
qplot(eukaryote_LandPlant$"Size(Mb)") +
    labs(title = "Histogram of Land Plant genome Size(Mb) content in Scaffold Level", x= "Size(Mb) content", y = "Frequency") +
	theme_fivethirtyeight()

**Read the Insect data**:

In [None]:
eukaryote_Insects = eukaryote[eukaryote$"Organism Groups" == "Eukaryota;Animals;Insects",]
head(eukaryote_Insects)
count(eukaryote_Insects)

In [None]:
#
eukaryote_Insects = eukaryote_Insects[eukaryote_Insects$"Size(Mb)" > 5,]

In [None]:
#Graph the raw Graph before going further
qplot(eukaryote_Insects$"Size(Mb)") +
    labs(title = "Histogram of Land Plant genome Size(Mb) content in Scaffold Level", x= "Size(Mb) content", y = "Frequency") +
	theme_fivethirtyeight()

In [None]:
#Let clean the data before runing it. the size will be higher than 5 to discard the random or data with 0
eukaryote_Insects = eukaryote_Insects[eukaryote_Insects$"Size(Mb)" > 5,]
eukaryote_Insects = eukaryote_Insects[eukaryote_Insects$"Size(Mb)" < 500,]

In [None]:
#Graph the raw Graph before going further
qplot(eukaryote_Insects$"Size(Mb)") +
    labs(title = "Histogram of Land Plant genome Size(Mb) content in Scaffold Level", x= "Size(Mb) content", y = "Frequency") +
	theme_fivethirtyeight()

Combine data of Land Plant and the Insect

In [None]:
both_genomes = data.frame(domain= "eukaryote_Insects",
												size_mb = eukaryote_Insects$"Size(Mb)")
both_genomes = rbind(both_genomes, data.frame(domain="eukaryote_LandPlant", 
												size_mb = eukaryote_LandPlant$"Size(Mb)"))

In [None]:
#This is how the new data look like:
both_genomes

In [None]:
boxplot(size_mb ~ domain , data = both_genomes,
            main = "Size(Mb) of Both Land Plant and Insect ")

According to the graph, The hypothesis that Insects contain higher amount of DNA than Land Plant is rejected