# **Data cleaning and sorting**
This notebook covers a summary come detailed analysis of the contents of the text (.tsv) country files generated from [01.Data_downloading_and_transformation](./01.Data_downloading_and_transformation.ipnb)
## **Goals**
1. To Understand the summary contents of the .tsv files
2. To Analyse the outlook of each sequence: length and content

### **Chacking installation status of packages, Installing tidyverse package and loading dplyr or magrittr packages**

In [1]:
getwd()

In [2]:
#Checking if a package is installed and if not installed it is installed. if installed it is loaded. "install.packages("tidyverse")" installs tidyverse
cat("\nChecking if needed packages are installed... 'tidyverse','dplyr' and 'magrittr'")
if("tidyverse" %in% rownames(installed.packages()) == FALSE) {
        install.packages("tidyverse")
} else {
        cat("\nExcellent tidyverse already installed.\nproceeding with clean_up and sorting\n")
        ## loads dplyr and magrittr packages
        suppressMessages(library(dplyr))
        suppressMessages(library(magrittr))
        suppressMessages(library(tools))
}


Checking if needed packages are installed... 'tidyverse','dplyr' and 'magrittr'
Excellent tidyverse already installed.
proceeding with clean_up and sorting


In [3]:
suppressMessages(library(dplyr));suppressMessages(library(magrittr))

### **Loading the .tsv data file to R**

In [4]:
bold_data = read.delim("/home/kibet/bioinformatics/github/co1_metaanalysis/data/input/bold_data/diptera_13012021/diptera_13012021.tsv", 
                       stringsAsFactors = F, header = T, na.strings = "") 
# seems to work ok. bold2.tsv does not contain any '\r' characters

### **The Overall look of the data**

In [5]:
str(bold_data)

'data.frame':	81302 obs. of  80 variables:
 $ processid                 : chr  "GBDP32377-19" "GBDP32805-19" "GBDP32808-19" "GBDP32817-19" ...
 $ sampleid                  : chr  "MF446534" "KX853282" "KX853251" "KX853243" ...
 $ recordID                  : int  9905930 9906358 9906361 9906370 9906371 9906373 9906469 9906470 9906472 9906578 ...
 $ catalognum                : chr  NA NA NA NA ...
 $ fieldnum                  : chr  "ZFMK_D190" NA NA NA ...
 $ institution_storing       : chr  "Mined from GenBank, NCBI" "Mined from GenBank, NCBI" "Mined from GenBank, NCBI" "Mined from GenBank, NCBI" ...
 $ collection_code           : logi  NA NA NA NA NA NA ...
 $ bin_uri                   : chr  "BOLD:AAZ6524" NA NA "BOLD:AED2981" ...
 $ phylum_taxID              : int  20 20 20 20 20 20 20 20 20 20 ...
 $ phylum_name               : chr  "Arthropoda" "Arthropoda" "Arthropoda" "Arthropoda" ...
 $ class_taxID               : int  82 82 82 82 82 82 82 82 82 82 ...
 $ class_name            

In [6]:
print(names(bold_data))

 [1] "processid"                  "sampleid"                  
 [3] "recordID"                   "catalognum"                
 [5] "fieldnum"                   "institution_storing"       
 [7] "collection_code"            "bin_uri"                   
 [9] "phylum_taxID"               "phylum_name"               
[11] "class_taxID"                "class_name"                
[13] "order_taxID"                "order_name"                
[15] "family_taxID"               "family_name"               
[17] "subfamily_taxID"            "subfamily_name"            
[19] "genus_taxID"                "genus_name"                
[21] "species_taxID"              "species_name"              
[23] "subspecies_taxID"           "subspecies_name"           
[25] "identification_provided_by" "identification_method"     
[27] "identification_reference"   "tax_note"                  
[29] "voucher_status"             "tissue_type"               
[31] "collection_event_id"        "collectors"         

### **Countries represented**
Note: Countries include; Atlantic Ocean(137), Costa Rica(1291), Indian Ocean(3), India(1), Israel(27) and United States(2) = 1461 in total

In [7]:
cat(length(unique(bold_data$country)), "countries are represented in bold_data: ")
cat(c(unique(bold_data$country)), sep=";")
as.data.frame(table(c(bold_data$country)))

52 countries are represented in bold_data: Algeria;Angola;Benin;Botswana;Burkina Faso;Burundi;Cameroon;Cape Verde;Central African Republic;Comoros;Cote d'Ivoire;Democratic Republic of the Congo;Djibouti;Egypt;Equatorial Guinea;Ethiopia;Gabon;Gambia;Ghana;Guinea;Guinea-Bissau;Kenya;Lesotho;Costa Rica;Liberia;Libya;Madagascar;Malawi;Mali;Mauritania;Mauritius;Morocco;Mozambique;Namibia;Niger;Nigeria;Republic of the Congo;Reunion;Rwanda;Sao Tome and Principe;Senegal;Seychelles;Sierra Leone;South Africa;Sudan;Swaziland;Tanzania;Togo;Tunisia;Uganda;Zambia;Zimbabwe

Var1,Freq
<fct>,<int>
Algeria,245
Angola,9
Benin,232
Botswana,37
Burkina Faso,316
Burundi,105
Cameroon,230
Cape Verde,20
Central African Republic,48
Comoros,18


## **Loading sample data from exclusively East Africa:**
1. Kenya, 
2. Tanzania, 
3. Uganda, 
4. Rwanda, 
5. Burundi, 
6. South Sudan and 
7. Ethiopia

### **Looking at the summary of the sample_dataframe.**

In [9]:
sample_data = read.delim("../data/input/test_data/bold.tsv", 
                         stringsAsFactors = F, header = T, na.strings = "") 
str(sample_data)

'data.frame':	41238 obs. of  80 variables:
 $ processid                 : chr  "ACRJP031-09" "ACRJP194-09" "ACRJP221-09" "ACRJP419-10" ...
 $ sampleid                  : chr  "BC-MNHNJP0313" "BC-MNHNJP0536" "BC-MNHNJP0563" "BC-MNHNJP0856" ...
 $ recordID                  : int  1134352 1134515 1134542 1608687 1608835 2496776 2506330 2506373 2506395 2508075 ...
 $ catalognum                : chr  NA NA NA NA ...
 $ fieldnum                  : chr  "BC-MNHNJP0313" "BC-MNHNJP0536" "BC-MNHNJP0563" "BC-MNHNJP0856" ...
 $ institution_storing       : chr  "Research Collection of Dominique Bernaud" "Museum National d'Histoire Naturelle, Paris" "Museum National d'Histoire Naturelle, Paris" "Research Collection of Dominique Bernaud" ...
 $ collection_code           : logi  NA NA NA NA NA NA ...
 $ bin_uri                   : chr  "BOLD:AAC9562" "BOLD:AAE8435" "BOLD:AAD8963" "BOLD:AAE0506" ...
 $ phylum_taxID              : int  20 20 20 20 20 20 20 20 20 20 ...
 $ phylum_name               : chr

### Copeland's data

In [8]:
#cat(unique((bold_data$collectors[grep("Copeland", bold_data$collectors)])),sep=" | ")
#as.data.frame(table(c(bold_data$collectors[grep("Copeland", bold_data$collectors)])))
#as.data.frame(table(c(bold_data$collectors)))
#cop_data <- bold_data[!!(bold_data$collectors %in% c(paste(unique(bold_data$collectors[grep("Copeland", bold_data$collectors)]),sep="\" , \""))), ]
#cop_data <- subset(bold_data, subset = collectors %in% c(paste(unique(bold_data$collectors[grep("Copeland", bold_data$collectors)]),sep="\" , \"")))
#c(cat(unique(bold_data$collectors[grep("Copeland", bold_data$collectors)]),sep="\" , \""))
#str(cop_data)
nrow(cop_data)
as.data.frame(table(c(cop_data$marker_codes)))
as.data.frame(table(c(cop_data$order_name)))
#as.data.frame(table(c(cop_data$lat)))
nrow(subset(cop_data, !is.na(lat) & marker_codes == "COI-5P" & !is.na(genus_name)))
as.data.frame(table(c((subset(cop_data, !is.na(lat) & marker_codes == "COI-5P" & !is.na(genus_name)))$order_name)))

R.Copeland | RS Copeland | R. Copeland | R.S. Copeland | R Copeland | R.S.Copeland | R. S. Copeland | Robert Copeland | J. Bukhebi & RS Copeland | R S Copeland | J.Bukhebi & R S Copeland | R S.Copeland | RS  Copeland | R.S Copeland | J.Bukhebi & R.S Copeland | J.Bukhebi & RS Copeland | Bob Copeland

Var1,Freq
<fct>,<int>
Bob Copeland,10
J. Bukhebi & RS Copeland,895
J.Bukhebi & R S Copeland,41
J.Bukhebi & R.S Copeland,16
J.Bukhebi & RS Copeland,83
R Copeland,35
R S Copeland,196
R S.Copeland,19
R. Copeland,526
R. S. Copeland,144


Var1,Freq
<fct>,<int>
. Onbekend,9
0,1
0G 1984GR00402,1
2016 Allendale Class,41
2rd year UKZN students,12
989 m,1
A Barbet,1
A Bok,1
A Eicker,1
A Fotie,4


Var1,Freq
<fct>,<int>
COI-5P,6795


Var1,Freq
<fct>,<int>
Araneae,2
Blattodea,39
Coleoptera,1034
Diptera,2154
Hemiptera,524
Hymenoptera,1558
Lepidoptera,1612
Mantodea,2
Neuroptera,16
Orthoptera,18


Var1,Freq
<fct>,<int>
Diptera,21
Hymenoptera,27
Lepidoptera,670


In [8]:
cat(length(unique(sample_data$markercode)),
    "markers are represented in the East African data set")
#unique(sample_data$markercode) # filtering rows corresponding to COI-5P markers
as.data.frame(table(c(sample_data$markercode)))

ERROR: Error in unique(sample_data$markercode): object 'sample_data' not found


### **Phyla represented in the bold data**
Despite our focus on arthropoda phylum, additional records from 38 phyla were downloaded

In [9]:
cat(length(unique(bold_data$phylum_name)), "phyla are represented in bold_data: ")
as.data.frame(table(c(bold_data$phylum_name)))
cat(length(unique(sample_data$phylum_name)), "phyla are represented in East African data: ")
as.data.frame(table(c(sample_data$phylum_name)))

1 phyla are represented in bold_data: 

Var1,Freq
<fct>,<int>
Arthropoda,81302


ERROR: Error in unique(sample_data$phylum_name): object 'sample_data' not found


### **Focusing on arthropoda phylum**
**1. Number of records from African Countries**

In [10]:
arthropoda_data1 = subset(bold_data, phylum_name == "Arthropoda")
cat("bold_data have", nrow(arthropoda_data1), "arthropoda records out of", 
    nrow(bold_data), "records in the raw bold_data")
arthropoda_data = subset(bold_data, phylum_name == "Arthropoda" )#& country != "United States")
cat("\nbold_data have", nrow(arthropoda_data), "arthropoda records out of", 
    nrow(bold_data), "records in the raw bold data")
arthropoda_data -> bold_dataframe
sample_dataframe = subset(sample_data, phylum_name == "Arthropoda")

bold_data have 81302 arthropoda records out of 81302 records in the raw bold_data
bold_data have 81302 arthropoda records out of 81302 records in the raw bold data

ERROR: Error in subset(sample_data, phylum_name == "Arthropoda"): object 'sample_data' not found


**2. Here are the African countries represented in the athropod data**

In [11]:
cat(length(unique(bold_dataframe$country)), "countries are represented in bold_dataframe: ")
#cat(c(unique(bold_dataframe$country)), sep=";")
as.data.frame(table(c(bold_dataframe$country)))

52 countries are represented in bold_dataframe: 

Var1,Freq
<fct>,<int>
Algeria,245
Angola,9
Benin,232
Botswana,37
Burkina Faso,316
Burundi,105
Cameroon,230
Cape Verde,20
Central African Republic,48
Comoros,18


### **What looks interesting?**
#### **1.Taxonomy**
**Taking a deeper look at the taxa variables: phylum,class,order and family**

In [12]:
unique(bold_dataframe$phylum_name)

### **classes represented in the bold_dataframe**

In [13]:
cat("African arthropod data is distrubed in taxa classes as follows:")
as.data.frame(table(c(bold_dataframe$class_name)))
cat("East African arthropod data is distrubed in taxa classes as follows:")
as.data.frame(table(c(sample_dataframe$class_name)))

African arthropod data is distrubed in taxa classes as follows:

Var1,Freq
<fct>,<int>
Insecta,81302


East African arthropod data is distrubed in taxa classes as follows:

ERROR: Error in table(c(sample_dataframe$class_name)): object 'sample_dataframe' not found


#### **orders represented**

In [14]:
cat(length(unique(bold_dataframe$order_name)), 
    "orders are indicated in African arthropod data: ")
cat(unique(bold_dataframe$order_name),sep=";","\n\n")
cat(length(unique(sample_dataframe$order_name)), 
    "orders are indicated in East African arthropod data: ")
cat(unique(sample_dataframe$order_name),sep=";")

1 orders are indicated in African arthropod data: Diptera;



ERROR: Error in unique(sample_dataframe$order_name): object 'sample_dataframe' not found


#### **Families represented**

In [15]:
cat(length(unique(bold_dataframe$family_name)), 
    "families are indicated in the African Data: ")
cat(unique(bold_dataframe$family_name),sep=";","\n\n")
cat(length(unique(sample_dataframe$family_name)), 
    "families are indicated in the East African Data: ")
cat(unique(sample_dataframe$family_name),sep=";")

87 families are indicated in the African Data: Syrphidae;Ceratopogonidae;Piophilidae;Polleniidae;Psychodidae;Culicidae;Tephritidae;Conopidae;Muscidae;NA;Calliphoridae;Drosophilidae;Simuliidae;Stratiomyidae;Mydidae;Asilidae;Anthomyiidae;Glossinidae;Sarcophagidae;Mycetophilidae;Phoridae;Diopsidae;Tabanidae;Hippoboscidae;Hybotidae;Platystomatidae;Micropezidae;Sphaeroceridae;Chironomidae;Cecidomyiidae;Sciaridae;Chloropidae;Agromyzidae;Tachinidae;Sepsidae;Milichiidae;Ephydridae;Dolichopodidae;Scatopsidae;Chyromyidae;Bombyliidae;Asteiidae;Chamaemyiidae;Carnidae;Therevidae;Pipunculidae;Scenopinidae;Heleomyzidae;Limoniidae;Ulidiidae;Fanniidae;Canacidae;Oestridae;Empididae;Rhiniidae;Rhinophoridae;Lonchaeidae;Scathophagidae;Odiniidae;Keroplatidae;Cryptochetidae;Curtonotidae;Lauxaniidae;Periscelididae;Corethrellidae;Pyrgotidae;Platypezidae;Chaoboridae;Vermileonidae;Nemestrinidae;Mesembrinellidae;Neriidae;Bibionidae;Tipulidae;Marginidae;Clusiidae;Psilidae;Lygistorrhinidae;Mythicomyiidae;Thaumaleid

ERROR: Error in unique(sample_dataframe$family_name): object 'sample_dataframe' not found


### **genera and species names featured**

In [16]:
#genera
cat(length(unique(bold_dataframe$genus_name)), "genus_names")
#species
cat(" and ", length(unique(bold_dataframe$species_name)), 
    "species_names are featured in the African arthropod data set\n\n")
#genera
cat(length(unique(sample_dataframe$genus_name)), "genus_names")
#species
cat(" and ", length(unique(sample_dataframe$species_name)), 
    "species_names are featured in the East African arthropod data set")

476 genus_names and  1283 species_names are featured in the African arthropod data set



ERROR: Error in unique(sample_dataframe$genus_name): object 'sample_dataframe' not found


### **Identify the container projects from which the data sets come from. Try using the copyright\***

In [17]:
cat(length(unique(bold_dataframe$copyright_institution)),
    "copyright institutions are featured in the African arthropod data set\n")
#output a list with so many missing values, **NOT IDEAL** for use.
cat(length(unique(sample_dataframe$copyright_institution)),
    "copyright institutions are featured in the East African arthropod data set ")

21 copyright institutions are featured in the African arthropod data set


ERROR: Error in unique(sample_dataframe$copyright_institution): object 'sample_dataframe' not found


In [18]:
cat(length(unique(bold_dataframe$copyright_holders)),
    "copyright holders are featured in the African arthropod data set\n")
#output with so many missing values "NA"
cat(length(unique(sample_dataframe$copyright_holders)),
    "copyright holders are featured in the East African arthropod data set")

25 copyright holders are featured in the African arthropod data set


ERROR: Error in unique(sample_dataframe$copyright_holders): object 'sample_dataframe' not found


### **Taking a look at the markercode field.**

In [19]:
cat(length(unique(bold_dataframe$markercode)),
    "markers are represented in the African data set")
as.data.frame(table(c(bold_dataframe$markercode)))
#unique(bold_dataframe$markercode) # filtering rows corresponding to COI-5P markers
cat(length(unique(sample_dataframe$markercode)),
    "markers are represented in the East African data set")
as.data.frame(table(c(sample_dataframe$markercode)))
#unique(sample_dataframe$markercode) # filtering rows corresponding to COI-5P markers

16 markers are represented in the African data set

Var1,Freq
<fct>,<int>
12S,3
16S,3
18S,2
28S,3
AATS,3
CAD,5
CAD4,1
COI-3P,892
COI-5P,77892
COI-PSEUDO,21


ERROR: Error in unique(sample_dataframe$markercode): object 'sample_dataframe' not found


#### **Genebank_accession numbers**

In [20]:
cat("There are ",length(unique(bold_dataframe$genbank_accession)), 
    "genbank accession numbers in the African arthropod records out of ",
    nrow(bold_dataframe),"records.\n")
cat("There are ",length(unique(sample_dataframe$genbank_accession)), 
    "genbank accession numbers in the East African arthropod records out of ",
    nrow(sample_dataframe),"records.")

There are  8027 genbank accession numbers in the African arthropod records out of  81302 records.


ERROR: Error in unique(sample_dataframe$genbank_accession): object 'sample_dataframe' not found


#### **Cleaning up the dataset to remain only with COI-5P sequences**

**1. Removing sequences from other classes and markers beside Insecta and COI-5P**

In [25]:
COI_Insect_Afrodata = subset(
    bold_dataframe, class_name == "Insecta" & markercode == "COI-5P" & !is.na(nucleotides) )
cat("African data set has", nrow(COI_Insect_Afrodata),"Insecta records out of", 
    nrow(bold_dataframe), "African arthropod records\n\n")
COI_Insect_EAfrodata = subset(
    sample_dataframe, markercode == "COI-5P" & !is.na(nucleotides) & class_name == "Insecta")
cat("Insect COI-5P marker sequences are ",nrow(COI_Insect_EAfrodata)," out of ", 
    nrow(sample_dataframe), "sequences in the East African bold data")
as.data.frame(table(c((COI_Insect_Afrodata)$class_name)))
as.data.frame(table(c((COI_Insect_Afrodata)$order_name)))

African data set has 77892 Insecta records out of 81302 African arthropod records



ERROR: Error in subset(sample_dataframe, markercode == "COI-5P" & !is.na(nucleotides) & : object 'sample_dataframe' not found


In [26]:
as.data.frame(table(c((subset(COI_Insect_Afrodata, order_name == "Diptera"))$family_name)))

Var1,Freq
<fct>,<int>
Acroceridae,4
Agromyzidae,759
Anisopodidae,2
Anthomyiidae,255
Asilidae,246
Asteiidae,31
Athericidae,1
Bibionidae,104
Blephariceridae,27
Bombyliidae,85


In [27]:
as.data.frame(table(c((subset(COI_Insect_Afrodata, order_name == "Lepidoptera"))$family_name)))

Freq
<int>


In [28]:
nonInsecta_data = subset(bold_dataframe, class_name != "Insecta")
cat (nrow(nonInsecta_data), "records in non-insecta classes \nclasses: ",
     unique(nonInsecta_data$class_name))
as.data.frame(table(c((nonInsecta_data)$class_name)))
cat("The order taxa represented in Arachnida class are:")
as.data.frame(table(c((subset(bold_dataframe, class_name == "Arachnida"))$order_name)))
cat("\n",nrow(subset(bold_dataframe, class_name == "Malacostraca")), "Malacostraca: orders;",
    unique((subset(bold_dataframe, class_name == "Malacostraca")$order_name)))
cat("\n",nrow(subset(bold_dataframe, class_name == "Diplopoda")), "Diplopoda: orders;",
    unique((subset(bold_dataframe, class_name == "Diplopoda")$order_name)))
cat("\n",nrow(subset(bold_dataframe, class_name == "Branchiopoda")), "Branchiopoda: orders;",
    unique((subset(bold_dataframe, class_name == "Branchiopoda")$order_name)))
cat("\n",nrow(subset(bold_dataframe, class_name == "Ostracoda")), "Ostracoda: orders;"
    ,unique((subset(bold_dataframe, class_name == "Ostracoda")$order_name)))

0 records in non-insecta classes 
classes:  

Freq
<int>


The order taxa represented in Arachnida class are:

Freq
<int>



 0 Malacostraca: orders; 
 0 Diplopoda: orders; 
 0 Branchiopoda: orders; 
 0 Ostracoda: orders; 

## **Focusing on our Sample data(East African data set)**
#### **Analysing nucleotide sequences (nucleotides)**

In [29]:
typeof(COI_Insect_EAfrodata$nucleotides)

ERROR: Error in typeof(COI_Insect_EAfrodata$nucleotides): object 'COI_Insect_EAfrodata' not found


**1. Introducing a field "seqlen1" that has the number of nucleotides in the COI-5P**

In [29]:
COI_Insect_EAfrodata %>% mutate(seqlen1 = nchar(nucleotides)) -> resulting_dataframe1
## ""%>%"" is same as pipe "|" in bash

ERROR: Error in eval(lhs, parent, parent): object 'COI_Insect_EAfrodata' not found


**2. List all characters present in the nucleotide sequences**

In [30]:
unique(unlist(strsplit(COI_data$nucleotides, "", fixed = TRUE)), incomparables = FALSE)

ERROR: Error in strsplit(COI_data$nucleotides, "", fixed = TRUE): object 'COI_data' not found


**3. Number of nucleotide sequences with '-' characters* in them**

In [31]:
length(grep( '-',resulting_dataframe1$nucleotides, value= TRUE))

ERROR: Error in grep("-", resulting_dataframe1$nucleotides, value = TRUE): object 'resulting_dataframe1' not found


**4. REMOVING '-' characters from nucleotide sequences and creating a field of unalinged nucleotide sequences (unalined_nucleotides)**

In [32]:
resulting_dataframe1 %>% mutate(unaligned_nucleotides = gsub(
    '-', '', resulting_dataframe1$nucleotides, ignore.case = FALSE, perl = FALSE,
    fixed = FALSE, useBytes = FALSE)) -> resulting_dataframe2

ERROR: Error in eval(lhs, parent, parent): object 'resulting_dataframe1' not found


In [33]:
#Number of nucleotide sequences with '-' characters* in them after removal
length(grep( '-',resulting_dataframe2$unaligned_nucleotides, value= TRUE))

ERROR: Error in grep("-", resulting_dataframe2$unaligned_nucleotides, value = TRUE): object 'resulting_dataframe2' not found


**5. Introducing a field seqlen2 with number nucleotides in unaligned_nucleotides field**

In [34]:
resulting_dataframe2 %>% mutate(seqlen2 = nchar(unaligned_nucleotides)) -> resulting_dataframe3

ERROR: Error in eval(lhs, parent, parent): object 'resulting_dataframe2' not found


In [35]:
str(resulting_dataframe3)

ERROR: Error in str(resulting_dataframe3): object 'resulting_dataframe3' not found


#### **Understanding the distribution of seqlen1 versus seqlen2**

In [36]:
boxplot(resulting_dataframe1$seqlen1, resulting_dataframe3$seqlen2)

ERROR: Error in boxplot(resulting_dataframe1$seqlen1, resulting_dataframe3$seqlen2): object 'resulting_dataframe1' not found


In [37]:
hist(resulting_dataframe3$seqlen1); hist(resulting_dataframe3$seqlen2)

ERROR: Error in hist(resulting_dataframe3$seqlen1): object 'resulting_dataframe3' not found


ERROR: Error in hist(resulting_dataframe3$seqlen2): object 'resulting_dataframe3' not found


### **Sorting the data based on the nucleotide length of sequences**

1. **Generating a file with all 'COI-5P' sequences**

In [38]:
resulting_dataframe3 -> COI_all_data
cat(length(COI_all_data$unaligned_nucleotides),"sequences have 'COI-5P' marker")

ERROR: Error in eval(expr, envir, enclos): object 'resulting_dataframe3' not found


2. **Introducing a filter to remove sequences with less than 500 nucleotides**

In [39]:
COI_all_data %>% filter(seqlen2 >= 500 ) -> COI_Over499_data
cat(length(COI_Over499_data$unaligned_nucleotides),"sequences have more or equivalent to 500 bases")

ERROR: Error in eval(lhs, parent, parent): object 'COI_all_data' not found


3. **Introducing a filter to remove any sequence with less than 500 and over 700 nucleotides**

In [40]:
COI_all_data %>% filter(seqlen2 >= 500 & seqlen2 <= 700) -> COI_500to700_data
cat(length(COI_500to700_data$unaligned_nucleotides),"sequences have from 500 to 700 bases")

ERROR: Error in eval(lhs, parent, parent): object 'COI_all_data' not found


In [41]:
hist(COI_500to700_data$seqlen2)

ERROR: Error in hist(COI_500to700_data$seqlen2): object 'COI_500to700_data' not found


4. **Introducing a filter to remove any sequence with less than 650 and over 660 nucleotides**

In [42]:
COI_all_data %>% filter(seqlen2 >= 650 & seqlen2 <= 660) -> COI_650to660_data
cat(length(COI_650to660_data$unaligned_nucleotides),"sequences have from 650 to 660 bases")

ERROR: Error in eval(lhs, parent, parent): object 'COI_all_data' not found


5. **Introducing a filter to remove any sequence with over 500 nucleotides**

In [43]:
COI_all_data %>% filter(seqlen2 < 500) -> COI_Under500_data
cat(length(COI_Under500_data$unaligned_nucleotides),"sequences have less than 500 bases")

ERROR: Error in eval(lhs, parent, parent): object 'COI_all_data' not found


6. **Introducing a filter to remove any sequence with less than 700 nucleotides**

In [44]:
COI_all_data %>% filter(seqlen2 > 700) -> COI_Over700_data
cat(length(COI_Over700_data$unaligned_nucleotides),"sequences have more than 700 bases")

ERROR: Error in eval(lhs, parent, parent): object 'COI_all_data' not found


### **Randomly sampling 100 sequences from data sets for use in testing the pipeline**

**1. sampling from all insecta COI-5P data irregardless of sequnce length**

In [45]:
COI_testa00_data <- COI_all_data[sample(nrow(COI_all_data), 100), ]

ERROR: Error in eval(expr, envir, enclos): object 'COI_all_data' not found


In [46]:
boxplot(COI_testa00_data$seqlen2)

ERROR: Error in boxplot(COI_testa00_data$seqlen2): object 'COI_testa00_data' not found


**2. Sampling from insecta COI-5P data with 500 to 700 nucleotide sequence length**

In [47]:
COI_testb01_data <- COI_500to700_data[sample(nrow(COI_500to700_data), 100), ]

ERROR: Error in eval(expr, envir, enclos): object 'COI_500to700_data' not found


In [48]:
boxplot(COI_testb01_data$seqlen2)

ERROR: Error in boxplot(COI_testb01_data$seqlen2): object 'COI_testb01_data' not found


In [49]:
COI_testb02_data <- COI_500to700_data[sample(nrow(COI_500to700_data), 100), ]

ERROR: Error in eval(expr, envir, enclos): object 'COI_500to700_data' not found


In [50]:
boxplot(COI_testb02_data$seqlen2)

ERROR: Error in boxplot(COI_testb02_data$seqlen2): object 'COI_testb02_data' not found


In [51]:
COI_testb03_data <- COI_500to700_data[sample(nrow(COI_500to700_data), 100), ]

ERROR: Error in eval(expr, envir, enclos): object 'COI_500to700_data' not found


In [52]:
boxplot(COI_testb03_data$seqlen2)

ERROR: Error in boxplot(COI_testb03_data$seqlen2): object 'COI_testb03_data' not found


**3. Sampling from insecta COI-5P data with 650 to 660 nucleotide sequence length**

In [53]:
COI_testc04_data <- COI_650to660_data[sample(nrow(COI_650to660_data), 100), ]

ERROR: Error in eval(expr, envir, enclos): object 'COI_650to660_data' not found


In [54]:
boxplot(COI_testc04_data$seqlen2)

ERROR: Error in boxplot(COI_testc04_data$seqlen2): object 'COI_testc04_data' not found


In [55]:
COI_testc05_data <- COI_650to660_data[sample(nrow(COI_650to660_data), 100), ]

ERROR: Error in eval(expr, envir, enclos): object 'COI_650to660_data' not found


In [56]:
boxplot(COI_testc05_data$seqlen2)

ERROR: Error in boxplot(COI_testc05_data$seqlen2): object 'COI_testc05_data' not found


**4. Sampling from insecta COI-5P data with under 500 nucleotide sequence length**

In [57]:
COI_testd06_data <- COI_Under500_data[sample(nrow(COI_Under500_data), 100), ]

ERROR: Error in eval(expr, envir, enclos): object 'COI_Under500_data' not found


In [58]:
boxplot(COI_testd06_data$seqlen2)

ERROR: Error in boxplot(COI_testd06_data$seqlen2): object 'COI_testd06_data' not found


**5. Sampling from insecta COI-5P data with over 700 nucleotide sequence length**

In [59]:
COI_teste07_data <- COI_Over700_data[sample(nrow(COI_Over700_data), 100), ]

ERROR: Error in eval(expr, envir, enclos): object 'COI_Over700_data' not found


In [60]:
boxplot(COI_teste07_data$seqlen2)

ERROR: Error in boxplot(COI_teste07_data$seqlen2): object 'COI_teste07_data' not found


In [61]:
ls()

In [62]:
getwd()

## **Saving final tidy copies of the data to** ***'/co1_metaanalysis/Data/input'*** **directory**

In [494]:
### Printing copies of the final tidy files
datalist = lapply(c("COI_all_data", "COI_Over499_data", "COI_500to700_data",
                    "COI_650to660_data", "COI_Over700_data", "COI_Under500_data",
                    "COI_testa00_data", "COI_testb01_data", "COI_testb02_data",
                    "COI_testb03_data", "COI_testc04_data", "COI_testc05_data",
                    "COI_testd06_data", "COI_teste07_data"), get)
names(datalist) <- (c("../data/input/test_data/COI_all_data",
                      "../data/input/test_data/COI_Over499_data",
                      "../data/input/test_data/COI_500to700_data",
                      "../data/input/test_data/COI_650to660_data",
                      "../data/input/test_data/COI_Over700_data",
                      "../data/input/test_data/COI_Under500_data",
                      "../data/input/test_data/COI_testa00_data",
                      "../data/input/test_data/COI_testb01_data",
                      "../data/input/test_data/COI_testb02_data",
                      "../data/input/test_data/COI_testb03_data",
                      "../data/input/test_data/COI_testc04_data",
                      "../data/input/test_data/COI_testc05_data",
                      "../data/input/test_data/COI_testd06_data",
                      "../data/input/test_data/COI_teste07_data"))
for (i in 1:length(datalist)) {
    write.table(datalist[i], file = paste(
        names(datalist[i]), ".tsv", sep = ""), 
                row.names = FALSE, col.names= TRUE, sep = "\t", quote=FALSE)
}

In [43]:
length(which(is.na(COI_500to700_data$order_name)))