# **Data Downloading and transformation**

## **Goals**
1. Retrive data from boldsystems.
2. Transform the xml files to text files (.tsv).
3. Build FASTA format sequnces from the text files.
4. Retriving unpublished data from the BOLDSystems and reformating the headers.

### **1. Retriving data from boldsystems**
Using the boldsystemsV4 [PUBLIC DATA API](http://www.boldsystems.org/index.php/resources/api?type=webservices) to export **Full Data Retrieval (Specimen + Sequence)** from a list of countries stored in a file (named by country) in a default destination directory "co1_metaanalysis/data/input/input_data/bold_africa/"

In [None]:
%%bash
bolddata_retrival() { # This fuction retrives data belonging to a list of country names given. Input can be a file containing names of select countries or idividual country names
        if [[ ( $# -eq 0 ) || ! ( `echo $1` =~ -.*$ ) ]]
        then
                echo "Input error..."
                echo "function usage: ${FUNCNAME[0]} [-a] [-c <name of country>] [-f <a file with list of countries and named *countries*>]"
                return 1
        fi

        local OPTIND=1
        countries=()

        while getopts 'ac:f:' key
        do
                case "${key}" in
                        f)
                                if [ ! -f $OPTARG ]
                                then
                                        echo "input error: file $OPTARG is non-existent!"
                                elif [[ ( -f $OPTARG ) && ( `basename $OPTARG` =~ ^.*countries.*$ ) ]]
                                then
                                        countries+=("$(while IFS="\n" read -r line || [[ "$line" ]]; do geography+=("`echo $line | sed 's/ /%20/g'`"); done < $OPTARG)")
                                else
                                        echo "input file error in `basename $OPTARG`: input file should be named '.*countries.*'"
                                fi
                                ;;
                        c)
                                countries+=(`echo $OPTARG | sed 's/ /%20/g'`)
                                ;;
                        a)
                                countries=("all")
                                ;;
                        ?)
                                echo "Input error..."
                                echo "function usage: ${FUNCNAME[0]} [-a] [-c <name of country>] [-f <a file with list of countries>]"
                                return 1
                                ;;
                esac
        done

        echo -e "\n\tDownloading data of countries named in ${countries[@]} from www.boldsystems.org V4"
        unset taxon_nam
        regexp='^[a-zA-Z0-9/_-\ ]+$'

        until [[ "$taxon_nam" =~ $regexp ]]
        do
                read -p "Please enter taxon name to be searched, ensure the spelling is right otherwise you get everything downloaded. To ensure that you are downloading the right dataset first go to 'http://v4.boldsystems.org/index.php/Public_BINSearch?searchtype=records' and search the tax(on|a) of choice as explained:: " taxon_nam
        done

        taxon_name=`echo $taxon_nam | sed 's/ /%20/g'`

        wgetoutput_dir=${inputdata_path}bold_data/${taxon_name}
        until [[ -d ${wgetoutput_dir} ]]
        do
                echo "Creating output directory '${wgetoutput_dir}'"
                mkdir ${wgetoutput_dir}
        done

        IFS=$'\n'
        if [[ ! ( `echo ${countries[0]}` =~ "all" ) ]]
        then
                for i in ${countries[@]}
                do
                        wget --show-progress --progress=bar:noscroll --retry-connrefused -t inf -O ${wgetoutput_dir}/"${i}"_summary.xml -a ${wgetoutput_dir}/${taxon_nam}_wget_log "http://www.boldsystems.org/index.php/API_Public/stats?geo=${i}&taxon=${taxon_name}&format=xml"
                        #wget --show-progress --progress=bar:noscroll --retry-connrefused -t inf -O ${wgetoutput_dir}/"${i}"_specimen.xml -a ${wgetoutput_dir}/${taxon_nam}_wget_log "http://www.boldsystems.org/index.php/API_Public/specimen?geo=${i}&taxon=${taxon_name}&format=xml"
                        wget --show-progress --progress=bar:noscroll --retry-connrefused -t inf -O ${wgetoutput_dir}/"${i}".xml -a ${wgetoutput_dir}/${taxon_nam}_wget_log "http://www.boldsystems.org/index.php/API_Public/combined?geo=${i}&taxon=${taxon_name}&format=xml"
                done
        elif [[ ( `echo ${countries[0]}` =~ "all" ) ]]
        then
                wget --show-progress --progress=bar:noscroll --retry-connrefused -t inf -O ${wgetoutput_dir}/"${taxon_nam}"_summary.xml -a ${wgetoutput_dir}/${taxon_nam}_wget_log "http://www.boldsystems.org/index.php/API_Public/stats?taxon=${taxon_name}&format=xml"
                #wget --show-progress --progress=bar:noscroll --retry-connrefused -t inf -O ${wgetoutput_dir}/"${taxon_nam}"_specimen.xml -a ${wgetoutput_dir}/${taxon_nam}_wget_log "http://www.boldsystems.org/index.php/API_Public/specimen?taxon=${taxon_name}&format=xml"
                wget --show-progress --progress=bar:noscroll --retry-connrefused -t inf -O ${wgetoutput_dir}/"${taxon_nam}".xml -a ${wgetoutput_dir}/${taxon_nam}_wget_log "http://www.boldsystems.org/index.php/API_Public/combined?taxon=${taxon_name}&format=xml"
        fi
}

**Running:**

In [1]:
%%bash
cd /home/kibet/bioinformatics/github/co1_metaanalysis/code/
source ./process_all_input_files.sh
bolddata_retrival #-c country  #Uncomment the word "country" to download from a list of countries in the file country (canada)

Input error...
function usage: bolddata_retrival [-a] [-c <name of country>] [-f <a file with list of countries and named *countries*>]


CalledProcessError: Command 'b'cd /home/kibet/bioinformatics/github/co1_metaanalysis/code/\nsource ./process_all_input_files.sh\nbolddata_retrival #-c country  #Uncomment the word "country" to download from a list of countries in the file country (canada)\n'' returned non-zero exit status 1.

### **2. Transformation of the XML files to tsv**
Here we use python3 packages : **BeautifulSoup4** and **pandas**.  
(**N/B:** Tried using R ([01.02.R_xml_to_tsv.ipynb](./01.02.R_xml_to_tsv.ipynb)), but didn't work well)  
For more on the logic behind the extraction script see jupyter notebook [01.01.xml_to_tsv.ipynb](./01.01.xml_to_tsv.ipynb)  
The country specific XML files are converted to text (.tsv) files.

In [None]:
%%bash
boldxml2tsv() { #This function generates .tsv files from .xml files using python script and Beautifulsoup4 and pandas package

        usage $@

        TAB=$(printf '\t')

        echo "generating .tsv files from .xml downloads"

        for i in "$@"
        do
                if [ ! -f $i ]
                then
                        echo "input error: file '$i' is non-existent!"
                elif [[ ( -f $i ) && ( `basename -- "$i"` =~ .*\.(xml) ) ]]
                then
                        rename
                        echo -e "\nLet us proceed with file '${input_filename}'..."
                        sed 's/class/Class/g' "$i" | sed "s/$TAB/,/g" > ${inputdata_path}bold_africa/input.xml
                        ${PYTHON_EXEC} ${xml_to_tsv} ${inputdata_path}bold_africa/input.xml && mv output.tsv ${inputdata_path}bold_africa/${output_filename}.tsv
                else
                        echo "input file error in `basename -- '$i'`: input file should be a .xml file format"
                        continue
                fi
        done
}

**Running:**

In [3]:
%%bash
cd /home/kibet/bioinformatics/github/co1_metaanalysis/code/
source ./process_all_input_files.sh
boldxml2tsv #../data/input/input_data/bold_africa/kenya.xml  #Uncomment the path to execute the function

Input error...
Usage: build_tsv file1.*[file2.* file3.* ...]
generating .tsv files from .xml downloads


Below is the outcome of conversion of xml files to text (tsv) files

In [3]:
%%bash
cd /home/kibet/bioinformatics/github/co1_metaanalysis/data/input/input_data/bold_africa/
wc -l *.tsv

     1213 Algeria.tsv
     1329 Angola.tsv
      941 Benin.tsv
      669 Botswana.tsv
      301 Burkina_Faso.tsv
      257 Burundi.tsv
     7268 Cameroon.tsv
      692 Cape_Verde.tsv
     3487 Central_African_Republic.tsv
       46 Chad.tsv
     1400 Comoros.tsv
      746 Cote_d_Ivoire.tsv
     7580 Democratic_republic_of_the_Congo.tsv
      472 Djibouti.tsv
    20984 Egypt.tsv
      479 Equatorial_Guinea.tsv
       31 Eritrea.tsv
     3792 Ethiopia.tsv
    16898 Gabon.tsv
      138 Gambia.tsv
     4178 Ghana.tsv
      100 Guinea-Bissau.tsv
     1035 Guinea.tsv
    29480 Kenya.tsv
       68 Lesotho.tsv
     2393 Liberia.tsv
       93 Libya.tsv
    50290 Madagascar.tsv
     1899 Malawi.tsv
      374 Mali.tsv
      225 Mauritania.tsv
     1736 Mauritius.tsv
     5219 Morocco.tsv
     2812 Mozambique.tsv
     2449 Namibia.tsv
     2876 Nigeria.tsv
       75 Niger.tsv
     2130 Republic_of_the_Congo.tsv
     2038 Reunion.tsv
      791 Rwanda.tsv
      260 Sao_Tome_and_Principe.tsv
     148

### **3. Build FASTA sequences from the .tsv text files**
The building of FASTA files is not done directly on the country specific text (.tsv) files.
It is done after some cleaning and sorting.
1. First only those with Insecta genus-name tag are extracted and all the records are cleaned of any non-COI-5P markers  
2. Then ALL the records are re-grouped into subsets based on sequence length  
3. Then fourteen 100-record samples are randomly sampled from this groups, to be used in the development and testing of the bioinformatics analysis pipelines  
4. Finally the re-grouped subsets and the samples are converted to FASTA format sequences  

There are two rscripts:  
1. [data_cleanup_n_sampling.R](./data_cleanup_n_sampling.R): Meant for cleaning, sorting and sampling the test data (East African data: Kenya, Tanzania, Uganda, Rwanda, Burundi, Ethiopia and South Sudan).  
See [02.00.Data_cleanup](./02.00.Data_cleanup.ipynb) for more information on step '1.' to '3.' 
2. [data_cleanup.R](./data_cleanup.R): Meant for cleaning and sorting all country specific records

**To sort the data for all country specific records into the groups defined by sequence length do as follows:**

In [None]:
%%bash
append_tsvfile() { # this function tests if the .tsv file has content and if it does it appends it to a cummulative file of all input records. This function is applied in the function below: clean_sort_tsv()

        if [ `grep -v "X..bioinformatics" ${input} | wc -l` -ge 1 ]
        then
                awk 'FNR==1 { while (/^X..bioinformatics/) getline; }   1 {print}' ${input} >> ${output}
        else
                echo -e "\n `basename -- $input` from `basename -- $i` has no content besides the header!!!"
        fi
}


clean_sort_tsv() { #This function cleans the .tsv files, sort the records into differnt files based on the sequence length and finally appends this files to a cummulative files of diffent input files

        usage $@

        echo "cleaningup and sorting .tsv files "

        output_files_africa=("${inputdata_path}clean_africa/afroCOI_500to700_data.tsv" "${inputdata_path}clean_africa/afroCOI_650to660_data.tsv" "${inputdata_path}clean_africa/afroCOI_all_data.tsv" "${inputdata_path}clean_africa/afroCOI_Over499_data.tsv" "${inputdata_path}clean_africa/afroCOI_Over700_data.tsv" "${inputdata_path}clean_africa/afroCOI_Under500_data.tsv")

        output_files_eafrica=("${inputdata_path}clean_eafrica/eafroCOI_500to700_data.tsv" "${inputdata_path}clean_eafrica/eafroCOI_650to660_data.tsv" "${inputdata_path}clean_eafrica/eafroCOI_all_data.tsv" "${inputdata_path}clean_eafrica/eafroCOI_Over499_data.tsv" "${inputdata_path}clean_eafrica/eafroCOI_Over700_data.tsv" "${inputdata_path}clean_eafrica/eafroCOI_Under500_data.tsv")


        for i in ${output_files_africa[@]}
        do
                grep "processid" $1 > $i && echo -e "\nInput file $i is set"
        done

        for i in ${output_files_eafrica[@]}
        do
                grep "processid" $1 > $i && echo -e "\nInput file $i is set"
        done


        for i in "$@"
        do
                if [ ! -f $i ]
                then
                        echo "input error: file '$i' is non-existent!"
                elif [[ ( -f $i ) && ( `basename -- "$i"` =~ .*\.(tsv) ) ]]
                then
                        rename
                        echo -e "\nLet us proceed with file '${input_filename}'..."
                        ${RSCRIPT_EXEC} --vanilla ${data_cleanup} $i

                        case $output_filename in
                                Algeria|Madagascar|Angola|Malawi|Benin|Mali|Botswana|Mauritania|Burkina_Faso|Mauritius|Morocco|Cameroon|Mozambique|Cape_Verde|Namibia|Central_African_Republic|Nigeria|Chad|Niger|Comoros|Republic_of_the_Congo|Cote_d_Ivoire|Reunion|Democratic_republic_of_the_Congo|Djibouti|Sao_Tome_and_Principe|Egypt|Senegal|Equatorial_Guinea|Seychelles|Eritrea|Sierra_Leone|Somalia|Gabon|South_Africa|Gambia|Ghana|Sudan|Guinea-Bissau|Swaziland|Guinea|Togo|Tunisia|Lesotho|Liberia|Zambia|Libya|Zimbabwe)
                                        input=${inputdata_path}clean_africa/COI_500to700_data.tsv
                                        output=${output_files_africa[0]}
                                        append_tsvfile

                                        input=${inputdata_path}clean_africa/COI_650to660_data.tsv
                                        output=${output_files_africa[1]}
                                        append_tsvfile

                                        input=${inputdata_path}clean_africa/COI_all_data.tsv
                                        output=${output_files_africa[2]}
                                        append_tsvfile

                                        input=${inputdata_path}clean_africa/COI_Over499_data.tsv
                                        output=${output_files_africa[3]}
                                        append_tsvfile

                                        input=${inputdata_path}clean_africa/COI_Over700_data.tsv
                                        output=${output_files_africa[4]}
                                        append_tsvfile

                                        input=${inputdata_path}clean_africa/COI_Under500_data.tsv
                                        output=${output_files_africa[5]}
                                        append_tsvfile
                                        ;;
                                Kenya|Tanzania|Uganda|Rwanda|Burundi|South_Sudan|Ethiopia)
                                        input=${inputdata_path}clean_africa/COI_500to700_data.tsv
                                        output=${output_files_eafrica[0]}
                                        append_tsvfile

                                        input=${inputdata_path}clean_africa/COI_650to660_data.tsv
                                        output=${output_files_eafrica[1]}
                                        append_tsvfile

                                        input=${inputdata_path}clean_africa/COI_all_data.tsv
                                        output=${output_files_eafrica[2]}
                                        append_tsvfile

                                        input=${inputdata_path}clean_africa/COI_Over499_data.tsv
                                        output=${output_files_eafrica[3]}
                                        append_tsvfile

                                        input=${inputdata_path}clean_africa/COI_Over700_data.tsv
                                        output=${output_files_eafrica[4]}
                                        append_tsvfile

                                        input=${inputdata_path}clean_africa/COI_Under500_data.tsv
                                        output=${output_files_eafrica[5]}
                                        append_tsvfile

                                        ;;
                                *)
                                        echo -e "The file $output_filename \b.tsv is not in the list of African countries or is not in the right format."
                                        ;;
                        esac
                 else
                        echo "input file error in `basename -- $i`: input file should be a .tsv file format"
                        continue
                fi
        done
}


**Running:**

In [6]:
%%bash
cd /home/kibet/bioinformatics/github/co1_metaanalysis/code/
source ./process_all_input_files.sh
clean_sort_tsv #../data/input_data/bold_africa/*.tsv

Input error...
Usage: clean_sort_tsv file1.*[file2.* file3.* ...]
cleaningup and sorting .tsv files 


The sorting rscript separates East African data from the rest of Africa and stores them in two separate directories: "co1_metaanalysis/data/input/input_data/clean_eafrica" and "co1_metaanalysis/data/input/input_data/clean_africa"  

**To convert the .tsv files to FASTA format files do as follows**

**Point to Note**  
Please note that errors may occur especially in cases where in the original .xml file there is existennce of values within fields that have end of line break `\n` or carriage return `\r\n`. In such cases extra editing will be required on the original .tsv files to substitute these end of line breaks with a white space. But this can only be done after a FASTA format sequence file is generated and each problematic end of line character, in this case a double `\n\n` identified and its' source corrected in the .tsv file.

In [None]:
%%bash
boldtsv2fasta() { #This function generates .fasta files from .tsv files using an awk script

        usage $@

        echo "generating .fasta files from .tsv metadata files"

        for i in "$@"
        do
                if [ ! -f $i ]
                then
                        echo "input error: file '$i' is non-existent!"
                elif [[ ( -f $i ) && ( `basename -- "$i"` =~ .*\.(tsv) ) ]]
                then
                        input_src=`dirname "$( realpath "${i}" )"`
                        rename
                        echo -e "\nLet us proceed with file '${input_filename}'..."
                        ${AWK_EXEC} -f ${AWK_SCRIPT} "$i" > ${input_src}/${output_filename}.fasta
                else
                        echo "input file error in `basename -- $i`: input file should be a .tsv file format"
                        continue
                fi
        done
}

**Running:**

In [12]:
%%bash
cd /home/kibet/bioinformatics/github/co1_metaanalysis/data/input/input_data/
source ../../../code/process_all_input_files.sh
build_fasta #../data/input/test_data/*.tsv # For test_data(East African data sets including their samples)
build_fasta #../data/input/input_data/clean_africa/*.tsv # For re-grouped African data sets
build_fasta #../data/input/input_data/clean_eafrica/*.tsv # For re-grouped East African data sets
echo -e "\nEast African data sets in 'clean_eafrica/':"
wc -l clean_eafrica/eafroCOI*
echo -e "\nAfrican data sets in 'clean_africa/':"
wc -l clean_africa/afroCOI*

Input error...
Usage: build_fasta file1.*[file2.* file3.* ...]
generating .fasta files from .tsv metadata files
Input error...
Usage: build_fasta file1.*[file2.* file3.* ...]
generating .fasta files from .tsv metadata files
Input error...
Usage: build_fasta file1.*[file2.* file3.* ...]
generating .fasta files from .tsv metadata files

East African data sets in 'clean_eafrica/':
    74378 clean_eafrica/eafroCOI_500to700_data.fasta
    36742 clean_eafrica/eafroCOI_500to700_data.tsv
     1019 clean_eafrica/eafroCOI_500to700_data_undesired.fasta
    48950 clean_eafrica/eafroCOI_650to660_data.fasta
    23560 clean_eafrica/eafroCOI_650to660_data.tsv
     1018 clean_eafrica/eafroCOI_650to660_data_undesired.fasta
    76842 clean_eafrica/eafroCOI_all_data.fasta
    38096 clean_eafrica/eafroCOI_all_data.tsv
     1019 clean_eafrica/eafroCOI_all_data_undesired.fasta
    75546 clean_eafrica/eafroCOI_Over499_data.fasta
    37326 clean_eafrica/eafroCOI_Over499_data.tsv
     1019 clean_eafrica/eafroCO

The sorting rscript separates East African data from the rest of Africa and stores them in two separate directories: "co1_metaanalysis/data/input/input_data/clean_eafrica" and "co1_metaanalysis/data/input/input_data/clean_africa"  

Below is the code to concatenate the the two different streams into one stored in "co1_metaanalysis/data/input/input_data/clean_africa"  

In [11]:
%%bash
cd /home/kibet/bioinformatics/github/co1_metaanalysis/data/input/input_data/
#cat ./clean_africa/afroCOI_500to700_data.fasta ./clean_eafrica/eafroCOI_500to700_data.fasta > ./clean_africa/enafroCOI_500to700_data.fasta
#cat ./clean_africa/afroCOI_650to660_data.fasta ./clean_eafrica/eafroCOI_650to660_data.fasta > ./clean_africa/enafroCOI_650to660_data.fasta
#cat ./clean_africa/afroCOI_all_data.fasta ./clean_eafrica/eafroCOI_all_data.fasta > ./clean_africa/enafroCOI_all_data.fasta
#cat ./clean_africa/afroCOI_Over499_data.fasta ./clean_eafrica/eafroCOI_Over499_data.fasta > ./clean_africa/enafroCOI_Over499_data.fasta
#cat ./clean_africa/afroCOI_Over700_data.fasta ./clean_eafrica/eafroCOI_Over700_data.fasta > ./clean_africa/enafroCOI_Over700_data.fasta
#cat ./clean_africa/afroCOI_Under500_data.fasta ./clean_eafrica/eafroCOI_Under500_data.fasta > ./clean_africa/enafroCOI_Under500_data.fasta
wc -l ./clean_africa/*

    295356 ./clean_africa/afroCOI_500to700_data.fasta
    147676 ./clean_africa/afroCOI_500to700_data.tsv
    150458 ./clean_africa/afroCOI_650to660_data.fasta
     75230 ./clean_africa/afroCOI_650to660_data.tsv
    309530 ./clean_africa/afroCOI_all_data.fasta
    154764 ./clean_africa/afroCOI_all_data.tsv
    297474 ./clean_africa/afroCOI_Over499_data.fasta
    148699 ./clean_africa/afroCOI_Over499_data.tsv
      2046 ./clean_africa/afroCOI_Over700_data.fasta
      1024 ./clean_africa/afroCOI_Over700_data.tsv
     12130 ./clean_africa/afroCOI_Under500_data.fasta
      6066 ./clean_africa/afroCOI_Under500_data.tsv
    170314 ./clean_africa/enafroCOI_500to700_data-650to660.fasta
    369710 ./clean_africa/enafroCOI_500to700_data.fasta
    199396 ./clean_africa/enafroCOI_650to660_data.fasta
    386352 ./clean_africa/enafroCOI_all_data.fasta
    385706 ./clean_africa/enafroCOI_all_data_raw.fasta
    279189 ./clean_africa/enafroCOI_all_data_raw.tsv
    372916 ./clean_africa/enafroCOI_Over49

Generating a file called **"enafroCOI_500to700_data-650to660.fasta"** of sequences with nucleotide number from 500 to 700, but excluding those with 650 to 660 nucleotides represented in enafroCOI_650to660_data.fasta  
Uses a fuctions in a bash script, "process_all_input_files.sh", that does the necessary text processing needed.  
See the function below:

In [None]:
%%bash
delete_repeats() { #This function takes a fasta_format_sequences file and deletes repeats of sequences based on identical headers.
        #in multiple files at once: awk -F'[|]' 'FNR%2{f=seen[$1]++} !f' *
        #in each file: awk -F'[|]' 'FNR==1{delete seen} FNR%2{f=seen[$1]++} !f' *
        if [ $# -eq 0 ]
        then
                echo "Input error..."
                echo "Usage: ${FUNCNAME[0]} file1.*[file2.* file3.* ...]"
                return 1
        fi

        for i in "$@"
        do
                rename
                input_src=`dirname "$( realpath "${i}" )"`
                unset duplicate_headers
                duplicate_headers=`grep ">" $i | $AWK_EXEC 'BEGIN { FS="|"; }; {print $1; }' | sort | uniq -d`
                if [ ! -z "$duplicate_headers" ]
                then
                        echo -e "\t`echo -e "$duplicate_headers" | wc -l` records are repeated in $i,\n\twould you like to proceed and delete all repeats?"
                        read -p "Please enter [Yes] or [No] to proceed: " choice
                else
                        choice="No"
                fi
                case $choice in
                        YES|Yes|yes|Y|y)
                                concatenate_fasta_seqs $i
                                $AWK_EXEC -F'[>|]' 'FNR==1{delete seen} FNR%2{f=seen[$2]++} !f' $i > ${input_src}/${output_filename}_cleaned && mv ${input_src}/${output_filename}_cleaned $( realpath "${i}" )
                                echo -e "\tDuplicate records deleted\n"
                                ;;
                        No|NO|no|N|n)
                                if [ ! -z "$duplicate_headers" ]
                                then
                                        echo -e "\tWould you like to save a list of the dublicates?"
                                        read -p "Please enter [Yes] or [No] to proceed: " option
                                        case $option in
                                                YES|Yes|yes|Y|y)
                                                        echo -e "\tCancelling....\nThe list of repeated sequences is in file called '_duplicates'\n"
                                                        echo -e "$duplicate_headers" > ${input_src}/${output_filename}_duplicates
                                                        ;;
                                                No|NO|no|N|n)
                                                        echo -e "\tCancelling...\n"
                                                        ;;
                                                *)
                                                        echo "ERROR!!! Invalid selection"
                                                        ;;
                                        esac
                                else
                                        echo -e "\tNo duplicate records in $i\n"
                                fi
                                ;;
                        *)
                                echo "ERROR!!! Invalid selection"
                                ;;
                esac
        done

}

**Actual run:**

In [None]:
%%bash
cd /home/kibet/bioinformatics/github/co1_metaanalysis/data/input/input_data/
source ./process_all_input_files.sh
cat enafroCOI_650to660_data.fasta enafroCOI_500to700_data.fasta > #input
delete_repeats input
x=`wc -l enafroCOI_650to660_data.fasta`
awk -v x=$x `{if (NRF<=x) {next} else {print $0} }`./input > #enafroCOI_500to700_data-650to660.fasta

### **4. Retriving unpublished data from the BOLDSystems and reformating the headers**
#### **4.1 To retrive unpublished data from [BOLD Systems](http://www.boldsystems.org/index.php/MAS_Management_UserConsole)**, first create a [BOLD systems account](http://www.boldsystems.org/index.php/MAS_Management_NewUserApp), [login](http://www.boldsystems.org/index.php/Login/page?destination=MAS_Management_UserConsole) and request data managers, to share their data sets.  
My list of shared data sets are:
1. [DS-KENFRUIT](http://www.boldsystems.org/index.php/MAS_Management_DataConsole?codes=DS-KENFRUIT): managed by Dr Scott E. Miller, has 1,427 records  
2. [DS-MPALALEP](http://www.boldsystems.org/index.php/MAS_Management_DataConsole?codes=DS-MPALALEP): managed by Dr Scott E. Miller, has 2,472 records  
3. [DS-TBILE](http://www.boldsystems.org/index.php/MAS_Management_DataConsole?codes=DS-TBILE) (Now publicly released now): managed by Dr Scott E. Miller has 90 records  

My list of container Projects; these contains multiple data sets within them:
4. [IDRCK](http://www.boldsystems.org/index.php/MAS_Management_DataConsole?codes=IDRCK): Has a number of subprojects; IDRC,HIVE, KBIR, KALG, KFISH, KPLA, ARAK and KINS. Has 2,110 sequences (COI-5P=1,704, matK=139, rbcLa=267) out of 6,016 specimen and is managed by Dr. Daniel Masiga.  
5. [GMTAH](http://www.boldsystems.org/index.php/MAS_Management_DataConsole?codes=GMTAH),[GMTAI](http://www.boldsystems.org/index.php/MAS_Management_DataConsole?codes=GMTAI) and [GMTAJ](http://www.boldsystems.org/index.php/MAS_Management_DataConsole?codes=GMTAJ) projects. All under the Global Malaise Program and the three have a combined total of 60 Projects and 49246 Specimens.  
>1. GMTAH: Has 26 projects titled "Kenya Malaise Mpala 2014" with 25,514 specimen, 170 species and 21,742 sequences (COI-5P=21,737, 28S=4 and EF1-alpha=1)  
>2. GMTAI: Has 26 projects titled "Kenya Malaise Kinondo 2014" with 13,656 specimen, 57 species and 11,805 sequences (COI-5P=11,801, 28S=3 and EF1-alpha=1)  
>3. GMTAJ: Has 5 projects titled "Kenya Malaise Turkana 2014" with 10,076 specimen, 63 species and 5,175 sequences (COI-5P=5,175)  

To retrive this data, I logged into the [BOLD Systems MAS management interface](http://www.boldsystems.org/index.php/MAS_Management_UserConsole) through Chromium web browser and for each named project above: DS-KENFRUIT, DS-MPALALEP, DS-TBILE, IDRCK AND GMTAH-GMTAI-GMTAJ (Mpala_Kinondo_Turkana_Malaise_traps), downloaded the spreadsheet and the sequence files to "/co1_metaanalysis/data/input/input_data/unpublished" directory.


In [2]:
%%bash
cd /home/kibet/bioinformatics/github/co1_metaanalysis/data/input/input_data/unpublished/
ls

DS-KENFRUIT.fasta
DS-KENFRUIT_headers
DS-KENFRUIT.xlsx
DS-MPALALEP.fasta
DS-MPALALEP_headers
DS-MPALALEP.xlsx
headers_edit.fasta
idrck.fasta
idrck_headers
idrck_headers_all
idrck_headers.csv
idrck_orders
idrck.xls
Mpala_Kinondo_Turkana_Malaise_traps.fasta
Mpala_Kinondo_Turkana_Malaise_traps.fasta2
Mpala_Kinondo_Turkana_Malaise_traps_unedited.fasta
Mpala_Kinondo_Turkana_Malaise_traps.xlsx
Mpala_Kinondo_Turkana_tagged.xlsx


#### **4.2 Changing the headers to look uniform to other headers**  
Current headers look like:
>\>PMANL5032-15|Sarrothripini|Arthropoda|Insecta|Lepidoptera|Nolidae|Chloephorinae|Sarrothripini||||Kenya|Muhaka Forest|-4.325|39.525|50.0  
>\>PMANL5022-15|Lobesia vanillana|Arthropoda|Insecta|Lepidoptera|Tortricidae|Olethreutinae||Lobesia|Lobesia vanillana||Kenya|Muhaka Forest|-4.325|39.525|50.0  

To an edited header that looks like:
>\>PMANL5032-15|Arthropoda|Insecta|Lepidoptera|fam-Nolidae|subfam-Chloephorinae|tri-Sarrothripini|gs-NA|sp-NA|subsp-NA|country-Kenya|exactsite-Muhaka_Forest|lat_-4.325|lon_39.525|elev-50.0  
>\>PMANL5022-15|Arthropoda|Insecta|Lepidoptera|fam-Tortricidae|subfam-Olethreutinae|tri-NA|gs-Lobesia|sp-Lobesia_vanillana|subsp-NA|country-Kenya|exactsite-Muhaka_Forest|lat_-4.325|lon_39.525|elev-50.0  

This standardizes the headers to a common format that is useful in the downstream analysis.  
For this to be done the headers of a given sequence file are first copied into a file, headers_edit.fasta, within which they are edited to the right format i.e:  
1. Deleting the default taxon (e.g Sarrothripini, in the first example) automatically assigned by BOLD systems during the download process which is usually the lowest taxon defined in the taxonomy of that record. This appears in the header just right after the unique identifier/process ID. 
2. Defining the various fields of the headers by adding suffices; fam-"family", subfam-"subfamily", tri-"tribe", gs-"genus", sp-"species", subsp-"subspecies", country-"country", exactsite-"exact site", lat_"latitude", lon_"longitude" and elev_"elevation".  

Then the formated headers are substituted into the actual sequence.fasta file using a function, see below:

In [2]:
%%bash
replacing_headers() { #This function takes an input file of edited_fasta_format_headers and searches through a fasta_format_sequence file and substitute it's headers if their uniq IDs match
        if [ $# -eq 0 ]
        then
                echo "Input error..."
                echo "Usage: ${FUNCNAME[0]} seq.fasta [seq2.fasta seq3.fasta ...]"
                return
        fi

        unset headers
        until [[ ( -f "$headers" ) && ( `basename -- "$headers"` =~ .*_(fasta|fa|afa) ) ]]
        do
                echo -e "\nFor the headers_[aln|fasta|fa|afa] input provide the full path to the file, the filename included."
                read -p "Please enter the file to be used as the FASTA headers source: " headers
                #$.*/[\]'^
                sed -i "s/\r$//g; s/ /_/g; s/\&/_n_/g; s/\//+/g; s/'//g; s/\[//g; s/\]//g" $headers
        done

        echo -e "\n\tStarting operation....\n\tPlease wait, this may take a while...."
        for i in "$@"
        do
                unset records
                number_of_replacements=0
                records=$( grep ">" $i | wc -l )
                unset x
                unset y
                unset z
                echo -e "\nProceeding with `basename -- $i`..."
                for line in `cat ${headers}`
                do
                        #x=$( head -10 idrck_headers | tail -1 | awk 'BEGIN { FS="|"; }{print $1;}') && echo $x
                        x=`echo "$line" | ${AWK_EXEC} 'BEGIN { RS="\n"; FS="|"; }{ x = $1; print x; }'`
                        y=`echo "$line" | ${AWK_EXEC} 'BEGIN { RS="\n"; FS="|"; }{ y = $0; print y; }'`
                        #echo -e "\n $x \n $y"

                        #Characters to replace from the headers as they will affect the performance of sed: carriage Returns (^M), white spaces ( ), back slashes (/), and ampersand '&' characters; they greately hamper the next step of header substitution.
                        sed -i "s/\r$//g; s/ /_/g; s/\&/_n_/g; s/\//+/g; s/'//g; s/\[//g; s/\]//g" $i

                        z=`grep "$x" $i`
                        #echo "$z"
                        for one_z in `echo -e "${z}"`
                        do
                                if [ $one_z == $y ]
                                then
                                        echo -e "Change for ${x} already in place..."
                                        continue
                                else
                                        echo -e "Substituting header for ${x}..."
                                        sed -i "s/${one_z}/${y}/g" $i
                                        #sed -i "s/^.*\b${x}\b.*$/${y}/g" $i
                                fi
                                number_of_replacements=$( expr $number_of_replacements + 1 )
                        done
                done
                echo -e "\nDONE. $number_of_replacements replacements done in `basename -- $i` out of $records records it has"
        done
        echo -e "\n\tCongratulations...Operation done."
}


#### **4.2.1 Working on idrck_headers.csv**
A single spreadsheet file (.xls) was retrived from [BOLDSystems Version 3](http://v3.boldsystems.org/) ([-Version 4](http://www.boldsystems.org/) proofed unworkable-) for these dataset: Contains all record information from which the headers were built from.  
Copy pasted the necessary columns ("Process ID", "Phylum", "Class", "Order", "Family",  "Subfamily", "Tribe", "Genus", "Species", "Subspecies", "Country/Ocean", "Exact Site", "Lat", "Lon", "Elev") into a single spreadsheet, then "find&Replaced" all "," with "--" and saved it in text format i.e CSV -Comma separated values-

In [20]:
%%bash
cd /home/kibet/bioinformatics/github/co1_metaanalysis/data/input/input_data/unpublished/
ls *.csv
head -10 idrck_headers.csv

idrck_headers.csv
"Process ID","Phylum","Class","Order","Family","Subfamily","Tribe","Genus","Species","Subspecies","Country/Ocean","Exact Site","Lat","Lon","Elev"
"KALG100-10","Rhodophyta","Florideophyceae","Ceramiales","Rhodomelaceae",,,"Acanthophora","Acanthophora spicifera",,"Kenya",,-4.26,39.599,
"KALG099-10","Rhodophyta","Florideophyceae","Ceramiales","Rhodomelaceae",,,"Acanthophora","Acanthophora spicifera",,"Kenya",,-4.26,39.599,
"KALG098-10","Rhodophyta","Florideophyceae","Ceramiales","Rhodomelaceae",,,"Acanthophora","Acanthophora spicifera",,"Kenya",,-4.26,39.599,
"KALG097-10","Rhodophyta","Florideophyceae","Ceramiales","Rhodomelaceae",,,"Acanthophora","Acanthophora spicifera",,"Kenya",,-4.26,39.599,
"KALG096-10","Rhodophyta","Florideophyceae","Ceramiales","Rhodomelaceae",,,"Acanthophora","Acanthophora spicifera",,"Kenya",,-4.26,39.599,
"KALG141-10","Rhodophyta","Florideophyceae","Ceramiales","Rhodomelaceae",,,"Bostrychia","Bostrychia radicans",,"Kenya",,-3.944,39.774,
"KALG1

**Generating Headers**  
All string are saved with a '"' delimiter in the .csv format. This delimiter is removed with the command `sed -i 's/"//g' idrck_headers.csv` shown below. The FASTA format headers are then generated using the command `awk 'BEGIN{FS=","; OFS="|"}; NR == 1 { next }; { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i = "NA" }; {print ">" $1, $2, $3, $4, "fam-"$5, "subfam-"$6, "tri-"$7, "gs-"$8, "sp-"$9, "subsp-"$10, "country-"$11, "exactsite-"$12, "lat_"$13, "lon_"$14, "elev-"$15}' idrck_headers.csv > idrck_headers.fasta`:

In [23]:
%%bash
cd /home/kibet/bioinformatics/github/co1_metaanalysis/data/input/input_data/unpublished/
sed -i 's/"//g' idrck_headers.csv
awk 'BEGIN{FS=","; OFS="|"}; NR == 1 { next }; { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i = "NA" }; {print ">" $1, $2, $3, $4, "fam-"$5, "subfam-"$6, "tri-"$7, "gs-"$8, "sp-"$9, "subsp-"$10, "country-"$11, "exactsite-"$12, "lat_"$13, "lon_"$14, "elev-"$15}' idrck_headers.csv > idrck_headers.fasta
ls idrck*
head -10 idrck_headers.fasta

idrck.fasta
idrck_headers
idrck_headers1.csv
idrck_headers_all
idrck_headers.csv
idrck_headers.fasta
idrck_headers.ods
idrck_orders
idrck.xls
>KALG100-10|Rhodophyta|Florideophyceae|Ceramiales|fam-Rhodomelaceae|subfam-NA|tri-NA|gs-Acanthophora|sp-Acanthophora spicifera|subsp-NA|country-Kenya|exactsite-NA|lat_-4.26|lon_39.599|elev-NA
>KALG099-10|Rhodophyta|Florideophyceae|Ceramiales|fam-Rhodomelaceae|subfam-NA|tri-NA|gs-Acanthophora|sp-Acanthophora spicifera|subsp-NA|country-Kenya|exactsite-NA|lat_-4.26|lon_39.599|elev-NA
>KALG098-10|Rhodophyta|Florideophyceae|Ceramiales|fam-Rhodomelaceae|subfam-NA|tri-NA|gs-Acanthophora|sp-Acanthophora spicifera|subsp-NA|country-Kenya|exactsite-NA|lat_-4.26|lon_39.599|elev-NA
>KALG097-10|Rhodophyta|Florideophyceae|Ceramiales|fam-Rhodomelaceae|subfam-NA|tri-NA|gs-Acanthophora|sp-Acanthophora spicifera|subsp-NA|country-Kenya|exactsite-NA|lat_-4.26|lon_39.599|elev-NA
>KALG096-10|Rhodophyta|Florideophyceae|Ceramiales|fam-Rhodomelaceae|subfam-NA|tri-NA|gs-Ac

**Replacing/Substituting Headers**

In [24]:
%%bash
cd /home/kibet/bioinformatics/github/co1_metaanalysis/data/input/input_data/unpublished/
head -5 idrck.fasta # Before headers are replaced
. ../../../../code/process_all_input_files.sh
replacing_headers idrck.fasta << EOF
./idrck_headers.fasta
EOF
head -5 idrck.fasta # After headers are replaced

>KALG100-10|Rhodophyta|Florideophyceae|Ceramiales|fam-Rhodomelaceae|subfam-NA|tri-NA|gs-Acanthophora|sp-Acanthophora_spicifera|subsp-NA|country-Kenya|exactsite-NA|lat_-4.26|lon_39.599|elev-NA
TACTTTATACTTAATTTTTGGAGCTTTTTCTGGAATATTAGGAGGTTGTATGTCAATGTTAATTCGTATGGAATTGGCTCAGCCTGGTAATCAATTACTTTTAGGTAATCATCAAGTTTACAATGTTCTTATCACAGCCCACGCATTTTTAATGATATTTTTTATGGTTATGCCAGTGATGATCGGAGGTTTTGGTAATTGATTTGTACCTATTATGATAGGTAGTCCTGATATGGCATTCCCTCGATTAAATAATATTTCCTTTTGATTATTACCACCTTCATTATGTCTGTTATTATTATCATCCGTAGTAGAAGTAGGTACAGGTACAGGTTGAACTGTTTATCCTCCATTAAGTTCTATACAAAGTCATTCAGGAGCTTCTGTTGATTTAGCAATATTTAGTTTACATTTATCAGGAGCTTCCTCTATTCTAGGTGCAATTAATTTTATTTCTACAATATTAAATATGCGTAATCCTGGACAAACATTTTATAGAATTCCGTTATTTGTTTGGGCAATTTTTGTTACAGCATTTTTATTATTATTAGCAGTTCCAGTATTAGCAGGAGCGATAACAATGTTATTAACTGATAGGAATTTTAATACCTCTTTTTTTGATCCAGCAGGAGGTGGAGATCCTATTCTTTACCAACATTTATTT
>KALG099-10|Rhodophyta|Florideophyceae|Ceramiales|fam-Rhodomelaceae|subfam-NA|tri-NA|gs-Acanthophora|sp-Acanthophora_spicifera|subsp-NA|country

#### **4.2.2 Working with:**  
1. **GMTAH,GMTAI and GMTAJ:** (merged into Mpala_Kinondo_Turkana_Malaise_traps.fa) 
2. **DS-KENFRUIT**
3. **DS-MPALALEP** and
4. **DS-TBILE**  

In the BOLDSystems database through my account I have access to the above projects. The merged dataset GMTAH-GMTAI-GMTAJ, DS-KENFRUIT and DS-MPALALEP are all not available publicly with the exception of DS-TBILE.  
These unpublished dataset were dowloaded with permission from their respective data managers. Two files for each; a spreadsheet with all metadata (`*.xlsx`) and a sequences file (`*.fa`) whose headers were further edited as shown below:  
The .fa files are the unedited FASTA format sequence files downloaded.

In [26]:
%%bash
cd /home/kibet/bioinformatics/github/co1_metaanalysis/data/input/input_data/unpublished/
ls

DS-KENFRUIT_edit.fa
DS-KENFRUIT.fa
DS-KENFRUIT.fasta
DS-KENFRUIT_headers
DS-KENFRUIT.xlsx
DS-MPALALEP_edit.fa
DS-MPALALEP.fa
DS-MPALALEP.fasta
DS-MPALALEP_headers
DS-MPALALEP.xlsx
headers_edit.fasta
idrck.fasta
idrck_headers
idrck_headers_all
idrck_headers.csv
idrck_headers.fasta
idrck_orders
idrck.xls
Mpala_Kinondo_Turkana_Malaise_traps_edit.fa
Mpala_Kinondo_Turkana_Malaise_traps.fa
Mpala_Kinondo_Turkana_Malaise_traps.fasta
Mpala_Kinondo_Turkana_Malaise_traps_insecta.fasta
Mpala_Kinondo_Turkana_Malaise_traps_unedited.fasta
Mpala_Kinondo_Turkana_Malaise_traps.xlsx
Mpala_Kinondo_Turkana_tagged.xlsx


#### **Editing the headers**  
The headers in the \*.fa files are not well formated by default. The look as shown below:  
>\>GMKMV173-15|Cicadellidae|Arthropoda|Insecta|Hemiptera|Cicadellidae||||||Kenya|Mpala Research Centre|0.293|36.899|1650.0
>\>PMANL5032-15|Sarrothripini|Arthropoda|Insecta|Lepidoptera|Nolidae|Chloephorinae|Sarrothripini||||Kenya|Muhaka Forest|-4.325|39.525|50.0 

Editing these headers to standard like in the others before, deletes the default taxon and introduces prefices to the fields in the headers for easy understanding
1. "fam-" for family taxon name
2. "subfam-" for subfamily taxon name
3. "tri-" for tribe taxon name
4. "gs-" for genus taxon name
5. "sp-" for species taxon name
6. "subsp-" for subspecies taxon name
7. "country-" for country of origin name
8. "exactsite-" for exact site of origin name
9. "lat_" for latitude co-ordinate
10. "lon_" for longitude co-ordinate name
11. "elev-" for elevation

In [None]:
%%bash
cd /home/kibet/bioinformatics/github/co1_metaanalysis/data/input/input_data/unpublished/
#Generating copies of the seq. files
cp DS-KENFRUIT.fa DS-KENFRUIT_edit.fa
cp DS-MPALALEP.fa DS-MPALALEP_edit.fa
cp Mpala_Kinondo_Turkana_Malaise_traps.fa Mpala_Kinondo_Turkana_Malaise_traps_edit.fa

source ../../../../code/process_all_input_files.sh
for i in $(ls Mpala_Kinondo_Turkana_Malaise_traps_edit.fa); do grep ">" $i | awk 'BEGIN {FS="|"; OFS="|"}; { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i = "NA" }; {print $1, $3, $4, $5, "fam-"$6, "subfam-"$7, "tri-"$8, "gs-"$9, "sp-"$10, "subsp-"$11, "country-"$12, "exactsite-"$13, "lat_"$14, "lon_"$15, "elev-"$16 }' > headers_edit.fasta; replacing_headers $i << EOF
./headers_edit.fasta
EOF
done
cp DS-KENFRUIT.fa DS-KENFRUIT.fasta
cp DS-MPALALEP.fa DS-MPALALEP.fasta
#cp Mpala_Kinondo_Turkana_Malaise_traps_edit.fa Mpala_Kinondo_Turkana_Malaise_traps.fasta

##### **Mpala_Kinondo_Turkana_Malaise_traps.fa**  
This particular dataset has 37,856 records with varying sequence lenghths. An alignment of such a big dataset will not be good unless the data is subsetted* based on seqeunce length, aligned, cleaned and ultimately merged to make a good proper alignment.  
This was done as follows:

In [None]:
%%bash
cd /home/kibet/bioinformatics/github/co1_metaanalysis/data/input/input_data/unpublished/

#creating a working copy
cp Mpala_Kinondo_Turkana_Malaise_traps.fa input.fasta
#removing gaps. The output is "input_dgpd.fasta", without gaps, "-".
source ../../../../code/process_all_input_files.sh
remove_gaps input.fasta
#Introducing a field "l-xxx" that has the length of the sequence in the header
awk '/^>/{hdr=$0; next}
    { seq=$0 } match(seq,/^.*$/) { LEN=RLENGTH }
    { print hdr"|l-"LEN; print seq }' input_dgpd.fasta > input_dgpd_edited.fasta

mv input_dgpd_edited.fasta Mpala_Kinondo_Turkana_Malaise_traps.fasta

#Understanding the taxonomic representation of the various orders
awk 'BEGIN{FS="|"}; /^>/{print $4}' Mpala_Kinondo_Turkana_Malaise_traps_all.fasta | sort | uniq -c | less

**Understanding the taxonomic distripution of the many orders within Mpala_Kinondo_Turkana_Malaise_traps_all.fasta**

In [5]:
%%bash
cd /home/kibet/bioinformatics/github/co1_metaanalysis/data/input/input_data/unpublished/
echo -e "Phyla represented: "
awk 'BEGIN{FS="|"}; /^>/{print $2}' Mpala_Kinondo_Turkana_Malaise_traps_all.fasta | sort | uniq -c | less
echo -e "\nClasses represented in Arthropoda phylum: "
grep "Arthropoda"  Mpala_Kinondo_Turkana_Malaise_traps_all.fasta | awk 'BEGIN{FS="|"}; /^>/{print $3}' | sort | uniq -c | less
echo -e "\nOrders represented in Insecta class: "
grep "Insecta"  Mpala_Kinondo_Turkana_Malaise_traps_all.fasta | awk 'BEGIN{FS="|"}; /^>/{print $4}' | sort | uniq -c | less

Phyla represented: 
  37846 Arthropoda
     10 Mollusca

Classes represented in Arthropoda phylum: 
    820 Arachnida
    833 Collembola
     52 Diplopoda
  36136 Insecta
      5 Malacostraca

Orders represented in Insecta class: 
      1 Archaeognatha
    232 Blattodea
   2121 Coleoptera
  17827 Diptera
      1 Embioptera
   3912 Hemiptera
   8037 Hymenoptera
   3301 Lepidoptera
     25 Mantodea
     10 Mecoptera
     59 Neuroptera
    175 Orthoptera
    307 Psocodea
    121 Thysanoptera
      6 Trichoptera
      1 Zygentoma


**Splitting Mpala_Kinondo_Turkana_Malaise_traps.fasta into nucleotide length dependent files**

In [None]:
%%bash
cd /home/kibet/bioinformatics/github/co1_metaanalysis/data/input/input_data/unpublished/
source ../../../../code/process_all_input_files.sh

#Extracting records from Insecta class
delete_unwanted Mpala_Kinondo_Turkana_Malaise_traps.fasta << EOF
1
Insecta
2
EOF

#Splitting
subset_seqs Mpala_Kinondo_Turkana_Malaise_traps_Insecta.fasta << EOF
N
EOF