# **Data Downloading and transformation**

## **Goals**
1. Retrive data from boldsystems.
2. Transform the xml files to text files (.tsv).
3. Build FASTA format sequnces from the text files.
4. Retriving unpublished data from the BOLDSystems and reformating the headers.

### **1. Retriving data from boldsystems**
Using the boldsystemsV4 [PUBLIC DATA API](http://www.boldsystems.org/index.php/resources/api?type=webservices) to export **Full Data Retrieval (Specimen + Sequence)** from a list of countries stored in a file (named by country) in a default destination directory "co1_metaanalysis/data/input/input_data/bold_africa/"

In [None]:
%%bash
bolddata_retrival() { # This fuction retrives data belonging to a list of country names given. Input can be a file containing names of select countries or idividual country names

        usage $@
        echo -e "\n\tDownloading data of countries named in $@ from www.boldsystems.org"

        IFS=$'\n'

        for i in `cat $@`
        do
                wget --show-progress --progress=bar:noscroll --retry-connrefused -t inf -O ${inputdata_path}bold_africa/"${i}".xml -a ${inputdata_path}wget_log http://www.boldsystems.org/index.php/API_Public/combined?geo="${i}"&taxon=arthropoda&format=tsv
        done
}

**Running:**

In [2]:
%%bash
cd /home/kibet/bioinformatics/github/co1_metaanalysis/code/
source ./process_all_input_files.sh
bolddata_retrival #country  #Uncomment the word "country" to download from a list of countries in the file country (canada)

Input error...
Usage: bolddata_retrival file1.*[file2.* file3.* ...]

	Downloading data of countries named in  from www.boldsystems.org


### **2. Transformation of the XML files to tsv**
Here we use python3 packages : **BeautifulSoup4** and **pandas**.  
(**N/B:** Tried using R ([01.02.R_xml_to_tsv.ipynb](./01.02.R_xml_to_tsv.ipynb)), but didn't work well)  
For more on the logic behind the extraction script see jupyter notebook [01.01.xml_to_tsv.ipynb](./01.01.xml_to_tsv.ipynb)  
The country specific XML files are converted to text (.tsv) files.

In [None]:
%%bash
build_tsv() { #This function generates .tsv files from .xml files using python script and Beautifulsoup4 and pandas package

        usage $@

        TAB=$(printf '\t')

        echo "generating .tsv files from .xml downloads"

        for i in "$@"
        do
                if [ ! -f $i ]
                then
                        echo "input error: file '$i' is non-existent!"
                elif [[ ( -f $i ) && ( `basename -- "$i"` =~ .*\.(xml) ) ]]
                then
                        rename
                        echo -e "\nLet us proceed with file '${input_filename}'..."
                        sed 's/class/Class/g' "$i" | sed "s/$TAB/,/g" > ${inputdata_path}bold_africa/input.xml
                        ${PYTHON_EXEC} ${xml_to_tsv} ${inputdata_path}bold_africa/input.xml && mv output.tsv ${inputdata_path}bold_africa/${output_filename}.tsv
                else
                        echo "input file error in `basename -- '$i'`: input file should be a .xml file format"
                        continue
                fi
        done
}

**Running:**

In [3]:
%%bash
cd /home/kibet/bioinformatics/github/co1_metaanalysis/code/
source ./process_all_input_files.sh
build_tsv #../data/input/input_data/bold_africa/kenya.xml  #Uncomment the path to execute the function

Input error...
Usage: build_tsv file1.*[file2.* file3.* ...]
generating .tsv files from .xml downloads


### **3. Build FASTA sequences from the .tsv text files**
The building of FASTA files is not done directly on the country specific text (.tsv) files.
It is done after some cleaning and sorting.
1. First only those with Insecta genus-name tag are extracted and all the records are cleaned of any non-COI-5P markers  
2. Then ALL the records are re-grouped into subsets based on sequence length  
3. Then fourteen 100-record samples are randomly sampled from this groups, to be used in the development and testing of the bioinformatics analysis pipelines  
4. Finally the re-grouped subsets and the samples are converted to FASTA format sequences  

There are two rscripts:  
1. [data_cleanup_n_sampling.R](./data_cleanup_n_sampling.R): Meant for cleaning, sorting and sampling the test data (East African data: Kenya, Tanzania, Uganda, Rwanda, Burundi, Ethiopia and South Sudan).  
See [02.00.Data_cleanup](./02.00.Data_cleanup.ipynb) for more information on step '1.' to '3.' 
2. [data_cleanup.R](./data_cleanup.R): Meant for cleaning and sorting all country specific records

**To sort the data for all country specific records into the groups defined by sequence length do as follows:**

In [None]:
%%bash
clean_sort_tsv() { #This function cleans the .tsv files, sort the records into differnt files based on the sequence length and finally appends this files to a cummulative files of diffent input files

        usage $@

        echo "cleaningup and sorting .tsv files "

        output_files_africa=("${inputdata_path}clean_africa/afroCOI_500to700_data.tsv" "${inputdata_path}clean_africa/afroCOI_650to660_data.tsv" "${inputdata_path}clean_africa/afroCOI_all_data.tsv" "${inputdata_path}clean_africa/afroCOI_Over499_data.tsv" "${inputdata_path}clean_africa/afroCOI_Over700_data.tsv" "${inputdata_path}clean_africa/afroCOI_Under500_data.tsv")

        output_files_eafrica=("${inputdata_path}clean_eafrica/eafroCOI_500to700_data.tsv" "${inputdata_path}clean_eafrica/eafroCOI_650to660_data.tsv" "${inputdata_path}clean_eafrica/eafroCOI_all_data.tsv" "${inputdata_path}clean_eafrica/eafroCOI_Over499_data.tsv" "${inputdata_path}clean_eafrica/eafroCOI_Over700_data.tsv" "${inputdata_path}clean_eafrica/eafroCOI_Under500_data.tsv")


        for i in ${output_files_africa[@]}
        do
                grep "processid" $1 > $i && echo -e "\nInput file $i is set"
        done

        for i in ${output_files_eafrica[@]}
        do
                grep "processid" $1 > $i && echo -e "\nInput file $i is set"
        done


        for i in "$@"
        do
                if [ ! -f $i ]
                then
                        echo "input error: file '$i' is non-existent!"
                elif [[ ( -f $i ) && ( `basename -- "$i"` =~ .*\.(tsv) ) ]]
                then
                        rename
                        echo -e "\nLet us proceed with file '${input_filename}'..."
                        ${RSCRIPT_EXEC} --vanilla ${data_cleanup} $i
                        case $output_filename in
                                Algeria|Madagascar|Angola|Malawi|Benin|Mali|Botswana|Mauritania|Burkina_Faso|Mauritius|Morocco|Cameroon|Mozambique|Cape_Verde|Namibia|Central_African_Republic|Nigeria|Chad|Niger|Comoros|Republic_of_the_Congo|Cote_d_Ivoire|Reunion|Democratic_republic_of_the_Congo|Djibouti|Sao_Tome_and_Principe|Egypt|Senegal|Equatorial_Guinea|Seychelles|Eritrea|Sierra_Leone|Somalia|Gabon|South_Africa|Gambia|Ghana|Sudan|Guinea-Bissau|Swaziland|Guinea|Togo|Tunisia|Lesotho|Liberia|Zambia|Libya|Zimbabwe)
                                        input=${inputdata_path}clean_africa/COI_500to700_data.tsv
                                        output=${output_files_africa[0]}
                                        append_tsvfile

                                        input=${inputdata_path}clean_africa/COI_650to660_data.tsv
                                        output=${output_files_africa[1]}
                                        append_tsvfile

                                        input=${inputdata_path}clean_africa/COI_all_data.tsv
                                        output=${output_files_africa[2]}
                                        append_tsvfile

                                        input=${inputdata_path}clean_africa/COI_Over499_data.tsv
                                        output=${output_files_africa[3]}
                                        append_tsvfile

                                        input=${inputdata_path}clean_africa/COI_Over700_data.tsv
                                        output=${output_files_africa[4]}
                                        append_tsvfile

                                        input=${inputdata_path}clean_africa/COI_Under500_data.tsv
                                        output=${output_files_africa[5]}
                                        append_tsvfile
                                        ;;
                                *)
                                        input=${inputdata_path}clean_africa/COI_500to700_data.tsv
                                        output=${output_files_eafrica[0]}
                                        append_tsvfile

                                        input=${inputdata_path}clean_africa/COI_650to660_data.tsv
                                        output=${output_files_eafrica[1]}
                                        append_tsvfile

                                        input=${inputdata_path}clean_africa/COI_all_data.tsv
                                        output=${output_files_eafrica[2]}
                                        append_tsvfile

                                        input=${inputdata_path}clean_africa/COI_Over499_data.tsv
                                        output=${output_files_eafrica[3]}
                                        append_tsvfile

                                        input=${inputdata_path}clean_africa/COI_Over700_data.tsv
                                        output=${output_files_eafrica[4]}
                                        append_tsvfile

                                        input=${inputdata_path}clean_africa/COI_Under500_data.tsv
                                        output=${output_files_eafrica[5]}
                                        append_tsvfile
                                        ;;
                        esac
                 else
                        echo "input file error in `basename -- $i`: input file should be a .tsv file format"
                        continue
                fi
        done
}


**Running:**

In [6]:
%%bash
cd /home/kibet/bioinformatics/github/co1_metaanalysis/code/
source ./process_all_input_files.sh
clean_sort_tsv #../data/input_data/bold_africa/*.tsv

Input error...
Usage: clean_sort_tsv file1.*[file2.* file3.* ...]
cleaningup and sorting .tsv files 


**To convert the .tsv files to FASTA format files do as follows**

The sorting rscript separates East African data from the rest of Africa and stores them in two separate directories: "co1_metaanalysis/data/input/input_data/clean_eafrica" and "co1_metaanalysis/data/input/input_data/clean_africa"  

Below is the code to concatenate the the two different streams into one stored in "co1_metaanalysis/data/input/input_data/clean_africa"  

In [8]:
%%bash
cd /home/kibet/bioinformatics/github/co1_metaanalysis/data/input/input_data/
cat ./clean_africa/afroCOI_500to700_data.fasta ./clean_eafrica/eafroCOI_500to700_data.fasta #> ./clean_africa/enafroCOI_500to700_data.fasta
cat ./clean_africa/afroCOI_650to660_data.fasta ./clean_eafrica/eafroCOI_650to660_data.fasta #> ./clean_africa/enafroCOI_650to660_data.fasta
cat ./clean_africa/afroCOI_all_data.fasta ./clean_eafrica/eafroCOI_all_data.fasta #> ./clean_africa/enafroCOI_all_data.fasta
cat ./clean_africa/afroCOI_Over499_data.fasta ./clean_eafrica/eafroCOI_Over499_data.fasta #> ./clean_africa/enafroCOI_Over499_data.fasta
cat ./clean_africa/afroCOI_Over700_data.fasta ./clean_eafrica/eafroCOI_Over700_data.fasta #> ./clean_africa/enafroCOI_Over700_data.fasta
cat ./clean_africa/afroCOI_Under500_data.fasta ./clean_eafrica/eafroCOI_Under500_data.fasta #> ./clean_africa/enafroCOI_Under500_data.fasta


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Generating a file called **"enafroCOI_500to700_data-650to660.fasta"** of sequences with nucleotide number from 500 to 700, but excluding those with 650 to 660 nucleotides represented in enafroCOI_650to660_data.fasta  
Uses a fuctions in a bash script, "process_all_input_files.sh", that does the necessary text processing needed.  
See below:

In [None]:
%%bash
rched for the patterns
        # To get the list of orders in description_taxon_names and their frequencies, from  which to select the undesired patterns (names), do: 
        #grep ">" seqs.fasta | awk 'BEGIN {FS="|"; OFS="|" ; }; {print $2}' |sort | uniq -c > seqs_orders && less seqs_orders

        if [ $# -eq 0 ]
        then
                echo "Input error..."
                echo "Usage: ${FUNCNAME[0]} file1.*[file2.* file3.* ...]"
                return 1

        fi

        echo -e "To delete sequences with specific words in the headers please choose [Yes] to proceed or [No] to cancel"
        PS3='Select option YES to delete, [1] or NO to exit, [2]: '
        select option in YES NO
        do
                unset pattern_name

                regexp='^[a-zA-Z0-9/_-\ ]+$'

                case $option in
                        YES)
                                until [[ "$pattern_name" =~ $regexp ]]
                                do
                                        read -p "Please enter string pattern to be searched:: " pattern_name
                                done

                                echo -e "\n\tDeleting all records with description '$pattern_name'..."

                                for i in "$@"
                                do
                                        echo -e "\n\tProceeding with `basename -- $i`..."
                                        rename
                                        input_src=`dirname "$( realpath "${i}" )"`

                                        #awk -v name="$input_r" 'BEGIN {RS="\n>"; ORF="\n>"}; $0 ~ name {print ">"$0}' test_all.fasta | less

                                        concatenate_fasta_seqs $i
                                        $AWK_EXEC -v pattern="$pattern_name" 'BEGIN { RS="\n>"; ORS="\n"; FS="\n"; OFS="\n" }; $1 ~ pattern {print ">"$0;}' $i >> ${input_src}/${output_filename}_undesired.fasta
                                        sed -i "/$pattern_name/,+1 d" $i
                                done
                                echo -e "\n\tDONE. All deleted records have been stored in '${output_filename}_undesired.fasta'"
                                ;;
                        NO)
                                echo -e "Exiting deletion of unwanted sequences..."
                                break
                esac
        done
}

In [None]:
%%bash
cd /home/kibet/bioinformatics/github/co1_metaanalysis/data/input/input_data/
source ./process_all_input_files.sh
cat enafroCOI_650to660_data.fasta enafroCOI_500to700_data.fasta > #input
delete_repeats input
x=`wc -l enafroCOI_650to660_data.fasta`
awk -v x=$x `{if (NRF<=x) {next} else {print $0} }`./input > enafroCOI_500to700_data-650to660.fasta

Converting .tsv files into FASTA files

In [None]:
%%bash
build_fasta() { #This function generates .fasta files from .tsv files using an awk script

        usage $@

        echo "generating .fasta files from .tsv metadata files"

        for i in "$@"
        do
                if [ ! -f $i ]
                then
                        echo "input error: file '$i' is non-existent!"
                elif [[ ( -f $i ) && ( `basename -- "$i"` =~ .*\.(tsv) ) ]]
                then
                        input_src=`dirname "$( realpath "${i}" )"`
                        rename
                        echo -e "\nLet us proceed with file '${input_filename}'..."
                        ${AWK_EXEC} -f ${AWK_SCRIPT} "$i" > ${input_src}/${output_filename}.fasta
                else
                        echo "input file error in `basename -- $i`: input file should be a .tsv file format"
                        continue
                fi
        done
}

**Running:**

In [7]:
%%bash
cd /home/kibet/bioinformatics/github/co1_metaanalysis/code/
source ./process_all_input_files.sh
build_fasta #../data/input/test_data/*.tsv # For test_data(East African data sets including their samples)
build_fasta #../data/input/input_data/clean_africa/*.tsv # For All re-grouped data sets

Input error...
Usage: build_fasta file1.*[file2.* file3.* ...]
generating .fasta files from .tsv metadata files
Input error...
Usage: build_fasta file1.*[file2.* file3.* ...]
generating .fasta files from .tsv metadata files


### **4. Retriving unpublished data from the BOLDSystems and reformating the headers**
#### **4.1 To retrive unpublished data from [BOLD Systems](http://www.boldsystems.org/index.php/MAS_Management_UserConsole)**, first create a [BOLD systems account](http://www.boldsystems.org/index.php/MAS_Management_NewUserApp), [login](http://www.boldsystems.org/index.php/Login/page?destination=MAS_Management_UserConsole) and request data managers, to share their data sets.  
My list of shared data sets are:
1. [DS-KENFRUIT](http://www.boldsystems.org/index.php/MAS_Management_DataConsole?codes=DS-KENFRUIT): managed by Dr Scott E. Miller, has 1,427 records  
2. [DS-MPALALEP](http://www.boldsystems.org/index.php/MAS_Management_DataConsole?codes=DS-MPALALEP): managed by Dr Scott E. Miller, has 2,472 records  
3. [DS-TBILE](http://www.boldsystems.org/index.php/MAS_Management_DataConsole?codes=DS-TBILE) (Now publicly released now): managed by Dr Scott E. Miller has 90 records  

My list of container Projects; these contains multiple data sets within them:
4. [IDRCK](http://www.boldsystems.org/index.php/MAS_Management_DataConsole?codes=IDRCK): Has a number of subprojects; IDRC,HIVE, KBIR, KALG, KFISH, KPLA, ARAK and KINS. Has 2,110 sequences (COI-5P=1,704, matK=139, rbcLa=267) out of 6,016 specimen and is managed by Dr. Daniel Masiga.  
5. [GMTAH](http://www.boldsystems.org/index.php/MAS_Management_DataConsole?codes=GMTAH),[GMTAI](http://www.boldsystems.org/index.php/MAS_Management_DataConsole?codes=GMTAI) and [GMTAJ](http://www.boldsystems.org/index.php/MAS_Management_DataConsole?codes=GMTAJ) projects. All under the Global Malaise Program and the three have a combined total of 60 Projects and 49246 Specimens.  
>1. GMTAH: Has 26 projects titled "Kenya Malaise Mpala 2014" with 25,514 specimen, 170 species and 21,742 sequences (COI-5P=21,737, 28S=4 and EF1-alpha=1)  
>2. GMTAI: Has 26 projects titled "Kenya Malaise Kinondo 2014" with 13,656 specimen, 57 species and 11,805 sequences (COI-5P=11,801, 28S=3 and EF1-alpha=1)  
>3. GMTAJ: Has 5 projects titled "Kenya Malaise Turkana 2014" with 10,076 specimen, 63 species and 5,175 sequences (COI-5P=5,175)  

To retrive this data, I logged into the [BOLD Systems MAS management interface](http://www.boldsystems.org/index.php/MAS_Management_UserConsole) through Chromium web browser and for each named project above: DS-KENFRUIT, DS-MPALALEP, DS-TBILE, IDRCK AND GMTAH-GMTAI-GMTAJ (Mpala_Kinondo_Turkana_Malaise_traps), downloaded the spreadsheet and the sequence files to "/co1_metaanalysis/data/input/input_data/unpublished" directory.


In [2]:
%%bash
cd /home/kibet/bioinformatics/github/co1_metaanalysis/data/input/input_data/unpublished/
ls *.fasta

DS-KENFRUIT.fasta
DS-MPALALEP.fasta
idrck.fasta
Mpala_Kinondo_Turkana_Malaise_traps.fasta


#### **4.2 Changing the headers to look uniform to other headers**  
Current headers look like:
>\>GMKMV173-15|Cicadellidae|Hemiptera||||Kenya|Mpala Research Centre|0.293|36.899|1650.0  
>\>GMKMW1843-15|Dichomeris tenextrema|Lepidoptera|Dichomeris|Dichomeris tenextrema||Kenya|Mpala Research Centre|0.293|36.899|1650.0  

To an edited header that looks like:
>\>GMKMV173-15|Hemiptera|gs-NA|sp-NA|subsp-NA|country-Kenya|exactsite-Mpala_Research_Centre|lat_0.293|lon_36.899|elev-1650.0  
>\>GMKMW1843-15|Lepidoptera|gs-Dichomeris|sp-Dichomeris tenextrema|subsp-NA|country-Kenya|exactsite-Mpala_Research_Centre|lat_0.293|lon_36.899|elev-1650.0  

This standardizes the headers to a common format that is useful in the downstream analysis.  
For this to be done the headers to a given sequence file are first copied into a file, headers_edit.fasta, within which they are edited to the right format i.e:  
1. Deleting the default taxon automatically assigned by BOLD systems during the download process which is usually the lowest taxon defined in the taxonomy of that record.
2. Defining the various fields of the headers by adding suffices; gs-"genus", sp-"species", subsp-"subspecies", country-"country", exactsite-"exact site", lat_"latitude", lon_"longitude" and elev_"elevation".  

Then the formated headers are substituted into the actual sequence.fasta file using a function, see below:

In [None]:
%%bash
replacing_headers() { #This function takes an input file of edited_fasta_format_headers and searches through a fasta_format_sequence file and substitute their headers if their uniq IDs match
        if [ $# -eq 0 ]
        then
                echo "Input error..."
                echo "Usage: ${FUNCNAME[0]} seq.fasta [seq2.fasta seq3.fasta ...]"
                return
        fi

        unset headers
        until [[ ( -f "$headers" ) && ( `basename -- "$headers"` =~ .*\.(fasta|fa|afa) ) ]]
        do
                echo -e "\nFor the headers.aln|fasta|fa|afa input provide the full path to the file, the filename included."
                read -p "Please enter the file to be used as the FASTA headers source: " headers
        done

        echo -e "\n\tStarting operation....\n\tPlease wait, this may take a while...."
        for i in "$@"
        do
                unset x
                unset y
                unset z
                echo -e "\nProceeding with `basename -- $i`..."
                for line in `cat ${headers}`
                do
                        #x=$( head -10 idrck_headers | tail -1 | awk 'BEGIN { FS="|"; }{print $1;}') && echo $x
                        x=`echo "$line" | ${AWK_EXEC} 'BEGIN { RS="\n"; FS="|"; }{ x = $1; print x; }'`
                        y=`echo "$line" | ${AWK_EXEC} 'BEGIN { RS="\n"; FS="|"; }{ y = $0; print y; }'`
                        #echo -e "\n $x \n $y"

                        z=`grep "$x" $i`
                        #echo "$z"
                        for one_z in `echo -e "${z}"`
                        do
                                if [ $one_z == $y ]
                                then
                                        echo -e "Change for ${x} already in place..."
                                        continue
                                else
                                        echo -e "\nSubstituting header for ${x}..."
                                        sed -i "s/${one_z}/${y}/g" $i
                                        #sed -i "s/^.*\b${x}\b.*$/${y}/g" $i
                                fi
                        done
                done
                echo -e "\nDONE replacing headers in `basename -- $i`"
        done
        echo -e "\n\tCongratulations...Operation done."
}

**Actual run:**

In [None]:
%%bash
cd /home/kibet/bioinformatics/github/co1_metaanalysis/data/input/input_data/unpublished/
source ../../../../code/process_all_input_files.sh
for i in $(ls *.fasta2); do grep ">" $i | sed s/\|\|/\|NA\|/g | sed s/\|\|/\|NA\|/g | awk 'BEGIN {FS="|"; OFS="|"} {print $1,$3,"gs-"$4,"sp-"$5,"subsp-"$6,"country-"$7,"exactsite-"$8,"lat_"$9,"lon_"$10,"elev-"$11}' > headers_edit.fasta; done
sed -i 's/\r$//g; s/ /_/g; s/\&/_n_/g; s/\//\\&/g' headers_edit.fasta
replacing_headers Mpala_Kinondo_Turkana_Malaise_traps.fasta2 << EOF
./headers_edit.fasta
EOF