Load libraries

In [1]:
using LightXML
using DataArrays
using DataFrames

Read in data; these were obtained from [NCBI](http://www.ncbi.nlm.nih.gov/nuccore?LinkName=bioproject_nuccore&from_uid=257197) based on the link from the [Gire et al.](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4431643/) paper.

In [2]:
xdoc = parse_file("ebola-sle-2014.gbc.xml");

I start by identifying the root element, which is an ```INSDSet```.

In [3]:
xroot = root(xdoc)
println(name(xroot));

INSDSet


I extract all the sequences and accession numbers as lists, the latter using a comprehension.

In [4]:
sequences = get_elements_by_tagname(xroot, "INSDSeq")
accessions = [content(find_element(s,"INSDSeq_primary-accession")) for s in sequences];

In [5]:
numseq=length(sequences)

249

This is way more than we have annotations for.

Let's look at the first entry.

In [6]:
sequences[1]

<INSDSeq>
  <INSDSeq_locus>KM034549</INSDSeq_locus>
  <INSDSeq_length>18835</INSDSeq_length>
  <INSDSeq_strandedness>single</INSDSeq_strandedness>
  <INSDSeq_moltype>cRNA</INSDSeq_moltype>
  <INSDSeq_topology>linear</INSDSeq_topology>
  <INSDSeq_division>VRL</INSDSeq_division>
  <INSDSeq_update-date>15-DEC-2014</INSDSeq_update-date>
  <INSDSeq_create-date>30-JUN-2014</INSDSeq_create-date>
  <INSDSeq_definition>Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-EM095B, complete genome</INSDSeq_definition>
  <INSDSeq_primary-accession>KM034549</INSDSeq_primary-accession>
  <INSDSeq_accession-version>KM034549.1</INSDSeq_accession-version>
  <INSDSeq_other-seqids>
    <INSDSeqid>gb|KM034549.1|</INSDSeqid>
    <INSDSeqid>gi|661348595</INSDSeqid>
  </INSDSeq_other-seqids>
  <INSDSeq_project>PRJNA257197</INSDSeq_project>
  <INSDSeq_source>Zaire ebolavirus</INSDSeq_source>
  <INSDSeq_organism>Zaire ebolavirus</INSDSeq_organism>
  <INSDSeq_taxonomy>Viruses; ssRNA viruses; ssRNA n

To extract all the information about organism, host, sampling time, etc., that is held in the list of ```INSDQualifier```s, I loop through all the sequences and generate a dictionary with accession as the key and a dictionary of qualifiers as the value.

I start by initialising an empty dictionary, with strings as both the key and the value.

In [7]:
seq_dict=Dict{ASCIIString,Dict{ASCIIString,ASCIIString}}()

Dict{ASCIIString,Dict{ASCIIString,ASCIIString}} with 0 entries

Extracting the information is a mixture of ```find_element``` and ```find_elements_by_tagname``` to search for the right elements, ```get_elements_by_tagname```, and finally using ```content``` to extract the contents of the qualifiers.

In [8]:
for i in 1:numseq
    s=sequences[i]
    accession=content(find_element(s, "INSDSeq_primary-accession"))
    feature_table=find_element(s,"INSDSeq_feature-table")
    features=get_elements_by_tagname(feature_table,"INSDFeature")
    feature_quals=get_elements_by_tagname(features[1], "INSDFeature_quals")
    qualifiers=get_elements_by_tagname(feature_quals[1], "INSDQualifier")
    qualifier_dict=Dict{ASCIIString,ASCIIString}()
    for q in qualifiers
        n=find_element(q,"INSDQualifier_name")
        v=find_element(q,"INSDQualifier_value")
        if v!=nothing
            qualifier_dict[content(n)]=content(v)
        end
    end
    seq_dict[accession]=qualifier_dict
end;

Here is an example of the features for the first accession.

In [9]:
seq_dict[accessions[1]]

Dict{ASCIIString,ASCIIString} with 8 entries:
  "organism"         => "Zaire ebolavirus"
  "isolation_source" => "serum"
  "host"             => "Homo sapiens"
  "mol_type"         => "viral cRNA"
  "collection_date"  => "25-May-2014"
  "isolate"          => "Ebola virus/H.sapiens-wt/SLE/2014/Makona-EM095B"
  "db_xref"          => "taxon:186538"
  "country"          => "Sierra Leone"

To flatten the dictionary, I first make a dictionary of all feature names, with the number of times the field is found.

In [10]:
fn_dict=(ASCIIString=>Int64)[]
for acc in keys(seq_dict)
    features=seq_dict[acc]
    for k in keys(features)
        current_count=get(fn_dict,k,0)
        fn_dict[k]=current_count+1
    end
end
fn_dict


Use "Dict{ASCIIString,Int64}()" instead.


Dict{ASCIIString,Int64} with 9 entries:
  "organism"         => 249
  "isolation_source" => 165
  "host"             => 249
  "collected_by"     => 150
  "mol_type"         => 249
  "collection_date"  => 249
  "isolate"          => 249
  "db_xref"          => 249
  "country"          => 249

I extract the names of the qualifiers as a list, that will be used below to construct a ```DataFrame```. 

In [11]:
feature_names=collect(keys(fn_dict))

9-element Array{ASCIIString,1}:
 "organism"        
 "isolation_source"
 "host"            
 "collected_by"    
 "mol_type"        
 "collection_date" 
 "isolate"         
 "db_xref"         
 "country"         

I then loop through each feature name, for each sequence, determine whether the feature is present, and construct a ```DataArray```, which is then added to a ```DataFrame```.

In [12]:
df=DataFrame(accession=accessions)
numfeatures=length(feature_names)
for i in 1:numfeatures
    key=feature_names[i]
    dv=DataArray(ASCIIString[],Bool[])
    for j in 1:numseq
        acc=accessions[j]
        f=seq_dict[acc]
        val=get(f,key,NA) # NA is the default
        push!(dv,val)
    end
    df[symbol(key)]=dv
end;

I now have a ```DataFrame``` that has the features in a flat format.

In [13]:
head(df)

Unnamed: 0,accession,organism,isolation_source,host,collected_by,mol_type,collection_date,isolate,db_xref,country
1,KM034549,Zaire ebolavirus,serum,Homo sapiens,,viral cRNA,25-May-2014,Ebola virus/H.sapiens-wt/SLE/2014/Makona-EM095B,taxon:186538,Sierra Leone
2,KM034550,Zaire ebolavirus,serum,Homo sapiens,,viral cRNA,25-May-2014,Ebola virus/H.sapiens-wt/SLE/2014/Makona-EM095,taxon:186538,Sierra Leone
3,KM034551,Zaire ebolavirus,serum,Homo sapiens,,viral cRNA,26-May-2014,Ebola virus/H.sapiens-wt/SLE/2014/Makona-EM096,taxon:186538,Sierra Leone
4,KM034552,Zaire ebolavirus,serum,Homo sapiens,,viral cRNA,26-May-2014,Ebola virus/H.sapiens-wt/SLE/2014/Makona-EM098,taxon:186538,Sierra Leone
5,KM034553,Zaire ebolavirus,serum,Homo sapiens,,viral cRNA,27-May-2014,Ebola virus/H.sapiens-wt/SLE/2014/Makona-G3670.1,taxon:186538,Sierra Leone
6,KM034554,Zaire ebolavirus,serum,Homo sapiens,,viral cRNA,27-May-2014,Ebola virus/H.sapiens-wt/SLE/2014/Makona-G3676.1,taxon:186538,Sierra Leone


Extract patient ID from dataframe

In [14]:
ids = [x |> # select
    (x)->split(x,"-") |> # split on hyphen
    last |>  # take last
    (x)->split(x,".") |> # split on period
    first for x in df[:isolate]]
df[:ids] = ids;

Load in annotations, obtained from the Sabeti/Garry labs available [here](https://fathom.info/mirador/ebola/datarelease/, )and select on the basis of IDs.

In [15]:
annot = readtable("ebola-data.csv")

Unnamed: 0,Patient_ID,Diagnosis,Age,Gender,Village,Chiefdom,District,Outcome,Date_of_Outcome,Admitted_at_report,Pre_admission_date,Date_of_admission,Date_of_discharge,Temperature,Systolic_pressure,Diastolic_pressure,Hearth_rate,Respiratory_rate,Days_since_onset,Oxygen_saturation,Bleeding_gums,Bleeding_nose,Blood_in_stool,Blood_in_vomit,Bleeding_injection,Bleeding_hematoma,Blood_in_sputum,Blood_in_urine,Vaginal_bleeding,No_bleeding,Abdominal_pain,Joint_pain,Muscle_pain,Back_pain,Side_pain,Retrosternal_pain,Other_pain,No_pain,Fever,Conjunctivitis,Edema,Inflammation,Rash,Headache,Sore_throat,Vomit,Cough,Diarrhea,Weakness,Dizziness,Hearing,Convulsions,Confusion,Jaundice,Other_symptoms,No_symptoms,Antimalarials,Ceftriaxone,Paracetamol,Metronidazole,Artemisinin_Combination_Therapy,Ciprofloxacin,Ampicillin,Omeprazole,Date_of_metabolic_panel_1,Alanine_Aminotransferase_U_L_day_1,Albumin_g_L_day_1,Alkaline_Phosphatase_U_L_day_1,Aspartate_Aminotransferase_U_L_day_1,Calcium_mmol_L_day_1,Chloride_mmol_L_day_1,Creatinine_umol_L_day_1,Glucose_mmol_L_day_1,Potassium_mmol_L_day_1,Sodium_mmol_L_day_1,Total_Bilirubin_umol_L_day_1,Total_Carbon_Dioxide_mmol_L_day_1,Total_Protein_g_L_day_1,Blood_Urea_Nitrogen_mmol_urea_L_day_1,Date_of_metabolic_panel_2,Alanine_Aminotransferase_U_L_day_2,Albumin_g_L_day_2,Alkaline_Phosphatase_U_L_day_2,Aspartate_Aminotransferase_U_L_day_2,Calcium_mmol_L_day_2,Chloride_mmol_L_day_2,Creatinine_umol_L_day_2,Glucose_mmol_L_day_2,Potassium_mmol_L_day_2,Sodium_mmol_L_day_2,Total_Bilirubin_umol_L_day_2,Total_Carbon_Dioxide_mmol_L_day_2,Total_Protein_g_L_day_2,Blood_Urea_Nitrogen_mmol_urea_L_day_2,Date_of_metabolic_panel_3,Alanine_Aminotransferase_U_L_day_3,Albumin_g_L_day_3,Alkaline_Phosphatase_U_L_day_3,Aspartate_Aminotransferase_U_L_day_3,Calcium_mmol_L_day_3,Chloride_mmol_L_day_3,Creatinine_umol_L_day_3,Glucose_mmol_L_day_3,Potassium_mmol_L_day_3,Sodium_mmol_L_day_3,Total_Bilirubin_umol_L_day_3,Total_Carbon_Dioxide_mmol_L_day_3,Total_Protein_g_L_day_3,Blood_Urea_Nitrogen_mmol_urea_L_day_3,Date_of_metabolic_panel_4,Alanine_Aminotransferase_U_L_day_4,Albumin_g_L_day_4,Alkaline_Phosphatase_U_L_day_4,Aspartate_Aminotransferase_U_L_day_4,Calcium_mmol_L_day_4,Chloride_mmol_L_day_4,Creatinine_umol_L_day_4,Glucose_mmol_L_day_4,Potassium_mmol_L_day_4,Sodium_mmol_L_day_4,Total_Bilirubin_umol_L_day_4,Total_Carbon_Dioxide_mmol_L_day_4,Total_Protein_g_L_day_4,Blood_Urea_Nitrogen_mmol_urea_L_day_4,Date_of_metabolic_panel_5,Alanine_Aminotransferase_U_L_day_5,Albumin_g_L_day_5,Alkaline_Phosphatase_U_L_day_5,Aspartate_Aminotransferase_U_L_day_5,Calcium_mmol_L_day_5,Chloride_mmol_L_day_5,Creatinine_umol_L_day_5,Glucose_mmol_L_day_5,Potassium_mmol_L_day_5,Sodium_mmol_L_day_5,Total_Bilirubin_umol_L_day_5,Total_Carbon_Dioxide_mmol_L_day_5,Total_Protein_g_L_day_5,Blood_Urea_Nitrogen_mmol_urea_L_day_5,First_measured_viral_load_log_units_,Maximum_measured_viral_load_log_units_,Minimum_measured_viral_load_log_units_,Averaged_viral_load_log_units_,Date_of_qPCR_1,EBOV_copies_mL_plasma_log_units_day_1,Date_of_qPCR_2,EBOV_copies_mL_plasma_log_units_day_2,Date_of_qPCR_3,EBOV_copies_mL_plasma_log_units_day_3,Date_of_qPCR_4,EBOV_copies_mL_plasma_log_units_day_4,Date_of_qPCR_5,EBOV_copies_mL_plasma_log_units_day_5,Date_of_qPCR_6,EBOV_copies_mL_plasma_log_units_day_6,SNP_572,SNP_800,SNP_1024,SNP_1288,SNP_1492,SNP_1849,SNP_2124,SNP_2185,SNP_2341,SNP_2364,SNP_2497,SNP_2931,SNP_3116,SNP_3388,SNP_3638,SNP_4340,SNP_4505,SNP_4709,SNP_4759,SNP_4976,SNP_5461,SNP_6175,SNP_6283,SNP_6909,SNP_8280,SNP_8928,SNP_9390,SNP_9536,SNP_9923,SNP_10005,SNP_10218,SNP_10252,SNP_10268,SNP_10509,SNP_10743,SNP_10801,SNP_11142,SNP_11811,SNP_11943,SNP_12878,SNP_12885,SNP_13856,SNP_13923,SNP_14019,SNP_14232,SNP_15599,SNP_15660,SNP_15963,SNP_16054,SNP_16455,SNP_16750,SNP_17142,SNP_17985,SNP_18412,SNP_18895,Allele_Frequency_10218,Cluster,_mutations_from_cluster,Sub_cluster,_mutations_from_sub_cluster
1,EM-095,Positive,42.0,Female,Koindu,Kissi Teng,Kailahun,,,Yes,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.17520455919,5.17520455919,5.17520455919,5.17520455919,2014-05-27,5.17520455919,,,,,,,,,,,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Yes,No,No,No,No,No,No,No,Yes,No,No,No,No,No,No,No,No,No,,Cluster 1,0,,
2,EM-95B,Positive,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,6.12450740145,6.12450740145,6.12450740145,6.12450740145,,6.12450740145,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,EM-099,Negative,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2014-05-27,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,EM-100,Negative,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2014-06-02,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
5,EM-101,Negative,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2014-06-02,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6,EM-102,Negative,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2014-06-02,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
7,EM-103,Negative,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2014-06-02,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
8,EM-105,Negative,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2014-06-02,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
9,EM-108,Negative,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2014-06-02,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
10,EM-109,Negative,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2014-06-02,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [16]:
annot[:ids] = [x |> (x)->replace(x,"-","") for x in annot[:Patient_ID]]

213-element Array{Any,1}:
 "EM095"
 "EM95B"
 "EM099"
 "EM100"
 "EM101"
 "EM102"
 "EM103"
 "EM105"
 "EM108"
 "EM109"
 "EM112"
 "EM114"
 "EM117"
 ⋮      
 "G3832"
 "G3833"
 "G3842"
 "G3852"
 "G3853"
 "G3854"
 "G3844"
 "G3849"
 "G3858"
 "G3859"
 "G3835"
 "G3836"

Marge sequences and annotations

In [17]:
bigdf = join(annot,df,on=:ids,kind=:inner)

Unnamed: 0,Patient_ID,Diagnosis,Age,Gender,Village,Chiefdom,District,Outcome,Date_of_Outcome,Admitted_at_report,Pre_admission_date,Date_of_admission,Date_of_discharge,Temperature,Systolic_pressure,Diastolic_pressure,Hearth_rate,Respiratory_rate,Days_since_onset,Oxygen_saturation,Bleeding_gums,Bleeding_nose,Blood_in_stool,Blood_in_vomit,Bleeding_injection,Bleeding_hematoma,Blood_in_sputum,Blood_in_urine,Vaginal_bleeding,No_bleeding,Abdominal_pain,Joint_pain,Muscle_pain,Back_pain,Side_pain,Retrosternal_pain,Other_pain,No_pain,Fever,Conjunctivitis,Edema,Inflammation,Rash,Headache,Sore_throat,Vomit,Cough,Diarrhea,Weakness,Dizziness,Hearing,Convulsions,Confusion,Jaundice,Other_symptoms,No_symptoms,Antimalarials,Ceftriaxone,Paracetamol,Metronidazole,Artemisinin_Combination_Therapy,Ciprofloxacin,Ampicillin,Omeprazole,Date_of_metabolic_panel_1,Alanine_Aminotransferase_U_L_day_1,Albumin_g_L_day_1,Alkaline_Phosphatase_U_L_day_1,Aspartate_Aminotransferase_U_L_day_1,Calcium_mmol_L_day_1,Chloride_mmol_L_day_1,Creatinine_umol_L_day_1,Glucose_mmol_L_day_1,Potassium_mmol_L_day_1,Sodium_mmol_L_day_1,Total_Bilirubin_umol_L_day_1,Total_Carbon_Dioxide_mmol_L_day_1,Total_Protein_g_L_day_1,Blood_Urea_Nitrogen_mmol_urea_L_day_1,Date_of_metabolic_panel_2,Alanine_Aminotransferase_U_L_day_2,Albumin_g_L_day_2,Alkaline_Phosphatase_U_L_day_2,Aspartate_Aminotransferase_U_L_day_2,Calcium_mmol_L_day_2,Chloride_mmol_L_day_2,Creatinine_umol_L_day_2,Glucose_mmol_L_day_2,Potassium_mmol_L_day_2,Sodium_mmol_L_day_2,Total_Bilirubin_umol_L_day_2,Total_Carbon_Dioxide_mmol_L_day_2,Total_Protein_g_L_day_2,Blood_Urea_Nitrogen_mmol_urea_L_day_2,Date_of_metabolic_panel_3,Alanine_Aminotransferase_U_L_day_3,Albumin_g_L_day_3,Alkaline_Phosphatase_U_L_day_3,Aspartate_Aminotransferase_U_L_day_3,Calcium_mmol_L_day_3,Chloride_mmol_L_day_3,Creatinine_umol_L_day_3,Glucose_mmol_L_day_3,Potassium_mmol_L_day_3,Sodium_mmol_L_day_3,Total_Bilirubin_umol_L_day_3,Total_Carbon_Dioxide_mmol_L_day_3,Total_Protein_g_L_day_3,Blood_Urea_Nitrogen_mmol_urea_L_day_3,Date_of_metabolic_panel_4,Alanine_Aminotransferase_U_L_day_4,Albumin_g_L_day_4,Alkaline_Phosphatase_U_L_day_4,Aspartate_Aminotransferase_U_L_day_4,Calcium_mmol_L_day_4,Chloride_mmol_L_day_4,Creatinine_umol_L_day_4,Glucose_mmol_L_day_4,Potassium_mmol_L_day_4,Sodium_mmol_L_day_4,Total_Bilirubin_umol_L_day_4,Total_Carbon_Dioxide_mmol_L_day_4,Total_Protein_g_L_day_4,Blood_Urea_Nitrogen_mmol_urea_L_day_4,Date_of_metabolic_panel_5,Alanine_Aminotransferase_U_L_day_5,Albumin_g_L_day_5,Alkaline_Phosphatase_U_L_day_5,Aspartate_Aminotransferase_U_L_day_5,Calcium_mmol_L_day_5,Chloride_mmol_L_day_5,Creatinine_umol_L_day_5,Glucose_mmol_L_day_5,Potassium_mmol_L_day_5,Sodium_mmol_L_day_5,Total_Bilirubin_umol_L_day_5,Total_Carbon_Dioxide_mmol_L_day_5,Total_Protein_g_L_day_5,Blood_Urea_Nitrogen_mmol_urea_L_day_5,First_measured_viral_load_log_units_,Maximum_measured_viral_load_log_units_,Minimum_measured_viral_load_log_units_,Averaged_viral_load_log_units_,Date_of_qPCR_1,EBOV_copies_mL_plasma_log_units_day_1,Date_of_qPCR_2,EBOV_copies_mL_plasma_log_units_day_2,Date_of_qPCR_3,EBOV_copies_mL_plasma_log_units_day_3,Date_of_qPCR_4,EBOV_copies_mL_plasma_log_units_day_4,Date_of_qPCR_5,EBOV_copies_mL_plasma_log_units_day_5,Date_of_qPCR_6,EBOV_copies_mL_plasma_log_units_day_6,SNP_572,SNP_800,SNP_1024,SNP_1288,SNP_1492,SNP_1849,SNP_2124,SNP_2185,SNP_2341,SNP_2364,SNP_2497,SNP_2931,SNP_3116,SNP_3388,SNP_3638,SNP_4340,SNP_4505,SNP_4709,SNP_4759,SNP_4976,SNP_5461,SNP_6175,SNP_6283,SNP_6909,SNP_8280,SNP_8928,SNP_9390,SNP_9536,SNP_9923,SNP_10005,SNP_10218,SNP_10252,SNP_10268,SNP_10509,SNP_10743,SNP_10801,SNP_11142,SNP_11811,SNP_11943,SNP_12878,SNP_12885,SNP_13856,SNP_13923,SNP_14019,SNP_14232,SNP_15599,SNP_15660,SNP_15963,SNP_16054,SNP_16455,SNP_16750,SNP_17142,SNP_17985,SNP_18412,SNP_18895,Allele_Frequency_10218,Cluster,_mutations_from_cluster,Sub_cluster,_mutations_from_sub_cluster,ids,accession,organism,isolation_source,host,collected_by,mol_type,collection_date,isolate,db_xref,country
1,EM-095,Positive,42.0,Female,Koindu,Kissi Teng,Kailahun,,,Yes,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.17520455919,5.17520455919,5.17520455919,5.17520455919,2014-05-27,5.17520455919,,,,,,,,,,,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Yes,No,No,No,No,No,No,No,Yes,No,No,No,No,No,No,No,No,No,,Cluster 1,0,,,EM095,KM034550,Zaire ebolavirus,serum,Homo sapiens,,viral cRNA,25-May-2014,Ebola virus/H.sapiens-wt/SLE/2014/Makona-EM095,taxon:186538,Sierra Leone
2,EM-112,Positive,65.0,Female,Njala,Jawie,Kailahun,Died,2014-06-03,No,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7.74822576799,7.74822576799,7.74822576799,7.74822576799,2014-06-03,7.74822576799,,,,,,,,,,,No,Yes,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Yes,No,No,No,No,Yes,No,No,No,No,No,No,Yes,No,No,No,No,No,No,No,Yes,No,Yes,No,No,No,Yes,No,No,No,0.750850993741,Cluster 3,0,,,EM112,KM233039,Zaire ebolavirus,,Homo sapiens,,viral cRNA,03-Jun-2014,Ebola virus/H.sapiens-wt/SLE/2014/Makona-EM112,taxon:186538,Sierra Leone
3,EM-121,Positive,44.0,Male,Foindu,Kissi Kama,Kailahun,Died,2014-06-06,Yes,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7.10069330882,7.10069330882,7.10069330882,7.10069330882,2014-06-04,7.10069330882,,,,,,,,,,,No,Yes,No,No,Yes,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Yes,No,No,No,No,No,No,No,No,No,No,No,Yes,No,No,No,No,No,No,No,Yes,No,Yes,No,No,No,Yes,No,No,No,0.0,Cluster 2,1,,,EM121,KM233044,Zaire ebolavirus,,Homo sapiens,,viral cRNA,04-Jun-2014,Ebola virus/H.sapiens-wt/SLE/2014/Makona-EM121,taxon:186538,Sierra Leone
4,EM-124,Positive,35.0,Female,Daru,Jawie,Kailahun,Died,2014-06-22,Yes,2014-06-05,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2014-06-10,263.0,23.0,335.0,638.0,1.62,95.0,720.0,5.7,,124.0,12,18.0,68,27.0,2014-06-11,187.0,21,217,351.0,1.75,105,736,5.8,2.0,127,14,16,60,33.7,2014-06-12,137,22,164,212.0,1.83,102,754.0,4.3,2.1,127,20,20,60,36.3,2014-06-13,103,22,144,117.0,1.84,99,826,3.7,2.0,125,16,23,65,41.3,2014-06-14,89,22,117,93,1.77,100,748,2.8,2.3,125,12,22,64,40.3,5.16284676401,5.16284676401,2.57931013071,3.74939867228,2014-06-06,5.16284676401,,4.31662430974,,3.87158337583,,2.57931013071,,2.81662878109,2014-06-18,,No,Yes,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Yes,No,No,No,No,Yes,No,No,No,No,No,No,Yes,No,No,No,No,No,No,No,Yes,No,Yes,No,No,No,Yes,No,No,No,0.868852459016,Cluster 3,0,,,EM124,KM233045,Zaire ebolavirus,,Homo sapiens,,viral cRNA,04-Jun-2014,Ebola virus/H.sapiens-wt/SLE/2014/Makona-EM124.1,taxon:186538,Sierra Leone
5,EM-124,Positive,35.0,Female,Daru,Jawie,Kailahun,Died,2014-06-22,Yes,2014-06-05,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2014-06-10,263.0,23.0,335.0,638.0,1.62,95.0,720.0,5.7,,124.0,12,18.0,68,27.0,2014-06-11,187.0,21,217,351.0,1.75,105,736,5.8,2.0,127,14,16,60,33.7,2014-06-12,137,22,164,212.0,1.83,102,754.0,4.3,2.1,127,20,20,60,36.3,2014-06-13,103,22,144,117.0,1.84,99,826,3.7,2.0,125,16,23,65,41.3,2014-06-14,89,22,117,93,1.77,100,748,2.8,2.3,125,12,22,64,40.3,5.16284676401,5.16284676401,2.57931013071,3.74939867228,2014-06-06,5.16284676401,,4.31662430974,,3.87158337583,,2.57931013071,,2.81662878109,2014-06-18,,No,Yes,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Yes,No,No,No,No,Yes,No,No,No,No,No,No,Yes,No,No,No,No,No,No,No,Yes,No,Yes,No,No,No,Yes,No,No,No,0.868852459016,Cluster 3,0,,,EM124,KM233046,Zaire ebolavirus,,Homo sapiens,,viral cRNA,06-Jun-2014,Ebola virus/H.sapiens-wt/SLE/2014/Makona-EM124.2,taxon:186538,Sierra Leone
6,EM-124,Positive,35.0,Female,Daru,Jawie,Kailahun,Died,2014-06-22,Yes,2014-06-05,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2014-06-10,263.0,23.0,335.0,638.0,1.62,95.0,720.0,5.7,,124.0,12,18.0,68,27.0,2014-06-11,187.0,21,217,351.0,1.75,105,736,5.8,2.0,127,14,16,60,33.7,2014-06-12,137,22,164,212.0,1.83,102,754.0,4.3,2.1,127,20,20,60,36.3,2014-06-13,103,22,144,117.0,1.84,99,826,3.7,2.0,125,16,23,65,41.3,2014-06-14,89,22,117,93,1.77,100,748,2.8,2.3,125,12,22,64,40.3,5.16284676401,5.16284676401,2.57931013071,3.74939867228,2014-06-06,5.16284676401,,4.31662430974,,3.87158337583,,2.57931013071,,2.81662878109,2014-06-18,,No,Yes,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Yes,No,No,No,No,Yes,No,No,No,No,No,No,Yes,No,No,No,No,No,No,No,Yes,No,Yes,No,No,No,Yes,No,No,No,0.868852459016,Cluster 3,0,,,EM124,KM233047,Zaire ebolavirus,,Homo sapiens,,viral cRNA,08-Jun-2014,Ebola virus/H.sapiens-wt/SLE/2014/Makona-EM124.3,taxon:186538,Sierra Leone
7,EM-124,Positive,35.0,Female,Daru,Jawie,Kailahun,Died,2014-06-22,Yes,2014-06-05,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2014-06-10,263.0,23.0,335.0,638.0,1.62,95.0,720.0,5.7,,124.0,12,18.0,68,27.0,2014-06-11,187.0,21,217,351.0,1.75,105,736,5.8,2.0,127,14,16,60,33.7,2014-06-12,137,22,164,212.0,1.83,102,754.0,4.3,2.1,127,20,20,60,36.3,2014-06-13,103,22,144,117.0,1.84,99,826,3.7,2.0,125,16,23,65,41.3,2014-06-14,89,22,117,93,1.77,100,748,2.8,2.3,125,12,22,64,40.3,5.16284676401,5.16284676401,2.57931013071,3.74939867228,2014-06-06,5.16284676401,,4.31662430974,,3.87158337583,,2.57931013071,,2.81662878109,2014-06-18,,No,Yes,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Yes,No,No,No,No,Yes,No,No,No,No,No,No,Yes,No,No,No,No,No,No,No,Yes,No,Yes,No,No,No,Yes,No,No,No,0.868852459016,Cluster 3,0,,,EM124,KM233048,Zaire ebolavirus,,Homo sapiens,,viral cRNA,09-Jun-2014,Ebola virus/H.sapiens-wt/SLE/2014/Makona-EM124.4,taxon:186538,Sierra Leone
8,G-3670,Positive,20.0,Female,Koindu,Kissi Teng,Kailahun,Discharged,2014-07-08,Yes,2014-05-26,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7.69901470087,7.69901470087,7.69901470087,7.69901470087,2014-05-27,7.69901470087,2014-06-06,,,,,,,,,,No,Yes,No,No,No,No,No,No,No,No,No,No,No,Yes,No,No,No,No,No,No,No,No,No,No,No,Yes,No,No,No,Yes,No,No,No,No,No,No,No,Yes,No,No,No,No,No,No,No,Yes,No,Yes,No,No,No,Yes,No,No,No,0.0,Cluster 2,2,,,G3670,KM034553,Zaire ebolavirus,serum,Homo sapiens,,viral cRNA,27-May-2014,Ebola virus/H.sapiens-wt/SLE/2014/Makona-G3670.1,taxon:186538,Sierra Leone
9,G-3676,Positive,45.0,Female,Buedu,Kissi Teng,Kailahun,Died,2014-05-30,Yes,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.17133670769,8.80761074826,8.17133670769,8.48947372797,2014-05-27,8.17133670769,,8.80761074826,,,,,,,,,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Yes,No,No,No,No,No,No,No,Yes,No,No,No,No,No,No,No,No,No,0.0,Cluster 1,0,,,G3676,KM034554,Zaire ebolavirus,serum,Homo sapiens,,viral cRNA,27-May-2014,Ebola virus/H.sapiens-wt/SLE/2014/Makona-G3676.1,taxon:186538,Sierra Leone
10,G-3676,Positive,45.0,Female,Buedu,Kissi Teng,Kailahun,Died,2014-05-30,Yes,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.17133670769,8.80761074826,8.17133670769,8.48947372797,2014-05-27,8.17133670769,,8.80761074826,,,,,,,,,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Yes,No,No,No,No,No,No,No,Yes,No,No,No,No,No,No,No,No,No,0.0,Cluster 1,0,,,G3676,KM034555,Zaire ebolavirus,serum,Homo sapiens,,viral cRNA,06-Jun-2014,Ebola virus/H.sapiens-wt/SLE/2014/Makona-G3676.2,taxon:186538,Sierra Leone


In [18]:
size(bigdf)

(92,226)

The annotations can now be written to file as a table.

In [19]:
writetable("ebola-sle-2014.txt", df, separator = '\t', header = true)

I make a dictionary of the sequences by accession...

In [20]:
seqstrings=[content(find_element(s,"INSDSeq_sequence")) for s in sequences];
seqdict = Dict{ASCIIString,ASCIIString}()
for i in 1:numseq
    seqdict[accessions[i]]=seqstrings[i]
end

...then I write them out to a FASTA file.

In [21]:
f=open("ebola-sle-2014.fasta","w")
for i in 1:size(bigdf)[1]
    acc = bigdf[:accession][i]
    @printf(f,">%s\n%s\n",acc,seqdict[acc])
end
close(f)