## Geneious analysis for individual samples from raw Geneious output, "Annotation.csv" 

 ### Required packages
 - No specific package required 
 
 ### Inputs
 - Geneious SNP analysis of _k13_, _crt_, _mdr1_, _dhfr_, _dhps_, and _cytb_ for pooled samples
 - Documentation on Geneious analysis can be found: Readme.md
 - Geneious outputs were modified to GuineaAnalysis_pooled.csv from "Annotation.csv"
 
 
 ### Data structure 
 - [Long-form](https://seaborn.pydata.org/tutorial/data_structure.html#long-form-vs-wide-form-data) 
     - Each variable is a column 

         - "Sample" = *AMD ID*, including associated meta-data for each sample
             - AMD ID and bit code key is found under MS Teams > Domestic > Files > Sample Naming > Sample_naming_key.pptx  

             - Key: **Year Country State/Site Treatment SampleID Genus SampleType GeneMarker-8bitcode SampleSeqCount**

                 - Example:
                     - Individual sequenced sample ID: 17GNDo00F0001PfF1290 = 2017 Guinea Dorota Day0 AS+AQ 0001 P.falciparum FilterBloodSpot k13-crt-mdr-dhfr-dhps-cytB-cpmp-pfs47 

                     - Pooled sequenced sample ID: 17GNDoxxx001P10F1290 = 2017 Guinea Dorota **xx x** 001 **Pooled SamplesInPool** P.falciparum FilterBloodSpot k13-crt-mdr-dhfr-dhps-cytB-cpmp-pfs47 

                         - NOTE: If information is not availble (na) **x** is used. For pooled samples, DayofTreatment and Treatment is na since its a pool of multiple samples with that info. 
                         - NOTE: For pooled samples, **Genus** is replaced with **Pooled** and **SampleType** with **SamplesInPool** to indicated this as a pooled sequenced sample and sample count in each pool. 
         <p>&nbsp;</p>
         - "Pooled" = whether this sample is a pooled or an individual sample
         - "Year" = the year the study was conducted 
         - "Site" = the state or province 
         - "Gene" = drug resistant gene(s) 
         - "Coverage" = the number of reads covering the SNP 
         - "VAF" = variant allele frequency calculated by AA divided by total reads in loci 
         - "SNP" = single nucleotide polymorphism in WildTypeAA or MutantAA annotation format 
         - "Type" = describes if it is a wild type or mutant SNP 
         - "Pooled_size" = describes how many individual samples were in the pool
         - "Pooled_group" = describes which group the pooled sample belongs to 
         

     - Each observation is a row for each sample ID (patient ID) 
 
 #### TODO
 
 #### Activity Name
 - [ ] Write doc.string at the beginning of the code
 - [ ] Write detailed description with comment for line by line
 - [ ] Make the code more simple and accurate
 - [ ] Follow zen of python
    
 #### Completed Activity ✓
 - [x] Created marked down at the beginning of the file for description

In [2]:
"""
The purpose of this file is for representing the weighted average and sample sizes at the later stage so it will have less columns for the other pooled adding files
"""

###We open previously created Geneious analysis for both pooled and individual files#####
with open ("GuineaAnalysis.csv", "r") as r1:   ##Open the previously created pooled lab file 
    ####We assign pooled group and pooled size to the previous information####
    with open ("GuineaAnalysis_ps.csv", "w") as w1:  ##Open a new file because we would like to add pooled size and pooled group to the filess
        ###We first write header for the new file####
        w1.write("Sample,Pooled,SITE,GENE,SNP,COVERAGE,VAF,VF,PooledSize\n")  ##These are the columns 
        for lines in r1:
            a=lines.split(",")[0]  ##This is the sample ID
            test1=""
            ####For every line we add pooled size and pooled group####
            ###This is for the pooled info part 1######
            with open("Pooled_Info_Part1.csv" , "r") as r2:  ##Check pooled part1 info to see group and size
                for lines2 in r2:  ##Loop through pooled part 1 file
                    if a in lines2:  ##Check if sample ID is in the given line to extract  the information about pooled size and pooled group
                        if lines2[0:-3][-1].isdigit():  ##Check if the last number is digit. This was due to the nature of file was written that there was extra column at the end
                            w1.write(lines.strip("\n")+","+lines2[0:-3][-1]+"\n")  ##If condition is met write extra two columns to the previous files
                            test1="used"  ##Assign test is used to check if the variable is used
                        if lines2[0:-4][-1].isdigit():   ##Check if the last number is digit. This was due to the nature of file was written that there was extra column at the end
                            w1.write(lines.strip("\n")+","+lines2[0:-4][-1]+"\n")  ##If condition is met write extra two columns to the previous files
                            test1="used"  ##Assign test is used to check if the variable is used
            with open("Pooled_Info_Part2_fixed.csv" , "r") as r3:  ##Also run a loop for pooled part 2
                for lines3 in r3:  ##loop  through pooled part the file
                    if a in lines3:  ##check if sample ID is in the file line
                        #print(lines3)
                        w1.write(lines.strip("\n")+","+lines3.strip("\n")[-1]+"\n")  ##If condition is met write extra two columns to the previous files
                        test1="used"  ##Assign test is used to check if the variable is used
            if test1!="used" and lines.startswith("Sample")==False:  ##If none of the test is used just add pooled size of 1
                w1.write(lines.strip("\n")+",1\n")
                