## Geneious analysis for individual samples from raw Geneious output, "Annotation.csv" 

 ### Required packages
 - No specific package required 
 
 ### Inputs
 - Geneious SNP analysis of _k13_, _crt_, _mdr1_, _dhfr_, _dhps_, and _cytb_
 - Documentation on Geneious analysis can be found: Readme.md
 - Geneious outputs were modified to GuineaAnalysis_Individual.csv from "Annotation.csv"
 
 
 ### Data structure 
 - [Long-form](https://seaborn.pydata.org/tutorial/data_structure.html#long-form-vs-wide-form-data) 
     - Each variable is a column 

         - "Sample" = *AMD ID*, including associated meta-data for each sample
             - AMD ID and bit code key is found under MS Teams > Domestic > Files > Sample Naming > Sample_naming_key.pptx  

             - Key: **Year Country State/Site DayofTreatment Treatment SampleID Genus SampleType GeneMarker-8bitcode SampleSeqCount**

                 - Example:
                     - Individual sequenced sample ID: 17GNDo00F0001PfF1290 = 2017 Guinea Dorota Day0 AS+AQ 0001 P.falciparum FilterBloodSpot k13-crt-mdr-dhfr-dhps-cytB-cpmp-pfs47 

                     - Pooled sequenced sample ID: 17GNDoxxx001P10F1290 = 2017 Guinea Dorota **xx x** 001 **Pooled SamplesInPool** P.falciparum FilterBloodSpot k13-crt-mdr-dhfr-dhps-cytB-cpmp-pfs47 

                         - NOTE: If information is not availble (na) **x** is used. For pooled samples, DayofTreatment and Treatment is na since its a pool of multiple samples with that info. 
                         - NOTE: For pooled samples, **Genus** is replaced with **Pooled** and **SampleType** with **SamplesInPool** to indicated this as a pooled sequenced sample and sample count in each pool. 
         <p>&nbsp;</p>
         - "Year" = the year the study was conducted 
         - "Site" = the state or province 
         - "Day_of_treatment" = describes the day of treatment provided to the patient 
         - "Gene" = drug resistant gene(s) 
         - "G_annotation" = full SNP annotation in the following format: WildTypeAA-CodonPosition-MutantAA 
         - "Coverage" = the number of reads covering the SNP 
         - "VAF" = variant allele frequency calculated by AA divided by total reads in loci 
         - "SNP" = single nucleotide polymorphism in WildTypeAA or MutantAA annotation format 
         - "Type" = describes if it is a wild type or mutant SNP 

     - Each observation is a row for each sample ID (patient ID) 
 
 #### TODO
 
 #### Activity Name
 - [ ] Write doc.string at the beginning of the code
 - [ ] Write detailed description with comment for line by line
 - [ ] Make the code more simple and accurate
 - [ ] Follow zen of python
    
 #### Completed Activity ✓
 - [x] Created marked down at the beginning of the file for description

In [43]:
import pandas as pd
import numpy as np
Geneious_DF=pd.read_csv("Annotations.csv")
#print(Geneious_DF["Coverage"])
#print(Geneious_DF["Sequence Name"])
Geneious_DF_N1=Geneious_DF[(Geneious_DF['Type']=='Polymorphism') & (Geneious_DF['Amino Acid Change'].notnull())]

Geneious_DF_N2=Geneious_DF[Geneious_DF['Type']=='Coverage - High']
#Coverage - High

#print(len(Geneious_DF_N1))
#print(len(Geneious_DF_N2))
#print(Geneious_DF_N1["CDS Codon Number"])
#print(Geneious_DF_N1["Amino Acid Change"])
Geneious_DF_N1["TrackerSNP"]=Geneious_DF_N1["Amino Acid Change"].astype(str).str[0]+Geneious_DF_N1["CDS Codon Number"].astype(int).astype(str)+Geneious_DF_N1["Amino Acid Change"].astype(str).str[-1]

#Geneious_DF_N_fix=Geneious_DF_N1.loc[Geneious_DF_N1['Document Name'].str.contains('M05039', na=False)]
#print(Geneious_DF_N1["TrackerSNP"])
#Geneious_DF_N1.to_csv("test.csv", sep=',')
#Geneious_DF_N1_test=pd.isna(Geneious_DF_N1["Amino Acid Change"])
#print(Geneious_DF_N1.iloc[1])
#print(Geneious_DF_N1["Amino Acid Change"])


Combine_Variant_Wildtpye = [Geneious_DF_N1, Geneious_DF_N2]
Combation_Vi_Wi = pd.concat(Combine_Variant_Wildtpye)
#print(len(Combation_Vi_Wi))
#print(len(Geneious_DF_N1))
Combination_filtered=Combation_Vi_Wi.drop_duplicates(subset =["Document Name", "TrackerSNP"] ) 
#(Geneious_DF_N1[!Geneious_DF_N1$feature %in% Geneious_DF_N2$feature, ]
#"Sample,Pooled,Year,SITE,TreatmentDay,GENE,G_annotation,COVERAGE,VAF,VF,SNP,TYPE\n")
        
def site(row):
    if row['Document Name'][4:6]=="Ha":
        return 'Hamdalaye'
    elif row['Document Name'][4:6]=="Do":
        return 'Dorota'
    elif row['Document Name'][4:6]=="Ma":
        return 'Maferinyah'
    elif row['Document Name'][4:6]=="La":
        return 'Lay-Sare'
    elif row['Document Name'][4:6]=="LS":
        return 'Lay-Sare'
    
def TreatmentDay(row):
    if row['Document Name'][6:8]=="00":
        return '0'
    elif row['Document Name'][6:8]=="1A":
        return '1'
    elif row['Document Name'][6:8]!="00" and row['Document Name'][6:8]!="1A":
        return row['Document Name'][6:8]
    
def Pooled(row):
    if row['Document Name'][8:10]=="xp":
        return 'individual'
    elif row['Document Name'][8:10]!="xp":
        return 'pooled'

def year(row):
    return row['Document Name'][0:2]

def type(row):
    if row['Type'] =='Polymorphism':
        return "mutation"
    if row['Type'] =='Coverage - High':
        return "wildtype"
    
def SNP(row):
    if row['Type'] =='Polymorphism':
        return row['TrackerSNP'][1::]
    if row['Type'] =='Coverage - High':
        return row['TrackerSNP'][0:-1]
    
    


    
Combination_filtered["SITE"]=Combination_filtered.apply(site, axis=1)
Combination_filtered["TreatmentDay"]=Combination_filtered.apply(TreatmentDay, axis=1)
Combination_filtered["Pooled"]=Combination_filtered.apply(Pooled, axis=1)
Combination_filtered["Year"]=Combination_filtered.apply(year, axis=1)
Combination_filtered["TYPE"]=Combination_filtered.apply(type, axis=1)
Combination_filtered["SNP"]=Combination_filtered.apply(SNP, axis=1)

#print(Combination_filtered["SITE"])
#print(Combination_filtered["TreatmentDay"])
#print(Combination_filtered["Pooled"])
#print(Combination_filtered["Year"])
#print(Combination_filtered["TYPE"])


#,"SITE","TreatmentDay","Pooled","Year"

Combination_report1=Combination_filtered[Combination_filtered['Type']=='Polymorphism']
Combination_report2=Combination_filtered[Combination_filtered['Type']=='Coverage - High']
final_report1=Combination_report1[["Document Name","Sequence Name","SITE","TreatmentDay","Pooled","Year","Coverage","Variant Frequency","Variant Raw Frequency","TrackerSNP","TYPE","SNP"]]
final_report2=Combination_report2[["Document Name","Sequence Name","SITE","TreatmentDay","Pooled","Year","Average Coverage","Variant Frequency","Variant Raw Frequency","TrackerSNP","TYPE","SNP"]]
final_report2_re=final_report2.rename(columns={'Average Coverage': 'Coverage'})
#print(Combination_report)
final_combine=[final_report1, final_report2_re]
final_combine_2=pd.concat(final_combine)
final_combine_2.to_csv("test.csv", sep=',', index=False)
#print((Combation_Vi_Wi))
#print(len(Geneious_DF_N2))



  Geneious_DF=pd.read_csv("Annotations.csv")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Geneious_DF_N1["TrackerSNP"]=Geneious_DF_N1["Amino Acid Change"].astype(str).str[0]+Geneious_DF_N1["CDS Codon Number"].astype(int).astype(str)+Geneious_DF_N1["Amino Acid Change"].astype(str).str[-1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Combination_filtered["SITE"]=Combination_filtered.apply(site, axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documenta

In [44]:
#print(final_combine_2["Document Name"])

pooled_part1=pd.read_csv("Pooled_Info_Part1.csv")
pooled_part2=pd.read_csv("Pooled_Info_Part2_fixed.csv")

Combine_pooled_parts = [pooled_part1[["Pool","SITE","YEAR","AMD_ID","Poolsize"]], pooled_part2[["Pool","SITE","YEAR","AMD_ID","Poolsize"]]]
Combation_pooled_concatenate = pd.concat(Combine_pooled_parts)

#print(Combation_pooled_concatenate)
Combation_pooled_concatenate.to_csv("test-pre1.csv", sep=',', index=False)

#print(pooled_part1["Poolsize "])
s = pd.Series(pooled_part1["AMD_ID"])
s2 = pd.Series(pooled_part1["Poolsize"])
s3 = pd.Series(pooled_part1["Pool"])

def name(row):
    return row['Document Name'].split("_")[0]

final_combine_2["Document Name"]=final_combine_2.apply(name, axis=1)

final_combine_2.rename(columns={'Document Name':'AMD_ID'}, inplace=True)


df_merged_poolsize = pd.merge(final_combine_2, Combation_pooled_concatenate, on=['AMD_ID'], how='left') 

df_merged_poolsize = df_merged_poolsize.drop('SITE_y', 1)
df_merged_poolsize = df_merged_poolsize.drop('YEAR', 1)

#final_combine_2.loc[(final_combine_2['Document Name']==Combation_pooled_concatenate["AMD_ID"])] = Combation_pooled_concatenate["Poolsize"]

#print(final_combine_2["Document Name"])
#df[df.Name.isin(['Alice', 'Bob'])]

#def row_match(row):
#    return Combation_pooled_concatenate["AMD_ID"].isin([row['Document Name']])
      
#new_match=(final_combine_2.apply(row_match, axis=1))
#print(new_match)

#print(new_match.columns)

#print(new_match.iloc[0][new_match.iloc[0]==True].index.tolist())

df_merged_poolsize.Poolsize.fillna(value=1, inplace=True)

df_merged_poolsize.to_csv("test2.csv", sep=',', index=False)


  df_merged_poolsize = df_merged_poolsize.drop('SITE_y', 1)
  df_merged_poolsize = df_merged_poolsize.drop('YEAR', 1)
