# 16S VALUES FIX

We detected a problem in the data for the Ardley samples, as its internal quality measures were all over the place. This included a lack of data for both 16S primer pairs in one of the soil samples, which meant that no statistical conclusions could be reached due to lack of a sufficient number of samples. On top of that, it made conclusions shaky, as when compared to lake samples (both in terms of plastics and surrounding environment) suddenly we had a much higher number of sequences being detected and with a noticeable higher NCN (which can be caused by a unnaturally small CT value in the 16S amps).

As such, we requested a rerun of the 3 Ardley resistome chips with 8 replicates of each 16S primer set, as well as two additional 16S primer sets to be included for QC considerations. While it would've been ideal to rerun the entire chip, we didn't have enough genetic material to do so. This means that, while this rerun may fix the lack of data and too-small-value problems, the high variety won't (can't) be fixed. To keep in mind going forward. 

In [1]:
import pandas as pd
import numpy as np

As always, the first step is to open the file containing the data of interest. In this case, to facilitate the fix, I've pre-filtered the file by hand in order to remove data not of interest: the QC primer pair, and primer sets that failed to amplify. 

In [2]:
fixed_16 = pd.read_csv("../data/resistome_data/metadata/corrected_16S_lakes.csv", sep = ";")
fixed_16

Unnamed: 0,Sample,Assay,N,Outliers,Rejected,Total reps,Ct,Ct SD,Tm,Tm SD,Efficiency,Conc,Flags
0,1,16s New 2,7,0,1,8,8.5,0.04,91.2,0.14,2.49,-1,
1,1,16s old 1,8,0,0,8,11.38,0.17,91.29,0.16,2.09,-1,
2,2,16s New 2,7,0,1,8,8.64,0.04,91.32,0.1,2.28,-1,
3,2,16s old 1,8,0,0,8,11.75,0.09,91.47,0.07,2.07,-1,
4,3,16s old 1,8,0,0,8,10.42,0.14,91.71,0.06,2.03,-1,
5,4,16s New 2,3,0,5,8,8.98,0.07,91.39,0.07,2.16,-1,MultipleMeltPeaks
6,4,16s old 1,2,0,6,8,12.31,0.02,91.01,0.07,2.04,-1,MultipleMeltPeaks
7,5,16s New 2,1,0,7,8,8.98,,90.11,,2.1,-1,MultipleMeltPeaks
8,5,16s old 1,8,0,0,8,12.35,0.19,91.42,0.09,2.05,-1,
9,6,16s New 2,3,0,5,8,9.08,0.05,91.52,0.02,2.07,-1,MultipleMeltPeaks


Now I will apply the standard filters, to see which remain and thus are appropiate for use

In [3]:
fixed_16 = fixed_16.loc[(fixed_16["Efficiency"] < 2.2) & (fixed_16["Efficiency"] > 1.75)]
fixed_16 = fixed_16.loc[fixed_16["N"] > 2]
fixed_16

Unnamed: 0,Sample,Assay,N,Outliers,Rejected,Total reps,Ct,Ct SD,Tm,Tm SD,Efficiency,Conc,Flags
1,1,16s old 1,8,0,0,8,11.38,0.17,91.29,0.16,2.09,-1,
3,2,16s old 1,8,0,0,8,11.75,0.09,91.47,0.07,2.07,-1,
4,3,16s old 1,8,0,0,8,10.42,0.14,91.71,0.06,2.03,-1,
5,4,16s New 2,3,0,5,8,8.98,0.07,91.39,0.07,2.16,-1,MultipleMeltPeaks
8,5,16s old 1,8,0,0,8,12.35,0.19,91.42,0.09,2.05,-1,
9,6,16s New 2,3,0,5,8,9.08,0.05,91.52,0.02,2.07,-1,MultipleMeltPeaks
10,6,16s old 1,8,0,0,8,12.47,0.07,91.29,0.11,2.02,-1,
11,7,16s old 1,6,0,2,8,11.6,0.04,90.74,0.09,2.04,-1,MultipleMeltPeaks
13,8,16s old 1,3,1,4,8,11.57,0.08,90.72,0.05,1.96,-1,"ContainsOutlierReplicateCt, MultipleMeltPeaks"
14,9,16s old 1,6,1,1,8,11.54,0.04,90.75,0.05,2.02,-1,"ContainsOutlierReplicateCt, MultipleMeltPeaks"


In [4]:
fixed_16["rel_16"] = (10 ** ((27 - fixed_16["Ct"])/(10/3)))
fixed_16

Unnamed: 0,Sample,Assay,N,Outliers,Rejected,Total reps,Ct,Ct SD,Tm,Tm SD,Efficiency,Conc,Flags,rel_16
1,1,16s old 1,8,0,0,8,11.38,0.17,91.29,0.16,2.09,-1,,48528.850016
3,2,16s old 1,8,0,0,8,11.75,0.09,91.47,0.07,2.07,-1,,37583.740429
4,3,16s old 1,8,0,0,8,10.42,0.14,91.71,0.06,2.03,-1,,94188.959652
5,4,16s New 2,3,0,5,8,8.98,0.07,91.39,0.07,2.16,-1,MultipleMeltPeaks,254683.025259
8,5,16s old 1,8,0,0,8,12.35,0.19,91.42,0.09,2.05,-1,,24831.331053
9,6,16s New 2,3,0,5,8,9.08,0.05,91.52,0.02,2.07,-1,MultipleMeltPeaks,237684.028662
10,6,16s old 1,8,0,0,8,12.47,0.07,91.29,0.11,2.02,-1,,22855.988034
11,7,16s old 1,6,0,2,8,11.6,0.04,90.74,0.09,2.04,-1,MultipleMeltPeaks,41686.938347
13,8,16s old 1,3,1,4,8,11.57,0.08,90.72,0.05,1.96,-1,"ContainsOutlierReplicateCt, MultipleMeltPeaks",42559.841313
14,9,16s old 1,6,1,1,8,11.54,0.04,90.75,0.05,2.02,-1,"ContainsOutlierReplicateCt, MultipleMeltPeaks",43451.022417


And that should be it. To note, all samples kept the 16S old 1 primer set for the exception of sample 4. If the values between both primers sets are compared, it can be seen that 16S new 2 values are generally much lower than 16S old 1 ones. EcoGenO indicated when running this test that the 16S new 2 works badly in general with environmental samples (= all of ours), so I'm going to make an exception and keep the 16S old 1 result for that sample

In [5]:
fixed_16["Sample"] = fixed_16["Sample"] + 18
fixed_16.loc[fixed_16["Sample"] == 22, "Assay"] = "16s old 1" #this is kinda hacky but won't affect the final result
fixed_16.loc[fixed_16["Sample"] == 22, "rel_16"] = (10 ** ((27 - 12.31)/(10/3))) #I copied the value of interest by hand

fixed_16 = fixed_16.loc[fixed_16["Assay"] == "16s old 1"]
fixed_16 = fixed_16[["Sample", "rel_16"]] #this is why the hack earlier didn't really matter
#fixed_16 = fixed_16.to_dict("records")
fixed_16.rename(columns = {"Sample": "sample"}, inplace = True)

In [6]:
# Before moving on: the way I processed the data drops all sample == 21 entries due to not having a 16S value associated. 
# I made a "fixed", provisional version to have an idea of how statistical results would look like in which I 
# provided the missing 16S value by making it the average of the 2 other replicates' values. So now I'm going to use the "fixed"
# version to change the values. That way I'll actually have values for the sample 21.


all_data = pd.read_csv("../data/resistome_data/clean_data/ab_data_false.csv", index_col = 0)
all_data.loc[all_data["sample"] == 25]

Unnamed: 0,Assay,antib,sample,Ct,place,type_f,type_g,or_seq,rel_n,rel_16,rel_ab,log_n
0,aacC2,Aminoglycoside,25.0,23.78,ardley,PUR,plastic,arg,9.246982,63533.093185,0.000146,-3.837
1,aacA/aphD,Aminoglycoside,25.0,26.05,ardley,PUR,plastic,arg,1.927525,63533.093185,0.000030,-4.518
2,aac(6')-II,Aminoglycoside,25.0,18.80,ardley,PUR,plastic,arg,288.403150,63533.093185,0.004539,-2.343
3,aphA3,Aminoglycoside,25.0,0.00,ardley,PUR,plastic,arg,,,,
4,sat4,Aminoglycoside,25.0,0.00,ardley,PUR,plastic,arg,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
377,dfrAB4,Trimethoprim,25.0,23.37,ardley,PUR,plastic,arg,12.274392,63533.093185,0.000193,-3.714
378,dfrC,Trimethoprim,25.0,0.00,ardley,PUR,plastic,arg,,,,
379,dfrG,Trimethoprim,25.0,22.68,ardley,PUR,plastic,arg,19.769696,63533.093185,0.000311,-3.507
380,dfrK,Trimethoprim,25.0,20.26,ardley,PUR,plastic,arg,105.196187,63533.093185,0.001656,-2.781


In [7]:
all_data = all_data.merge(fixed_16, on = "sample", how = "left")

In [8]:
all_data.loc[all_data["sample"] < 19, "rel_16"] = all_data["rel_16_x"]
all_data.loc[all_data["sample"] > 18, "rel_16"] = all_data["rel_16_y"]
all_data.drop(columns = ["rel_16_x", "rel_16_y"], inplace = True)
all_data.tail()

Unnamed: 0,Assay,antib,sample,Ct,place,type_f,type_g,or_seq,rel_n,rel_ab,log_n,rel_16
10309,dfrAB4,Trimethoprim,14.0,0.0,ion,water,control,arg,,,,
10310,dfrC,Trimethoprim,14.0,0.0,ion,water,control,arg,,,,
10311,dfrG,Trimethoprim,14.0,0.0,ion,water,control,arg,,,,
10312,dfrK,Trimethoprim,14.0,0.0,ion,water,control,arg,,,,
10313,dfrBmulti,Trimethoprim,14.0,0.0,ion,water,control,arg,,,,


In [9]:
all_data["rel_ab"] = all_data["rel_n"] / all_data["rel_16"]
all_data["log_n"] = np.log10(all_data["rel_ab"])
all_data

Unnamed: 0,Assay,antib,sample,Ct,place,type_f,type_g,or_seq,rel_n,rel_ab,log_n,rel_16
0,aacC2,Aminoglycoside,25.0,23.78,ardley,PUR,plastic,arg,9.246982,0.000222,-3.654,41686.938347
1,aacA/aphD,Aminoglycoside,25.0,26.05,ardley,PUR,plastic,arg,1.927525,0.000046,-4.335,41686.938347
2,aac(6')-II,Aminoglycoside,25.0,18.80,ardley,PUR,plastic,arg,288.403150,0.006918,-2.160,41686.938347
3,aphA3,Aminoglycoside,25.0,0.00,ardley,PUR,plastic,arg,,,,41686.938347
4,sat4,Aminoglycoside,25.0,0.00,ardley,PUR,plastic,arg,,,,41686.938347
...,...,...,...,...,...,...,...,...,...,...,...,...
10309,dfrAB4,Trimethoprim,14.0,0.00,ion,water,control,arg,,,,
10310,dfrC,Trimethoprim,14.0,0.00,ion,water,control,arg,,,,
10311,dfrG,Trimethoprim,14.0,0.00,ion,water,control,arg,,,,
10312,dfrK,Trimethoprim,14.0,0.00,ion,water,control,arg,,,,


And now I can keep it, in both formats, as a .csv to work with

In [10]:
all_data.to_csv("../data/resistome_data/clean_data/ab_data_all_fixed.csv")
all_data.dropna(inplace = True)
all_data.to_csv("../data/resistome_data/clean_data/ab_data_simple_fixed.csv")