## Comparison of Raw Data Correction Methods
Methods compared:
1. Brianna's linear model with fixed and random effects (only single mutants and wild type grown on one flat [not four flats])

    For sets with many flats:
        
        formula = f'{col_name} ~ Genotype + (1|Column) + (1|Row) + (1|Flat)'
    
    For sets with one flat:
        
        formula = f'{col_name} ~ Genotype + (1|Column) + (1|Row)'

2. Estimation of marginal means for each genotype using lmer (single, double, and wild type)

    Per set, per flat:
    
        formula = TSC ~ Subline + (1|Column) + (1|Row)

3. Spatial Analysis with SpATS (single, double, and wild type)

In [1]:
import datatable as dt
import pandas as pd

### Read in the corrected raw datasets

In [2]:
# Results on single mutants (that were grown on only one flat) for the lmer model Brianna ran in python
og_bri = dt.fread('../data/brianna_comparemean_tolmer_df_withrelative.csv').to_pandas()
og_bri.head()

Unnamed: 0,Set,WT_avg,WT_fitlmer,MA_avg,MA_fitlmer,MB_avg,MB_fitlmer,MA,MB,MA/WT,MB/WT
0,845,30.79,30.28,41.66,42.3,31.21,31.23,AT1G06040,AT2G31380,1.396794,1.031056
1,845E,27.94,27.58,27.1,27.04,25.88,26.72,AT1G06040,AT2G31380,0.980661,0.968813
2,133,406.46,408.68,411.25,414.09,369.34,368.95,AT1G18620,AT1G74160,1.013246,0.902789
3,703,340.87,342.38,228.98,228.4,292.24,291.11,AT1G74160,AT1G18620,0.667103,0.850257
4,72,166.93,166.73,161.08,161.47,151.75,151.59,AT3G14020,AT1G54160,0.968407,0.909187


In [3]:
og_bri.loc[og_bri['Set'].str.contains('1'),:] # 1 is a set grown on four flats; these sets were excluded in her analysis

Unnamed: 0,Set,WT_avg,WT_fitlmer,MA_avg,MA_fitlmer,MB_avg,MB_fitlmer,MA,MB,MA/WT,MB/WT
2,133,406.46,408.68,411.25,414.09,369.34,368.95,AT1G18620,AT1G74160,1.013246,0.902789
14,791,61.32,60.7,71.38,71.87,66.67,65.13,AT1G07180,AT2G29990,1.184029,1.073125
19,61,332.35,333.34,335.11,345.25,359.52,351.93,AT1G10450,AT1G59890,1.03573,1.055769
22,71,90.37,90.36,15.25,14.19,90.21,89.16,AT1G10650,AT1G60610,0.15708,0.986719
28,761,101.93,98.62,86.71,90.74,116.02,108.8,AT1G17540,AT1G72760,0.92014,1.103183
33,771,102.53,101.68,104.87,104.32,81.56,82.26,AT1G21380,AT1G76970,1.025938,0.80904
46,741,102.17,103.0,113.92,114.62,100.68,100.79,AT1G52190,AT3G16180,1.112831,0.978543
47,712,180.59,181.53,172.47,171.68,161.08,161.35,AT1G52420,AT3G15940,0.945766,0.88886
49,719,53.5,52.22,65.55,65.9,118.06,116.88,AT1G54130,AT3G14050,1.262093,2.238322
52,812,136.65,136.78,139.11,139.78,136.5,136.14,AT1G66180,AT5G37540,1.021922,0.995302


In [4]:
# Results on single and double mutants for the lmer model I ran in R (should be emulating Brianna's results)
bri = dt.fread('../data/double_mutant_fitness_data_05312024_TSC_corrected_brianna.txt').to_pandas()
bri.head()

Unnamed: 0,Set,Flat,Column,Row,Number,Type,Genotype,Subline,MA,MB,...,WO,FN,SPF,TSC,SH,emmean,SE,df,lower.CL,upper.CL
0,1,1,4,1,4,BORDER,MB,001-MB-2,WT,MUT,...,1.0,2.0,21.666667,65.0,0.0,38.65701,3.52665,15.317155,31.153665,46.160354
1,1,1,6,1,6,BORDER,DM,001-DM-2,MUT,MUT,...,0.0,0.0,20.333333,61.0,0.0,40.079134,3.531284,15.320574,32.566074,47.592193
2,1,1,8,1,8,BORDER,MA,001-MA-2,MUT,WT,...,0.0,0.0,15.5,62.0,0.0,51.311661,3.67322,16.969088,43.560769,59.062553
3,1,1,10,1,10,BORDER,WT,001-WT-2,WT,WT,...,1.0,0.0,12.5,37.5,,54.898058,3.625398,15.993182,47.212292,62.583824
4,1,1,6,3,26,INSIDE,MB,001-MB-2,WT,MUT,...,0.0,0.0,16.333333,49.0,0.0,38.65701,3.52665,15.317155,31.153665,46.160354


In [5]:
# Results on single and double mutants for the lmer model ran per set per flat
lin = dt.fread('../data/double_mutant_fitness_data_05312024_TSC_corrected_linear.txt').to_pandas()
lin.head()

Unnamed: 0,Set,Flat,Column,Row,Number,Type,Genotype,Subline,MA,MB,...,WO,FN,SPF,TSC,SH,emmean,SE,df,lower.CL,upper.CL
0,1,1,4,1,4,BORDER,MB,001-MB-2,WT,MUT,...,1.0,2.0,21.666667,65.0,0.0,41.427617,5.251221,36.40158,30.781726,52.073509
1,1,1,6,1,6,BORDER,DM,001-DM-2,MUT,MUT,...,0.0,0.0,20.333333,61.0,0.0,38.897783,4.932536,34.149081,28.875275,48.920291
2,1,1,8,1,8,BORDER,MA,001-MA-2,MUT,WT,...,0.0,0.0,15.5,62.0,0.0,44.320512,5.53711,38.173321,33.112889,55.528134
3,1,1,10,1,10,BORDER,WT,001-WT-2,WT,WT,...,1.0,0.0,12.5,37.5,,44.88664,5.879645,37.370532,32.977333,56.795946
4,1,1,6,3,26,INSIDE,MB,001-MB-2,WT,MUT,...,0.0,0.0,16.333333,49.0,0.0,41.427617,5.251221,36.40158,30.781726,52.073509


In [6]:
# Results on single and double mutants for the spatial analysis model ran per set per flat
spa = dt.fread('../data/double_mutant_fitness_data_05312024_TSC_corrected_SpATS.txt').to_pandas()
spa.head()

Unnamed: 0,Set,Flat,Column,Row,Number,Type,Genotype,Subline,MA,MB,...,WO,FN,SPF,TSC,SH,R,C,geno,weights,fit.TSC$fitted
0,1,1,4,1,4,BORDER,MB,001-MB-2,WT,MUT,...,1.0,2.0,21.666667,65.0,0.0,1,4,MB,True,66.307385
1,1,1,6,1,6,BORDER,DM,001-DM-2,MUT,MUT,...,0.0,0.0,20.333333,61.0,0.0,1,6,DM,True,52.85026
2,1,1,8,1,8,BORDER,MA,001-MA-2,MUT,WT,...,0.0,0.0,15.5,62.0,0.0,1,8,MA,True,47.525405
3,1,1,10,1,10,BORDER,WT,001-WT-2,WT,WT,...,1.0,0.0,12.5,37.5,,1,10,WT,True,46.041855
4,1,1,6,3,26,INSIDE,MB,001-MB-2,WT,MUT,...,0.0,0.0,16.333333,49.0,0.0,3,6,MB,True,52.446085


In [7]:
spa[['Set', 'Genotype', 'TSC', 'fit.TSC$fitted']].groupby(['Set', 'Genotype']).mean()
# For some reason, the spatial model predicts the mean of the genotypes perfectly,
# but if you look at the mean of the sublines, it's not perfect.

Unnamed: 0_level_0,Unnamed: 1_level_0,TSC,fit.TSC$fitted
Set,Genotype,Unnamed: 2_level_1,Unnamed: 3_level_1
1,DM,40.386243,40.386243
1,MA,51.669540,51.669540
1,MB,38.492063,38.492063
1,WT,54.880556,54.880556
11,DM,616.974368,616.974368
...,...,...,...
845,MB,32.216129,32.216129
845,WT,30.793651,30.793651
845E,MA,27.104478,27.104478
845E,MB,25.883333,25.883333


In [8]:
spa[['Set', 'Subline', 'TSC', 'fit.TSC$fitted']].groupby(['Set', 'Subline']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,TSC,fit.TSC$fitted
Set,Subline,Unnamed: 2_level_1,Unnamed: 3_level_1
1,001-DM-1,47.428571,47.611536
1,001-DM-2,46.452381,40.963321
1,001-DM-3,28.615385,32.194339
1,001-DM-4,39.250000,42.200998
1,001-DM-5,38.700000,37.934696
...,...,...,...
845E,845-MB-3,20.714286,26.906139
845E,845-MB-4,32.941176,24.433929
845E,845-WT-1,28.272727,29.288220
845E,845-WT-2,21.250000,26.535178


In [9]:
og_bri.shape, bri.shape, lin.shape, spa.shape

((119, 11), (25795, 26), (25795, 26), (25795, 26))

In [10]:
# Reshape Brianna's data
og_Bri = og_bri[['Set', 'WT_fitlmer', 'MA_fitlmer', 'MB_fitlmer']].melt(id_vars = 'Set', value_name='TSC_corrected', var_name='Genotype')
og_Bri.Genotype = og_Bri.Genotype.str.split('_').str.get(0)
og_Bri.head()

Unnamed: 0,Set,Genotype,TSC_corrected
0,845,WT,30.28
1,845E,WT,27.58
2,133,WT,408.68
3,703,WT,342.38
4,72,WT,166.73


In [11]:
bri_raw = og_bri[['Set', 'WT_avg', 'MA_avg', 'MB_avg']].melt(id_vars = 'Set', value_name='TSC_avg_raw', var_name='Genotype')
bri_raw.Genotype = bri_raw.Genotype.str.split('_').str.get(0)
bri_raw.head()

Unnamed: 0,Set,Genotype,TSC_avg_raw
0,845,WT,30.79
1,845E,WT,27.94
2,133,WT,406.46
3,703,WT,340.87
4,72,WT,166.93


In [12]:
# Merge corrected values with the mean of the raw data for single mutants
corrected = pd.merge(bri[['Set', 'Genotype', 'emmean']].\
    groupby(['Set', 'Genotype']).mean(), lin[['Set', 'Genotype', 'emmean']].\
    groupby(['Set', 'Genotype']).mean(), left_index=True, right_index=True, how='left') # Linear model results to compare with Brianna's

corrected = pd.merge(corrected, spa[['Set', 'Genotype', 'fit.TSC$fitted']].\
    groupby(['Set', 'Genotype']).mean(), 
    left_on=['Set', 'Genotype'], right_index=True, how='left') # Spatial analysis results

corrected = pd.merge(corrected, og_Bri, left_index=True,
    right_on=['Set', 'Genotype'], how='left') # Brianna's python results

corrected = pd.merge(bri_raw, corrected, on=['Set', 'Genotype'], how='left') # Brianna's raw mean data

corrected = pd.merge(corrected, lin[['Set', 'Genotype', 'TSC']].groupby(['Set', 'Genotype']).mean(),
    left_on=['Set', 'Genotype'], right_index=True, how='left') # Raw mean data (to compare with Brianna's)

corrected.columns = ['Set', 'Genotype', 'TSC_avg_raw_bri', 'Brianna_rerun', 'Linear',
                     'SpATS', 'Brianna_og', 'TSC_raw_avg']

corrected

Unnamed: 0,Set,Genotype,TSC_avg_raw_bri,Brianna_rerun,Linear,SpATS,Brianna_og,TSC_raw_avg
0,845,WT,30.79,30.338463,30.338463,30.793651,30.28,30.793651
1,845E,WT,27.94,27.575788,27.575788,27.937500,27.58,27.937500
2,133,WT,406.46,409.687943,409.687943,406.461695,408.68,406.461695
3,703,WT,340.87,342.450414,342.450414,340.866006,342.38,340.866006
4,72,WT,166.93,166.184350,166.184350,166.934028,166.73,166.934028
...,...,...,...,...,...,...,...,...
352,724,MB,350.20,350.827143,350.827143,350.197384,351.13,350.197384
353,739,MB,87.73,87.733615,87.733615,87.733615,87.68,87.733615
354,767,MB,26.60,25.789527,25.789527,26.600000,25.05,26.600000
355,754,MB,42.70,42.511234,42.511234,42.701389,42.62,42.701389


In [13]:
corrected.select_dtypes('float').corr(method='pearson')

Unnamed: 0,TSC_avg_raw_bri,Brianna_rerun,Linear,SpATS,Brianna_og,TSC_raw_avg
TSC_avg_raw_bri,1.0,0.460581,0.460581,0.462071,0.999771,0.462071
Brianna_rerun,0.460581,1.0,1.0,0.9998,0.466201,0.9998
Linear,0.460581,1.0,1.0,0.9998,0.466201,0.9998
SpATS,0.462071,0.9998,0.9998,1.0,0.467512,1.0
Brianna_og,0.999771,0.466201,0.466201,0.467512,1.0,0.467512
TSC_raw_avg,0.462071,0.9998,0.9998,1.0,0.467512,1.0


In [14]:
# Since Brianna only did the single mutant data, there are NAs when I combine her results with mine
corrected.select_dtypes('float').dropna().corr(method='pearson')

Unnamed: 0,TSC_avg_raw_bri,Brianna_rerun,Linear,SpATS,Brianna_og,TSC_raw_avg
TSC_avg_raw_bri,1.0,0.460581,0.460581,0.462071,0.999771,0.462071
Brianna_rerun,0.460581,1.0,1.0,0.9998,0.466201,0.9998
Linear,0.460581,1.0,1.0,0.9998,0.466201,0.9998
SpATS,0.462071,0.9998,0.9998,1.0,0.467512,1.0
Brianna_og,0.999771,0.466201,0.466201,0.467512,1.0,0.467512
TSC_raw_avg,0.462071,0.9998,0.9998,1.0,0.467512,1.0


Conclusion:

Of the three approaches I used, it seems to not matter which one I go with. 
But, I don't trust the spatial model, so I won't use that one. Instead, I will 
use the "Linear" model built per set per flat, instead of Brianna's re-run, where 
she built a linear model per set and flat was a random variable for those sets 
grown on 4 flats.

Remaining mystery:

I believe my analysis is correct, so I don't know why there is disagreement with 
what Brianna built in python. I made sure to keep the WT genotype as the reference 
level for the three approaches. And the models I ran agree with the average total 
seed count value, which also agrees with Brianna's averages if you see the dataframe, 
but I don't know why the correlation is so low. The correlation of her results 
is only ~.46 with "Brianna_rerun", "Linear", "SpATS", and "TSC_raw_avg".
Brianna said she has to re-run her stuff and clean up her code, so in the mean time, 
I think I will move forward.

Mystery solved:

Brianna used an older version of the dataset, hence why there is disagreement.
She re-ran the model with the updated raw data (that I used) and at least for 
set 845, our numbers agree, so the correlation should go up to .99 if I were to 
include here.

In [15]:
corrected.loc[corrected['Set'].str.contains('845'),:]

Unnamed: 0,Set,Genotype,TSC_avg_raw_bri,Brianna_rerun,Linear,SpATS,Brianna_og,TSC_raw_avg
0,845,WT,30.79,30.338463,30.338463,30.793651,30.28,30.793651
1,845E,WT,27.94,27.575788,27.575788,27.9375,27.58,27.9375
119,845,MA,41.66,42.406116,42.406116,41.664103,42.3,41.664103
120,845E,MA,27.1,27.042495,27.042495,27.104478,27.04,27.104478
238,845,MB,31.21,31.798327,31.798327,32.216129,31.23,32.216129
239,845E,MB,25.88,26.71577,26.71577,25.883333,26.72,25.883333
