evalzoo requires a config `yml` file for each batch of profiles. Using the `match_rep_df.csv`, create these profiles.

Additionally, each plate will require a unique config file for each batch. 

1. Load match_rep_df to get access to batch names and plate names (Assay_Plate_Barcode)
2. Iterate through each batch+plate and update both the replicate and match config files
   1. Update `experiment.data_path` and `experiment.plates` with the batch and **all** plates within that batch information, respectively
3. Save the two config files in an identical folder structure 
   1. `batch_plate.yml`
4. Run evalzoo


After evalzoo we will have results that are stored within a folder that's named with a random hash (unsure why). However, we want these results to be 


experiment:
  data_path: "/input/Scope1_Nikon_10X"
  input_structure: "{data_path}/{{plate}}/{{plate}}_normalized_feature_select_negcon_batch.{extension}"
  extension: csv.gz
  plates:
    - BR00117060a10x

In [107]:
import yaml
import pandas as pd
import os
import shutil

In [5]:
match_rep_df = pd.read_csv("../checkpoints/match_rep_df.csv")

match_rep_df = match_rep_df[match_rep_df["sphering"] == True]

match_rep_df

Unnamed: 0,Vendor,Batch,Plate_Map_Name,Assay_Plate_Barcode,Modality,Images_per_well,Sites-SubSampled,Binning,Magnification,Number_of_channels,...,Size_MB_std,sphering,value_95_replicating,Percent_Replicating,channel_names,brightfield_z_plane_used,feature_channels_found,Percent_Matching,value_95_matching,cell_count
0,MolDev,Scope1_MolDev_10X,JUMP-MOA_compound_platemap,Plate2_PCO_6ch_4site_10XPA,Widefield,4,,1,10,6,...,0.000144,True,0.191908,60.000000,"Actin, DNA, ER, Golgi, Mito, RNA",,"Actin, DNA, ER, Golgi, Mito, RNA",23.255814,0.288099,2014937
2,MolDev,Scope1_MolDev_10X,JUMP-MOA_compound_platemap,Plate3_PCO_6ch_4site_10XPA_Crest,Confocal,4,,1,10,6,...,0.000183,True,0.269617,62.222222,"Actin, DNA, ER, Golgi, Mito, RNA",,"Actin, DNA, ER, Golgi, Mito, RNA",18.604651,0.398249,2413350
4,MolDev,Scope1_MolDev_10X_4siteZ,JUMP-MOA_compound_platemap,Plate3_PCO_6ch_4site_10XPA_Crestz,Confocal,4,,1,10,6,...,0.000142,True,0.205121,66.666667,"Actin, DNA, ER, Golgi, Mito, RNA",,"Actin, DNA, ER, Golgi, Mito, RNA",23.255814,0.363114,2381443
6,MolDev,Scope1_MolDev_20X_4site,JUMP-MOA_compound_platemap,Plate3_PCO_6ch_4site_20XPA_Crestz,Confocal,4,,1,20,6,...,0.000114,True,0.182630,57.777778,"Actin, DNA, ER, Golgi, Mito, RNA",,"Actin, DNA, ER, Golgi, Mito, RNA",18.604651,0.279178,527841
8,MolDev,Scope1_MolDev_20X_9site,JUMP-MOA_compound_platemap,Plate2_PCO_6ch_9site_20XPA,Widefield,9,,1,20,6,...,0.000153,True,0.184205,67.777778,"Actin, DNA, ER, Golgi, Mito, RNA",,"Actin, DNA, ER, Golgi, Mito, RNA",23.255814,0.291127,1101611
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
352,Yokogawa_US,4siteSubSample_Scope1_Yokogawa_US_20X_5Ch,JUMP-MOA_compound_platemap,BRO0117056_20x,Confocal,9,4.0,1,20,5,...,0.000044,True,0.174914,57.777778,"AGP, DNA, ER, Mito, RNA",,"AGP, DNA, ER, Mito, RNA",23.255814,0.244983,544244
354,Yokogawa_US,4siteSubSample_Scope1_Yokogawa_US_20X_5Ch_12Z,JUMP-MOA_compound_platemap,BRO0117056_20xb,Confocal,9,4.0,1,20,5,...,0.000044,True,0.157136,60.000000,"AGP, DNA, ER, Mito, RNA",,"AGP, DNA, ER, Mito, RNA",20.930233,0.227059,543826
356,Yokogawa_US,4siteSubSample_Scope1_Yokogawa_US_20X_6Ch_BRO0...,JUMP-MOA_compound_platemap,BRO0117059_20X,Confocal,9,4.0,1,20,6,...,0.000583,True,0.179268,58.888889,"AGP, BrightField, DNA, ER, Mito, RNA",Z08,"AGP, BrightField, DNA, ER, Mito, RNA",20.930233,0.253483,489099
358,Yokogawa_US,4siteSubSample_Scope1_Yokogawa_US_20X_6Ch_BRO0...,JUMP-MOA_compound_platemap,BRO01177034_20x,Confocal,9,4.0,1,20,6,...,0.000014,True,0.139090,56.666667,"AGP, BrightField, DNA, ER, Mito, RNA",Z17,"AGP, BrightField, DNA, ER, Mito, RNA",18.604651,0.193171,452567


In [105]:
with open("params/within_matches.yml") as f:
    match_yaml = yaml.safe_load(f)

with open("params/within_replicates.yml") as f:
    rep_yaml = yaml.safe_load(f)

for batch, grouped_df in match_rep_df.groupby("Batch")[["Batch", "Assay_Plate_Barcode"]]:
    plates = grouped_df["Assay_Plate_Barcode"].tolist()

    out_rep_yaml = rep_yaml
    out_rep_yaml["experiment"]["data_path"] = f"/input/profiles/{batch}"
    out_rep_yaml["experiment"]["plates"] = plates

    out_match_yaml = match_yaml
    out_match_yaml["experiment"]["data_path"] = f"/input/profiles/{batch}"
    out_match_yaml["experiment"]["plates"] = plates

    save_dir = os.path.join("params", batch)

    if not os.path.exists(save_dir):
        os.makedirs(save_dir)

    with open(os.path.join(save_dir, f'{batch}_replicates_config.yaml'), 'w') as f:
        yaml.dump(out_rep_yaml, f, sort_keys=False)

    with open(os.path.join(save_dir, f'{batch}_matches_config.yaml'), 'w') as f:
        yaml.dump(out_match_yaml, f, sort_keys=False)


Following an evalzoo matric run, you are left with results folders that have a random hash. However, each folder should be named after the batch from which it was derived. Inside this folder is a copy of the config file that was used to generate it, so we can get the batch name from there. Additionally, each hashed folder will be the output from a **single** config file, so one for match and one for rep.

Additionally, there's a copy of the profile (`profiles.paraquet`) for some reason, which we can just delete. We can also delete the `.html` plot files too.

1. Rename the hash folder based on the `batch` and the `background_type` (ie. `{batch}_{background_type}`)
2. Delete `.html` and `profiles.paraquet` files


In [133]:
results_dir = "results/results/"

for result_folder in os.listdir(results_dir):
    print(os.path.join(results_dir, result_folder))
    if ".DS_Store" not in result_folder:
        os.remove(os.path.join(results_dir, result_folder, "profiles.parquet"))
        os.remove(os.path.join(results_dir, result_folder, "metrics_plot_level_1_pvalue.html"))
        os.remove(os.path.join(results_dir, result_folder, "metrics_plot_level_1_qvalue.html"))
        shutil.rmtree(os.path.join(results_dir, result_folder, "metrics_plot_level_1_pvalue_files"))
        shutil.rmtree(os.path.join(results_dir, result_folder, "metrics_plot_level_1_qvalue_files"))

        with open(os.path.join(results_dir, result_folder, "params.yaml")) as f:
            param_file = yaml.safe_load(f)
        batch = param_file["experiment"]["data_path"].split("/")[-1]
        rep_or_non_rep = param_file["experiment"]["background_type"]
        shutil.move(os.path.join(results_dir, result_folder), os.path.join(results_dir, batch+"_"+rep_or_non_rep))


results/results/7bdab1ed
results/results/4df8f4be
results/results/9fca2067
results/results/fe29b925
results/results/.DS_Store
results/results/bf606ee8
results/results/d74df4e6
results/results/5b046e93
results/results/7a32000a
results/results/4bc00381
results/results/8953051b
results/results/e475b0d9
results/results/8538ead5
results/results/8ab48388
results/results/d2f6ed35
results/results/a7ced1bf
results/results/d782cb1d


In [106]:
t = "/input/profiles/1siteSubSample_Scope1_MolDev_10X"

t.split("/")[-1]

'1siteSubSample_Scope1_MolDev_10X'

In [136]:
df = pd.read_parquet("/Users/ctromans/Desktop/input/results/results/404510be/metrics_level_1_non_rep.parquet")

df

Unnamed: 0,Metadata_moa_split_compact,Metadata_reference_or_other,Metadata_pert_iname,sim_scaled_mean_non_rep_i_mean_i,sim_scaled_mean_non_rep_i_median_i,sim_scaled_median_non_rep_i_mean_i,sim_scaled_median_non_rep_i_median_i,sim_ranked_relrank_mean_non_rep_i_mean_i,sim_ranked_relrank_mean_non_rep_i_median_i,sim_ranked_relrank_median_non_rep_i_mean_i,...,sim_stat_signal_n_non_rep_i_median_i,sim_stat_background_n_non_rep_i_mean_i,sim_stat_background_n_non_rep_i_median_i,sim_retrieval_average_precision_non_rep_i_mean_i,sim_retrieval_average_precision_non_rep_i_median_i,sim_retrieval_r_precision_non_rep_i_mean_i,sim_retrieval_r_precision_non_rep_i_median_i,sim_retrieval_average_precision_non_rep_i_nlog10pvalue_mean_i,sim_retrieval_average_precision_non_rep_i_nlog10pvalue_median_i,sim_retrieval_average_precision_non_rep_i_nlog10qvalue_mean_i
0,Aurora kinase inhibitor,pert,AMG900,2.794679,2.794679,2.794679,2.794679,0.022727,0.022727,0.022727,...,1.0,88.0,88.0,0.500000,0.500000,0.0,0.0,1.962617,1.962617,0.931208
1,Aurora kinase inhibitor,pert,MK-5108,2.794679,2.794679,2.794679,2.794679,0.022727,0.022727,0.022727,...,1.0,88.0,88.0,0.500000,0.500000,0.0,0.0,1.962617,1.962617,0.931208
2,BCL inhibitor,pert,ABT-737,1.317855,1.317855,1.317855,1.317855,0.107955,0.107955,0.107955,...,1.0,88.0,88.0,0.135714,0.135714,0.0,0.0,1.105334,1.105334,0.443720
3,BCL inhibitor,pert,Compound3,1.317855,1.317855,1.317855,1.317855,0.107955,0.107955,0.107955,...,1.0,88.0,88.0,0.135714,0.135714,0.0,0.0,1.105334,1.105334,0.443720
4,Bcr-Abl kinase inhibitor,pert,GNF-5,-0.784852,-0.784852,-0.784852,-0.784852,0.659091,0.659091,0.659091,...,1.0,88.0,88.0,0.017370,0.017370,0.0,0.0,0.208658,0.208658,0.063732
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
81,tricyclic antidepressant,pert,dosulepin,-1.010539,-1.010539,-1.010539,-1.010539,0.869318,0.869318,0.869318,...,1.0,88.0,88.0,0.013140,0.013140,0.0,0.0,0.087009,0.087009,0.055154
82,tumor necrosis factor production inhibitor,pert,apratastat,-1.011224,-1.011224,-1.011224,-1.011224,0.710227,0.710227,0.710227,...,1.0,88.0,88.0,0.018905,0.018905,0.0,0.0,0.212973,0.212973,0.063732
83,tumor necrosis factor production inhibitor,pert,pomalidomide,-1.011224,-1.011224,-1.011224,-1.011224,0.710227,0.710227,0.710227,...,1.0,88.0,88.0,0.018905,0.018905,0.0,0.0,0.212973,0.212973,0.063732
84,ubiquitin specific protease inhibitor,pert,ML-323,2.373177,2.373177,2.373177,2.373177,0.022727,0.022727,0.022727,...,1.0,88.0,88.0,0.666667,0.666667,0.5,0.5,2.821076,2.821076,1.488638
