# Plotting the results from experiment 1

One of the main claims from the original paper is part of Table 1, i.e., that the SELFIES-based GA outperforms other generative models:

![Screen Shot 2021-01-06 at 15.42.06.png](attachment:03a3a7fa-64f1-4b19-9121-a502e6fec6a8.png)

In [161]:
import wandb
import pandas as pd
import holoviews as hv
from holoviews import opts
import numpy as np 
hv.extension('bokeh')
api = wandb.Api()

In [30]:
runs = api.runs("kjappelbaum/ga_replication_study")
summary_list = [] 
config_list = [] 
name_list = [] 
tag_list = []
id_list = []
for run in runs: 
    # run.summary are the output key/values like accuracy.  We call ._json_dict to omit large files 
    summary_list.append(run.summary._json_dict) 

    # run.config is the input metrics.  We remove special values that start with _.
    config_list.append({k:v for k,v in run.config.items() if not k.startswith('_')}) 
    
    # run.name is the name of the run.
    name_list.append(run.name)       
    tag_list.append(run.tags) 
    id_list.append(run.id)
    
summary_df = pd.DataFrame.from_records(summary_list) 
config_df = pd.DataFrame.from_records(config_list) 
name_df = pd.DataFrame({'name': name_list}) 
tag_df = pd.DataFrame({'tags': tag_list})
id_df = pd.DataFrame({'id': id_list})
all_df = pd.concat([name_df, id_df, tag_df, config_df,summary_df], axis=1)

## Random baseline

In [31]:
all_df[[tag == ['baseline', 'experiment_1', 'final'] for tag in all_df['tags']]]

Unnamed: 0,name,id,tags,run,beta,disc_layers,disc_enc_type,generation_size,num_generations,max_molecules_len,...,gradients/graph_9hidden.8.weight,gradients/graph_9hidden.3.weight,gradients/graph_9hidden.1.bias,gradients/graph_9hidden.2.bias,gradients/graph_9hidden.1.weight,gradients/graph_9hidden.7.weight,gradients/graph_9predict.bias,J,run.1,smile
51,radiant-star-10,3u3a0pn8,"[baseline, experiment_1, final]",,,,,,,,...,,,,,,,,3.531633,,


In [64]:
run_random_baseline = api.run("kjappelbaum/ga_replication_study/3u3a0pn8")

In [163]:
our_dist_random_baseline_dist = hv.Violin(run_random_baseline.history()['J'].values, 
                                          label='our results', vdims='J').opts(opts.Violin(inner='stick'))

In [79]:
original_paper_random_baseline = pd.read_csv('https://raw.githubusercontent.com/aspuru-guzik-group/GA/paper_results/4.1/random_selfies/results.txt', sep='\s+', names=['smiles', '_' , 'j'])

In [164]:
original_paper_random_baseline_dist = hv.Violin(original_paper_random_baseline['j'].values, 
                                                label='original GitHub', vdims='J').opts(opts.Violin(inner='stick'))

In [165]:
our_dist_random_baseline_dist * original_paper_random_baseline_dist

Since our findings do not agree with the report, let's check if our scoring functions are implemented correctly.

In [90]:
original_paper_random_baseline['smiles'][0].strip(',')

'N#Cc1ccccc1NN=NSOc1ccccc1NNc1ccccc1-c1ccccc1-c1ccccc1-c1ccccc1-c1ccccc1N=O'

In [91]:
from rdkit.Chem import Descriptors

In [97]:
import sys
sys.path.append('../../')

In [100]:
from sa_scorer.sascorer import calculate_score
from net import evolution_functions as evo

In [103]:
mol, _ , _ = evo.sanitize_smiles('N#Cc1ccccc1NN=NSOc1ccccc1NNc1ccccc1-c1ccccc1-c1ccccc1-c1ccccc1-c1ccccc1N=O')

In [109]:
(Descriptors.MolLogP(mol) - 2.4729421499641497) /  1.4157879815362406 - (calculate_score(mol) - 3.0470797085649894)/0.830643172314514

6.660424831806891

This is the same number they have in their table, so the scoring code we use seem fine.

## GA (no discriminator)

In [182]:
ga_runs_df = all_df[([tag == ['experiment_1', 'final', 'ga'] for tag in all_df['tags']]) & (all_df['beta']==0)]

In [183]:
ga_run_ids = ga_runs_df['id']

In [184]:
j_gas = []

for ga_run_id in ga_run_ids:
    try:
        run = api.run(f"kjappelbaum/ga_replication_study/{ga_run_id}")
        j_gas.append(run.history()['fitness'].max())
    except Exception:
        pass

In [185]:
hv.Violin(j_gas, vdims='J').opts(opts.Violin(inner='stick'))

In [186]:
print(f"We found a a maximum penalized lg P of {np.mean(j_gas):.3f} +/- {np.std(j_gas):.3f} (averaged over {len(j_gas)} runs)")

We found a a maximum penalized lg P of 11.911 +/- 1.262 (averaged over 10 runs)


Within the error margin this agrees with the results from the original paper.