# Product correlations
An obvious question to ask in exploratory data analysis: Is there a correlation between the different products.
E.g. do we get product C in the cases where we do not get A?

In [None]:
import pathlib
import sys

sys.path.append(str(pathlib.Path().resolve().parents[1]))

import pandas as pd
import scipy
import matplotlib.pyplot as plt

from src.definitions import DATA_DIR

In [None]:
# get the dataset
df = pd.read_csv(DATA_DIR / "curated_data" / "synferm_dataset_2024-04-18_38586records.csv")
df.head()

## Correlation between products

In [None]:
scipy.stats.pearsonr(df["binary_A"], df["binary_B"])

In [None]:
scipy.stats.pearsonr(df["binary_A"], df["binary_C"])

In [None]:
scipy.stats.pearsonr(df["binary_B"], df["binary_C"])

In [None]:
scipy.stats.pearsonr(df["binary_A"], df["binary_H"])

In [None]:
scipy.stats.pearsonr(df["scaled_A"], df["scaled_B"])

In [None]:
scipy.stats.pearsonr(df["scaled_A"], df["scaled_C"])

In [None]:
scipy.stats.pearsonr(df["scaled_B"], df["scaled_C"])

We know that B  can be converted to A by a variety of means (increasing reaction time, temperature, adding oxidant,...)

How often do we get A or B vs. C?

In [None]:
# how often do we get A or B?
df["binary_AorB"] = df["binary_A"] | df["binary_B"]
df["binary_AorB"].mean()
print(f'A or B combined occur in {df["binary_AorB"].mean():.2%} of all reactions')

In [None]:
# if we get B, do we always get A too?
print(f'If B is present, A is also present in {df.loc[df["binary_B"] == 1, "binary_A"].mean():.2%} of cases')

In [None]:
# reverse question, if we get A, is B also present?
print(f'If A is present, B is also present in {df.loc[df["binary_A"] == 1, "binary_B"].mean():.2%} of cases')

In [None]:
# If we get C, are A and B also present?
print(f'If C is present, A is also present in {df.loc[df["binary_C"] == 1, "binary_A"].mean():.2%} of cases')
print(f'If C is present, B is also present in {df.loc[df["binary_C"] == 1, "binary_B"].mean():.2%} of cases')

In [None]:
# How often do we get C exlusively?
print(f'If C is present, A is not present in {1 - df.loc[df["binary_C"] == 1, "binary_A"].mean():.2%} of cases')
print(f'If C is present, neither A nor B are present in {1 - df.loc[df["binary_C"] == 1, "binary_AorB"].mean():.2%} of cases')


We think that product H may occur through elimination of the amide after forming product A

How often do we get H vs. A?

In [None]:
# If we get H, are A and B also present?
print(f'If H is present, A is also present in {df.loc[df["binary_H"] == 1, "binary_A"].mean():.2%} of cases')
print(f'If H is present, B is also present in {df.loc[df["binary_H"] == 1, "binary_B"].mean():.2%} of cases')
print(f'If H is present, C is also present in {df.loc[df["binary_H"] == 1, "binary_C"].mean():.2%} of cases')

# compare to background rate
print(f'On average A is present in {df["binary_A"].mean():.2%} of cases')
print(f'On average B is present in {df["binary_B"].mean():.2%} of cases')
print(f'On average B is present in {df["binary_C"].mean():.2%} of cases')

# If we dont get A, do we see H?
print(f'If A is no present, H is present in {df.loc[df["binary_A"] == 0, "binary_H"].mean():.2%} of cases')
# compare to background rate
print(f'On average H is present in {df["binary_H"].mean():.2%} of cases')

Occurence of A is indeed enriched, conditional on occurence of H. This is not observed for either B or C. This is in alignment with the conjecture of H occuring by elimination from A.

In [None]:
# do we ever see C for ABTs?
print(f'C occurs for ABTs in {df.loc[df["T_long"].str.startswith("TerABT"), "binary_C"].mean():.2%} of cases')
print(f'C occurs for THs in {df.loc[df["T_long"].str.startswith("TerTH"), "binary_C"].mean():.2%} of cases')

In [None]:
# formation of C by terminator
plt.figure(figsize=(2.25,2), dpi=300)
df[["T_long", "binary_C"]].groupby("T_long").mean().plot.bar()
plt.xlabel("Terminator")
plt.ylabel("Occurence of product C")
plt.ylim(0,1)
plt.legend([])
plt.xticks(fontsize=8)
plt.show()

In [None]:
# formation of H by monomer
plt.figure(figsize=(2.25,2), dpi=300)
df[["M_long", "binary_H"]].groupby("M_long").mean().plot.bar()
plt.xlabel("Monomer")
plt.ylabel("Occurence of product H")
plt.ylim(0,1)
plt.legend([])
plt.xticks(fontsize=5)
plt.show()

## Conclusion
It seems the formation of A and B is moderately correlated (which makes some sense as B is an intermediate en route to A). Particulary, whenever we see B, we also see A.

A and C are weakly correlated (probably this is the result of two opposing tendencies: Formation of A depletes mutual intermediate B, leading to a negative correlation, but formation of A and C is confounded by formation of B, leading to a positive correlation). Particularly, we only see C if we also see A. This makes sense when considering the first point. Basically, we get (some) A whenever B forms. Intramolecular cyclization does not outcompete oxidative decarboxylation sufficiently to shut down formation of A.