# ClinTox exploratory analysis

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("../data/external/clintox.csv")

In [3]:
df.shape

(1484, 3)

In [4]:
df.head()

Unnamed: 0,smiles,FDA_APPROVED,CT_TOX
0,*C(=O)[C@H](CCCCNC(=O)OCCOC)NC(=O)OCCOC,1,0
1,[C@@H]1([C@@H]([C@@H]([C@H]([C@@H]([C@@H]1Cl)C...,1,0
2,[C@H]([C@@H]([C@@H](C(=O)[O-])O)O)([C@H](C(=O)...,1,0
3,[H]/[NH+]=C(/C1=CC(=O)/C(=C\C=c2ccc(=C([NH3+])...,1,0
4,[H]/[NH+]=C(\N)/c1ccc(cc1)OCCCCCOc2ccc(cc2)/C(...,1,0


Just two outcomes:
* FDA_APPROVED - did the drug get approved during clinical trials
* CT_TOX - was the drug judged to be toxic in clinical trials

## Missingness

In [5]:
# Calculate missingness for each variable
df_dropped = df.drop(["smiles"], axis=1)
row_missing = df_dropped.isna().sum() / df.shape[0]

# Get missingness summary statistics
median = row_missing.median()
Q1 = row_missing.quantile(0.25)
Q3 = row_missing.quantile(0.75)
IQR = Q3 - Q1
print(f'Median: {median}')
print(f'Interquartile range: {IQR}')

Median: 0.0
Interquartile range: 0.0


No missingness in the dataset.

## Pairs of outcomes

Are there variables where the drug was approved but was also judged to be toxic?

In [8]:
# Get unique pair counts
pair_counts = df.groupby(['FDA_APPROVED', 'CT_TOX']).size().reset_index(name='count')

print(pair_counts)

   FDA_APPROVED  CT_TOX  count
0             0       1     94
1             1       0   1372
2             1       1     18


I want to check which drugs were approved and toxic.

In [9]:
# Return drugs both toxic and approved
approved_toxic = df[(df['FDA_APPROVED'] == 1) & (df['CT_TOX'] == 1)]

print(approved_toxic)

                                                 smiles  FDA_APPROVED  CT_TOX
178        C1=CC(=CC=C1C#N)C(C2=CC=C(C=C2)C#N)N3C=NC=N3             1       1
303                    C1=CC=C(C=C1)NC(=O)CCCCCCC(=O)NO             1       1
346   C1=CN(C(=O)N=C1N)[C@H]2[C@H]([C@@H]([C@H](O2)C...             1       1
347   C1=CN(C(=O)N=C1N)[C@H]2C([C@@H]([C@H](O2)CO)O)...             1       1
355                           C1CN(P(=O)(OC1)NCCCl)CCCl             1       1
364                           C1CNP(=O)(OC1)N(CCCl)CCCl             1       1
425   C[C@@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@@]4([C@...             1       1
454   C[C@]12C[C@@H]([C@H]3[C@H]([C@@H]1CC[C@@]2(C(=...             1       1
456   C[C@]12CC(=O)[C@H]3[C@H]([C@@H]1CC[C@@]2(C(=O)...             1       1
484   C[C@]12CCC(=O)C=C1CC[C@@H]3[C@@H]2[C@H](C[C@]4...             1       1
670   CC(C)(C)[C@@H](C(=O)N[C@@H](CC1=CC=CC=C1)[C@H]...             1       1
683   CC(C)(C)C(=O)OCOP(=O)(COCCN1C=NC2=C1N=CN=C2N)O...         

e.g. C1=CC(=CC=C1C#N)C(C2=CC=C(C=C2)C#N)N3C=NC=N3 is Letrozole, which is a chemotherapeutic indicated for breast cancer.