# Task

Your task is focused on data processing, visualisation and basic statistical analysis. You will be asked to:
 1. [basic] download publically available data of chemical compounds
 1. [basic] do basic cleaning/processing
 1. [basic] visualise and compute correlation between experimental and heuristic features
 1. [intermediate] use kernel techniques to visualise the data in 2D
 1. [advanced] make the approach more efficient so it can run in a few seconds

## General rules

Use only basic python libraries, i.e.: `matplotlib, sklearn, numpy, scipy, csv, pandas, seaborn`.

You can write your solution either directly in colab, or in your favourite IDE. Once you are finished, copy it to colab and make sure it runs with the standard free kernel

## [Task 1] [Basic] Gather data

Your first task is to navigate to the ChEMBL database (https://www.ebi.ac.uk/chembl/), one of the biggest datasbases of bioactive molecules, and find an **assay** from AstraZeneca entitled "*Octan-1-ol/water (pH7.4) distribution coefficent measured by a shake flask method*". It is a result of an experiment measuring said property (often called "logD" for 4200 compounds).

Download this data as a .csv format (by clicking "ChEMBL Activity Types for Assay" button in the Bioactivity section of the assay), as well as data for "Associated Compounds for Assay" which you will find on the assay's website, in the section called "Compound Summary". Lets call these files "astrazeneca.csv" and "compounds.csv" respectively.

If you are using colab, you can upload these files to your kernel by

```
from google.colab import files
files.upload()

with open("astrazeneca.csv") as fh:
  # process your file
  pass
```

## [Task 2] [Basic] [Data processing]

Merge the two datasets, using 'Smiles' as a key.
Keep all the data from 'compounds.csv' apart from 'CX Basic pKa' and 'CX Acidic pKa', but only extract 'Standard Value' from each entry in 'astrazeneca.csv' and add it as "Experimental Value" in your resulting dataset.
If you encounter multiple entries with the same Smiles, average the corresponding experimental values. For every entry that can be converted to a float, convert it and keep remaining entries as strings.

Investigate your dataset, are there any entries that are missing some features? Remove compounds that are missing any numerical values.

After this process you should remove 11 molecules, and thus end up with the dataset of 4,187.

If you were to sort it alphabetically by the key, and print the first element you should see something along the lines of

```
{'ChEMBL ID': 'CHEMBL108667',
 'Name': '',
 'Synonyms': '',
 'Type': 'Small molecule',
 'Max Phase': 0.0,
 'Molecular Weight': 454.21,
 'Targets': 6.0,
 'Bioactivities': 7.0,
 'AlogP': 4.37,
 'Polar Surface Area': 24.5,
 'HBA': 3.0,
 'HBD': 1.0,
 '#RO5 Violations': 0.0,
 '#Rotatable Bonds': 6.0,
 'Passes Ro3': 'N',
 'QED Weighted': 0.7,
 'CX LogP': 4.65,
 'CX LogD': 2.83,
 'Aromatic Rings': 2.0,
 'Structure Type': 'MOL',
 'Inorganic Flag': -1.0,
 'Heavy Atoms': 24.0,
 'HBA (Lipinski)': 3.0,
 'HBD (Lipinski)': 1.0,
 '#RO5 Violations (Lipinski)': 0.0,
 'Molecular Weight (Monoisotopic)': 452.0099,
 'Molecular Species': 'BASE',
 'Molecular Formula': 'C19H22Br2N2O',
 'Smiles': 'Brc1cc(Br)cc(COC[C@H](c2ccccc2)N2CCNCC2)c1',
 'Experimental Value': 2.8}
 ```

## [Task 3] [Basic] Visualisation and correlation analysis

All the values coming from our compounds.csv file are either basic properties of the molecule, or come from some heuristic calculations. We want to analyse how related they are to the experimentally measured values.

Plot the pairwise correlations between every numeric property present in the data and the "Experimental Value". Additionally, compute the Pearson correlation coefficient.

**QUESTION: Which of the features has the strongest positive correlation? What about the strongest negative correlation?**

## [Task 4] [Intermediate] 2D data visualisation with kernel methods

There are many ways in which we could try to visualise our dataset of molecules. We will focus on Kernel PCA, that we will apply directly to the string representation of the molecule (Smiles).

Let us define a kernel for the substrings of length $m$ as

$$
K_m(s_1, s_2) = \sum_{w: |w|=m} \left [ count(s_1, w) \cdot count(s_2, w) \right ]
$$

where $w$ is any word of length $m$, and $count(s, w)$ checks how many times word $w$ occures in $s$ as a consecutive substring with overlaps (note that `s.count(w)` ignores overlaps!). For example for

$$
s=\texttt{Brc1cc(Br)cc(COC[C@H](c2ccccc2)N2CCNCC2)c1}
$$

we have
$$
count(s, \texttt{C}) = 7\;\; count(s, \texttt{cc}) =6\;\;count(s, \texttt{ccc}) = 3
$$

For example

$$
K_1(\texttt{tree}, \texttt{apple}) = \sum_{w \in \{ \texttt{t,r,e}\} } count(\texttt{tree}, w) \cdot count( \texttt{apple}, w) = 1 \cdot 0 + 1 \cdot 0 + 2 \cdot 1 = 2
$$

Using this definition of the kernel, compute the Kernel (Gram) matrix $G$, such that

$$
G^m_{ij} = K_m(s_i, s_j)
$$

where $s_i$ is $i$th Smiles in the lexicographic order, for $m=1$ and $m=2$.

After doing so, you can use scikit-learn KernelPCA to create a 2D embedding

```
from sklearn.decomposition import KernelPCA
embedding = KernelPCA(kernel='precomputed').fit_transform(G)
```

Use this embedding to plot your dataset, and use each of the features to colour your points. **Which of the features can be seen to be highly correlated with this embedding space?**

## [Task 5] [Advanced] Scaling up

In total, there are 33 symbols used in Smiles strings. Consequently there are $33^m$ words of length $m$.
We have around 4,000 compounds and thus around 16,000,000 entries in the kernel matrix to be computed.
Even if we were to assume that the count() function is constant time, a naive algorithm of the form

    kernel = np.zeros((len(smiles), len(smiles))
    for i1, s1 in enumerate(smiles):
      for i2, s2 in enumerate(smiles):
        score = 0
        for w in words_generator(m):
          score += count(s1, w) * count(s2, w)
        kernel[i1, i2] = score

will take a prohibitive amount of time as $m$ grows:
