## Data
- Lets have a look at our dataset, preprocess it and save the preprocessed version

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m41.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m99.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers


In [None]:
import re
import pandas as pd

from transformers import AutoTokenizer
from sklearn.model_selection import train_test_split

In [None]:
# dataset comes from here: https://github.com/theochem/B3DB/blob/main/README.md

df = pd.read_csv("https://staicentreprod001.blob.core.windows.net/share/mlprague23/B3DB_classification.tsv", sep="\t")
df.head()

In [None]:
df.shape

In [None]:
# for readability
df = df.rename(columns={"BBB+/BBB-": "label", "compound_name": "name"})
df.loc[df.label == "BBB+", "label"] = 1
df.loc[df.label == "BBB-", "label"] = 0

df.head(20)

### Do the molecule names need to be cleaned?

In [None]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext")

In [None]:
mol_ids = tokenizer.encode('bbcpd11 (cimetidine analog) (y-g13)')
print('mol_ids:', mol_ids)

print('mol_tokens', tokenizer.convert_ids_to_tokens(mol_ids))

- Notice how the subword unit (suffix) starts with "##" to indicate that it is part of the previous string
- Also [CLS] and [SEP] tokens are automatically added

In [None]:
mol_ids = tokenizer.encode('morphine-6-glucuronide')
print('mol_ids:', mol_ids)
print('mol_tokens', tokenizer.convert_ids_to_tokens(mol_ids))

- The model has *morphine* and *glucuronide* in its vocabulary (has matching input id for these words)
- But doesn't have *bbcpd11* or *cimetidine*

In [None]:
mol_ids = tokenizer.encode('33419-42-0')
print('mol_ids:', mol_ids)
print('mol_tokens', tokenizer.convert_ids_to_tokens(mol_ids))

#### Regex to remove non alpha-numeric characters and convert to lowercase
- see it in action: https://regex101.com/

In [None]:
df["name"] = df["name"].apply(lambda x: re.sub("[^A-Za-z0-9]+", "", str(x)).lower())
df.head(20)

In [None]:
# replace molecules whose names are just numbers with nan
df["name"] = df["name"].apply(lambda x: re.sub("^[0-9]+", "nan", str(x)))
df.head(20)

In [None]:
df[df["name"] == "nan"]

In [None]:
num_nan=sum(df["name"] == "nan")
print(f"number of molecules with nan name: {num_nan}")

In [None]:
print(f"df shape before nan molecule removal: {df.shape}")
df = df[df["name"] != "nan"]
print(f"df shape after nan molecule removal: {df.shape}")

In [None]:
df.head(20)

In [None]:
df = df.drop_duplicates(subset="name")
df.shape

In [None]:
# make train-test split of data using name column as X data and label as y data

In [None]:
df.name.values

In [None]:
df_train = pd.DataFrame(data={"name": X_train, "label": y_train})

df_test = pd.DataFrame(data={"name": X_test, "label": y_test})

Mount your drive to colab so you can write the processed data there

In [None]:
from google.colab import drive

drive.mount('/content/drive', force_remount=True)

In [None]:
# save preprocessed data to your drive 


Split and save dataset with SMILES separately

In [None]:
# repeat what was done above using name column but now for SMILES, and save the data :)