Merging 5-1 & 5-2 notebooks - to be updated

The following is a short sample demonstration about how to use [tokenizers](https://pypi.org/project/tokenizers/) package to tokenise a very small set of ADR terms (or words) into tokens, then encode ADR terms with token IDs, followed by a final decoding of these token IDs back into the corresponding ADR terms.

Below is a short demonstration on using AutoTokenizer from [transformers](https://pypi.org/project/transformers/) package to tokenise/encode and decode one line of texts. A separate notebook (5-1_Tokenizers_to_tokenise_texts.ipynb) has been prepared to show how to use tokenizers package to tokenise/encode and decode text data or ADR terms.

Reference links: 
* https://huggingface.co/learn/llm-course/chapter2/4?fw=pt#tokenization
* https://huggingface.co/docs/datasets/use_dataset#tokenize-text

Plan
* initial small goal is to try using tokenizer.decode(), so building a tokenization/tokenizer model first
* tokenisation appears to be workable through two different packages e.g. tokenizers or transformers package
* the idea is that this text data tokenisation part can be added to a larger DNN model to decode ADRs output later (subject to further idea changes... may try a small NER classification model first, see 6_NER_tk_inhibitors.ipynb)

* trying HuggingFace's transformers package:
1. set up tokenizer model that will tokenize the ADRs/words
2. apply tokenizer.decode() function to each tensor row/sequence (via using list comprehension)
3. use sample code snippet below to decode tensors: 
e.g. decoded = [tokenizer.decode(x) for x in adrs_ts]
- the code will iterate through each row/sequence of tensors and apply the decode() method 
which'll transform the numerical IDs back into human-readable texts/words

In [None]:
#from tokenizers.models import WordLevel
from tokenizers import Tokenizer, models, normalizers, pre_tokenizers, trainers
import sys, datetime
print(f"Python version used: {sys.version} at {datetime.datetime.now()}")

Python version used: 3.12.7 (main, Oct 16 2024, 09:10:10) [Clang 18.1.8 ] at 2025-05-15 16:24:27.156234


In [2]:
## Sample normalizers code to "normalise" texts
# somehow the normalizer code is not quite working yet... text data in and the same text data out...

# from tokenizers.models import BPE, WordLevel, WordPiece
# from tokenizers import Tokenizer, normalizers
# from tokenizers.normalizers import StripAccents, Sequence, Replace

# BPE - byte pair encoding
# bpe_tokenizer = Tokenizer(BPE())
# print(bpe_tokenizer.normalizer)
# bpe_tokenizer.normalizer = normalizers.Sequence([StripAccents()])
## normalizer seems to be set already even though code seems not right within the normalizers.Sequence() (?)
# print(bpe_tokenizer.normalizer)

# sentences = ['abdominal_pain', 'Höw aRę ŸõŪ dÔįñg?']

# normalized_sentences = [bpe_tokenizer.normalizer.normalize_str(s) for s in sentences]
# normalized_sentences

In [3]:
# example text data from one of CYP3A4 substrates - bosenten's ADRs 
# since ADRs data are preprocessed a bit more than raw texts found elsewhere, decided to go straight to create a tokenizer
data = ["abnormal_LFT^^, headache^^, RTI^^, hemoglobin_decreased^^, sperm_count_decreased^^, edema^^, hepatic_cirrhosis(pm), liver_failure(pm), jaundice(pm), syncope^, sinusitis^, nasal_congestion^, sinus_congestion^, rhinitis^, oropharyngeal_pain^, epistaxis^, nasopharyngitis^, idiopathic_pulmonary_fibrosis^, anemia^, hematocrit_decreased^, thrombocytopenia(pm), neutropenia(pm), leukopenia(pm), flushing^, hypotension^, palpitation^, orthostatic_hypotension^, unstable_angina^, hot_flush^, gastroesophageal_reflux_disease^, diarrhea^, pruritus^, erythema^, angioedema(pm), DRESS(pm), rash(pm), dermatitis(pm), arthralgia^, joint_swelling^, blurred_vision^, chest_pain^, peripheral_edema^, influenza_like_illness^, vertigo^, fever^, chest_pain^, hypersensitivity_reaction^, anaphylaxis(pm)"]

#UNK_TOKEN = '[UNK]'
PAD_TOKEN = '[PAD]'

# have not yet taken into account of unknown words or padding token
tokenizer = Tokenizer(models.WordLevel())

# below link explains about how to add special tokens e.g. unknown tokens to take into account diff. scenarios
# https://huggingface.co/learn/llm-course/chapter6/8?fw=pt#building-a-wordpiece-tokenizer-from-scratch
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordLevelTrainer(vocab_size=100000, special_tokens=special_tokens)

# training tokenizer 
# specify iterator - pass through iterator a sequence of sequences in the data via using map() function to apply split()
# and trainer
tokenizer.train_from_iterator(map(lambda x: x.split(), data), trainer=trainer)

tokenizer.get_vocab()
# returns the indices of each token in the text data

{'hot_flush^,': 26,
 'gastroesophageal_reflux_disease^,': 21,
 'fever^,': 19,
 'joint_swelling^,': 32,
 'anemia^,': 10,
 '[CLS]': 2,
 '[MASK]': 4,
 'anaphylaxis(pm)': 9,
 'oropharyngeal_pain^,': 38,
 'rhinitis^,': 44,
 'RTI^^,': 7,
 'hepatic_cirrhosis(pm),': 25,
 'neutropenia(pm),': 37,
 'peripheral_edema^,': 41,
 'headache^^,': 22,
 'angioedema(pm),': 11,
 'thrombocytopenia(pm),': 49,
 'pruritus^,': 42,
 'hypersensitivity_reaction^,': 27,
 'nasopharyngitis^,': 36,
 'diarrhea^,': 15,
 '[PAD]': 1,
 'hypotension^,': 28,
 'orthostatic_hypotension^,': 39,
 'idiopathic_pulmonary_fibrosis^,': 29,
 'dermatitis(pm),': 14,
 'hemoglobin_decreased^^,': 24,
 'influenza_like_illness^,': 30,
 'flushing^,': 20,
 '[UNK]': 0,
 'hematocrit_decreased^,': 23,
 'erythema^,': 18,
 'abnormal_LFT^^,': 8,
 'DRESS(pm),': 6,
 'liver_failure(pm),': 34,
 'chest_pain^,': 5,
 'sinusitis^,': 46,
 'unstable_angina^,': 50,
 'nasal_congestion^,': 35,
 'leukopenia(pm),': 33,
 'syncope^,': 48,
 'vertigo^,': 51,
 'sinus_co

In [4]:
# using str.split() but punctuations such as commas are not stripped/splitted
for t in data:
    print(t.split())

['abnormal_LFT^^,', 'headache^^,', 'RTI^^,', 'hemoglobin_decreased^^,', 'sperm_count_decreased^^,', 'edema^^,', 'hepatic_cirrhosis(pm),', 'liver_failure(pm),', 'jaundice(pm),', 'syncope^,', 'sinusitis^,', 'nasal_congestion^,', 'sinus_congestion^,', 'rhinitis^,', 'oropharyngeal_pain^,', 'epistaxis^,', 'nasopharyngitis^,', 'idiopathic_pulmonary_fibrosis^,', 'anemia^,', 'hematocrit_decreased^,', 'thrombocytopenia(pm),', 'neutropenia(pm),', 'leukopenia(pm),', 'flushing^,', 'hypotension^,', 'palpitation^,', 'orthostatic_hypotension^,', 'unstable_angina^,', 'hot_flush^,', 'gastroesophageal_reflux_disease^,', 'diarrhea^,', 'pruritus^,', 'erythema^,', 'angioedema(pm),', 'DRESS(pm),', 'rash(pm),', 'dermatitis(pm),', 'arthralgia^,', 'joint_swelling^,', 'blurred_vision^,', 'chest_pain^,', 'peripheral_edema^,', 'influenza_like_illness^,', 'vertigo^,', 'fever^,', 'chest_pain^,', 'hypersensitivity_reaction^,', 'anaphylaxis(pm)']


In [5]:
# using pre_tokenizer will split at white spaces and remove punctuations, and set tokens for each word and each punctuation
pre_tokenizer = pre_tokenizers.Whitespace()
split_data = [pre_tokenizer.pre_tokenize_str(t) for t in data]
split_data

[[('abnormal_LFT', (0, 12)),
  ('^^,', (12, 15)),
  ('headache', (16, 24)),
  ('^^,', (24, 27)),
  ('RTI', (28, 31)),
  ('^^,', (31, 34)),
  ('hemoglobin_decreased', (35, 55)),
  ('^^,', (55, 58)),
  ('sperm_count_decreased', (59, 80)),
  ('^^,', (80, 83)),
  ('edema', (84, 89)),
  ('^^,', (89, 92)),
  ('hepatic_cirrhosis', (93, 110)),
  ('(', (110, 111)),
  ('pm', (111, 113)),
  ('),', (113, 115)),
  ('liver_failure', (116, 129)),
  ('(', (129, 130)),
  ('pm', (130, 132)),
  ('),', (132, 134)),
  ('jaundice', (135, 143)),
  ('(', (143, 144)),
  ('pm', (144, 146)),
  ('),', (146, 148)),
  ('syncope', (149, 156)),
  ('^,', (156, 158)),
  ('sinusitis', (159, 168)),
  ('^,', (168, 170)),
  ('nasal_congestion', (171, 187)),
  ('^,', (187, 189)),
  ('sinus_congestion', (190, 206)),
  ('^,', (206, 208)),
  ('rhinitis', (209, 217)),
  ('^,', (217, 219)),
  ('oropharyngeal_pain', (220, 238)),
  ('^,', (238, 240)),
  ('epistaxis', (241, 250)),
  ('^,', (250, 252)),
  ('nasopharyngitis', (253, 2

In [6]:
for i in range(10):
    print(f'ID: {i}, token: {tokenizer.id_to_token(i)}')

ID: 0, token: [UNK]
ID: 1, token: [PAD]
ID: 2, token: [CLS]
ID: 3, token: [SEP]
ID: 4, token: [MASK]
ID: 5, token: chest_pain^,
ID: 6, token: DRESS(pm),
ID: 7, token: RTI^^,
ID: 8, token: abnormal_LFT^^,
ID: 9, token: anaphylaxis(pm)


In [7]:
# number of unique tokens (words)
tokenizer.get_vocab_size()

52

In [8]:
# Enable padding
# need to find out if pad_id is always necessary e.g. pad_id = tokenizer.token_to_id(PAD_TOKEN)
tokenizer.enable_padding(pad_token=PAD_TOKEN)

In [9]:
output = tokenizer.encode('vertigo^,', 'chest_pain^,')
print(output.ids)

[51, 5]


In [10]:
tokenizer.decode([51, 5])

'vertigo^, chest_pain^,'

In [None]:
# import pandas as pd
# import torch
# import torch.nn as nn
# import torch.nn.functional as F
# from torch.nn.functional import one_hot
# from torch.utils.data import TensorDataset, DataLoader
# import numpy as np
# import datamol as dm
# import rdkit
# from rdkit import Chem
# from rdkit.Chem import rdFingerprintGenerator
# import useful_rdkit_utils as uru
# from matplotlib import pyplot as plt

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import sys, datetime

# print(f"Pandas version used: {pd.__version__}")
# print(f"PyTorch version used: {torch.__version__}")
# print(f"NumPy version used: {np.__version__}")
#print(f"RDKit version used: {rdkit.__version__}")
print(f"Python version used: {sys.version} at {datetime.datetime.now()}")

Python version used: 3.12.7 (main, Oct 16 2024, 09:10:10) [Clang 18.1.8 ] at 2025-05-20 14:24:19.085140


In [None]:
# PyTorch example re. saving & reloading tensors
# t = torch.tensor([1., 2.])
# torch.save(t, 'tensor.pt')
# ts = torch.load('tensor.pt')
# ts


# Load adrs tensors from 2_ADR_regressor.ipynb after it's saved (from 2_ADR_regressor_save_tensors.ipynb)
# adrs_ts = torch.load("adr_train_tensors.pt")
# adrs_ts

In [None]:
# note: some of the pre-trained models are freely available but some of them may be gated 
# (possibly still freely available but may require signing up a HF account)
# BERT base transformer model (cased -> case-sensitive) has been used - https://huggingface.co/google-bert/bert-base-cased
# "uncased" version - https://huggingface.co/google-bert/bert-base-uncased

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "abnormal_LFT^^, headache^^, RTI^^, hemoglobin_decreased^^, sperm_count_decreased^^, edema^^, hepatic_cirrhosis(pm), " \
"liver_failure(pm), jaundice(pm), syncope^, sinusitis^, nasal_congestion^, sinus_congestion^, rhinitis^, oropharyngeal_pain^, " \
"epistaxis^, nasopharyngitis^, idiopathic_pulmonary_fibrosis^, anemia^, hematocrit_decreased^, thrombocytopenia(pm), neutropenia(pm), " \
"leukopenia(pm), flushing^, hypotension^, palpitation^, orthostatic_hypotension^, unstable_angina^, hot_flush^, " \
"gastroesophageal_reflux_disease^, diarrhea^, pruritus^, erythema^, angioedema(pm), DRESS(pm), rash(pm), dermatitis(pm), " \
"arthralgia^, joint_swelling^, blurred_vision^, chest_pain^, peripheral_edema^, influenza_like_illness^, vertigo^, fever^, " \
"chest_pain^, hypersensitivity_reaction^, anaphylaxis(pm)"

tokens = tokenizer.tokenize(sequence)
print(tokens)

['abnormal', '_', 'L', '##FT', '^', '^', ',', 'headache', '^', '^', ',', 'R', '##TI', '^', '^', ',', 'hem', '##og', '##lo', '##bin', '_', 'decreased', '^', '^', ',', 'sperm', '_', 'count', '_', 'decreased', '^', '^', ',', 'ed', '##ema', '^', '^', ',', 'he', '##pa', '##tic', '_', 'c', '##ir', '##r', '##hos', '##is', '(', 'pm', ')', ',', 'liver', '_', 'failure', '(', 'pm', ')', ',', 'j', '##au', '##ndi', '##ce', '(', 'pm', ')', ',', 's', '##ync', '##ope', '^', ',', 'sin', '##us', '##itis', '^', ',', 'nasal', '_', 'congestion', '^', ',', 'sin', '##us', '_', 'congestion', '^', ',', 'r', '##hin', '##itis', '^', ',', 'or', '##op', '##har', '##yn', '##ge', '##al', '_', 'pain', '^', ',', 'e', '##pis', '##ta', '##xi', '##s', '^', ',', 'na', '##so', '##pha', '##ryn', '##git', '##is', '^', ',', 'id', '##io', '##pathic', '_', 'pulmonary', '_', 'fi', '##bro', '##sis', '^', ',', 'an', '##emia', '^', ',', 'hem', '##ato', '##c', '##rit', '_', 'decreased', '^', ',', 'th', '##rom', '##bo', '##cy', '##to

In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[22832, 168, 149, 26321, 167, 167, 117, 16320, 167, 167, 117, 155, 21669, 167, 167, 117, 23123, 8032, 2858, 7939, 168, 10558, 167, 167, 117, 20479, 168, 5099, 168, 10558, 167, 167, 117, 5048, 14494, 167, 167, 117, 1119, 4163, 2941, 168, 172, 3161, 1197, 15342, 1548, 113, 9852, 114, 117, 11911, 168, 4290, 113, 9852, 114, 117, 179, 3984, 12090, 2093, 113, 9852, 114, 117, 188, 27250, 15622, 167, 117, 11850, 1361, 10721, 167, 117, 21447, 168, 22860, 167, 117, 11850, 1361, 168, 22860, 167, 117, 187, 8265, 10721, 167, 117, 1137, 4184, 7111, 5730, 2176, 1348, 168, 2489, 167, 117, 174, 19093, 1777, 8745, 1116, 167, 117, 9468, 7301, 20695, 15023, 24632, 1548, 167, 117, 25021, 2660, 21745, 168, 26600, 168, 20497, 12725, 4863, 167, 117, 1126, 20504, 167, 117, 23123, 10024, 1665, 7729, 168, 10558, 167, 117, 24438, 16071, 4043, 3457, 9870, 23179, 113, 9852, 114, 117, 24928, 3818, 12736, 23179, 113, 9852, 114, 117, 5837, 7563, 15622, 5813, 113, 9852, 114, 117, 14991, 1158, 167, 117, 177, 1183, 11439

In [None]:
# convert_tokens_to_string() - merges sub-word tokens into complete words
adrs_words = tokenizer.convert_tokens_to_string(tokens)
adrs_words

'abnormal _ LFT ^ ^, headache ^ ^, RTI ^ ^, hemoglobin _ decreased ^ ^, sperm _ count _ decreased ^ ^, edema ^ ^, hepatic _ cirrhosis ( pm ), liver _ failure ( pm ), jaundice ( pm ), syncope ^, sinusitis ^, nasal _ congestion ^, sinus _ congestion ^, rhinitis ^, oropharyngeal _ pain ^, epistaxis ^, nasopharyngitis ^, idiopathic _ pulmonary _ fibrosis ^, anemia ^, hematocrit _ decreased ^, thrombocytopenia ( pm ), neutropenia ( pm ), leukopenia ( pm ), flushing ^, hypotension ^, palpitation ^, orthostatic _ hypotension ^, unstable _ angina ^, hot _ flush ^, gastroesophageal _ reflux _ disease ^, diarrhea ^, pruritus ^, erythema ^, angioedema ( pm ), DRESS ( pm ), rash ( pm ), dermatitis ( pm ), arthralgia ^, joint _ swelling ^, blurred _ vision ^, chest _ pain ^, peripheral _ edema ^, influenza _ like _ illness ^, vertigo ^, fever ^, chest _ pain ^, hypersensitivity _ reaction ^, anaphylaxis ( pm )'

In [None]:
# convert_ids_to_tokens() - converts numerical IDs back into corresponding token identifiers
token_words = tokenizer.convert_ids_to_tokens(ids)
print(token_words)

['abnormal', '_', 'L', '##FT', '^', '^', ',', 'headache', '^', '^', ',', 'R', '##TI', '^', '^', ',', 'hem', '##og', '##lo', '##bin', '_', 'decreased', '^', '^', ',', 'sperm', '_', 'count', '_', 'decreased', '^', '^', ',', 'ed', '##ema', '^', '^', ',', 'he', '##pa', '##tic', '_', 'c', '##ir', '##r', '##hos', '##is', '(', 'pm', ')', ',', 'liver', '_', 'failure', '(', 'pm', ')', ',', 'j', '##au', '##ndi', '##ce', '(', 'pm', ')', ',', 's', '##ync', '##ope', '^', ',', 'sin', '##us', '##itis', '^', ',', 'nasal', '_', 'congestion', '^', ',', 'sin', '##us', '_', 'congestion', '^', ',', 'r', '##hin', '##itis', '^', ',', 'or', '##op', '##har', '##yn', '##ge', '##al', '_', 'pain', '^', ',', 'e', '##pis', '##ta', '##xi', '##s', '^', ',', 'na', '##so', '##pha', '##ryn', '##git', '##is', '^', ',', 'id', '##io', '##pathic', '_', 'pulmonary', '_', 'fi', '##bro', '##sis', '^', ',', 'an', '##emia', '^', ',', 'hem', '##ato', '##c', '##rit', '_', 'decreased', '^', ',', 'th', '##rom', '##bo', '##cy', '##to

In [None]:
# example to obtain ADR terms from vocabulary indices
adrs_terms = tokenizer.decode([22832, 168, 149, 26321, 167, 167, 117, 16320, 167, 167])
print(adrs_terms)

abnormal _ LFT ^ ^, headache ^ ^


In [None]:
# Try converting the token ID outputs into torch tensors so they can be used in a pytorch model later
# transformers models expect multiple lines of string sequences, so likely need to add tensor dimensions and/or paddings later 
# may be applicable to one line string sequence or multiple string sequences 

In [None]:
# API for PreTrainedTokenizerBase class re. parameter on return_tensors 
# https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.__call__.return_tensors
tokenised_inputs = tokenizer(sequence, return_tensors="pt")
tokenised_inputs
# output contains "input_ids" tensors, "token_type_ids" tensors & "attention_mask" tensors

{'input_ids': tensor([[  101, 22832,   168,   149, 26321,   167,   167,   117, 16320,   167,
           167,   117,   155, 21669,   167,   167,   117, 23123,  8032,  2858,
          7939,   168, 10558,   167,   167,   117, 20479,   168,  5099,   168,
         10558,   167,   167,   117,  5048, 14494,   167,   167,   117,  1119,
          4163,  2941,   168,   172,  3161,  1197, 15342,  1548,   113,  9852,
           114,   117, 11911,   168,  4290,   113,  9852,   114,   117,   179,
          3984, 12090,  2093,   113,  9852,   114,   117,   188, 27250, 15622,
           167,   117, 11850,  1361, 10721,   167,   117, 21447,   168, 22860,
           167,   117, 11850,  1361,   168, 22860,   167,   117,   187,  8265,
         10721,   167,   117,  1137,  4184,  7111,  5730,  2176,  1348,   168,
          2489,   167,   117,   174, 19093,  1777,  8745,  1116,   167,   117,
          9468,  7301, 20695, 15023, 24632,  1548,   167,   117, 25021,  2660,
         21745,   168, 26600,   168, 2

In [None]:
# printing out only the "input_ids" tensors
print(tokenised_inputs["input_ids"])

tensor([[  101, 22832,   168,   149, 26321,   167,   167,   117, 16320,   167,
           167,   117,   155, 21669,   167,   167,   117, 23123,  8032,  2858,
          7939,   168, 10558,   167,   167,   117, 20479,   168,  5099,   168,
         10558,   167,   167,   117,  5048, 14494,   167,   167,   117,  1119,
          4163,  2941,   168,   172,  3161,  1197, 15342,  1548,   113,  9852,
           114,   117, 11911,   168,  4290,   113,  9852,   114,   117,   179,
          3984, 12090,  2093,   113,  9852,   114,   117,   188, 27250, 15622,
           167,   117, 11850,  1361, 10721,   167,   117, 21447,   168, 22860,
           167,   117, 11850,  1361,   168, 22860,   167,   117,   187,  8265,
         10721,   167,   117,  1137,  4184,  7111,  5730,  2176,  1348,   168,
          2489,   167,   117,   174, 19093,  1777,  8745,  1116,   167,   117,
          9468,  7301, 20695, 15023, 24632,  1548,   167,   117, 25021,  2660,
         21745,   168, 26600,   168, 20497, 12725,  

In [None]:
# using pytorch directly to create tensors from token IDs
import torch
torch.tensor(ids)

tensor([22832,   168,   149, 26321,   167,   167,   117, 16320,   167,   167,
          117,   155, 21669,   167,   167,   117, 23123,  8032,  2858,  7939,
          168, 10558,   167,   167,   117, 20479,   168,  5099,   168, 10558,
          167,   167,   117,  5048, 14494,   167,   167,   117,  1119,  4163,
         2941,   168,   172,  3161,  1197, 15342,  1548,   113,  9852,   114,
          117, 11911,   168,  4290,   113,  9852,   114,   117,   179,  3984,
        12090,  2093,   113,  9852,   114,   117,   188, 27250, 15622,   167,
          117, 11850,  1361, 10721,   167,   117, 21447,   168, 22860,   167,
          117, 11850,  1361,   168, 22860,   167,   117,   187,  8265, 10721,
          167,   117,  1137,  4184,  7111,  5730,  2176,  1348,   168,  2489,
          167,   117,   174, 19093,  1777,  8745,  1116,   167,   117,  9468,
         7301, 20695, 15023, 24632,  1548,   167,   117, 25021,  2660, 21745,
          168, 26600,   168, 20497, 12725,  4863,   167,   117, 

In [None]:
# Adding sample checkpoint & model with the tokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

# Sample multiple sequence data using ADRs of bosentan and carbamazepine
sequence = ["abnormal_LFT^^, headache^^, RTI^^, hemoglobin_decreased^^, sperm_count_decreased^^, edema^^, hepatic_cirrhosis(pm), " \
"liver_failure(pm), jaundice(pm), syncope^, sinusitis^, nasal_congestion^, sinus_congestion^, rhinitis^, oropharyngeal_pain^, " \
"epistaxis^, nasopharyngitis^, idiopathic_pulmonary_fibrosis^, anemia^, hematocrit_decreased^, thrombocytopenia(pm), " \
"neutropenia(pm), leukopenia(pm), flushing^, hypotension^, palpitation^, orthostatic_hypotension^, unstable_angina^, " \
"hot_flush^, gastroesophageal_reflux_disease^, diarrhea^, pruritus^, erythema^, angioedema(pm), DRESS(pm), rash(pm), " \
"dermatitis(pm), arthralgia^, joint_swelling^, blurred_vision^, chest_pain^, peripheral_edema^, influenza_like_illness^, " \
"vertigo^, fever^, chest_pain^, hypersensitivity_reaction^, anaphylaxis(pm)", "constipation^^, leucopenia^^, dizziness^^, " \
"sedation^^, ataxia^^, elevated_GGT^^, allergic_skin_reactions^^, eosinophilia^, thrombocytopenia^, neutropenia^, headache^, " \
"tremor^, elevated_ALP^, pruritus^, paresthesia^, diplopia^, blurred_vision^, hyponatremia^, fluid_retention^, oedema^, "
"weight_gain^, reduced_plasma_osmolarity_(ADH_like_effect)^, vertigo^"]

tokens = tokenizer(sequence, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)

# tokens = tokenizer.tokenize(sequence)
# ids = tokenizer.convert_tokens_to_ids(tokens)
# input_ids = torch.tensor(ids)
# print("Input IDs:", input_ids)

In [None]:
output

SequenceClassifierOutput(loss=None, logits=tensor([[ 3.2043, -2.7013],
        [ 2.1903, -1.8928]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [None]:
# Making a sample batch of token IDs by using the same sequence twice
batched_ids = [ids, ids]
batched_ids

[[19470,
  1035,
  1048,
  6199,
  1034,
  1034,
  1010,
  14978,
  1034,
  1034,
  1010,
  19387,
  2072,
  1034,
  1034,
  1010,
  19610,
  8649,
  4135,
  8428,
  1035,
  10548,
  1034,
  1034,
  1010,
  18047,
  1035,
  4175,
  1035,
  10548,
  1034,
  1034,
  1010,
  3968,
  14545,
  1034,
  1034,
  1010,
  2002,
  24952,
  2278,
  1035,
  25022,
  12171,
  25229,
  1006,
  7610,
  1007,
  1010,
  11290,
  1035,
  4945,
  1006,
  7610,
  1007,
  1010,
  14855,
  8630,
  6610,
  1006,
  7610,
  1007,
  1010,
  26351,
  17635,
  1034,
  1010,
  8254,
  2271,
  13706,
  1034,
  1010,
  19077,
  1035,
  20176,
  1034,
  1010,
  8254,
  2271,
  1035,
  20176,
  1034,
  1010,
  1054,
  20535,
  7315,
  1034,
  1010,
  20298,
  21890,
  18143,
  3351,
  2389,
  1035,
  3255,
  1034,
  1010,
  4958,
  11921,
  9048,
  2015,
  1034,
  1010,
  17235,
  7361,
  8167,
  6038,
  23806,
  2483,
  1034,
  1010,
  8909,
  3695,
  25940,
  1035,
  21908,
  1035,
  10882,
  12618,
  6190,
  1034,
 

In [None]:
input_batched_ids = torch.tensor(batched_ids)
output_batched = model(input_batched_ids)
print("Logits:", output_batched.logits)

Logits: tensor([[ 2.1536, -1.9253],
        [ 2.1536, -1.9253]], grad_fn=<AddmmBackward0>)


In [None]:
# attention masks are used to tell the attention layers (which contextualise each token) in transformer models 
# to ignore the padding tokens when multiple sequences are of different lengths

##### **Some initial thoughts after trying out tokenization**

The overall concept that I'm getting at the moment is that a language model (whether large or small) consists of: 

* training a data corpus
* using tokenizer to encode or decode text data
* using pre-trained model of choice as a base model to train the data provided
* producing training output

The pre-trained model can be further adjusted or fine-tuned via training the model on smaller high-quality datasets for other more specific NLP tasks.
    
This means my initial small goal to convert the tensor outputs back into words will actually be the next step after having the training output from a language translation/summarisation/classification model, meaning I'll have to test the trained model on a different set of test data in order to see if the conversion from tensors to strings will make sense (this leads to the latest new plan to try doing NER for the ADRs of tyrosine kinase inhibitors). 

* possible training workflow of a small part of an early ADR prediction model may be like this: 

    input training ADR strings (later may add the "drug" part) -> encode into token IDs for training -> tensors -> token IDs to be decoded -> ADR strings

    code example for the tensors to token IDs to string representations part: 
    
    tokenizer.batch_decode(outputs.context_input_ids, skip_special_tokens=True)

* possible workflow to test data in the trained pre-trained ADR decoder model may be like this:

    input testing drug-ADRs -> token IDs -> tensors -> token IDs -> ADR strings


A useful and informative reference paper to learn about NLP in drug discovery is by Withers et al. - https://doi.org/10.1080/17460441.2025.2490835