REFERENCE: https://github.com/maese005/GLBIO-2024
Session 1

### Terms:
#### PREPROCESSING
1. Vectorization: Docs to matrix (TF-IDF : term frquency-inverse document frequency - to highlight most distinctive terms in a document)
   fit a vectorizer to a training data to transform the train data to TF-IDF matrix using same vocabulary.
2. Tokenization: Individual words or punctuation marks. Building blocks as they consider individual elements within text
   Word embedding
   * Libraries: nltk, spacy
   * Word.tokenize
   * EOS: end of sentence to recognize where sequence concludes. Model to learn to when to stop generating output
   * PADDING TOKEN: Processed in a batch. Shorter sequences padded to match length of longer sequence. They do not carry  meaningful information. Important in preprocessing.
  
#### TRANSFORMERS and KEY COMPONENTS
1. Type of Neural Net
2. Differ from RNN, LSTM because they use self attention
3. How it works:
   * Self attention: diff parts, diff weights, process tokens in sequence at same time - so captures full context in one run
   * parallel processing
   * encoder and decoder: decoder to decode and give interpretable output.

In [None]:
# Transformers.py

In [26]:
# Import modules
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D, Layer, Input
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam

In [None]:
# Data generation parameters
num_patients = 1000
max_seq_length = 10
vocab_size = 50  # Number of possible symptoms

np.random.seed(0)
patient_data = np.random.randint(1, vocab_size, size=(num_patients, max_seq_length))
print(patient_data)
"""
[[45 48  1 ... 20 22 37]
 [24  7 25 ... 40 24 47]
 [25 18 38 ... 21 17  6]
 ...
 [26 46  4 ... 11  8  5]
 [ 7  5 41 ... 36 16 23]
 [47 11 37 ...  5 41 45]]
"""
patient_labels = np.random.randint(2, size=(num_patients, 1))
print(patient_labels)
"""
 [1]
 [0]
 [1]
 [1]
 [1]
 [0]...etc.
"""

In [28]:
# Train-test split
split = int(num_patients * 0.8)
train_data, test_data = patient_data[:split], patient_data[split:]
train_labels, test_labels = patient_labels[:split], patient_labels[split:]

In [30]:
# Define the positional encoding function
# positional encoders: provide model with information about absolute or relative position of to the model
# added to input embedding to give order of encoder
def get_positional_encoding(max_seq_length, d_model):
    positional_enc = np.zeros((max_seq_length, d_model))
    position = np.arange(0, max_seq_length, dtype=np.float32).reshape(-1, 1)
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))

    positional_enc[:, 0::2] = np.sin(position * div_term)
    positional_enc[:, 1::2] = np.cos(position * div_term)

    positional_enc = tf.cast(positional_enc, dtype=tf.float32)
    return positional_enc

# Define the scaled dot-product attention function
# CORE COMPONENT: computes actual attention mechanism
def scaled_dot_product_attention(q, k, v, mask):
    matmul_qk = tf.matmul(q, k, transpose_b=True)
    depth = tf.cast(tf.shape(k)[-1], tf.float32)
    logits = matmul_qk / tf.math.sqrt(depth)
    if mask is not None:
        logits += (mask * -1e9)
    attention_weights = tf.nn.softmax(logits, axis=-1)
    output = tf.matmul(attention_weights, v)
    return output

In [33]:
# implementation of multi head attention mechanism
# model gets info from different parts of input
class MultiHeadAttention(Layer):
    def __init__(self, d_model, num_heads): # declare what is needed inside with init
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        assert d_model % self.num_heads == 0
        self.depth = d_model // self.num_heads
        self.wq = Dense(d_model)
        self.wk = Dense(d_model)
        self.wv = Dense(d_model)
        self.dense = Dense(d_model)

    def split_heads(self, x, batch_size): # reshape and transpose input data
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, v, k, q, mask=None): # call this
        batch_size = tf.shape(q)[0]
        q = self.wq(q)
        k = self.wk(k)
        v = self.wv(v)
        q = self.split_heads(q, batch_size)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)
        scaled_attention = scaled_dot_product_attention(q, k, v, mask)
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])
        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model))
        output = self.dense(concat_attention)
        return output



In [32]:
class TransformerEncoder(Model): #define transformer encoder class
    def __init__(self, vocab_size, num_heads, d_model):
        super(TransformerEncoder, self).__init__()
        self.embed = Embedding(vocab_size, d_model)
        self.pos_encoding = get_positional_encoding(100, d_model)  # Adjust max sequence length if necessary
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.pooling = GlobalAveragePooling1D() #reduces output to a form suitable to interpret
        self.final = Dense(1, activation='sigmoid')

    def call(self, x):
        x = self.embed(x)
        x += self.pos_encoding[:tf.shape(x)[1], :]
        x = self.attention(x, x, x)  # Mask is omitted unless you specifically need it
        x = self.pooling(x)
        return self.final(x)

In [34]:
# Model setup parameters
num_heads = 4 # heads for multi-head attention (4 positions at the same time)

# Dimensionality of the embedding
d_model = 128  # size of embedding and larger d incraases model ability to learn detailed features

# Initialize and compile the model
model = TransformerEncoder(vocab_size=vocab_size, num_heads=num_heads, d_model=d_model)
model.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(train_data, train_labels, batch_size=32, epochs=10, validation_data=(test_data, test_labels))

# Make predictions
predictions = model.predict(test_data)
print("Sample predictions:", predictions[:5])
"""
Sample predictions: [[0.59326303]
 [0.51481295]
 [0.8828845 ]
 [0.34554717]
 [0.35884818]]
"""

2024-05-13 13:26:11.340787: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


Epoch 1/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - accuracy: 0.4471 - loss: 0.8042 - val_accuracy: 0.5200 - val_loss: 0.6930
Epoch 2/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.5112 - loss: 0.6932 - val_accuracy: 0.5200 - val_loss: 0.6930
Epoch 3/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.4980 - loss: 0.6969 - val_accuracy: 0.4600 - val_loss: 0.6954
Epoch 4/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.5115 - loss: 0.6953 - val_accuracy: 0.4800 - val_loss: 0.7081
Epoch 5/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.5378 - loss: 0.6839 - val_accuracy: 0.5350 - val_loss: 0.6944
Epoch 6/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.5430 - loss: 0.6880 - val_accuracy: 0.4700 - val_loss: 0.7089
Epoch 7/10
[1m25/25[0m [32m━━━━━━━━━━

'\nSample predictions: [[0.59326303]\n [0.51481295]\n [0.8828845 ]\n [0.34554717]\n [0.35884818]]\n'

#### KEYWORD SARCH IN HEALTHCARE: APPLICATIONS
1. data documents for med data
2. researchers: efficient web scraper
3. Treatment personalization
   
* TF IDF: for med data very important

In [None]:
# Disease Keyword Extraction.py

In [37]:
# Import modules
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer


In [40]:
# Create a sample dataset of patient records with disease labels
data = {
    'Patient_ID': [1, 2, 3, 4, 5, 6],
    'Notes': [
        "Patient shows increased blood sugar and frequent urination.",
        "Elevated blood pressure and headaches reported repeatedly.",
        "Blood sugar tests indicate possible diabetic condition.",
        "High blood pressure observed, along with blurred vision.",
        "Urine tests confirm high sugar levels, suggesting diabetes.",
        "Patient complains of chronic headaches and high blood pressure."
    ],
    'Disease': ['Diabetes', 'Hypertension', 'Diabetes', 'Hypertension', 'Diabetes', 'Hypertension']
}
df = pd.DataFrame(data)
df

Unnamed: 0,Patient_ID,Notes,Disease
0,1,Patient shows increased blood sugar and freque...,Diabetes
1,2,Elevated blood pressure and headaches reported...,Hypertension
2,3,Blood sugar tests indicate possible diabetic c...,Diabetes
3,4,"High blood pressure observed, along with blurr...",Hypertension
4,5,"Urine tests confirm high sugar levels, suggest...",Diabetes
5,6,Patient complains of chronic headaches and hig...,Hypertension


In [39]:
# Define function to extract top 5 distinguishing keywords using TF-IDF score
def extract_distinguishing_keywords(notes):
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(notes)
    feature_names = vectorizer.get_feature_names_out()
    tfidf_scores = tfidf_matrix.sum(axis=0).A1 # get total score across all patients
    keywords_scores = sorted(zip(feature_names, tfidf_scores), key=lambda x: x[1], reverse=True)
    return keywords_scores[:5]  # Return the top 5 keywords for clarity


In [41]:
# Analyze by disease type
diabetes_df = df[df['Disease'] == 'Diabetes']
hypertension_df = df[df['Disease'] == 'Hypertension']

# Extract keywords for each disease type
diabetes_keywords = extract_distinguishing_keywords(diabetes_df['Notes'])
hypertension_keywords = extract_distinguishing_keywords(hypertension_df['Notes'])


In [42]:
# Print results for each disease
print("Top 5 Keywords for Diabetes:")
for keyword, score in diabetes_keywords:
    print(f"{keyword}: {score:.2f}")

print("\nTop 5 Keywords for Hypertension:")
for keyword, score in hypertension_keywords:
    print(f"{keyword}: {score:.2f}")

Top 5 Keywords for Diabetes:
sugar: 0.72
blood: 0.64
tests: 0.61
condition: 0.43
diabetic: 0.43

Top 5 Keywords for Hypertension:
blood: 0.84
pressure: 0.84
headaches: 0.71
high: 0.71
blurred: 0.48


#### APPLICATIONS OF GEN AI IN HEALTHCARE
CHARACTERISTICS:
1. Transfer knowledge from one data form to another: transfer learning
2. adaptable
3. logical reasoning

APPS:
1. make new molecules: Google DeepMIND
2. personalized medicine: model how different genetic configurations can model different treatment strategies
3. generate synthetic data: to train ml models without exposing personal information
4. can do oversampling for data.

In [43]:
# Import libraries
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Activation

In [46]:
# Provide sample text to train the model
"""
The model will learn from this input text and try to 
generate text after learning.
"""
text = "hello world" # can change for text data of choice
chars = sorted(list(set(text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))

# Prepare dataset
"""
Convert characters to integers and 
create sequences that the model can learn from.
"""
seq_length = 3
dataX = []
dataY = []
for i in range(0, len(text) - seq_length, 1): # for each input text sequence until the end and output is next character of seq
    seq_in = text[i:i + seq_length]
    seq_out = text[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
 
X = np.reshape(dataX, (len(dataX), seq_length, 1)) # reshape
X = X / float(len(chars)) # normalization (between 0-1: consistent scale)
y = tf.keras.utils.to_categorical(dataY) # make data into diff categories - 1 hot encoding

In [47]:
# Define a simple LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]))) # LSTM in seq prediction learns ordered dependencies of text
model.add(Dense(y.shape[1], activation='softmax'))

# automated chatbot responses for example
# but they fail in complex cases

# Compile and train the model on the prepared data
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.fit(X, y, epochs=300, batch_size=64)

# Define function to generate text
"""
Use the model to generate text based on a seed input.
"""
def generate_text(model, seed_text, n_vocab, char_to_int, 
                  int_to_char, length=100):
    pattern = [char_to_int[char] for char in seed_text]
    text = seed_text
    for i in range(length):
        x = np.reshape(pattern, (1, len(pattern), 1))
        x = x / float(n_vocab)
        prediction = model.predict(x, verbose=0)
        index = np.argmax(prediction)
        result = int_to_char[index]
        text += result
        pattern.append(index)
        pattern = pattern[1:len(pattern)]
    return text

Epoch 1/300


  super().__init__(**kwargs)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 781ms/step - loss: 2.0820
Epoch 2/300
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step - loss: 2.0727
Epoch 3/300
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step - loss: 2.0635
Epoch 4/300
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - loss: 2.0542
Epoch 5/300
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - loss: 2.0449
Epoch 6/300
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - loss: 2.0354
Epoch 7/300
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - loss: 2.0257
Epoch 8/300
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - loss: 2.0158
Epoch 9/300
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - loss: 2.0055
Epoch 10/300
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - loss: 1.9948
Epoch 11/300
[1m1/1

In [48]:
# Generate some text
# but output has a looping chatacter and is repititive
int_to_char = dict((i, c) for i, c in enumerate(chars))
print(generate_text(model, "hel", len(chars), char_to_int, int_to_char))

hello worldllr worldllr worldllr worldllr worldllr worldllr worldllr worldllr worldllr worldllr worldll


#### APPLICATIONS of LLM IN MEDICAL CHATBOT
* Medical LLM: pretrain, finetuned model 
* Med chatbot - new avenue for patient care
* https://github.com/pallavisurana1/Llama2-Medical-Chatbot

In [None]:
import pandas as pd
from datasets import load_dataset
dataset = load_dataset("danielpark/MQuAD-v1") # get from github later

In [None]:
# Assuming 'dataset' is your DatasetDict and you're interested in the 'train' split
train_dataset = dataset['train']

# Extract 'question' and 'answer' fields
user_text = train_dataset['question']
bot_text = train_dataset['answer']

# Create a pandas DataFrame with 'user_text' and 'bot_text' columns
df = pd.DataFrame({
    'User': user_text,
    'Bot': bot_text
})

# Shuffle the DataFrame rows
df = df.sample(frac=1).reset_index(drop=True)

# Optionally, save the DataFrame to a CSV file
# df.to_csv('medical_conversation_data.csv', index=False)

In [None]:
# Display the first few rows of the DataFrame to verify its diversity
print(df.head())

# Assuming 'df' is your Pandas DataFrame with columns 'User' and 'Bot'
df['formatted'] = "User: " + df['User'] + " \nBot: " + df['Bot'] + "\n"

training_text = "\n".join(df['formatted'].tolist())


In [55]:
from transformers import GPT2Tokenizer
import torch
from transformers import GPT2LMHeadModel, Trainer, TrainingArguments

In [None]:
# Instantiate a tokenizer for gpt-2 model
"""
- GPT2Tokenizer is a class from the Transformers library.
It is designed to handle tokenization for the GPT-2 model.
- Tokenization is the process of converting raw text into a format that can
be input into a neural network.
It typically transforms the text into a sequence of integers called tokens.
- from_pretrained('gpt2') is a method use to load t pre-trained tokenizer.
The 'gpt2' argument specifies that it should load the tokenizer pre-configured
and trained to work with the gpt-2 model.
This includes all the speciific settings like vocab size, special tokens, other configurations that are unique to gpt-2.
- tokenizer is the tokenizer object that is now ready to be used to tokenize text.
Ex: methods like tokenizer.encode('your text here') can be used to convert text into tokens.
Ex: methods like tokenizer.decode([list of tokens]) can be used to convert the tokens back into text.
"""
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
"""
The tokenizer has a padding token and a end of sequence tokenizer.
Set these two values to be the same.
- Set the padding token of the tokenizer to the same value of the end of sequence token.
- End of sequence token: A special token used by models like gpt2 to indicate the end of a text sequence.
It is used by the model to determine when a sentence or a text has finished.
- Padding token: Used to fill up sequences to a uniform length.
This is necessary because most neural networks require inputs of the same size.
If you have sentences or inputs of different lengths you use the padding token to equalize their lengths by padding shorter sequences.
Normally the padding token might be a special token that is specific for padding purposes.
- When you set tokenizer.pad_token=tokenizer.eos_token, you tell the tokenizer to use the EOS token also as the padding token.
This is useful in scenarios where you want to simplify the token set or when the model's behavior aligns better with using the eos token for padding.
This is not typical for most applicaitons.
- Causes every added padding token to be treated as signaling the end of a sequence. 
This affects how the model interprets sequences.

"""
tokenizer.pad_token = tokenizer.eos_token  

# Tokenize the training text
"""
- Prepare the text data for input into a neural network/here we specifically prepare it for models like gpt-2.
- tokenizer(training_text): calls the tokenizer on training_text (could be a single string or a batch of texts/strings).
The tokenizer converts the text into tokens that the model can understand aka it converts it to numerical representations. 
- return_tensors='pt': Specifies the format of the returned data.
pt stands for PyTorch tensors.
Other options include tf/TensorFlow tensors, np/numpy arrays.
This ensures the output is compatible with PyTorch models as tensors.
- padding=True:Ensures that all sequences in the batch are padded to the length of the longest sequence
in the case that training_text contains multiple texts.
- max_length: Pads all sequences to the length specified. 
Note that this uniformity in sequence length is required for batch processing in neural networks.
This is the max length of the sequences.
Any text longer than 512 tokens will be truncated.
All texts will be padded to this length since padding=True.
The choice of 512 tokens aligns with typical configurations of many transformer based models like gpt2
because they often have a max input size of 512 tokens. 
- inputs: the output will be a batch of tokenized text data that is formated as PyTorch tensors. 
They are ready to be fed into a model for tasks like training.  
"""
inputs = tokenizer(training_text, return_tensors="pt", padding=True, truncation=True, max_length=512)

"""
- Define a custom dataset implementation designed to work with PyTorch.
- This class inherits from torch.utils.data.Dataset (the base class for all datasets in PyTorch).
- This prepares data for use in training a model that requires inputs and corresponding labels where labels are the inputs themselves.
- Helps us train the model to understand and generate text based on the context provided by the inputs.
"""
class ConversationDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __len__(self):
        return len(self.encodings['input_ids'])

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = item['input_ids'].clone()  # Set labels to be the same as input_ids
        return item


In [None]:
# Create the dataset
"""
Create an instance of the ConversationDataset class using the inputs variable as its parameter.
Prepares your data for neural network training within the PyTorch framework.
"""
dataset = ConversationDataset(inputs)

"""
Use the Hugging Face transformers library to laod a pre-trained model.
- Here we laod the pre-trained model gpt2.
"""
model = GPT2LMHeadModel.from_pretrained('gpt2')

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=2,
    logging_dir='./logs',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

prompt = "What causes nausea?"
encoded_prompt = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")

In [None]:
# Generate a response
output_sequences = model.generate(
    input_ids=encoded_prompt,
    max_length=1000,
    temperature=0.3,
    top_k=50, #likely token choices
    top_p=0.95,
    repetition_penalty=1.2,
    no_repeat_ngram_size=2,  # Prevent repeating n-grams
    num_beams=5,  # Beam search
    length_penalty=0.8,  # Adjust length of responses
    do_sample=True,
    num_return_sequences=1,
)

# Decode the generated sequence to text
generated_sequence = output_sequences[0].tolist()
text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)

# Extract the text after the prompt
response_text = text[len(tokenizer.decode(encoded_prompt[0], clean_up_tokenization_spaces=True)):]

print(response_text)

* set up a list of predefined scenarios to check the reliability of the answer
* user surveys
* response time, error rates
* compliance and security (privacy laws - user trust and legal compliances)

#### Qualitative(Human judgement) and quantitative testing (A/B testing)

#### MEDICAL annotation generator
1. automated report generator
2. EHR
3. data retrieval system
4. Lets explore LLaVA-Med (Biomed viz assistance)
* https://github.com/microsoft/LLaVA-Med
  * Can answer open ended research qs
  * understand topics related to biomed images
  * additional context
  * visual Q and A
  * LLaVA is more general as its not fine tuned like LLaVA-Med

#### GENE GPT
1. uses ncbi API documentation
2. answers gene turing and gene hop qs (benchmark datasets)
3. can utilize from learnt information from a prompt
4. Model agnostic augmentation method - does not need task specific fine tune (interact with API and incorporate information retrived from the API calls)
5. Hence its versatile so no task specific learning is needed
6. Evaluation:
   * Manual and automatic evaluations
   * Gene disease association : use recall
   * DNA seq alignment to many species - only exact match is considered as correct

#### scGPT
#### single cell GPT
1. cell type annotation
2. genetic perturb pred
3. batch correction
4. multiomic integration
5. reference mapping
6. transfer learning
https://scgpt.readthedocs.io/en/latest/

Can use a more hybrid (manual + Gen AI approach) for rare cell types