<h1 style = "text-align:center">
 Semantic Similarity using Sentence Transformer
</h1>
<p>
<b>Problem:</b>
 </p>
 <p>Predict if two sentences are similar.
</p>

<p>
<b>
Data
</b>
</p>

<p>
This dataset consists of 3048 similar and dissimilar medical question pairs hand-generated and labeled by Curai's doctors. Doctors with a list of 1524 patient-asked questions randomly sampled from the publicly available crawl of HealthTap
</p>

<p>
<b>
Solution:
</b>
</p>
<p>
Our approach is using fine tuning a Sentence Transformers. We compare our fine-tune model with more complex Sentence models.
</p>

In [54]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer, models, util, InputExample, losses
from torch.utils.data import DataLoader
import pickle

<h1 style = "text-align:center">
Load data
</h1>

In [56]:
medical_file = '/Users/abelcamachoguardian/Downloads/train-00000-of-00001.parquet'
medical_data = pd.read_parquet(medical_file)

<h1 style = "text-align:center">
Data Exploration
</h1>

In [57]:
medical_data.head(2)

Unnamed: 0,dr_id,question_1,question_2,label
0,1,After how many hour from drinking an antibioti...,I have a party tonight and I took my last dose...,1
1,1,After how many hour from drinking an antibioti...,I vomited this morning and I am not sure if it...,0


In [58]:
medical_data.reset_index(inplace=True)

In [59]:
medical_data['question_id'] = 0
question_id = 0

for i in range(medical_data.shape[0]):
    if i%2==0:
        question_id+=1
    medical_data.loc[i,'question_id'] = question_id

In [60]:
number_of_different_questions = len(medical_data.question_id.unique())

percentage_of_train_questions = 0.7
train_number_of_questions = int(np.floor(number_of_different_questions*0.7))

In [61]:
permutation_of_questions = np.random.permutation(number_of_different_questions)

train_questions_id = permutation_of_questions[0:train_number_of_questions]
test_questions_id = permutation_of_questions[train_number_of_questions::]

In [62]:
number_train_questions = int(train_questions_id.shape[0])

<h1 style = "text-align:center">
Data Augmentation
</h1>
<p>
We are considering two options for data augmentation.
The first one involves creating more triples using the existing dataset.
The second one involves adding noise to the dataset, such as introducing typo errors to the questions or replacing words with synonyms
</p>


In [63]:
# Creating more triples using the existing questions
for k in range(number_train_questions-1):
    temp_question_id = medical_data['question_id'] == train_questions_id[k]
    temp_label_1 = medical_data['label'] == 1
    temp_random_question_id = medical_data['question_id'] == train_questions_id[k+1]

    temp_question1= medical_data.loc[temp_question_id & temp_label_1,'question_1']
    temp_question2 = medical_data.loc[temp_question_id & temp_label_1,'question_2']
    question_id = number_of_different_questions+k+1
    
    temp_new_question = medical_data.loc[ temp_random_question_id & temp_label_1,'question_2']

    if k==0:
        temp_dataset = pd.DataFrame({'dr_id': [-1],
                         'question_1':list(temp_question1),
                         'question_2': list(temp_question2),
                        'label': [1],
                         'question_id': [question_id]})
        
        temp_dataset = pd.concat([ temp_dataset, pd.DataFrame({'dr_id': [-1],
                         'question_1':list(temp_question1),
                         'question_2': list(temp_new_question),
                        'label': [0],
                         'question_id': [question_id]})],ignore_index=True)
        
    else:
        temp_dataset = pd.concat([ temp_dataset, pd.DataFrame({'dr_id': [-1],
                         'question_1':list(temp_question1),
                         'question_2': list(temp_question2),
                        'label': [1],
                         'question_id': [question_id]})],ignore_index=True)
    
    
        temp_dataset = pd.concat([ temp_dataset, pd.DataFrame({'dr_id': [-1],
                         'question_1':list(temp_question1),
                         'question_2': list(temp_new_question),
                        'label': [0],
                         'question_id': [question_id]})],ignore_index=True)
        

In [64]:
train_data = pd.concat([medical_data.loc[train_questions_id], temp_dataset], ignore_index=True)

<h1 style = "text-align:center">
Data processing
</h1>

In [66]:
train_examples = []

The example is a triplet (anchor, positive, negative) without classes or labels for the sentences

In [67]:
for i in train_data.question_id.unique():

    example = train_data.loc[train_data.question_id == i]
    if example.shape[0] == 2:
        question = example.loc[example['label'] == 1].question_1.iloc[0]
        positive_example = example.loc[example['label'] == 1].question_2.iloc[0]
        negative_example = example.loc[example['label'] == 0].question_2.iloc[0]
        train_examples.append(InputExample(texts=[question, positive_example, negative_example]))

<h1 style = "text-align:center">
Fine tuning Models
</h1>

In [None]:
fine_tune_models = ['distilbert-base-uncased', 'bert-base-uncased']
num_epochs = 50

for data_type in ['actual_data', 'augmented_data', '2x_augmented_data']:
    for  model_id in fine_tune_models:
        model_raw = models.Transformer(model_id)

        ## Step 1: use an existing language model
        word_embedding_model = models.Transformer(model_id)
        ## Step 2: use a pool function over the token embeddings
        pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())

        model_fine_tune = SentenceTransformer(modules=[word_embedding_model, pooling_model])
        #Convert the training examples to a Dataloader.
        train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=250)

        # TripletLoss: loss minimizes the distance between the anchor and the positive sentences while maximizing the          between the anchor and the negative sentences.
        train_loss = losses.TripletLoss(model=model_fine_tune)

        warmup_steps = int(len(train_dataloader) * num_epochs * 0.1) 

        model_fine_tune.fit(train_objectives=[(train_dataloader, train_loss)],
                    epochs=num_epochs,
                    warmup_steps=warmup_steps)
    
        # Save model
        pickle.dump(model_fine_tune, open('/Users/abelcamachoguardian/Downloads/data_augmentation_'+str(num_epochs)+'-'+model_id+'-fine_tune.pkl', 'wb'))
        print(model_id)

Epoch:   0%|          | 0/50 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

distilbert-base-uncased


Epoch:   0%|          | 0/50 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]