### Introduction 
This notebook is created based on this research paper [**PubMed 200k RCT:
a Dataset for Sequential Sentence Classification in Medical Abstracts**](https://arxiv.org/pdf/1710.06071.pdf)<br>

The summary of this reseach paper is that, it converts medical research paper **Abstract's** each sentence to a category (background, objective, methods, results, conclusions).

### Getting the data
Good thing is that data is publicly available.

In [None]:
!git clone https://github.com/Franck-Dernoncourt/pubmed-rct.git

In [None]:
!ls pubmed-rct

In [None]:
!ls /kaggle/working/pubmed-rct/PubMed_20k_RCT

`dev.txt` is validation set <br>
`test.txt` is test set and<br>
`train.txt` is train set

In [None]:
data_dir = "pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/"

In [None]:
import os 
filenames = [data_dir + filename for filename in os.listdir(data_dir)]
filenames

### Preprocessing data

In [None]:
def read_lines(filename):
    with open(filename) as file:
        return file.readlines()

In [None]:
filenames[0]

In [None]:
train_file = read_lines(filename=filenames[1])
train_file[:10]

In [None]:
def preprocess_text(filename):
    
    input_lines = read_lines(filename)

    abstract_lines = ""
    abstracts = []

    for line in input_lines:
        if line.startswith("###"):
            abstract_lines = ""

        elif line.isspace():
            abstract_line_split = abstract_lines.splitlines()

            for abstract_line_number, abstract_line in enumerate(abstract_line_split):
                line_data = {}
                line_split = abstract_line.split("\t")
                line_data["target"] = line_split[0]
                line_data["text"] = line_split[1].lower()
                line_data["line_number"] = abstract_line_number + 1
                line_data["total_lines"] = len(abstract_line_split)
                abstracts.append(line_data)
        else:
            abstract_lines += line
    
    return abstracts

In [None]:
abstracts = preprocess_text(filename=filenames[0])
abstracts[:10]

In [None]:
filenames

### Visualizing Data

In [None]:
train_samples = preprocess_text(filename=data_dir + "train.txt")
dev_samples = preprocess_text(filename=filenames[1])
test_samples = preprocess_text(filename=filenames[2])

In [None]:
import pandas as pd

In [None]:
train_df = pd.DataFrame(train_samples)
dev_df = pd.DataFrame(dev_samples)
test_df = pd.DataFrame(test_samples)

In [None]:
train_df

In [None]:
train_df["target"].value_counts()

In [None]:
train_df["total_lines"].plot.hist();

In [None]:
len(train_df["text"]), len(dev_df["text"]), len(test_df["text"])

### `OneHotEncoder`

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
one_hot_encoder = OneHotEncoder(sparse=False)
train_hot_encoder = one_hot_encoder.fit_transform(train_df["target"].to_numpy().reshape(-1,1)) 
dev_hot_encoder = one_hot_encoder.fit_transform(dev_df["target"].to_numpy().reshape(-1,1)) 
test_hot_encoder = one_hot_encoder.fit_transform(test_df["target"].to_numpy().reshape(-1,1)) 

In [None]:
train_hot_encoder

### `LabelEncoder`

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
label_encoder = LabelEncoder()
train_label_encoder = label_encoder.fit_transform(train_df["target"].to_numpy().reshape(-1,1))
dev_label_encoder = label_encoder.fit_transform(dev_df["target"].to_numpy().reshape(-1,1))
test_label_encoder = label_encoder.fit_transform(test_df["target"].to_numpy().reshape(-1,1))

In [None]:
len(label_encoder.classes_), label_encoder.classes_

### Spliting Data

In [None]:
train_sentences = train_df["text"]
test_sentences_label_encoder = train_label_encoder
val_sentences = dev_df["text"]
val_sentences_label_encoder = label_encoder.transform(dev_df["target"].to_numpy().reshape(-1,1))

In [None]:
train_sentences = train_sentences.to_numpy()
test_sentences_label_encoder = test_sentences_label_encoder.reshape(-1,1)

In [None]:
test_sentences_label_encoder

### `model_0`: Baseline

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

In [None]:
model_0 = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("multinomialNB", MultinomialNB())
])

In [None]:
model_0.fit(train_sentences, test_sentences_label_encoder)

### Evaluation function

In [None]:
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

In [None]:
def calculate_results(y_true, y_pred):
    """
    returns a dictionary of accuracy_score, precission, recall and f1_score
    """
    precission, recall, f1_score, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")
    acc_score = accuracy_score(y_true, y_pred) * 100
    
    results = {
        "accuracy_score": acc_score,
        "precission": precission,
        "recall": recall,
        "f1_score": f1_score
    }
    return results

In [None]:
baseline_results = calculate_results(val_sentences_label_encoder, model_0.predict(val_sentences))
baseline_results

### Text vectorization (tokenization)

In [None]:
from tensorflow.keras.layers import TextVectorization

In [42]:
avg_sentence_length = round(sum([len(sentence.split()) for sentence in train_sentences]) / len(train_sentences))
avg_sentence_length

26

In [43]:
text_vectorization = TextVectorization(max_tokens=10000,
                                      output_sequence_length=avg_sentence_length)

In [44]:
text_vectorization.adapt(train_sentences)

In [45]:
import random

In [51]:
random_sentence = random.choice(train_sentences)
print(random_sentence)
print(f"After tokenization {text_vectorization(random_sentence)}")

patients were randomly divided into two and @ patients were allocated into each group .
After tokenization [ 12   9  92 471 143  51   3  12   9 379 143 122  13   0   0   0   0   0
   0   0   0   0   0   0   0   0]


In [52]:
print(f"5 most common words: {text_vectorization.get_vocabulary()[:5]}")
print(f"5 least common words: {text_vectorization.get_vocabulary()[-5:]}")

5 most common words: ['', '[UNK]', 'the', 'and', 'of']
5 least common words: ['ethnically', 'ethambutol', 'ert', 'epicardial', 'ephedrine']


### Embedding

In [53]:
from tensorflow.keras.layers import Embedding

In [60]:
embedding = Embedding(input_dim=len(text_vectorization.get_vocabulary()), # 10000 set earlier
                     output_dim=128,
                     input_length=avg_sentence_length) # 26 words

In [58]:
random_sentence = random.choice(train_sentences)
print(f"Sentence before embedding: {random_sentence}")
embedding(text_vectorization([random_sentence]))

Sentence before embedding: serum hbv dna , hbeag status , liver biochemistry and safety were monitored at baseline and week @ , @ , @ and @ .


<tf.Tensor: shape=(1, 26, 128), dtype=float32, numpy=
array([[[ 0.02919802,  0.03124345,  0.00171049, ..., -0.00471847,
         -0.01143587, -0.0305656 ],
        [ 0.01416546, -0.00996596,  0.04400421, ...,  0.03002257,
         -0.03582291, -0.02971776],
        [ 0.02935305, -0.03412697, -0.0179699 , ...,  0.02308455,
         -0.00593557, -0.0303174 ],
        ...,
        [ 0.03361862, -0.00581356,  0.01430945, ..., -0.02459285,
         -0.03040624, -0.0107484 ],
        [ 0.03361862, -0.00581356,  0.01430945, ..., -0.02459285,
         -0.03040624, -0.0107484 ],
        [ 0.03361862, -0.00581356,  0.01430945, ..., -0.02459285,
         -0.03040624, -0.0107484 ]]], dtype=float32)>

**Note:** If we see the shape of the embedded sentence, we will notice it is (1, 15, 128)
which means from every sentence we will take 26 words and each word will be represented with 128 shape matrix.