<a href="https://colab.research.google.com/github/hmblackwood/Medical_Abstract_Project/blob/main/medical_abstract_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#🩺 Medical Abstract Project 🩺

# Summary:
Replicating the deep learning model behind the white paper PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts.
Created a RNN, multi-__ clasifier model with ___.
Libraries and Project used: PyTorch, SkimLit

##1. Problem
The number of Randomized Controlled Trials (RCTs) increases every year. Reading completely through even a small portion of them might not be the best use of a medical professional's time. I will create a model that will summarize the abstract, ___, ___ and ___ to help them skim through large bodies of information.

##2. Data
The dataset I'll be working with is provided in PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts at https://arxiv.org/abs/1710.06071

##3. Machine Learning Model
I will be replicating the deep learning model behind the paper, PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts. This paper presented a new dataset called PubMed 200k RCT which consists of ~200,000 labelled Randomized Controlled Trial (RCT) abstracts.
https://arxiv.org/abs/1710.06071

In this paper, the authors state that they use the machine learning model described in Neural Networks for Joint Sentence Classification in Medical Paper Abstracts. I will be replicating this model.
https://arxiv.org/abs/1612.05251

From the white paper:
"The fourth baseline (bi-ANN) is an ANN consisting of three components: a token embedding
layer (bi-LSTM), a sentence label prediction layer
(bi-LSTM), and a label sequence optimization
layer (CRF). The architecture is described in (Dernoncourt et al., 2016) and has been demonstrated
to yield state-of-the-art results for sequential sentence classification."


##4. Evaluation

##5. Features

I will perform the following steps:
1. Download the text dataset (PubMed 200K RCT)
2. Code a preprocessing function for my text data.
3. Set up multiple modeling experiments with different levels of token embeddings.
  - Make a baseline (TF-IDF classifier)
  - Deep models with various combinations of: token embeddings, character embeddings, pretrained embeddings, positional embeddings.
4. Build a multimodal model to take in different sources of data.
5. Find the most wrong prediction examples ---- purpose?
6. Make predictions on PubMed abstracts in the wild.


## Confirm Access to a GPU

In [None]:
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-669a86d3-5a12-6760-bc7b-68861e6cf552)


## Download the Dataset
 Dataset can be found on the author's GitHub:
 https://github.com/Franck-Dernoncourt/pubmed-rct

In [None]:
!git clone https://github.com/Franck-Dernoncourt/pubmed-rct
!ls pubmed-rct

Cloning into 'pubmed-rct'...
remote: Enumerating objects: 39, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 39 (delta 8), reused 5 (delta 5), pack-reused 25 (from 1)[K
Receiving objects: 100% (39/39), 177.08 MiB | 38.17 MiB/s, done.
Resolving deltas: 100% (15/15), done.
Updating files: 100% (13/13), done.
PubMed_200k_RCT				       PubMed_20k_RCT_numbers_replaced_with_at_sign
PubMed_200k_RCT_numbers_replaced_with_at_sign  README.md
PubMed_20k_RCT


In [None]:
# Look at what files are in the dataset
!ls pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/

dev.txt  test.txt  train.txt


✅ The dataset has a test and train set already. Note that the validation set is called "dev."

In [None]:
# I'll begin experiments using the 20k dataset of with the numbers replaced by the @ sign.
data_dir = "/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/"

In [None]:
# Check all of the filenames in the target directory
import os
filenames = [data_dir + filename for filename in os.listdir(data_dir)]
filenames

['/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/test.txt',
 '/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/dev.txt',
 '/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/train.txt']

# Preprocess Data
I will write a function to read in all the lines of the target text file.

In [None]:
# Create function to read the lines of a document
def get_lines(filename):
  """
  Reads the filename (text filename) and returns the lines of text as a list.

  Args:
    filename: target file path in the form of a string.

  Returns:
    A list of strings with one string per line.
  """
  with open(filename, "r") as f:
    return f.readlines()


In [None]:
# Read in the training lines
train_lines = get_lines(data_dir+"train.txt") # Read the lines with the training file.
train_lines[:27]

['###24293578\n',
 'OBJECTIVE\tTo investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( OA ) .\n',
 'METHODS\tA total of @ patients with primary knee OA were randomized @:@ ; @ received @ mg/day of prednisolone and @ received placebo for @ weeks .\n',
 'METHODS\tOutcome measures included pain reduction and improvement in function scores and systemic inflammation markers .\n',
 'METHODS\tPain was assessed using the visual analog pain scale ( @-@ mm ) .\n',
 'METHODS\tSecondary outcome measures included the Western Ontario and McMaster Universities Osteoarthritis Index scores , patient global assessment ( PGA ) of the severity of knee OA , and @-min walk distance ( @MWD ) .\n',
 'METHODS\tSerum levels of interleukin @ ( IL-@ ) , IL-@ , tumor necrosis factor ( TNF ) - , and 

✅ \t means tab in the data.
I need a function to separate labels from the text and make it easier for our model to take in.

In [None]:
len(train_lines)

210040

# Preprocess the Data
I have over 200,000 lines. Now I need to decide how I would like to format the data. I need to separate the results, conclusions etc. I'll write a function that creates a list of dictionaries to format the data. This will allow me to make functions later to take in the data structure and manipulate it in ways I want.

I'll write a function to:

1) Take the target file of abstract examples

2) Read the lines in the target file

3) For each line of the file:
  - If the line begins with ###, mark it as an abstract ID and the beginning of a new abstract.
    - Keep count of the number of lines in a sample.
  - If the line begins with /n, mark it as the end of the abstract example.
    - Keep count of the number of total lines in a sample.
  - Record the text before the \t as the label of the line.
  - Record the text after the \t as the text of the line.
  
4) Return all of the lines in the target text file as a list of dictionaries cotnaining these key/value pairs:
    - "line_number" - the position of the line in the abstract (e.g., 3)
    - "target" - the role of the line in the abstract (e.g., OBJECTIVE)
    - "text" - the text of the line in the abstract
    - "total_lines" - the total of the lines in an abstract sample (e.g., 14)

5) Abstract IDs and new lines should be omitted from the returned, preprocessed data.



Example of a single sample (a single line from the text)

```
[{"line_number":0,
"target": "OBJECTIVE",
"text": "to investigate the efficacy of @ weeks of daily, low-dose...",
"total_line":11},
...]
```

In [None]:
def preprocess_text_with_line_number(filename):
  """
  returns a list of dictionaries of abstract line data.

  Takes in a filename, reads its contents adn sorts through each line, extracting the target label, text of the sentence, how many sentences are in the abstract and the sentence number.
  """
  input_lines = get_lines(filename)  # Get all lines from filename
  abstract_lines = ""  # Create an empty abstract
  abstract_samples = []  # Create an empty list of abstract

  # Loop through each line in the target file.
  for line in input_lines:
    if line.startswith("###"):  # Check to see if this is an ID line (will give True/False output)
      abstract_id = line
      abstract_lines = ""  # Reset the abstract string if the line is an ID line. It will save everything up to the next ###, then resets to make room for the next batch.
    elif line.isspace():  # Check if line is a new line. If it is, split the abstract into separate lines.
      abstract_line_split = abstract_lines.splitlines()
      # Iterate through each line in a single abstract and count them at the same time. "abstract_line" is the same as a sentence. Create empty dictionary per line.
      for abstract_line_number, abstract_line in enumerate(abstract_line_split):
          line_data = {}  # Create empty dictionary for each line.
          target_text_split = abstract_line.split("\t")  # Split the target label from the text and put them into their own strings. Uses "\t" as divider.
          line_data["target"] = target_text_split[0]  # Get the target label.
          line_data["text"] = target_text_split[1].lower()  # Get target text and lower it.
          line_data["line_number"] = abstract_line_number  # What number line is this line in the abstract?
          line_data["total_lines"] = len(abstract_line_split) - 1   # How many total ines are there in the target abstract. The -1 starts us at zero.
          abstract_samples.append(line_data)  # Add line data to abstract sample list.
    # If it's not a new line and not an ID line (start with ###), the line contains a labelled sentence and is also part of the same abstract.
    else:
      abstract_lines += line

  return abstract_samples