<a href="https://colab.research.google.com/github/nicolasvazquez95/Aprendiendo_DeepLearning/blob/main/projects/02_SkimLit_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SkimLit

The purpose of this notebook is to build an NLP model to make reading medical abstracts easier.

The paper we're replicating (the source of the dataset that we'll be using) is available in [arXiv](https://arxiv.org/abs/1710.06071).

The model architecture that they use is available here: https://arxiv.org/abs/1612.05251


## Confirm access to a GPU

In [2]:
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-d6f2ca91-df58-6eb8-91eb-8ed278a84869)


## Get data
Let's download the dataset they used. It's freely available in GitHub

In [3]:
!git clone https://github.com/Franck-Dernoncourt/pubmed-rct

Cloning into 'pubmed-rct'...
remote: Enumerating objects: 33, done.[K
remote: Counting objects: 100% (8/8), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 33 (delta 5), reused 5 (delta 5), pack-reused 25[K
Unpacking objects: 100% (33/33), done.


In [5]:
!tail pubmed-rct/PubMed_20k_RCT/test.txt

METHODS	Both were assessed at baseline , 3 weeks , and 6 weeks .
METHODS	Treatment satisfaction was assessed at week 6 .
METHODS	Adverse effects were also monitored .
RESULTS	There was a statistically significant within-group improvement in VISA-A score for both groups ( standard , P = .03 ; do as tolerated , P < .001 ) and VAS pain for the do-as-tolerated group ( P = .001 ) at week 6 , based on the intention-to-treat analysis .
RESULTS	There was a statistically significant between-group difference in VISA-A scores at week 3 , based on both the intention-to-treat ( P = .004 ) and per-protocol analyses ( P = .007 ) , partly due to a within-group deterioration at week 3 in the standard group .
RESULTS	There were no statistically significant between-group differences for VISA-A and VAS pain scores at week 6 , the completion of the intervention .
RESULTS	There was no significant association between satisfaction and treatment groups at week 6 .
RESULTS	No adverse effects were reported .
CON

We'll start the experiments using the small dataset (20k) with the numbers replaced with `'@'`.

In [6]:
data_dir = 'pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/'
import os 
filenames = [data_dir+filename for filename in os.listdir(data_dir)]

## Preprocess data

In [9]:
def get_text(filename):
  with open(filename,'r') as f:
    return f.readlines()
test = get_text(filenames[0])
train = get_text(filenames[1])
dev = get_text(filenames[2])

__How can we represent this data?__

We can use a dictionary format with the keys: `line_number, target, text`. Our samples will be in a list of dictionaries, one for each line in the text.

In [29]:
def get_list_samples(file):
  """This function receives a written file loaded in memory (i.e. a list of strings). Then it returns the samples in the format required,
  if the lines are in the format
  SECTION\t\PHRASE\n
  It returns a list of dictionaries with the lines preprocessed.
  """
  list_of_samples = []
  for n_line,line in enumerate(file):
     # Need to filter number of the paper, and empty lines
    if line.startswith('###'):
      article_id = line[3:-2]
    elif line.startswith('\n'):pass
    else: 
      # Find the target in the string
      index_t = line.index('\t')
      index_n = line.index('\n')
      target = line[:index_t]
      text = line[index_t+1:index_n]
      list_of_samples.append({'n_line':n_line,'target':target,
                              'text':text,'article_id':article_id})
  return list_of_samples

In [30]:
train_list = get_list_samples(train)
test_list = get_list_samples(test)
dev_list = get_list_samples(dev)

In [32]:
test_list[:2]

[{'article_id': '2484596',
  'n_line': 1,
  'target': 'BACKGROUND',
  'text': 'This study analyzed liver function abnormalities in heart failure patients admitted with severe acute decompensated heart failure ( ADHF ) .'},
 {'article_id': '2484596',
  'n_line': 2,
  'target': 'RESULTS',
  'text': 'A post hoc analysis was conducted with the use of data from the Evaluation Study of Congestive Heart Failure and Pulmonary Artery Catheterization Effectiveness ( ESCAPE ) .'}]

In this format, we can turn the data into a DataFrame

In [48]:
import pandas as pd
train = pd.DataFrame(train_list)
test = pd.DataFrame(test_list)
dev = pd.DataFrame(dev_list)

In [35]:
train.head()

Unnamed: 0,n_line,target,text,article_id
0,1,OBJECTIVE,To investigate the efficacy of @ weeks of dail...,2429357
1,2,METHODS,A total of @ patients with primary knee OA wer...,2429357
2,3,METHODS,Outcome measures included pain reduction and i...,2429357
3,4,METHODS,Pain was assessed using the visual analog pain...,2429357
4,5,METHODS,Secondary outcome measures included the Wester...,2429357


In [36]:
# Distribution of labels in training data
train.target.value_counts()

METHODS        59353
RESULTS        57953
CONCLUSIONS    27168
BACKGROUND     21727
OBJECTIVE      13839
Name: target, dtype: int64

In [38]:
X_train = train['text'].tolist()
X_test = test['text'].tolist()
X_dev = dev['text'].tolist()
print(len(X_train),len(X_test),len(X_dev))

180040 30135 30212


In [54]:
# One-hot-encoder for the targets
from sklearn.preprocessing import OneHotEncoder
OHE = OneHotEncoder(sparse=False)
OHE.fit(train.target.to_numpy().reshape(-1,1))
y_train = OHE.transform(train.target.to_numpy().reshape(-1,1))
y_test = OHE.transform(test['target'].to_numpy().reshape(-1,1))
y_dev = OHE.transform(dev['target'].to_numpy().reshape(-1,1))
for target,number in zip(train.target.tolist()[5:8],y_train[5:8]):
  print(target,number)

METHODS [0. 0. 1. 0. 0.]
RESULTS [0. 0. 0. 0. 1.]
RESULTS [0. 0. 0. 0. 1.]
