# **Clinic Texts Classification  for AI4EU challenge**

Text classification is an important task in many problems related to Natural Language Processing (NLP). This task can be applied to classify for example whether a text belongs to a category x or y in the case of binary classification and can classify for n categories in the case of multi-class classification. In this notebook we work on the text classification of colon cancer using the codiEsp database for this challenge. We use fine turning distilBert for text classification. Indeed, the classification is based first on the original text and finally with the text summaries. 

### **Data information**

The text preprocessing  on the codiEsp database was performed in another notebook named "**Preprocessing data for Colon cancer classification using clinical text**". In this notebook we have studied the codiEsp database in detail and extracted the information necessary to perform our task. All the details about the database and the procedure to obtain the necessary information are included. 

# **1. The preliminaries**

First of all we will prepare the working environment, give access to the Drive, install some packages, and load the data to be used for our task.


### Mount the Drive to access Data

In [1]:
# Mount the Drive to access Data

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Drive Access 

In [2]:
 # Drive Access

%cd /content/drive/MyDrive/Evida_NLP_Project/

/content/drive/MyDrive/Evida_NLP_Project


### **Ktrain librairy** 

ktrain is a lightweight wrapper for the deep learning library TensorFlow Keras (and other libraries) to help build, train, and deploy neural networks and other machine learning models. Inspired by ML framework extensions like fastai and ludwig, ktrain is designed to make deep learning and AI more accessible and easier to apply for both newcomers and experienced practitioners. With only a few lines of code, ktrain allows  to easily and quickly:

employ fast, accurate, and easy-to-use pre-canned models for text, vision, graph, and tabular data:

for exemple on text data:

Text Classification: BERT, DistilBERT, NBSVM, fastText, and other models [example notebook]
Text Regression: BERT, DistilBERT, Embedding-based linear text regression, fastText, and other models [example notebook]
Sequence Labeling (NER): Bidirectional LSTM with optional CRF layer and various embedding schemes such as pretrained BERT and fasttext word embeddings and character embeddings [example notebook]
[<a href='https://github.com/amaiya/ktrain/blob/master/README.md'>Source</a>]

In [None]:
# Install Ktrain librairy
pip install ktrain

### prepare the device to use 

In [4]:
# Libraries 
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"] = "0";

### Importing pandas, numpy , matplotlib and load the data

In the text preprocessing we have 12 colon cancer documents in the whole codiEsp database. To perform the classification task we have randomly selected 50 non-colon cancer documents among which we will choose 12 to add with the 12 colon cancer documents to perform the classification task

In [5]:
# Libraries 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load the data 

'''
In this cell we have to load the data what we are going to use.
- data : the 12 colon cancer documents and their summaries
- fifity_documents : the fifity documents no colon cancer. 
'''

# load the data from the excel file and choose the sheet 'ColonCancer_documents'.
data = pd.read_excel('document_data.xlsx','ColonCancer_documents')

fifity_document = pd.read_excel('fifity_document.xlsx')

### Data processing 

In the following cells we prepare our data for the classification task. We work on both cases at the same time, preparing the original text and the summary at the same time.  Because we will apply these two approaches to see the performance of the model on the text summary and in the original text

In [None]:
'''
Because of a few nomber of data for the colon cancer documents, 
we have to select 12 documents no colon cancer for this task 
in total we would have 24 documents both colon cancer and no colon cancer. 
'''
data_sel = fifity_document.iloc[0:12,:]  # We select only 12 no colon cancer documents among 50 for our task
data_selected = data_sel[['text','label','label_name']] # select the orginal text in each document , the label and label name for the no colon cancer documents

data_selected_summary = data_sel[['summary','label','label_name']] # select the summaries  text in each document , the label and label name for the no colon cancer documents
data_selected['label']= 1 # Change the label value for the no colon cancer documents 
data_selected_summary['label']= 1 # Change the label value for the no colon cancer documents summaries 
data_selected['label_name']= 'no_colonCancer' # change the labal name for the no colon cancer documents
data_selected_summary['label_name']= 'no_colonCancer' # change the labal name for the no colon cancer   documents summaries
colon_cancer_doc = data[['text','label','label_name']] # Select the orginal text in each document , the label and label name for  the colon cancer documents
colon_cancer_doc_summary = data[['summary','label','label_name']] # Select the summaries text in each document , the label and label name for  the colon cancer documents

In [7]:
# check the shape of the data set.
print(f'fifity_document shape: {data_selected.shape}')
print(f'colon_cancer_doc shape: {colon_cancer_doc.shape}')
# check the shape of the summaries data set.
print(f'fifity_summaries_document shape: {data_selected_summary.shape}')
print(f'colon_cancer_summaries_doc shape: {colon_cancer_doc_summary.shape}')

fifity_document shape: (12, 3)
colon_cancer_doc shape: (12, 3)
fifity_summaries_document shape: (12, 3)
colon_cancer_summaries_doc shape: (12, 3)


In [8]:
# Concatenate the colon cancer dataFrame et no colon cancer dataFrame 
all_data = pd.concat([data_selected,colon_cancer_doc])
all_data_summary = pd.concat([data_selected_summary,colon_cancer_doc_summary])
print(f'All_data  shape: {all_data.shape}') 
print(f'all_data_summary  shape: {all_data_summary.shape}') 

All_data  shape: (24, 3)
all_data_summary  shape: (24, 3)


## Importing Distilbert model distilbert-base-uncased  used 

In this cell we telecopy and import the distilBert transform which is the model we will use in our case.

The DistilBERT model was proposed in the blog post Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT, and the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language understanding benchmark.

The abstract from the paper is the following:

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pretraining, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

Tips:

- DistilBERT doesn't have :obj:`token_type_ids`, you don't need to indicate which token belongs to which segment. Just separate your segments with the separation token :obj:`tokenizer.sep_token` (or :obj:`[SEP]`).

- DistilBERT doesn't have options to select the input positions (:obj:`position_ids` input). This could be added if necessary though, just let us know if you need this option.

<a href='https://github.com/huggingface/transformers/blob/master/docs/source/model_doc/distilbert.rst'>[Source]</a>

In [9]:
import ktrain
from ktrain import text
from sklearn.utils import shuffle

from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

# Shuffle the data 
all_data = shuffle(all_data)
all_data_summary = shuffle(all_data_summary)
np.random.seed(25)

# Retrieve text content , label and associated label_name
contents,contents_summary = all_data['text'],all_data_summary['summary']
labels,label_summary = all_data['label'],all_data_summary['label']
label_names ,label_names_summary = all_data['label_name'],all_data_summary['label_name']

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

### Transform the text dataFrame to list 

The text that the model processes must first be tokenized and then data embedded. 
To begin with, we need to transform each text into a list as a type of data structure.
We will transform the data that are in dataFrame format to list format in this cell


In [None]:

dataset  = []       # List of the text contents of all documents
datasetlabel = []   # List of the labels 
label_name = []     # List of label names
for i in all_data['text']:
  dataset.append(i)
for j in all_data['label']:
  datasetlabel.append(j)
for t in all_data['label_name']:
  label_name.append(t)

### Split the data to train and text dataset

In [None]:
from sklearn.model_selection import train_test_split

train_texts, test_texts, train_labels, test_labels = train_test_split(dataset, datasetlabel, test_size=0.25)

In [None]:
print(f'train data size {len(train_texts)}')
print(f'test data size {len(test_texts)}')

train data size 18
test data size 6


### Save the text data to excel format 

In [None]:
pd.DataFrame(test_texts).to_excel("val_dataset.xlsx")
pd.DataFrame(test_labels).to_excel("val_labels.xlsx")

### Tokenize the text

Tokenization is a common task in Natural Language Processing (NLP). It’s a fundamental step in both traditional NLP methods like Count Vectorizer and Advanced Deep Learning-based architectures like transformer.

As tokens are the building blocks of Natural Language, the most common way of processing the raw text happens at the token level.

For example, Transformer based models – the State of The Art (SOTA) Deep Learning architectures in NLP – process the raw text at the token level. Similarly, the most popular deep learning architectures for NLP like RNN, GRU, and LSTM also process the raw text at the token level.

<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2020/05/rnn.gif'/>

<a href='https://www.analyticsvidhya.com/blog/2020/05/what-is-tokenization-nlp/'>[Source]</a>



In [None]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

In [None]:
#only for fine tunning

import torch

class colonCancerData(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = colonCancerData(train_encodings, train_labels)
test_dataset = colonCancerData(test_encodings, test_labels)

# **2. Fine tunning DistilBert model.**

Since being first developed and released in the Attention Is All You Need paper Transformers have completely redefined the field of Natural Language Processing (NLP) setting the state-of-the-art on numerous tasks such as question answering, language generation, and named-entity recognition.

The main things to keep in mind conceptually about Transformers are that they are really good at dealing with sequential data (text, speech, etc.), they act as an encoder-decoder framework where data is mapped to some representational space by the encoder before then being mapped to the output by way of the decoder, and they scale incredibly well to parallel processing hardware (GPUs).

Transformers in the field of Natural Language Processing have been trained on massive amounts of text data which allow them to understand both the syntax and semantics of a language very well. For example, the original GPT model published in Improving Language Understanding by Generative Pre-Training was trained on BooksCorpus, over 7,000 unique unpublished books. Likewise, the famous BERT model released in the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding was trained on both BooksCorpus and English Wikipedia. For readers interested in diving into the neural network architecture of a Transformer, the original paper and The Illustrated Transformer are two great resources.

The main benefit behind Transformers, and what we will take a look at throughout the rest of this blog, is that once pre-trained Transformers can be quickly fine-tuned for numerous downstream tasks and often perform really well out of the box. This is primarily due to the fact that the Transformer already understands language which allows training to focus on learning how to do question answering, language generation, named-entity recognition, or whatever other goal someone has in mind for their model.

<img src='https://assets-global.website-files.com/5fbd459f3b05914cf70496d7/60cbd3ee2cff2abf2ae008b6_finetune.png'/>

<a href='https://www.assemblyai.com/blog/fine-tuning-transformers-for-nlp/'>[Source]</a>

## Using the orriginal text

In [None]:
training_args = TrainingArguments(
    output_dir='./results_colonCancer',   # output directory
    num_train_epochs=100,                 # total number of training epochs
    per_device_train_batch_size=16,       # batch size per device during training
    per_device_eval_batch_size=64,        # batch size for evaluation
    warmup_steps=500,                     # number of warmup steps for learning rate scheduler
    weight_decay=0.01,                    # strength of weight decay
    logging_dir='./logs',                 # directory for storing logs
    logging_steps=10,
)

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)
trainer = Trainer(
    model=model,                            # the instantiated  transformers model to be trained
    args=training_args,                     # training arguments, defined above
    train_dataset=train_dataset,            # training dataset
    #eval_dataset=val_dataset               # evaluation dataset
)
trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
loading configuration file https://huggingface.co/distilbert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.91b885ab15d631bf9cee9dc9d25ece0afd932f2f5130eba28f2055b2220c0333
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,

Step,Training Loss
10,0.6902
20,0.687
30,0.6752
40,0.6613
50,0.6201
60,0.5746
70,0.4217
80,0.2811
90,0.1984
100,0.0848




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=200, training_loss=0.25106786109507084, metrics={'train_runtime': 166.8078, 'train_samples_per_second': 10.791, 'train_steps_per_second': 1.199, 'total_flos': 238441317580800.0, 'train_loss': 0.25106786109507084, 'epoch': 100.0})

In [None]:
import torch.nn.functional as F
# please comment the next line if you want to test the fine tunned model
#model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
def evalua(string):
    pt_batch = tokenizer(
    [string],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
    )

    #model_custom = pipeline('sentiment-analysis',model,None,tokenizer)
    pt_outputs = model.cpu()(**pt_batch)
    pt_predictions = F.softmax(pt_outputs.logits, dim=-1)
    
    return np.argmax(pt_predictions.detach().numpy()[0])

In [None]:

test_texts = pd.read_excel('val_dataset.xlsx')
test_labels = pd.read_excel('val_labels.xlsx')
test_labels = test_labels[0]
labeles = []
print(test_texts)
for i in test_texts[0]:
    results = evalua(i)
    labeles.append(results)

   Unnamed: 0                                                  0
0           0  A 36-year-old male, with no history of interes...
1           1  19-year-old patient, with no history of intere...
2           2  We report the case of a 73-year-old patient wh...
3           3  A 46-year-old female patient hysterectomized f...
4           4  We present an 85-year-old patient with a cecal...
5           5  Se presenta un caso con patología litiásica co...


## Metrics evaluation of the fine turning distilbert 

In [None]:
from sklearn.metrics import classification_report
print(classification_report(test_labels,labeles))

              precision    recall  f1-score   support

           0       0.50      1.00      0.67         2
           1       1.00      0.50      0.67         4

    accuracy                           0.67         6
   macro avg       0.75      0.75      0.67         6
weighted avg       0.83      0.67      0.67         6



## save the model

In [None]:
model.save_pretrained("distilBert_colon_cancer_model.h5")

Configuration saved in distilBert_colon_cancer_model.h5/config.json
Model weights saved in distilBert_colon_cancer_model.h5/pytorch_model.bin


## Using summaries data 

In this part we will use the same approach only the data we use changes. in this case we use text summaries 

In [10]:
'''
The text that the model processes must first be tokenized and then data embedded. 
To begin with, we need to transform each text into a list as a type of data structure.
We will transform the data that are in dataFrame format to list format in this cell
'''
dataset  = []       # List of the text contents of all documents
datasetlabel = []   # List of the labels 
label_name = []     # List of label names
for i in all_data_summary['summary']:
  dataset.append(i)
for j in all_data_summary['label']:
  datasetlabel.append(j)
for t in all_data_summary['label_name']:
  label_name.append(t)

In [11]:
from sklearn.model_selection import train_test_split

train_texts, test_texts, train_labels, test_labels = train_test_split(dataset, datasetlabel, test_size=0.25)

In [12]:
print(f'train data size {len(train_texts)}')
print(f'test data size {len(test_texts)}')

train data size 18
test data size 6


In [13]:
pd.DataFrame(test_texts).to_excel("val_summary_dataset.xlsx")
pd.DataFrame(test_labels).to_excel("val_summary_labels.xlsx")

In [14]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

In [15]:
#only for fine tunning

import torch

class colonCancerData(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = colonCancerData(train_encodings, train_labels)
test_dataset = colonCancerData(test_encodings, test_labels)

In [16]:
training_args = TrainingArguments(
    output_dir='./results_colonCancer',   # output directory
    num_train_epochs=100,                 # total number of training epochs
    per_device_train_batch_size=16,       # batch size per device during training
    per_device_eval_batch_size=64,        # batch size for evaluation
    warmup_steps=500,                     # number of warmup steps for learning rate scheduler
    weight_decay=0.01,                    # strength of weight decay
    logging_dir='./logs',                 # directory for storing logs
    logging_steps=10,
)

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)
trainer = Trainer(
    model=model,                            # the instantiated  transformers model to be trained
    args=training_args,                     # training arguments, defined above
    train_dataset=train_dataset,            # training dataset
    #eval_dataset=val_dataset               # evaluation dataset
)
trainer.train()

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

***** Running training *****
  Num examples = 18
  Num Epochs = 100
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 200


{'loss': 0.711, 'learning_rate': 1.0000000000000002e-06, 'epoch': 5.0}
{'loss': 0.677, 'learning_rate': 2.0000000000000003e-06, 'epoch': 10.0}
{'loss': 0.6834, 'learning_rate': 3e-06, 'epoch': 15.0}
{'loss': 0.6763, 'learning_rate': 4.000000000000001e-06, 'epoch': 20.0}
{'loss': 0.6509, 'learning_rate': 5e-06, 'epoch': 25.0}
{'loss': 0.5997, 'learning_rate': 6e-06, 'epoch': 30.0}
{'loss': 0.5047, 'learning_rate': 7.000000000000001e-06, 'epoch': 35.0}
{'loss': 0.3548, 'learning_rate': 8.000000000000001e-06, 'epoch': 40.0}
{'loss': 0.1912, 'learning_rate': 9e-06, 'epoch': 45.0}
{'loss': 0.0924, 'learning_rate': 1e-05, 'epoch': 50.0}
{'loss': 0.0485, 'learning_rate': 1.1000000000000001e-05, 'epoch': 55.0}
{'loss': 0.0263, 'learning_rate': 1.2e-05, 'epoch': 60.0}
{'loss': 0.017, 'learning_rate': 1.3000000000000001e-05, 'epoch': 65.0}
{'loss': 0.014, 'learning_rate': 1.4000000000000001e-05, 'epoch': 70.0}
{'loss': 0.009, 'learning_rate': 1.5e-05, 'epoch': 75.0}
{'loss': 0.0072, 'learning_ra



Training completed. Do not forget to share your model on huggingface.co/models =)




{'loss': 0.0039, 'learning_rate': 2e-05, 'epoch': 100.0}
{'train_runtime': 95.8935, 'train_samples_per_second': 18.771, 'train_steps_per_second': 2.086, 'train_loss': 0.2641929206997156, 'epoch': 100.0}


TrainOutput(global_step=200, training_loss=0.2641929206997156, metrics={'train_runtime': 95.8935, 'train_samples_per_second': 18.771, 'train_steps_per_second': 2.086, 'train_loss': 0.2641929206997156, 'epoch': 100.0})

In [17]:
import torch.nn.functional as F
# please comment the next line if you want to test the fine tunned model
#model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
def evalua(string):
    pt_batch = tokenizer(
    [string],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
    )

    #model_custom = pipeline('sentiment-analysis',model,None,tokenizer)
    pt_outputs = model.cpu()(**pt_batch)
    pt_predictions = F.softmax(pt_outputs.logits, dim=-1)
    
    return np.argmax(pt_predictions.detach().numpy()[0])

In [18]:
test_texts = pd.read_excel('val_summary_dataset.xlsx')
test_labels = pd.read_excel('val_summary_labels.xlsx')
test_labels = test_labels[0]
labeles = []
print(test_texts)
for i in test_texts[0]:
    results = evalua(i)
    labeles.append(results)

   Unnamed: 0                                                  0
0           0  XY 65 A\nConsultation for infravesical obstruc...
1           1  XY 54A complex lithiasis pathology, use of the...
2           2  65 year-old patient with a history of alcoholi...
3           3  Patient with history of interest, resolved pul...
4           4  XX 69A, history of CKD unaffiliated, RA 20 yea...
5           5  49 year-old men, underwent extraction of a thi...


In [19]:
from sklearn.metrics import classification_report
print(classification_report(test_labels,labeles))

              precision    recall  f1-score   support

           0       1.00      0.67      0.80         3
           1       0.75      1.00      0.86         3

    accuracy                           0.83         6
   macro avg       0.88      0.83      0.83         6
weighted avg       0.88      0.83      0.83         6



In [20]:
model.save_pretrained("distilBert_colon_cancer_model_with_summaries.h5")

Configuration saved in distilBert_colon_cancer_model_with_summaries.h5/config.json
Model weights saved in distilBert_colon_cancer_model_with_summaries.h5/pytorch_model.bin
