# ADVANCED TEXT ANALYTICS 2024/2025

## Scope of the project
Starting from a pre-trained model, the goal of the project is to attach a trained [ner](https://spacy.io/api/entityrecognizer) component to the model such that it will recognize labels coming from the medical field. The code is based on the spaCy Python library ([documentation here](https://spacy.io/api/doc)).

To address the ["catastrophic forgetting" problem](https://en.wikipedia.org/wiki/Catastrophic_interference), the trained ner component will be attached to a pre-trained model, the same one used for training the component, so that the output of the model will contain labels that can be assigned either by the original ner or by the trained ner component. Another possible solution could be performing a ["rehease"](https://spacy.io/api/language#rehearse), but in this project it is not explored.

<a id='step0'></a>

### STEP 0: install required libraries and check  for the GPU
Remove the comments to install the libraries required for running this notebook. 

In [None]:
# %pip install spacy

In [None]:
# %pip install tensorflow

In [None]:
# %pip install pandas

In [None]:
# %pip install spacy-transformers

In [None]:
# !python -m spacy download en_core_web_trf

# Probably it is required to restart the Jupyter kernel after this instruction

In [1]:
import tensorflow as tf

# Check version of tensorflow and if GPU is available
print(tf.__version__, tf.config.list_physical_devices('GPU'))

2.18.0 []


<a id='step1'></a>

### STEP 1: prepare training set and test set

In this project we define the **Annotations** that contains the annotations of the articles, and the **Articles** that contains the articles.

To upload the data use the following logic:
- Store the Annotations Train data inside `./Annotations/Train` folder;
- Store the Annotations Dev data inside `./Annotations/Dev` folder.
- Store the Articles Train data inside `./Articles/Train` folder;
- Store the Articles Dev data inside `./Articles/Dev` folder.

[DocBin](https://spacy.io/api/docbin) is used to store and serialize the Doc objects. The train DocBin will be saved in the `./TrainDocBin/train.spacy` file and the dev DocBin in the `./DevDocBin/dev.spacy` file.

**Note:** go to [step 2](#step2.0) if you already have the train and dev set well formatted.

In [2]:
from pathlib import Path
import os

path = str(Path(os.path.abspath(os.getcwd())).absolute())

# print(path)

# Open training set
with open(path + '/Articles/Train/articles_train_platinum.txt','r', encoding='UTF-8') as articlesTrainFile:
  articlesTrain = articlesTrainFile.read().split('\n\n')
  # Remove last empty line if present
  if articlesTrain[len(articlesTrain)-1] == '\n':
    articlesTrain = articlesTrain[:len(articlesTrain)-1]

with open(path + '/Annotations/Train/train_platinum_entities.txt','r', encoding='UTF-8') as entitiesTrainFile:
  entitiesTrainFile.readline()
  entitiesTrain = entitiesTrainFile.read().split('\n\n')
  # Remove last empty line if present
  if entitiesTrain[len(entitiesTrain)-1] == '\n':
    entitiesTrain = entitiesTrain[:len(entitiesTrain)-1]

with open(path + '/Articles/Dev/articles_dev.txt','r', encoding='UTF-8') as articlesDevFile:
  articlesDev = articlesDevFile.read().split('\n\n')
  if articlesDev[len(articlesDev)-1] == '\n':
    articlesDev = articlesDev[:len(articlesDev)-1]

with open(path + '/Annotations/Dev/dev_entities.txt','r', encoding='UTF-8') as entitiesDevFile:
  entitiesDevFile.readline()
  entitiesDev = entitiesDevFile.read().split('\n\n')
  if entitiesDev[len(entitiesDev)-1] == '\n':
    entitiesDev = entitiesDev[:len(entitiesDev)-1]

In [3]:
# Done for test purpose
train = articlesTrain
test = articlesDev

In [4]:
import re

def get_article(text):
  article = re.findall(r'a\|(.*)',text)
  return article[0]

def get_title(text):
  print(text)
  title = re.findall(r't\|(.*)',text)
  return title[0]

def get_pmid(text):
  pmid = text.split('|', 1)[0]
  pmid = re.sub('\n', '', pmid)
  return pmid

def calc_article_start(text):
  title = re.findall(r't\|(.*)',text)
  return len(title[0]) + 1 # +1 because of space char added between title and abstract

# Articles for train data
train_id = [get_pmid(x) for x in train]
train_articles = [get_title(x)+' '+get_article(x) for x in train]
train_articles_start_at = [calc_article_start(x) for x in train]

# Articles for test data
dev_id = [get_pmid(x) for x in test]
dev_articles = [get_title(x)+' '+get_article(x) for x in test]
dev_articles_start_at = [calc_article_start(x) for x in test]

38068763|t|Analysis of the Efficacy of Diet and Short-Term Probiotic Intervention on Depressive Symptoms in Patients after Bariatric Surgery: A Randomized Double-Blind Placebo Controlled Pilot Study.
38068763|w|Natalia Komorniak; Mariusz Kaczmarczyk; Igor Łoniewski; Alexandra Martynova-Van Kley; Armen Nalian; Michał Wroński; Krzysztof Kaseja; Bartosz Kowalewski; Marcin Folwarski; Ewa Stachowska
38068763|j|Nutrients
38068763|y|2023
38068763|a|(1) Background: studies have shown that some patients experience mental deterioration after bariatric surgery. (2) Methods: We examined whether the use of probiotics and improved eating habits can improve the mental health of people who suffered from mood disorders after bariatric surgery. We also analyzed patients' mental states, eating habits and microbiota. (3) Results: Depressive symptoms were observed in 45% of 200 bariatric patients. After 5 weeks, we noted an improvement in patients' mental functioning (reduction in BDI and HRSD), but it was

In [5]:
import pandas as pd

train_df = pd.DataFrame(columns=['article', 'articleStartAt'])
train_df['pmid'] = train_id
train_df['article'] = train_articles
train_df['articleStartAt'] = train_articles_start_at

dev_df = pd.DataFrame(columns=['article'])
dev_df['pmid'] = dev_id
dev_df['article'] = dev_articles
dev_df['articleStartAt'] = dev_articles_start_at

In [6]:
train_df.head()

Unnamed: 0,article,articleStartAt,pmid
0,Analysis of the Efficacy of Diet and Short-Ter...,189,38068763
1,Systematic profiling of the chicken gut microb...,193,35965349
2,Compositional and functional alterations in th...,138,34870091
3,Potential Beneficial Effects of Probiotics on ...,92,28158162
4,Depletion of acetate-producing bacteria from t...,154,34172092


In [7]:
dev_df.head()

Unnamed: 0,article,pmid,articleStartAt
0,Hypothesis of a potential BrainBiota and its r...,36532064,86
1,IgA-Biome Profiles Correlate with Clinical Par...,37212075,73
2,The association between oral and gut microbiot...,37577447,90
3,Our Mental Health Is Determined by an Intrinsi...,38203207,130
4,Abnormal composition of gut microbiota is asso...,31530799,123


In [8]:
def get_labels(dataframe, text):
  text = text.strip() # Remove possible \n at the start/end of the text
  l = text.split("\n")
  l = [x.split('\t') for x in l]
  labels = []
  index = 0
  for i in l:
    while dataframe.iloc[index]['pmid'] != i[0]:
      index += 1
      continue

    if i[4] == 'title':
      labels.append((int(i[2]), int(i[3])+1, i[6]))
    elif i[4] == 'abstract':
      # Add shift due to the title length
      labels.append((int(i[2]) + int(dataframe.iloc[index]['articleStartAt']), int(i[3]) + 1 + int(dataframe.iloc[index]['articleStartAt']), i[6]))
  return labels

train_labels = [get_labels(train_df, x) for x in entitiesTrain]
dev_labels = [get_labels(dev_df, x) for x in entitiesDev]

print('total train labels: ', len(train_labels), ', total test labels: ' , len(dev_labels))

total train labels:  111 , total test labels:  40


In [None]:
train_df['labels'] = train_labels
dev_df['labels'] = dev_labels

In [10]:
train_df.head()

Unnamed: 0,article,articleStartAt,pmid,labels
0,Analysis of the Efficacy of Diet and Short-Ter...,189,38068763,"[(74, 93, DDF), (97, 105, human), (234, 242, h..."
1,Systematic profiling of the chicken gut microb...,193,35965349,"[(28, 50, microbiome), (59, 82, dietary supple..."
2,Compositional and functional alterations in th...,138,34870091,"[(48, 71, microbiome), (75, 83, human), (89, 9..."
3,Potential Beneficial Effects of Probiotics on ...,92,28158162,"[(32, 42, dietary supplement), (46, 69, DDF), ..."
4,Depletion of acetate-producing bacteria from t...,154,34172092,"[(13, 39, bacteria), (49, 63, microbiome), (76..."


In [11]:
dev_df.head()

Unnamed: 0,article,pmid,articleStartAt,labels
0,Hypothesis of a potential BrainBiota and its r...,36532064,86,"[(26, 36, microbiome), (57, 84, DDF), (168, 18..."
1,IgA-Biome Profiles Correlate with Clinical Par...,37212075,73,"[(34, 71, DDF), (73, 92, DDF), (98, 138, DDF),..."
2,The association between oral and gut microbiot...,37577447,90,"[(24, 47, microbiome), (51, 64, human), (70, 8..."
3,Our Mental Health Is Determined by an Intrinsi...,38203207,130,"[(114, 128, microbiome), (130, 160, bacteria),..."
4,Abnormal composition of gut microbiota is asso...,31530799,123,"[(24, 38, microbiome), (194, 208, microbiome),..."


In [12]:
training_data = []
for i, j in zip(train_articles, train_labels):
  training_data.append((i, j))

# print(training_data[0])

dev_data = []
for i, j in zip(dev_articles, dev_labels):
  dev_data.append((i, j))

# print(dev_data[0])

In [13]:
import spacy
import spacy.training

nlp = spacy.load("en_core_web_trf")

  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  model.load_state_dict(torch.load(filelike, map_location=device))


In [14]:
from spacy.tokens import DocBin

def save_to_disk(data, dir, filename):
  db = DocBin()
  for text, annotations in data:
    doc = nlp(text)
    # sentence_tokens = []
    # for sent in doc.sents:
    #   sentence_tokens.append([token.text for token in sent])
    # print(sentence_tokens)
    ents = []
    for start, end, label in annotations:
      span = doc.char_span(int(start), int(end), label=label)
      if not span == None:
        ents.append(span)
        # print(ents)
    doc.ents = ents
    db.add(doc)

  db.to_disk(os.path.join(dir, filename))

In [15]:
save_to_disk(training_data, os.path.join(path,'TrainDocBin'), "train.spacy")
save_to_disk(dev_data, os.path.join(path,'DevDocBin'), "dev.spacy")

<a id='step2.0'></a>

### STEP 2.0: training
The second step of the project is to setup the training and test data.

The documents are converted to Docbin objects and then are saved to the disk in case of needed in the future

Then, we train the en_core_web_trf model from spacy on the training data

<a id='step2.1'></a>

### STEP 2.1: prepare CUDA and PyTorch

If your PC is already set up correctly, then skip to [step 2.2](#step2.2).

#### Check if CUDA is available
The instruction *torch.cuda.is_available()* checks if CUDA is avaiable for running the train on the GPU.
If the answer if false, then it means either PyTorch or CUDA or both of them is not installed.

#### Install PyTorch
To install PyTorch, go to [this link](https://pytorch.org/get-started/locally/), select your preferences (in this case it is important to set a CUDA version as "Compute Platform" so that the code will run on the GPU) and then copy-paste the command into the following cell.

It might be necessary to restart the runtime.

After installing pythorch, *torch.cuda.is_available()* returns true.

<a id='step2.2'></a>

### STEP 2.2: train the NER component

#### Generate config.cfg file
Generate the base_config.cfg configuration file that includes all the settings and hyperparameters.
In this project the focus is to train only the ner component.
The train will be optimized for accuracy over efficiency.
Then, save the config to config.cfg file

For this project the training is done with an NVIDIA GeForce 4060 laptop with 8GB of VRAM. 

#### Train en_core_web_trf NER component

<a id='step3'></a>

### STEP 3: attach the trained NER comopnent to the original model
In the following cell, the trained NER component will be integrated into the original en_core_web_trf model. This combination will allow the final model to label words using a set that includes both the original labels and the newly trained ones.