# ADVANCED TEXT ANALYTICS 2024/2025

## Scope of the project
Starting from a pre-trained model, the goal of the project is to attach a trained [ner](https://spacy.io/api/entityrecognizer) component to the model such that it will recognize labels coming from the medical field. The code is based on the spaCy Python library ([documentation here](https://spacy.io/api/doc)).

To address the ["catastrophic forgetting" problem](https://en.wikipedia.org/wiki/Catastrophic_interference), the trained ner component will be attached to a pre-trained model, the same one used for training the component, so that the output of the model will contain labels that can be assigned either by the original ner or by the trained ner component. Another possible solution could be performing a ["rehease"](https://spacy.io/api/language#rehearse), but in this project it is not explored.

<a id='step0'></a>

### STEP 0: install required libraries and check  for the GPU
Remove the comments to install the libraries required for running this notebook. 

In [None]:
# %pip install spacy

In [None]:
# %pip install tensorflow

In [None]:
# %pip install pandas

In [None]:
# %pip install spacy-transformers

In [None]:
# !python -m spacy download en_core_web_trf

# Probably it is required to restart the Jupyter kernel after this instruction

### Installations for NVIDIA GPU

In [None]:
# %pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

In [None]:
# %pip install cupy-wheel

In [None]:
import tensorflow as tf

# Check version of tensorflow and if GPU is available
print(tf.__version__, tf.config.list_physical_devices('GPU'))

### Download required models

The models used in this project are:
- `en_core_web_lg`: English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer. Docs: [en_core_web_lg](https://spacy.io/models/en#en_core_web_lg)
- `en_core_web_trf`: English transformer pipeline (Transformer(name=‘roberta-base’, piece_encoder=‘byte-bpe’, stride=104, type=‘roberta’, width=768, window=144, vocab_size=50265)). Components: transformer, tagger, parser, ner, attribute_ruler, lemmatizer. Docs: [en_core_web_trf](https://spacy.io/models/en#en_core_web_trf)

In [None]:
# !python -m spacy download en_core_web_lg

In [None]:
# !python -m spacy download en_core_web_trf

<a id='step1'></a>

### STEP 1: prepare training set and dev set

In this project we define the **Annotations** that contains the annotations of the articles, and the **Articles** that contains the articles.

To upload the data use the following logic:
- Store the Annotations Train data inside `./Annotations/Train` folder;
- Store the Annotations Dev data inside `./Annotations/Dev` folder.
- Store the Articles Train data inside `./Articles/Train` folder;
- Store the Articles Dev data inside `./Articles/Dev` folder.

[DocBin](https://spacy.io/api/docbin) is used to store and serialize the Doc objects. The train DocBin will be saved in the `./TrainDocBin/train.spacy` file and the dev DocBin in the `./DevDocBin/dev.spacy` file.

**Note:** go to [step 2](#step2.0) if you already have the train and dev set well formatted.

In [5]:
from pathlib import Path
import os

path = str(Path(os.path.abspath(os.getcwd())).absolute())

# print(path)

# Open training set
with open(path + '/Articles/Train/articles_train.txt','r', encoding='UTF-8') as articlesTrainFile:
  articlesTrain = articlesTrainFile.read().split('\n\n')
  # Remove last empty line if present
  if articlesTrain[len(articlesTrain)-1] == '\n':
    articlesTrain = articlesTrain[:len(articlesTrain)-1]

with open(path + '/Annotations/Train/entities_train.txt','r', encoding='UTF-8') as entitiesTrainFile:
  entitiesTrainFile.readline()
  entitiesTrain = entitiesTrainFile.read().split('\n\n')
  # Remove last empty line if present
  if entitiesTrain[len(entitiesTrain)-1] == '\n':
    entitiesTrain = entitiesTrain[:len(entitiesTrain)-1]

with open(path + '/Articles/Dev/articles_dev.txt','r', encoding='UTF-8') as articlesDevFile:
  articlesDev = articlesDevFile.read().split('\n\n')
  if articlesDev[len(articlesDev)-1] == '\n':
    articlesDev = articlesDev[:len(articlesDev)-1]

with open(path + '/Annotations/Dev/entities_dev.txt','r', encoding='UTF-8') as entitiesDevFile:
  entitiesDevFile.readline()
  entitiesDev = entitiesDevFile.read().split('\n\n')
  if entitiesDev[len(entitiesDev)-1] == '\n':
    entitiesDev = entitiesDev[:len(entitiesDev)-1]

In [6]:
import re

def get_article(text):
  article = re.findall(r'a\|(.*)',text)
  return article[0]

def get_title(text):
  # print(text)
  title = re.findall(r't\|(.*)',text)
  return title[0]

def get_pmid(text):
  pmid = text.split('|', 1)[0]
  pmid = re.sub('\n', '', pmid)
  return pmid

def calc_article_start(text):
  title = re.findall(r't\|(.*)',text)
  return len(title[0]) + 1 # +1 because of space char added between title and abstract

# Articles for train data
train_id = [get_pmid(x) for x in articlesTrain]
train_articles = [get_title(x)+' '+get_article(x) for x in articlesTrain]
train_articles_start_at = [calc_article_start(x) for x in articlesTrain]

# Articles for test data
dev_id = [get_pmid(x) for x in articlesDev]
dev_articles = [get_title(x)+' '+get_article(x) for x in articlesDev]
dev_articles_start_at = [calc_article_start(x) for x in articlesDev]

In [7]:
import pandas as pd

train_df = pd.DataFrame(columns=['article', 'articleStartsAt'])
train_df['pmid'] = train_id
train_df['article'] = train_articles
train_df['articleStartsAt'] = train_articles_start_at

dev_df = pd.DataFrame(columns=['article'])
dev_df['pmid'] = dev_id
dev_df['article'] = dev_articles
dev_df['articleStartsAt'] = dev_articles_start_at

In [None]:
train_df.head()

In [None]:
dev_df.head()

In [8]:
def get_labels(dataframe, text):
  text = text.strip() # Remove possible \n at the start/end of the text
  l = text.split("\n")
  l = [x.split('\t') for x in l]
  labels = []
  index = 0
  for i in l:
    # print(i)
    while dataframe.iloc[index]['pmid'] != i[0]:
      index += 1
      continue

    if i[4] == 'title':
      labels.append((int(i[2]), int(i[3])+1, i[6]))
    elif i[4] == 'abstract':
      # Add shift due to the title length
      labels.append((int(i[2]) + int(dataframe.iloc[index]['articleStartsAt']), int(i[3]) + 1 + int(dataframe.iloc[index]['articleStartsAt']), i[6]))
  return labels

train_labels = [get_labels(train_df, x) for x in entitiesTrain]
dev_labels = [get_labels(dev_df, x) for x in entitiesDev]

print('total train labels: ', len(train_labels), ', total test labels: ' , len(dev_labels))

total train labels:  1566 , total test labels:  40


In [9]:
train_df['labels'] = train_labels
dev_df['labels'] = dev_labels

In [None]:
train_df.head()

In [None]:
dev_df.head()

In [10]:
training_data = []
for i, j in zip(train_articles, train_labels):
  training_data.append((i, j))

# print(training_data[0])

dev_data = []
for i, j in zip(dev_articles, dev_labels):
  dev_data.append((i, j))

# print(dev_data[0])

In [2]:
import spacy
import spacy.training

nlp = spacy.load("en_core_web_lg")

  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


<a id='step1.1'></a>

### STEP 1.1: improve Tokenizer

Sometimes a word followed by a punctuation (or viceversa) are not considered as two distinct tokens. To fix this, a custom tokenizer is implemented by adding the list of puctuations to the default prefixes and suffixes lists.

In [25]:
# Customizing the tokenizer
from spacy.tokenizer import Tokenizer
import re
import string


def use_custom_tokenizer(nlp):
  my_punct = f"[{re.escape(string.punctuation)}]"
  my_punct = my_punct.replace("<", "")
  my_punct = my_punct.replace(">", "")
  html_tag_pattern = re.compile(r'<[^>]+>')

  all_prefixes_re = spacy.util.compile_prefix_regex(tuple(list(nlp.Defaults.prefixes) + [html_tag_pattern.pattern] + [my_punct]))
  infix_re = spacy.util.compile_infix_regex(nlp.Defaults.infixes)
  suffix_re = spacy.util.compile_suffix_regex(tuple(list(nlp.Defaults.suffixes) + [html_tag_pattern.pattern] + [my_punct]))
  
  return Tokenizer(nlp.vocab,
                   nlp.Defaults.tokenizer_exceptions,
                   prefix_search = all_prefixes_re.search, 
                   infix_finditer = infix_re.finditer,
                   suffix_search = suffix_re.search,
                   token_match=None)

nlp.tokenizer = use_custom_tokenizer(nlp=nlp)

In [26]:
from spacy.tokens import DocBin

def save_to_disk(data, dir, filename):
  db = DocBin()
  for text, annotations in data:
    doc = nlp(text)
    sentence_tokens = []
    for sent in doc.sents:
      sentence_tokens.append([token.text for token in sent])
    print(sentence_tokens)
    ents = []
    for start, end, label in annotations:
      span = doc.char_span(int(start), int(end), label=label)
      if not span == None:
        ents.append(span)
      else:
        print("none", int(start), int(end), label)
    # print(ents)
    doc.ents = ents
    db.add(doc)

  db.to_disk(os.path.join(dir, filename))

In [27]:
save_to_disk(training_data, os.path.join(path,'TrainDocBin'), "train.spacy")
save_to_disk(dev_data, os.path.join(path,'DevDocBin'), "dev.spacy")

[['Analysis', 'of', 'the', 'Efficacy', 'of', 'Diet', 'and', 'Short', '-', 'Term', 'Probiotic', 'Intervention', 'on', 'Depressive', 'Symptoms', 'in', 'Patients', 'after', 'Bariatric', 'Surgery', ':', 'A', 'Randomized', 'Double', '-', 'Blind', 'Placebo', 'Controlled', 'Pilot', 'Study', '.'], ['(', '1', ')', 'Background', ':', 'studies', 'have', 'shown', 'that', 'some', 'patients', 'experience', 'mental', 'deterioration', 'after', 'bariatric', 'surgery', '.'], ['(', '2', ')', 'Methods', ':', 'We', 'examined', 'whether', 'the', 'use', 'of', 'probiotics', 'and', 'improved', 'eating', 'habits', 'can', 'improve', 'the', 'mental', 'health', 'of', 'people', 'who', 'suffered', 'from', 'mood', 'disorders', 'after', 'bariatric', 'surgery', '.'], ['We', 'also', 'analyzed', 'patients', "'", 'mental', 'states', ',', 'eating', 'habits', 'and', 'microbiota', '.'], ['(', '3', ')', 'Results', ':'], ['Depressive', 'symptoms', 'were', 'observed', 'in', '45', '%', 'of', '200', 'bariatric', 'patients', '.'],

<a id='step2'></a>

### STEP 2: prepare CUDA and PyTorch

If your PC is already set up correctly, then skip to [step 3](#step3).

#### Check if CUDA is available
The instruction *torch.cuda.is_available()* checks if CUDA is avaiable for running the train on the GPU.
If the answer if false, then it means either PyTorch or CUDA or both of them is not installed.

#### Install PyTorch
To install PyTorch, go to [this link](https://pytorch.org/get-started/locally/), select your preferences (in this case it is important to set a CUDA version as "Compute Platform" so that the code will run on the GPU) and then copy-paste the command into the following cell.

It might be necessary to restart the runtime.

After installing pythorch, *torch.cuda.is_available()* returns true.

In [None]:
# Check if CUDA is available
import torch

print('is cuda available? ', torch.cuda.is_available())
print(torch.__version__)
!nvidia-smi

<a id='step3'></a>

### STEP 3: train the NER component

#### Generate config.cfg file
Generate the base_config.cfg configuration file that includes all the settings and hyperparameters.
In this project the focus is to train only the ner component.
The train will be optimized for accuracy over efficiency.
Then, save the config to config.cfg file

For this project the training is done with an NVIDIA GeForce 4060 laptop with 8GB of VRAM. 

#### Train en_core_web_trf NER component

In [None]:
@spacy.registry.tokenizers("whitespace_tokenizer")
def create_whitespace_tokenizer():
    def create_tokenizer(nlp):
        return WhitespaceTokenizer(nlp.vocab)

    return create_tokenizer

In [None]:
!python -m spacy init fill-config ./config/lg/base_config.cfg ./config/lg/config.cfg

In [None]:
# import spacy
# import spacy.training

# # Load pre-trained model over which a new training for the ner component will be done
# nlp = spacy.load("en_core_web_trf")

In [28]:
!python -m spacy train ./config/lg/config.cfg --output ./output/lg-improve-tokenizer --paths.train ./TrainDocBin/train.spacy --paths.dev ./DevDocBin/dev.spacy

[38;5;4mℹ Saving to output directory: output\lg-improve-tokenizer[0m
[38;5;4mℹ Using CPU[0m
[38;5;4mℹ To switch to GPU 0, use the option: --gpu-id 0[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00    137.43    0.00    0.00    0.00    0.00
  0     200        591.75   9946.65   44.73   65.97   33.84    0.45
  0     400        726.21   6437.89   51.99   60.52   45.57    0.52
  0     600       1171.14   5845.69   61.85   72.25   54.07    0.62
  0     800        221.00   5172.51   64.83   69.10   61.06    0.65
  0    1000        248.12   4839.52   65.30   63.71   66.97    0.65
  0    1200        351.98   5001.45   65.32   77.59   56.40    0.65
  0    1400        295.28   4500.93   69.70   78.78   62.49    0.70
  1    1600        429.41   466

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [None]:
unique_labels = set()

for example in dev_data:
	entities = example[1]
	for entity in entities:
		entity_label = entity[2]
		unique_labels.add(entity_label)
        
unique_labels_list = list(unique_labels)

print('Entities to be recognised in the provided data: ')
print(unique_labels_list)

### STEP 3.1: evaluate best model

In [29]:
!spacy evaluate ./output/lg-improve-tokenizer/model-best/ ./DevDocBin/dev.spacy

[38;5;4mℹ Using CPU[0m
[38;5;4mℹ To switch to GPU 0, use the option: --gpu-id 0[0m
[1m

TOK     99.74
NER P   78.29
NER R   74.57
NER F   76.39
SPEED   6035 

[1m

                            P       R       F
DDF                     84.51   82.06   83.27
microbiome              90.55   90.55   90.55
human                   87.23   95.35   91.11
anatomical location     87.50   73.68   80.00
chemical                56.55   72.52   63.55
biomedical technique    75.00   33.33   46.15
animal                  80.00   76.71   78.32
statistical technique   33.33   66.67   44.44
bacteria                67.27   68.52   67.89
gene                    60.00   15.38   24.49
dietary supplement      71.88   85.19   77.97
food                     0.00    0.00    0.00
drug                    70.37   63.33   66.67



  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


<a id='step3'></a>

### STEP 3: attach the trained NER comopnent to the original model
In the following cell, the trained NER component will be integrated into the original en_core_web_trf model. This combination will allow the final model to label words using a set that includes both the original labels and the newly trained ones.

In [None]:
# Load original model
nlp = spacy.load("en_core_web_trf")
# Load trained model
trained_nlp = spacy.load('./output/model-best')

trained_nlp.replace_listeners("transformer", "ner", ["model.tok2vec"])

nlp.add_pipe(
  "ner",
  name="trained_ner",
  source=trained_nlp,
  before="ner",
)

print(nlp.pipe_names)

In [None]:
import spacy.displacy


doc = nlp("Doc here")
spacy.displacy.render(doc, style="ent", jupyter=True)

STEP 4: train using the en_core_web_lg model