# ADVANCED TEXT ANALYTICS 2024/2025

## Scope of the project
Starting from a pre-trained model, the goal of the project is to attach a trained [ner](https://spacy.io/api/entityrecognizer) component to the model such that it will recognize labels coming from the medical field. The code is based on the spaCy Python library ([documentation here](https://spacy.io/api/doc)).

To address the ["catastrophic forgetting" problem](https://en.wikipedia.org/wiki/Catastrophic_interference), the trained ner component will be attached to a pre-trained model, the same one used for training the component, so that the output of the model will contain labels that can be assigned either by the original ner or by the trained ner component. Another possible solution could be performing a ["rehease"](https://spacy.io/api/language#rehearse), but in this project it is not explored.

<a id='step0'></a>

### STEP 0: install required packets and check the GPU
Remove the comments to install the packets required for running this notebook. 

<a id='step1'></a>

### STEP 1: prepare training set and test set

Store the training and development data as files on disk to load them into spaCy's training process.

[DocBin](https://spacy.io/api/docbin) is used to store and serialize the Doc objects.

Save the taining data in the trainset folder and the developer data in the testset folder

Go to [step 2](#step2.0) if you already have the training and test set well formatted.

<a id='step2.0'></a>

### STEP 2.0: training
The second step of the project is to setup the training and test data.

The documents are converted to Docbin objects and then are saved to the disk in case of needed in the future

Then, we train the en_core_web_trf model from spacy on the training data

<a id='step2.1'></a>

### STEP 2.1: prepare CUDA and PyTorch

If your PC is already set up correctly, then skip to [step 2.2](#step2.2).

#### Check if CUDA is available
The instruction *torch.cuda.is_available()* checks if CUDA is avaiable for running the train on the GPU.
If the answer if false, then it means either PyTorch or CUDA or both of them is not installed.

#### Install PyTorch
To install PyTorch, go to [this link](https://pytorch.org/get-started/locally/), select your preferences (in this case it is important to set a CUDA version as "Compute Platform" so that the code will run on the GPU) and then copy-paste the command into the following cell.

It might be necessary to restart the runtime.

After installing pythorch, *torch.cuda.is_available()* returns true.

<a id='step2.2'></a>

### STEP 2.2: train the NER component

#### Generate config.cfg file
Generate the base_config.cfg configuration file that includes all the settings and hyperparameters.
In this project the focus is to train only the ner component.
The train will be optimized for accuracy over efficiency.
Then, save the config to config.cfg file

For this project the training is done with an NVIDIA GeForce 4060 laptop with 8GB of VRAM. 

#### Train en_core_web_trf NER component

<a id='step3'></a>

### STEP 3: attach the trained NER comopnent to the original model
In the following cell, the trained NER component will be integrated into the original en_core_web_trf model. This combination will allow the final model to label words using a set that includes both the original labels and the newly trained ones.