# Dependencies for training with GPU (transformer)

To train the model with GPU, you need to install the following dependencies:

In [None]:
# Install CuPy (adjust based on your CUDA version)
%pip install cupy-cuda12x

In [None]:
# Install PyTorch, torchvision, and torchaudio (adjust CUDA version if needed)
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

# Generating the Configuration File

To generate the configuration file, execute the following command:

```bash
!python -m spacy init fill-config /path/to/base_config_tagger.cfg /path/to/config.cfg
```


This command utilizes the spacy init fill-config module to create a configuration file named config.cfg. It fills in the base configuration from the file specified at /path/to/base_config_tagger.cfg, which contains initial settings and parameters for the SpaCy model training process.

You can download a base_config_tagger.cfg file from the SpaCy website https://spacy.io/usage/training. After downloading the desired configuration file, specify its path as the base configuration file argument in the command. 

In [None]:
!python -m spacy init fill-config /path/to/base_config_tagger.cfg /path/to/config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
C:\Users\mikek\projects\Text-Normalization\src\text_norm_NER\config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


# Training the Model

To train the model, execute the following command:

```bash
!python -m spacy train /path/to/config.cfg --output /path/to/model_output_directory --paths.train /path/to/train.spacy --paths.dev /path/to/test.spacy --gpu-id 0
```

This command initiates the training process using SpaCy's training module. It requires a configuration file specified at /path/to/config.cfg, which contains settings and parameters for training the model. Additionally, it specifies the output directory for saving the trained model files at /path/to/model_output_directory and the paths to the training and testing datasets at /path/to/train.spacy and /path/to/test.spacy, respectively. The --gpu-id 0 flag indicates that the training process should utilize the GPU with ID 0.

In [None]:
!python -m spacy train /path/to/config.cfg --output /path/to/model_output_directory --paths.train /path/to/train.spacy --paths.dev /path/to/test.spacy --gpu-id 0

[38;5;2m✔ Created output directory:

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  with torch.cuda.amp.autocast(self._mixed_precision):
  with torch.cuda.amp.autocast(self._mixed_precision):



C:\Users\mikek\projects\Text-Normalization\src\text_norm_NER\model[0m
[38;5;4mℹ Saving to output directory:
C:\Users\mikek\projects\Text-Normalization\src\text_norm_NER\model[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['transformer', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TRANS...  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  -------------  --------  ------  ------  ------  ------
  0       0        1961.45    440.74   42.12   51.10   35.82    0.42
  4     200       23959.72  23403.50   98.30   97.76   98.85    0.98
  9     400         927.97   1842.55   99.52   99.52   99.52    1.00
 14     600         379.49    682.32   99.95   99.95   99.95    1.00
 19     800         205.83    351.65   99.95   99.99   99.91    1.00
 24    1000         106.27    185.48  100.00  100.00  100.00    1.00
 29    1200          80.38    111.30   99.97   99.99   99.95    1.00
 34    1400          69.37    109.36

# Evaluating the Model

To evaluate the model, execute the following command:

```bash
!python -m spacy evaluate /path/to/trained_model_directory/model-best /path/to/eval.spacy --output /path/to/evaluation_output.json --gpu-id 0
```

This command evaluates the performance of the trained model using SpaCy's evaluation module. It requires specifying the directory containing the trained model files at /path/to/trained_model_directory/model-best. Additionally, it specifies the path to the evaluation dataset at /path/to/eval.spacy.

The evaluation results will be saved in JSON format at the location specified by /path/to/evaluation_output.json. The --gpu-id 0 flag indicates that the evaluation process should utilize the GPU with ID 0.

In [None]:
!python -m spacy evaluate /path/to/trained_model_directory/model-best /path/to/eval.spacy --output /path/to/evaluation_output.json --gpu-id 0

[38;5;4mℹ Using GPU: 0[0m
[1m

TOK     100.00
NER P   96.73 
NER R   97.43 
NER F   97.08 
SPEED   357   

[1m

             P       R       F
PERSON   96.73   97.43   97.08

[38;5;2m✔ Saved results to
C:\Users\mikek\projects\Text-Normalization\src\text_norm_NER\output.json[0m


  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  with torch.cuda.amp.autocast(self._mixed_precision):


# Packaging the Model

To package the model, execute the following command:

```bash
!python -m spacy package "/path/to/trained_model_directory/model-best" "/path/to/output_directory"
```

This command packages the trained model using SpaCy's packaging module. It requires specifying the directory containing the trained model files at /path/to/trained_model_directory/model-best. Additionally, it specifies the directory where the packaged model will be saved at /path/to/output_directory.

The packaged model can be installed using pip with the following command:

```bash
%pip install "/path/to/output_directory/model_name-version"
```

In [None]:
!python -m spacy package "/path/to/trained_model_directory/model-best" "/path/to/output_directory"

[38;5;4mℹ Building package artifacts: sdist[0m
[38;5;2m✔ Including 1 package requirement(s) from meta and config[0m
spacy-transformers>=1.3.4,<1.4.0
[38;5;2m✔ Loaded meta.json from file[0m
C:\Users\mikek\projects\Text-Normalization\src\text_norm_NER\model\model-best\meta.json
[38;5;2m✔ Generated README.md from meta.json[0m
[38;5;2m✔ Successfully created package directory 'en_pipeline-0.0.0'[0m
C:\Users\mikek\projects\Text-Normalization\src\text_norm_NER\output\en_pipeline-0.0.0
* Creating isolated environment: venv+pip...
* Installing packages in isolated environment:
  - setuptools >= 40.8.0
* Getting build dependencies for sdist...
running egg_info
creating en_pipeline.egg-info
writing en_pipeline.egg-info\PKG-INFO
writing dependency_links to en_pipeline.egg-info\dependency_links.txt
writing entry points to en_pipeline.egg-info\entry_points.txt
writing requirements to en_pipeline.egg-info\requires.txt
writing top-level names to en_pipeline.egg-info\top_level.txt
writing man

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [None]:
%pip install "/path/to/output_directory/model_name-version"

Processing c:\users\mikek\projects\text-normalization\src\text_norm_ner\output\en_pipeline-0.0.0
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: en_pipeline
  Building wheel for en_pipeline (pyproject.toml): started
  Building wheel for en_pipeline (pyproject.toml): finished with status 'done'
  Created wheel for en_pipeline: filename=en_pipeline-0.0.0-py3-none-any.whl size=434078417 sha256=1f3b4ca1202977a24e7032d823a6c66f7bf3cc2b3798df3d07859541900baf08
  Stored in directory: c:\users\mikek\appdata\local\pip\cache\wheels\45\48\ef\5285815e886b9cce76bf94cfcd4c0e01bfc0b43eef7f0be145
Successfully built en_pipeline
Installing collected packages: en_pipeline
  Attempting uni


[notice] A new release of pip is available: 23.2.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


# Performing Inference with the Model


In [1]:
import spacy

nlp = spacy.load("en_pipeline")

# add the text to be processed
text = ""

doc = nlp(text)

entities = []
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")
    if ent.label_ == "PERSON":
        entities.append(ent.text)

# Join PERSON entities with /
if entities:
    joined_entities = "/".join(entities)
    print(f"Joined PERSON entities: {joined_entities}")

  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
