In [1]:
import pandas as pd
import torch
from training import train_model
from inference import infer

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/mariiakyrychok/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Training

The script uses the Hugging Face `transformers` library for tokenization and model training (`BertTokenizerFast`, `BertForTokenClassification`, and `Trainer`).
The model’s performance is evaluated using the `seqeval` metric.

In [None]:
df = pd.read_csv('data/labeled_sentences.csv')

model, tokenizer, result = train_model(df)

### Metrics:

In [9]:
for key, value in result.items():
    print(f'{key}: {value}')

eval_loss: 0.06209941580891609
eval_precision: 0.8709677419354839
eval_recall: 0.7941176470588235
eval_f1: 0.8307692307692308
eval_accuracy: 0.9809782608695652
eval_runtime: 0.2957
eval_samples_per_second: 121.755
eval_steps_per_second: 16.91
epoch: 3.0


The model is performing well with a good balance between precision and recall (F1 = 0.831), and high accuracy (98.1%). However, recall could be improved, meaning the model might be missing some mountain names (false negatives).

In [4]:
# save model_weights
model_weights = model.state_dict()
torch.save(model_weights, 'model_weights.pth')

Let's check how model works, using 5 sentences with mountain names.

In [10]:
sentences = [
    'Climbing Gangkar Punsum is considered a challenging feat for even experienced mountaineers.',
    'Annapurna is the goddess of abundance and nourishment in Hindu mythology.',
    'Volcan Karisimbi is an active stratovolcano located in the Virunga Mountains.',
    'Many outdoor enthusiasts dream of summiting Mont Blanc one day.',
    'We finally reached the summit of Mount Everest, the tallest peak on Earth.'
]

for sentence in sentences:
    tokens, predicted_tags, mountains = infer(model, tokenizer, sentence)
    print('Sentence:', sentence)
    print('Tokens:', tokens)
    print('Predicted tags:', predicted_tags, '\n')
    print('Extracted mountains:', mountains)
    print('\n\n')

Sentence: Climbing Gangkar Punsum is considered a challenging feat for even experienced mountaineers.
Tokens: ['climbing', 'gang', '##kar', 'pun', '##sum', 'is', 'considered', 'a', 'challenging', 'feat', 'for', 'even', 'experienced', 'mountain', '##eers', '.']
Predicted tags: ['O', 'B-MOUNTAIN', 'B-MOUNTAIN', 'I-MOUNTAIN', 'I-MOUNTAIN', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'] 

Extracted mountains: gangkar punsum



Sentence: Annapurna is the goddess of abundance and nourishment in Hindu mythology.
Tokens: ['anna', '##pur', '##na', 'is', 'the', 'goddess', 'of', 'abundance', 'and', 'no', '##uri', '##sh', '##ment', 'in', 'hindu', 'mythology', '.']
Predicted tags: ['B-MOUNTAIN', 'B-MOUNTAIN', 'B-MOUNTAIN', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'] 

Extracted mountains: annapurna



Sentence: Volcan Karisimbi is an active stratovolcano located in the Virunga Mountains.
Tokens: ['vol', '##can', 'kari', '##si', '##mb', '##i', 'is', 'an', 'active', 