***
# <font color=red>Chapter 6: MedTALN inc.'s Case Study -  Healthcare NER Training Dataset Creation from CoNLL File</font>
<p style="margin-left:10%; margin-right:10%;">by <font color=teal> John Doe (typica.ai) </font></p>

***


## Overview:

First and foremost, we need a dataset that is suitable for token classification. In this notebook, we will use the CoNLL file exported from OCI DLS (refer to Chapter 5) to create our training dataset (in the Hugging Face datasets format).

To create a Hugging Face dataset from the exported CoNLL file, we will follow these high-level steps:

1. Parse the CoNLL File: Extract the tokens and their corresponding NER tags from your CoNLL file.
2. Create a Hugging Face Dataset: Prepare the data into a structure that can be consumed by the Hugging Face datasets library, with each instance having an 'id', 'tokens', and 'ner_tags'. Then, use the datasets library to create a DatasetDict with your data.
3. Create dataset splits: Create train, validation and test splits
4. Save dataset: save the HF dataset to disk (training-datasets-bkt mount)


Hugging face datasets library is already included in our installed conda env. (i.e. pytorch21_p39_gpu_v1).
in case you want to reinstall it use: !pip install datasets==2.16.1

Filters out warnings

In [3]:
import warnings
warnings.filterwarnings('ignore')

## Declare helper functions



In [4]:
import re

def split_token(token, tag):
    """
    Splits tokens if they end with specific punctuation characters (.,;!?) and assigns
    'O' to the punctuation, leaving other tokens intact.
    """
    # Define the punctuations to split
    punctuations = ".,;!?"
    # Check if the token ends with a punctuation that should be split
    if token[-1] in punctuations:
        # Return the token without the last character and the punctuation as separate tokens
        return [(token[:-1], tag), (token[-1], 'O')]
    else:
        # Return the token as is if it doesn't end with specified punctuation
        return [(token, tag)]

# This function reads your .conll file and extracts sentences and their NER tags.
def parse_conll_file(file_path):
    sentences = []
    current_sentence = []
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            if line.startswith("-DOCSTART-") or line.strip() == "":
                if current_sentence:
                    sentences.append(current_sentence)
                    current_sentence = []
            else:
                parts = line.strip().split()
                token = parts[0]
                tag = parts[-1] if len(parts) > 1 else 'O'  # Default to 'O' if no tag is present
                # Split token if it contains punctuation
                current_sentence.extend(split_token(token, tag))
        if current_sentence:  # Add the last sentence if it exists
            sentences.append(current_sentence)
    return sentences

# This function extracts unique NER tags ensuring 'O' is first, and prepares the data for the dataset creation.
def prepare_dataset(sentences):
    unique_tags = set()
    for sentence in sentences:
        for _, tag in sentence:
            #if tag not in excluded_tags:
            unique_tags.add(tag)

    # Ensure 'O' is first, then sort the rest of the tags
    unique_tags.discard('O')  # Remove 'O' to avoid duplication
    unique_tags = ['O'] + sorted(unique_tags)  # Prepend 'O' and sort the rest

    tag_to_id = {tag: id for id, tag in enumerate(unique_tags)}

    # Prepare data for Hugging Face Dataset
    data = {'id': [], 'tokens': [], 'ner_tags': []}
    for i, sentence in enumerate(sentences):
        tokens, tags = zip(*sentence)
        data['id'].append(str(i))
        data['tokens'].append(list(tokens))
        data['ner_tags'].append([tag_to_id.get(tag, tag_to_id['O']) for tag in tags ]) #if tag not in excluded_tags

    return data, unique_tags

from datasets import Dataset, DatasetDict, Features, ClassLabel, Sequence, Value

# This function creates the dataset using the prepared data and unique NER tags.
def create_hf_dataset(data, unique_tags):
    features= Features({
                'id': Value(dtype='string', id=None),
                'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
                'ner_tags': Sequence(feature=ClassLabel(num_classes=len(unique_tags), names=unique_tags))
            })

    dataset = Dataset.from_dict(data, features=features)
    dataset_dict = DatasetDict({'train': dataset})
    return dataset_dict


## Create HF Dataset from CoNLL file

Parse the .conll file, prepare the data, and create the dataset.



In [9]:
# Set the CoNLL file path
file_path = "/home/datascience/buckets/training-datasets-bkt/healthcare_ner_dataset_v1.1.0/healthcare_ner_dataset_v1.0.0_1724175778995.conll"

# Parse the .conll file
sentences = parse_conll_file(file_path)
# Prepare the dataset and extract unique NER tags
data, unique_tags = prepare_dataset(sentences)
# Create the Hugging Face dataset
dataset_dict = create_hf_dataset(data, unique_tags)

print("Dataset created successfully!")

dataset_dict


Dataset created successfully!


DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 9000
    })
})

Inspec a row randomly

In [5]:
dataset_dict["train"].shuffle(seed=42)[0]

{'id': '2015',
 'tokens': ['un',
  'vaccin',
  'vivant',
  'atténué',
  'est',
  'maintenant',
  'disponible',
  '',
  '.'],
 'ner_tags': [0, 5, 0, 0, 0, 0, 0, 0, 0]}

## Create splits for the HF Dataset

At this step we create train, validation, and test splits




In [10]:
from datasets import DatasetDict

ds_train_devtest = dataset_dict['train'].train_test_split(test_size=0.25, seed=42)
ds_devtest = ds_train_devtest['test'].train_test_split(test_size=0.25, seed=42)


healthcare_ner_dataset = DatasetDict({
    'train': ds_train_devtest['train'],
    'validation': ds_devtest['train'],
    'test': ds_devtest['test']
})

healthcare_ner_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 6750
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1687
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 563
    })
})

## Save dataset

We will save the Hugging Face dataset to the `training-datasets-bkt/healthcare_ner_dataset_v1.2.0` directory. The dataset is ready for Named Entity Recognition (NER) training.

The `v1.2` part of the version denotes that the dataset has been fully processed and is in a format suitable for NER tasks, making it easy to reference this specific state of the dataset in future training runs or evaluation tasks.


In [11]:
healthcare_ner_dataset.save_to_disk("/home/datascience/buckets/training-datasets-bkt/healthcare_ner_dataset_v1.2.0")

Saving the dataset (0/1 shards):   0%|          | 0/6750 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1687 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/563 [00:00<?, ? examples/s]