DSeqMap for NLP

This tool handles annotation data for named-entity-recognition (NER) that support token-wise (discrete sequences) and character-wise (text) representations and can translate between these representations.

Install

Use pip install dseqmap4nlp to install the package.

Example

The following script demonstrates some transformation of annotation data.

from dseqmap4nlp import SpacySequenceMapper, LabelLoader, CharSequenceMapper

# Load JSONL
samples = [
    {"text": "Das ist gut.", "label": [
        (0,3, "a"),
        (2,7, "b"),
        (6,8, "b")
    ]}
]

for sample in samples:
    text = sample["text"]
    anns = sample["label"]
    # Mapper @ Chars <-> SpaCy Tokens
    mapper = SpacySequenceMapper(text, nlp="de")
    # (trivial) Mapper @ Chars <-> Chars
    charmapper = CharSequenceMapper(text=text)

    # Load annotation data
    # assuming the format: [ (start_idx, stop_idx, label_class), ...]
    # and merge it with to a certain (char <-> discrete sequence) mapper
    annotationset = LabelLoader.from_text_spans(anns, mapper)
    
    # Determine number fo overlaps
    print("Overlaps:", annotationset.countOverlaps())

    # Apply the following transformations to the annotation data:
    # - transform char-based labels onto discretized sequence items (e.g. tokens)
    #   -> Expand if a label's char bounds are not exactly at token bounds
    # - Remove shorter spans in case of overlapping spans
    #   -> Note: New overlaps could also be introduced by span expansion!
    filtered_spans = annotationset\
        .toDSeqSpans(strategy=["expand"])\
        .withoutOverlaps(strategy="prefer_longest", merge_same_classes=True)

    # Check overlaps again (No overlap should exist anymore!)
    print("Overlaps:", filtered_spans.countOverlaps())

    # Transform annotation data into IOB2-formatted sequence.
    print("Sequence:")
    print(filtered_spans.toFormattedSequence(schema="IOB2"))

    # Try to generate an IOB2 sequence with overlaps. (It should fail!)
    print("Previous sequence (should fail):")
    try:
        # Should raise an error...
        print(annotationset.toFormattedSequence(schema="IOB2"))
    except ValueError as e:
        print("Error raised: " + repr(e))

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
src/dseqmap4nlp		src/dseqmap4nlp
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

src/dseqmap4nlp

src/dseqmap4nlp

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

pyproject.toml

pyproject.toml

Repository files navigation

DSeqMap for NLP

Install

Example

About

Releases 2

Languages

License

j-frei/DSeqMap4NLP

Folders and files

Latest commit

History

Repository files navigation

DSeqMap for NLP

Install

Example

About

Topics

Resources

License

Stars

Watchers

Forks

Languages