thinking about how to sample data for manual labelling in argilla. we want to present passages which are likely to contain instances of each concept. Makes sense to use an existing classifier to identify these passages, which can then be manually labelled as positive or negative examples. From there, we can train more sophisticated models and lean on active learning to reduce the amount of manual labelling required.

In [None]:
import json
from pathlib import Path

from rich.progress import track

from src.concept import Concept
from src.document import Document

load up all of our documents and concepts

In [None]:
data_dir = Path("../data")

In [None]:
with open(data_dir / "raw" / "concepts.json") as f:
    concepts_data = json.load(f)
concepts = [Concept.from_dict(concept) for concept in track(concepts_data)]

In [None]:
documents_dir = data_dir / "processed" / "documents"
file_paths = list(documents_dir.glob("*.json"))
documents = [Document.load(file) for file in track(file_paths)]

In [None]:
document = documents[1]
concept_id = "b8aevvwa"

document.concept_spans

We can exploit the fact that the concept spans and sentence spans are sorted by start index. We start by iteraring through the sentence spans while holding the first concept span in the list as the one we're looking for. If the concept span is within the bounds of the sentence span, we can add the sentence to the list of sentences thatcontain the concept. We can then move on to the next concept span and repeat the process. This way, we only have to iterate through the sentence spans once.

In [None]:
concept_passages = {concept.id: [] for concept in concepts}

for document in track(documents):
    if not document.concept_spans:
        continue
    # set up an iterator for the concept spans, so that we can track the current concept
    concept_span_iterable = iter(document.concept_spans)
    concept_span = next(concept_span_iterable)
    for sentence_span in document.sentence_spans:
        # if the concept is within the bounds of the sentence
        if (
            sentence_span.start_index <= concept_span.start_index
            and sentence_span.end_index >= concept_span.end_index
        ):
            sentence = document.text[
                sentence_span.start_index : sentence_span.end_index
            ]
            concept_passages[concept_span.identifier].append(sentence)
            try:
                concept_span = next(concept_span_iterable)
            except StopIteration:
                break

In [None]:
for key, value in concept_passages.items():
    print(key, len(value))