# LabelstudioToFonduer Example Notebook

In this example notebook, a Fonduer pipeline is set up using a Lable Studio export. Alongside, the important functionalities of `LabelstudioToFonduer` are explained. 

First, we will load the Label Studio Export in Section 1. Then we will import the documents into Fonduer using a custom parser in Section 2. In the next Section 3, the Fonduer data model is set up. Finally, the Label Studio annotations are transferred to Fonduer with the gold function in Section 4.

Before we can start, we need to connect to the Fonduer database, create a table and instantiate a Fonduer session.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import sys
import logging
import sqlalchemy

In [3]:
PARALLEL = 8
ATTRIBUTE = "larger_test"
conn_string = 'postgresql://postgres:postgres@127.0.0.1:5432/'

In [4]:
engine = sqlalchemy.create_engine(conn_string)
conn = engine.connect()

In [5]:
print(engine.execute("SELECT datname FROM pg_database;").fetchall())

[('postgres',), ('template1',), ('template0',), ('larger_test',)]


In [6]:
def wipe_db():
    engine = sqlalchemy.create_engine(conn_string)
    conn = engine.connect()

    conn.execute("commit")
    conn.execute(
        f"""SELECT 
        pg_terminate_backend(pid) 
    FROM 
        pg_stat_activity 
    WHERE 
        -- don't kill my own connection!
        pid <> pg_backend_pid()
        -- don't kill the connections to other databases
        AND datname = '{ATTRIBUTE}'
        ;"""
    )

    conn.execute("commit")
    conn.execute("drop database " + ATTRIBUTE)

    print(engine.execute("SELECT datname FROM pg_database;").fetchall())

    conn.execute("commit")
    conn.execute("create database " + ATTRIBUTE)
    conn.close()
    print(engine.execute("SELECT datname FROM pg_database;").fetchall())

wipe_db()

[('postgres',), ('template1',), ('template0',)]
[('postgres',), ('template1',), ('template0',), ('larger_test',)]


In [7]:
from fonduer import Meta, init_logging

# Configure logging for Fonduer
init_logging(log_dir="logs")

session = Meta.init(conn_string + ATTRIBUTE).Session()

[2022-07-21 06:14:49,431][INFO] fonduer.meta:49 - Setting logging directory to: logs/2022-07-21_06-14-49
[2022-07-21 06:14:49,439][INFO] fonduer.meta:135 - Connecting user:postgres to 127.0.0.1:5432/larger_test
[2022-07-21 06:14:54,666][INFO] fonduer.meta:162 - Initializing the storage schema


### 1. Load Export
Now we are loading the Label Studio export JSON file. This process may take a while since all HTML files are newly created from the export to retain the exact same structure as in label studio. 

The directory for the created HTML documents can be specified with the `base_dir` argument but also defaults to `tmp`.

In [8]:
from LabelstudioToFonduer.export import Export

my_export = Export(export_path="data/export/export_larger_test.json", base_dir="tmp")

In [8]:
docs_path = my_export.base_dir
docs_path

'tmp'

### 2. Import Documents
After Fonduer is set up and the export is prepared for the import, we can start with the importing process.

In [9]:
from fonduer.parser.preprocessors import HTMLDocPreprocessor
from fonduer.parser import Parser

max_docs = 100
doc_preprocessor = HTMLDocPreprocessor(docs_path, max_docs=max_docs)

By default, the fonduer `lingual_parser` splits sentences sometimes also on `:` cars. Since the annotated labels contain `:` chars as part of the job descriptions, this behavior is critical as fonduer would split some labels into two sentences. 

As a solution, a modified `lingual_parser` is provided. The `ModifiedSpacyParser` is heavily based on the fonduer `SpacyParser` but uses a RegEx sentencizer that splits sentences only on the `.` char. Further, split exceptions can be provided to except the splitting on abbreviations like `.NET` or `Sr.`.

In [10]:
from LabelstudioToFonduer.parser import ModifiedSpacyParser

exceptions = [".NET", "Sr.", ".WEB", ".de", "Jr.", "Inc.", "Senior."]
my_parser = ModifiedSpacyParser(lang="en", split_exceptions=exceptions)

In [11]:
corpus_parser = Parser(session, lingual_parser=my_parser, structural=True, lingual=True)
corpus_parser.apply(doc_preprocessor, parallelism=PARALLEL)

[2022-07-19 14:30:45,330][INFO] fonduer.utils.udf:67 - Running UDF...


  0%|          | 0/100 [00:00<?, ?it/s]

In [12]:
from fonduer.parser.models import Document, Sentence

print(f"Documents: {session.query(Document).count()}")
print(f"Sentences: {session.query(Sentence).count()}")

docs = session.query(Document).order_by(Document.name).all()

Documents: 100
Sentences: 15866


### 3. Setup Fonduer Data Model
As we can see in the previous cell, we imported 100 documents into Fonduer. Now we need to prepare the data model according to our annotations. The two entities `Job` and `City` are created.

In [13]:
from fonduer.candidates.models import mention_subclass

Job = mention_subclass("Job")
City = mention_subclass("City")

Since the gold labels define the ngram size, the needed size can be calculated by finding the longest and shortest label. 

`LabelstudioToFonduer` conveniently determines the necessary ngram sizes for us. 

In [14]:
my_export.ngrams

{'Job': {'min': 2, 'max': 10}, 'City': {'min': 1, 'max': 2}}

In [15]:
from fonduer.candidates import MentionNgrams

job_ngrams = MentionNgrams(n_max=my_export.ngrams["Job"]["max"], n_min=my_export.ngrams["Job"]["min"])
city_ngrams = MentionNgrams(n_max=my_export.ngrams["City"]["max"], n_min=my_export.ngrams["City"]["min"])

print("Job ngram size:", job_ngrams.n_max)
print("City ngram size:", city_ngrams.n_max)

Job ngram size: 10
City ngram size: 2


In [16]:
from fonduer.candidates.matchers import LambdaFunctionMatcher

city = my_export.texts["City"]
jobs = my_export.texts["Job"]

def is_job(mention):
    if mention.get_span() in jobs:
        return True
    else:
        False

def is_city(mention):
    if mention.get_span() in city:
        return True
    else:
        False
    
job_matcher = LambdaFunctionMatcher(func=is_job)
city_matcher = LambdaFunctionMatcher(func=is_city)

In [17]:
from fonduer.candidates import MentionExtractor

mention_extractor = MentionExtractor(
    session,
    [Job, City],
    [job_ngrams, city_ngrams],
    [job_matcher, city_matcher],
)

In [18]:
from fonduer.candidates.models import Mention

mention_extractor.apply(docs)
num_jobs = session.query(Job).count()
num_cities = session.query(City).count()
print(
    f"Total Mentions: {session.query(Mention).count()} ({num_jobs} jobs, {num_cities} cities)"
)

[2022-07-19 14:32:51,881][INFO] fonduer.candidates.mentions:467 - Clearing table: job
[2022-07-19 14:32:51,897][INFO] fonduer.candidates.mentions:467 - Clearing table: city
[2022-07-19 14:32:51,898][INFO] fonduer.utils.udf:67 - Running UDF...


  0%|          | 0/100 [00:00<?, ?it/s]

Total Mentions: 1707 (369 jobs, 1338 cities)


In [19]:
from fonduer.candidates.models import candidate_subclass

JobCity = candidate_subclass(
    "JobCity", [Job, City]
)

In [20]:
from fonduer.candidates import CandidateExtractor

candidate_extractor = CandidateExtractor(session, [JobCity])
candidate_extractor.apply(docs)

[2022-07-19 14:35:11,676][INFO] fonduer.candidates.candidates:138 - Clearing table job_city (split 0)
[2022-07-19 14:35:11,691][INFO] fonduer.utils.udf:67 - Running UDF...


  0%|          | 0/100 [00:00<?, ?it/s]

In [21]:
train_cands = candidate_extractor.get_candidates()

  .filter(candidate_class.id.in_(sub_query))


### 4. Label Docs

To label the Fonduer candidates, we need to create a gold function. This function determines if a candidate is a gold label by comparing the text,  XPath, document and char offsets. To create positive exaples, we provide the session to the export.

In [22]:
my_export.create_gold_function(session)



In [24]:
from fonduer.supervision.models import GoldLabel
from fonduer.supervision import Labeler

In [25]:
labeler = Labeler(session, [JobCity])

Finally we can provide the Fonduer labeler with the `my_export.is_gold` function to transfer the labels from Lable Studio to Fonduer.

In [26]:
%time labeler.apply(docs=docs, lfs=[[my_export.is_gold]], table=GoldLabel, train=True, parallelism=PARALLEL)

[2022-07-19 14:36:30,851][INFO] fonduer.supervision.labeler:330 - Clearing Labels (split ALL)
  query = self.session.query(table).filter(table.candidate_id.in_(sub_query))
[2022-07-19 14:36:30,857][INFO] fonduer.utils.udf:67 - Running UDF...


  0%|          | 0/100 [00:00<?, ?it/s]

CPU times: user 375 ms, sys: 68.9 ms, total: 444 ms
Wall time: 3.13 s


To check if the pipeline worked as expected, we count the gold labels in Fonduer.

In [28]:
all_gold = labeler.get_gold_labels(train_cands)
all_gold[0].sum()

90

In [32]:
golds = []
for k, v in zip(all_gold[0], train_cands[0]):
    if k:
        golds.append(v)