# Extract electricity prices from VENRON data set

# Introduction

In this notebook we use `Fonduer` to extract relations from the `VENRON` dataset.  
This code is a modified version of their original hardware [tutorial](https://github.com/HazyResearch/fonduer-tutorials/tree/master/hardware).  
The `Fonduer` pipeline (as outlined in the [paper](https://arxiv.org/abs/1703.05028)), and the iterative KBC process:

1. KBC Initialization
2. Candidate Generation and Multimodal Featurization
3. Probabilistic Relation Classification
4. Error Analysis and Iterative KBC


## Setup

First we import the relevant libraries and connect to the local database.  
Follow the README instructions to setup the connection to the postgres DB correctly.

If the database has existing candidates with generated features, the will not be overriden.  
To re-run the entire pipeline including initialization drop the database first.

In [None]:
! dropdb -h postgres -h postgres --if-exists elec_price_vol
! createdb -h postgres -h postgres elec_price_vol

In [None]:
# source .venv/bin/activate

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import sys
import logging


In [None]:
PARALLEL = 8 # 4  # assuming a quad-core machine
ATTRIBUTE = "elec_price_vol"
DB_USERNAME = 'user'
DB_PASSWORD = 'venron'
conn_string = f'postgresql://{DB_USERNAME}:{DB_PASSWORD}@postgres:5432/{ATTRIBUTE}'

dataset = 'gold' # 'full'    
docs_path = f'data/{dataset}/html/'
pdf_path = 'data/pdf/'
gold_file = 'data/electricity_gold.csv'
max_docs = 10 # 114


## 1.1 Parsing and Transforming the Input Documents into Unified Data Models

We first initialize a `Meta` object, which manages the connection to the database automatically, and enables us to save intermediate results.

In [None]:
from fonduer import Meta, init_logging

# Configure logging for Fonduer
init_logging(log_dir="logs", level=logging.INFO) # DEBUG LOGGING

session = Meta.init(conn_string).Session()

In [None]:
from fonduer.parser.preprocessors import HTMLDocPreprocessor
from fonduer.parser.models import Document, Sentence
from fonduer.parser import Parser

has_documents = session.query(Document).count() > 0

corpus_parser = Parser(session, structural=True, lingual=True, visual=True, pdf_path=pdf_path)

if (not has_documents): 
    doc_preprocessor = HTMLDocPreprocessor(docs_path, max_docs=max_docs)
    %time corpus_parser.apply(doc_preprocessor, parallelism=PARALLEL)
    
print(f"Documents: {session.query(Document).count()}")
print(f"Sentences: {session.query(Sentence).count()}")

In [None]:
# Initialize NLP library for vector similarities
import sys

!{sys.executable} -m spacy download en_core_web_lg

## 1.2 Dividing the Corpus into Test and Train

We'll split the documents 80/10/10 into train/dev/test splits. Note that here we do this in a non-random order to preserve the consistency and we reference the splits by 0/1/2 respectively.

In [None]:
docs = session.query(Document).order_by(Document.name).all()
ld   = len(docs)

train_docs = set()
dev_docs   = set()
test_docs  = set()
splits = (0.8, 0.9)
data = [(doc.name, doc) for doc in docs]
data.sort(key=lambda x: x[0])
for i, (doc_name, doc) in enumerate(data):
    if i < splits[0] * ld:
        train_docs.add(doc)
    elif i < splits[1] * ld:
        dev_docs.add(doc)
    else:
        test_docs.add(doc)
from pprint import pprint
pprint([x.name for x in train_docs][0:5])
print(f"Number of documents split: {len(docs)}")

# Phase 2: Mention Extraction, Candidate Extraction Multimodal Featurization

Given the unified data model from Phase 1, `Fonduer` extracts relation
candidates based on user-provided **matchers** and **throttlers**. Then,
`Fonduer` leverages the multimodality information captured in the unified data
model to provide multimodal features for each candidate.

## 2.1 Mention Extraction & Candidate Generation

1. Define mention classes
2. Use matcher functions to define the format of potential mentions
3. Define Mentionspaces (Ngrams)
4. Run Mention extraction (all possible ngrams in the document, API [ReadTheDocs](https://fonduer.readthedocs.io/en/stable/user/candidates.html#fonduer.candidates.MentionExtractor))

In [None]:
from fonduer.candidates import MentionExtractor 
from fonduer.candidates.models import Mention
from my_subclasses import mention_classes, mention_spaces, matchers
hasMentions = session.query(Mention).count() > 0

if (not hasMentions):
    # 4.) Mention extraction
    mention_extractor = MentionExtractor(
        session, mention_classes, mention_spaces, matchers
    )
    docs = session.query(Document).order_by(Document.name).all()
    mention_extractor.apply(docs, parallelism=PARALLEL)


mentions = session.query(Mention).all()
print(f"Total Mentions: {len(mentions)}")

In [None]:
from fonduer_utils import prune_duplicate_mentions

Station = mention_classes[0]
# Performance increase (reduce quadratic candidates combination by deleting duplicate mentions)
mentions = prune_duplicate_mentions(session, mentions, Station)

In [None]:
# DEBUG: Test if at least one station mention is for meadmktplace types
list([x for x in mentions if x.document.name.upper() == "11_NP 15 PAGES" and isinstance(x, Station)])

## 2.2 Candidate Extraction

1. Define Candidate Class
2. Define trottlers to reduce the number of possible candidates
3. Extract candidates (View the API for the CandidateExtractor on [ReadTheDocs](https://fonduer.readthedocs.io/en/stable/user/candidates.html#fonduer.candidates.MentionExtractor).)

In the last part we specified that these `Candidates` belong to the training set by specifying `split=0`; recall that we're referring to train/dev/test as splits 0/1/2.

In [None]:
import re
from my_subclasses import candidate_classes, throttlers
from fonduer.candidates import CandidateExtractor
from fonduer.utils.visualizer import Visualizer


# 1.) Define Candidate class
StationPrice = candidate_classes[0]
has_candidates = session.query(StationPrice).filter(StationPrice.split == 0).count() > 0

# 2.) Candidate extraction
# NOTE: Without nested_relations flag DocumentMentions and FigureMentions are filtered out. 
#       Otherwise they would require a rewrite of the featurizers, due to preprocessing we duplicate the img-url and doc-name
candidate_extractor = CandidateExtractor(session, [StationPrice], throttlers=throttlers) # , nested_relations=True)

for i, docs in enumerate([train_docs, dev_docs, test_docs]):
    if (not has_candidates):
        candidate_extractor.apply(docs, split=i, parallelism=PARALLEL)
    print(f"Number of Candidates in split={i}: {session.query(StationPrice).filter(StationPrice.split == i).count()}")

train_cands = candidate_extractor.get_candidates(split = 0)
dev_cands = candidate_extractor.get_candidates(split = 1)
test_cands = candidate_extractor.get_candidates(split = 2)
cands = [train_cands, dev_cands, test_cands]

# 3.) Visualize some candidate for error analysis
# pprint(train_cands[0][0])
# vis = Visualizer(pdf_path)

# Display a candidate
# vis.display_candidates([train_cands[0][0]])

## 2.2 Multimodal Featurization
Unlike dealing with plain unstructured text, `Fonduer` deals with richly formatted data, and consequently featurizes each candidate with a baseline library of multimodal features. 

### Featurize with `Fonduer`'s optimized Postgres Featurizer
We now annotate the candidates in our training, dev, and test sets with features. The `Featurizer` provided by `Fonduer` allows this to be done in parallel to improve performance.

View the API provided by the `Featurizer` on [ReadTheDocs](https://fonduer.readthedocs.io/en/stable/user/features.html#fonduer.features.Featurizer).

At the end of this phase, `Fonduer` has generated the set of candidates and the feature matrix. Note that Phase 1 and 2 are relatively static and typically are only executed once during the KBC process.

In [None]:
from fonduer.features import Featurizer
from fonduer.features.models import Feature
from fonduer.features.feature_extractors import FeatureExtractor

featurizer = Featurizer(
    session, 
    [StationPrice], 
    feature_extractors=FeatureExtractor(["textual", "structural", "tabular", "visual"])
)
has_features = session.query(Feature).count() > 0

if (not has_features):
    # Training set
    %time featurizer.apply(split=0, train=True, parallelism=PARALLEL)
    %time F_train = featurizer.get_feature_matrices(train_cands)
    print(F_train[0].shape)

    # Dev set
    %time featurizer.apply(split=1, parallelism=PARALLEL)
    %time F_dev = featurizer.get_feature_matrices(dev_cands)
    print(F_dev[0].shape)

    # Test set
    %time featurizer.apply(split=2, parallelism=PARALLEL)
    %time F_test = featurizer.get_feature_matrices(test_cands)
    print(F_test[0].shape)
else:
    %time F_train = featurizer.get_feature_matrices(train_cands)
    %time F_dev = featurizer.get_feature_matrices(dev_cands)
    %time F_test = featurizer.get_feature_matrices(test_cands)
    
F = [F_train, F_dev, F_test]
    

# Phase 3: Probabilistic Relation Classification
In this phase, `Fonduer` applies user-defined **labeling functions**, which express various heuristics, patterns, and [weak supervision](http://hazyresearch.github.io/snorkel/blog/weak_supervision.html) strategies to label our data, to each of the candidates to create a label matrix that is used by our data programming engine.

1. Load Gold Data

--- 

Iterate the following steps

2. Create labeling functions
3. Apply labeling functions and measure accuracy of each LF (based on gold data).
4. Build a generative model by combining the labeling functions
5. Iterate on labeling function based on the models score

---

6. Finally build a descriminative model and test on the test set

### 3.1) Loading Gold LF

In [None]:
from fonduer.supervision.models import GoldLabel
from electricity_utils import get_gold_func
from fonduer.supervision import Labeler
from my_subclasses import stations_mapping_dict

# 1.) Load the gold data
gold = get_gold_func(gold_file, attribute=ATTRIBUTE, stations_mapping_dict=stations_mapping_dict)
docs = corpus_parser.get_documents()
labeler = Labeler(session, [StationPrice])
%time labeler.apply(docs=docs, lfs=[[gold]], table=GoldLabel, train=True, parallelism=PARALLEL)

### 3.2) Creating Labeling Functions

We have 3 states that we can return from a LF: `ABSTAIN`, `FALSE` or `TRUE`.

A library of data model utilities
which can be used to write labeling functions are outline in [Read the
Docs](http://fonduer.readthedocs.io/en/stable/user/data_model_utils.html). 

### 3.3) Applying the Labeling Functions

Next, we need to actually run the LFs over all of our training candidates, producing a set of `Labels` and `LabelKeys` (just the names of the LFs) in the database. Note that this will delete any existing `Labels` and `LabelKeys` for this candidate set.

View the API provided by the `Labeler` on [ReadTheDocs](https://fonduer.readthedocs.io/en/stable/user/supervision.html#fonduer.supervision.Labeler).

We can also view statistics about the resulting label matrix.
* **Coverage** is the fraction of candidates that the labeling function emits a non-zero label for.
* **Overlap** is the fraction candidates that the labeling function emits a non-zero label for and that another labeling function emits a non-zero label for.
* **Conflict** is the fraction candidates that the labeling function emits a non-zero label for and that another labeling function emits a conflicting non-zero label for.

In addition, because we have already loaded the gold labels, we can view the emperical accuracy of these labeling functions when compared to our gold labels using the `analysis` module of [Snorkel](https://github.com/snorkel-team/snorkel)

### 3.4) Build Generative Model

Now, we'll train a model of the LFs to estimate their accuracies. Once the model is trained, we can combine the outputs of the LFs into a single, noise-aware training label set for our extractor. Intuitively, we'll model the LFs by observing how they overlap and conflict with each other. To do so, we use [Snorkel](https://github.com/snorkel-team/snorkel)'s single-task label model.

We then print out the marginal probabilities for each training candidate.

In [None]:
from fonduer.utils.data_model_utils import *
from electricity_utils import eval_LFs
from snorkel.labeling import labeling_function

from snorkel.labeling import LFAnalysis
from snorkel.labeling.model import LabelModel

from fonduer_utils import get_applied_lfs, get_neighbor_cell_ngrams_own, _min_range_diff, min_row_diff, min_col_diff

import matplotlib.pyplot as plt
import re


def run_labeling_functions():
    ABSTAIN = -1
    FALSE = 0
    TRUE = 1

    @labeling_function()
    def LF_other_station_table(c):
        station_span = c.station.context.get_span().lower()
        neighbour_cells = get_neighbor_cell_ngrams_own(c.price, dist=100, directions=True, n_max = 4, absolute = True)
        up_cells = [x for x in neighbour_cells if len(x) > 1 and x[1] == 'DOWN' and x[0] in stations_list]
        # No station name in upper cells
        if (len(up_cells) == 0):
            return ABSTAIN
        # Check if the next upper aligned station-span corresponds to the candidate span (or equivalents)
        closest_header = up_cells[len(up_cells)-1]
        return TRUE if closest_header[0] in stations_mapping_dict[station_span] else FALSE

    @labeling_function()
    def LF_station_non_meta_tag(c):
        html_tags = get_ancestor_tag_names(c.station)
        return FALSE if ('head' in html_tags and 'title' in html_tags) else ABSTAIN

    # Basic constraint for the price LFs to be true -> no wrong station (increase accuracy)
    def base(c):
        return (
            LF_station_non_meta_tag(c) != 0 and 
            LF_other_station_table(c) != 0 and 
            LF_off_peak_head(c) != 0 and
            LF_purchases(c)
        )

    # 2.) Create labeling functions 
    @labeling_function()
    def LF_on_peak_head(c):
        return TRUE if 'on peak' in get_aligned_ngrams(c.price, n_min=2, n_max=2)  and base(c) else ABSTAIN

    @labeling_function()
    def LF_off_peak_head(c):
        return FALSE if 'off peak' in get_aligned_ngrams(c.price, n_min=2, n_max=2) else ABSTAIN

    @labeling_function()
    def LF_price_range(c):
        price = float(c.price.context.get_span())
        return TRUE if price > 0 and price < 1000 and base(c) else FALSE

    @labeling_function()
    def LF_price_head(c):
        return TRUE if 'price' in get_aligned_ngrams(c.price) and base(c) else ABSTAIN

    @labeling_function()
    def LF_firm_head(c):
        return TRUE if 'firm' in get_aligned_ngrams(c.price)and base(c) else ABSTAIN

    @labeling_function()
    def LF_dollar_to_left(c):
        return TRUE if '$' in get_left_ngrams(c.price, window=2) and base(c) else ABSTAIN

    @labeling_function()
    def LF_purchases(c):
        return FALSE if 'purchases' in get_aligned_ngrams(c.price, n_min=1) else ABSTAIN

    station_price_lfs = [
        LF_other_station_table,
        LF_station_non_meta_tag,

        # indicator
        LF_price_range,

        # negative indicators
        LF_off_peak_head,
        LF_purchases,

        # positive indicators
        LF_on_peak_head,    
        LF_price_head,
        LF_firm_head,
        LF_dollar_to_left,
    ]

    # 3.) Apply the LFs on the training set
    labeler = Labeler(session, [StationPrice])
    labeler.apply(split=0, lfs=[station_price_lfs], train=True, clear=True, parallelism=PARALLEL)
    L_train = labeler.get_label_matrices(train_cands)

    # Check that LFs are all applied (avoid crash)
    applied_lfs = L_train[0].shape[1]
    has_non_applied = applied_lfs != len(station_price_lfs)
    print(f"Labeling functions on train_cands not ABSTAIN: {applied_lfs} (/{len(station_price_lfs)})")

    if (has_non_applied):
        applied_lfs = get_applied_lfs(session)
        non_applied_lfs = [l.name for l in station_price_lfs if l.name not in applied_lfs]
        print(f"Labling functions {non_applied_lfs} are not applied.")
        station_price_lfs = [l for l in station_price_lfs if l.name in applied_lfs]

    # 4.) Evaluate their accuracy
    L_gold_train = labeler.get_gold_labels(train_cands, annotator='gold')
    # Sort LFs for LFAnalysis because LFAnalysis does not sort LFs,
    # while columns of L_train are sorted alphabetically already.
    sorted_lfs = sorted(station_price_lfs, key=lambda lf: lf.name)
    LFAnalysis(L=L_train[0], lfs=sorted_lfs).lf_summary(Y=L_gold_train[0].reshape(-1))

    # 5.) Build generative model
    gen_model = LabelModel(cardinality=2)
    gen_model.fit(L_train[0], n_epochs=500, log_freq=100)

    train_marginals_lfs = gen_model.predict_proba(L_train[0])
    plt.hist(train_marginals_lfs[:, TRUE], bins=20)
    plt.show()

    # Apply on dev-set
    labeler.apply(split=1, lfs=[station_price_lfs], clear=True, parallelism=PARALLEL)
    L_dev = labeler.get_label_matrices(dev_cands)

    L_gold_dev = labeler.get_gold_labels(dev_cands, annotator='gold')
    LFAnalysis(L=L_dev[0], lfs=sorted_lfs).lf_summary(Y=L_gold_dev[0].reshape(-1))
    return (gen_model, train_marginals_lfs)

if (dataset == 'full'):
    (gen_model, train_marginals_lfs) = run_labeling_functions()
    eval_LFs(train_marginals_lfs, train_cands, gold)

In [None]:
# # Query for analysis
# labels = session.query(Label).all()
# gold_labels = session.query(GoldLabel).all()
# gold_labels_map = { gold_label.candidate_id: gold_label for gold_label in gold_labels }

In [None]:
# from fonduer.candidates.models import Candidate

# DB_FALSE = FALSE +1
# DB_ABSTAIN = ABSTAIN +1
# DB_TRUE = TRUE +1

# def get_incorrect_instances(lf):
#     def is_wrong_label(label):
#         if (lf.name not in label.keys):
#             return False # Abstain
#         assigned_label = label.values[label.keys.index(lf.name)]
#         gold_label = gold_labels_map[label.candidate_id] # [x for x in gold_labels if x.candidate_id == label.candidate_id][0]
#         return gold_label.values[0] != assigned_label
#     return [x.candidate for x in labels if is_wrong_label(x)]

In [None]:
# lf = station_price_lfs[5]
# wrong_cands = get_incorrect_instances(lf)

In [None]:
# pprint(f"Labeling Function: {lf.name} has wrongly labelled the candidate(1/{len(wrong_cands)}):")
# if (len(wrong_cands) > 0):
#     wrong_cand = wrong_cands[100]
#     pprint(wrong_cand)
#     pprint('LF is True' if lf(wrong_cand) == 1 else 'LF is False')
#     vis = Visualizer(pdf_path)

#     # Display a candidate
#     vis.display_candidates([wrong_cand])
# else:
#     print("There are no wrong candidates for this labeling function")

## Training the Discriminative Model 

Fonduer uses the machine learning framework [Emmental](https://github.com/SenWu/emmental) to support all model training.

In [None]:
import emmental
import numpy as np

from emmental.modules.embedding_module import EmbeddingModule
from emmental.data import EmmentalDataLoader
from emmental.model import EmmentalModel
from emmental.learner import EmmentalLearner
from fonduer.learning.utils import collect_word_counter
from fonduer.learning.dataset import FonduerDataset
from fonduer.learning.task import create_task

ABSTAIN = -1
FALSE = 0
TRUE = 1

def train_model(cands, F, train_marginals, model_type="LogisticRegression"):
    # Extract candidates and features
    train_cands = cands[0]
    F_train = F[0]
    
    # 1.) Setup training config
    config = {
        "meta_config": {"verbose": True},
        "model_config": {"model_path": None, "device": 0, "dataparallel": False},
        "learner_config": {
            "n_epochs": 50,
            "optimizer_config": {"lr": 0.001, "l2": 0.0},
            "task_scheduler": "round_robin",
        },
        "logging_config": {
            "evaluation_freq": 1,
            "counter_unit": "epoch",
            "checkpointing": False,
            "checkpointer_config": {
                "checkpoint_metric": {f"{ATTRIBUTE}/{ATTRIBUTE}/train/loss": "min"},
                "checkpoint_freq": 1,
                "checkpoint_runway": 2,
                "clear_intermediate_checkpoints": True,
                "clear_all_checkpoints": True,
            },
        },
    }

    emmental.init(Meta.log_path)
    emmental.Meta.update_config(config=config)
    
    # 2.) Collect word counter from training data
    word_counter = collect_word_counter(train_cands)
    
    # 3.) Generate word embedding module for LSTM model
    # (in Logistic Regression, we generate it since Fonduer dataset requires word2id dict)
    # Geneate special tokens
    arity = 2
    specials = []
    for i in range(arity):
        specials += [f"~~[[{i}", f"{i}]]~~"]

    emb_layer = EmbeddingModule(
        word_counter=word_counter, word_dim=300, specials=specials
    )
    
    # 4.) Generate dataloader for training set
    # Filter out noise samples
    diffs = train_marginals.max(axis=1) - train_marginals.min(axis=1)
    train_idxs = np.where(diffs > 1e-6)[0]

    train_dataloader = EmmentalDataLoader(
        task_to_label_dict={ATTRIBUTE: "labels"},
        dataset=FonduerDataset(
            ATTRIBUTE,
            train_cands[0],
            F_train[0],
            emb_layer.word2id,
            train_marginals,
            train_idxs,
        ),
        split="train",
        batch_size=100,
        shuffle=True,
    )
    
    # 5.) Training 
    tasks = create_task(
        ATTRIBUTE, 2, F_train[0].shape[1], 2, emb_layer, model=model_type # "LSTM" 
    )

    model = EmmentalModel(name=f"{ATTRIBUTE}_task")

    for task in tasks:
        model.add_task(task)

    emmental_learner = EmmentalLearner()
    emmental_learner.learn(model, [train_dataloader])
    
    return (model, emb_layer)

In [None]:
from electricity_utils import entity_level_f1
from fonduer_utils import schema_match_filter

price_col_keywords = ["price", "weighted avg."]  
DEBUG = False

def eval_model(model, emb_layer, cands, F, schema_filter=False):
    # Extract candidates and features 
    train_cands = cands[0]
    dev_cands = cands[1]
    test_cands = cands[2] 
    F_train = F[0]
    F_dev = F[1]
    F_test = F[2]
    
    # apply schema filter
    def apply(cands):
        return schema_match_filter(
            cands, 
            "station", 
            "price", 
            price_col_keywords, 
            stations_mapping_dict, 
            0.05,
            DEBUG,
        )  
    
    # Generate dataloader for test data
    test_dataloader = EmmentalDataLoader(
        task_to_label_dict={ATTRIBUTE: "labels"},
        dataset=FonduerDataset(
            ATTRIBUTE, test_cands[0], F_test[0], emb_layer.word2id, 2
        ),
        split="test",
        batch_size=100,
        shuffle=False,
    )
    
    test_preds = model.predict(test_dataloader, return_preds=True)
    positive = np.where(np.array(test_preds["probs"][ATTRIBUTE])[:, TRUE] > 0.6)
    true_pred = [test_cands[0][_] for _ in positive[0]]
    true_pred = apply(true_pred) if schema_filter else true_pred        
    test_results = entity_level_f1(true_pred, gold_file, ATTRIBUTE, test_docs, stations_mapping_dict=stations_mapping_dict)

    # Run on dev and train set for validation
    # We run the predictions also on our training and dev set, to validate that everything seems to work smoothly
    
    # Generate dataloader for dev data
    dev_dataloader = EmmentalDataLoader(
        task_to_label_dict={ATTRIBUTE: "labels"},
        dataset=FonduerDataset(
            ATTRIBUTE, dev_cands[0], F_dev[0], emb_layer.word2id, 2
        ),
        split="test",
        batch_size=100,
        shuffle=False,
    )

    dev_preds = model.predict(dev_dataloader, return_preds=True)
    positive_dev = np.where(np.array(dev_preds["probs"][ATTRIBUTE])[:, TRUE] > 0.6)
    true_dev_pred = [dev_cands[0][_] for _ in positive_dev[0]]
    true_dev_pred = apply(true_dev_pred) if schema_filter else true_dev_pred        
    dev_results = entity_level_f1(true_dev_pred, gold_file, ATTRIBUTE, dev_docs, stations_mapping_dict=stations_mapping_dict)

    # Generate dataloader for train data
    train_dataloader = EmmentalDataLoader(
        task_to_label_dict={ATTRIBUTE: "labels"},
        dataset=FonduerDataset(
            ATTRIBUTE, train_cands[0], F_train[0], emb_layer.word2id, 2
        ),
        split="test",
        batch_size=100,
        shuffle=False,
    )

    train_preds = model.predict(train_dataloader, return_preds=True)
    positive_train = np.where(np.array(train_preds["probs"][ATTRIBUTE])[:, TRUE] > 0.6)
    true_train_pred = [train_cands[0][_] for _ in positive_train[0]]
    true_train_pred = apply(true_train_pred) if schema_filter else true_train_pred        
    train_results = entity_level_f1(true_train_pred, gold_file, ATTRIBUTE, train_docs, stations_mapping_dict=stations_mapping_dict)
 
    return [train_results, dev_results, test_results]

## Evaluating on the Test Set 

In [None]:
# Based on gold labels or labeling functions (gold/full data set)
train_marginals_gold = np.array([[0,1] if gold(x) else [1,0] for x in train_cands[0]])
train_marginals = train_marginals_gold if dataset == 'gold' else train_marginals_lfs

In [None]:
from electricity_utils import summarize_results

# Build model and evaluate for Logistic Regression
(lr_model, lr_emb_layer) = train_model(cands, F, train_marginals, "LogisticRegression" )

print("Evaluate Logistic Regression method")
lr_results = eval_model(lr_model, lr_emb_layer, cands, F)
(prec_total, rec_total, f1_total) = summarize_results(lr_results)
print(f"TOTAL DOCS PAIRWISE (LogisticRegression): Precision={prec_total}, Recall={rec_total}, F1={f1_total}")

print("Evaluate Logistic Regression method with schema matching")
lr_results = eval_model(lr_model, lr_emb_layer, cands, F, True)
(prec_total, rec_total, f1_total) = summarize_results(lr_results)
print(f"TOTAL DOCS PAIRWISE (LogisticRegression): Precision={prec_total}, Recall={rec_total}, F1={f1_total}")

In [None]:
# Build model and evaluate for LSTM
(lstm_model, lstm_emb_layer) = train_model(cands, F, train_marginals, "LSTM" )

print("Evaluate LSTM method")
lstm_results = eval_model(lstm_model, lstm_emb_layer, cands, F)
(prec_total, rec_total, f1_total) = summarize_results(lstm_results)
print(f"TOTAL DOCS PAIRWISE (LSTM): Precision={prec_total}, Recall={rec_total}, F1={f1_total}")


print("Evaluate LSTM method with schema matching")
lstm_results = eval_model(lstm_model, lstm_emb_layer, cands, F, True)
(prec_total, rec_total, f1_total) = summarize_results(lstm_results)
print(f"TOTAL DOCS PAIRWISE (LSTM): Precision={prec_total}, Recall={rec_total}, F1={f1_total}")

# Phase 4: Error Analysis & Iterative KBC 

- Analyise the false positive (FP) and false negative (FN) candidates
- Use the visualization tool to better understand which labeling functions might be responsible
- Test the labeling functions on this candidates to verify they work as expected

In [None]:
# from electricity_utils import entity_to_candidates

# def display_cand(cand_nr):
#     # Get a list of candidates that match the FN[10] entity
#     fp_cands = entity_to_candidates(FP[cand_nr], test_cands[0])

#     # Display a candidate
#     fp_cand = fp_cands[0]
#     print(fp_cand)
#     print(f"Number of FP: {cand_nr}/{len(FP)}")
#     vis.display_candidates([fp_cand])
#     return fp_cand
    
# maximum = len(FP)-1

In [None]:
# from ipywidgets import widgets
# from functools import partial
# from IPython.display import display, clear_output

# class Counter:
#     def __init__(self, initial=0, maximum=0, minimum=0):
#         self.value = initial
#         self.maximum = maximum
#         self.minimum = 0
#         self.cand = display_cand(initial)

#     def increment(self, amount=1):
#         if (self.value+amount > self.maximum):
#             return self.value
#         self.value += amount
#         return self.value
    
#     def decrement(self, amount=1):
#         if (self.value-amount < 0):
#             return self.value
#         self.value -= amount
#         return self.value

#     def __iter__(self, sentinal=False):
#         return iter(self.increment, sentinal)
    
# def display_all(cand_nr):
#     # Clear previous
#     clear_output(wait=True)
#     # Redraw
#     display(minus)
#     display(plus)
#     return display_cand(cand_nr)

# def btn_inc(counter, w):
#     counter.increment()  
#     counter.cand = display_all(counter.value)

# def btn_dec(counter, w):
#     counter.decrement()
#     counter.cand = display_all(counter.value)

# counter = Counter(40, maximum)
# minus = widgets.Button(description='<')
# minus.on_click(partial(btn_dec, counter))

# plus = widgets.Button(description='>')
# plus.on_click(partial(btn_inc, counter))

# display(minus)
# display(plus)

In [None]:
# # Get a list of candidates that match the FN[10] entity
# tp_cands = entity_to_candidates(TP[40], test_cands[0])


# # Display a candidate
# print(f"Number of TP: {len(TP)}")
# print(tp_cands[0])
# vis.display_candidates([tp_cands[0]])

In [None]:
# # Get a list of candidates that match the FN[10] entity
# fn_cands = entity_to_candidates(FN[2], test_cands[0])


# # Display a candidate
# print(f"Number of FN: {len(FN)}")
# print(fn_cands)
# vis.display_candidates([fn_cands[0]])


In [None]:
# result = re.compile(station_rgx, flags=re.I).search(mentions[len(mentions)-7].document.name)
# result

In [None]:
# import spacy 
# from itertools import chain, tee, groupby, product
# from fonduer.utils.data_model_utils.tabular import _get_aligned_sentences
# from itertools import groupby
# import operator

# def get_col(m):
#     s = m.context.sentence
#     if (not s.is_tabular()):
#         return -1
#     if (s.cell.col_start != s.cell.col_end):
#         return -1
#     return s.cell.col_start

# def get_headers(mentions_col):
#     m_sentences = [m.context.sentence for m in mentions_col]
#     min_row = min([x.cell.row_start for x in m_sentences])
#     s = m_sentences[0]
#     aligned = [x.text for x in _get_aligned_sentences(s, axis=1) if x not in m_sentences and x.cell.row_end < min_row]
#     # TODO: HEADER cell-annotation condition
#     return aligned

# def get_sim(mentions_col_it, fid, pos_keyw, id_dict):
#     headers = " , ".join(get_headers(list(mentions_col_it)))
#     pos_keyw_vec = nlp(" , ".join(pos_keyw + id_dict[fid.context.get_span().lower()]))
#     headers_vec = nlp(headers)

#     # vectorize with word2vec and measure the similarity to positive/negative schema column keywords
#     return pos_keyw_vec.similarity(headers_vec)


# def schema_match_filter(cands, id_field, filter_field, pos_keyw = [], id_dict = {}, variance=0.05, DEBUG=False):
#     filtered_cands = []
    
#     # group them by document, itertools requires sorting    
#     cands.sort(key=lambda c: c.document.name)
#     for doc, doc_it in groupby(cands, lambda c: c.document.name):
        
#         # group them by the candidate id field (e.g. all prices for one station-id)
#         doc_cands = list(doc_it)
#         doc_cands.sort(key=lambda c: getattr(c, id_field))
#         for fid, doc_cand_it in groupby(doc_cands, lambda c: getattr(c, id_field)):
        
#             it1, it2, it3 = tee(doc_cand_it, 3)
#             # group by col
#             doc_ms = [getattr(c, filter_field) for c in iter(it1)]
#             doc_ms.sort(key=lambda m: get_col(m))
#             ms_by_cols = { col:list(it) for col, it in groupby(doc_ms, lambda m: get_col(m)) }

#             # ignore non tabular or multi-col/row 
#             if (-1 in ms_by_cols.keys()):
#                 filtered_cands += [c for c in iter(it2) if getattr(c, filter_field) in ms_by_cols[-1]]

#             # Compare headers of each column based on semantic similarity (word vectors)
#             similarities = { col:get_sim(it, fid, pos_keyw, id_dict) for col, it in ms_by_cols.items() if col != -1 }
#             sim_sorted = [(col, sim) for col, sim in sorted(similarities.items(), key=lambda i: i[1], reverse=True)]
#             maximum = sim_sorted[0]
            
#             # If there is a conflict (multiple assigned columns)
#             # only take the maximum similarity as true for this candidate match
#             if (len(sim_sorted) > 1 and DEBUG):
#                 print("#####################################")
#                 print(f"Similarity for {fid.context.get_span()} in doc {doc}")
#                 print(similarities)
#                 print(f"The maximum similarity is for entries in column {maximum}")
#                 print()
#                 for col, it in ms_by_cols.items():
#                     print(f"Col {col} with {len(list(it))} entries and headers:")
#                     pprint(get_headers(list(it)))
#                 print()
             
#             # Filter only the k maximal similar column candidates based on variance
#             for i in sim_sorted:
#                 if (i[1] >= maximum[1]-variance):
#                     if (len(sim_sorted) > 1 and DEBUG):
#                         print("KEEP", i)
#                     filtered_cands += [c for c in iter(it3) if getattr(c, filter_field) in ms_by_cols[i[0]]]
            
            
#           # only max column
# #         counts = { col:len(list(it)) for col, it in ms_by_cols.items() if col != -1 }
# #         maximum = max(counts.items(), key=operator.itemgetter(1))[0]
# #         if (len(counts) > 1):
# #             print("max and all", doc, maximum, counts, get_header(ms_by_cols[maximum][0]))
# #             pprint(ms_by_cols)
# #             print()
            
#     return filtered_cands

# nlp = spacy.load("en_core_web_lg")
# price_pos_keywords = ["price", "firm", "on peak", "weighted avg."]      
# result = schema_match_filter(train_cands[0], "station", "price", price_pos_keywords, stations_mapping_dict)  


In [None]:
# print(len(result), "vs", len(train_cands[0]))