# Extracting Maximum Emitter Base Voltage from Digikey Datasheets

This Jupyter Notebook will begin extracting additional relations from transistors (`eb_v_max`, `c_current_max`, `dev_dissipation`, `dc_gain_min`).

Sarting with the `max_emitter_base_voltage` as shown below.

# Phase 1: KBC Initialization

Created a new database named `eb_v_max` for extracting the maximum ratings for emitter - base voltage.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import sys
import logging

# Configure logging for Fonduer
logging.basicConfig(stream=sys.stdout, format='[%(levelname)s] %(name)s - %(message)s')
log = logging.getLogger('fonduer')
log.setLevel(logging.INFO)

PARALLEL = 4 # assuming a quad-core machine
ATTRIBUTE = "eb_v_max"
conn_string = 'postgresql://localhost:5432/' + ATTRIBUTE

## 1.1 Parsing and Transforming the Input Documents into Unified Data Models

We first initialize a `Meta` object, which manages the connection to the database automatically, and enables us to save intermediate results.

In [2]:
from fonduer import Meta

session = Meta.init(conn_string).Session()

[INFO] fonduer.meta - Connecting user:None to localhost:5432/eb_v_max
[INFO] fonduer.meta - Initializing the storage schema


### Configuring an `HTMLDocPreprocessor`
We start by setting the paths to where our documents are stored, and defining a `HTMLDocPreprocessor` to read in the documents found in the specified paths. `max_docs` specified the number of documents to parse.

In the `transistor_dataset` that was downloaded as per Luke's instruction, there are 123 HTML files in `/dev/html`, 76 HTML files in `/test/html`, 2745 HTML files in `/train_digikey/html`. 

In order to, however, maintain consistency with and use previously defined gold labels, this test run is parsing the hardware tutorial's 100-pdf-long dataset.

In [3]:
from fonduer.parser.preprocessors import HTMLDocPreprocessor
from fonduer.parser import Parser

docs_path = 'data/html/'
pdf_path = 'data/pdf/'

max_docs = 100
doc_preprocessor = HTMLDocPreprocessor(docs_path, max_docs=max_docs)

### Configuring a `Parser`
Next, we configure a `Parser`, which serves as our `CorpusParser` for PDF documents. We use [spaCy](https://spacy.io/) as a preprocessing tool to split our documents into sentences and tokens, and to provide annotations such as part-of-speech tags and dependency parse structures for these sentences. In addition, we can specify which modality information to include in the unified data model for each document. Below, we enable all modality information.

In [4]:
corpus_parser = Parser(session, structural=True, lingual=True, visual=True, pdf_path=pdf_path)
%time corpus_parser.apply(doc_preprocessor, parallelism=PARALLEL)

[INFO] fonduer.utils.udf - Running UDF...


HBox(children=(IntProgress(value=0), HTML(value='')))


CPU times: user 7.13 s, sys: 409 ms, total: 7.54 s
Wall time: 4min 40s


Checking to ensure consistency in document numbers:

In [5]:
from fonduer.parser.models import Document, Sentence

print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())

Documents: 100
Sentences: 43803


## 1.2 Dividing the Corpus into Test and Train

We'll split the documents 80/10/10 into train/dev/test splits. Note that here we do this in a non-random order to preserve the consistency in the tutorial, and we reference the splits by 0/1/2 respectively.

In [6]:
docs = session.query(Document).order_by(Document.name).all()
ld   = len(docs)

train_docs = set()
dev_docs   = set()
test_docs  = set()
splits = (0.8, 0.9)
data = [(doc.name, doc) for doc in docs]
data.sort(key=lambda x: x[0])
for i, (doc_name, doc) in enumerate(data):
    if i < splits[0] * ld:
        train_docs.add(doc)
    elif i < splits[1] * ld:
        dev_docs.add(doc)
    else:
        test_docs.add(doc)
from pprint import pprint
pprint([x.name for x in train_docs])

['BC547',
 'DISES00490-1',
 'CSEMS05383-1',
 'BC337-D',
 'LTSCS02920-1',
 'MCCCS08818-1',
 'ONSemiconductor_MMBT6521LT1',
 'DISES00242-1',
 'BC337',
 'CentralSemiconductorCorp_CMPT5401ETR',
 'MCCCS08610-1',
 'DISES00645-1',
 'PHGLS19500-1',
 'CentralSemiconductorCorp_CENU45',
 'BC546',
 'DISES00616-1',
 'LITES00690-1',
 'MCCCS08984-1',
 'ONSMS04099-1',
 'CSEMS02742-1',
 '2N6427',
 'FairchildSemiconductor_KSC2310YTA',
 'LITES00689-1',
 'MOTOS04676-1',
 'CentralSemiconductorCorp_CXT4033TR',
 '112823',
 '2N6426-D',
 'LTSCS02912-1',
 'MOTOS03189-1',
 'NXPUSAInc_PBSS5360PASX',
 'CSEMS05382-1',
 'LTSCS02910-1',
 'AUKCS04635-1',
 'DISES00189-1',
 'CSEMS03485-1',
 'DiodesIncorporated_2DD26527',
 'MOTOS04796-1',
 'BC182-D',
 'DISES00023-1',
 'JCSTS01155-1',
 'MMMCS17742-1',
 'CentralSemiconductorCorp_2N4013',
 'BC182',
 'DISES00192-1',
 'INFNS19372-1',
 'MMBT3904',
 'BournsInc_TIP152S',
 '2N4123-D',
 'DIODS00215-1',
 'KECCS05435-1',
 'MOTOS03160-1',
 '2N3906-D',
 'DiodesIncorporated_ZXT690BKTC'

# Phase 2: Mention Extraction, Candidate Extraction Multimodal Featurization

Given the unified data model from Phase 1, `Fonduer` extracts relation
candidates based on user-provided **matchers** and **throttlers**. Then,
`Fonduer` leverages the multimodality information captured in the unified data
model to provide multimodal features for each candidate.

## 2.1 Mention Extraction

We first start by defining and naming our two `mention`s:

In [7]:
from fonduer.candidates.models import mention_subclass

Part = mention_subclass("Part")
EB_Voltage = mention_subclass("EB_Voltage")

### Transistor Part Number Matchers

Previously defined transistor part number matchers as found in the `maximum_storage_tempature.ipynb` tutorial.

In [8]:
from fonduer.candidates.matchers import RegexMatchSpan, DictionaryMatch, LambdaFunctionMatcher, Intersect, Union

### Transistor Naming Conventions as Regular Expressions ###
eeca_rgx = r'([ABC][A-Z][WXYZ]?[0-9]{3,5}(?:[A-Z]){0,5}[0-9]?[A-Z]?(?:-[A-Z0-9]{1,7})?(?:[-][A-Z0-9]{1,2})?(?:\/DG)?)'
jedec_rgx = r'(2N\d{3,4}[A-Z]{0,5}[0-9]?[A-Z]?)'
jis_rgx = r'(2S[ABCDEFGHJKMQRSTVZ]{1}[\d]{2,4})'
others_rgx = r'((?:NSVBC|SMBT|MJ|MJE|MPS|MRF|RCA|TIP|ZTX|ZT|ZXT|TIS|TIPL|DTC|MMBT|SMMBT|PZT|FZT|STD|BUV|PBSS|KSC|CXT|FCX|CMPT){1}[\d]{2,4}[A-Z]{0,5}(?:-[A-Z0-9]{0,6})?(?:[-][A-Z0-9]{0,1})?)'

part_rgx = '|'.join([eeca_rgx, jedec_rgx, jis_rgx, others_rgx])
part_rgx_matcher = RegexMatchSpan(rgx=part_rgx, longest_match_only=True)

Next, we can create a matcher from a dictionary of known part numbers:

In [9]:
import csv

def get_digikey_parts_set(path):
    """
    Reads in the digikey part dictionary and yeilds each part.
    """
    all_parts = set()
    with open(path, "r") as csvinput:
        reader = csv.reader(csvinput)
        for line in reader:
            (part, url) = line
            all_parts.add(part)
    return all_parts

### Dictionary of known transistor parts ###
dict_path = 'data/digikey_part_dictionary.csv'
part_dict_matcher = DictionaryMatch(d=get_digikey_parts_set(dict_path))

We can also use user-defined functions to further improve our matchers. For example, here we use patterns in the document filenames as a signal for whether a span of text in a document is a valid transistor part number.

In [10]:
from builtins import range

def common_prefix_length_diff(str1, str2):
    for i in range(min(len(str1), len(str2))):
        if str1[i] != str2[i]:
            return min(len(str1), len(str2)) - i
    return 0

def part_file_name_conditions(attr):
    file_name = attr.sentence.document.name
    if len(file_name.split('_')) != 2: return False
    if attr.get_span()[0] == '-': return False
    name = attr.get_span().replace('-', '')
    return any(char.isdigit() for char in name) and any(char.isalpha() for char in name) and common_prefix_length_diff(file_name.split('_')[1], name) <= 2

add_rgx = '^[A-Z0-9\-]{5,15}$'

part_file_name_lambda_matcher = LambdaFunctionMatcher(func=part_file_name_conditions)
part_file_name_matcher = Intersect(RegexMatchSpan(rgx=add_rgx, longest_match_only=True), part_file_name_lambda_matcher)

Then, we can union all of these matchers together to form our final part matcher.

In [11]:
part_matcher = Union(part_rgx_matcher, part_dict_matcher, part_file_name_matcher)

### Emitter - Base Voltage Matchers

Our emitter base voltage matcher can be a very simple regular expression
since we know that we are looking for floats (e.g. 4.0, 5.0, 6.0), and by inspecting a portion of
our corpus, we see that emitter base voltages fall within a fairly
narrow range. (i.e. `4.0`,`5.0`,`6.0`,`7.0`, up to `10`)

The matcher used below is from https://github.com/fonduer-apps/snorkel/blob/semi-structured/tutorials/tables/deprecated/eb_v_max.ipynb

In [12]:
#NOTE: This is super specific. Came from previously defined snorkel matchers at https://github.com/fonduer-apps/snorkel/blob/semi-structured/tutorials/tables/deprecated/eb_v_max.ipynb
eb_voltage_matcher = RegexMatchSpan(rgx=r'\-?([56]|12)(\.0)?', longest_match_only=True)

### Define a Mention's `MentionSpace`

Next, in order to define the "space" of all mentions that are even considered
from the document, we need to define a `MentionSpace` for each component of the
relation we wish to extract. Fonduer provides a default `MentionSpace` for you
to use, but you can also extend the default `MentionSpace` depending on your
needs.

In the case of transistor part numbers, the `MentionSpace` can be quite complex
due to the need to handle implicit part numbers that are implied in text like
"BC546A/B/C...BC548A/B/C", which refers to 9 unique part numbers. To handle
these, we consider all n-grams up to 3 words long.

In contrast, the `MentionSpace` for temperature values is simpler: we only need
to process different Unicode representations of a (`-`), and don't need to look
at more than two words at a time.

When no special preprocessing like this is needed, we could have used the
default `Ngrams` class provided by `fonduer`. For example, if we were looking
to match polarities, which only take the form of "NPN" or "PNP", we could've
used `ngrams = MentionNgrams(n_max=1)`.

In [13]:
from hardware_spaces import MentionNgramsPart, MentionNgrams
    
part_ngrams = MentionNgramsPart(parts_by_doc=None, n_max=3)
eb_voltage_ngrams = MentionNgrams(n_max=3)

### Running Mention Extraction 

Next, we create a `MentionExtractor` to extract the mentions from all of
our documents based on the `MentionSpace` and matchers we defined above.

View the API for the MentionExtractor on [ReadTheDocs](https://fonduer.readthedocs.io/en/latest/user/candidates.html#fonduer.candidates.MentionExtractor).

In [14]:
from fonduer.candidates import MentionExtractor 

mention_extractor = MentionExtractor(
    session, [Part, EB_Voltage], [part_ngrams, eb_voltage_ngrams], [part_matcher, eb_voltage_matcher]
)

Then, we run the extractor on all of our documents.

In [15]:
from fonduer.candidates.models import Mention

mention_extractor.apply(docs, parallelism=PARALLEL)

print("Total Mentions: {}".format(session.query(Mention).count()))

[INFO] fonduer.candidates.mentions - Clearing table: part
[INFO] fonduer.candidates.mentions - Clearing table: eb__voltage
[INFO] fonduer.utils.udf - Running UDF...


HBox(children=(IntProgress(value=0), HTML(value='')))


Total Mentions: 3970


## 2.2 Candidate Extraction

Now that we have both defined and extracted the Mentions that can be used to compose Candidates, we are ready to move on to extracting Candidates. Like we did with the Mentions, we first define what each candidate schema looks like. In this example, we create a candidate that is composed of a `Part` and a `Temp` mention as we defined above. We name this candidate "PartEBVoltage".

In [16]:
from fonduer.candidates.models import candidate_subclass

PartEBVoltage = candidate_subclass("PartEBVoltage", [Part, EB_Voltage])

### Defining candidate `Throttlers`

Here, we create a throttler that discards candidates if they are in the same table, but the part and max emitter base voltage are not vertically or horizontally aligned.

In [17]:
from fonduer.utils.data_model_utils import *
import re

def eb_voltage_filter(c):
    (part, attr) = c
    if same_table((part, attr)):
        return (is_horz_aligned((part, attr)) or is_vert_aligned((part, attr)))
    return True

eb_voltage_throttler = eb_voltage_filter

### Running the `CandidateExtractor`

Now, we have all the component necessary to perform candidate extraction. We have defined the Mentions that compose each candidate and a throttler to prunes away excess candidates. We now can define the `CandidateExtractor` with the candidate subclass and corresponding throttler to use.

View the API for the CandidateExtractor on [ReadTheDocs](https://fonduer.readthedocs.io/en/docstrings/user/candidates.html#fonduer.candidates.CandidateExtractor).

In [18]:
from fonduer.candidates import CandidateExtractor


candidate_extractor = CandidateExtractor(session, [PartEBVoltage], throttlers=[eb_voltage_throttler])

In [19]:
for i, docs in enumerate([train_docs, dev_docs, test_docs]):
    candidate_extractor.apply(docs, split=i, parallelism=PARALLEL)
    print("Number of Candidates in split={}: {}".format(i, session.query(PartEBVoltage).filter(PartEBVoltage.split == i).count()))

train_cands = candidate_extractor.get_candidates(split = 0)
dev_cands = candidate_extractor.get_candidates(split = 1)
test_cands = candidate_extractor.get_candidates(split = 2)

[INFO] fonduer.candidates.candidates - Clearing table part_eb_voltage (split 0)
[INFO] fonduer.utils.udf - Running UDF...


HBox(children=(IntProgress(value=0, max=80), HTML(value='')))


Number of Candidates in split=0: 70957
[INFO] fonduer.candidates.candidates - Clearing table part_eb_voltage (split 1)
[INFO] fonduer.utils.udf - Running UDF...


HBox(children=(IntProgress(value=0, max=10), HTML(value='')))


Number of Candidates in split=1: 1912
[INFO] fonduer.candidates.candidates - Clearing table part_eb_voltage (split 2)
[INFO] fonduer.utils.udf - Running UDF...


HBox(children=(IntProgress(value=0, max=10), HTML(value='')))


Number of Candidates in split=2: 3027


In [20]:
from fonduer.candidates.models import Candidate

print("Total Candidates: {}".format(session.query(Candidate).count()))

Total Candidates: 75896


## 2.2 Multimodal Featurization
Unlike dealing with plain unstructured text, `Fonduer` deals with richly formatted data, and consequently featurizes each candidate with a baseline library of multimodal features. 

### Featurize with `Fonduer`'s optimized Postgres Featurizer
We now annotate the candidates in our training, dev, and test sets with features. The `Featurizer` provided by `Fonduer` allows this to be done in parallel to improve performance.

View the API provided by the `Featurizer` on [ReadTheDocs](https://fonduer.readthedocs.io/en/stable/user/features.html#fonduer.features.Featurizer).

In [None]:
from fonduer.features import Featurizer

featurizer = Featurizer(session, [PartEBVoltage])
%time featurizer.apply(split=0, train=True, parallelism=PARALLEL)
%time F_train = featurizer.get_feature_matrices(train_cands)

[INFO] fonduer.features.featurizer - Clearing Features (split 0)
[INFO] fonduer.utils.udf - Running UDF...


HBox(children=(IntProgress(value=0, max=80), HTML(value='')))

In [None]:
print(F_train[0].shape)
%time featurizer.apply(split=1, parallelism=PARALLEL)
%time F_dev = featurizer.get_feature_matrices(dev_cands)
print(F_dev[0].shape)

In [None]:
%time featurizer.apply(split=2, parallelism=PARALLEL)
%time F_test = featurizer.get_feature_matrices(test_cands)
print(F_test[0].shape)

At the end of this phase, `Fonduer` has generated the set of candidates and the feature matrix. Note that Phase 1 and 2 are relatively static and typically are only executed once during the KBC process.

# Phase 3: Probabilistic Relation Classification
In this phase, `Fonduer` applies user-defined **labeling functions**, which express various heuristics, patterns, and [weak supervision](http://hazyresearch.github.io/snorkel/blog/weak_supervision.html) strategies to label our data, to each of the candidates to create a label matrix that is used by our data programming engine.

In the wild, hand-labeled training data is rare and expensive. A common scenario is to have access to tons of unlabeled training data, and have some idea of how to label them programmatically. For example:
* We may be able to think of text patterns that would indicate a part and polarity mention are related, for example the word "temperature" appearing between them.
* We may have access to an external knowledge base that lists some pairs of parts and polarities, and can use these to noisily label some of our mention pairs.
Our labeling functions will capture these types of strategies. We know that these labeling functions will not be perfect, and some may be quite low-quality, so we will model their accuracies with a generative model, which `Fonduer` will help us easily apply.

Using data programming, we can then train machine learning models to learn which features are the most important in classifying candidates.

### Loading Gold Data
For convenience in error analysis and evaluation, we have already annotated the dev and test set for this tutorial, and we'll now load it using an externally-defined helper function. If you're interested in the example implementation details, please see the script we now load:

In [None]:
from hardware_utils import load_hardware_labels

gold_file = 'data/hardware_tutorial_gold.csv'
load_hardware_labels(session, PartEBVoltage, gold_file, ATTRIBUTE ,annotator_name='gold')

## Labeling Functions

Below is a list of patterns that I've noticed in various datasheets:
- The emitter base voltage symbol (found in the same row as `EB_Voltage`) is usually the following: <sup>V</sup>EBO
- The `EB_Voltage` is also often found in the same row as a V (indicating voltage).

The labeling functions defined below are also from an old Snorkel [repos](https://github.com/fonduer-apps/snorkel/blob/semi-structured/tutorials/tables/deprecated/eb_v_max.ipynb):

In [28]:
# Helpers
def set_all_in_set(a, b):
    '''return true if all of a is in b'''
    return b.issuperset(a)

def set_none_in_set(a, b):
    '''return true if none of a is in b'''
    return (b.difference(a) == b)

def set_any_in_set(a, b):
    '''return true if any of a is in b'''
    return len(b.intersection(a)) > 0

LFs = []

###################################################################
# POSITIVE
###################################################################

def LF_voltage_inside_table(c):
    return 1 if c.voltage.parent.row is not None else 0
LFs.append(LF_voltage_inside_table)

# def LF_part_is_aligned(c):
#     return 1 if (c.part.parent.table == c.voltage.parent.table and
#                 (c.part.parent.row_num == c.voltage.parent.row_num or
#                  c.part.parent.col_num == c.voltage.parent.col_num)) else 0
# LFs.append(LF_part_is_aligned)
    
def LF_ce_keywords(c):
    individuals = set(['collector', 'emitter', 'voltage'])
    together = set(['collector-emitter', 'voltage'])
    row_ngrams = set(x.replace(' ', '') for x in get_row_ngrams(c.voltage, infer=True))
    if set_all_in_set(individuals, row_ngrams):
        return 1
    if set_all_in_set(together, row_ngrams):
        return 1
    return 0
LFs.append(LF_ce_keywords)

def LF_pos_keywords_in_row(c):
    pos_keys = set(['v ceo', 'ceo', 'vceo', 'value', 'rating'])
    ngrams = set(get_row_ngrams(c.voltage, infer=True))
    if set_any_in_set(pos_keys, ngrams):
        return 1
    else:
        return 0
LFs.append(LF_pos_keywords_in_row)

def LF_low_table_num(c):
    if c.voltage.parent.table <= 2:
        return 1
    else:
        return -1
LFs.append(LF_low_table_num)

def LF_whole_phrase_in_row(c):
    row_ngrams = set(get_row_ngrams(c.voltage))
    if 'collector-emitter voltage' in row_ngrams:
        return 1
    else:
        return 0
LFs.append(LF_whole_phrase_in_row)


###################################################################
# NEGATIVE
###################################################################

def LF_specific_neg_row_keywords(c):
    left_ngrams = set(get_row_ngrams(c.voltage, infer=True))
    neg_keys = set(['continuous', 'dc', 'cut-off'])
    if set_any_in_set(neg_keys, left_ngrams):
        return -1
    else:
        return 0
LFs.append(LF_specific_neg_row_keywords)

def LF_equals_in_row(c):
    row_ngrams = set(get_row_ngrams(c.voltage))
    if '=' in row_ngrams:
        return -1
    else:
        return 0
LFs.append(LF_equals_in_row)

def LF_i_in_row(c):
    row_ngrams = set(get_row_ngrams(c.voltage))
    if 'i' in row_ngrams:
        return -1
    else:
        return 0
LFs.append(LF_i_in_row)

def LF_first_row(c):
    if c.voltage.parent.row_num == 0:
        return -1
    else:
        return 0
LFs.append(LF_first_row)
    
def LF_not_ce_relevant(c):
    ce_keywords = set(['collector', 'emitter', 'collector-emitter'])
    ngrams = set(get_aligned_ngrams(c.voltage))
    if not set_any_in_set(ce_keywords, ngrams):
        return -1
    else:
        return 1
LFs.append(LF_not_ce_relevant)

def LF_too_many_numbers_row(c):
    num_numbers = list(get_row_ngrams(c.voltage, attrib="ner_tags")).count('number')
    return -1 if num_numbers >= 4 else 0
LFs.append(LF_too_many_numbers_row)

def LF_negative_keywords(c):
    row_neg_keys = set(['ambient',
                    'small-signal',
                    'cut-off',
                    'na',
                    'ma',
                    'cex',
                    'resistance',
                    'power',
                    'junction',
                    'dissipation', 
                    'breakdown',
                    'current',
                    'cbo',
                    'vcbo'
                    'peak',
                    '=',
                    'f',
                    'p',
                    'base',
                    'mw',
                    'ebo',
                    'vebo',
                    'i c',
                    'total',
                    'device',
                    'c',
                    'mhz',
                    'temperature',
                    'saturation',
                    'operating',
                    'storage'
                    'bandwidth',
                    'derate',
                    'above',
                    'product',
                    'figure',
                    'conditions',
                    'current gain',
                    'saturation',
                    'min',
                    'min.',
                    'typ',
                    'typ.',
                    'max',
                    'max.',
                    'gain',
                    'p',
                    'thermal',
                    'test'])
    row_ngrams = set(get_row_ngrams(c.voltage))
    col_ngrams = set(get_col_ngrams(c.voltage))
    col_neg_keys = set(['conditions', 
                        'condition', 
                        'parameter', 
                        'min',
                        'min.',
                        'typ',
                        'typ.',
                        'max',
                        'max.',
                        'test'])
    if set_any_in_set(row_neg_keys, row_ngrams):
        return -1
    if set_any_in_set(col_neg_keys, col_ngrams):
        return -1
    
    return 0

LFs.append(LF_negative_keywords)

### Applying the Labeling Functions

Next, we need to actually run the LFs over all of our training candidates, producing a set of `Labels` and `LabelKeys` (just the names of the LFs) in the database. Note that this will delete any existing `Labels` and `LabelKeys` for this candidate set.

View the API provided by the `Labeler` on [ReadTheDocs](https://fonduer.readthedocs.io/en/stable/user/supervision.html#fonduer.supervision.Labeler).

In [None]:
from fonduer.supervision import Labeler

labeler = Labeler(session, [PartEBVoltage])
%time labeler.apply(split=0, lfs=[LFs], train=True, parallelism=PARALLEL)
%time L_train = labeler.get_label_matrices(train_cands)

We can also view statistics about the resulting label matrix.
* **Coverage** is the fraction of candidates that the labeling function emits a non-zero label for.
* **Overlap** is the fraction candidates that the labeling function emits a non-zero label for and that another labeling function emits a non-zero label for.
* **Conflict** is the fraction candidates that the labeling function emits a non-zero label for and that another labeling function emits a conflicting non-zero label for.

In addition, because we have already loaded the gold labels, we can view the emperical accuracy of these labeling functions when compared to our gold labels using the `analysis` module of [MeTaL](https://github.com/HazyResearch/metal).

In [None]:
from fonduer.supervision import get_gold_labels
L_gold_train = get_gold_labels(session, train_cands, annotator_name='gold')

In [None]:
from metal import analysis

analysis.lf_summary(L_train[0], lf_names=labeler.get_keys(), Y=L_gold_train[0].todense().reshape(-1,).tolist()[0])

### Fitting the Generative Model

Now, we'll train a model of the LFs to estimate their accuracies. Once the model is trained, we can combine the outputs of the LFs into a single, noise-aware training label set for our extractor. Intuitively, we'll model the LFs by observing how they overlap and conflict with each other. To do so, we use [MeTaL](https://github.com/HazyResearch/metal)'s single-task label model.

In [None]:
from metal.label_model import LabelModel

gen_model = LabelModel(k=2)
%time gen_model.train_model(L_train[0], n_epochs=500, print_every=100)

We now apply the generative model to the training candidates to get the noise-aware training label set. We'll refer to these as the training marginals:

In [None]:
train_marginals = gen_model.predict_proba(L_train[0])[:, 1]

We'll look at the distribution of the training marginals:

In [None]:
import matplotlib.pyplot as plt
plt.hist(train_marginals, bins=20)
plt.show()