# Extracting DC Gain Min from Digikey Datasheets

This Jupyter Notebook will begin extracting additional relations from transistors (`eb_v_max`, `c_current_max`, `dev_dissipation`, `dc_gain_min`).

Sarting with the `dc_gain_min` as shown below.

## KBC Initialization

Created a new database named `dc_gain_min` for extracting the maximum ratings for emitter - base voltage.

In [39]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import sys
import logging

# Configure logging for Fonduer
logging.basicConfig(stream=sys.stdout, format='[%(levelname)s] %(name)s - %(message)s')
log = logging.getLogger('fonduer')
log.setLevel(logging.INFO)

PARALLEL = 4 # changed to 12 for watchog.stanford.edu
ATTRIBUTE = "dc_gain_min"
conn_string = 'postgresql://nchiang:postgres@localhost:5432/' + ATTRIBUTE

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1.1 Parsing and Transforming the Input Documents into Unified Data Models

We first initialize a `Meta` object, which manages the connection to the database automatically, and enables us to save intermediate results.

In [4]:
from fonduer import Meta

session = Meta.init(conn_string).Session()

[INFO] fonduer.meta - Connecting user:nchiang to localhost:5432/dc_gain_min
[INFO] fonduer.meta - Initializing the storage schema


### Configuring an `HTMLDocPreprocessor`
We start by setting the paths to where our documents are stored, and defining a `HTMLDocPreprocessor` to read in the documents found in the specified paths. `max_docs` specified the number of documents to parse.

In the `transistor_dataset` that was downloaded as per Luke's instruction, there are 123 HTML files in `/dev/html`, 76 HTML files in `/test/html`, 2745 HTML files in `/train_digikey/html`. 

In order to, however, maintain consistency with and use previously defined gold labels, this test run is parsing the hardware tutorial's 100-pdf-long dataset.

In [5]:
from fonduer.parser.preprocessors import HTMLDocPreprocessor
from fonduer.parser import Parser

docs_path = 'data/html/'
pdf_path = 'data/pdf/'

max_docs = 100
doc_preprocessor = HTMLDocPreprocessor(docs_path, max_docs=max_docs)

### Configuring a `Parser`
Next, we configure a `Parser`, which serves as our `CorpusParser` for PDF documents. We use [spaCy](https://spacy.io/) as a preprocessing tool to split our documents into sentences and tokens, and to provide annotations such as part-of-speech tags and dependency parse structures for these sentences. In addition, we can specify which modality information to include in the unified data model for each document. Below, we enable all modality information.

In [6]:
corpus_parser = Parser(session, parallelism=PARALLEL, structural=True, lingual=True, visual=True, pdf_path=pdf_path)
%time corpus_parser.apply(doc_preprocessor, pdf_path=pdf_path, parallelism=PARALLEL)

[INFO] fonduer.utils.udf - Running UDF...


HBox(children=(IntProgress(value=0), HTML(value='')))


[93m    Linking successful[0m
    /home/nchiang/digikey-transistors/.venv/lib/python3.6/site-packages/en_core_web_sm
    -->
    /home/nchiang/digikey-transistors/.venv/lib/python3.6/site-packages/spacy/data/en

    You can now load the model via spacy.load('en')


CPU times: user 8.75 s, sys: 939 ms, total: 9.69 s
Wall time: 6min 36s


Checking to ensure consistency in document numbers:

In [6]:
from fonduer.parser.models import Document, Sentence

print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())

Documents: 100
Sentences: 43803


## 1.2 Dividing the Corpus into Test and Train
We'll split the documents 80/10/10 into train/dev/test splits. Note that here we do this in a non-random order to preserve the consistency in the tutorial, and we reference the splits by 0/1/2 respectively.



In [7]:
docs = session.query(Document).order_by(Document.name).all()
ld   = len(docs)

train_docs = set()
dev_docs   = set()
test_docs  = set()
splits = (0.8, 0.9)
data = [(doc.name, doc) for doc in docs]
data.sort(key=lambda x: x[0])
for i, (doc_name, doc) in enumerate(data):
    if i < splits[0] * ld:
        train_docs.add(doc)
    elif i < splits[1] * ld:
        dev_docs.add(doc)
    else:
        test_docs.add(doc)
from pprint import pprint
pprint([x.name for x in train_docs])

['BC546A_Series_B14-521026',
 'DiodesIncorporated_ZXT690BKTC',
 'CentralSemiconductorCorp_CMPT5401ETR',
 'FAIRS19194-1',
 'LTSCS02910-1',
 'MMBT3904',
 'BC546-BC548C(TO-92)',
 'DIODS00215-1',
 'CentralSemiconductorCorp_CXT4033TR',
 'FAIRS25065-1',
 'LTSCS02912-1',
 'MMMCS17742-1',
 'DIODS13249-1',
 'LTSCS02920-1',
 'CSEMS02742-1',
 'Infineon-BC817KSERIES_BC818KSERIES-DS-v01_01-en',
 'MOTOS03160-1',
 '112823',
 'DISES00023-1',
 'MCCCS08610-1',
 'CSEMS03485-1',
 'Infineon-BC857SERIES_BC858SERIES_BC859SERIES_BC860SERIES-DS-v01_01-en',
 'MOTOS03189-1',
 '2N3906',
 'BC547',
 'DISES00189-1',
 '2N3906-D',
 'CSEMS05382-1',
 'INFNS19372-1',
 'MCCCS08818-1',
 'MOTOS04676-1',
 '2N4123-D',
 'DISES00192-1',
 'MCCCS08984-1',
 'CSEMS05383-1',
 'JCSTS01155-1',
 'MOTOS04796-1',
 '2N4124',
 'BC818',
 'DISES00242-1',
 '2N6426-D',
 'DiodesIncorporated_2DD26527',
 'KECCS03676-1',
 'MCCCS09540-1',
 'MINDS00015-1',
 'LITES00690-1',
 'NXPUSAInc_PBSS5360PASX',
 '2N6427',
 'BC818-40LT1-D',
 'DISES00490-1',
 'Di

# Phase 2: Mention Extraction, Candidate Extraction Multimodal Featurization

Given the unified data model from Phase 1, `Fonduer` extracts relation
candidates based on user-provided **matchers** and **throttlers**. Then,
`Fonduer` leverages the multimodality information captured in the unified data
model to provide multimodal features for each candidate.

## 2.1 Mention Extraction

We first start by defining and naming our two `mention`s:

In [8]:
from fonduer.candidates.models import mention_subclass

Part = mention_subclass("Part")
DCGain = mention_subclass("DCGain")

### Transistor Part Number Matchers

Previously defined transistor part number matchers as found in the `maximum_storage_tempature.ipynb` tutorial.

In [9]:
from fonduer.candidates.matchers import RegexMatchSpan, DictionaryMatch, LambdaFunctionMatcher, Intersect, Union

### Transistor Naming Conventions as Regular Expressions ###
eeca_rgx = r'([ABC][A-Z][WXYZ]?[0-9]{3,5}(?:[A-Z]){0,5}[0-9]?[A-Z]?(?:-[A-Z0-9]{1,7})?(?:[-][A-Z0-9]{1,2})?(?:\/DG)?)'
jedec_rgx = r'(2N\d{3,4}[A-Z]{0,5}[0-9]?[A-Z]?)'
jis_rgx = r'(2S[ABCDEFGHJKMQRSTVZ]{1}[\d]{2,4})'
others_rgx = r'((?:NSVBC|SMBT|MJ|MJE|MPS|MRF|RCA|TIP|ZTX|ZT|ZXT|TIS|TIPL|DTC|MMBT|SMMBT|PZT|FZT|STD|BUV|PBSS|KSC|CXT|FCX|CMPT){1}[\d]{2,4}[A-Z]{0,5}(?:-[A-Z0-9]{0,6})?(?:[-][A-Z0-9]{0,1})?)'

part_rgx = '|'.join([eeca_rgx, jedec_rgx, jis_rgx, others_rgx])
part_rgx_matcher = RegexMatchSpan(rgx=part_rgx, longest_match_only=True)

Next, we can create a matcher from a dictionary of known part numbers:

In [10]:
import csv

def get_digikey_parts_set(path):
    """
    Reads in the digikey part dictionary and yeilds each part.
    """
    all_parts = set()
    with open(path, "r") as csvinput:
        reader = csv.reader(csvinput)
        for line in reader:
            (part, url) = line
            all_parts.add(part)
    return all_parts

### Dictionary of known transistor parts ###
dict_path = 'data/digikey_part_dictionary.csv'
part_dict_matcher = DictionaryMatch(d=get_digikey_parts_set(dict_path))

We can also use user-defined functions to further improve our matchers. For example, here we use patterns in the document filenames as a signal for whether a span of text in a document is a valid transistor part number.

In [11]:
from builtins import range

def common_prefix_length_diff(str1, str2):
    for i in range(min(len(str1), len(str2))):
        if str1[i] != str2[i]:
            return min(len(str1), len(str2)) - i
    return 0

def part_file_name_conditions(attr):
    file_name = attr.sentence.document.name
    if len(file_name.split('_')) != 2: return False
    if attr.get_span()[0] == '-': return False
    name = attr.get_span().replace('-', '')
    return any(char.isdigit() for char in name) and any(char.isalpha() for char in name) and common_prefix_length_diff(file_name.split('_')[1], name) <= 2

add_rgx = '^[A-Z0-9\-]{5,15}$'

part_file_name_lambda_matcher = LambdaFunctionMatcher(func=part_file_name_conditions)
part_file_name_matcher = Intersect(RegexMatchSpan(rgx=add_rgx, longest_match_only=True), part_file_name_lambda_matcher)

Then, we can union all of these matchers together to form our final part matcher.

In [12]:
part_matcher = Union(part_rgx_matcher, part_dict_matcher, part_file_name_matcher)

### DC Gain - Min Matchers

The matcher defined below was taken from an old snorkel [notebook](https://github.com/fonduer-apps/snorkel/blob/semi-structured/tutorials/tables/deprecated/dc_gain_min.ipynb).

In [13]:
# This was taken from an old snorkel repo (see above)
dc_gain_matcher = RegexMatchSpan(rgx=r'\d+[05]', longest_match_only=False)

### Define a Mention's `MentionSpace`

Next, in order to define the "space" of all mentions that are even considered
from the document, we need to define a `MentionSpace` for each component of the
relation we wish to extract. Fonduer provides a default `MentionSpace` for you
to use, but you can also extend the default `MentionSpace` depending on your
needs.

In the case of transistor part numbers, the `MentionSpace` can be quite complex
due to the need to handle implicit part numbers that are implied in text like
"BC546A/B/C...BC548A/B/C", which refers to 9 unique part numbers. To handle
these, we consider all n-grams up to 3 words long.

In contrast, the `MentionSpace` for temperature values is simpler: we only need
to process different Unicode representations of a (`-`), and don't need to look
at more than two words at a time.

When no special preprocessing like this is needed, we could have used the
default `Ngrams` class provided by `fonduer`. For example, if we were looking
to match polarities, which only take the form of "NPN" or "PNP", we could've
used `ngrams = MentionNgrams(n_max=1)`.

In [14]:
from hardware_spaces import MentionNgramsPart, MentionNgrams
    
part_ngrams = MentionNgramsPart(parts_by_doc=None, n_max=3)
dc_gain_ngrams = MentionNgrams(n_max=1)

### Running Mention Extraction 

Next, we create a `MentionExtractor` to extract the mentions from all of
our documents based on the `MentionSpace` and matchers we defined above.

View the API for the MentionExtractor on [ReadTheDocs](https://fonduer.readthedocs.io/en/latest/user/candidates.html#fonduer.candidates.MentionExtractor).

In [15]:
from fonduer.candidates import MentionExtractor 

mention_extractor = MentionExtractor(
    session, [Part, DCGain], [part_ngrams, dc_gain_ngrams], [part_matcher, dc_gain_matcher]
)

Then, we run the extractor on all of our documents.

In [18]:
mention_extractor.apply(docs, parallelism=PARALLEL)

[INFO] fonduer.candidates.mentions - Clearing table: part
[INFO] fonduer.candidates.mentions - Clearing table: dc_gain
[INFO] fonduer.utils.udf - Running UDF...


HBox(children=(IntProgress(value=0), HTML(value='')))




In [16]:
from fonduer.candidates.models import Mention

print(f"Total Mentions: {session.query(Mention).count()}")

Total Mentions: 11199


## 2.2 Candidate Extraction

Now that we have both defined and extracted the Mentions that can be used to compose Candidates, we are ready to move on to extracting Candidates. Like we did with the Mentions, we first define what each candidate schema looks like. In this example, we create a candidate that is composed of a `Part` and a `Temp` mention as we defined above. We name this candidate "PartEBVoltage".

In [17]:
from fonduer.candidates.models import candidate_subclass

PartDCGain = candidate_subclass("PartDCGain", [Part, DCGain])

## *Note:
Define mention and candidate variable names without underscores or you'll get something weird like this happen:
> The name of the second candidate attribute is not `eb_voltage`, but it is `eb__voltage` with **two** underscores. Why is that though? It was always defined with only one underscore...

It seems that Fonduer will automatically add an underscore for you... I'm still not sure why or where, but it does.

In [18]:
PartDCGain.part

<sqlalchemy.orm.attributes.InstrumentedAttribute at 0x7f13013a6af0>

In [19]:
PartDCGain.dc_gain

<sqlalchemy.orm.attributes.InstrumentedAttribute at 0x7f13013a6ba0>

### Defining candidate `Throttlers`

Here, we create a throttler that discards candidates if they are in the same table, but the part and max emitter base voltage are not vertically or horizontally aligned.

In [40]:
from fonduer.utils.data_model_utils import *
import re

def dc_gain_filter(c):
    (part, attr) = c
    if same_table((part, attr)):
        return (is_horz_aligned((part, attr)) or is_vert_aligned((part, attr)))
    return True

dc_gain_throttler = dc_gain_filter

### Running the `CandidateExtractor`

Now, we have all the component necessary to perform candidate extraction. We have defined the Mentions that compose each candidate and a throttler to prunes away excess candidates. We now can define the `CandidateExtractor` with the candidate subclass and corresponding throttler to use.

View the API for the CandidateExtractor on [ReadTheDocs](https://fonduer.readthedocs.io/en/docstrings/user/candidates.html#fonduer.candidates.CandidateExtractor).

In [41]:
from fonduer.candidates import CandidateExtractor


candidate_extractor = CandidateExtractor(session, [PartDCGain], throttlers=[dc_gain_throttler])

In [42]:
for i, docs in enumerate([train_docs, dev_docs, test_docs]):
    candidate_extractor.apply(docs, split=i, parallelism=PARALLEL)

[INFO] fonduer.candidates.candidates - Clearing table part_dc_gain (split 0)
[INFO] fonduer.utils.udf - Running UDF...


HBox(children=(IntProgress(value=0, max=80), HTML(value='')))

[INFO] fonduer.candidates.candidates - Clearing table part_dc_gain (split 1)
[INFO] fonduer.utils.udf - Running UDF...


HBox(children=(IntProgress(value=0, max=10), HTML(value='')))

[INFO] fonduer.candidates.candidates - Clearing table part_dc_gain (split 2)
[INFO] fonduer.utils.udf - Running UDF...


HBox(children=(IntProgress(value=0, max=10), HTML(value='')))

In [43]:
train_cands = candidate_extractor.get_candidates(split = 0)
dev_cands = candidate_extractor.get_candidates(split = 1)
test_cands = candidate_extractor.get_candidates(split = 2)

In [44]:
print(f"Number of Candidates in split={0}: {session.query(PartDCGain).filter(PartDCGain.split == i).count()}")

Number of Candidates in split=0: 17445


In [46]:
from fonduer.candidates.models import Candidate

print("Total Candidates: {}".format(session.query(Candidate).count()))

Total Candidates: 351847


## 2.2 Multimodal Featurization
Unlike dealing with plain unstructured text, `Fonduer` deals with richly formatted data, and consequently featurizes each candidate with a baseline library of multimodal features. 

### Featurize with `Fonduer`'s optimized Postgres Featurizer
We now annotate the candidates in our training, dev, and test sets with features. The `Featurizer` provided by `Fonduer` allows this to be done in parallel to improve performance.

View the API provided by the `Featurizer` on [ReadTheDocs](https://fonduer.readthedocs.io/en/stable/user/features.html#fonduer.features.Featurizer).

In [47]:
from fonduer.features import Featurizer

featurizer = Featurizer(session, [PartDCGain])
%time featurizer.apply(split=0, train=True, parallelism=PARALLEL)
%time F_train = featurizer.get_feature_matrices(train_cands)

[INFO] fonduer.features.featurizer - Clearing Features (split 0)
[INFO] fonduer.utils.udf - Running UDF...


HBox(children=(IntProgress(value=0, max=80), HTML(value='')))

CPU times: user 12.8 s, sys: 675 ms, total: 13.5 s
Wall time: 8min 57s
CPU times: user 7min 50s, sys: 25.9 s, total: 8min 16s
Wall time: 12min 33s


In [None]:
print(F_train[0].shape)
%time featurizer.apply(split=1, parallelism=PARALLEL)
%time F_dev = featurizer.get_feature_matrices(dev_cands)
print(F_dev[0].shape)

(319180, 46139)
[INFO] fonduer.features.featurizer - Clearing Features (split 1)
[INFO] fonduer.utils.udf - Running UDF...


HBox(children=(IntProgress(value=0, max=10), HTML(value='')))

In [None]:
%time featurizer.apply(split=2, parallelism=PARALLEL)
%time F_test = featurizer.get_feature_matrices(test_cands)
print(F_test[0].shape)

At the end of this phase, `Fonduer` has generated the set of candidates and the feature matrix. Note that Phase 1 and 2 are relatively static and typically are only executed once during the KBC process.

# Phase 3: Probabilistic Relation Classification
In this phase, `Fonduer` applies user-defined **labeling functions**, which express various heuristics, patterns, and [weak supervision](http://hazyresearch.github.io/snorkel/blog/weak_supervision.html) strategies to label our data, to each of the candidates to create a label matrix that is used by our data programming engine.

In the wild, hand-labeled training data is rare and expensive. A common scenario is to have access to tons of unlabeled training data, and have some idea of how to label them programmatically. For example:
* We may be able to think of text patterns that would indicate a part and polarity mention are related, for example the word "temperature" appearing between them.
* We may have access to an external knowledge base that lists some pairs of parts and polarities, and can use these to noisily label some of our mention pairs.
Our labeling functions will capture these types of strategies. We know that these labeling functions will not be perfect, and some may be quite low-quality, so we will model their accuracies with a generative model, which `Fonduer` will help us easily apply.

Using data programming, we can then train machine learning models to learn which features are the most important in classifying candidates.

### Loading Gold Data
For convenience in error analysis and evaluation, we have already annotated the dev and test set for this tutorial, and we'll now load it using an externally-defined helper function. If you're interested in the example implementation details, please see the script we now load:

In [None]:
from hardware_utils import load_hardware_labels

gold_file = 'data/hardware_tutorial_gold.csv'
load_hardware_labels(session, PartDCGain, gold_file, ATTRIBUTE ,annotator_name='gold')

## Labeling Functions

The emitter base voltage symbol (found in the same row as `EB_Voltage`) is usually the following: <sup>V</sup>EBO

The `EB_Voltage` is also often found in the same row as a V (indicating voltage).

## Helper Functions

In [None]:
# Define variables to make code more readable

ABSTAIN = 0
FALSE = 1
TRUE = 2

In [None]:
# Helpers
def set_all_in_set(a, b):
    '''return true if all of a is in b'''
    return b.issuperset(a)

def set_none_in_set(a, b):
    '''return true if none of a is in b'''
    return (b.difference(a) == b)

def set_any_in_set(a, b):
    '''return true if any of a is in b'''
    return len(b.intersection(a)) > 0

## Positive Labeling Functions

Some possible labeling function best practices:
- Start with broad assumptions (i.e. Is the value in the expected modality?)


In [None]:
LFs = []

###################################################################
# POSITIVE
###################################################################

def LF_inside_table(c):
    return TRUE if c.dc_gain.context.sentence.is_tabular() is not None else ABSTAIN
LFs.append(LF_inside_table)

# def LF_part_is_aligned(c):
#     return TRUE if (c.part.parent.table == c.dc_gain.parent.table and
#                 (c.part.parent.row_num == c.dc_gain.parent.row_num or
#                  c.part.parent.col_num == c.dc_gain.parent.col_num)) else ABSTAIN
# LFs.append(LF_part_is_aligned)
    
def LF_dc_gain_keywords(c):
    keywords = set(['dc', 'current', 'gain'])
    if c.dc_gain.context.sentence.is_tabular() is not None:
        row_ngrams = set(x.replace(' ', '') for x in get_row_ngrams(c.dc_gain, lower=True) if x)
        if set_all_in_set(keywords, row_ngrams):
            return TRUE
    return ABSTAIN
LFs.append(LF_dc_gain_keywords)

def LF_dc_gain_symbols(c):
    pos_keys = set(['hfe', 'h fe'])
    ngrams = set(get_row_ngrams(c.dc_gain, lower=True))
    if set_any_in_set(pos_keys, ngrams):
        return TRUE
    else:
        return ABSTAIN
LFs.append(LF_dc_gain_symbols)

def LF_low_table_num(c):
    if c.dc_gain.context.sentence.table is not None:
        if c.dc_gain.context.sentence.table.position <= 2:
            return TRUE
        else:
            return FALSE
    return ABSTAIN
LFs.append(LF_low_table_num)

def LF_whole_phrase_in_row(c):
    row_ngrams = set(get_row_ngrams(c.dc_gain, lower=True))
    if 'dc current gain' in row_ngrams:
        return TRUE
    else:
        return ABSTAIN
LFs.append(LF_whole_phrase_in_row)

## Negative Labeling Functions

In [None]:
###################################################################
# NEGATIVE
###################################################################

"""
def LF_specific_neg_row_keywords(c):
    left_ngrams = set(get_row_ngrams(c.dc_gain, lower=True))
    neg_keys = set(['rating', 'v', 'vebo', 'v ebo', 'cut-off'])
    if set_any_in_set(neg_keys, left_ngrams):
        return FALSE
    else:
        return ABSTAIN
LFs.append(LF_specific_neg_row_keywords)
"""

def LF_equals_in_row(c):
    row_ngrams = set(get_row_ngrams(c.dc_gain))
    if '=' in row_ngrams:
        return FALSE
    else:
        return ABSTAIN
LFs.append(LF_equals_in_row)

def LF_v_in_row(c):
    row_ngrams = set(get_row_ngrams(c.dc_gain))
    if 'v' in row_ngrams:
        return FALSE
    else:
        return ABSTAIN
LFs.append(LF_v_in_row)

def LF_first_row(c):
    if c.dc_gain.context.sentence.row_start == 0:
        return FALSE
    else:
        return ABSTAIN
LFs.append(LF_first_row)
    
def LF_not_ccurrent_relevant(c):
    keywords = set(['dc', 'current', 'gain'])
    if c.dc_gain.context.sentence.is_tabular() is not None:
        row_ngrams = set(x.replace(' ', '') for x in get_row_ngrams(c.dc_gain, lower=True) if x)
        if not set_any_in_set(keywords, row_ngrams):
            return FALSE
    return ABSTAIN
LFs.append(LF_not_ccurrent_relevant)

def LF_too_many_numbers_row(c):
    num_numbers = list(get_row_ngrams(c.dc_gain, attrib="ner_tags")).count('number')
    return FALSE if num_numbers >= 4 else ABSTAIN
LFs.append(LF_too_many_numbers_row)

# dc_gain should not have any units specified
def LF_negative_units(c):
    row_ngrams = set(get_row_ngrams(c.eb_voltage, lower=True))
    units = set(['v','V','mV','mA','mW','C','C/W'])
    if set_any_in_set(units,row_ngrams):
        return FALSE
    return ABSTAIN
LFs.append(LF_units_in_row)

def LF_negative_keywords(c):
    row_neg_keys = set(['ambient',
                    'small-signal',
                    'cut-off',
                    'na',
                    'ma',
                    'cex',
                    'resistance',
                    'power',
                    'junction',
                    'dissipation', 
                    'breakdown',
                    'voltage',
                    'cbo',
                    'vcbo'
                    'peak',
                    '=',
                    'f',
                    'p',
                    'mw',
                    'ceo',
                    'vceo',
                    'vebo',
                    'v',
                    'ebo',
                    'total',
                    'device',
                    'mhz',
                    'temperature',
                    'saturation',
                    'operating',
                    'storage'
                    'bandwidth',
                    'derate',
                    'above',
                    'product',
                    'figure',
                    'conditions',
                    'collector',
                    'saturation',
                    'min',
                    'min.',
                    'typ',
                    'typ.',
                    'max',
                    'max.',
                    'gain',
                    'p',
                    'thermal',
                    'test'])
    row_ngrams = set(get_row_ngrams(c.dc_gain))
    col_ngrams = set(get_col_ngrams(c.dc_gain))
    col_neg_keys = set(['conditions', 
                        'condition', 
                        'parameter', 
                        'rating',
                        'ratings',
                        'typ',
                        'typ.',
                        'max',
                        'max.',
                        'test'])
    if set_any_in_set(row_neg_keys, row_ngrams):
        return FALSE
    if set_any_in_set(col_neg_keys, col_ngrams):
        return FALSE
    
    return ABSTAIN

LFs.append(LF_negative_keywords)

### Applying the Labeling Functions

Next, we need to actually run the LFs over all of our training candidates, producing a set of `Labels` and `LabelKeys` (just the names of the LFs) in the database. Note that this will delete any existing `Labels` and `LabelKeys` for this candidate set.

View the API provided by the `Labeler` on [ReadTheDocs](https://fonduer.readthedocs.io/en/stable/user/supervision.html#fonduer.supervision.Labeler).

In [None]:
from fonduer.supervision import Labeler

labeler = Labeler(session, [PartEBVoltage])
%time labeler.apply(split=0, train=True, lfs=[LFs], parallelism=PARALLEL, progress_bar=True)