# Intro. to Snorkel: Extracting Spouse Relations from the News

In this tutorial, we will walk through the process of using `Snorkel` to identify mentions of spouses in a corpus of news articles. The tutorial is broken up into 3 notebooks, each covering a step in the pipeline:
1. Preprocessing
2. Training
3. Evaluation

## Part I: Preprocessing

In this notebook, we preprocess several documents using `Snorkel` utilities, parsing them into a simple hierarchy of component parts of our input data, which we refer to as _contexts_. We'll also create _candidates_ out of these contexts, which are the objects we want to classify, in this case, possible mentions of spouses. Finally, we'll load some gold labels for evaluation.

All of this preprocessed input data is saved to a database.  (Connection strings can be specified by setting the `SNORKELDB` environment variable.  In Snorkel, if no database is specified, then a SQLite database at `./snorkel.db` is created by default--so no setup is needed here!

### Initializing a `SnorkelSession`

First, we initialize a `SnorkelSession`, which manages a connection to a database automatically for us, and will enable us to save intermediate results.  If we don't specify any particular database (see commented-out code below), then it will automatically create a SQLite database in the background for us:

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os

# TO USE A DATABASE OTHER THAN SQLITE, USE THIS LINE
# Note that this is necessary for parallel execution amongst other things...
# os.environ['SNORKELDB'] = 'postgres:///snorkel-intro'

from snorkel import SnorkelSession
session = SnorkelSession()

# Here, we just set how many documents we'll process for automatic testing- you can safely ignore this!
n_docs = 500 if 'CI' in os.environ else 2591

## Loading the Corpus

Next, we load and pre-process the corpus of documents.

In [2]:
import pandas as pd
data_in_pd = pd.read_csv("data/articles.tsv", sep="\t", header=None)
data_in_pd.head()

Unnamed: 0,0,1
0,9b28e780-ba48-4a53-8682-7c58c141a1b6,"NEW YORK -- Theatergoers who check out ""Beauti..."
1,0e1d09ed-6ba7-430b-be20-0a3744bfd7b0,CenseoHealth an industry leader of physician-p...
2,c4a1f668-b653-45d9-a95a-bb692c3e81cb,Britons unfairly charged IHT between 2011 and ...
3,4fb3b88d-4000-4a11-bb77-8008358fe6d7,A search is underway for a family of prospecto...
4,d611cfe5-1bb1-429f-8440-1c4553f94805,Source: News unlimited - 4 days ago Two years...


In [3]:
data_in_pd[1][0]

'NEW YORK -- Theatergoers who check out "Beautiful" on tour won\'t get to see Tony winner Jessie Mueller but they may get the next best thing -- someone with her DNA.   Mueller\'s older sister Abby has stepped into her sister\'s shoes to play Carole King in the Broadway musical about the celebrated "I Feel the Earth Move" songwriter.   Abby Mueller acknowledged she was "a little hesitant" to audition for a part that earned her sister a best-actress Tony Award last year.   "I\'m really glad my agent convinced me to and I\'m really glad they saw something in me. I\'ve just been having the best time doing this," said Abby Mueller. "As soon as I got the audition, I called Jessie. I was like, \'I\'ve got a funny story for you.\'"   Jessie and Abby Mueller are just the tip of the talented, Chicago-based Mueller family. Their brothers, Matt and Andrew, as well as their parents, Roger Mueller and Jill Shellabarger, are all actors.   "Beautiful" director Marc Bruni said Abby, who graduated from

### Configuring a `DocPreprocessor`

We'll start by defining a `TSVDocPreprocessor` class to read in the documents, which are stored in a tab-seperated value format as pairs of document names and text.

In [4]:
from snorkel.parser import TSVDocPreprocessor

doc_preprocessor = TSVDocPreprocessor('data/articles.tsv', max_docs=n_docs)

### Running a `CorpusParser`

We'll use [Spacy](https://spacy.io/), an NLP preprocessing tool, to split our documents into sentences and tokens, and provide named entity annotations.

In [5]:
from snorkel.parser.spacy_parser import Spacy
from snorkel.parser import CorpusParser

corpus_parser = CorpusParser(parser=Spacy())
%time corpus_parser.apply(doc_preprocessor, count=n_docs)

Clearing existing...


  0%|          | 4/2591 [00:00<01:06, 39.18it/s]

Running UDF...


100%|██████████| 2591/2591 [01:48<00:00, 23.92it/s]

CPU times: user 1min 53s, sys: 4.1 s, total: 1min 57s
Wall time: 2min 51s





We can then use simple database queries (written in the syntax of [SQLAlchemy](http://www.sqlalchemy.org/), which Snorkel uses) to check how many documents and sentences were parsed:

In [6]:
from snorkel.models import Document, Sentence, Candidate, Feature

print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())
print("Candidates:", session.query(Candidate).count())
print("Features:", session.query(Feature).count())

Documents: 2591
Sentences: 67820
Candidates: 0
Features: 0


In [7]:
d = session.query(Document).get(2)

In [8]:
list(d.get_sentence_generator())

[Sentence(Document 0e1d09ed-6ba7-430b-be20-0a3744bfd7b0,0,b"CenseoHealth an industry leader of physician-performed health assessment solutions for Medicare Advantage plans, has partnered with Health Insurance Marketplace (HIX) and Commercial plans in 2015 to provide their members health assessments at their home, workplace, primary care physician's office or retail\xe2\x80\xa6   Published on September 4, 2015 at 3:40 AM \xc2\xb7 No Comments   CenseoHealth an industry leader of physician-performed health assessment solutions for Medicare Advantage plans, has partnered with Health Insurance Marketplace (HIX) and Commercial plans in 2015 to provide their members health assessments at their home, workplace, primary care physician's office or retail clinic.   "),
 Sentence(Document 0e1d09ed-6ba7-430b-be20-0a3744bfd7b0,1,b'CenseoHealth is proud to support this new market in their effort to provide quality care to this previously uninsured population.   '),
 Sentence(Document 0e1d09ed-6ba7-43

## Generating Candidates

The next step is to extract _candidates_ from our corpus. A `Candidate` in Snorkel is an object for which we want to make a prediction. In this case, the candidates are pairs of people mentioned in sentences, and our task is to predict which pairs are described as married in the associated text.

### Defining a `Candidate` schema
We now define the schema of the relation mention we want to extract (which is also the schema of the candidates).  This must be a subclass of `Candidate`, and we define it using a helper function. Here we'll define a binary _spouse relation mention_ which connects two `Span` objects of text.  Note that this function will create the table in the database backend if it does not exist:

In [9]:
from snorkel.models import candidate_subclass

Spouse = candidate_subclass('Spouse', ['person1', 'person2'])

### Writing a basic `CandidateExtractor`

Next, we'll write a basic function to extract **candidate spouse relation mentions** from the corpus.  The [Spacy](https://spacy.io/) parser we used performs _named entity recognition_ for us.

We will extract `Candidate` objects of the `Spouse` type by identifying, for each `Sentence`, all pairs of n-grams (up to 7-grams) that were tagged as people. (An n-gram is a span of text made up of n tokens.) We do this with three objects:

* A `ContextSpace` defines the "space" of all candidates we even potentially consider; in this case we use the `Ngrams` subclass, and look for all n-grams up to 7 words long

* A `Matcher` heuristically filters the candidates we use.  In this case, we just use a pre-defined matcher which looks for all n-grams tagged by Spacy as "PERSON". The keyword argument `longest_match_only` means that we'll skip n-grams contained in other n-grams.

* A `CandidateExtractor` combines this all together!

In [10]:
from snorkel.candidates import Ngrams, CandidateExtractor
from snorkel.matchers import PersonMatcher

ngrams         = Ngrams(n_max=7)
person_matcher = PersonMatcher(longest_match_only=True)
cand_extractor = CandidateExtractor(Spouse, [ngrams, ngrams], [person_matcher, person_matcher])

Next, we'll split up the documents into train, development, and test splits; and collect the associated sentences.

Note that we'll filter out a few sentences that mention more than five people. These lists are unlikely to contain spouses.

In [11]:
from snorkel.models import Document
from util import number_of_people

docs = session.query(Document).order_by(Document.name).all()

train_sents = set()
dev_sents   = set()
test_sents  = set()

for i, doc in enumerate(docs):
    for s in doc.sentences:
        if number_of_people(s) <= 5:
            if i % 10 == 8:
                dev_sents.add(s)
            elif i % 10 == 9:
                test_sents.add(s)
            else:
                train_sents.add(s)

Finally, we'll apply the candidate extractor to the three sets of sentences. The results will be persisted in the database backend.

In [12]:
%%time
for i, sents in enumerate([train_sents, dev_sents, test_sents]):
    cand_extractor.apply(sents, split=i)
    print("Number of candidates:", session.query(Spouse).filter(Spouse.split == i).count())

  0%|          | 31/54023 [00:00<02:59, 300.20it/s]

Clearing existing...
Running UDF...


100%|██████████| 54023/54023 [03:31<00:00, 255.05it/s]
  0%|          | 24/6791 [00:00<00:28, 235.96it/s]

Number of candidates: 22254
Clearing existing...
Running UDF...


100%|██████████| 6791/6791 [00:27<00:00, 245.02it/s]
  0%|          | 11/6202 [00:00<00:56, 109.57it/s]

Number of candidates: 2811
Clearing existing...
Running UDF...


100%|██████████| 6202/6202 [00:26<00:00, 236.16it/s]

Number of candidates: 2701
CPU times: user 4min 17s, sys: 1.23 s, total: 4min 18s
Wall time: 4min 25s





In [13]:
print("Candidates:", session.query(Candidate).get(1))

Candidates: Spouse(Span("b'Dawn'", sentence=54508, chars=[16,19], words=[3,3]), Span("b'Leah\xe2\x80\x99s'", sentence=54508, chars=[63,68], words=[13,14]))


## Loading Gold Labels

Finally, we'll load gold labels for development and evaluation. Even though Snorkel is designed to create labels for data, we still use gold labels to evaluate the quality of our models. Fortunately, we need far less labeled data to _evaluate_ a model than to _train_ it.

In [14]:
from util import load_external_labels

%time missed = load_external_labels(session, Spouse, annotator_name='gold')

AnnotatorLabels created: 2695
AnnotatorLabels created: 2615
CPU times: user 1min 9s, sys: 428 ms, total: 1min 9s
Wall time: 1min 10s


In [29]:
labels_from_pd = pd.read_csv("data/gold_labels.tsv", sep="\t")
labels_from_pd.head()

Unnamed: 0,person1,person2,label
0,36c3703b-bd5b-4888-be46-2f45bcb37f8e::span:95:106,36c3703b-bd5b-4888-be46-2f45bcb37f8e::span:0:10,1
1,e16a971f-23ce-42e4-81df-b2386126f8b3::span:126...,e16a971f-23ce-42e4-81df-b2386126f8b3::span:140...,-1
2,bd5bd79c-6d36-45bc-a5d3-29d711098202::span:346...,bd5bd79c-6d36-45bc-a5d3-29d711098202::span:347...,-1
3,53ee44ac-8634-4449-afb8-7c69b108aafd::span:708...,53ee44ac-8634-4449-afb8-7c69b108aafd::span:977...,1
4,0466d5cb-256b-41a0-9cd2-cc41d8916ede::span:288...,0466d5cb-256b-41a0-9cd2-cc41d8916ede::span:308...,-1


In [27]:
for i in range(session.query(Spouse).count()):
    cand_ = session.query(Candidate).get(i)
    spou_ = session.query(Spouse).get(i)
    if cand_ != spou_:
        print(i)
        print("\tcandidate", cand_)
        print("\tspouse", spou_)
    print("=====================")



















































































Next, in Part II, we will work towards building a model to predict these labels with high accuracy using data programming