# Relation Exraction Dataset based on CONLL data

In this small tutorial we are going to look at how we could make the weak annotation of data for relation extraction task using simple patterns. 

For that, we are going to use CONLL dataset that already has gold labels, which we are going to use for comparing with the labels we receive. 

The steps are the follwoing:
- process the data in CONLL format (an example is stored in the same folder, "dev.conll" file)
- extract all the relevant for us information (the sample and the relation label)
- create a samll set of hand-crafted patterns in the format: "$ARG1 <some words> $ARG2" where $ARG1 and $ARG2 are the entities between that the relation is hold
- annotated sample with Spacy package in order to get the entities
- take the entities pairwise and extract the part of sentences between them 

(e.g. from a sentence "Margaret Thatcher was born on 13 October 1925" a subsentence "$ARG1 was born on $ARG2" will be extracted given "Margaret Thatcher" and "13 October 1925" were the only entities defined by Spacy)

- check if there is a pattern in this subsentences. If so, it receives a label corresponding to the relation that pattern support and becomes a part of our new weakly annotated training set. 

## Imports

In [192]:
from sklearn.datasets import fetch_20newsgroups
from pprint import pprint
from IPython.display import display, HTML
import urllib
import spacy
import sklearn
import pandas as pd
import en_core_web_sm
import re
import itertools
from tqdm import tqdm
tqdm.pandas()

pd.set_option('display.max_colwidth', -1)

In [193]:
ARG1 = "$ARG1"
ARG2 = "$ARG2"
FINAL_DF_COLUMNS = ['sample', 'extr_sample', 'pattern', 'weak_label', 'gold_label']

conll_file = "dev.conll"

We are going to use this function each time we want to print out a Dataframe in order to escape dollar symbol in patterns

In [194]:
def escape_dollar(strings):
    return [re.sub("\\$", "\\\\$", str(string)) for string in strings]

## Read and preprocess CONLL data

In [195]:
def process_data(path_to_data):
    samples, relations = [], []
    with open(path_to_data, encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if line.startswith("# id="):    # Instance starts
                sample = ""
                label = line.split(" ")[3][5:]
            elif line == "":  # Instance ends
                samples.append(sample)
                relations.append(label)
            elif line.startswith("#"):  # comment
                continue
            else:
                parts = line.split("\t")
                token = parts[1]
                if token == "-LRB-":
                    token = "("
                elif token == "-RRB-":
                    token = ")"
                sample += " " + token
    return pd.DataFrame.from_dict({"sample": samples, "label": relations})

samples = process_data(conll_file)

Let's inspect all relation labels that we have in our dataset

In [196]:
print(set(samples["label"]))

{'org:website', 'org:alternate_names', 'per:other_family', 'org:number_of_employees/members', 'per:parents', 'per:country_of_death', 'per:cities_of_residence', 'per:city_of_death', 'org:top_members/employees', 'per:stateorprovince_of_birth', 'per:date_of_birth', 'per:stateorprovince_of_death', 'org:members', 'per:employee_of', 'org:city_of_headquarters', 'per:siblings', 'per:children', 'per:countries_of_residence', 'per:cause_of_death', 'no_relation', 'org:parents', 'per:schools_attended', 'per:country_of_birth', 'org:shareholders', 'org:subsidiaries', 'org:country_of_headquarters', 'per:title', 'per:religion', 'org:stateorprovince_of_headquarters', 'per:stateorprovinces_of_residence', 'org:founded', 'org:dissolved', 'org:member_of', 'per:charges', 'per:date_of_death', 'org:founded_by', 'org:political/religious_affiliation', 'per:age', 'per:origin', 'per:alternate_names', 'per:city_of_birth', 'per:spouse'}


To make the calculation quicklier, let's choose only the samples that contain a relation (that is, labelled not with "no_relation" label). 

In [259]:
selected_samples = samples[samples["label"]!='no_relation'].sample(n=1000, random_state=100)

In [260]:
selected_samples.head()

Unnamed: 0,sample,label
8105,"The body of Polish first lady Maria Kaczynska remains unidentified more than a day after she died in an air crash with President Lech Kaczynski and dozens of other officials , an aide said Sunday .",per:cause_of_death
14261,"Rana , who is a Canadian citizen , was arrested Oct 18 in his home .",per:origin
4089,He was 68 .,per:age
19544,SAN DIEGO : The American Association for the Advancement of Science ( AAAS ) hosts the world 's most important scientific conference .,org:alternate_names
13130,Shareholders will be voting on whether to split the company 's Class B shares in a move tied to Berkshire 's purchase of Burlington Northern Santa Fe Corp .,org:parents


## Defining patterns

In order to turn the data into distantly supervised one, we write down a couple of simple patterns for each relation that could help us to find these relations in training samples. For complexity reasons we reduced the number of relation we want to write patterns for and choose 3 relations from the TACRED relation labels: "org:alternate_names", "per:date_of_birth" and "org:top_members/employees". 

In [261]:
relation_patterns_df = pd.DataFrame.from_dict({"org:alternate_names": 
                                               [["$ARG1 ( $ARG2 ",
                                                 "$ARG1 formerly known as $ARG2",
                                                 "$ARG1 aka $ARG2", 
                                                 "$ARG1 ( also known as $ARG2 )"]],
                                               "per:date_of_birth": 
                                               [["$ARG1 ( born $ARG2 )", 
                                                 "$ARG1 ( born $ARG2 in",
                                                 "$ARG1 ( $ARG2 -",
                                                 "$ARG1 was born in $ARG2"]],
                                               "org:top_members/employees":
                                                [["$ARG1 , executive director of $ARG2",
                                                  "$ARG1 , head of $ARG2",
                                                  "$ARG1 , who heads $ARG2",
                                                  "$ARG1 , chief executive of $ARG2"]]}, 
                                              orient='index', columns = ["raw pattern"])

In [262]:
relation_patterns_df.apply(lambda x: escape_dollar(x)).head()

Unnamed: 0,raw pattern
org:alternate_names,"['\$ARG1 ( \$ARG2 ', '\$ARG1 formerly known as \$ARG2', '\$ARG1 aka \$ARG2', '\$ARG1 ( also known as \$ARG2 )']"
per:date_of_birth,"['\$ARG1 ( born \$ARG2 )', '\$ARG1 ( born \$ARG2 in', '\$ARG1 ( \$ARG2 -', '\$ARG1 was born in \$ARG2']"
org:top_members/employees,"['\$ARG1 , executive director of \$ARG2', '\$ARG1 , head of \$ARG2', '\$ARG1 , who heads \$ARG2', '\$ARG1 , chief executive of \$ARG2']"


Since we want to make a simple regex search, we convert the handßwritten patterns into regexes.

In [263]:
def preprocess_patterns(patterns):
    regex_patterns = [re.sub("\\\\\\$ARG", "(A )?(a )?(The )?(the )?\\$ARG", re.escape(pattern)) for pattern in patterns]
    return regex_patterns

relation_patterns_df["regex pattern"] = relation_patterns_df["raw pattern"].apply(preprocess_patterns)

In [264]:
relation_patterns_df.apply(lambda x: escape_dollar(x)).head()

Unnamed: 0,raw pattern,regex pattern
org:alternate_names,"['\$ARG1 ( \$ARG2 ', '\$ARG1 formerly known as \$ARG2', '\$ARG1 aka \$ARG2', '\$ARG1 ( also known as \$ARG2 )']","['(A )?(a )?(The )?(the )?\\\$ARG1\\ \\(\\ (A )?(a )?(The )?(the )?\\\$ARG2\\ ', '(A )?(a )?(The )?(the )?\\\$ARG1\\ formerly\\ known\\ as\\ (A )?(a )?(The )?(the )?\\\$ARG2', '(A )?(a )?(The )?(the )?\\\$ARG1\\ aka\\ (A )?(a )?(The )?(the )?\\\$ARG2', '(A )?(a )?(The )?(the )?\\\$ARG1\\ \\(\\ also\\ known\\ as\\ (A )?(a )?(The )?(the )?\\\$ARG2\\ \\)']"
per:date_of_birth,"['\$ARG1 ( born \$ARG2 )', '\$ARG1 ( born \$ARG2 in', '\$ARG1 ( \$ARG2 -', '\$ARG1 was born in \$ARG2']","['(A )?(a )?(The )?(the )?\\\$ARG1\\ \\(\\ born\\ (A )?(a )?(The )?(the )?\\\$ARG2\\ \\)', '(A )?(a )?(The )?(the )?\\\$ARG1\\ \\(\\ born\\ (A )?(a )?(The )?(the )?\\\$ARG2\\ in', '(A )?(a )?(The )?(the )?\\\$ARG1\\ \\(\\ (A )?(a )?(The )?(the )?\\\$ARG2\\ \\-', '(A )?(a )?(The )?(the )?\\\$ARG1\\ was\\ born\\ in\\ (A )?(a )?(The )?(the )?\\\$ARG2']"
org:top_members/employees,"['\$ARG1 , executive director of \$ARG2', '\$ARG1 , head of \$ARG2', '\$ARG1 , who heads \$ARG2', '\$ARG1 , chief executive of \$ARG2']","['(A )?(a )?(The )?(the )?\\\$ARG1\\ ,\\ executive\\ director\\ of\\ (A )?(a )?(The )?(the )?\\\$ARG2', '(A )?(a )?(The )?(the )?\\\$ARG1\\ ,\\ head\\ of\\ (A )?(a )?(The )?(the )?\\\$ARG2', '(A )?(a )?(The )?(the )?\\\$ARG1\\ ,\\ who\\ heads\\ (A )?(a )?(The )?(the )?\\\$ARG2', '(A )?(a )?(The )?(the )?\\\$ARG1\\ ,\\ chief\\ executive\\ of\\ (A )?(a )?(The )?(the )?\\\$ARG2']"


## Entity pairing

We get the sentence annotations with the Spacy package

In [265]:
analyzer = spacy.load("en_core_web_sm")
selected_samples["spacy info"] = selected_samples["sample"].apply(lambda x: analyzer(x).to_json())

After that, we take the arguments pairwise in each sentence and create subsentences that include:
- the first entity substituted with "ARG1"
- the second argument substituted with "$ARG2"
- all the words inbetween

In [314]:
def get_extracted_sample(sample):
    return [(ARG1 + sample["text"][ent1["end"]:ent2["start"]] + ARG2) if ent1["end"] < ent2["end"] 
            else (ARG2 + sample["text"][ent2["end"]:ent1["start"]] + ARG1)
            for ent1, ent2 in itertools.permutations(sample["ents"],2)]

selected_samples["extr"] = selected_samples["spacy info"].apply(lambda x: get_extracted_sample(x))
selected_samples = selected_samples.loc[selected_samples["extr"].apply(lambda x: len(x) != 0)]

In [315]:
selected_samples[["sample", "label", "extr"]].apply(lambda x: escape_dollar(x)).head()

Unnamed: 0,sample,label,extr
8105,"The body of Polish first lady Maria Kaczynska remains unidentified more than a day after she died in an air crash with President Lech Kaczynski and dozens of other officials , an aide said Sunday .",per:cause_of_death,"['\$ARG1 first lady \$ARG2', '\$ARG1 first lady Maria Kaczynska remains unidentified \$ARG2', '\$ARG1 first lady Maria Kaczynska remains unidentified more than a day after she died in an air crash with President \$ARG2', '\$ARG1 first lady Maria Kaczynska remains unidentified more than a day after she died in an air crash with President Lech Kaczynski and \$ARG2', '\$ARG1 first lady Maria Kaczynska remains unidentified more than a day after she died in an air crash with President Lech Kaczynski and dozens of other officials , an aide said \$ARG2', '\$ARG2 first lady \$ARG1', '\$ARG1 remains unidentified \$ARG2', '\$ARG1 remains unidentified more than a day after she died in an air crash with President \$ARG2', '\$ARG1 remains unidentified more than a day after she died in an air crash with President Lech Kaczynski and \$ARG2', '\$ARG1 remains unidentified more than a day after she died in an air crash with President Lech Kaczynski and dozens of other officials , an aide said \$ARG2', '\$ARG2 first lady Maria Kaczynska remains unidentified \$ARG1', '\$ARG2 remains unidentified \$ARG1', '\$ARG1 after she died in an air crash with President \$ARG2', '\$ARG1 after she died in an air crash with President Lech Kaczynski and \$ARG2', '\$ARG1 after she died in an air crash with President Lech Kaczynski and dozens of other officials , an aide said \$ARG2', '\$ARG2 first lady Maria Kaczynska remains unidentified more than a day after she died in an air crash with President \$ARG1', '\$ARG2 remains unidentified more than a day after she died in an air crash with President \$ARG1', '\$ARG2 after she died in an air crash with President \$ARG1', '\$ARG1 and \$ARG2', '\$ARG1 and dozens of other officials , an aide said \$ARG2', '\$ARG2 first lady Maria Kaczynska remains unidentified more than a day after she died in an air crash with President Lech Kaczynski and \$ARG1', '\$ARG2 remains unidentified more than a day after she died in an air crash with President Lech Kaczynski and \$ARG1', '\$ARG2 after she died in an air crash with President Lech Kaczynski and \$ARG1', '\$ARG2 and \$ARG1', '\$ARG1 of other officials , an aide said \$ARG2', '\$ARG2 first lady Maria Kaczynska remains unidentified more than a day after she died in an air crash with President Lech Kaczynski and dozens of other officials , an aide said \$ARG1', '\$ARG2 remains unidentified more than a day after she died in an air crash with President Lech Kaczynski and dozens of other officials , an aide said \$ARG1', '\$ARG2 after she died in an air crash with President Lech Kaczynski and dozens of other officials , an aide said \$ARG1', '\$ARG2 and dozens of other officials , an aide said \$ARG1', '\$ARG2 of other officials , an aide said \$ARG1']"
14261,"Rana , who is a Canadian citizen , was arrested Oct 18 in his home .",per:origin,"['\$ARG1 , who is a \$ARG2', '\$ARG1 , who is a Canadian citizen , was arrested \$ARG2', '\$ARG2 , who is a \$ARG1', '\$ARG1 citizen , was arrested \$ARG2', '\$ARG2 , who is a Canadian citizen , was arrested \$ARG1', '\$ARG2 citizen , was arrested \$ARG1']"
13130,Shareholders will be voting on whether to split the company 's Class B shares in a move tied to Berkshire 's purchase of Burlington Northern Santa Fe Corp .,org:parents,"[""\$ARG1 's purchase of \$ARG2"", ""\$ARG2 's purchase of \$ARG1""]"
8370,"Leading Cuban political prisoner Orlando Zapata died in hospital Tuesday , 85 days into a hunger strike , medical officials said as `` indignant '' dissidents blamed the government for his death .",per:cause_of_death,"['\$ARG1 political prisoner \$ARG2', '\$ARG1 political prisoner Orlando Zapata died in hospital \$ARG2', '\$ARG1 political prisoner Orlando Zapata died in hospital Tuesday , \$ARG2', '\$ARG2 political prisoner \$ARG1', '\$ARG1 died in hospital \$ARG2', '\$ARG1 died in hospital Tuesday , \$ARG2', '\$ARG2 political prisoner Orlando Zapata died in hospital \$ARG1', '\$ARG2 died in hospital \$ARG1', '\$ARG1 , \$ARG2', '\$ARG2 political prisoner Orlando Zapata died in hospital Tuesday , \$ARG1', '\$ARG2 died in hospital Tuesday , \$ARG1', '\$ARG2 , \$ARG1']"
6984,"Colorado Rockies team president Keli McGregor , a former National Football League player , was found dead Tuesday in a hotel room , Salt Lake City police said .",per:city_of_death,"['\$ARG1 \$ARG2', '\$ARG1 Rockies team president \$ARG2', '\$ARG1 Rockies team president Keli McGregor , a former \$ARG2', '\$ARG1 Rockies team president Keli McGregor , a former National Football League player , was found dead \$ARG2', '\$ARG1 Rockies team president Keli McGregor , a former National Football League player , was found dead Tuesday in a hotel room , \$ARG2', '\$ARG2 \$ARG1', '\$ARG1 team president \$ARG2', '\$ARG1 team president Keli McGregor , a former \$ARG2', '\$ARG1 team president Keli McGregor , a former National Football League player , was found dead \$ARG2', '\$ARG1 team president Keli McGregor , a former National Football League player , was found dead Tuesday in a hotel room , \$ARG2', '\$ARG2 Rockies team president \$ARG1', '\$ARG2 team president \$ARG1', '\$ARG1 , a former \$ARG2', '\$ARG1 , a former National Football League player , was found dead \$ARG2', '\$ARG1 , a former National Football League player , was found dead Tuesday in a hotel room , \$ARG2', '\$ARG2 Rockies team president Keli McGregor , a former \$ARG1', '\$ARG2 team president Keli McGregor , a former \$ARG1', '\$ARG2 , a former \$ARG1', '\$ARG1 player , was found dead \$ARG2', '\$ARG1 player , was found dead Tuesday in a hotel room , \$ARG2', '\$ARG2 Rockies team president Keli McGregor , a former National Football League player , was found dead \$ARG1', '\$ARG2 team president Keli McGregor , a former National Football League player , was found dead \$ARG1', '\$ARG2 , a former National Football League player , was found dead \$ARG1', '\$ARG2 player , was found dead \$ARG1', '\$ARG1 in a hotel room , \$ARG2', '\$ARG2 Rockies team president Keli McGregor , a former National Football League player , was found dead Tuesday in a hotel room , \$ARG1', '\$ARG2 team president Keli McGregor , a former National Football League player , was found dead Tuesday in a hotel room , \$ARG1', '\$ARG2 , a former National Football League player , was found dead Tuesday in a hotel room , \$ARG1', '\$ARG2 player , was found dead Tuesday in a hotel room , \$ARG1', '\$ARG2 in a hotel room , \$ARG1']"


## Pattern search

Now we look for patterns in each of these subsentences. If there is a match, we add a corresponding sentence to the final DataFrame with the information about pattern that matched and weak label it received. The fold labels are still there for comparing. 

In [318]:
def pattern_search(extr_sample, patterns, row):
    for relation, rel_patterns in patterns.iterrows():
        matches = [[row["sample"], extr_sample, pattern, relation, row["label"]] 
                   for pattern in rel_patterns["regex pattern"]].str.match(pattern) 
                   # if re.match(pattern, extr_sample) is not None]   # todo
        if len(matches) > 0:
            return pd.DataFrame(matches, columns = FINAL_DF_COLUMNS)

In [319]:
all_matches = pd.DataFrame(columns = FINAL_DF_COLUMNS)
for _, row in selected_samples.iterrows():
    for cand_sample in row["extr"]:
        df_found = pattern_search(cand_sample, relation_patterns_df, row)
        if isinstance(df_found, pd.DataFrame) and not df_found.empty:
            all_matches = pd.concat([all_matches, df_found])

In [320]:
all_matches.apply(lambda x: escape_dollar(x)).head()

Unnamed: 0,sample,extr_sample,pattern,weak_label,gold_label
0,"A professor emeritus at Yale University , Mandelbrot was born in Poland but as a child moved with his family to France where he was educated .",\$ARG1 was born in \$ARG2,(A )?(a )?(The )?(the )?\\$ARG1\ was\ born\ in\ (A )?(a )?(The )?(the )?\\$ARG2,per:date_of_birth,per:countries_of_residence
0,"Gwathmey was born in 1938 , the only child of painter Robert Gwathmey and his wife , Rosalie , a photographer .",\$ARG1 was born in \$ARG2,(A )?(a )?(The )?(the )?\\$ARG1\ was\ born\ in\ (A )?(a )?(The )?(the )?\\$ARG2,per:date_of_birth,per:children
0,"CBS News pioneer Don Hewitt dies at 86 Donald Shepard Hewitt was born in New York on Dec 14 , 1922 , and grew up in the suburb of New Rochelle .",\$ARG1 was born in \$ARG2,(A )?(a )?(The )?(the )?\\$ARG1\ was\ born\ in\ (A )?(a )?(The )?(the )?\\$ARG2,per:date_of_birth,per:stateorprovince_of_birth
0,"`` It is difficult to put one country and its citizens in the same boat ... some people from Iran have very legitimate interests in Europe and are very trustworthy , '' said Pierre Mirabaud , head of the Swiss Bankers Association .","\$ARG1 , head of \$ARG2","(A )?(a )?(The )?(the )?\\$ARG1\ ,\ head\ of\ (A )?(a )?(The )?(the )?\\$ARG2",org:top_members/employees,org:top_members/employees
0,"A professor emeritus at Yale University , Mandelbrot was born in Poland but as a child moved with his family to France where he was educated .",\$ARG1 was born in \$ARG2,(A )?(a )?(The )?(the )?\\$ARG1\ was\ born\ in\ (A )?(a )?(The )?(the )?\\$ARG2,per:date_of_birth,per:country_of_birth


In [284]:
all_matches.size

120

But here we can observe some misclassified sentences: for example, a sentence 

"A professor emeritus at Yale University , Mandelbrot was born in Poland but as a child moved with his family to France where he was educated" 

was assigned with a label "per:date_of_birth" (presumably by a pattern "ARG1 was born in ARG2"), what is definitely wrong. In order to avoid such mistake, let's add additional constraints on the argument types.

In [285]:
relation_to_types = {"org:alternate_names": ['PERSON', 'PERSON'], 
                     "per:date_of_birth": ['PERSON', 'DATE'],
                     "org:top_members/employees": ['PERSON', 'ORG']}

So, when we look for these patterns in samples, we should take into account the entity types referred to corresponding relation. 

In [286]:
relation_patterns_df["type"] = pd.Series(relation_to_types)

In [287]:
relation_patterns_df.apply(lambda x: escape_dollar(x)).head()

Unnamed: 0,raw pattern,regex pattern,type
org:alternate_names,"['\$ARG1 ( \$ARG2 ', '\$ARG1 formerly known as \$ARG2', '\$ARG1 aka \$ARG2', '\$ARG1 ( also known as \$ARG2 )']","['(A )?(a )?(The )?(the )?\\\$ARG1\\ \\(\\ (A )?(a )?(The )?(the )?\\\$ARG2\\ ', '(A )?(a )?(The )?(the )?\\\$ARG1\\ formerly\\ known\\ as\\ (A )?(a )?(The )?(the )?\\\$ARG2', '(A )?(a )?(The )?(the )?\\\$ARG1\\ aka\\ (A )?(a )?(The )?(the )?\\\$ARG2', '(A )?(a )?(The )?(the )?\\\$ARG1\\ \\(\\ also\\ known\\ as\\ (A )?(a )?(The )?(the )?\\\$ARG2\\ \\)']","['PERSON', 'PERSON']"
per:date_of_birth,"['\$ARG1 ( born \$ARG2 )', '\$ARG1 ( born \$ARG2 in', '\$ARG1 ( \$ARG2 -', '\$ARG1 was born in \$ARG2']","['(A )?(a )?(The )?(the )?\\\$ARG1\\ \\(\\ born\\ (A )?(a )?(The )?(the )?\\\$ARG2\\ \\)', '(A )?(a )?(The )?(the )?\\\$ARG1\\ \\(\\ born\\ (A )?(a )?(The )?(the )?\\\$ARG2\\ in', '(A )?(a )?(The )?(the )?\\\$ARG1\\ \\(\\ (A )?(a )?(The )?(the )?\\\$ARG2\\ \\-', '(A )?(a )?(The )?(the )?\\\$ARG1\\ was\\ born\\ in\\ (A )?(a )?(The )?(the )?\\\$ARG2']","['PERSON', 'DATE']"
org:top_members/employees,"['\$ARG1 , executive director of \$ARG2', '\$ARG1 , head of \$ARG2', '\$ARG1 , who heads \$ARG2', '\$ARG1 , chief executive of \$ARG2']","['(A )?(a )?(The )?(the )?\\\$ARG1\\ ,\\ executive\\ director\\ of\\ (A )?(a )?(The )?(the )?\\\$ARG2', '(A )?(a )?(The )?(the )?\\\$ARG1\\ ,\\ head\\ of\\ (A )?(a )?(The )?(the )?\\\$ARG2', '(A )?(a )?(The )?(the )?\\\$ARG1\\ ,\\ who\\ heads\\ (A )?(a )?(The )?(the )?\\\$ARG2', '(A )?(a )?(The )?(the )?\\\$ARG1\\ ,\\ chief\\ executive\\ of\\ (A )?(a )?(The )?(the )?\\\$ARG2']","['PERSON', 'ORG']"


The information about type of the entities we get from the Spacy annotations as well and build a new format of extracted subsentences: now they are not lists but dictionaries in format {subsentence: entity types}

In [288]:
def get_arg_types(sample):
    return [{ARG1 + sample["text"][ent1["end"]:ent2["start"]] + ARG2 : [ent1["label"], ent2["label"]]}
            if ent1["end"] < ent2["end"]
            else {(ARG2 + sample["text"][ent2["end"]:ent1["start"]] + ARG1) : [ent2["label"], ent1["label"]]}
            for ent1, ent2 in itertools.permutations(sample["ents"],2)]

In [289]:
selected_samples["extr_with_types"] = selected_samples["spacy info"].apply(lambda x: get_arg_types(x))

In [290]:
# selected_samples.apply(lambda x: escape_dollar(x)).head()

We do a similar pattern search, but now taking into account entity types.

In [291]:
def pattern_search_filter_types(cand_sample_with_types, patterns, row):
    for relation, rel_patterns in patterns.iterrows():
        cand_sample_types = [item for type_pair in cand_sample_with_types.values() for item in type_pair]
        if rel_patterns["type"] == cand_sample_types:
            sample = list(cand_sample_with_types.keys())[0]
            matches = [[row["sample"], sample, pattern, relation, row["label"]] 
                       for pattern in rel_patterns["regex pattern"]
                       if re.match(pattern, sample) is not None]   # todo
            if len(matches) > 0:
                return pd.DataFrame(matches, columns = FINAL_DF_COLUMNS)

In [292]:
all_matches_filter_types = pd.DataFrame(columns = FINAL_DF_COLUMNS)
for _, row in selected_samples.iterrows():
    for cand_sample_with_types in row["extr_with_types"]:
        df_found = pattern_search_filter_types(cand_sample_with_types, relation_patterns_df, row)
        if isinstance(df_found, pd.DataFrame) and not df_found.empty:
            all_matches_filter_types = pd.concat([all_matches_filter_types, df_found])

In [293]:
all_matches_filter_types.apply(lambda x: escape_dollar(x))

Unnamed: 0,sample,extr_sample,pattern,weak_label,gold_label
0,"Gwathmey was born in 1938 , the only child of painter Robert Gwathmey and his wife , Rosalie , a photographer .",\$ARG1 was born in \$ARG2,(A )?(a )?(The )?(the )?\\$ARG1\ was\ born\ in\ (A )?(a )?(The )?(the )?\\$ARG2,per:date_of_birth,per:children
0,"`` It is difficult to put one country and its citizens in the same boat ... some people from Iran have very legitimate interests in Europe and are very trustworthy , '' said Pierre Mirabaud , head of the Swiss Bankers Association .","\$ARG1 , head of \$ARG2","(A )?(a )?(The )?(the )?\\$ARG1\ ,\ head\ of\ (A )?(a )?(The )?(the )?\\$ARG2",org:top_members/employees,org:top_members/employees
0,"Anders Berntell , head of the Stockholm International Water Institute , says that , although `` water is absolutely crucial for all sectors in society , '' water issues have played too small a role in climate talks .","\$ARG1 , head of \$ARG2","(A )?(a )?(The )?(the )?\\$ARG1\ ,\ head\ of\ (A )?(a )?(The )?(the )?\\$ARG2",org:top_members/employees,org:top_members/employees
0,"Gwathmey was born in 1938 , the only child of painter Robert Gwathmey and his wife , Rosalie , a photographer .",\$ARG1 was born in \$ARG2,(A )?(a )?(The )?(the )?\\$ARG1\ was\ born\ in\ (A )?(a )?(The )?(the )?\\$ARG2,per:date_of_birth,per:date_of_birth
0,"`` At the moment the water issue does n't get enough attention in the climate negotiations , '' Anders Berntell , head of the Stockholm International Water Institute , told The Associated Press .","\$ARG1 , head of \$ARG2","(A )?(a )?(The )?(the )?\\$ARG1\ ,\ head\ of\ (A )?(a )?(The )?(the )?\\$ARG2",org:top_members/employees,org:top_members/employees
0,"Water and its links to development , peace and conflict were key words in the annual sessions , Anders Berntell , executive director of Stockholm International Water Institute ( SIWI ) , said in his opening address .","\$ARG1 , executive director of \$ARG2","(A )?(a )?(The )?(the )?\\$ARG1\ ,\ executive\ director\ of\ (A )?(a )?(The )?(the )?\\$ARG2",org:top_members/employees,org:top_members/employees
0,"Water and its links to development , peace and conflict were key words in the annual sessions , Anders Berntell , executive director of Stockholm International Water Institute ( SIWI ) , said in his opening address .","\$ARG1 , executive director of \$ARG2","(A )?(a )?(The )?(the )?\\$ARG1\ ,\ executive\ director\ of\ (A )?(a )?(The )?(the )?\\$ARG2",org:top_members/employees,org:alternate_names
0,"STOCKHOLM 2009-08-21 11:50:54 UTC Anders Berntell , head of the Stockholm International Water Institute , says that , although `` water is absolutely crucial for all sectors in society , '' water issues have played too small a role in climate talks .","\$ARG1 , head of \$ARG2","(A )?(a )?(The )?(the )?\\$ARG1\ ,\ head\ of\ (A )?(a )?(The )?(the )?\\$ARG2",org:top_members/employees,org:top_members/employees
0,"Gwathmey was born in 1938 , the only child of painter Robert Gwathmey and his wife , Rosalie , a photographer .",\$ARG1 was born in \$ARG2,(A )?(a )?(The )?(the )?\\$ARG1\ was\ born\ in\ (A )?(a )?(The )?(the )?\\$ARG2,per:date_of_birth,per:children
0,"Of course , customer demand is the ultimate driver , said John Overstreet , executive director of the Indoor Tanning Association , which represents the \$ 5 billion industry .","\$ARG1 , executive director of \$ARG2","(A )?(a )?(The )?(the )?\\$ARG1\ ,\ executive\ director\ of\ (A )?(a )?(The )?(the )?\\$ARG2",org:top_members/employees,org:top_members/employees


In [321]:
all_matches_filter_types.size

60

So, now we have a smaller dataset (only 60 found samples compared to 120 without argument types filtering), but much more clean. We have already done our first weakly data preprocessing :)