# Relation Exraction Dataset based on CONLL data

In this small tutorial we are going to look at how we could make the weak annotation of data for relation extraction task using simple patterns. 

For that, we are going to use dataset in CONLL format that already has gold labels, which we are going to use for comparing with the labels we receive. The examples of such files could be downloaded by the following link: https://ucloud.univie.ac.at/index.php/s/cGQNRgbPHZU89mY

The steps are the following:
- process the data in CONLL format (an example could be downloaded by the following link:)
- extract all the relevant for us information (the sample and the relation label)
- create a sample set of hand-crafted patterns in the format: _"\\$ARG1 \<some words\> \\$ARG2"_ where _\\$ARG1_ and _\\$ARG2_ are the entities between that the relation is hold
- annotated sample with Spacy package in order to get the entities
- take the entities pairwise and extract the part of sentences between them 

(e.g. from a sentence _"Margaret Thatcher was born on 13 October 1925"_ a subsentence _"\\$ARG1 was born on \\$ARG2"_ will be extracted given _"Margaret Thatcher"_ and _"13 October 1925"_ were the only entities defined by Spacy)

- check if there is a pattern in this subsentences. If so, it receives a label corresponding to the relation that pattern support and becomes a part of our new weakly annotated training set. 

## Imports

In [78]:
import spacy
import pandas as pd
# import en_core_web_sm
import re
import itertools
from tqdm import tqdm
tqdm.pandas()

pd.set_option('display.max_colwidth', -1)

  from pandas import Panel


In [79]:
# in oder to avoid issues with en_core_web_lg, please execute the following command
!python -m spacy download en_core_web_sm

In [None]:
ARG1 = "$ARG1"
ARG2 = "$ARG2"
FINAL_DF_COLUMNS = ['sample', 'extr_sample', 'pattern', 'weak_label', 'gold_label']

We are going to use this function each time we want to print out a Dataframe in order to escape dollar symbol in patterns

In [80]:
def escape_dollar(strings):
    return [re.sub("\\$", "\\\\$", str(string)) for string in strings]

## Read and preprocess CONLL data

Firstly, we are going to read the input .conll file. Please make sure that it is saved in the same directory as this tutorial notebook and path to it is solved in conll_file variable.

In [81]:
conll_file = "dev.conll"

In [82]:
def process_data(path_to_data):
    samples, relations = [], []
    with open(path_to_data, encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if line.startswith("# id="):    # Instance starts
                sample = ""
                label = line.split(" ")[3][5:]
            elif line == "":  # Instance ends
                samples.append(sample)
                relations.append(label)
            elif line.startswith("#"):  # comment
                continue
            else:
                parts = line.split("\t")
                token = parts[1]
                if token == "-LRB-":
                    token = "("
                elif token == "-RRB-":
                    token = ")"
                sample += " " + token
    return pd.DataFrame.from_dict({"sample": samples, "label": relations})

samples = process_data(conll_file)

Let's inspect all relation labels that we have in our dataset

In [83]:
print(set(samples["label"]))

{'org:founded', 'org:country_of_headquarters', 'per:city_of_birth', 'per:origin', 'per:other_family', 'per:country_of_birth', 'org:dissolved', 'org:shareholders', 'per:parents', 'per:cause_of_death', 'per:spouse', 'per:stateorprovince_of_death', 'no_relation', 'org:members', 'org:stateorprovince_of_headquarters', 'org:number_of_employees/members', 'per:cities_of_residence', 'org:political/religious_affiliation', 'per:children', 'per:date_of_death', 'org:top_members/employees', 'org:city_of_headquarters', 'org:subsidiaries', 'per:city_of_death', 'per:countries_of_residence', 'per:schools_attended', 'per:stateorprovinces_of_residence', 'org:alternate_names', 'org:parents', 'per:religion', 'org:founded_by', 'per:age', 'per:employee_of', 'per:title', 'per:date_of_birth', 'org:member_of', 'per:alternate_names', 'per:country_of_death', 'per:stateorprovince_of_birth', 'per:charges', 'org:website', 'per:siblings'}


To make the calculation quicklier, let's choose only the samples that contain a relation (that is, labelled not with "no_relation" label). 

In [84]:
selected_samples = samples[samples["label"]!='no_relation'].sample(n=1000, random_state=100)

In [85]:
selected_samples.head()

Unnamed: 0,sample,label
4777,"Or as one hotel executive , James Anhut , of InterContinental Hotels Group , put it , `` Travelers can stay in a cool hotel and earn their Priority Club frequency points , too . ''",org:top_members/employees
4068,"Prolific French film maker Claude Chabrol , who helped start the New Wave movement in the 1950s , died on Sunday , aged 80 , an aide to Paris mayor Bertrand Delanoe told AFP .",per:origin
5581,"Try the Public Library of Science ( http://www.plos.org/ ) , which does have some drawbacks , but also appears to have a clue .",org:website
21851,"His main Shiite rival , Abdul Aziz al-Hakim , who heads the Islamic Supreme Council of Iraq , an influential Shiite political party that is part of Maliki 's ruling coalition , has also denounced the plans .",per:religion
4237,"Popular Kabul lawmaker Ramazan Bashardost , who camps out in a tent near parliament and campaigned against corruption , attracted 359,214 votes and former World Bank economist Ashraf Ghani 62,536 votes , Najafi added .",per:cities_of_residence


## Defining patterns

In order to turn the data into distantly supervised one, we write down a couple of simple patterns for each relation that could help us to find these relations in training samples. For complexity reasons we reduced the number of relation we want to write patterns for and choose 3 relations from the TACRED relation labels: "org:alternate_names", "per:date_of_birth" and "org:top_members/employees". 

In [86]:
relation_patterns = {"org:alternate_names":
                     ["$ARG1 ( $ARG2 ", 
                       "$ARG1 formerly known as $ARG2",
                       "$ARG1 aka $ARG2", 
                       "$ARG1 ( also known as $ARG2 )"],
                     "per:date_of_birth": 
                     ["$ARG1 ( born $ARG2 )", 
                       "$ARG1 ( born $ARG2 in",
                       "$ARG1 ( $ARG2 -",
                       "$ARG1 was born in $ARG2"],
                     "org:top_members/employees":
                     ["$ARG1 , executive director of $ARG2",
                       "$ARG1 , head of $ARG2",
                       "$ARG1 , who heads $ARG2",
                       "$ARG1 , chief executive of $ARG2"]}

relation_patterns_df = pd.DataFrame.from_dict(relation_patterns, orient='index')
relation_patterns_df['raw pattern']=relation_patterns_df.apply(lambda row: row.dropna().tolist(), axis=1)
relation_patterns_df = relation_patterns_df[['raw pattern']]

In [87]:
relation_patterns_df.apply(lambda x: escape_dollar(x)).head()

Unnamed: 0,raw pattern
org:alternate_names,"['\$ARG1 ( \$ARG2 ', '\$ARG1 formerly known as \$ARG2', '\$ARG1 aka \$ARG2', '\$ARG1 ( also known as \$ARG2 )']"
per:date_of_birth,"['\$ARG1 ( born \$ARG2 )', '\$ARG1 ( born \$ARG2 in', '\$ARG1 ( \$ARG2 -', '\$ARG1 was born in \$ARG2']"
org:top_members/employees,"['\$ARG1 , executive director of \$ARG2', '\$ARG1 , head of \$ARG2', '\$ARG1 , who heads \$ARG2', '\$ARG1 , chief executive of \$ARG2']"


Since we want to make a simple regex search, we convert the handßwritten patterns into regexes.

In [88]:
def preprocess_patterns(patterns):
    regex_patterns = [re.sub("\\\\\\$ARG", "(A )?(a )?(The )?(the )?\\$ARG", re.escape(pattern)) for pattern in patterns]
    return regex_patterns

relation_patterns_df["regex pattern"] = relation_patterns_df["raw pattern"].apply(preprocess_patterns)

In [89]:
relation_patterns_df.apply(lambda x: escape_dollar(x)).head()

Unnamed: 0,raw pattern,regex pattern
org:alternate_names,"['\$ARG1 ( \$ARG2 ', '\$ARG1 formerly known as \$ARG2', '\$ARG1 aka \$ARG2', '\$ARG1 ( also known as \$ARG2 )']","['(A )?(a )?(The )?(the )?\\\$ARG1\\ \\(\\ (A )?(a )?(The )?(the )?\\\$ARG2\\ ', '(A )?(a )?(The )?(the )?\\\$ARG1\\ formerly\\ known\\ as\\ (A )?(a )?(The )?(the )?\\\$ARG2', '(A )?(a )?(The )?(the )?\\\$ARG1\\ aka\\ (A )?(a )?(The )?(the )?\\\$ARG2', '(A )?(a )?(The )?(the )?\\\$ARG1\\ \\(\\ also\\ known\\ as\\ (A )?(a )?(The )?(the )?\\\$ARG2\\ \\)']"
per:date_of_birth,"['\$ARG1 ( born \$ARG2 )', '\$ARG1 ( born \$ARG2 in', '\$ARG1 ( \$ARG2 -', '\$ARG1 was born in \$ARG2']","['(A )?(a )?(The )?(the )?\\\$ARG1\\ \\(\\ born\\ (A )?(a )?(The )?(the )?\\\$ARG2\\ \\)', '(A )?(a )?(The )?(the )?\\\$ARG1\\ \\(\\ born\\ (A )?(a )?(The )?(the )?\\\$ARG2\\ in', '(A )?(a )?(The )?(the )?\\\$ARG1\\ \\(\\ (A )?(a )?(The )?(the )?\\\$ARG2\\ \\-', '(A )?(a )?(The )?(the )?\\\$ARG1\\ was\\ born\\ in\\ (A )?(a )?(The )?(the )?\\\$ARG2']"
org:top_members/employees,"['\$ARG1 , executive director of \$ARG2', '\$ARG1 , head of \$ARG2', '\$ARG1 , who heads \$ARG2', '\$ARG1 , chief executive of \$ARG2']","['(A )?(a )?(The )?(the )?\\\$ARG1\\ ,\\ executive\\ director\\ of\\ (A )?(a )?(The )?(the )?\\\$ARG2', '(A )?(a )?(The )?(the )?\\\$ARG1\\ ,\\ head\\ of\\ (A )?(a )?(The )?(the )?\\\$ARG2', '(A )?(a )?(The )?(the )?\\\$ARG1\\ ,\\ who\\ heads\\ (A )?(a )?(The )?(the )?\\\$ARG2', '(A )?(a )?(The )?(the )?\\\$ARG1\\ ,\\ chief\\ executive\\ of\\ (A )?(a )?(The )?(the )?\\\$ARG2']"


## Entity pairing

We get the sentence annotations with the Spacy package

In [90]:
analyzer = spacy.load("en_core_web_sm")
selected_samples["spacy info"] = selected_samples["sample"].apply(lambda x: analyzer(x).to_json())

After that, we take the arguments pairwise in each sentence and create subsentences that include:
- the first entity substituted with "ARG1"
- the second argument substituted with "$ARG2"
- all the words inbetween

In [91]:
def get_extracted_sample(sample):
    return [(ARG1 + sample["text"][ent1["end"]:ent2["start"]] + ARG2) if ent1["end"] < ent2["end"] 
            else (ARG2 + sample["text"][ent2["end"]:ent1["start"]] + ARG1)
            for ent1, ent2 in itertools.permutations(sample["ents"],2)]

selected_samples["extr"] = selected_samples["spacy info"].apply(lambda x: get_extracted_sample(x))
selected_samples = selected_samples.loc[selected_samples["extr"].apply(lambda x: len(x) != 0)]

In [92]:
selected_samples[["sample", "label", "extr"]].apply(lambda x: escape_dollar(x)).head()

Unnamed: 0,sample,label,extr
4777,"Or as one hotel executive , James Anhut , of InterContinental Hotels Group , put it , `` Travelers can stay in a cool hotel and earn their Priority Club frequency points , too . ''",org:top_members/employees,"['\$ARG1 hotel executive , \$ARG2', '\$ARG1 hotel executive , James Anhut , of \$ARG2', '\$ARG1 hotel executive , James Anhut , of InterContinental Hotels Group , put it , `` \$ARG2', '\$ARG1 hotel executive , James Anhut , of InterContinental Hotels Group , put it , `` Travelers can stay in a cool hotel and earn their \$ARG2', '\$ARG2 hotel executive , \$ARG1', '\$ARG1 , of \$ARG2', '\$ARG1 , of InterContinental Hotels Group , put it , `` \$ARG2', '\$ARG1 , of InterContinental Hotels Group , put it , `` Travelers can stay in a cool hotel and earn their \$ARG2', '\$ARG2 hotel executive , James Anhut , of \$ARG1', '\$ARG2 , of \$ARG1', '\$ARG1 , put it , `` \$ARG2', '\$ARG1 , put it , `` Travelers can stay in a cool hotel and earn their \$ARG2', '\$ARG2 hotel executive , James Anhut , of InterContinental Hotels Group , put it , `` \$ARG1', '\$ARG2 , of InterContinental Hotels Group , put it , `` \$ARG1', '\$ARG2 , put it , `` \$ARG1', '\$ARG1 can stay in a cool hotel and earn their \$ARG2', '\$ARG2 hotel executive , James Anhut , of InterContinental Hotels Group , put it , `` Travelers can stay in a cool hotel and earn their \$ARG1', '\$ARG2 , of InterContinental Hotels Group , put it , `` Travelers can stay in a cool hotel and earn their \$ARG1', '\$ARG2 , put it , `` Travelers can stay in a cool hotel and earn their \$ARG1', '\$ARG2 can stay in a cool hotel and earn their \$ARG1']"
4068,"Prolific French film maker Claude Chabrol , who helped start the New Wave movement in the 1950s , died on Sunday , aged 80 , an aide to Paris mayor Bertrand Delanoe told AFP .",per:origin,"['\$ARG1 film maker \$ARG2', '\$ARG1 film maker Claude Chabrol , who helped start the \$ARG2', '\$ARG1 film maker Claude Chabrol , who helped start the New Wave movement in \$ARG2', '\$ARG1 film maker Claude Chabrol , who helped start the New Wave movement in the 1950s , died on \$ARG2', '\$ARG1 film maker Claude Chabrol , who helped start the New Wave movement in the 1950s , died on Sunday , aged \$ARG2', '\$ARG1 film maker Claude Chabrol , who helped start the New Wave movement in the 1950s , died on Sunday , aged 80 , an aide to \$ARG2', '\$ARG1 film maker Claude Chabrol , who helped start the New Wave movement in the 1950s , died on Sunday , aged 80 , an aide to Paris mayor \$ARG2', '\$ARG1 film maker Claude Chabrol , who helped start the New Wave movement in the 1950s , died on Sunday , aged 80 , an aide to Paris mayor Bertrand Delanoe told \$ARG2', '\$ARG2 film maker \$ARG1', '\$ARG1 , who helped start the \$ARG2', '\$ARG1 , who helped start the New Wave movement in \$ARG2', '\$ARG1 , who helped start the New Wave movement in the 1950s , died on \$ARG2', '\$ARG1 , who helped start the New Wave movement in the 1950s , died on Sunday , aged \$ARG2', '\$ARG1 , who helped start the New Wave movement in the 1950s , died on Sunday , aged 80 , an aide to \$ARG2', '\$ARG1 , who helped start the New Wave movement in the 1950s , died on Sunday , aged 80 , an aide to Paris mayor \$ARG2', '\$ARG1 , who helped start the New Wave movement in the 1950s , died on Sunday , aged 80 , an aide to Paris mayor Bertrand Delanoe told \$ARG2', '\$ARG2 film maker Claude Chabrol , who helped start the \$ARG1', '\$ARG2 , who helped start the \$ARG1', '\$ARG1 movement in \$ARG2', '\$ARG1 movement in the 1950s , died on \$ARG2', '\$ARG1 movement in the 1950s , died on Sunday , aged \$ARG2', '\$ARG1 movement in the 1950s , died on Sunday , aged 80 , an aide to \$ARG2', '\$ARG1 movement in the 1950s , died on Sunday , aged 80 , an aide to Paris mayor \$ARG2', '\$ARG1 movement in the 1950s , died on Sunday , aged 80 , an aide to Paris mayor Bertrand Delanoe told \$ARG2', '\$ARG2 film maker Claude Chabrol , who helped start the New Wave movement in \$ARG1', '\$ARG2 , who helped start the New Wave movement in \$ARG1', '\$ARG2 movement in \$ARG1', '\$ARG1 , died on \$ARG2', '\$ARG1 , died on Sunday , aged \$ARG2', '\$ARG1 , died on Sunday , aged 80 , an aide to \$ARG2', '\$ARG1 , died on Sunday , aged 80 , an aide to Paris mayor \$ARG2', '\$ARG1 , died on Sunday , aged 80 , an aide to Paris mayor Bertrand Delanoe told \$ARG2', '\$ARG2 film maker Claude Chabrol , who helped start the New Wave movement in the 1950s , died on \$ARG1', '\$ARG2 , who helped start the New Wave movement in the 1950s , died on \$ARG1', '\$ARG2 movement in the 1950s , died on \$ARG1', '\$ARG2 , died on \$ARG1', '\$ARG1 , aged \$ARG2', '\$ARG1 , aged 80 , an aide to \$ARG2', '\$ARG1 , aged 80 , an aide to Paris mayor \$ARG2', '\$ARG1 , aged 80 , an aide to Paris mayor Bertrand Delanoe told \$ARG2', '\$ARG2 film maker Claude Chabrol , who helped start the New Wave movement in the 1950s , died on Sunday , aged \$ARG1', '\$ARG2 , who helped start the New Wave movement in the 1950s , died on Sunday , aged \$ARG1', '\$ARG2 movement in the 1950s , died on Sunday , aged \$ARG1', '\$ARG2 , died on Sunday , aged \$ARG1', '\$ARG2 , aged \$ARG1', '\$ARG1 , an aide to \$ARG2', '\$ARG1 , an aide to Paris mayor \$ARG2', '\$ARG1 , an aide to Paris mayor Bertrand Delanoe told \$ARG2', '\$ARG2 film maker Claude Chabrol , who helped start the New Wave movement in the 1950s , died on Sunday , aged 80 , an aide to \$ARG1', '\$ARG2 , who helped start the New Wave movement in the 1950s , died on Sunday , aged 80 , an aide to \$ARG1', '\$ARG2 movement in the 1950s , died on Sunday , aged 80 , an aide to \$ARG1', '\$ARG2 , died on Sunday , aged 80 , an aide to \$ARG1', '\$ARG2 , aged 80 , an aide to \$ARG1', '\$ARG2 , an aide to \$ARG1', '\$ARG1 mayor \$ARG2', '\$ARG1 mayor Bertrand Delanoe told \$ARG2', '\$ARG2 film maker Claude Chabrol , who helped start the New Wave movement in the 1950s , died on Sunday , aged 80 , an aide to Paris mayor \$ARG1', '\$ARG2 , who helped start the New Wave movement in the 1950s , died on Sunday , aged 80 , an aide to Paris mayor \$ARG1', '\$ARG2 movement in the 1950s , died on Sunday , aged 80 , an aide to Paris mayor \$ARG1', '\$ARG2 , died on Sunday , aged 80 , an aide to Paris mayor \$ARG1', '\$ARG2 , aged 80 , an aide to Paris mayor \$ARG1', '\$ARG2 , an aide to Paris mayor \$ARG1', '\$ARG2 mayor \$ARG1', '\$ARG1 told \$ARG2', '\$ARG2 film maker Claude Chabrol , who helped start the New Wave movement in the 1950s , died on Sunday , aged 80 , an aide to Paris mayor Bertrand Delanoe told \$ARG1', '\$ARG2 , who helped start the New Wave movement in the 1950s , died on Sunday , aged 80 , an aide to Paris mayor Bertrand Delanoe told \$ARG1', '\$ARG2 movement in the 1950s , died on Sunday , aged 80 , an aide to Paris mayor Bertrand Delanoe told \$ARG1', '\$ARG2 , died on Sunday , aged 80 , an aide to Paris mayor Bertrand Delanoe told \$ARG1', '\$ARG2 , aged 80 , an aide to Paris mayor Bertrand Delanoe told \$ARG1', '\$ARG2 , an aide to Paris mayor Bertrand Delanoe told \$ARG1', '\$ARG2 mayor Bertrand Delanoe told \$ARG1', '\$ARG2 told \$ARG1']"
21851,"His main Shiite rival , Abdul Aziz al-Hakim , who heads the Islamic Supreme Council of Iraq , an influential Shiite political party that is part of Maliki 's ruling coalition , has also denounced the plans .",per:religion,"['\$ARG1 rival , \$ARG2', '\$ARG1 rival , Abdul Aziz al-Hakim , who heads \$ARG2', '\$ARG1 rival , Abdul Aziz al-Hakim , who heads the Islamic Supreme Council of Iraq , an influential \$ARG2', '\$ARG1 rival , Abdul Aziz al-Hakim , who heads the Islamic Supreme Council of Iraq , an influential Shiite political party that is part of \$ARG2', '\$ARG2 rival , \$ARG1', '\$ARG1 , who heads \$ARG2', '\$ARG1 , who heads the Islamic Supreme Council of Iraq , an influential \$ARG2', '\$ARG1 , who heads the Islamic Supreme Council of Iraq , an influential Shiite political party that is part of \$ARG2', '\$ARG2 rival , Abdul Aziz al-Hakim , who heads \$ARG1', '\$ARG2 , who heads \$ARG1', '\$ARG1 , an influential \$ARG2', '\$ARG1 , an influential Shiite political party that is part of \$ARG2', '\$ARG2 rival , Abdul Aziz al-Hakim , who heads the Islamic Supreme Council of Iraq , an influential \$ARG1', '\$ARG2 , who heads the Islamic Supreme Council of Iraq , an influential \$ARG1', '\$ARG2 , an influential \$ARG1', '\$ARG1 political party that is part of \$ARG2', '\$ARG2 rival , Abdul Aziz al-Hakim , who heads the Islamic Supreme Council of Iraq , an influential Shiite political party that is part of \$ARG1', '\$ARG2 , who heads the Islamic Supreme Council of Iraq , an influential Shiite political party that is part of \$ARG1', '\$ARG2 , an influential Shiite political party that is part of \$ARG1', '\$ARG2 political party that is part of \$ARG1']"
4237,"Popular Kabul lawmaker Ramazan Bashardost , who camps out in a tent near parliament and campaigned against corruption , attracted 359,214 votes and former World Bank economist Ashraf Ghani 62,536 votes , Najafi added .",per:cities_of_residence,"['\$ARG1 lawmaker \$ARG2', '\$ARG1 lawmaker Ramazan Bashardost , who camps out in a tent near parliament and campaigned against corruption , attracted \$ARG2', '\$ARG1 lawmaker Ramazan Bashardost , who camps out in a tent near parliament and campaigned against corruption , attracted 359,214 votes and former \$ARG2', '\$ARG1 lawmaker Ramazan Bashardost , who camps out in a tent near parliament and campaigned against corruption , attracted 359,214 votes and former World Bank economist \$ARG2', '\$ARG1 lawmaker Ramazan Bashardost , who camps out in a tent near parliament and campaigned against corruption , attracted 359,214 votes and former World Bank economist Ashraf Ghani \$ARG2', '\$ARG1 lawmaker Ramazan Bashardost , who camps out in a tent near parliament and campaigned against corruption , attracted 359,214 votes and former World Bank economist Ashraf Ghani 62,536 votes , \$ARG2', '\$ARG2 lawmaker \$ARG1', '\$ARG1 , who camps out in a tent near parliament and campaigned against corruption , attracted \$ARG2', '\$ARG1 , who camps out in a tent near parliament and campaigned against corruption , attracted 359,214 votes and former \$ARG2', '\$ARG1 , who camps out in a tent near parliament and campaigned against corruption , attracted 359,214 votes and former World Bank economist \$ARG2', '\$ARG1 , who camps out in a tent near parliament and campaigned against corruption , attracted 359,214 votes and former World Bank economist Ashraf Ghani \$ARG2', '\$ARG1 , who camps out in a tent near parliament and campaigned against corruption , attracted 359,214 votes and former World Bank economist Ashraf Ghani 62,536 votes , \$ARG2', '\$ARG2 lawmaker Ramazan Bashardost , who camps out in a tent near parliament and campaigned against corruption , attracted \$ARG1', '\$ARG2 , who camps out in a tent near parliament and campaigned against corruption , attracted \$ARG1', '\$ARG1 votes and former \$ARG2', '\$ARG1 votes and former World Bank economist \$ARG2', '\$ARG1 votes and former World Bank economist Ashraf Ghani \$ARG2', '\$ARG1 votes and former World Bank economist Ashraf Ghani 62,536 votes , \$ARG2', '\$ARG2 lawmaker Ramazan Bashardost , who camps out in a tent near parliament and campaigned against corruption , attracted 359,214 votes and former \$ARG1', '\$ARG2 , who camps out in a tent near parliament and campaigned against corruption , attracted 359,214 votes and former \$ARG1', '\$ARG2 votes and former \$ARG1', '\$ARG1 economist \$ARG2', '\$ARG1 economist Ashraf Ghani \$ARG2', '\$ARG1 economist Ashraf Ghani 62,536 votes , \$ARG2', '\$ARG2 lawmaker Ramazan Bashardost , who camps out in a tent near parliament and campaigned against corruption , attracted 359,214 votes and former World Bank economist \$ARG1', '\$ARG2 , who camps out in a tent near parliament and campaigned against corruption , attracted 359,214 votes and former World Bank economist \$ARG1', '\$ARG2 votes and former World Bank economist \$ARG1', '\$ARG2 economist \$ARG1', '\$ARG1 \$ARG2', '\$ARG1 62,536 votes , \$ARG2', '\$ARG2 lawmaker Ramazan Bashardost , who camps out in a tent near parliament and campaigned against corruption , attracted 359,214 votes and former World Bank economist Ashraf Ghani \$ARG1', '\$ARG2 , who camps out in a tent near parliament and campaigned against corruption , attracted 359,214 votes and former World Bank economist Ashraf Ghani \$ARG1', '\$ARG2 votes and former World Bank economist Ashraf Ghani \$ARG1', '\$ARG2 economist Ashraf Ghani \$ARG1', '\$ARG2 \$ARG1', '\$ARG1 votes , \$ARG2', '\$ARG2 lawmaker Ramazan Bashardost , who camps out in a tent near parliament and campaigned against corruption , attracted 359,214 votes and former World Bank economist Ashraf Ghani 62,536 votes , \$ARG1', '\$ARG2 , who camps out in a tent near parliament and campaigned against corruption , attracted 359,214 votes and former World Bank economist Ashraf Ghani 62,536 votes , \$ARG1', '\$ARG2 votes and former World Bank economist Ashraf Ghani 62,536 votes , \$ARG1', '\$ARG2 economist Ashraf Ghani 62,536 votes , \$ARG1', '\$ARG2 62,536 votes , \$ARG1', '\$ARG2 votes , \$ARG1']"
3311,"`` Our entire organization grieves at the death of Mike Coolbaugh , '' Rockies president Keli McGregor said .",per:title,"[""\$ARG1 , '' Rockies president \$ARG2"", ""\$ARG2 , '' Rockies president \$ARG1""]"


## Pattern search

Now we look for patterns in each of these subsentences. If there is a match, we add a corresponding sentence to the final DataFrame with the information about pattern that matched and weak label it received. The fold labels are still there for comparing. 

In [93]:
def pattern_search(extr_sample, patterns, row):
    for relation, rel_patterns in patterns.iterrows():
        matches = [[row["sample"], extr_sample, pattern, relation, row["label"]] 
                   for pattern in rel_patterns["regex pattern"]
                   if re.match(pattern, extr_sample) is not None]   # todo
        if len(matches) > 0:
            return pd.DataFrame(matches, columns = FINAL_DF_COLUMNS)

In [94]:
all_matches = pd.DataFrame(columns = FINAL_DF_COLUMNS)
for _, row in selected_samples.iterrows():
    for cand_sample in row["extr"]:
        df_found = pattern_search(cand_sample, relation_patterns_df, row)
        if isinstance(df_found, pd.DataFrame) and not df_found.empty:
            all_matches = pd.concat([all_matches, df_found])

In [95]:
all_matches.apply(lambda x: escape_dollar(x)).head()

Unnamed: 0,sample,extr_sample,pattern,weak_label,gold_label
0,"His main Shiite rival , Abdul Aziz al-Hakim , who heads the Islamic Supreme Council of Iraq , an influential Shiite political party that is part of Maliki 's ruling coalition , has also denounced the plans .","\$ARG1 , who heads \$ARG2","(A )?(a )?(The )?(the )?\\$ARG1\ ,\ who\ heads\ (A )?(a )?(The )?(the )?\\$ARG2",org:top_members/employees,per:religion
0,"`` It 's an issue for everybody in the state because peanuts are a big part of our economy , '' said Don Koehler , executive director of the Georgia Peanut Commission .","\$ARG1 , executive director of \$ARG2","(A )?(a )?(The )?(the )?\\$ARG1\ ,\ executive\ director\ of\ (A )?(a )?(The )?(the )?\\$ARG2",org:top_members/employees,org:top_members/employees
0,"Water and its links to development , peace and conflict were key words in the annual sessions , Anders Berntell , executive director of Stockholm International Water Institute ( SIWI ) , said in his opening address .","\$ARG1 , executive director of \$ARG2","(A )?(a )?(The )?(the )?\\$ARG1\ ,\ executive\ director\ of\ (A )?(a )?(The )?(the )?\\$ARG2",org:top_members/employees,org:alternate_names
0,"Gwathmey was born in 1938 , the only child of painter Robert Gwathmey and his wife , Rosalie , a photographer .",\$ARG1 was born in \$ARG2,(A )?(a )?(The )?(the )?\\$ARG1\ was\ born\ in\ (A )?(a )?(The )?(the )?\\$ARG2,per:date_of_birth,per:children
0,"A professor emeritus at Yale University , Mandelbrot was born in Poland but as a child moved with his family to France where he was educated .",\$ARG1 was born in \$ARG2,(A )?(a )?(The )?(the )?\\$ARG1\ was\ born\ in\ (A )?(a )?(The )?(the )?\\$ARG2,per:date_of_birth,per:employee_of


In [96]:
all_matches.size

105

But here we can observe some misclassified sentences: for example, a sentence 

"A professor emeritus at Yale University , Mandelbrot was born in Poland but as a child moved with his family to France where he was educated" 

was assigned with a label "per:date_of_birth" (presumably by a pattern "ARG1 was born in ARG2"), what is definitely wrong. In order to avoid such mistake, let's add additional constraints on the argument types.

In [97]:
relation_to_types = {"org:alternate_names": ['PERSON', 'PERSON'], 
                     "per:date_of_birth": ['PERSON', 'DATE'],
                     "org:top_members/employees": ['PERSON', 'ORG']}

So, when we look for these patterns in samples, we should take into account the entity types referred to corresponding relation. 

In [98]:
relation_patterns_df["type"] = pd.Series(relation_to_types)

In [99]:
relation_patterns_df.apply(lambda x: escape_dollar(x)).head()

Unnamed: 0,raw pattern,regex pattern,type
org:alternate_names,"['\$ARG1 ( \$ARG2 ', '\$ARG1 formerly known as \$ARG2', '\$ARG1 aka \$ARG2', '\$ARG1 ( also known as \$ARG2 )']","['(A )?(a )?(The )?(the )?\\\$ARG1\\ \\(\\ (A )?(a )?(The )?(the )?\\\$ARG2\\ ', '(A )?(a )?(The )?(the )?\\\$ARG1\\ formerly\\ known\\ as\\ (A )?(a )?(The )?(the )?\\\$ARG2', '(A )?(a )?(The )?(the )?\\\$ARG1\\ aka\\ (A )?(a )?(The )?(the )?\\\$ARG2', '(A )?(a )?(The )?(the )?\\\$ARG1\\ \\(\\ also\\ known\\ as\\ (A )?(a )?(The )?(the )?\\\$ARG2\\ \\)']","['PERSON', 'PERSON']"
per:date_of_birth,"['\$ARG1 ( born \$ARG2 )', '\$ARG1 ( born \$ARG2 in', '\$ARG1 ( \$ARG2 -', '\$ARG1 was born in \$ARG2']","['(A )?(a )?(The )?(the )?\\\$ARG1\\ \\(\\ born\\ (A )?(a )?(The )?(the )?\\\$ARG2\\ \\)', '(A )?(a )?(The )?(the )?\\\$ARG1\\ \\(\\ born\\ (A )?(a )?(The )?(the )?\\\$ARG2\\ in', '(A )?(a )?(The )?(the )?\\\$ARG1\\ \\(\\ (A )?(a )?(The )?(the )?\\\$ARG2\\ \\-', '(A )?(a )?(The )?(the )?\\\$ARG1\\ was\\ born\\ in\\ (A )?(a )?(The )?(the )?\\\$ARG2']","['PERSON', 'DATE']"
org:top_members/employees,"['\$ARG1 , executive director of \$ARG2', '\$ARG1 , head of \$ARG2', '\$ARG1 , who heads \$ARG2', '\$ARG1 , chief executive of \$ARG2']","['(A )?(a )?(The )?(the )?\\\$ARG1\\ ,\\ executive\\ director\\ of\\ (A )?(a )?(The )?(the )?\\\$ARG2', '(A )?(a )?(The )?(the )?\\\$ARG1\\ ,\\ head\\ of\\ (A )?(a )?(The )?(the )?\\\$ARG2', '(A )?(a )?(The )?(the )?\\\$ARG1\\ ,\\ who\\ heads\\ (A )?(a )?(The )?(the )?\\\$ARG2', '(A )?(a )?(The )?(the )?\\\$ARG1\\ ,\\ chief\\ executive\\ of\\ (A )?(a )?(The )?(the )?\\\$ARG2']","['PERSON', 'ORG']"


The information about type of the entities we get from the Spacy annotations as well and build a new format of extracted subsentences: now they are not lists but dictionaries in format {subsentence: entity types}

In [100]:
def get_arg_types(sample):
    return [{ARG1 + sample["text"][ent1["end"]:ent2["start"]] + ARG2 : [ent1["label"], ent2["label"]]}
            if ent1["end"] < ent2["end"]
            else {(ARG2 + sample["text"][ent2["end"]:ent1["start"]] + ARG1) : [ent2["label"], ent1["label"]]}
            for ent1, ent2 in itertools.permutations(sample["ents"],2)]

In [101]:
selected_samples["extr_with_types"] = selected_samples["spacy info"].apply(lambda x: get_arg_types(x))

We do a similar pattern search, but now taking into account entity types.

In [102]:
def pattern_search_filter_types(cand_sample_with_types, patterns, row):
    for relation, rel_patterns in patterns.iterrows():
        cand_sample_types = [item for type_pair in cand_sample_with_types.values() for item in type_pair]
        if rel_patterns["type"] == cand_sample_types:
            sample = list(cand_sample_with_types.keys())[0]
            matches = [[row["sample"], sample, pattern, relation, row["label"]] 
                       for pattern in rel_patterns["regex pattern"]
                       if re.match(pattern, sample) is not None]   # todo
            if len(matches) > 0:
                return pd.DataFrame(matches, columns = FINAL_DF_COLUMNS)

In [103]:
all_matches_filter_types = pd.DataFrame(columns = FINAL_DF_COLUMNS)
for _, row in selected_samples.iterrows():
    for cand_sample_with_types in row["extr_with_types"]:
        df_found = pattern_search_filter_types(cand_sample_with_types, relation_patterns_df, row)
        if isinstance(df_found, pd.DataFrame) and not df_found.empty:
            all_matches_filter_types = pd.concat([all_matches_filter_types, df_found])

In [104]:
all_matches_filter_types.apply(lambda x: escape_dollar(x))

Unnamed: 0,sample,extr_sample,pattern,weak_label,gold_label
0,"His main Shiite rival , Abdul Aziz al-Hakim , who heads the Islamic Supreme Council of Iraq , an influential Shiite political party that is part of Maliki 's ruling coalition , has also denounced the plans .","\$ARG1 , who heads \$ARG2","(A )?(a )?(The )?(the )?\\$ARG1\ ,\ who\ heads\ (A )?(a )?(The )?(the )?\\$ARG2",org:top_members/employees,per:religion
0,"`` It 's an issue for everybody in the state because peanuts are a big part of our economy , '' said Don Koehler , executive director of the Georgia Peanut Commission .","\$ARG1 , executive director of \$ARG2","(A )?(a )?(The )?(the )?\\$ARG1\ ,\ executive\ director\ of\ (A )?(a )?(The )?(the )?\\$ARG2",org:top_members/employees,org:top_members/employees
0,"Water and its links to development , peace and conflict were key words in the annual sessions , Anders Berntell , executive director of Stockholm International Water Institute ( SIWI ) , said in his opening address .","\$ARG1 , executive director of \$ARG2","(A )?(a )?(The )?(the )?\\$ARG1\ ,\ executive\ director\ of\ (A )?(a )?(The )?(the )?\\$ARG2",org:top_members/employees,org:alternate_names
0,"Gwathmey was born in 1938 , the only child of painter Robert Gwathmey and his wife , Rosalie , a photographer .",\$ARG1 was born in \$ARG2,(A )?(a )?(The )?(the )?\\$ARG1\ was\ born\ in\ (A )?(a )?(The )?(the )?\\$ARG2,per:date_of_birth,per:children
0,"Water and its links to development , peace and conflict were key words in the annual sessions , Anders Berntell , executive director of Stockholm International Water Institute ( SIWI ) , said in his opening address .","\$ARG1 , executive director of \$ARG2","(A )?(a )?(The )?(the )?\\$ARG1\ ,\ executive\ director\ of\ (A )?(a )?(The )?(the )?\\$ARG2",org:top_members/employees,org:top_members/employees
0,"`` At the moment the water issue does n't get enough attention in the climate negotiations , '' Anders Berntell , head of the Stockholm International Water Institute , told The Associated Press .","\$ARG1 , head of \$ARG2","(A )?(a )?(The )?(the )?\\$ARG1\ ,\ head\ of\ (A )?(a )?(The )?(the )?\\$ARG2",org:top_members/employees,org:top_members/employees
0,"BEIJING , Dec 23 ( Xinhua ) Net profit of the central SOEs totaled 80226 billion yuan ( 12073 billion US dollars ) in the January-to-November period , said Wang Yong , head of the State - owned Assets Supervision and Administration Commission ( SASAC ) .","\$ARG1 , head of the \$ARG2","(A )?(a )?(The )?(the )?\\$ARG1\ ,\ head\ of\ (A )?(a )?(The )?(the )?\\$ARG2",org:top_members/employees,org:top_members/employees
0,"Gwathmey was born in 1938 , the only child of painter Robert Gwathmey and his wife , Rosalie , a photographer .",\$ARG1 was born in \$ARG2,(A )?(a )?(The )?(the )?\\$ARG1\ was\ born\ in\ (A )?(a )?(The )?(the )?\\$ARG2,per:date_of_birth,per:date_of_birth
0,"`` It is difficult to put one country and its citizens in the same boat ... some people from Iran have very legitimate interests in Europe and are very trustworthy , '' said Pierre Mirabaud , head of the Swiss Bankers Association .","\$ARG1 , head of \$ARG2","(A )?(a )?(The )?(the )?\\$ARG1\ ,\ head\ of\ (A )?(a )?(The )?(the )?\\$ARG2",org:top_members/employees,org:top_members/employees
0,"`` The conference is like the science Olympics , '' said Peter Agre , head of the American Association for the Advancement of Sciece ( AAAS ) , which is hosting the meeting in California as the Winter Olympics were in full swing in Vancouver , Canada .","\$ARG1 , head of \$ARG2","(A )?(a )?(The )?(the )?\\$ARG1\ ,\ head\ of\ (A )?(a )?(The )?(the )?\\$ARG2",org:top_members/employees,org:top_members/employees


In [105]:
all_matches_filter_types.size

65

## Input matrices creation

So, now we have a smaller dataset (only 60 found samples compared to 120 without argument types filtering), but much more clean. By the way: we have already done our first weakly data preprocessing :)

Some weak labels are still not the same as the gold labels, but mostly this happens because they are hold in different parts of a sentence. For example, "Gwathmey was born in 1938 , the only child of painter Robert Gwathmey and his wife , Rosalie , a photographer." sentence definitely contains both relations: "per:date_of_birth" and "per:children" - so, our distant label is still true. 

The final step would be a creation of the input matrices needed for knodle framework, namely:

- X: input_features (sentence x features): containing information about the data samples
- Z: rule_matches (sentence x labelling): in our case labelling functions are patterns and this binary matrix contains information about which pattern match for each sentence
- T: mapping_rules_labels (labelling functions x classes): which pattern corresponds to which class

We are not going to discuss here encoding of the data samples and creation of X (input_features) matrix, but let's construct Z (rule_matches) and T (mapping_rules_labels) matrices.

In [62]:
def z_matrix_creation(all_matches_filter_types, relation_patterns_df):
    
    # create empty matrix with columns as patterns + column "sample"
    sentences_lf_df_columns = [item for relation in relation_patterns_df["raw pattern"] for item in relation] 
    rule_matches = pd.DataFrame(columns=escape_dollar(sentences_lf_df_columns))
    rule_matches.insert(0, "sample", all_matches_filter_types["sample"])

    for index, row in all_matches_filter_types.iterrows():   # iterate over rows of found samples
        curr_sample, curr_pattern, curr_relation = row[0], row[2], row[3]
        pattern_index = list(relation_patterns_df.loc[curr_relation]["regex pattern"]).index(curr_pattern)    # find idx of current regex pattern in patterns lists in relation_patterns_df
        raw_pattern = relation_patterns_df.loc[curr_relation, "raw pattern"][pattern_index]   # take the corresponding raw pattern
        rule_matches.loc[(rule_matches["sample"] == curr_sample), escape_dollar([raw_pattern])] = 1   # add 1 on the intersection of sample and pattern in Z matrix

    return rule_matches.fillna(0)

rule_matches = z_matrix_creation(all_matches_filter_types, relation_patterns_df)

In [63]:
rule_matches.head()

Unnamed: 0,sample,\$ARG1 ( \$ARG2,\$ARG1 formerly known as \$ARG2,\$ARG1 aka \$ARG2,\$ARG1 ( also known as \$ARG2 ),\$ARG1 ( born \$ARG2 ),\$ARG1 ( born \$ARG2 in,\$ARG1 ( \$ARG2 -,\$ARG1 was born in \$ARG2,"\$ARG1 , executive director of \$ARG2","\$ARG1 , head of \$ARG2","\$ARG1 , who heads \$ARG2","\$ARG1 , chief executive of \$ARG2"
0,"His main Shiite rival , Abdul Aziz al-Hakim , who heads the Islamic Supreme Council of Iraq , an influential Shiite political party that is part of Maliki 's ruling coalition , has also denounced the plans .",0,0,0,0,0,0,0,0,0,0,1,0
0,"`` It 's an issue for everybody in the state because peanuts are a big part of our economy , '' said Don Koehler , executive director of the Georgia Peanut Commission .",0,0,0,0,0,0,0,0,1,0,0,0
0,"Water and its links to development , peace and conflict were key words in the annual sessions , Anders Berntell , executive director of Stockholm International Water Institute ( SIWI ) , said in his opening address .",0,0,0,0,0,0,0,0,1,0,0,0
0,"Gwathmey was born in 1938 , the only child of painter Robert Gwathmey and his wife , Rosalie , a photographer .",0,0,0,0,0,0,0,1,0,0,0,0
0,"Water and its links to development , peace and conflict were key words in the annual sessions , Anders Berntell , executive director of Stockholm International Water Institute ( SIWI ) , said in his opening address .",0,0,0,0,0,0,0,0,1,0,0,0


In [76]:
mapping_rules_labels = pd.concat([pd.Series(escape_dollar(v), name=k).astype(str) for k, v in relation_patterns.items()], axis=1)
mapping_rules_labels = pd.get_dummies(mapping_rules_labels.stack()).sum(level=1).clip(upper=1)

In [77]:
mapping_rules_labels

Unnamed: 0,\$ARG1 ( \$ARG2,\$ARG1 ( \$ARG2 -,\$ARG1 ( also known as \$ARG2 ),\$ARG1 ( born \$ARG2 ),\$ARG1 ( born \$ARG2 in,"\$ARG1 , chief executive of \$ARG2","\$ARG1 , executive director of \$ARG2","\$ARG1 , head of \$ARG2","\$ARG1 , who heads \$ARG2",\$ARG1 aka \$ARG2,\$ARG1 formerly known as \$ARG2,\$ARG1 was born in \$ARG2
org:alternate_names,1,0,1,0,0,0,0,0,0,1,1,0
per:date_of_birth,0,1,0,1,1,0,0,0,0,0,0,1
org:top_members/employees,0,0,0,0,0,1,1,1,1,0,0,0


## Finish

Now we have not only a weakly supervised dataset, but also an appropriate input for the knodle framework. What is still to be done is the choosing of text features, converting them to matrix as well and pass all three matrices in knodle to make denoising and model training. Good luck! :) 