# Week 3, Lesson 5, Activity 11: End-to-end IE application

&copy;2021, Ekaterina Kochmar \
(revised: Nadejda Roubtsova, June 2022)

Your task in this activity is to:

- To implement all the steps discussed in the lecture and apply your IE algorithm to the set of sentences and questions provided with this notebook.

## Step 1: Understand the task

In this task you will aim to extract actions and corresponding participants from raw text in order to get answers to queries such as "Who did Joe Biden meet with?" \
Let us start with the following simpler case of preprocessed data. Can you programmatically extract participant 2 (`p2`) from the data tuples in `meetings` given that participant 1 (`p1`) is Joe Biden?

In [None]:
meetings = [('Boris Johnson', 'meets with', 'the Queen'),
            ('Joe Biden', 'meets with', 'his cabinet'),
            ('administration', 'meets with', 'tech giants'),
            ('the Queen', 'meets with', 'the Prime Minister'),
            ('Joe Biden', 'meets with', 'Russian President')]
# query: Who does Joe Biden meet with?
answer = [# enter code here: p2 for tuples in meetings if p1=='Joe Biden'
         ]
print(answer)

Since "meeting" is a mutual action, a participant may appear on the right or on the left. Let's make sure both cases are covered. Try another query:

In [None]:
# query: Who did the Queen meet with?
answer   = [# enter code here: p2 for tuples in meetings if p1=='the Queen'
           ]
answer +=  [# enter code here: p1 for tuples in meetings if p2=='the Queen'
           ]
print(answer)

## Step 2: Get more practice with NLP using spaCy

Apply `spaCy`'s `nlp` pipeline to some input text. Do you agree with the output: e.g., are all PoS assinged to these words correct?

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Beware the Jabberwock, my son! The jaws that bite, the claws that catch! " +
          "Beware the Jubjub bird, and shun The frumious Bandersnatch!")

rows = []
rows.append(["Word", "Position", "Lowercase", "Lemma", "POS", "Alphanumeric", "Stopword"])
for token in doc:
    rows.append([token.text, str(token.i), token.lower_, token.lemma_, 
                 token.pos_, str(token.is_alpha), str(token.is_stop)])

columns = zip(*rows)
column_widths = [max(len(item) for item in col) for col in columns]
for row in rows:
    print(''.join(' {:{width}} '.format(row[i], width=column_widths[i]) 
                  for i in range(0, len(row))))

Let's identify all noun phrases in a sentence. Is this output correct?

In [None]:
doc = nlp("On Friday, board members meet with senior managers " +
          "to discuss future development of the company.")

for chunk in doc.noun_chunks:
    # here you return the noun phrase, its head (main) noun, the dependency relation linking
    # this noun to its own head in the parse tree, and the head of the noun in the parse tree
    print('\t'.join([chunk.text, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text]))

Let's look at all the dependencies for all words in a sentence:

In [None]:
for token in doc:
    print(# the token itself, the dependency relation for the token,
          # the head to which the token is linked, the head's PoS,
          # the list of all dependents of the token (token.children)
          )

## Step 3: Extract information

Here is how you can identify the participants of a particular action (e.g., when the action is expressed with "meet"):

In [None]:
for token in doc:
    if token.lemma_=="meet" and token.pos_=="VERB" and token.dep_=="ROOT":
        action = token.text
        children = [child for child in token.children]
        participant1 = ""
        participant2 = ""
        for child1 in children:
            if child1.dep_=="nsubj":
                participant1 = " ".join([attr.text for 
                                         attr in child1.children]) + " " + child1.text
            elif child1.text=="with":
                action += " " + child1.text
                child1_children = [child for child in child1.children]
                for child2 in child1_children:
                    if child2.pos_ == "NOUN":
                        participant2 = " ".join([attr.text for 
                                             attr in child2.children]) + " " + child2.text
print (f"Participant1 = {participant1}")
print (f"Action = {action}")
print (f"Participant2 = {participant2}")    

Now let's improve this code so that it can deal with different formats of the expression, and apply it to various sentences:

In [None]:
sentences = ["On Friday, board members meet with senior managers " +
             "to discuss future development of the company.", 
             "Boris Johnson met with the Queen last week.",
             "Joe Biden meets the Queen at Buckingham Palace.",
             "The two leaders also posed for photographs and " +
             "the President talked to reporters."]

def extract_information(doc):
    action=""
    participant1 = ""
    participant2 = ""
    for token in doc:
        if # check that the token's lemma is "meet", its PoS is VERB and 
           # it's the ROOT of the whole sentence (i.e., the main verb)
            action = token.text
            children = [child for child in token.children]   
            for child1 in children:
                if # check that child1 is a subject (i.e., it's related to the verb with the 'nsubj' relation)
                    participant1 = " ".join([attr.text for 
                                             attr in child1.children]) + " " + child1.text
                elif child1.text=="with": # i.e., if the verb attaches "with" (as in "meet with")
                    action += " " + child1.text
                    child1_children = [child for child in child1.children]
                    for child2 in child1_children:
                        if # check that child2 is either a common noun (NOUN, e.g., "members") 
                        # or a proper noun (PROPN, e.g., "Joe")
                            participant2 = " ".join([attr.text for 
                                                 attr in child2.children]) + " " + child2.text
                elif # check if child1 is a direct object (the dependency relation is 'dobj')
                # and it is either a common or a proper noun
                    participant2 = " ".join([attr.text for 
                                             attr in child1.children]) + " " + child1.text
    print (f"Participant1 = {participant1}")
    print (f"Action = {action}")
    print (f"Participant2 = {participant2}")

for sent in sentences:
    print(f"\nSentence = {sent}")
    doc = nlp(sent)
    extract_information(doc)