# Assignment 3 - CT5120/CT5146

### Instructions:
- Complete all the tasks below and upload your submission as a Python notebook on Blackboard with the filename “`StudentID_Lastname.ipynb`” before **23:59** on **December 31, 2021**. Please note that there will be no further extensions to this deadline and we highly encourage you to submit this assignment before Semester 1 exams.
- This is an individual assignment, you **must not** work with other students to complete this assessment.
- The assignment is worth $100$ marks and constitutes 19% of the final grade. The breakdown of the marking scheme for each task is as follows:

|           | Task | Marks |
| :---      | :-----| -----:|
| Task 1    | Pre-processing |   15 |
| Task 2    | Named Entity Recognition |    10 |
| Task 3    | Information / Relation Extraction (I) | 30 |
| Task 4    | Information / Relation Extraction (II) | 15 |
| Task 5    | Combining information in the output   | 5 |
| Task 6    | Evaluation (I) | 15 |
| Task 7    | Evaluation (II) | 10 |



---

## Information Extraction and Relation Extraction

In the following tasks you will write code to perform **_information extraction_** and **_relation extraction_** across a collection of documents in `movies.zip`.

The zip archive contains 100 files, out of which 50 are plaintext documents and other 50 contain data structured as JSON.
Each plaintext document contains a text description of a movie taken from the English version of Wikipedia, while each JSON document contains *gold-standard* labels (also called *reference* labels) stored as key-value pairs for the entities and relations for each document.

You are only allowed to use the given documents and labels and **must not** use any other external sources of data for this assignment.

---

Download and unarchive `movies.zip` from Blackboard and place it in the same location as this notebook or uncomment the code cell below to get the data in a directory called `movies` and also place it automatically in the same location as this notebook.

In [1]:
!if test -f "movies.zip"; then rm "movies.zip"; fi
!if test -d "movies/"; then rm -rf "movies/"; fi
!wget "https://drive.google.com/uc?export=download&id=1L6NcSGkubNJaL6xSnYEZZKSrlyXq1AbB" -O "movies.zip"
!unzip "movies.zip"

--2021-12-30 15:25:58--  https://drive.google.com/uc?export=download&id=1L6NcSGkubNJaL6xSnYEZZKSrlyXq1AbB
Resolving drive.google.com (drive.google.com)... 74.125.31.102, 74.125.31.139, 74.125.31.101, ...
Connecting to drive.google.com (drive.google.com)|74.125.31.102|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-04-90-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/i8u21mkp3jgee5vne2c9ad3siu68c2u6/1640877900000/04741348677416923358/*/1L6NcSGkubNJaL6xSnYEZZKSrlyXq1AbB?e=download [following]
--2021-12-30 15:26:00--  https://doc-04-90-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/i8u21mkp3jgee5vne2c9ad3siu68c2u6/1640877900000/04741348677416923358/*/1L6NcSGkubNJaL6xSnYEZZKSrlyXq1AbB?e=download
Resolving doc-04-90-docs.googleusercontent.com (doc-04-90-docs.googleusercontent.com)... 173.194.218.132, 2607:f8b0:400c:c14::84
Connecting to doc-04-90-docs.googleusercontent.com (doc-04-90

---

## Reading Data

Place the unzipped `movies` directory in the same location as this notebook and run the following code cell to read the plaintext and JSON documents.

In [2]:
######### DO NOT EDIT THIS CELL #########

import os
import json

documents = []   # store the text documents as a list of strings
labels = []      # store the gold-standard labels as a list of dictionaries

for idx in range(50):
  with open(os.path.join('movies', str(idx+1).zfill(2) + '.doc.txt')) as f:
    doc = f.read().strip()
  with open(os.path.join('movies', str(idx+1).zfill(2) + '.info.json')) as f:
    label = json.load(f)

  documents.append(doc)
  labels.append(label)

assert len(documents) == 50
assert len(labels) == 50

---

In [3]:
# Load the libraries which might be useful

import re
import nltk
nltk.download('all', quiet=True)

True

---

## Task 1: Document Pre-processing (15 Marks)
Write a function that takes a document and returns a list of sentences with part-of-speech tags.

The expected output is a list of tagged sentences where each tagged sentence is a list containing `(token, tag)` pairs.


In [4]:
def ie_preprocess(document):
  '''Return a list of sentences tagged with part-of-speech tags for the given document.'''

  tagged_sentences = []


  # Step 1: Sentence segmentation.
  sentences = nltk.sent_tokenize(document)
  
  # Step 2: Tokenize sentences into words.
  tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]
  
  # Step 3: POS tagging.
  tagged_sentences = [nltk.pos_tag(sent) for sent in tokenized_sentences]
  
    
  
  return tagged_sentences

Run the cell below to check if the output is formatted correctly.

Expected output: `[('It', 'PRP'), ('received', 'VBD'), ('ten', 'JJ'), ('Oscar', 'NNP'), ('nominations', 'NNS'), ('(', '('), ('including', 'VBG'), ('Best', 'NNP'), ('Picture', 'NN'), (')', ')'), (',', ','), ('winning', 'VBG'), ('seven', 'CD'), ('.', '.')]`

In [5]:
# check output for Task 1
ie_preprocess(documents[0])[-10]

[('It', 'PRP'),
 ('received', 'VBD'),
 ('ten', 'JJ'),
 ('Oscar', 'NNP'),
 ('nominations', 'NNS'),
 ('(', '('),
 ('including', 'VBG'),
 ('Best', 'NNP'),
 ('Picture', 'NN'),
 (')', ')'),
 (',', ','),
 ('winning', 'VBG'),
 ('seven', 'CD'),
 ('.', '.')]

## Task 2: Named Entity Recognition (10 Marks)

Write a function that returns a list of all the named entities in a given document. The document here is structured as a list of sentences and tagged with part-of-speech tags.

Hint: Set `binary = True` while calling the `ne_chunk` function.

In [6]:
def find_named_entities(tagged_document):
  '''Return a list of all the named entities in the given tagged document.'''
  
  named_entities = []

  for sentu in tagged_document:
    tree = nltk.ne_chunk(sentu, binary=True)
    for subtree in tree.subtrees():
      if subtree.label() == 'NE':
        entity = ""
        for leaf in subtree.leaves():
          entity = entity + leaf[0] + " "
        named_entities.append(entity.strip())
             

  return named_entities

Run the cell below to check if the output is formatted correctly.

The output values might not match exactly, but should look similar to: `['Star Wars', 'Star Wars', 'New Hope', 'American', 'George Lucas', 'Lucasfilm', ...]`

In [7]:
# check output for Task 2
tagged_document = ie_preprocess(documents[0]) # pre-process the first document
find_named_entities(tagged_document)[:10]     # display the first 10 named entities

['Star Wars',
 'Star Wars',
 'New Hope',
 'American',
 'George Lucas',
 'Lucasfilm',
 'Century Fox',
 'Mark Hamill',
 'Harrison Ford',
 'Carrie Fisher']

In [8]:
find_named_entities(tagged_document)[:10]

['Star Wars',
 'Star Wars',
 'New Hope',
 'American',
 'George Lucas',
 'Lucasfilm',
 'Century Fox',
 'Mark Hamill',
 'Harrison Ford',
 'Carrie Fisher']

## Task 3: Information / Relation Extraction (I) (30 Marks)

Choose any **three** relations out of the following and write functions to extract them from a given document.

* **Title**
* **Language**
* **Starring**
* **Release date**
* **Cinematography**
* **Dialogue by**
* **Directed by**
* **Edited by**
* **Music by**
* **Narrated by**
* **Produced by**
* **Screenplay by**
* **Story by**
* **Written by**
* **Production companies**
* **Distribution companies**
* **Budget**
* **Box office**


The functions you define here must take as input a string called `document` and return the information/relation extracted as a list. You can explain your approach with comments along with your code.


In [94]:
# relation 1 
def relation1(doc):

  tokenized_sent = nltk.sent_tokenize(doc)
  tagged_sent = [nltk.word_tokenize(sent) for sent in tokenized_sent]
  tagged_sent = [nltk.pos_tag(sent) for sent in tagged_sent]

  output_prod = []

  for sent in tagged_sent:
    # Define X, Y, and \alpha
    subjclass = 'NE'
    objclass = 'NE'
    pattern = re.compile(r'.*Produce.*', re.IGNORECASE)

    # Group a chunk structure into a list of 'semi-relations'.

    chunked = nltk.ne_chunk(sent, binary=True) 
    pairs = nltk.sem.relextract.tree2semi_rel(chunked)

    # Convert 'semi-relations' into a dictionary which stores information 
    # about the subject and object NEs plus the filler between them.
    reldicts = nltk.sem.relextract.semi_rel2reldict(pairs + [[[]]])

    # Filter relevant relations by matching the regexp pattern.
    relfilter = lambda x: (x['subjclass'] == subjclass and
                              pattern.match(x['filler']) and
                              x['objclass'] == objclass)
    
    rels = list(filter(relfilter, reldicts))

    # Print the relations found in the text.
    for rel in rels:
      print(nltk.sem.relextract.rtuple(rel))

    if len(rels) > 0:
      output_prod.append(rels[0]['objsym'])
      output_prod.append(rels[0]['subjsym'])
  return output_prod

relation1(documents[0])

[NE: 'George/NNP Lucas/NNP'] ',/, produced/VBN by/IN' [NE: 'Lucasfilm/NNP']


['lucasfilm', 'george_lucas']

In [51]:
documents[2]


'The Dark Knight is a 2008 superhero film directed, produced, and co-written by Christopher Nolan. Based on the DC Comics character Batman, the film is the second installment of Nolan\'s The Dark Knight Trilogy and a sequel to 2005\'s Batman Begins, starring Christian Bale and supported by Michael Caine, Heath Ledger, Gary Oldman, Aaron Eckhart, Maggie Gyllenhaal, and Morgan Freeman. In the film, Bruce Wayne / Batman (Bale), Police Lieutenant James Gordon (Oldman) and District Attorney Harvey Dent (Eckhart) form an alliance to dismantle organized crime in Gotham City, but are menaced by an anarchistic mastermind known as the Joker (Ledger), who seeks to undermine Batman\'s influence and throw the city into anarchy. Nolan\'s inspiration for the film was the Joker\'s comic book debut in 1940, the 1988 graphic novel The Killing Joke, and the 1996 series The Long Halloween, which retold Harvey Dent\'s origin. The "Dark Knight" nickname was first applied to Batman in Batman #1 (1940), in a 

In [108]:
# relation 2 
def relation2(doc):

  tokenized_sent = nltk.sent_tokenize(doc)
  tagged_sent = [nltk.word_tokenize(sent) for sent in tokenized_sent]
  tagged_sent = [nltk.pos_tag(sent) for sent in tagged_sent]

  output_prod = []

  for sent in tagged_sent:
    # Define X, Y, and \alpha
    subjclass = 'NE'
    objclass = 'NE'
    pattern = re.compile(r'.*Direct.*', re.IGNORECASE)

    # Group a chunk structure into a list of 'semi-relations'.

    chunked = nltk.ne_chunk(sent, binary=True) 
    pairs = nltk.sem.relextract.tree2semi_rel(chunked)

    # Convert 'semi-relations' into a dictionary which stores information 
    # about the subject and object NEs plus the filler between them.
    reldicts = nltk.sem.relextract.semi_rel2reldict(pairs + [[[]]])

    # Filter relevant relations by matching the regexp pattern.
    relfilter = lambda x: (x['subjclass'] == subjclass and
                              pattern.match(x['filler']) and
                              x['objclass'] == objclass)
    
    rels = list(filter(relfilter, reldicts))

    # Print the relations found in the text.
    for rel in rels:
      print(nltk.sem.relextract.rtuple(rel))

    if len(rels) > 0:
      output_prod.append(rels[0]['objsym'])
      output_prod.append(rels[0]['subjsym'])
  return output_prod

relation2(documents[10])

[NE: 'American/JJ'] 'epic/NN biographical/JJ black/JJ comedy/NN crime/NN film/NN directed/VBN by/IN' [NE: 'Martin/NNP Scorsese/NNP']


['martin_scorsese', 'american']

In [137]:
documents[1]

'Star Wars: The Rise of Skywalker (also known as Star Wars: Episode IX - The Rise of Skywalker) is a 2019 American epic space opera film produced, co-written, and directed by J. J. Abrams. Produced by Lucasfilm and Abrams\' production company Bad Robot Productions, and distributed by Walt Disney Studios Motion Pictures, it is the third installment of the Star Wars sequel trilogy, following The Force Awakens (2015) and The Last Jedi (2017), and the final episode of the nine-part "Skywalker saga". Its ensemble cast includes Carrie Fisher, Mark Hamill, Adam Driver, Daisy Ridley, John Boyega, Oscar Isaac, Anthony Daniels, Naomi Ackie, Domhnall Gleeson, Richard E. Grant, Lupita Nyong\'o, Keri Russell, Joonas Suotamo, Kelly Marie Tran, Ian McDiarmid, and Billy Dee Williams. The Rise of Skywalker follows Rey, Finn, and Poe Dameron as they lead the Resistance\'s final stand against Supreme Leader Kylo Ren and the First Order, who are aided by the return of the deceased Galactic Emperor, Palpat

In [110]:
# relation 3 
def relation3(doc):

  tokenized_sent = nltk.sent_tokenize(doc)
  tagged_sent = [nltk.word_tokenize(sent) for sent in tokenized_sent]
  tagged_sent = [nltk.pos_tag(sent) for sent in tagged_sent]

  output_prod = []

  for sent in tagged_sent:
    # Define X, Y, and \alpha
    subjclass = 'NE'
    objclass = 'NE'
    pattern = re.compile(r'.*Written.*', re.IGNORECASE)

    # Group a chunk structure into a list of 'semi-relations'.

    chunked = nltk.ne_chunk(sent, binary=True) 
    pairs = nltk.sem.relextract.tree2semi_rel(chunked)

    # Convert 'semi-relations' into a dictionary which stores information 
    # about the subject and object NEs plus the filler between them.
    reldicts = nltk.sem.relextract.semi_rel2reldict(pairs + [[[]]])

    # Filter relevant relations by matching the regexp pattern.
    relfilter = lambda x: (x['subjclass'] == subjclass and
                              pattern.match(x['filler']) and
                              x['objclass'] == objclass)
    
    rels = list(filter(relfilter, reldicts))

    # Print the relations found in the text.
    for rel in rels:
      print(nltk.sem.relextract.rtuple(rel))

    if len(rels) > 0:
      output_prod.append(rels[0]['objsym'])
      output_prod.append(rels[0]['subjsym'])
  return output_prod

relation3(documents[10])

[NE: 'Martin/NNP Scorsese/NNP'] 'and/CC written/VBN by/IN' [NE: 'Terence/NNP Winter/NNP']


['terence_winter', 'martin_scorsese']

---

## Task 4: Information / Relation Extraction (II)  (15 Marks)

Identify one other relation of your choice, besides the ones mentioned in the previous task, and write a function to extract it. 

The function you define here must take as input a string called `document` and return the information/relations extracted as a list.

In [140]:
def relation(doc):

  tokenized_sent = nltk.sent_tokenize(doc)
  tagged_sent = [nltk.word_tokenize(sent) for sent in tokenized_sent]
  tagged_sent = [nltk.pos_tag(sent) for sent in tagged_sent]

  output_prod = []

  for sent in tagged_sent:
    # Define X, Y, and \alpha
    subjclass = 'NE'
    objclass = 'NE'
    pattern = re.compile(r'.*episode.*', re.IGNORECASE)

    # Group a chunk structure into a list of 'semi-relations'.

    chunked = nltk.ne_chunk(sent, binary=True) 
    pairs = nltk.sem.relextract.tree2semi_rel(chunked)

    # Convert 'semi-relations' into a dictionary which stores information 
    # about the subject and object NEs plus the filler between them.
    reldicts = nltk.sem.relextract.semi_rel2reldict(pairs + [[[]]])

    # Filter relevant relations by matching the regexp pattern.
    relfilter = lambda x: (x['subjclass'] == subjclass and
                              pattern.match(x['filler']) and
                              x['objclass'] == objclass)
    
    rels = list(filter(relfilter, reldicts))

    # Print the relations found in the text.
    for rel in rels:
      print(nltk.sem.relextract.rtuple(rel))

    if len(rels) > 0:
      output_prod.append(rels[0]['objsym'])
      output_prod.append(rels[0]['subjsym'])
  return output_prod

# for x in documents:
#   print(relation(x))
relation(documents[1])

[NE: 'Star/NN Wars/NNS'] ':/: Episode/NNP IX/NNP -/: The/DT Rise/NN of/IN' [NE: 'Skywalker/NNP']


['skywalker', 'star_wars']

---

## Task 5: Combining information in the output (5 Marks)

Edit the function below to return a Python dictionary with the outputs from the functions defined in tasks $3 - 4$.

In [141]:
def extract_info(document):
  '''Extract information and relations from a given document.'''

  # Edit the output dict below and assign the values to keys by 
  # calling the appropriate functions from Tasks 3 and 4.
  
  # You can delete the keys for which you do not perform extraction in Task 3.
  temp_1 = relation1(document)
  temp_2 = relation2(document)
  temp_3 = relation3(document)
  temp_4 = relation(document)
  output = {
    ##### EDIT BELOW THIS LINE #####
    
    # For the relations you extract in Task 3, 
    # save the output in the appropriate key and delete rest of the keys.
    
    
  
    "Directed by": [temp_1],
    
    "Produced by": [temp_2],
    
    "Written by": [temp_3],
   

    # save the output from Task 4 here
    "episode": [temp_4],

    ##### EDIT ABOVE THIS LINE #####
  }

  return output


# check output for the first document
extract_info(documents[1])

[NE: 'Trevorrow/NNP'] 'left/VBD the/DT project/NN following/VBG creative/JJ differences/NNS with/IN producer/NN' [NE: 'Kathleen/NNP Kennedy/NNP']
[NE: 'Star/NN Wars/NNS'] ':/: Episode/NNP IX/NNP -/: The/DT Rise/NN of/IN' [NE: 'Skywalker/NNP']
[NE: 'Star/NN Wars/NNS'] ':/: Episode/NNP IX/NNP -/: The/DT Rise/NN of/IN' [NE: 'Skywalker/NNP']


{'Directed by': [['kathleen_kennedy', 'trevorrow']],
 'Produced by': [[]],
 'Written by': [['skywalker', 'star_wars']],
 'episode': [['skywalker', 'star_wars']]}

The output from the cell above should look something like the dictionary shown below. Overall values might be different, based on what four items you choose to extract in Tasks 3 and 4, but the structure should be similar.

For example, if you choose to extract **Starring**, **Release Date**, **Box office**, and **Directed by**, then the output should look something like this for the first document:

```javascript
{
  'Box office': ['$775 million'],
  'Directed by': ['George Lucas'],
  'Release date': ['May 25, 1977'],
  'Starring': ['Mark Hamill', 'Harrison Ford', 'Carrie Fisher', 
               'Peter Cushing', 'David Prowse', 'James Earl Jones', ],
}
```

---

## Task 6: Evaluation (I) (15 Marks)

Write a function to evaluate the performance of Task $3$ using **Precision**, **Recall** and **F1** scores. Use the gold-standard labels provided in the JSON files to calculate these values.

Please note that not all the information / relations mentioned in Task $3$ have associated labels for each and every movie in the JSON documents, i.e., some JSON documents will have certain keys-value pairs missing. For example, we have labels for *Budget* in 46 out of the 50 movies and in the remaining 4 documents, you will find that the key `Budget` is omitted from the JSON.
 
Also keep in mind that we will further run this evaluation on a hidden test set containing similar movie descriptions.

In [142]:
def evaluate(labels, predictions):
  '''
  Evaluate the performance of relation extraction 
  using Precision, Recall, and F1 scores.

  Args:
    labels: A list containing gold-standard labels
    predictions: A list containing information extracted from documents
  Returns:
    scores: A dictionary containing Precision, Recall and F1 scores 
            for the information/relations extracted in Task 3.
  '''

  assert len(predictions) == len(labels)

  scores = {
      'precision': 0.0, 'recall': 0.0, 'f1': 0.0
  }

  # calculate the precision, recall and f1 score over the information fields 
  # corresponding to Task 3 and store the result in the `scores` dict.

  # your code goes here
  # ...



  return scores

---
Run the cell below to calculate and display the evaluation scores for the 50 documents in `movies.zip`.

You can consider the following as a baseline score. Your aim should be to score higher or atleast get as close as possible to these values.

| Precision | Recall | F1    |
| :---:     | :---:  | :---: |
| 0.5       | 0.25   | 0.333 |

In [None]:
# !pip install pandas
import pandas as pd

# calculate evaluation score across all the 50 documents
extracted_infos = []
for document in documents:
  extracted_infos.append(extract_info(document))

scores = evaluate(labels, extracted_infos)

pd.DataFrame([scores])

---

## Task 7: Evaluation (II) (10 Marks)

Describe **two** challenges you encountered above or might encounter in the evaluation of *information extraction* or *relation extraction* tasks.


Edit this cell to write your answer below the line in no more than 100 words. No coding is required for this task.

---

> Delete this line and write your answer here.