# Assignment 3 - CT5120/CT5146

### Instructions:
- Complete all the tasks below and upload your submission as a Python notebook on Blackboard with the filename “`StudentID_Lastname.ipynb`” before **23:59** on **December 31, 2021**. Please note that there will be no further extensions to this deadline and we highly encourage you to submit this assignment before Semester 1 exams.
- This is an individual assignment, you **must not** work with other students to complete this assessment.
- The assignment is worth $100$ marks and constitutes 19% of the final grade. The breakdown of the marking scheme for each task is as follows:

|           | Task | Marks |
| :---      | :-----| -----:|
| Task 1    | Pre-processing |   15 |
| Task 2    | Named Entity Recognition |    10 |
| Task 3    | Information / Relation Extraction (I) | 30 |
| Task 4    | Information / Relation Extraction (II) | 15 |
| Task 5    | Combining information in the output   | 5 |
| Task 6    | Evaluation (I) | 15 |
| Task 7    | Evaluation (II) | 10 |



---

## Information Extraction and Relation Extraction

In the following tasks you will write code to perform **_information extraction_** and **_relation extraction_** across a collection of documents in `movies.zip`.

The zip archive contains 100 files, out of which 50 are plaintext documents and other 50 contain data structured as JSON.
Each plaintext document contains a text description of a movie taken from the English version of Wikipedia, while each JSON document contains *gold-standard* labels (also called *reference* labels) stored as key-value pairs for the entities and relations for each document.

You are only allowed to use the given documents and labels and **must not** use any other external sources of data for this assignment.

---

Download and unarchive `movies.zip` from Blackboard and place it in the same location as this notebook or uncomment the code cell below to get the data in a directory called `movies` and also place it automatically in the same location as this notebook.

In [65]:
!if test -f "movies.zip"; then rm "movies.zip"; fi
!if test -d "movies/"; then rm -rf "movies/"; fi
!wget "https://drive.google.com/uc?export=download&id=1L6NcSGkubNJaL6xSnYEZZKSrlyXq1AbB" -O "movies.zip"
!unzip "movies.zip"

--2022-01-02 22:23:10--  https://drive.google.com/uc?export=download&id=1L6NcSGkubNJaL6xSnYEZZKSrlyXq1AbB
Resolving drive.google.com (drive.google.com)... 74.125.195.100, 74.125.195.139, 74.125.195.138, ...
Connecting to drive.google.com (drive.google.com)|74.125.195.100|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-04-90-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/iot4l6711uig94ftv8qlv1cr7mtatqfh/1641162150000/04741348677416923358/*/1L6NcSGkubNJaL6xSnYEZZKSrlyXq1AbB?e=download [following]
--2022-01-02 22:23:11--  https://doc-04-90-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/iot4l6711uig94ftv8qlv1cr7mtatqfh/1641162150000/04741348677416923358/*/1L6NcSGkubNJaL6xSnYEZZKSrlyXq1AbB?e=download
Resolving doc-04-90-docs.googleusercontent.com (doc-04-90-docs.googleusercontent.com)... 74.125.20.132, 2607:f8b0:400e:c07::84
Connecting to doc-04-90-docs.googleusercontent.com (doc-04-

---

## Reading Data

Place the unzipped `movies` directory in the same location as this notebook and run the following code cell to read the plaintext and JSON documents.

In [66]:
######### DO NOT EDIT THIS CELL #########

import os
import json

documents = []   # store the text documents as a list of strings
labels = []      # store the gold-standard labels as a list of dictionaries

for idx in range(50):
  with open(os.path.join('movies', str(idx+1).zfill(2) + '.doc.txt')) as f:
    doc = f.read().strip()
  with open(os.path.join('movies', str(idx+1).zfill(2) + '.info.json')) as f:
    label = json.load(f)

  documents.append(doc)
  labels.append(label)

assert len(documents) == 50
assert len(labels) == 50

---

In [67]:
# Load the libraries which might be useful

import re
import nltk
nltk.download('all', quiet=True)

True

---

## Task 1: Document Pre-processing (15 Marks)
Write a function that takes a document and returns a list of sentences with part-of-speech tags.

The expected output is a list of tagged sentences where each tagged sentence is a list containing `(token, tag)` pairs.


In [68]:
def ie_preprocess(document):
  '''Return a list of sentences tagged with part-of-speech tags for the given document.'''
#Creating a list
  tagged_sentences = []


  #Sentence segmentation
  sentences = nltk.sent_tokenize(document)
  
  #Sentence is split into tokens
  #Token is a smallest unit of text or a sentence
  #Tokenization is the process of splitting the text to extract tokens.
  # Tokenizing the sentences into words using nltk.word_tokenize
  tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]
  
  # POS tagging
  
  # Parts of Speech tagging helps to infer knowledge on how a word is used in a sentence or text
  tagged_sentences = [nltk.pos_tag(sent) for sent in tokenized_sentences]
  
    
  
  return tagged_sentences
  #output
ie_preprocess(documents[0])[-10]



[('It', 'PRP'),
 ('received', 'VBD'),
 ('ten', 'JJ'),
 ('Oscar', 'NNP'),
 ('nominations', 'NNS'),
 ('(', '('),
 ('including', 'VBG'),
 ('Best', 'NNP'),
 ('Picture', 'NN'),
 (')', ')'),
 (',', ','),
 ('winning', 'VBG'),
 ('seven', 'CD'),
 ('.', '.')]

In [69]:
ie_preprocess(documents[0])[-10]

[('It', 'PRP'),
 ('received', 'VBD'),
 ('ten', 'JJ'),
 ('Oscar', 'NNP'),
 ('nominations', 'NNS'),
 ('(', '('),
 ('including', 'VBG'),
 ('Best', 'NNP'),
 ('Picture', 'NN'),
 (')', ')'),
 (',', ','),
 ('winning', 'VBG'),
 ('seven', 'CD'),
 ('.', '.')]

Run the cell below to check if the output is formatted correctly.

Expected output: `[('It', 'PRP'), ('received', 'VBD'), ('ten', 'JJ'), ('Oscar', 'NNP'), ('nominations', 'NNS'), ('(', '('), ('including', 'VBG'), ('Best', 'NNP'), ('Picture', 'NN'), (')', ')'), (',', ','), ('winning', 'VBG'), ('seven', 'CD'), ('.', '.')]`

In [70]:
# check output for Task 1
ie_preprocess(documents[0])[-7]

[('In', 'IN'),
 ('2004', 'CD'),
 (',', ','),
 ('its', 'PRP$'),
 ('soundtrack', 'NN'),
 ('was', 'VBD'),
 ('added', 'VBN'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('U.S.', 'NNP'),
 ('National', 'NNP'),
 ('Recording', 'NNP'),
 ('Registry', 'NNP'),
 (',', ','),
 ('and', 'CC'),
 ('was', 'VBD'),
 ('additionally', 'RB'),
 ('listed', 'VBN'),
 ('by', 'IN'),
 ('the', 'DT'),
 ('American', 'NNP'),
 ('Film', 'NNP'),
 ('Institute', 'NNP'),
 ('as', 'IN'),
 ('the', 'DT'),
 ('best', 'JJS'),
 ('movie', 'NN'),
 ('score', 'NN'),
 ('of', 'IN'),
 ('all', 'DT'),
 ('time', 'NN'),
 ('a', 'DT'),
 ('year', 'NN'),
 ('later', 'RB'),
 ('.', '.')]

## Task 2: Named Entity Recognition (10 Marks)

Write a function that returns a list of all the named entities in a given document. The document here is structured as a list of sentences and tagged with part-of-speech tags.

Hint: Set `binary = True` while calling the `ne_chunk` function.

In [None]:
#pre-porcess and POS tagged documents from the previous step is used in this function
def find_named_entities(tagged_document):
  '''Return a list of all the named entities in the given tagged document.'''

  named_entities = []
  #The each sentence present in the document is grouped in to trees
  #Leaf Node in a tree can be defined as a node without any children
  #Named Entity Recognition is a process of automatically extracting the Named Entities from a corpus of text 
  #Named Entities that are found in a leaf node are extracted from the trees are extracted and updated to a the "named_entities" list
  #Looping each sentence in the tagged_document
  #each sentence present in  the document is chunked into trees
  #named enties present in  the leaf node of individual tree are filtered and are appended into a list
  for sentu in tagged_document:
        
    #In unconstrained text, classes and interfaces are used to define linguistic groupings which are non-overlapping in nature. This process is called as "Chunk Parsing"
    #The groups obtained from the process of Chunk Parsing is called as "Chunks"
    tree = nltk.ne_chunk(sentu, binary=True)
    for subtree in tree.subtrees():
     #The label() is used to obtain a specific node by its label in a tree
    #Checking for a Named Entity
      if subtree.label() == 'NE':
        entity = ""
        for leaf in subtree.leaves():
          entity = entity + leaf[0] + " "
        named_entities.append(entity.strip())
             

  return named_entities

In [71]:
#pre-porcess and POS tagged documents from the previous step is used in this function
def find_named_entities(tagged_document):
  '''Return a list of all the named entities in the given tagged document.'''

  named_entities = []
  #each sentence present in  the document is chunked into trees
  #named enties present in  the leaf node of individual tree are filtered and are appended into a list called "named_entities"
  for sentu in tagged_document:
    tree = nltk.ne_chunk(sentu, binary=True)
    for subtree in tree.subtrees():
      if subtree.label() == 'NE':
        entity = ""
        for leaf in subtree.leaves():
          entity = entity + leaf[0] + " "
        named_entities.append(entity.strip())
             

  return named_entities

Run the cell below to check if the output is formatted correctly.

The output values might not match exactly, but should look similar to: `['Star Wars', 'Star Wars', 'New Hope', 'American', 'George Lucas', 'Lucasfilm', ...]`

In [72]:
# check output for Task 2
tagged_document = ie_preprocess(documents[0]) # pre-process the first document
find_named_entities(tagged_document)[:10]     # display the first 10 named entities

['Star Wars',
 'Star Wars',
 'New Hope',
 'American',
 'George Lucas',
 'Lucasfilm',
 'Century Fox',
 'Mark Hamill',
 'Harrison Ford',
 'Carrie Fisher']

## Task 3: Information / Relation Extraction (I) (30 Marks)

Choose any **three** relations out of the following and write functions to extract them from a given document.

* **Title**
* **Language**
* **Starring**
* **Release date**
* **Cinematography**
* **Dialogue by**
* **Directed by**
* **Edited by**
* **Music by**
* **Narrated by**
* **Produced by**
* **Screenplay by**
* **Story by**
* **Written by**
* **Production companies**
* **Distribution companies**
* **Budget**
* **Box office**


The functions you define here must take as input a string called `document` and return the information/relation extracted as a list. You can explain your approach with comments along with your code.


In [136]:
# relation 1 - your code goes here

def relations_1(doc):
 #sent_tokenize() is used to perform sentence-segmentation
  tokenized_sent = nltk.sent_tokenize(doc)
  #word_tokenize() is used to tokenize each sentence present in the document into words 
  tagged_sent = [nltk.word_tokenize(sent) for sent in tokenized_sent]
  # POS tagging is performed using pos_tag() 
  tagged_sent = [nltk.pos_tag(sent) for sent in tagged_sent]
#Creating a list to store the results of the "Produced_By" relation
  produced_res = []
  
#Defining X 
#Defining Y
#Defining /alpha
  for sent in tagged_sent:
    subjclass = 'NE'
    objclass = 'NE'
    
    pattern_to_find = re.compile(r'.*produce.*') #For the Produced By relation
     #".* " in regular expression means "0 or more of any character"
    #"." - a "dot" indicates any character
    #"*" - means "0 or more instances of the preceding regex token"
    

    

    chunked = nltk.ne_chunk(sent, binary=True) 
    pairs = nltk.sem.relextract.tree2semi_rel(chunked)
    #Converting the 'semi-relations' that were extracted into a "semi_rel2reldict" dictionary
    reldicts = nltk.sem.relextract.semi_rel2reldict(pairs + [[[]]])

    #
    # Extract the  relevant relations that match the regular expression pattern.
    relfilter = lambda x: (x['subjclass'] == subjclass and
                              pattern_to_find.match(x['filler']) and
                              x['objclass'] == objclass)
    
    rels_list = list(filter(relfilter, reldicts))
    for real in rels_list:
      print(nltk.sem.relextract.rtuple(real))
    # extracting subject and object text based on filtered relations 
    if len(rels_list) > 0:
      produced_res.append(rels_list[0]['objsym'])
      produced_res.append(rels_list[0]['subjsym'])
  return produced_res

relations_1(documents[0])

[NE: 'George/NNP Lucas/NNP'] ',/, produced/VBN by/IN' [NE: 'Lucasfilm/NNP']


['lucasfilm', 'george_lucas']

In [134]:
# relation 2 - your code goes here

def relations_2(doc):
  #sent_tokenize() is used to perform sentence-segmentation
  tokenized_sent = nltk.sent_tokenize(doc)
   #word_tokenize() is used to tokenize each sentence present in the document into words 
  tagged_sent = [nltk.word_tokenize(sent) for sent in tokenized_sent]
   # POS tagging is performed using pos_tag() 
  tagged_sent = [nltk.pos_tag(sent) for sent in tagged_sent]
#Creating a list to store the results of the "Directed_By" relation
  directed_res = []
  
  ##Defining X 
#Defining Y
#Defining /alpha
  for sent in tagged_sent:
    subjclass = 'NE'
    objclass = 'NE'

    pattern_to_find = re.compile(r'.*direct.*') #For the Directed By relation
    #".* " in regular expression means "0 or more of any character"
    #"." - a "dot" indicates any character
    #"*" - means "0 or more instances of the preceding regex token"
  

    chunked = nltk.ne_chunk(sent, binary=True) 
    pairs = nltk.sem.relextract.tree2semi_rel(chunked)
    #Converting the 'semi-relations' that were extracted into a "semi_rel2reldict" dictionary
    
    reldicts = nltk.sem.relextract.semi_rel2reldict(pairs + [[[]]])

    #
    # Extract the  relevant relations that match the regular expression pattern.
    relfilter = lambda x: (x['subjclass'] == subjclass and
                              pattern_to_find.match(x['filler']) and
                              x['objclass'] == objclass)
    
    rels_list = list(filter(relfilter, reldicts))
    for real in rels_list:
      print(nltk.sem.relextract.rtuple(real))
    # extracting subject and object text based on filtered relations 
    if len(rels_list) > 0:
      directed_res.append(rels_list[0]['objsym'])
      directed_res.append(rels_list[0]['subjsym'])
  return directed_res

relations_2(documents[0])

[NE: 'American/JJ'] 'epic/NN space-opera/NN film/NN written/VBN and/CC directed/VBN by/IN' [NE: 'George/NNP Lucas/NNP']


['george_lucas', 'american']

In [137]:
# relation 3 
def relations_3(doc):
  #sent_tokenize() is used to perform sentence-segmentation
  tokenized_sent = nltk.sent_tokenize(doc)
   #word_tokenize() is used to tokenize each sentence present in the document into words 
  tagged_sent = [nltk.word_tokenize(sent) for sent in tokenized_sent]
  # POS tagging is performed using pos_tag() 
  tagged_sent = [nltk.pos_tag(sent) for sent in tagged_sent]
#Creating a list to store the results of the "Edited_By" relation
  edited_res = []
  for sent in tagged_sent:
#Defining X 
#Defining Y
#Defining /alpha
    subjclass = 'NE'
    objclass = 'NE'
    pattern_to_find = re.compile(r'.*edited*') #For the Edited By relation
    #".* " in regular expression means "0 or more of any character"
    #"." - a "dot" indicates any character
    #"*" - means "0 or more instances of the preceding regex token"
    

    chunked = nltk.ne_chunk(sent, binary=True) 
    pairs = nltk.sem.relextract.tree2semi_rel(chunked)
    #Converting the 'semi-relations' that were extracted into a "semi_rel2reldict" dictionary
    reldicts = nltk.sem.relextract.semi_rel2reldict(pairs + [[[]]])

    
    # Extract the  relevant relations that match the regular expression pattern.
    relfilter = lambda x: (x['subjclass'] == subjclass and
                              pattern_to_find.match(x['filler']) and
                              x['objclass'] == objclass)
    
    rels_list = list(filter(relfilter, reldicts))
    for real in rels_list:
      print(nltk.sem.relextract.rtuple(real))

    if len(rels_list) > 0:
      edited_res.append(rels_list[0]['objsym'])
      edited_res.append(rels_list[0]['subjsym'])
  return edited_res

relations_3(documents[4])

[NE: 'Buena/NNP Vista/NNP International/NNP'] ',/, edited/VBN by/IN' [NE: 'Fabienne/NNP']


['fabienne', 'buena_vista_international']

---

## Task 4: Information / Relation Extraction (II)  (15 Marks)

Identify one other relation of your choice, besides the ones mentioned in the previous task, and write a function to extract it. 

The function you define here must take as input a string called `document` and return the information/relations extracted as a list.

In [130]:

# relation 4
def relations_4(doc):
  
  #sent_tokenize() is used to perform sentence-segmentation
  tokenized_sent = nltk.sent_tokenize(doc)
  #word_tokenize() is used to tokenize each sentence present in the document into words 
  tagged_sent = [nltk.word_tokenize(sent) for sent in tokenized_sent]
  # POS tagging is performed using pos_tag() 
  tagged_sent = [nltk.pos_tag(sent) for sent in tagged_sent]
  #Creating a list to store the results of the "Co-Written" relation
  cowritten_res = []
#Defining X 
#Defining Y
#Defining /alpha
  
  for sent in tagged_sent:
    subjclass = 'NE'
    objclass = 'NE'
    
    pattern_to_find = re.compile(r'.*co-written.*')  #For the Co-Written relation
    #".* " in regular expression means "0 or more of any character"
    #"." - a "dot" indicates any character
    #"*" - means "0 or more instances of the preceding regex token"
   

    chunked = nltk.ne_chunk(sent, binary=True) 
    pairs = nltk.sem.relextract.tree2semi_rel(chunked)
    #Converting the 'semi-relations' that were extracted into a "semi_rel2reldict" dictionary
   
    reldicts = nltk.sem.relextract.semi_rel2reldict(pairs + [[[]]])

    # Extract the  relevant relations that match the regular expression pattern.
    relfilter = lambda x: (x['subjclass'] == subjclass and
                              pattern_to_find.match(x['filler']) and
                              x['objclass'] == objclass)
    
    rels_list = list(filter(relfilter, reldicts))
    for real in rels_list:
      print(nltk.sem.relextract.rtuple(real))
    # extracting subject and object text based on filtered relations 
    if len(rels_list) > 0:
      cowritten_res.append(rels_list[0]['objsym'])
      cowritten_res.append(rels_list[0]['subjsym'])
  return  cowritten_res

relations_4(documents[2])

[NE: 'Dark/NNP Knight/NNP'] 'is/VBZ a/DT 2008/CD superhero/NN film/NN directed/VBD ,/, produced/VBN ,/, and/CC co-written/JJ by/IN' [NE: 'Christopher/NNP Nolan/NNP']


['christopher_nolan', 'dark_knight']

---

## Task 5: Combining information in the output (5 Marks)

Edit the function below to return a Python dictionary with the outputs from the functions defined in tasks $3 - 4$.

The output from the cell above should look something like the dictionary shown below. Overall values might be different, based on what four items you choose to extract in Tasks 3 and 4, but the structure should be similar.

For example, if you choose to extract **Starring**, **Release Date**, **Box office**, and **Directed by**, then the output should look something like this for the first document:

```javascript
{
  'Box office': ['$775 million'],
  'Directed by': ['George Lucas'],
  'Release date': ['May 25, 1977'],
  'Starring': ['Mark Hamill', 'Harrison Ford', 'Carrie Fisher', 
               'Peter Cushing', 'David Prowse', 'James Earl Jones', ],
}
```

In [82]:
def extract_info(document):
  '''Extract information and relations from a given document.'''

  # Edit the output dict below and assign the values to keys by 
  # calling the appropriate functions from Tasks 3 and 4.
  
  # You can delete the keys for which you do not perform extraction in Task 3.
  #Except "Produced by", "Directed by", "Edited by" all the other keys are deleted
  result_1 = relations_1(document)
  result_2 = relations_2(document)
  result_3 = relations_3(document)
  result_4 = relations_4(document)
  output = {
   
    
    # For the relations you extract in Task 3, 
    # save the output in the appropriate key and delete rest of the keys.
    
    "Produced by": [result_1[0] if len(result_1) > 0 else None  ],
    
    "Directed by": [result_2[0] if len(result_2) > 0 else None ],
    
    "Edited by": [result_3[0] if len(result_3) > 0 else None ],
   

    # save the output from Task 4 here
    "co-written": [result_4]

    
  }

  return output


# check output for the first document
extract_info(documents[2])

{'Directed by': ['christopher_nolan'],
 'Edited by': [None],
 'Produced by': ['christopher_nolan'],
 'co-written': [['christopher_nolan', 'dark_knight']]}

In [127]:
def evaluation(pred, label):
  
  output = []
  total_predictions = 0
  total_labels = 0
  counts_num = 0
  for val1, val2 in zip(pred, label):
    preds = []
    labes  = []
    for tempo1 in val1:
      if tempo1 != None:
        tempo1 = tempo1.replace("_", " ").strip().lower()
      else:
        tempo1 = None
        total_predictions += 1
      preds.append(tempo1)

    if val2 == None:
      val2 = [None]
    for tempvar2 in val2:
      if tempvar2 != None:
        tempo = tempvar2.strip().lower()
      else:
        tempo1 = None
        total_labels += 1
      labes.append(tempo1)
    output.append((*preds, labes))
  
  similar = 0
  for tres1, tres2 in output:
    if tres1 in tres2 and tres1 != None:
      similar += 1

  p_value = similar / (50 - total_predictions)
  r_value = similar / (50 - total_labels)
  f1_value = (2 * p_value * r_value) / (p_value + r_value)
  return p_value, r_value, f1_value


def evaluate(labels, predictions):
  '''
  Evaluate the performance of relation extraction 
  using Precision, Recall, and F1 scores.

  Args:
    labels: A list containing gold-standard labels
    predictions: A list containing information extracted from documents
  Returns:
    scores: A dictionary containing Precision, Recall and F1 scores 
            for the information/relations extracted in Task 3.
  '''

  assert len(predictions) == len(labels)
  #Defining lists to store the produced by, directed by , edited by entities along with their labels 
  
  produ_predictions,  produ_labels   = [], []
  dire_predictions, dire_labels      = [], []
  edited_predictions, edited_labels  = [], []

  for movie in predictions:
    produ_predictions.append(movie['Produced by'])
    dire_predictions.append(movie['Directed by'])
    edited_predictions.append(movie["Edited by"])


  for movie in labels:
    produ_labels.append(movie['Produced by'])
    dire_labels.append(movie['Directed by'])
    if 'Edited by' in list(movie.keys()):
      edited_labels.append(movie['Edited by'])
    else:
      edited_labels.append(None)
#The evaluation-function is called in order to compute metrices values for Precision,   Recall and F1 Score  for all the entities
  precision_produ_value, recall_produ_value, f1_produ_value    = evaluation(produ_predictions, produ_labels)
  precision_dire_value, recall_dire_value, f1_dire_value = evaluation(dire_predictions, dire_labels)
  precision_write_value, recall_write_value, f1_write_value = evaluation(edited_predictions, edited_labels)
  
#Computing the metric values for Precision,   Recall and F1 Score and computing their mean value
  scores = {
      'Precision': (precision_produ_value + precision_dire_value + precision_write_value)/3,
      'Recall': (recall_produ_value + recall_dire_value + recall_write_value)/3,
      'F1': (f1_produ_value + f1_dire_value + f1_write_value)/3
  }

  # calculate the precision, recall and f1 score over the information fields 
  # corresponding to Task 3 and store the result in the `scores` dict.


  return scores

---

## Task 6: Evaluation (I) (15 Marks)

Write a function to evaluate the performance of Task $3$ using **Precision**, **Recall** and **F1** scores. Use the gold-standard labels provided in the JSON files to calculate these values.

Please note that not all the information / relations mentioned in Task $3$ have associated labels for each and every movie in the JSON documents, i.e., some JSON documents will have certain keys-value pairs missing. For example, we have labels for *Budget* in 46 out of the 50 movies and in the remaining 4 documents, you will find that the key `Budget` is omitted from the JSON.
 
Also keep in mind that we will further run this evaluation on a hidden test set containing similar movie descriptions.

---
Run the cell below to calculate and display the evaluation scores for the 50 documents in `movies.zip`.

You can consider the following as a baseline score. Your aim should be to score higher or atleast get as close as possible to these values.

| Precision | Recall | F1    |
| :---:     | :---:  | :---: |
| 0.5       | 0.25   | 0.333 |

In [128]:
# !pip install pandas
import pandas as pd

# calculate evaluation score across all the 50 documents
extracted_infos = []
for document in documents:
  extracted_infos.append(extract_info(document))

scores = evaluate(labels, extracted_infos)

pd.DataFrame([scores])

Unnamed: 0,Precision,Recall,F1
0,1.0,0.500816,0.609041


---

## Task 7: Evaluation (II) (10 Marks)

Describe **two** challenges you encountered above or might encounter in the evaluation of *information extraction* or *relation extraction* tasks.




1.   The related words may not appear in all documents. This may differ from one document to the next, and it may have an impact on the rating matrices. Choosing the related words was also difficult because they had to be chosen based on their frequency in the majority of the documents.
2.   The majority of these related words had word cases that were mixed. This could potentially have an impact on the prediction process. As a result, the (upper or lower)cases can be  ignored using re.IGNORECASE and the procedure can be continued.
