# **Tâche #2 - Questions-réponses avec un modèle QA extractif**

Cette tâche consiste à utiliser un modèle de question-réponse extractif de type transformer afin de repérer des informations dans un texte. Vous utilisez la librairie HuggingFace pour accomplir cette tâche. On demande plus spécifiquement d’utiliser le modèle *bert-large-uncased-whole-word-masking-finetuned-squad*.

La tâche a pour but précis de repérer 3 informations dans les descriptions textuelles : le lieu et la date de l’incident ainsi qu’un court passage de texte indiquant ce qui s’est passé.  Une partie importante de votre travail consiste à trouver de bonnes formulations de questions pour repérer ces informations. Le fichier *t2_qa_examples*.json, qui contient 25 exemples annotés par un humain, est disponible pour mener vos expérimentations.

Les consignes pour cette tâche sont:
-	Nom du notebook : *t2_qa.ipynb* (ce notebook)
-	Tokenisation et plongements de mots : Ceux du modèle utilisé.
-	Normalisation : Aucune normalisation à faire (le tokeniseur convertit les lettres en minuscule).
-	Construction du modèle : vous utilisez la version préentraînée du modèle sans modification. Aucun affinement (fine-tuning) du modèle n’est requis pour cette tâche.
-	Évaluation : Du code est disponible dans le notebook pour évaluer la performance du modèle avec les métriques *exact match* et *F1*.
-	Analyse : Présentez et discutez des résultats que vous obtenez pour les 3 types d’informations à repérer. Discutez également de vos choix de questions pour accomplir cette tâche et les erreurs commises par le modèle QA.

Vous pouvez ajouter au notebook toutes les cellules dont vous avez besoin pour votre code, vos explications ou la présentation de vos résultats. Vous pouvez également ajouter des sous-sections (par ex. des sous-sections 1.1, 1.2 etc.) si cela améliore la lisibilité.

Notes :
- Évitez les bouts de code trop longs ou trop complexes. Par exemple, il est difficile de comprendre 4-5 boucles ou conditions imbriquées. Si c'est le cas, définissez des sous-fonctions pour refactoriser et simplifier votre code.
- Expliquez sommairement votre démarche.
- Expliquez les choix que vous faites au niveau de la programmation et des modèles (si non trivial).
- Analysez vos résultats. Indiquez ce que vous observez, si c'est bon ou non, si c'est surprenant, etc.
- Une analyse quantitative et qualitative d'erreurs est intéressante et permet de mieux comprendre le comportement d'un modèle.

## 1. Le chargement des données

Utilisez le fichier ***/data/t2_qa_examples.json*** pour mener vos expérimentations. 

In [1]:
import json

def load_json_data(filename):
    with open(filename, 'r') as fp:
        data = json.load(fp)
    return data

In [2]:
# Charger et afficher quelques exemples
from pprint import pprint

data = load_json_data('data/t2_qa_examples.json')
print("Nombre total d'exemples:", len(data))

# utilisation de pprint pour afficher 5 exemplses


pprint(data[:5])

Nombre total d'exemples: 25
[{'EVENT': 'Employee #1  was struck and thrown',
  'WHEN': 'November 10  2013',
  'WHERE': 'railroad bridge overpass',
  'text': ' At around 10:00 p.m. on November 10  2013  Employee #1  with '
          'Villager  Construction Inc.  with a coworker  were using an asphalt '
          'milling machine  (Wirtgen; Model Number: W2100) to grind out '
          'existing asphalt from an  interstate at a railroad bridge overpass. '
          'Employee # 1 was standing on the  ground  checking the depth of the '
          'cut into the asphalt  using a handheld  pendant attached to the '
          'machine. The pedant could stretch out from ten to 15  ft. This '
          'allowed Employee #1 to walk back and forth  checking the cut. The  '
          'operator was on the top of the milling machine  controlling the '
          'operation of  the machine and ensuring that the milling machine and '
          'dump truck (driven by a  second coworker  who worked for an

## 2. Vos questions 

Vous pouvez mettre plusieurs options de questions dans le notebook. Il est important de présenter, au minimum, les résultats pour le meilleur jeu de questions. Vous pourrez également mettre des informations à ce propos dans la section d'analyse. 

In [10]:
# liste des questions

questions = {
    "WHEN": [
        "When did the incident occur?",
        "What is the exact date and time when the incident occurred?",
        "What time did the incident occur?",
        "When did the incident take place?",
    ],
    "WHERE": [  
        "Where did the event occur?",
        "What is the exact location of the incident?",
        "Where did the incident take place?",
        "What is the location of the incident?",
    ],
    "EVENT": [
        "What unfolded during the incident?",
        "What happened during the incident?",
        "What is the incident about?",
        "Summarize in a few sentences what happened during the incident.",
    ],
}

# Afficher les questions
for key, qs in questions.items():
    print(f"\n{key} questions:")
    for q in qs:
        print(q)


WHEN questions:
When did the incident occur?
What is the exact date and time when the incident occurred?
What time did the incident occur?
When did the incident take place?

WHERE questions:
Where did the event occur?
What is the exact location of the incident?
Where did the incident take place?
What is the location of the incident?

EVENT questions:
What unfolded during the incident?
What happened during the incident?
What is the incident about?
Summarize in a few sentences what happened during the incident.


## 3. Le modèle de question-réponse extractif

In [4]:
from transformers import pipeline

# Load the model
qa_model = pipeline('question-answering', model='bert-large-uncased-whole-word-masking-finetuned-squad')


  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## 4. Des fonctions utilitaires pour l'évaluation

In [5]:
import string
import re
from collections import Counter

def remove_articles(text):
    return re.sub(r'\b(a|an|the)\b', ' ', text)

def white_space_fix(text):
    return ' '.join(text.split())

def remove_punc(text):
    exclude = set(string.punctuation)
    return ''.join(ch for ch in text if ch not in exclude)

def lower(text):
    return text.lower()

def normalize_answer(s):
    """Mettre en minuscule et retirer la ponctuation, des déterminants and les espaces."""
    return white_space_fix(remove_articles(remove_punc(lower(s))))

In [6]:
def evaluate_f1(ground_truth, prediction):
    """Normalise les 2 textes, trouve ce qu'il y a en commun et estime précision, rappel et F1."""
    prediction_tokens = normalize_answer(prediction).split()
    ground_truth_tokens = normalize_answer(ground_truth).split()
    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common.values())
    if len(ground_truth_tokens) == 0 or len(prediction_tokens) == 0:
        return int(ground_truth_tokens == prediction_tokens)
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(prediction_tokens)
    recall = 1.0 * num_same / len(ground_truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

def evaluate_exact_match(ground_truth, prediction):
    """Vérifie si les 2 textes sont quasi-identiques."""
    return (normalize_answer(prediction) == normalize_answer(ground_truth))

## 5. Évaluation du modèle et analyse

In [12]:
import numpy as np

# Fonction pour évaluer les scores F1 pour une liste de questions
def evaluate_questions(questions, data, key):
    scores = []
    for question in questions:
        question_scores = []
        for example in data:
            context = example['text']
            result = qa_model(question=question, context=context)
            score = evaluate_f1(example[key], result['answer'])
            question_scores.append(score)
        scores.append(np.mean(question_scores))
    return scores

# Évaluer les questions de type WHEN
when_scores = evaluate_questions(questions['WHEN'], data, 'WHEN')
best_when_question = questions['WHEN'][np.argmax(when_scores)]
print(f"Meilleure question WHEN: {best_when_question} avec un score moyen de {max(when_scores):.3f}")

# Évaluer les questions de type WHERE
where_scores = evaluate_questions(questions['WHERE'], data, 'WHERE')
best_where_question = questions['WHERE'][np.argmax(where_scores)]
print(f"Meilleure question WHERE: {best_where_question} avec un score moyen de {max(where_scores):.3f}")

# Évaluer les questions de type EVENT
event_scores = evaluate_questions(questions['EVENT'], data, 'EVENT')
best_event_question = questions['EVENT'][np.argmax(event_scores)]
print(f"Meilleure question EVENT: {best_event_question} avec un score moyen de {max(event_scores):.3f}")


Meilleure question WHEN: When did the incident occur? avec un score moyen de 0.920
Meilleure question WHERE: Where did the event occur? avec un score moyen de 0.783
Meilleure question EVENT: What unfolded during the incident? avec un score moyen de 0.567


In [11]:
for i in range(5):
    example = data[i]
    context = example['text']
    
    print(f"Example {i+1}:")
    print("Context:", context)
    
    # Question 1: Where did the incident occur?
    where_question = "Where did the event occur?"
    where_result = qa_model(question=where_question, context=context)
    print("Where:", where_result['answer'])
    print("F1 score:", evaluate_f1(example['WHERE'], where_result['answer']))
    
    # Question 2: When did the incident occur?
    when_question = "When did the incident occur?"
    when_result = qa_model(question=when_question, context=context)
    print("When:", when_result['answer'])
    print("F1 score:", evaluate_f1(example['WHEN'], when_result['answer']))
    
    # Question 3: Summarize in a few sentences what happened.
    event_question = "What unfolded during the incident?"
    what_result = qa_model(question=event_question, context=context)
    print("Event:", what_result['answer'])
    print("F1 score:", evaluate_f1(example['EVENT'], what_result['answer']))
    
    print()

Example 1:
Where: railroad bridge overpass
F1 score: 1.0
When: November 10  2013
F1 score: 1.0
Event: Employee #1  was struck and thrown
F1 score: 1.0

Example 2:
Context:  On August 27  2012  Employee #1  a 19 year-old male laborer with Stomper  Company Inc.  arrived at 2:00 .am. at a site in Menlo Park California to  demolish the interiors of the building. They scraped the interiors of the  building and collected debris as they finished up the job. On August 28  2012   at approximately 10:00 a.m  the job assignment was done and every employee was  to put away all the rubble and gather all equipment in order to pack up and  leave the site. When the job assignment was finished  it is typical for all  employees to gather everything and put it away into the garbage bin or in  their trailers and bins. At the time  four coworkers were outside in the  parking lot working near the Number 5 700 Panther. Two coworkers were going to  load the number 5 700 Panther and Employee #1 stated that he 