# Google Natural Questions

## Introduction

Le but de ce challenge est de créer un modèle d'apprentissage statistique pour répondre aux questions posées en se servant d'un corps de texte (i.e. Wikipedia). Pour accéder au lien du *challenge*, veuillez clicquez [ici](https://ai.google.com/research/NaturalQuestions).

## Présentation des données

Dans cette partie nous présentons les données fournies. Le format des données d'apprentissage est `json`. En revanche Google a créé une version tabulée des données qui se trouve dans ce [lien](https://ai.google.com/research/NaturalQuestions/databrowser).

Pour visualiser une observation de la base simplifiée, veuillez cliquez [ici](https://raw.githubusercontent.com/joelun37/Question-Answering/master/Documents/simplified-nq-train-for-content.json).

Vous pouvez également visualiser une observation de la base initiale, en cliquant [ici] ()

Nous utiliserons les données simplifiées par Google:

- **simplified-nq-train.jsonl**: 17.45 Go
- **simplified-nq-test.jsonl**: 18.8 Mo

La base d'apprentissage étant assez volumineuse, nous utiliserons le Google Cloud Platform pour nos calculs.

Ci-dessous, les nombres d'observations dans les bases fournies par Google:

- **Base d'apprentissage:** 307 373 observations
- **Base de validation (ou de développement):** 7 830 observations
- **Base de test:** 7 842 observations

In [4]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark import SparkConf
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.types import *
import time

local = "local[*]"
appName = "QA"
configLocale = SparkConf().setAppName(appName).setMaster(local).\
set("spark.executor.memory", "6G").\
set("spark.driver.memory", "6G").\
set("spark.sql.catalogImplementation", "in-memory")

spark = SparkSession.builder.config(conf = configLocale).getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("ERROR")
spark

Afin d'inspecter le format du fichier `json`nous avons créé une version courte, que nous chargeons ici.

In [20]:
for_content = spark.read.json("/Volumes/750GB-HDD/root/Question-Answering/pyData/tensorflow2-question-answering/simplified-nq-train-for-content.json")

In [18]:
for_content.printSchema()

root
 |-- annotations: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotation_id: long (nullable = true)
 |    |    |-- long_answer: struct (nullable = true)
 |    |    |    |-- candidate_index: long (nullable = true)
 |    |    |    |-- end_token: long (nullable = true)
 |    |    |    |-- start_token: long (nullable = true)
 |    |    |-- short_answers: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- end_token: long (nullable = true)
 |    |    |    |    |-- start_token: long (nullable = true)
 |    |    |-- yes_no_answer: string (nullable = true)
 |-- document_text: string (nullable = true)
 |-- document_url: string (nullable = true)
 |-- example_id: long (nullable = true)
 |-- long_answer_candidates: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- end_token: long (nullable = true)
 |    |    |-- start_token: long (nullable = true)
 |    |    |

## Les variables

### Annotations

Cette variable contient réponses **longues** et **courtes**, s'il existe des réponses. Ci-dessous les règles de cette variable:

- Chaque question a une réponse longue au maximum. En revanche, il peut y avoir plusieurs réponses courtes.
- Les réponses courtes sont nécessairement contenues dans la réponse longue. Si la réponse est de type **Oui/Non**, alors `yes_no_answer` prend les valeurs `Yes` ou `No`. Par défaut, la valeur de cette variable est `None`.
- Seulement 1% des réponses sont de type **Oui/Non**.

### Document text
Cette variable contient le corps de la page Wikipedia en format *html*.

### Document URL
Cette variable contient le lien URL vers la page Wikipedia.

### Example ID
C'est l'identifiant de l'exemple.

### Long answer candidates
Cette variable contient les réponses candidates. 

- Parfois, une longue réponse est imbriquée dans une autre.
- Pour différencier ces deux types de réponses, on utilise la notion de **niveau**. Une réponse est donc contenue dans une autre si son indicateur *top level* est `False`.
- 95% des réponses longues sont du *top level* `True`. Nous pourrions donc, dans un premier temps, nous focaliser sur ces réponses uniquement.

### Question text
C'est la question posée.

In [29]:
for_content.annotations

Column<b'annotations'>

In [22]:
for_content_2 = spark.read.json("/Volumes/750GB-HDD/root/Question-Answering/pyData/tensorflow2-question-answering/v1.0-simplified_nq-dev-all-for-content.json")

In [24]:
for_content_2.printSchema()

root
 |-- annotations: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotation_id: decimal(20,0) (nullable = true)
 |    |    |-- long_answer: struct (nullable = true)
 |    |    |    |-- candidate_index: long (nullable = true)
 |    |    |    |-- end_byte: long (nullable = true)
 |    |    |    |-- end_token: long (nullable = true)
 |    |    |    |-- start_byte: long (nullable = true)
 |    |    |    |-- start_token: long (nullable = true)
 |    |    |-- short_answers: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- end_byte: long (nullable = true)
 |    |    |    |    |-- end_token: long (nullable = true)
 |    |    |    |    |-- start_byte: long (nullable = true)
 |    |    |    |    |-- start_token: long (nullable = true)
 |    |    |-- yes_no_answer: string (nullable = true)
 |-- document_html: string (nullable = true)
 |-- document_title: string (nullable = true)
 |-- document_to

### Première tentative de modélisation

In [31]:
# Libraries
import re
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
import json

# Mock test
# We are going to get 2 long answer candidates with the real long answer
# and feed it to the BERT model

example_txt = "/Volumes/750GB-HDD/root/Question-Answering/pyData/tensorflow2-question-answering/simplified-nq-train-for-content.json"
dev_example = "/Volumes/750GB-HDD/root/Question-Answering/pyData/tensorflow2-question-answering/v1.0-simplified_nq-dev-all-for-content.json"

def raw_NQ_data_dict(input_text_file):

    with open(input_text_file, 'r') as f:
        for line in f:
            example_dict = json.loads(line) 
            simplfied_ex =   (example_dict)

    return simplfied_ex

test_dict = raw_NQ_data_dict(input_text_file=example_txt)

long_answer_candidates = test_dict["long_answer_candidates"]
document_text = test_dict["document_text"]
question_text = test_dict["question_text"]

# [{'yes_no_answer': 'NONE',
#   'long_answer': {'start_token': 1952,
#    'candidate_index': 54,
#    'end_token': 2019},
#   'short_answers': [{'start_token': 1960, 'end_token': 1969}],
#   'annotation_id': 593165450220027640}]
annotations = test_dict["annotations"]

long_answer = ""
candidate_index = annotations[0]["long_answer"]["candidate_index"]

for i in range(long_answer_candidates[candidate_index]["start_token"], \
               long_answer_candidates[candidate_index]["end_token"]):
    long_answer += document_text.split()[i] + " "

short_answer = ""
for i in range(annotations[0]["short_answers"][0]["start_token"], \
               annotations[0]["short_answers"][0]["end_token"]):
    short_answer += document_text.split()[i] + " "

# First two candidates
candidate_dict = {}

i = 0
while len(candidate_dict.keys()) < 2:
    if long_answer_candidates[i]["top_level"] == True:
        txt = ""
        for j in range(long_answer_candidates[i]["start_token"], \
                       long_answer_candidates[i]["end_token"]):
            txt += document_text.split()[j] + " "
        candidate_dict[i] = txt
    i += 1

def remove_html_tags(text):
    """Remove html tags from a string"""
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = ""
for key in candidate_dict.keys():
    text += remove_html_tags(candidate_dict[key])

inputs = tokenizer.encode_plus(question_text, text, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
answer_start_scores, answer_end_scores = model(**inputs)
answer_start = torch.argmax(
    answer_start_scores
)  # Get the most likely beginning of answer with the argmax of the score
answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

print(f"Question: {question_text}\n")
print(f"Answer found by the model: {answer}\n")
print(f"Long Answer: {remove_html_tags(long_answer)}\n")
print(f"Short Answer: {short_answer}")

Question: which is the most common use of opt-in e-mail marketing
Answe found by the model: referral marketing

Long Answer: <P> A common example of permission marketing is a newsletter sent to an advertising firm 's customers . Such newsletters inform customers of upcoming events or promotions , or new products . In this type of advertising , a company that wants to send a newsletter to their customers may ask them at the point of purchase if they would like to receive the newsletter . </P> 
Short Answer: a newsletter sent to an advertising firm 's customers 
