---
# 💾 Drive
---

This notebook essentially:
* Reads a number of documents from RACE
* Extracts a number of candidates to be answers from the text of the question according to some heuristics.

The model for text-to-text generation used is a version fine-tuned for QA of [Google's T5](https://arxiv.org/pdf/1910.10683.pdf).

In [None]:
from google.colab import drive
drive.mount('/content/drive')

---
# 📚 Libraries
---

In [None]:
!pip install transformers
!pip install datasets
!pip install sentencepiece

In [None]:
import spacy 
import random 
import pandas as pd
import json

from datasets import load_dataset, Dataset
from transformers import AutoModelWithLMHead, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-question-generation-ap")
model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-question-generation-ap")
SAVE_PATH_RACE = "/content/drive/MyDrive/TFM/RACE_DATASET/race_extensions/first_poc/high/"

In [None]:
!python -m spacy download es_core_news_sm
!python -m spacy download en_core_web_sm

---
# 🔮 Models
---

https://huggingface.co/mrm8488/t5-base-finetuned-question-generation-ap

### Common

In [None]:
def get_question(answer, context, max_length=64):
    input_text = "answer: %s  context: %s </s>" % (answer, context)
    features = tokenizer([input_text], return_tensors='pt')

    output = model.generate(input_ids=features['input_ids'], 
                attention_mask=features['attention_mask'],
                max_length=max_length)

    return tokenizer.decode(output[0]).replace("<pad> question: " , ""). replace("</s>", "")

def generate_candidates(doc):
    candidates = []
    # Heuristic for selection
    for token in doc:
        if (token.pos_ == "NOUN"):
          candidates.append(token.text)
    return candidates

def generate_candidates_sent(doc):
    candidates = []
    # Heuristic for selection
    for sent in doc.sents:
      candidates.append(str(sent))
    return candidates

def generate_distractors(answer, candidates):
    candidates = [c for c in candidates if c != answer]
    options = [answer, *random.sample(candidates, 3)]
    random.shuffle(options)
    return options


### RACE

In [None]:
import os 
def process_doc_race(row, doc, questions_per_doc=2):
  #Heuristic for answer generation
  candidates = generate_candidates(doc)
  random.shuffle(candidates)

  new_docs = []
  new_id = row["example_id"].replace(".txt", "").replace("high", "")
  new_id = "SYNT_" + new_id + ".txt"

  new_docs = []

  try:
    for answer in candidates[:questions_per_doc]:
      q = get_question(answer, doc)
      # Heuristic for distractor generation
      options = generate_distractors(answer, candidates)
      new_docs.append({
        'id': new_id,
        'article': row["article"],
        'answer': ["A", "B", "C", "D"][options.index(answer)],
        'question': q,
        'options': options
      })
  except Exception as exc:
    pass

  return new_docs


def generate_new_docs_race(example_ds, n_docs=5000, questions_per_doc=2, save=False):
  new_docs = []
  docs = list(nlp.pipe(example_ds["article"]))
  for i in range(0, n_docs):
    print(f"Generating {questions_per_doc} for {i} of {n_docs}")
    #row = example_ds.__getitem__((i+1) * 3)
    row = example_ds.__getitem__(i)
    new_docs.append(process_doc_race(row, docs[i], questions_per_doc=questions_per_doc))

  new_docs = [item for sublist in new_docs for item in sublist]
  print(f"Generated {len(new_docs)}")

  if save is not None:
    try:
      with open(save, 'w') as f:
        f.write(json.dumps(new_docs))
    except Exception as exc:
      print(exc)

  return new_docs


In [None]:
example_ds.__getitem__((8+1) * 3)

### EE

In [None]:
nlp_es = spacy.load('es_core_news_sm')

In [None]:
exams_en_path = '/content/drive/MyDrive/TFM/EntranceExam/qa2015-exam-readingENGLISH.csv'
exams_es_path = '/content/drive/MyDrive/TFM/EntranceExam/qa2015-exam-readingSPANISH.csv'
nlp = spacy.load('en_core_web_sm')
nlp_es = spacy.load('es_core_news_sm')


In [None]:
import pandas as pd


exams_en_path = '/content/drive/MyDrive/TFM/EntranceExam/qa2015-exam-readingENGLISH.csv'
exams_es_path = '/content/drive/MyDrive/TFM/EntranceExam/qa2015-exam-readingSPANISH.csv'
nlp = spacy.load('en_core_web_sm')

def generate_new_doc_ee_en(text_en, id, questions_per_doc=3):
  doc = nlp(text_en)
  new_docs = []
  candidates = generate_candidates_sent(doc)
  for q_id, answer in enumerate(candidates[:questions_per_doc]):
    # Heuristic for distractor generation
    options = generate_distractors(answer, candidates)
    new_docs.append({
      'id': f"{id}_{q_id}",
      'article': str(doc),
      'answer': ["A", "B", "C", "D"][options.index(answer)], 
      'question': get_question(answer, doc), 
      'options': options
    })
  print(f"Generated {len(new_docs)} for id {id}")
  return new_docs

def generate_new_docs_ee_en(df, n_docs=None, questions_per_doc=3, save=None):
  new_docs = []
  if n_docs is None:
    n_docs = df.shape[0]

  print(f"Have to generate {n_docs*questions_per_doc}")
  for i in range(0, n_docs):
    row = df.iloc[i,:]
    new_docs.append(
        generate_new_doc_ee_en(row["doc/__text"], 
          i, questions_per_doc=questions_per_doc))
  new_docs = [item for sublist in new_docs for item in sublist]

  print(f"Generated {len(new_docs)}")
  if save is not None:
    try:
      with open(save, 'w') as f:
        f.write(json.dumps(new_docs))
    except Exception as exc:
      print(exc)
  return new_docs

---
# 💀 Execution
---

Extract `questions_per_doc` * `n_docs` tuples taking the form of (Q,A1,A2,A3,A4)  and write them to the `SAVE_PATH` specified above.

Answers and distractors are provided by the `distractor_strategy` followed above.

These can then be loaded by extending the data_loaders in the base model class previously developed or by merging it into one big folder with the original data.


## EE

In [None]:
exams_en_path

~190 texts are generated in 15min -> it's still very slow

In [None]:
SAVE_PATH_RACE = '/content/drive/MyDrive/TFM/RACE_DATASET/race_extensions/experiment3-sent.json'
new_docs_race = generate_new_docs_race()

In [None]:
SAVE_PATH_EE = '/content/drive/MyDrive/TFM/EntranceExam/ee_cache_en/experiment3-sent.json'
new_docs = generate_new_docs_ee_en(pd.read_csv(exams_en_path), questions_per_doc=100, save=SAVE_PATH_EE)

In [None]:
generate_candidates_long(nlp(pd.read_csv(exams_en_path)["doc/__text"][0]))

In [None]:
new_docs

In [None]:
datadir = '/content/drive/MyDrive/TFM/EntranceExam/rc-test-english-2013.json'
pd.read_json(datadir)['data'].tolist()[0]

## RACE

In [None]:
SAVE_PATH_RACE = "/content/drive/MyDrive/TFM/RACE_DATASET/race_extensions/train_5k_words.json"
example_ds = Dataset.from_pandas(load_dataset('race', 'middle', split='train').to_pandas().head(5000)).sort('example_id')

In [None]:
new_docs = generate_new_docs_race(example_ds, n_docs=1000, questions_per_doc=5, save=SAVE_PATH_RACE)

In [None]:
%%timeit
docs = list(nlp.pipe(example_ds["article"][:100]))[1]

In [None]:
import pandas as pd
pd.read_json(open('/content/drive/MyDrive/TFM/EntranceExam/ee_cache_en/train_5k_sents.json'))["answer"][3]

In [None]:
['place', 'family', 'trouble', 'newspaper']
What kind of shopping did I do in town?

---
# 🗺️ Exploration
---

In [None]:
df5krace = pd.read_json("/content/drive/MyDrive/TFM/RACE_DATASET/race_extensions/train_10k_sent.json")

In [None]:
txt = df5krace.iloc[4995, :]['article'].replace('--', '').replace('\n', ' ').replace("\'", '').replace('-', '')

In [None]:
df5krace['article'][0].replace('--', '').replace('\n', ' ').replace("\'", '').replace('-', '').replace('1',''). replace('2', '')

## 🏎️ RACE

In [None]:
example_ds = Dataset.from_pandas(load_dataset('race', 'middle', split='train').to_pandas().head(5000)).sort('example_id')

In [None]:
df = example_ds.to_pandas()
txt = df.iloc[4004, :]['article'].replace('--', '').replace('\n', ' ').replace("\'", "'").replace('-', '')

df.iloc[4004, :]['answer']

In [None]:
df.iloc[4004, :]['answer']

In [None]:
txt = df.iloc[4004, :]['article'].replace('--', '').replace('\n', ' ').replace("\'", "'").replace('-', '')
txt

### RACE Example #1

---

Life is like the four seasons. Now I am very old, but when I was young, it was the spring of my life. After I was born, I played a lot, and then I started school. I learned many new things. Like a flower, I grew bigger every day. There were happy days and _ ldays: some days the sun shone, and some days it didn't. In my twenties, I had a good job. I was strong and happy. Then I married and had a child. In those days, I didn't have much time to think. Every day I was busy and worked very hard. And so, I started to get some white hairs. The summer of my life passed quickly. Then the days got shorter. Leaves fell from the trees. My child was a university student, and then an engineer. My home was much quieter. I started walking more slowly. One day I stopped working. I had more time. I understood this was my autumn, a beautiful time when the trees change color and give us delicious fruits. But the days kept getting shorter and colder. Winter has come. I am older and weaker. I know I do not have many days left, but I will enjoy them to the end.

---

According to the passage, which of the following ages is during the summer of his life?
* 15
* 33
* 62
* 87

Answer: B

### RACE Example #2

---

There's always something deep in our soul that never dies. I moved to the small, busy town of Edison in New Jersey six years ago. It was during the second term of my fifth grade. My parents got new jobs and higher income, so they decided it was time to move from Woodbridge to a better, more educational town. In the US, it is unnecessary to take a test to get into a "good" middle or high school. You just attend the school close to where you live. So, many parents will think about the quality of the local school when they decide to buy a new house. My parents did the same. We finally chose Edison mainly because of the high quality of its school. In New Jersey, an area with a good school usually means Asian people. There are about 300 students in our school. 55% are Asians and just under half of that are Chinese. There are so many Chinese people nearby that we even have our own Chinese school. Edison is an old town, just like thousands of others in the United States. However, I have treated it as my hometown. That's where I spend much of my youth, and the memories there can't be moved at all

---

#### QA Set 1
Why did the writer's parents move to Edison?

* Because they were born there
* Because the writer began his fifth grade
* Because it was a better educational town
* Because the writer didn't need to take a test

Answer: C

### RACE QA Generation Examples

---
Mike gets up at half past seven. He has an egg and some milk for breakfast. Then he goes to school. When he is on his way to school, he is thinking, " I tell my teacher that my mother is ill on  Monday. I tell him my bike doesnt work on my way to school on Tuesday. What should I say  today? Mike thinks it over, but he doesnt have a good idea. "May I come in?" says Mike at the  door. "Oh, my boy," says Mr. Brown. "Please look at the clock on the wall. What time is it now?" "Its eight ten," says Mike. Mr. Brown is not happy and says, "You are late for class three times this week. If all the students are like you, the clock is no use, I think." " You are wrong, Mr. Brown," says Mike. "If I dont have the clock how do I know I am late for school?"

---
#### QA Set 1
* Candidate extracted: clock
* Question Generated: What is on the wall at the school?
* Options: Mike, way, egg, clock
* Answer: D

#### QA Set 2
* Candidate extracted: school
* Question Generated: Where does Mike go after breakfast?
* Options: school, boy, breakfast, wall
* Answer: A

#### QA Set 3
* Candidate stracted: idea
* Question generated: What doesn't Mike have?
* Options: school, students, idea, boy
* Answer: C
---

In [None]:
df5krace['options'][1]

### RACE QA Generation Examples II

---
Pit-a-pat. Pit-a-pat. It's raining. "I want to go outside and play, Mum," Robbie says, "When can the rain stop?" His mum doesnt know what to say. She hopes the rain can stop, too. "You can watch TV with me," she says. "No, I just want to go outside." "Put on your raincoat." "Does it stop raining?" "No, but you can go outside and play in the rain. Do you like that?" "Yes, mum." He runs to his bedroom and puts on his red raincoat. "Here you go. Go outside and play." Mum opens the door and says. Robbie runs into the rain. Water goes here and there. Robbies mum watches her son. He is having so much fun. "Mum, come and play with me!" Robbie calls. The door opens and his mum walks out. She is in her yellow raincoat. Mother and son are out in the rain for a long time. They play all kinds of games in the rain.

---
#### QA Set 1
* Candidate extracted: He runs to his bedroom and puts on his red raincoat
* Question Generated: What does Robbie do before going outside?

* Options: 
  * Mum opens the door.
  * He runs to his bedroom and puts on his red raincoat.
  * You can go outside and play in the rain.
  * He is having so much fun.

* Answer: B
---
#### QA Set 2
* Candidate extracted: Here you go. Go outside and play.
* Question Generated: What does Robbie's mum say?
* Options: 
  * Pit-a-pat. Pit-a-pat.
  * Robbie runs into the rain.
  * Here you go. Go outside and play.
  * Mum, come and play with me!

* Answer: C
---

## 💯 EE

In [None]:
SAVE_PATH_EE = '/content/drive/MyDrive/TFM/EntranceExam/ee_cache_en/experiment2-sent.json'
eee = pd.read_json(SAVE_PATH_EE)

In [None]:
eee['options'][5]

In [None]:
eeee = pd.read_csv('/content/drive/MyDrive/TFM/EntranceExam/qa2015-exam-readingENGLISH.csv')
eeee.iloc[0,:]

In [None]:
dfff = eeee[[col for col in eeee.columns if col.startswith('question/0')]]

In [None]:
dfff['question/0/answer/0/_a_id'][1]

### EntranceExams Example

---

About fifteen hundred years ago the Japanese imported many aspects of Chinese culture: the writing system, political institutions, and perhaps most important, Buddhism. Buddhist priests were expected to eat only vegetables, and tofu, made from the soybean, was a very important food in their diet. When Buddhism was introduced from China, tofu was also brought to Japan. Tofu developed in different ways in China and Japan. While the Chinese often changed the taste of tofu by mixing it with strongly-flavored vegetables or meat, the Japanese preferred to eat it using only a simple sauce. Even now, traditional Japanese cooking preserves the original delicacy of tofu, though the way it is served may change from season to season. In summer, for example, it is simply served cold, while in winter it is often eaten as part of a hot dish. The soybean was introduced to the West in the eighteenth century, but little interest was taken in it; only scientists recognized its high food value. During the Second World War, when meat was in short supply, the U.S. government encouraged the American people to eat soybean products. However, they never became very popular and, after the war, interest in them dropped off as the supply of meat became plentiful again. In recent years, people in the West have become increasingly aware of the dangers of eating too much animal fat, and as a result, they have turned more and more to soybean products. This is mainly because the soybean provides almost the same food value as meat, and in addition is a lot more healthful. Much of the margarine, salad oil, and cooking oil in daily use is now produced from soybean oil. Tofu, a representative soybean product and originally one of the main foods in the diet of Chinese priests, is considered to be one of the healthiest foods available to man.

---

Tofu came to Japan together with Buddhism, because

* Buddhist priests ate tofu rather than vegetables.
* it was a very important food in the diet of Buddhist priests.
* the religion came to Japan together with political institutions.
* the religion was the most important aspect of Chinese culture.

Answer: B

### EntranceExams Question Generation Example

---

My husband hasnt stopped laughing about a funny thing that happened to me. Its funny now but it wasnt at the time. Last Friday, after doing all the family shopping in town, I wanted a rest before catching the train, so I bought a newspaper and some chocolate and went into the station coffee shop  that cheap, selfservice place with long tables to sit at. I put my heavy bag down on the floor, put the newspaper and chocolate on the table to keep a place, and went to get a cup of coffee. When I came back with the coffee, there was someone in the next seat. It was one of those wildlooking youngsters, with dark glasses and torn clothes, and hair colored bright red at the front. Not so unusual these days. What did surprise me was that hed started to eat my chocolate! Naturally, I was annoyed. However, to avoid trouble  and really I was rather uneasy about him  I just looked down at the front page of the newspaper, tasted my coffee, and took a bit of chocolate. The boy looked at me closely. Then he took a second piece of my chocolate. I could hardly believe it. Still I didnt dare to start an argument. When he took a third piece, I felt more angry than uneasy. I thought, "Well, I shall have the last piece," and I got it. The boy gave me a strange look, then stood up. As he left he shouted out, "This womans crazy!" Everyone stared. That was embarrassing enough, but it was worse when I finished my coffee and got ready to leave. My face went red  as red as his hair  when I realized Id made a mistake. It wasnt my chocolate that hed been taking. There was mine, unopened, just under my newspaper.

---

#### QA Example 1 - Strategy I
* Candidate extracted: husband
* Question Generated: Who laughed at the funny thing that happened to me?
* Options: coffee, chocolate, glasses, husband
* Answer: D

#### QA Example 2 - Strategy I
* Candidate extracted: thing
* Question Generated: What did my husband laugh about?
* Options: time, seat, thing, chocolate

* Answer: C

#### QA Example 3 - Strategy II
* Candidate extracted: As he left he shouted out, "This woman's crazy!"
* Question Generated: What happened to the person who was in the next seat?
* Options:
  * Not so unusual these days.
  * When I came back with the coffee, there was someone in the next seat.
  * I thought, "Well, I shall have the last piece," and I got it.'
  * As he left he shouted out, "This woman's crazy!"
* Answer: D
