# 3. True/False Question

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Pseudo-Lab/Tutorial-Book/blob/master/book/chapters/NLP/Ch3-True-False-Question.ipynb)

이전 장에서는 실습에 사용할 데이터셋을 다운로드하고 시각화하며 전처리를 진행해보았습니다. 이번 장에서는 해당 데이터셋을 이용하여 True/False 문제를 생성하는 모델을 실습해보도록 하겠습니다.
    
    
3.1절에서는 모델링에 사용할 GPT2와 BERT에 대해 간단하게 설명하고, 3.2절에서는 데이터를 불러와서 다시 한 번 살펴봅니다. 그리고 3.3절에서는 ㅇㅇㅇ를 이용하여 해당 text 데이터를 요약(Summarize)하는 작업을 진행합니다. 마지막 3.4절에서는 Text 문장들을 필터링하고 GPT2와 BERT를 이용하여 True/False 문제를 생성해보도록 하겠습니다.


## 3.1 GPT2 & BERT

### GPT2
GPT2는 ...  설명 제작중...
  
  

### BERT
BERT는...  설명 제작중...

## 3.2 패키지 준비 및 데이터 로드

In [None]:
# |!pip install tensorflow==1.14.0   
# !pip install torch==1.4.0  
# !pip install sentence-transformers==0.2.5.1  
# !pip install transformers==2.6.0  
# !pip install benepar==0.1.2  
# !pip install summa  
# !pip install nltk==3.4.5  
# !pip install spacy==2.1.0  
# !python3 -m spacy download en  
# !pip install scipy  

In [None]:
import requests
import json
from summa.summarizer import summarize
import benepar
import string
import nltk
from nltk import tokenize
from nltk.tokenize import sent_tokenize
import re
from random import shuffle
import spacy
import warnings
warnings.filterwarnings(action='ignore')
nlp = spacy.load('en')

#this package is required for the summa summarizer
nltk.download('punkt')
benepar.download('benepar_en2')
benepar_parser = benepar.Parser("benepar_en2")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package benepar_en2 to /root/nltk_data...
[nltk_data]   Package benepar_en2 is already up-to-date!


## 3.3 Summarization

#### Summarize the loaded content

In [None]:
text = 'There is a lot of volcanic activity at divergent plate boundaries in the oceans. For example, many undersea volcanoes are found along the Mid-Atlantic Ridge. This is a divergent plate boundary that runs north-south through the middle of the Atlantic Ocean. As tectonic plates pull away from each other at a divergent plate boundary, they create deep fissures, or cracks, in the crust. Molten rock, called magma, erupts through these cracks onto Earth’s surface. At the surface, the molten rock is called lava. It cools and hardens, forming rock. Divergent plate boundaries also occur in the continental crust. Volcanoes form at these boundaries, but less often than in ocean crust. That’s because continental crust is thicker than oceanic crust. This makes it more difficult for molten rock to push up through the crust. Many volcanoes form along convergent plate boundaries where one tectonic plate is pulled down beneath another at a subduction zone. The leading edge of the plate melts as it is pulled into the mantle, forming magma that erupts as volcanoes. When a line of volcanoes forms along a subduction zone, they make up a volcanic arc. The edges of the Pacific plate are long subduction zones lined with volcanoes. This is why the Pacific rim is called the “Pacific Ring of Fire.”'

In [None]:
from string import punctuation

def preprocess(sentences):
    output = []
    for sent in sentences:
        single_quotes_present = len(re.findall(r"['][\w\s.:;,!?\\-]+[']",sent))>0
        double_quotes_present = len(re.findall(r'["][\w\s.:;,!?\\-]+["]',sent))>0
        question_present = "?" in sent
        if single_quotes_present or double_quotes_present or question_present :
            continue
        else:
            output.append(sent.strip(punctuation))
    return output
        
        
def get_candidate_sents(resolved_text, ratio=0.3):
    candidate_sents = summarize(resolved_text, ratio=ratio)
    candidate_sents_list = tokenize.sent_tokenize(candidate_sents)
    candidate_sents_list = [re.split(r'[:;]+',x)[0] for x in candidate_sents_list ]
    # Remove very short sentences less than 30 characters and long sentences greater than 150 characters
    filtered_list_short_sentences = [sent for sent in candidate_sents_list if len(sent)>30 and len(sent)<150]
    return filtered_list_short_sentences

cand_sents = get_candidate_sents(text)
filter_quotes_and_questions = preprocess(cand_sents)
for each_sentence in filter_quotes_and_questions:
    print (each_sentence)
    print ("\n")

As tectonic plates pull away from each other at a divergent plate boundary, they create deep fissures, or cracks, in the crust


Divergent plate boundaries also occur in the continental crust


Volcanoes form at these boundaries, but less often than in ocean crust


Many volcanoes form along convergent plate boundaries where one tectonic plate is pulled down beneath another at a subduction zone




# 3.4 GPT2 & BERT를 이용한 True/False Question 생성

#### split sentences at appropriate place

In [None]:
from IPython.display import Markdown, display

def printmd(string):
    display(Markdown(string))
    
def get_flattened(t):
    sent_str_final = None
    if t is not None:
        sent_str = [" ".join(x.leaves()) for x in list(t)]
        sent_str_final = [" ".join(sent_str)]
        sent_str_final = sent_str_final[0]
    return sent_str_final
    

def get_termination_portion(main_string,sub_string):
    combined_sub_string = sub_string.replace(" ","")
    main_string_list = main_string.split()
    last_index = len(main_string_list)
    for i in range(last_index):
        check_string_list = main_string_list[i:]
        check_string = "".join(check_string_list)
        check_string = check_string.replace(" ","")
        if check_string == combined_sub_string:
            return " ".join(main_string_list[:i])
                     
    return None
    
def get_right_most_VP_or_NP(parse_tree,last_NP = None,last_VP = None):
    if len(parse_tree.leaves()) == 1:
        return get_flattened(last_NP),get_flattened(last_VP)
    last_subtree = parse_tree[-1]
    if last_subtree.label() == "NP":
        last_NP = last_subtree
    elif last_subtree.label() == "VP":
        last_VP = last_subtree
    
    return get_right_most_VP_or_NP(last_subtree,last_NP,last_VP)


def get_sentence_completions(key_sentences):
    sentence_completion_dict = {}
    for individual_sentence in filter_quotes_and_questions:
        sentence = individual_sentence.rstrip('?:!.,;')
        tree = benepar_parser.parse(sentence)
        last_nounphrase, last_verbphrase =  get_right_most_VP_or_NP(tree)
        phrases= []
        if last_verbphrase is not None:
            verbphrase_string = get_termination_portion(sentence,last_verbphrase)
            phrases.append(verbphrase_string)
        if last_nounphrase is not None:
            nounphrase_string = get_termination_portion(sentence,last_nounphrase)
            phrases.append(nounphrase_string)

        longest_phrase =  sorted(phrases, key=len,reverse= True)
        if len(longest_phrase) == 2:
            first_sent_len = len(longest_phrase[0].split())
            second_sentence_len = len(longest_phrase[1].split())
            if (first_sent_len - second_sentence_len) > 4:
                del longest_phrase[1]
                
        if len(longest_phrase)>0:
            sentence_completion_dict[sentence]=longest_phrase
    return sentence_completion_dict



sent_completion_dict = get_sentence_completions(filter_quotes_and_questions)

print (sent_completion_dict)

{'As tectonic plates pull away from each other at a divergent plate boundary, they create deep fissures, or cracks, in the crust': ['As tectonic plates pull away from each other at a divergent plate boundary, they create deep fissures, or cracks, in'], 'Divergent plate boundaries also occur in the continental crust': ['Divergent plate boundaries also occur in', 'Divergent plate boundaries also'], 'Volcanoes form at these boundaries, but less often than in ocean crust': ['Volcanoes form at these boundaries, but less often than in'], 'Many volcanoes form along convergent plate boundaries where one tectonic plate is pulled down beneath another at a subduction zone': ['Many volcanoes form along convergent plate boundaries where one tectonic plate is pulled down beneath another at']}


# Load OpenAI GPT2 and Sentence BERT

In [None]:
# https://huggingface.co/transformers/main_classes/model.html?highlight=no_repeat_ngram_size

from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("gpt2",pad_token_id=tokenizer.eos_token_id)

from sentence_transformers import SentenceTransformer
# Load the BERT model. Various models trained on Natural Language Inference (NLI) https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/nli-models.md and 
# Semantic Textual Similarity are available https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/sts-models.md
model_BERT = SentenceTransformer('bert-base-nli-mean-tokens')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…




100%|██████████| 405M/405M [00:18<00:00, 22.0MB/s]


# Filter sentences and generate false sentences.

In [None]:
from nltk import tokenize
import scipy
import torch
torch.manual_seed(2020)


def sort_by_similarity(original_sentence,generated_sentences_list):
    # Each sentence is encoded as a 1-D vector with 768 columns
    sentence_embeddings = model_BERT.encode(generated_sentences_list)

    queries = [original_sentence]
    query_embeddings = model_BERT.encode(queries)
    # Find the top sentences of the corpus for each query sentence based on cosine similarity
    number_top_matches = len(generated_sentences_list)

    dissimilar_sentences = []

    for query, query_embedding in zip(queries, query_embeddings):
        distances = scipy.spatial.distance.cdist([query_embedding], sentence_embeddings, "cosine")[0]

        results = zip(range(len(distances)), distances)
        results = sorted(results, key=lambda x: x[1])


        for idx, distance in reversed(results[0:number_top_matches]):
            score = 1-distance
            if score < 0.9:
                dissimilar_sentences.append(generated_sentences_list[idx].strip())
           
    sorted_dissimilar_sentences = sorted(dissimilar_sentences, key=len)
    
    return sorted_dissimilar_sentences[:3]
    

def generate_sentences(partial_sentence,full_sentence):
    input_ids = torch.tensor([tokenizer.encode(partial_sentence)])
    maximum_length = len(partial_sentence.split())+80

    # Actiavte top_k sampling and top_p sampling with only from 90% most likely words
    sample_outputs = model.generate(
        input_ids, 
        do_sample=True, 
        max_length=maximum_length, 
        top_p=0.90, # 0.85 
        top_k=50,   #0.30
        repetition_penalty  = 10.0,
        num_return_sequences=10
    )
    generated_sentences=[]
    for i, sample_output in enumerate(sample_outputs):
        decoded_sentences = tokenizer.decode(sample_output, skip_special_tokens=True)
        decoded_sentences_list = tokenize.sent_tokenize(decoded_sentences)
        generated_sentences.append(decoded_sentences_list[0])
        
    top_3_sentences = sort_by_similarity(full_sentence,generated_sentences)
    
    return top_3_sentences

index = 1
choice_list = ["a)","b)","c)","d)","e)","f)"]
for key_sentence in sent_completion_dict:
    partial_sentences = sent_completion_dict[key_sentence]
    false_sentences =[]
    print_string = "**%s) True Sentence (from the story) :**"%(str(index))
    printmd(print_string)
    print ("  ",key_sentence)
    for partial_sent in partial_sentences:
        false_sents = generate_sentences(partial_sent,key_sentence)
        false_sentences.extend(false_sents)
    printmd("  **False Sentences (GPT-2 Generated)**")
    for ind,false_sent in enumerate(false_sentences):
        print_string_choices = "**%s** %s"%(choice_list[ind],false_sent)
        printmd(print_string_choices)
    index = index+1
    
    print ("\n\n")
        

**1) True Sentence (from the story) :**

   As tectonic plates pull away from each other at a divergent plate boundary, they create deep fissures, or cracks, in the crust


  **False Sentences (GPT-2 Generated)**

**a)** As tectonic plates pull away from each other at a divergent plate boundary, they create deep fissures, or cracks, in the crust that provide access to oxygen-rich water.

**b)** As tectonic plates pull away from each other at a divergent plate boundary, they create deep fissures, or cracks, in the seafloor that are more sensitive to wind velocity and pressure than most continental surfaces.






**2) True Sentence (from the story) :**

   Divergent plate boundaries also occur in the continental crust


  **False Sentences (GPT-2 Generated)**

**a)** Divergent plate boundaries also occur in the low and high latitudes.

**b)** Divergent plate boundaries also occur in regions with more frequent rainfall.

**c)** Divergent plate boundaries also occur in the brain of mammals and vertebrates.

**d)** Divergent plate boundaries also have been proposed.

**e)** Divergent plate boundaries also may be used to map and reduce traffic congestion.

**f)** Divergent plate boundaries also had to be adjusted and the data collected from different cities was sent on a regular basis.






**3) True Sentence (from the story) :**

   Volcanoes form at these boundaries, but less often than in ocean crust


  **False Sentences (GPT-2 Generated)**

**a)** Volcanoes form at these boundaries, but less often than in any other country," he says.

**b)** Volcanoes form at these boundaries, but less often than in a large coastal country (such as the USA) they must be found along well-defined water bodies such that their location can become clear.






**4) True Sentence (from the story) :**

   Many volcanoes form along convergent plate boundaries where one tectonic plate is pulled down beneath another at a subduction zone


  **False Sentences (GPT-2 Generated)**

**a)** Many volcanoes form along convergent plate boundaries where one tectonic plate is pulled down beneath another at a rate of 1.1, 3–4 km/h (5).

**b)** Many volcanoes form along convergent plate boundaries where one tectonic plate is pulled down beneath another at about 25% of its original location.

**c)** Many volcanoes form along convergent plate boundaries where one tectonic plate is pulled down beneath another at a rate of about 100 million km/year.






#### Reference
 - https://medium.com/swlh/practical-ai-automatically-generate-true-or-false-questions-from-any-content-with-openai-gpt2-9081ffe4d4c9