<h1> Generate QA from a large text file
<h3>Steps </h3>


1.   Install requirments
2.   read text file
3.   Clean the text file
4.   Pass the data to led_base_book_summary model to generate summaries
5.   Store the summeries in a text file
6.   Pass the summeries from text file to question-generation-t5 model
7.   Store the Questions in text file.



In [1]:
!pip install -U datasets transformers ninja -q
!pip install -U sentencepiece -q
!pip install clean-text[gpl] -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m63.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m146.0/146.0 KB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 KB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 KB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

In [2]:
from transformers import pipeline
import torch
from cleantext import clean
from pathlib import Path

_device = 0 if torch.cuda.is_available() else -1
hf_tag = "pszemraj/led-base-book-summary"
token_batch_length = 2048 #["16384", "8192", "4096", "3072", "2048"]
batch_stride = 20 
number_beams = 8 # ["16", "12", "8", "4"] {type:"raw"}
min_length =  32
max_len_ratio = 4.75 #2 - 10
length_penalty =  0.5#



In [3]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(
    hf_tag, 
).to('cuda')
from datasets import load_dataset
from tqdm.auto import tqdm
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
    hf_tag, 
)
session_settings = {}
session_settings['token_batch_length'] = token_batch_length
session_settings['batch_stride'] = batch_stride

if token_batch_length > 8192 and number_beams > 8:
    number_beams = 8
settings = {
    'min_length':32,
    'max_length':int(token_batch_length//max_len_ratio),
    'no_repeat_ngram_size':3, 
    'encoder_no_repeat_ngram_size' :4,
    'repetition_penalty':3.7,
    'num_beams':number_beams,
    'length_penalty':length_penalty,
    'early_stopping':True,
    'do_sample':False,
}

session_settings['num_beams'] = number_beams
session_settings['length_penalty'] = length_penalty
session_settings['max_len_ratio'] = max_len_ratio

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/648M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

In [4]:
#generate_single_answer()

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i : i + n]



def generate_answer(batch,**kwargs):

    inputs_dict = tokenizer(batch["text"], 
                            padding="max_length", max_length=16384, 
                            return_tensors="pt", 
                            truncation=True,
                            add_special_tokens =False,
                            )
    
    input_ids = inputs_dict.input_ids.to("cuda")
    attention_mask = inputs_dict.attention_mask.to("cuda")
    print(attention_mask, attention_mask.size())
    global_attention_mask = torch.zeros_like(attention_mask)
    # put global attention on <s> token
    global_attention_mask[:, 0] = 1

    predicted_abstract_ids = model.generate(
            input_ids, 
            attention_mask=attention_mask, 
            global_attention_mask=global_attention_mask, 
            **kwargs
        )
    batch["summary"] = tokenizer.batch_decode(predicted_abstract_ids, 
                                                skip_special_tokens=True,
                                                remove_invalid_values=True,
                                                )
    return batch


# batch of summeries
def summarize_and_score(ids, mask, **kwargs):


    ids = ids[None, :]
    mask = mask[None, :]
    
    input_ids = ids.to("cuda")
    attention_mask = mask.to("cuda")
    global_attention_mask = torch.zeros_like(attention_mask)
    # put global attention on <s> token
    global_attention_mask[:, 0] = 1

    summary_pred_ids = model.generate(
            input_ids, 
            attention_mask=attention_mask, 
            global_attention_mask=global_attention_mask, 
            output_scores=True,
            return_dict_in_generate=True,
            **kwargs
        )
    summary = tokenizer.batch_decode(
                summary_pred_ids.sequences, 
                skip_special_tokens=True,
                remove_invalid_values=True,
            )
    score = round(summary_pred_ids.sequences_scores.cpu().numpy()[0], 4)
    
    return summary, score
    
def summarize_via_tokenbatches(
        input_text:str,
        batch_length=8192,
        batch_stride=16,
        **kwargs,
    ):
    
    encoded_input = tokenizer(
                        input_text, 
                        padding='max_length', 
                        truncation=True,
                        max_length=batch_length, 
                        stride=batch_stride,
                        return_overflowing_tokens=True,
                        add_special_tokens =False,
                        return_tensors='pt',
                    )
    
    in_id_arr, att_arr = encoded_input.input_ids, encoded_input.attention_mask
    gen_summaries = []

    pbar = tqdm(total=len(in_id_arr))

    for _id, _mask in zip(in_id_arr, att_arr):

        result, score = summarize_and_score(
            ids=_id, 
            mask=_mask, 
            **kwargs,
        )
        score = round(float(score),4)
        _sum = {
            "input_tokens":_id,
            "summary":result,
            "summary_score":score,
        }
        gen_summaries.append(_sum)
        print(f"\t{result[0]}\nScore:\t{score}")
        pbar.update()

    pbar.close()

    return gen_summaries


In [5]:
#read in single file text as `long_text`
with open("RJ.txt", 'r', errors='ignore') as f:
    raw_text = f.read()

long_text = clean(raw_text, lower=False)
batch = {}
batch['text'] = long_text


encoded_input = tokenizer(
    long_text, 
    padding='max_length', 
    truncation=True,
    max_length=token_batch_length, 
    stride=batch_stride,
    return_overflowing_tokens=True,
    add_special_tokens =False,
    return_tensors='pt',
)
_summaries = summarize_via_tokenbatches(
    long_text,
    batch_length=token_batch_length,
    batch_stride=batch_stride,
    **settings,
)
# write the `_summaries` var to a `.txt`
sum_text = [s["summary"][0] for s in _summaries]
full_summary = "\n\t".join(sum_text)
_outpath = "summary.txt"

with open(
    _outpath,
    "w",
) as fo:
    fo.writelines(full_summary)
    fo.write("\n" * 3)
   


  0%|          | 0/5 [00:00<?, ?it/s]

	The first act of William Shakespeare's play, "Romeo and Juliet," is set in a public place. The action moves forward to the private rooms of the houses of Montague and Capulet, where the lovers fight over whom shall be pushed to the wall or thrown to the ground. They are joined by two servants, Balthasar and Abram, who are also involved in the quarrel. At this point, Benvolio enters with some of his kinsmen, and they engage in a heated debate about which side should draw the swords. In the end, both sides agree to draw their swords.
Score:	-10.873
	The scene shifts to the Capulet household, where Lady Capulet and Prince Escalus are engaged in a series of brawls. The first quarrel is brought to light when Prince Montague enters with his attendants. He accuses the conspirators of breaking into Verona's peaceful citizens' homes and causing them to turn against each other. Attempting to quell the conflict peacefully, Prince Cadwallader Tybalt attempts to cut off Romeo from conversation by 

<h2> Generate QA pairs from Summeries </h2>


In [13]:
import requests
with open('summary.txt', 'r') as file:
    text = file.read()
response = requests.post("https://pragnakalp-question-generation-t5.hf.space/run/predict", 
                         json={
  "data": [text, "20",
]}).json()
data = response["data"]
with open('qs.txt', 'w+') as file:
    file.write(data[0])

<h1> References </h1>

https://huggingface.co/spaces/pszemraj/summarize-long-text

https://github.com/AMontgomerie/question_generator
