# Zero Scrolls
The goal of this notebook is to explore the zero scrolls dataset.  
there are 2 versions: 
- public version 
- LC team's version


In [1]:
import json
import os
import sys
from datetime import datetime
import random

import numpy as np
import torch
from datasets import load_dataset
from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import set_seed as hf_set_seed

device = "cuda" if torch.cuda.is_available() else "cpu"
device

  from .autonotebook import tqdm as notebook_tqdm


'cuda'

In [9]:
# !pip install sentencepiece accelerate


Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting accelerate
  Obtaining dependency information for accelerate from https://files.pythonhosted.org/packages/d9/92/2d3aecf9f4a192968035880be3e2fc8b48d541c7128f7c936f430d6f96da/accelerate-0.23.0-py3-none-any.whl.metadata
  Downloading accelerate-0.23.0-py3-none-any.whl.metadata (18 kB)
Downloading accelerate-0.23.0-py3-none-any.whl (258 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: accelerate
[0mSuccessfully installed accelerate-0.23.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


## ZeroScrolls public
Benchmark [Homepage](https://www.zero.scrolls-benchmark.com/) and [github](https://github.com/tau-nlp/zero_scrolls/tree/main)  


Assuming the files have been downloaded and located in the `datasets` folder

In [2]:
zc_data_path='./datasets/zero-scrolls/public/'
zc_tasks = os.listdir(zc_data_path)
zc_tasks

['qmsum',
 'narrative_qa',
 'book_sum_sort',
 'space_digest',
 'gov_report',
 'qasper',
 'musique',
 'quality',
 'summ_screen_fd',
 'squality']

In [3]:
zc_task=zc_tasks[4]
# load from  file
# data = load_dataset('json',data_files={'test':os.path.join(zc_data_path,zc_task,'test.jsonl'),'validation':os.path.join(zc_data_path,zc_task,'validation.jsonl')}) 
# download from huggingface
data = load_dataset("tau/zero_scrolls", zc_task)
data.keys()

dict_keys(['validation', 'test'])

In [4]:
print(len(data['test']))
print(len(data['validation']))

500
20


In [5]:
example=data['validation'][0]
example

{'id': 'crs_R45461',
 'pid': 'crs_R45461_0',
 'input': 'You are given a report by a government agency. Write a one-page summary of the report.\n\nReport:\nBackground\n\n\t\tWhat is the U.S. International Development Finance Corporation (IDFC)?\n\nThe IDFC is authorized by statute to be a "wholly owned Government corporation ... under the foreign policy guidance of the Secretary of State" in the executive branch. Its purpose is to "mobilize and facilitate the participation of private sector capital and skills in the economic development" of developing and transition countries, in order to complement U.S. development assistance objectives and foreign policy interests (§1412). In other words, the IDFC\'s mission is to promote private investment in support of both U.S. global development goals and U.S. economic interests. Not yet operational, the IDFC represents a potentially major overhaul of U.S. development finance efforts. \nThe IDFC\'s enabling legislation is the Better Utilization of

In [6]:
print(example['input'])
print(example['output'])

You are given a report by a government agency. Write a one-page summary of the report.

Report:
Background

		What is the U.S. International Development Finance Corporation (IDFC)?

The IDFC is authorized by statute to be a "wholly owned Government corporation ... under the foreign policy guidance of the Secretary of State" in the executive branch. Its purpose is to "mobilize and facilitate the participation of private sector capital and skills in the economic development" of developing and transition countries, in order to complement U.S. development assistance objectives and foreign policy interests (§1412). In other words, the IDFC's mission is to promote private investment in support of both U.S. global development goals and U.S. economic interests. Not yet operational, the IDFC represents a potentially major overhaul of U.S. development finance efforts. 
The IDFC's enabling legislation is the Better Utilization of Investments Leading to Development Act of 2018 (BUILD Act), which w

In [7]:
tst_example = data['test'][0]
tst_example

{'id': 'crs_R46330',
 'pid': 'crs_R46330_0',
 'input': 'You are given a report by a government agency. Write a one-page summary of the report.\n\nReport:\nIntroduction\n\nThroughout U.S. history, Congress has created advisory commissions to assist in the development of public policy. Among other contexts, commissions have been used following crisis situations, including the September 11, 2001, terrorist attacks and the 2008 financial crisis. In such situations, advisory commissions may potentially provide Congress with a high-visibility forum to assemble expertise that might not exist within the legislative environment; allow for the in-depth examination of complex, cross-cutting policy issues; and lend bipartisan credibility to a set of findings and recommendations.\nAs Congress considers its range of responses to the coronavirus pandemic, the creation of one or more congressional advisory commissions is an option that could provide a platform for evaluating various pandemic-related p

In [21]:
print(tst_example['input'])

You are given a report by a government agency. Write a one-page summary of the report.

Report:
Introduction

Throughout U.S. history, Congress has created advisory commissions to assist in the development of public policy. Among other contexts, commissions have been used following crisis situations, including the September 11, 2001, terrorist attacks and the 2008 financial crisis. In such situations, advisory commissions may potentially provide Congress with a high-visibility forum to assemble expertise that might not exist within the legislative environment; allow for the in-depth examination of complex, cross-cutting policy issues; and lend bipartisan credibility to a set of findings and recommendations.
As Congress considers its range of responses to the coronavirus pandemic, the creation of one or more congressional advisory commissions is an option that could provide a platform for evaluating various pandemic-related policy issues over time. Past congressional advisory commission

we see that the test set includes 500 samples and the validation set includes 20 samples.  
the difference between them is that the test set doesnt include the expected `output` (the value is `None`)

next, lets see how to generate a prediction for the example. based on the code provided by zeroscrolls (`run_hf_model.py`)

In [8]:
model_to_max_input_tokens = {
    "google/flan-t5-xxl": 8192,
    "google/flan-t5-xl": 8192,
    "google/flan-t5-large": 8192,
    "google/flan-t5-base": 8192,
    "google/flan-t5-small": 8192,
    "google/flan-ul2": 8192,
    "bigscience/T0pp": 8192,
}

def trim_doc_keeping_suffix(tokenizer, tokenized_input_full, example, suffix_index, max_tokens, device):
    seperator_and_suffix = f"{example['truncation_seperator'].strip()}\n\n{example['input'][suffix_index:].strip()}\n"
    tokenized_seperator_and_suffix = tokenizer(seperator_and_suffix, return_tensors="pt").input_ids.to(device)
    tokenized_input_trimmed = tokenized_input_full[:, :max_tokens - tokenized_seperator_and_suffix.shape[1]]
    tokenized_input = torch.cat([tokenized_input_trimmed, tokenized_seperator_and_suffix], dim=1)
    return tokenized_input


def process_model_input(tokenizer, example, max_tokens, device):
    tokenized_input_full = tokenizer(example["input"], return_tensors="pt").input_ids.to(device)
    if tokenized_input_full.shape[1] <= max_tokens:
        return tokenized_input_full

    seperator_and_query_text = example['truncation_seperator'] + example["input"][example['query_start_index']:]
    tokenized_seperator_and_query = tokenizer(seperator_and_query_text, return_tensors="pt").input_ids.to(device)
    input_without_query = example['input'][:example['query_start_index']]
    tokenized_input_without_query = tokenizer(input_without_query, return_tensors="pt").input_ids.to(device)
    tokenized_input_without_query = tokenized_input_without_query[:,
                                    :max_tokens - tokenized_seperator_and_query.shape[1]]

    tokenized_input = torch.cat([tokenized_input_without_query, tokenized_seperator_and_query], dim=1)
    return tokenized_input


In [15]:
# load a model
model_name='google/flan-t5-small'
tokenizer = T5Tokenizer.from_pretrained(model_name)
max_input_length = model_to_max_input_tokens[model_name]
model = T5ForConditionalGeneration.from_pretrained(model_name, device_map="auto",
                                                    torch_dtype=torch.float32)
model = model.eval()



Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [10]:
example

{'id': 'crs_R45461',
 'pid': 'crs_R45461_0',
 'input': 'You are given a report by a government agency. Write a one-page summary of the report.\n\nReport:\nBackground\n\n\t\tWhat is the U.S. International Development Finance Corporation (IDFC)?\n\nThe IDFC is authorized by statute to be a "wholly owned Government corporation ... under the foreign policy guidance of the Secretary of State" in the executive branch. Its purpose is to "mobilize and facilitate the participation of private sector capital and skills in the economic development" of developing and transition countries, in order to complement U.S. development assistance objectives and foreign policy interests (§1412). In other words, the IDFC\'s mission is to promote private investment in support of both U.S. global development goals and U.S. economic interests. Not yet operational, the IDFC represents a potentially major overhaul of U.S. development finance efforts. \nThe IDFC\'s enabling legislation is the Better Utilization of

In [16]:
# process an example
model_input = process_model_input(tokenizer, example, max_input_length, device)
prediction_token_ids = model.generate(model_input,
                                        max_length=1024,
                                        do_sample=False,
                                        top_p=0,
                                        top_k=0,
                                        temperature=1)
predicted_text = tokenizer.decode(prediction_token_ids[0], skip_special_tokens=True)



Token indices sequence length is longer than the specified maximum sequence length for this model (11348 > 512). Running this sequence through the model will result in indexing errors




In [22]:
model_input.shape

torch.Size([1, 8192])

In [19]:
prediction_token_ids.shape

torch.Size([1, 32])

In [20]:
predicted_text

'The IDFC must provide preferential consideration to projects sponsored by or involving private-sector entities that are "U.S. persons"'

Now lets try to load the hyena checkpoint and see if we can feed the example in it:

In [24]:
from src.models.sequence.long_conv_lm import ConvLMHeadModel
from transformers import GPT2Tokenizer
import yaml 

In [25]:
def load_hyena_model(model_cfg, ckpt_path):
    config = yaml.load(open(model_cfg, 'r'), Loader=yaml.FullLoader)
    model = ConvLMHeadModel(**config['model_config'])
    state_dict = torch.load(ckpt_path, map_location='cpu')
    model.load_state_dict(state_dict)
    if config['tokenizer_name'] == 'gpt2':
        tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    else:
        tokenizer = None 
    return model, tokenizer,config


In [None]:
model_cfg_file = 'configs/evals/hyena_small_150b.yaml'
ckpt_path='checkpoints/hyena_small_150b_tok.ckpt'



In [27]:
config = yaml.load(open(model_cfg_file, 'r'), Loader=yaml.FullLoader)
config

{'model_name': 'hyena-small',
 'tokenizer_name': 'gpt2',
 'model_config': {'_name_': 'lm',
  'd_model': 864,
  'd_inner': 1728,
  'n_layer': 18,
  'vocab_size': 50257,
  'embed_dropout': 0.0,
  'layer': {'_name_': 'hyena',
   'emb_dim': 33,
   'filter_order': 64,
   'local_order': 3,
   'l_max': 2048,
   'modulate': False,
   'w': 14},
  'fused_mlp': True,
  'fused_dropout_add_ln': True,
  'residual_in_fp32': True,
  'pad_vocab_size_multiple': 8}}

In [28]:
hmodel = ConvLMHeadModel(**config['model_config'])

ImportError: dropout_add_layer_norm is not installed

In [None]:
hmodel, htokenizer,config = load_hyena_model(model_cfg_file, ckpt_path)
max_input_length = config['model_config']['layer']['l_max']

hmodel = hmodel.to(device)
hmodel = hmodel.eval()        # is it needed ?
