# Installation

1. Download the LLaMA 2 Tokenzier from https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main 
   and place the files into a directory named `llama2_tokenizer` in the same 
   directory as this notebook.

2. install the python packages below:

In [2]:
!pip install --quiet transformers pandas

In [2]:
import glob
from pathlib import Path
import unicodedata

import pandas as pd
from transformers import AutoTokenizer

In [3]:
TOKENIZER_DIR = 'llama2_tokenizer'
DATA_DIR = Path('data')
DATASET_DIR = Path('dataset')

In [None]:
!mkdir -p dataset

In [4]:
tk = AutoTokenizer.from_pretrained(TOKENIZER_DIR)

In [11]:
PROMPT_FMT = """\
<s>[INST] <<SYS>>
You are an assistant for question-answering tasks. Use the following pieces of \
retrieved context in the section demarcated by "```" to answer the question. \
If you don't know the answer just say that you don't know. Use three sentences \
maximum and keep the answer concise.
<</SYS>>

```
{context}
```

Question: {question}

[/INST]
Answer:
"""

In [6]:
len(tk.encode(PROMPT_FMT.format(context="", question="")))

98

In [78]:
wikimqa_data = pd.read_json(DATA_DIR / '2wikimqa_e.jsonl', lines=True)

In [19]:
wikimqa_data.iloc[0].keys()

Index(['input', 'context', 'answers', 'length', 'dataset', 'language',
       'all_classes', '_id'],
      dtype='object')

In [20]:
len(wikimqa_data)

300

In [21]:
wikimqa_data.length.describe()

count      300.000000
mean      6146.540000
std       3178.540665
min        987.000000
25%       3514.500000
50%       5244.000000
75%       9061.000000
max      12334.000000
Name: length, dtype: float64

In [22]:
wq0 = wikimqa_data.iloc[0]

In [28]:
wq0_len = wq0.length
input_len = len(tk.encode(wq0.input))
context_len = len(tk.encode(wq0.context))

print(f"input length: {input_len} context length: {context_len} total: {input_len + context_len}")
print("reported length: ", wq0_len)

input length: 23 context length: 4751 total: 4774
reported length:  2521


## Summary

* we are only concerned with the dataset items with questions (`qa`) files,
  in english (`en` if available). Fortunately all `_e` files are in English.

* to create a dataset with defined prompt sizes we need to count the tokens in 
  the `input` question and `context` and with the prompt length (`103`) but
  we want to be flexible

* then we partition the dataset by length

In [86]:
files = glob.glob(str(DATA_DIR / '*qa_e.jsonl'))
files.extend(glob.glob(str(DATA_DIR / '*qa.jsonl')))

In [87]:
len(files)

7

In [88]:
# The files in LongBench contain nonstandard or irregular Unicode.
# For compatibility and safety we normalize them.

def normalize(text, form='NFC'):
    return unicodedata.normalize(form, text)

def process_item(item, prompt_fmt=PROMPT_FMT):
    question = normalize(item.input)
    context = normalize(item.context)
    prompt = prompt_fmt.format(question=question, context=context)
    prompt_len = len(tk.encode(prompt))
    return {
        "question": question,
        "context": context,
        "prompt": prompt,
        "prompt_len": prompt_len,
        "question_len": len(tk.encode(question)),
        "context_len": len(tk.encode(context)),
    }

In [89]:
combined_rows = []
for file in files:
    df = pd.read_json(file, lines=True)
    combined_rows.extend(df.apply(process_item, axis=1))

In [91]:
combined_rows[0]

{'question': 'Which city is under Jining, Kaiyuan, Liaoning or Yanzhou District?',
 'context': 'Passage 1:\nKaiyuan, Liaoning\nKaiyuan (simplified Chinese: 开原; traditional Chinese: 開原; pinyin: Kāiyuán; lit. \'Open Plains\') is a county-level city in the northeast of Liaoning, People\'s Republic of China, bordering Jilin for a small section to the north. It is under the administration of Tieling City, the centre of which lies 33 kilometres (21 mi) to the southwest.\n\nAdministrative divisions\nThere are 3 subdistricts, 9 towns, and 9 townships under the city\'s administration.Subdistricts:\n\nXincheng Subdistrict (新城街道), Laocheng Subdistrict (老城街道), Xingkai Subdistrict (兴开街道)Towns:\n\nBabao (八宝镇), Qingyunbao (庆云堡镇), Kaoshan (靠山镇), Yemin (业民镇), Jingouzi (金沟子镇), Zhonggu (中固镇), Bakeshu (八棵树镇), Lianhua (莲花镇), Weiyuanbao (威远堡镇)Townships:\n\nChengdong Township (城东乡), Sanjiazi Township (三家子乡), Songshanbao Township (松山堡乡), Majiazhai Township (马家寨乡), Lijiatai Township (李家台乡), Shangbadi Manchu Et

In [107]:
dataset_all = pd.DataFrame(combined_rows)
dataset_all.to_csv(DATASET_DIR / 'dataset_all.csv', escapechar='"', index=False)

In [94]:
dataset_all.prompt_len.describe()

count     1700.000000
mean     14639.289412
std      12233.962324
min        305.000000
25%       7048.750000
50%      12262.500000
75%      18248.250000
max      84241.000000
Name: prompt_len, dtype: float64

In [98]:
dataset_all[dataset_all.prompt_len <= 3997].describe()

Unnamed: 0,prompt_len,question_len,context_len
count,105.0,105.0,105.0
mean,2867.657143,334.085714,2438.571429
std,797.591239,438.178926,869.577637
min,305.0,12.0,191.0
25%,2370.0,20.0,1814.0
50%,3051.0,32.0,2492.0
75%,3498.0,622.0,3118.0
max,3997.0,1647.0,3883.0


In [9]:
BASE_PARAMS = {
    "do_sample": True,
    "temperature": 0.7,
    "top_p": 0.92,
    "top_k": 120,
}

In [10]:
def build_huggingface_json_request(item, max_new_tokens=100):
    params = BASE_PARAMS.copy()
    params["truncate"] = item.prompt_len
    params["max_new_tokens"] = max_new_tokens
    return {
        "inputs": item.prompt,
        "parameters": params,
    }

## Initial Dataset

In the future we will recursively split the input documents into paragraph 
chunks and reduce some of the much longer contexts to a range of close to 
4000 tokens. 

For now we have 105 samples, in the range 305 to 3997, choosing 3997 as the
maximum to allow for `max_new_tokens` of 100.

In [12]:
if "dataset_all" not in locals():
    dataset_all = pd.read_csv(DATASET_DIR / "dataset_all.csv")

In [14]:
json_requests = dataset_all[dataset_all.prompt_len <= 3997].apply(build_huggingface_json_request, axis=1)

In [15]:
json_requests.iloc[0]

{'inputs': '<s>[INST] <<SYS>>\nYou are an assistant for question-answering tasks. Use the following pieces of retrieved context in the section demarcated by "```" to answer the question. If you don\'t know the answer just say that you don\'t know. Use three sentences maximum and keep the answer concise.\n<</SYS>>\n\n```\nPassage 1:\nHuernia\nThe genus Huernia (family Apocynaceae, subfamily Asclepiadoideae) consists of perennial, stem succulents from Eastern and Southern Africa and Arabia, first described as a genus in 1810.The flowers are five-lobed, usually somewhat more funnel- or bell-shaped than in the closely related genus Stapelia, and often striped vividly in contrasting colors or tones, some glossy, others matte and wrinkled depending on the species concerned. Frequently the flowers are colored a variation of red, yellow or brown. To pollinate, the flowers attract flies by emitting a scent similar to that of carrion. The genus is considered close to the genera Stapelia and Hood

In [17]:
json_requests.to_json(DATASET_DIR / "huggingface-requests-305-3997.jsonl", orient='records', lines=True)