Based on notebook created by Yanli LIU

Using individual pdf / text files to create question - answer pairs


## Step 1 - Install the dependencies

In [4]:
!pip install -e git+https://github.com/BatsResearch/bonito#egg=bonito

Obtaining bonito from git+https://github.com/BatsResearch/bonito#egg=bonito
  Updating ./src/bonito clone
  Running command git fetch -q --tags
  Running command git reset --hard -q ddfe7077a36178e97a32523117cc3213dad06c23
  Preparing metadata (setup.py) ... [?25l[?25hdone
Installing collected packages: bonito
  Attempting uninstall: bonito
    Found existing installation: bonito 0.0.2
    Uninstalling bonito-0.0.2:
      Successfully uninstalled bonito-0.0.2
  Running setup.py develop for bonito
Successfully installed bonito


In [5]:
!pip install datasets huggingface_hub



In [6]:
!pip install pymupdf spacy



In [7]:
!pip install -qU langchain-text-splitters

## Step 2: Processing the PDF documentExtract Text from PDF

### 2.1 Exract texts

In [61]:

import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

def extranct_text_from_txt(txt_path):
  with open(txt_path, 'r') as txt:
    text = txt.read()

  return text

pdf_path = 'beginner-Randy-Fang.pdf'
#text = extract_text_from_pdf(pdf_path)

txt_path = 'Othello-B_and_B.txt'
text = extranct_text_from_txt(txt_path)


### 2.2 Split Text into Sentences and chunks

In [62]:

import spacy

nlp = spacy.load("en_core_web_sm")  # Load English tokenizer, tagger, parser, NER, and word vectors

def split_into_sentences(text):
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents]
    return sentences

sentences = split_into_sentences(text)




In [63]:
print(len(sentences))

1630


In [64]:
print(sentences[500])

White has a potential free move at h7 in Dia-
gram 29.


In [65]:
# using the original text create chunks of 1500 tokens
# some intuition from https://dgallitelli95.medium.com/serving-fish-for-dinner-using-bonito-v1-on-amazon-sagemaker-to-generate-datasets-for-llm-d8340dee2e85

from langchain_text_splitters import CharacterTextSplitter

# Split the longer text in chunks of 1500 tokens max
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1500,
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False,
)
texts = text_splitter.split_text(text)

In [66]:
print(len(texts))

125


In [67]:
print(texts[2])

win the U.S. National Championship. I went on to the World Championship, where I
finished third. With the confidence from my tournament victories and the resources availa-
ble to me as Editor of OQ, I decided that now was an appropriate time to attempt a project
of long-standing interest: to write an introductory handbook on Othello strategy and tac-
tics.
It was clearly something that the USOA needed. It was becoming increasingly diffi-
cult (and unrealistic to expect of new members) to learn the fundamental principles of
Othello strategy by sorting through the ever increasing collection of largely unrelated arti-
cles appearing in the back issues of OQ. A concise, synthesized, easily accessible single
resource was the obvious alternative: a Beginner's Handbook, as it came to be called.
Another potential value of this project would be to serve as an antidote for all the misinfor-
mation about Othello that has been printed in various books over the years. For example, a
recent book des

### 2.3 Create a Transformers Dataset
Transforming the sentences into a format suitable for the Hugging Face datasets library.

In [68]:
# governs if sentences or chunks are used for question - answer generation
USE_SENTENCES = False

In [69]:
from datasets import Dataset


# Assuming sentences is a list of strings, where each string is a sentence
data = {"sentence": sentences}
dataset = Dataset.from_dict(data)
#dataset = Dataset.from_text('books.txt')

print(dataset)


Dataset({
    features: ['sentence'],
    num_rows: 1630
})


In [70]:
# second dataset for chunks
ch_data = {"sentence": texts}
ch_dataset = Dataset.from_dict(ch_data)

print(ch_dataset)

Dataset({
    features: ['sentence'],
    num_rows: 125
})


## Step 3 : Generate synthetic dataset using Bonito

In [18]:
from bonito import Bonito
from vllm import SamplingParams
from datasets import load_dataset

# Initialize the Bonito model
bonito = Bonito("BatsResearch/bonito-v1")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


INFO 09-30 20:43:41 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='BatsResearch/bonito-v1', speculative_config=None, tokenizer='BatsResearch/bonito-v1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=BatsResearch/bonito-v1, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, e

  @torch.library.impl_abstract("xformers_flash::flash_fwd")
  @torch.library.impl_abstract("xformers_flash::flash_bwd")


INFO 09-30 20:43:43 model_runner.py:1014] Starting to load model BatsResearch/bonito-v1...
INFO 09-30 20:43:43 selector.py:240] Cannot use FlashAttention-2 backend due to sliding window.
INFO 09-30 20:43:43 selector.py:116] Using XFormers backend.
INFO 09-30 20:43:43 weight_utils.py:242] Using model weights format ['*.bin']


Loading pt checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


  state = torch.load(bin_file, map_location="cpu")


INFO 09-30 20:43:57 model_runner.py:1025] Loading model weights took 13.4966 GB
INFO 09-30 20:44:00 gpu_executor.py:122] # GPU blocks: 9209, # CPU blocks: 2048
INFO 09-30 20:44:02 model_runner.py:1329] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 09-30 20:44:02 model_runner.py:1333] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 09-30 20:44:29 model_runner.py:1456] Graph capturing finished in 27 secs.


In [71]:
# setting dataset for chunks / sentences

if not USE_SENTENCES:
  dataset = ch_dataset

print(dataset)

Dataset({
    features: ['sentence'],
    num_rows: 125
})


In [72]:
# load dataset with unannotated text
# Supported Task Types [full name (short form)]: extractive question answering (exqa),
# multiple-choice question answering (mcqa),
# question generation (qg),
# question answering without choices (qa),
# yes-no question answering (ynqa),
# coreference resolution (coref),
# paraphrase generation (paraphrase),
# paraphrase identification (paraphrase_id),
# sentence completion (sent_comp),
# sentiment (sentiment),
# summarization (summarization),
# text generation (text_gen),
# topic classification (topic_class),
# word sense disambiguation (wsd), textual entailment (te), natural language inference (nli)
# Generate synthetic instruction tuning dataset

sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.5, n=1)
synthetic_dataset = bonito.generate_tasks(
    dataset,
    context_col="sentence",
    task_type="qa",
    sampling_params=sampling_params
)

Map:   0%|          | 0/125 [00:00<?, ? examples/s]

Processed prompts: 100%|██████████| 125/125 [00:05<00:00, 22.57it/s, est. speed input: 10409.96 toks/s, output: 1022.70 toks/s]


Filter:   0%|          | 0/125 [00:00<?, ? examples/s]

Map:   0%|          | 0/125 [00:00<?, ? examples/s]

In [73]:
print(synthetic_dataset)

Dataset({
    features: ['input', 'output'],
    num_rows: 125
})


In [74]:
import pandas as pd

df = pd.DataFrame(synthetic_dataset)

df.head(10)


Unnamed: 0,input,output
0,Read the following article and answer the ques...,those who play Othello
1,Read the following article and answer the ques...,1980
2,win the U.S. National Championship. I went on ...,before he went to the World Championship
3,Read the following article and answer the ques...,To make it easier to find the topics.
4,Handbook has also benefited from the work of s...,Because it is considerably shorter than the mo...
5,Read the following context and answer the ques...,He is an expert at Othello
6,Read the following article and answer the ques...,It is well-received.
7,Read the following article and answer the ques...,"To introduce the book ""Othello: Brief & Basic""."
8,Read the following article and answer the ques...,"Othello, the popular board game."
9,Read the following article and answer the ques...,be placed in a way that it can outflank at lea...


## Step 4 : Saving the generated dataset

4.1Authenticate with Hugging Face

In [39]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

4.2 Push the dataset to the Hub

In [75]:
from huggingface_hub import create_repo
from huggingface_hub import Repository

repo_name = "othello_b_b_0930"  # Choose a name for your dataset repository
repo_url = create_repo(repo_name, repo_type="dataset")
print("Repository URL:", repo_url)






Repository URL: https://huggingface.co/datasets/lacibacsi/othello_b_b_0930


In [76]:
synthetic_dataset.push_to_hub(f"lacibacsi/othello_b_b_0930")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/lacibacsi/othello_b_b_0930/commit/ffa9f7654c0e16435841a5969e9d1dc7f2cb4471', commit_message='Upload dataset', commit_description='', oid='ffa9f7654c0e16435841a5969e9d1dc7f2cb4471', pr_url=None, pr_revision=None, pr_num=None)

In [77]:
df.to_csv('othello_b_b_0930.csv', index=False)