The aim of this notebook is to do initial explorations in generating an Instruction tuning dataset.

References:
* https://towardsdatascience.com/how-to-generate-instruction-datasets-from-any-documents-for-llm-fine-tuning-abb319a05d91
* https://colab.research.google.com/drive/1XuDRVKpUUqdjrqg2-P2FIqkdAQBnqoNL?usp=sharing

The paper being used is a Dark matter review paper: https://arxiv.org/pdf/2104.11488.pdf

In [1]:
!pip install -e git+https://github.com/BatsResearch/bonito#egg=bonito
!pip install datasets huggingface_hub
!pip install pymupdf spacy

Obtaining bonito from git+https://github.com/BatsResearch/bonito#egg=bonito
  Cloning https://github.com/BatsResearch/bonito to ./src/bonito
  Running command git clone --filter=blob:none --quiet https://github.com/BatsResearch/bonito /content/src/bonito
  Resolved https://github.com/BatsResearch/bonito to commit 176cfff5a19cfeea03382506afd869dd9a22afaf
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting datasets (from bonito)
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting vllm (from bonito)
  Downloading vllm-0.4.0-cp310-cp310-manylinux1_x86_64.whl (72.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.3/72.3 MB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets->bonito)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
!git clone https://github.com/BatsResearch/bonito.git
!pip install -U bonito/

Cloning into 'bonito'...
remote: Enumerating objects: 91, done.[K
remote: Counting objects: 100% (46/46), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 91 (delta 36), reused 26 (delta 26), pack-reused 45[K
Receiving objects: 100% (91/91), 784.34 KiB | 18.24 MiB/s, done.
Resolving deltas: 100% (38/38), done.
Processing ./bonito
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: bonito
  Building wheel for bonito (setup.py) ... [?25l[?25hdone
  Created wheel for bonito: filename=bonito-0.0.2-py3-none-any.whl size=4585 sha256=05fac2c3466ca1091f89f88d9f3d7b1d442a81ea39b4c31f1a95e62250e50134
  Stored in directory: /tmp/pip-ephem-wheel-cache-jue2boe5/wheels/c5/fe/de/e0c4849775dee927ba7352098bc3e060482a9ac937dde7f9a3
Successfully built bonito
Installing collected packages: bonito
  Attempting uninstall: bonito
    Found existing installation: bonito 0.0.2
    Uninstalling bonito-0.0.2:
      Successfully uninstalled bo

In [3]:
import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

pdf_path = '/content/dm_review.pdf'
text = extract_text_from_pdf(pdf_path)

In [4]:
text[:1000]

'Dark matter and the early Universe: a review\nA. Arbey and F. Mahmoudi\nUniv Lyon, Univ Claude Bernard Lyon 1, CNRS/IN2P3,\nInstitut de Physique des 2 Inﬁnis de Lyon, UMR 5822, 69622 Villeurbanne, France\nTheoretical Physics Department, CERN, CH-1211 Geneva 23, Switzerland\nInstitut Universitaire de France, 103 boulevard Saint-Michel, 75005 Paris, France\nAbstract\nDark matter represents currently an outstanding problem in both cosmology and\nparticle physics. In this review we discuss the possible explanations for dark matter\nand the experimental observables which can eventually lead to the discovery of dark\nmatter and its nature, and demonstrate the close interplay between the cosmological\nproperties of the early Universe and the observables used to constrain dark matter\nmodels in the context of new physics beyond the Standard Model.\n1\narXiv:2104.11488v1  [hep-ph]  23 Apr 2021\nContents\n1\nIntroduction\n3\n2\nStandard Cosmological Model\n3\n2.1\nFriedmann-Lemaˆ\nıtre-Robertso

In [5]:
# Split text into sentences

import spacy

nlp = spacy.load("en_core_web_sm")  # Load English tokenizer, tagger, parser, NER, and word vectors

def split_into_sentences(text):
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents]
    return sentences

sentences = split_into_sentences(text)

In [6]:
len(sentences)

1322

In [7]:
sentences[200]

'In particular dark\nmatter is generally considered as collisionless contrary to baryonic matter, so that it is\npossible to distinguish the role of dark matter from the one of baryonic matter in the\nsimulations.'

In [8]:
# Create a Transformers dataset

from datasets import Dataset

# Assuming sentences is a list of strings, where each string is a sentence
data = {"sentence": sentences}
dataset = Dataset.from_dict(data)

print(dataset)

Dataset({
    features: ['sentence'],
    num_rows: 1322
})


In [9]:
Dataset.from_dict(data).to_pandas() # the split sentences from the paper

Unnamed: 0,sentence
0,Dark matter and the early Universe: a review\n...
1,In this review we discuss the possible explana...
2,1\narXiv:2104.11488v1
3,[hep-ph] 23 Apr 2021\nContents\n1\nIntroducti...
4,. . . . . . . . . . . . . . .
...,...
1317,"arXiv:0803.0741, doi:\n10.1016/j.physletb.2008..."
1318,"[164] T. Barreiro, E. J. Copeland, N. Nunes, Q..."
1319,"arXiv:astro-ph/9910214, doi:10.1103/\nPhysRevD..."
1320,"[165] A. Arbey, F. Mahmoudi, SUSY Constraints,..."


In [10]:
# Generate synthetic dataset using Bonito

from bonito import Bonito
from vllm import SamplingParams

from datasets import load_dataset

# Initialize the Bonito model
bonito = Bonito("BatsResearch/bonito-v1")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

INFO 04-01 06:54:44 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='BatsResearch/bonito-v1', tokenizer='BatsResearch/bonito-v1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)


tokenizer_config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/145 [00:00<?, ?B/s]

INFO 04-01 06:54:46 selector.py:45] Cannot use FlashAttention because the package is not found. Please install it for better performance.
INFO 04-01 06:54:46 selector.py:21] Using XFormers backend.
INFO 04-01 06:54:48 weight_utils.py:177] Using model weights format ['*.bin']


pytorch_model-00002-of-00002.bin:   0%|          | 0.00/5.06G [00:00<?, ?B/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

INFO 04-01 06:57:56 model_runner.py:104] Loading model weights took 13.4966 GB
INFO 04-01 06:57:59 gpu_executor.py:94] # GPU blocks: 9133, # CPU blocks: 2048
INFO 04-01 06:58:01 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-01 06:58:01 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-01 06:58:09 model_runner.py:867] Graph capturing finished in 8 secs.


In [11]:
# load dataset with unannotated text

# Supported Task Types [full name (short form)]: extractive question answering (exqa),
# multiple-choice question answering (mcqa),
# question generation (qg),
# question answering without choices (qa),
# yes-no question answering (ynqa),
# coreference resolution (coref),
# paraphrase generation (paraphrase),
# paraphrase identification (paraphrase_id),
# sentence completion (sent_comp),
# sentiment (sentiment),
# summarization (summarization),
# text generation (text_gen),
# topic classification (topic_class),
# word sense disambiguation (wsd), textual entailment (te), natural language inference (nli)
# Generate synthetic instruction tuning dataset

# Trying qa: Question answering without choices for now

sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.5, n=1)

synthetic_dataset = bonito.generate_tasks(
    dataset,
    context_col="sentence",
    task_type="qa",
    sampling_params=sampling_params
)

Map:   0%|          | 0/1322 [00:00<?, ? examples/s]

Processed prompts: 100%|██████████| 1322/1322 [00:20<00:00, 63.27it/s] 


Filter:   0%|          | 0/1322 [00:00<?, ? examples/s]

Map:   0%|          | 0/1322 [00:00<?, ? examples/s]

In [12]:
print(synthetic_dataset)

Dataset({
    features: ['input', 'output'],
    num_rows: 1322
})


In [13]:
from pprint import pprint

pprint("----Generated Instructions----")
pprint(f'Input: {synthetic_dataset[0]["input"]}')
pprint(f'Output: {synthetic_dataset[0]["output"]}')

'----Generated Instructions----'
('Input: Read the following article and answer the question.\n'
 'Article: Dark matter and the early Universe: a review\n'
 'A. Arbey and F. Mahmoudi\n'
 'Univ Lyon, Univ Claude Bernard Lyon 1, CNRS/IN2P3,\n'
 'Institut de Physique des 2 Inﬁnis de Lyon, UMR 5822, 69622 Villeurbanne, '
 'France\n'
 'Theoretical Physics Department, CERN, CH-1211 Geneva 23, Switzerland\n'
 'Institut Universitaire de France, 103 boulevard Saint-Michel, 75005 Paris, '
 'France\n'
 'Abstract\n'
 'Dark matter represents currently an outstanding problem in both cosmology '
 'and\n'
 'particle physics.\n'
 'Question: What is the purpose of the passage?\n'
 'Answer:')
'Output: To introduce the topic of the paper.'


In [25]:
pprint("----Generated Instructions----")
pprint(f'Input: {synthetic_dataset[100]["input"]}')
pprint(f'Output: {synthetic_dataset[100]["output"]}')

'----Generated Instructions----'
('Input: Read the following context and answer the question.\n'
 'Context: The standard cosmological model oﬀers a simple framework to study '
 'the evolution of\n'
 'the Universe, but it does not describe the other phenomena which may have '
 'occurred in the\n'
 'early Universe, for which speciﬁc models are required.\n'
 'Question: What is the standard cosmological model?\n'
 'Answer:')
'Output: not enough information'


In [26]:
pprint("----Generated Instructions----")
pprint(f'Input: {synthetic_dataset[140]["input"]}')
pprint(f'Output: {synthetic_dataset[140]["output"]}')

'----Generated Instructions----'
('Input: Read the following context and answer the question.\n'
 'Context: There is an excellent agreement for hydrogen, deuterium and helium, '
 'whereas there is a\n'
 'large discrepancy for lithium-7: this is the well-known lithium problem\n'
 'Question: What is the reason for the large discrepancy in lithium-7?\n'
 'Answer:')
'Output: The agreement is not good for lithium-7'


In [27]:
pprint("----Generated Instructions----")
pprint(f'Input: {synthetic_dataset[200]["input"]}')
pprint(f'Output: {synthetic_dataset[200]["output"]}')

'----Generated Instructions----'
('Input: In particular dark\n'
 'matter is generally considered as collisionless contrary to baryonic matter, '
 'so that it is\n'
 'possible to distinguish the role of dark matter from the one of baryonic '
 'matter in the\n'
 'simulations.\n'
 '\n'
 'Q: What is a type of matter that is not collisionless?\n'
 '\n'
 'A:')
'Output: baryonic'


In [24]:
import pandas as pd

df = pd.DataFrame(synthetic_dataset)

In [20]:
df.iloc[8]['input'], df.iloc[8]['output']

('On a scale of 1-5 (with 1 being least favorable and 5 being most favorable), how would you rate this review? "8\n3\nDark matter(s)\n9\n3.1\nObservational evidences\n. . . . . . . . . . . . . . . . . . . . . . . . . . . . ."',
 '5')

In [21]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [22]:
from huggingface_hub import create_repo
from huggingface_hub import Repository

repo_name = "dark_matter_instruction_qa"  # Choose a name for your dataset repository
repo_url = create_repo(repo_name, repo_type="dataset")
print("Repository URL:", repo_url)

Repository URL: https://huggingface.co/datasets/delayedkarma/dark_matter_instruction_qa


In [23]:
synthetic_dataset.push_to_hub(f"delayedkarma/dark_matter_instruction_qa")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/delayedkarma/dark_matter_instruction_qa/commit/e54f4a208cc9d7f13fb44cbb9062e788b15ef98b', commit_message='Upload dataset', commit_description='', oid='e54f4a208cc9d7f13fb44cbb9062e788b15ef98b', pr_url=None, pr_revision=None, pr_num=None)

Let's try some variations on the sampling params

In [28]:
from transformers import set_seed
set_seed(42)

sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.5, n=1)

synthetic_dataset = bonito.generate_tasks(
    dataset,
    context_col="sentence",
    task_type="qa",
    sampling_params=sampling_params
)



Map:   0%|          | 0/1322 [00:00<?, ? examples/s]

Processed prompts: 100%|██████████| 1322/1322 [00:20<00:00, 63.12it/s] 


Filter:   0%|          | 0/1322 [00:00<?, ? examples/s]

Map:   0%|          | 0/1322 [00:00<?, ? examples/s]

In [29]:
synthetic_dataset.push_to_hub(f"delayedkarma/dark_matter_instruction_qa") # After setting seed

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/307 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/delayedkarma/dark_matter_instruction_qa/commit/8244fc5d5683e4610d49f7cadc2de2c5b8ee2e29', commit_message='Upload dataset', commit_description='', oid='8244fc5d5683e4610d49f7cadc2de2c5b8ee2e29', pr_url=None, pr_revision=None, pr_num=None)