# Small Language Model with Reinforcement Learning.
- The goal of this project is to demonstrate the importance of reinceforcement learning and its performance on small language model.

## Important tasks:
- Pretrain Small Language Model (10-100 million parameters)
- RLHF pipeline

### What is small language model (SLM)?
- The difference between small language model and large language model is in its training dataset, the transformer architecture remains same. It sole purpose is to pretrain language model for specific task whereas LLM can do multiple tasks. Thereby a good quality dataset is important for our SLM to capture the grammer and context of a sentence and the task.

### Dataset
- For the simplicity, we load the dataset of tiny stories mentioned in this paper https://arxiv.org/abs/2305.07759, we can find the dataset from hugging face.

### Step 1: Import Dataset
TinyStories is a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We can get it from HuggingFace.

In [None]:
!pip install datasets

In [2]:
from datasets import load_dataset

In [None]:
ds = load_dataset("roneneldan/TinyStories")

- Peek into the dataset

In [None]:
print('Total Number of training data', len(ds['train']))
print('Total Number of validation data', len(ds['validation']))
print("\n============Training Sample==============\n")
print(ds['train'][0])
print("\n============Validation Sample============\n")
print(ds['validation'][0])

## Next Steps:

1.  **Tokenization**: Convert the text data into numerical tokens that the language model can understand.
2.  **Data Preparation**: Create data loaders or tensors for training and evaluation.
3.  **Model Definition**: Define the architecture of your small language model.
4.  **Training Setup**: Set up the training loop, including optimizer, loss function, and training parameters.
5.  **Training**: Train the model on the prepared dataset.

### Tokenization
- convert text to tokenIDs
- write all token ids to the .bin file in batches.

In [None]:
!pip install tiktoken
import tiktoken
import os
import numpy as np
import tqdm.auto as tqdm

# Initialize the tokenizer
tokenizer = tiktoken.get_encoding("gpt2")

def process_sample(sample):
    text = sample["text"]
    tokens = tokenizer.encode_ordinary(text) # Use encode_ordinary to avoid adding special tokens
    return {'ids': tokens, 'len': len(tokens)}

# write all tokens to the memory mapped file
if not os.path.exists('train.bin'):
    # Generate tokenized dataset
    tokenized_ds = ds.map(process_sample, remove_columns=['text'], num_proc=8)
    
    for split, dataset in tokenized_ds.items():
        
        token_len = np.sum(dataset['len'], dtype=np.uint64)
        print(f'Total token length in {split} data: {token_len}')
        
        arr = np.memmap(f"{split}.bin", dtype=np.uint16, mode='w+', shape=(token_len,))
        total_batches = 1024
        train_idx = 0
        for batch_id in tqdm(range(total_batches), desc='Writing to train.bin'):
            # write the tokenids to mem map in batches
            batch = dataset.shard(num_shards=total_batches, index=batch_id, contiguous=True).with_format('numpy')
            batch_ids = np.concatenate(batch['ids'])
            arr[train_idx : train_idx+len(batch_ids)] = batch_ids
            train_idx += len(batch_ids)
        arr.flush()
        print('train data written to train.bin')