# Conversational

In this notebook, it will:

    I. explain the NER problem.
    II. Model
    III. Realization

## I. Presentation

### 1. Definition

The conversatinal task aims to generate text based on the human inputs. The conversation can be in form of:

* question-answering: one solution of this task is the indexing
* causal conversation
* functional conversation

In this context, we focus on generative question-answering conversation.

This is part of tasks such as text generation, text summurization. The model used belongs to the Causal Language Modeling (CLM) family.

The CLM model is usually a decoder which generates a single token at each iteration based on all previous tokens.

For conversational appliations, the dialog can be individual non-related human-ml pairs or recurrent pairs which current pair can be related to previous pairs.


### 2. data structure

In this notebook we focued on the non-related dialogs. Thus, the data structure is

    _________________________________________________________________________
       prompt/input/instruction           |         output/response     |eos|
    -------------------------------------------------------------------------
                                            |   |   |   |   |   |   |   |  
                                            --  --  --  --  --  --  --  --
                                              |   |   |   |   |   |   |   |
    _________________________________________________________________________
                   -100                   |        output/response      |eos|
    -------------------------------------------------------------------------

So for each pair, we concat the input and the output as one single data where the prompts are not used for loss calculation.


## II. Realization

In [1]:
# to set the gpu to use
# Since I have 2 GPUs and I only want to use one, I need to run this.
# Should be run the first
# skip this if you don't need.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # or "0,1" for multiple GPUs
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [19]:
## defin repos for data and model

# data

ckp_data = "yahma/alpaca-cleaned"

# model

ckp = "bigscience/bloom-560m"

### 1. import

In [3]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer

2024-06-21 18:15:18.517795: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-21 18:15:18.517861: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-21 18:15:18.520895: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-21 18:15:18.534617: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### 2. load dataset

In [4]:
data = load_dataset(ckp_data, split="train[:1000]")
data

Dataset({
    features: ['output', 'instruction', 'input'],
    num_rows: 1000
})

In [10]:
data[0]

{'output': '1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.',
 'instruction': 'Give three tips for staying healthy.',
 'input': ''}

### 3. split data

In [11]:
split_data = data.train_test_split(test_size=0.2)
split_data

DatasetDict({
    train: Dataset({
        features: ['output', 'instruction', 'input'],
        num_rows: 800
    })
    test: Dataset({
        features: ['output', 'instruction', 'input'],
        num_rows: 200
    })
})

### 4. tokenization

In [12]:
tokenizer = AutoTokenizer.from_pretrained(ckp)
tokenizer

BloomTokenizerFast(name_or_path='bigscience/bloom', vocab_size=250680, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [13]:
def process(sample):

    MAX_LEN = 256

    human = tokenizer("Human: " + "\n".join([sample["instruction"], sample["input"]]).strip() + "\n\nAssistant: ")
    ml = tokenizer(sample["output"] + tokenizer.eos_token)

    input_ids = human["input_ids"] + ml["input_ids"]
    attention_mask = human["attention_mask"] + ml["attention_mask"]
    labels = [-100] * len(human["input_ids"]) + ml["input_ids"]

    if len(input_ids) > MAX_LEN:

        input_ids = input_ids[:MAX_LEN]
        attention_mask = attention_mask[:MAX_LEN]
        labels = labels[:MAX_LEN]

    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels
    }

In [16]:
tokenized_data = split_data.map(process, remove_columns=split_data["train"].column_names)

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [17]:
tokenizer.decode(tokenized_data["train"][0]["input_ids"])

'Human: Classify the relationship between John and Mary.\nJohn and Mary are siblings.\n\nAssistant: The relationship between John and Mary is that of siblings. They share a familial bond where they have common parents.</s>'

In [18]:
tokenizer.decode(list(filter(lambda x: x != -100, tokenized_data["train"][0]["labels"])))

'The relationship between John and Mary is that of siblings. They share a familial bond where they have common parents.</s>'

### 5. load model

In [20]:
model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-560m")
model

config.json:   0%|          | 0.00/693 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

BloomForCausalLM(
  (transformer): BloomModel(
    (word_embeddings): Embedding(250880, 1024)
    (word_embeddings_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    (h): ModuleList(
      (0-23): 24 x BloomBlock(
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): BloomAttention(
          (query_key_value): Linear(in_features=1024, out_features=3072, bias=True)
          (dense): Linear(in_features=1024, out_features=1024, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): BloomMLP(
          (dense_h_to_4h): Linear(in_features=1024, out_features=4096, bias=True)
          (gelu_impl): BloomGelu()
          (dense_4h_to_h): Linear(in_features=4096, out_features=1024, bias=True)
        )
      )
    )
    (ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  )
  (

### 6. metrics

### 7. train args

In [23]:
args = TrainingArguments(
    output_dir="../tmp/checkpoint",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=8,
    logging_steps=10,
    num_train_epochs=3
)

### 8. trainer

In [24]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_data["train"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True)
)

### 9. train

In [25]:
trainer.train()

Step,Training Loss
10,2.2059
20,1.6233
30,1.1851


TrainOutput(global_step=36, training_loss=1.5478614038891263, metrics={'train_runtime': 172.249, 'train_samples_per_second': 13.933, 'train_steps_per_second': 0.209, 'total_flos': 1047450880770048.0, 'train_loss': 1.5478614038891263, 'epoch': 2.88})

### 10. inference

In [26]:
from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

In [31]:
# parameters to control the output:
# * length
#   - min/max_length: default to 20, so we should increase this to get get longer result
#   - min/max_new_tokens: the same as before but control the newly generated length
# * search type
#   - do_sample: to generate different results
#   - num_beams: the beam numbers
# * sampling type
#   - temperature: default to 1.0, greater than 1.0 return more abrapt distribution, vice versa
#   - top_k: sort the results by their probabilities
#   - top_p: return all results whose summed probability is greater than top_p value
# * other
#   - repetition_penalty: penalize the probabiltiy of repetive result
# We can try those parameters to output the best results

human = "human: {}\n{}".format("List five steps for comparing two products.", "").strip() + "\n\nAssistant: "
pipe(human, max_new_tokens=256)

[{'generated_text': "human: List five steps for comparing two products.\n\nAssistant: 1. Compare the dimensions: The dimensions of the two products should be the same. This means that the product's dimensions should be the same as the one you are comparing it to. This can be done by measuring the length, width, and height of the two products, and then calculating the standard deviation.\n\n2. Compare the weight: The weight of the two products should be the same. This means that the product's weight should be the same as the one you are comparing it to. This can be done by measuring the total weight of the two products, and then calculating the standard deviation.\n\n3. Compare the color and appearance: The color and appearance of the two products should be the same. This means that the product's color and appearance should be the same as the one you are comparing it to. This can be done by comparing the color of the product to the one you are comparing it to, and then comparing the col