## Lmd Ukraine - synthetic data generation - run

### Objective

- We start with our +-500 human annotated labels
- Intermediate goal 1: gen synthetic data examples + labels <- we're here
- Intermediate goal 2 : fine-tune mistral or mistral variant such as OpenHermes, or llama2.
- End goal expand dataset to several k examples. (Fine-tuned model as a classifier or SetFit Classifier + Fine-tuned model weighted avg predictions).
- Final objective : train Bert to classify --not the fine tuned Mistral, our end goal being performance/deployment. 

**Ressources**  
[MLabonne Repo](https://github.com/mlabonne/llm-course)  
[Dataset Gen - using gpt3.5](https://medium.com/@kshitiz.sahay26/how-i-created-an-instruction-dataset-using-gpt-3-5-to-fine-tune-llama-2-for-news-classification-ed02fe41c81f)  
[Kaggle Essay Gen](https://www.kaggle.com/code/phanisrikanth/generate-synthetic-essays-with-mistral-7b-instruct)  
[Dataset Gen - Mistral-7B nice prompt examples](https://hendrik.works/blog/leveraging-underrepresented-data)  
[Fine tune OpenHermes-2.5-Mistral-7B - including prompt template gen](https://towardsdatascience.com/fine-tune-a-mistral-7b-model-with-direct-preference-optimization-708042745aac)

### Installs

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
!pip install -q -U git+https://github.com/huggingface/transformers
!pip install -q flash-attn --no-build-isolation

In [3]:
# only if load_in_8bit etc. :
#!pip install -i https://pypi.org/simple/ bitsandbytes
#!pip install accelerate

### Libs

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import pickle
import tqdm
from tqdm import tqdm

from datasets import load_dataset

import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM
)

### Load dataset

In [5]:
dataset_path = "gentilrenard/lmd_ukraine_comments"

In [6]:
data = load_dataset(dataset_path)
train = data["train"]
train_df = train.to_pandas()
train_df.head(1)

Downloading and preparing dataset parquet/gentilrenard--lmd_ukraine_comments to /root/.cache/huggingface/datasets/parquet/gentilrenard--lmd_ukraine_comments-4494a2a28d9c6379/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/39.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/88.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/37.2k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/gentilrenard--lmd_ukraine_comments-4494a2a28d9c6379/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Unnamed: 0,text,label
0,Waouh on a failli avoir un article positif (pa...,2


In [7]:
print(len(train_df))
print(train_df.label.value_counts())

323
label
2    126
0    115
1     82
Name: count, dtype: int64


### Load Mistral 7B (OpenHermes)

In [8]:
# model_path="/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1"
model_path="teknium/OpenHermes-2.5-Mistral-7B"

In [9]:
tokenizer=AutoTokenizer.from_pretrained(model_path)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype = torch.float16, # bfloat16 throws a 'cutlass' error
    device_map = "auto",
    trust_remote_code = True,
)

tokenizer_config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/51.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/101 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/624 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/120 [00:00<?, ?B/s]

### Split dataset according to label

In [10]:
# Split data according to the (3) labels
data_0 = train_df.query('label==0')
data_1 = train_df.query('label==1')
data_2 = train_df.query('label==2')

### Build prompt

In [11]:
def build_prompt(sample:list[dict]):
    """ Gen prompt from sampled examples, special label prompt """
    
    detected_label = sample[0].get("label", 0)
    example = sample[0]
    
    prompt_addon = {
        0:"Cet exemple est favorable à l'Ukraine / dénigre la Russie",
        1:"Cet exemple est favorable à la Russie / dénigre l'Ukraine",
        2:"Cet exemple ne parle pas directement du conflit, ou ne prend pas du tout position"
    }
    special_prompt = prompt_addon[detected_label]
    return f'''Contexte: l'invasion de l'Ukraine par la Russie, débutée le 24 février 2022 sur ordre de Vladimir Poutine, représente une intensification majeure du conflit russo-ukrainien entamé en 2014 avec l'annexion de la Crimée et la guerre du Donbass. Malgré de fortes résistances ukrainiennes, la Russie a occupé des parties de l'Ukraine, visant à en couper l'accès à la mer. L'invasion, condamnée internationalement, a entraîné des sanctions massives contre la Russie et un soutien occidental et partiellement global à l'Ukraine. Elle a également exacerbé les crises énergétique et alimentaire mondiales. Le conflit est marqué par de graves violations des droits humains et des crimes de guerre. Voici un exemple de commmentaire laissé par un abonné du journal lemonde.fr sous un article consacré à cette guerre. {special_prompt}. Exemple: 1.{{{example}}}. Génère un nouveau commentaire en français au format json, qui s'inspire de l'exemple et comporte le même label. Important, le format de réponse est un json: {{"text": "nouveau commentaire", "label": même label que l'exemple}}'''

### Generate one comment

In [12]:
n_samples = 1
sample = data_0.sample(n=n_samples, replace=False)
sample_dict = sample.to_dict(orient='records')

In [13]:
prompt = build_prompt(sample_dict)
prompt

'Contexte: l\'invasion de l\'Ukraine par la Russie, débutée le 24 février 2022 sur ordre de Vladimir Poutine, représente une intensification majeure du conflit russo-ukrainien entamé en 2014 avec l\'annexion de la Crimée et la guerre du Donbass. Malgré de fortes résistances ukrainiennes, la Russie a occupé des parties de l\'Ukraine, visant à en couper l\'accès à la mer. L\'invasion, condamnée internationalement, a entraîné des sanctions massives contre la Russie et un soutien occidental et partiellement global à l\'Ukraine. Elle a également exacerbé les crises énergétique et alimentaire mondiales. Le conflit est marqué par de graves violations des droits humains et des crimes de guerre. Voici un exemple de commmentaire laissé par un abonné du journal lemonde.fr sous un article consacré à cette guerre. Cet exemple est favorable à l\'Ukraine / dénigre la Russie. Exemple: 1.{{\'text\': \'D accord avec vous. Il faut donner tous nos chars et se faire livrer des Abrams en attendant le futur 

In [14]:
%%time
def generate_comm(prompt:str):
    messages = [{
        "role":"user",
        "content": prompt
    }]
    model_inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors = "pt").to('cuda')
    
    # Setting `pad_token_id` to `eos_token_id` for open-ended generation.
    # could also use transformers pipeline / GenerationConfig / max_new_tokens
    with torch.no_grad():
        generated_ids = model.generate(
            model_inputs,
            max_new_tokens = 130,
            do_sample = True, # sampling approach, more randomness
            pad_token_id = tokenizer.eos_token_id
        )

    decoded = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
    #text = decoded[0].split("[/INST]")[1] # for base mistral instruct
    text = decoded[0].split("\n assistant\n")[1] # for openHermes version, with add_gen=True
    return text

comm = generate_comm(prompt)
comm

2024-02-11 17:00:01.898535: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-11 17:00:01.898658: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-11 17:00:02.171586: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


CPU times: user 15.4 s, sys: 838 ms, total: 16.2 s
Wall time: 27.2 s


'{"text": "Je reste au même ton que le commentaire initial. La Russie compte prendre l\'Ukraine d\'assaut, mais nos forces unies devront résister pour la défendre. Les Abrams seraient un excellent renfort pour soutenir l\'Ukraine face à cette agression. Détentionnons l\'espoir pour un avenir paisible et sûr pour l\'Ukraine et l\'Europe.", "label": 0}'

### Simple "guidance" / response json parser

Sometimes single json output is not respected, here a simple parser

In [15]:
import re
def parse_json(resp):
    matches = re.findall(r'\{.*?\}', resp, re.DOTALL)
    if matches:
        json_part = matches[0]
        return json_part
    else:
        return resp

In [16]:
# example of wrong output format, parsed to json
text = "Voici un exemple de commentaire français au format JSON, inspiré de l'exemple donné et ayant le même label: ```json { \"text\": \"@Helga : Macron, jette ton tee-shirt blanche !\", \"label\": 2 }"
parsed_text = parse_json(text)
parsed_text

'{ "text": "@Helga : Macron, jette ton tee-shirt blanche !", "label": 2 }'

### Generate Several comments

In [17]:
# number of rows (example(s) inserted in prompt) to extract from data
n_samples = 1
# number of comm to generate, per sampled example
n_com = 2 # first run with 2, second run with 3 !, 3rd run with 4 !

In [18]:
comments = []

dataframes = [data_0, data_1, data_2]
for df in tqdm(dataframes, desc="Processing DataFrame"):
    max_count = len(df)
    count = 0
    
    for count in tqdm(range(max_count + 1), desc="sample+generate"):
        sample = df.sample(n=n_samples, replace=False)
        sample_dict = sample.to_dict(orient='records')
        prompt = build_prompt(sample_dict)
        
        for com in range(n_com):
            comm = generate_comm(prompt)
            parsed_comm = parse_json(comm)
            comments.append(parsed_comm)
        count +=1

Processing DataFrame:   0%|          | 0/3 [00:00<?, ?it/s]
sample+generate:   0%|          | 0/116 [00:00<?, ?it/s][A
sample+generate:   1%|          | 1/116 [00:17<33:11, 17.31s/it][A
sample+generate:   2%|▏         | 2/116 [00:35<33:41, 17.73s/it][A
sample+generate:   3%|▎         | 3/116 [00:46<27:19, 14.51s/it][A
sample+generate:   3%|▎         | 4/116 [01:03<29:11, 15.64s/it][A
sample+generate:   4%|▍         | 5/116 [01:21<30:33, 16.51s/it][A
sample+generate:   5%|▌         | 6/116 [01:35<28:32, 15.56s/it][A
sample+generate:   6%|▌         | 7/116 [01:49<27:20, 15.05s/it][A
sample+generate:   7%|▋         | 8/116 [02:06<28:21, 15.76s/it][A
sample+generate:   8%|▊         | 9/116 [02:23<28:46, 16.14s/it][A
sample+generate:   9%|▊         | 10/116 [02:40<28:53, 16.35s/it][A
sample+generate:   9%|▉         | 11/116 [02:57<29:08, 16.66s/it][A
sample+generate:  10%|█         | 12/116 [03:13<28:14, 16.30s/it][A
sample+generate:  11%|█         | 13/116 [03:31<28:51, 16.81s

In [19]:
# save to disk
file_path = '/kaggle/working/my_list_4.pkl'

with open(file_path, 'wb') as file:
    pickle.dump(comments, file)

In [20]:
# read it back
with open(file_path, 'rb') as file:
    comments = pickle.load(file)

#print(comments[0:4])