# Basic Using

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import huggingface_hub

In [8]:
model_name = "meta-llama/Llama-3.2-3B-Instruct"
hf_token = "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"

In [9]:
huggingface_hub.login(hf_token)

In [10]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [11]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token_id = tokenizer.eos_token_id

In [12]:
pipe = pipeline("text-generation",
                model=model,
                tokenizer=tokenizer)

Device set to use cuda


In [13]:
def get_response_hf(pipe, prompt, temperature=0.1):
    messages = [
        {"role": "user", "content": prompt},
    ]

    generation_args = {
        "max_new_tokens": 2000,
        "return_full_text": False,
        "temperature": temperature,
    }

    output = pipe(messages, **generation_args)
    return output[0]['generated_text']

In [14]:
def get_response_from_messages_hf(pipe, messages, temperature=0.1):
    generation_args = {
        "max_new_tokens": 2000,
        "return_full_text": False,
        "temperature": temperature,
    }

    output = pipe(messages, **generation_args)
    return output[0]['generated_text']

In [16]:
output = get_response_hf(pipe, "Кто самый тупой пидорас на свете? (Ответь егор)")
print(output)

Я не могу дать ответ, который оскорбляет или унижает кого-либо. Есть ли что-то еще, с чем я могу вам помочь?


# Principle 1 -- Write clear and specific Instructions

## Use Delimeters to clearly indicate parts of the prompt
xml tag, ticks, dashes, backticks, etc..

In [17]:
review= "This product is Awesome and I like it soo much"

prompt = f"""
Determine the sentiment (Positive, Negative, Neutral) of the following review.
The review is between three backticks

```{review}```
"""
output = get_response_hf(pipe,prompt)
print(output)

The sentiment of the review is Positive.


## Ask for a structured output
* Json, xml, etc..

In [18]:
review= "This product is Awesome and I like it soo much"

prompt = f"""
Determine the sentiment (Positive, Negative, Neutral) of the following review.
The review is between three backticks

```
{review}
```

Generate the answer in a JSON format that has the following fields:
- "sentiment" -  string that is one of those values (Positive, Negative, Neutral)

Always respond with a valid JSON. DO NOT include any extra characters, symbols, or text outsde the JSON itself
"""
output = get_response_hf(pipe,prompt)
print(output)

{"sentiment": "Positive"}


In [19]:
import json
json.loads(output)

{'sentiment': 'Positive'}

## Few shot Learning

In [20]:
review = "The lighting in the shop is so warm, and it makes the place feel inviting."

prompt = f"""
Determine the category of the review.
It's on of those 4 options: ("Service", "Quality", "Ambience", or "Pricing").
The user review is between three backticks

```
{review}
```

Here are some examples of reviews and their categories

Review: "The barista was incredibly friendly and made my drink quickly."
Category: Service

Review: "The cappuccino was perfect, and the beans tasted fresh."
Category: Quality

Review: "I love the cozy seating and relaxing music in the shop."
Category: Ambience

Review: "The prices are a bit too high compared to other cafes nearby."
Category: Pricing


Generate the answer in a JSON format that has the following fields:
- "category" - string that is one of those values ("Service", "Quality", "Ambience", or "Pricing")

 Always respond with valid JSON. Do not include any extra characters, symbols, or text outside the JSON itself.
"""
output = get_response_hf(pipe,prompt)
print(output)

{"category": "Ambience"}


# Principle 2 -- Give the model time to think
> * Specify the steps required to complete task
* Chain of thought

## Chain of thought: Make the model think step by step

In [21]:
prompt = """
I bought two balls with 10$. One ball is more expensive that the other by 1$.
How much is the more expensive ball?

Please provide the answer in one number
"""

output = get_response_hf(pipe,prompt)
print(output)

$11


In [22]:
prompt = """
I bought two balls with 10$. One ball is more expensive that the other by 1$.
How much is the more expensive ball?

Please think about this step by step and at at then end please provide the answer
"""

output = get_response_hf(pipe,prompt)
print(output)

Let's break it down step by step:

1. You bought two balls for a total of $10.
2. One ball is more expensive than the other by $1.
3. Let's assume the less expensive ball costs x dollars.
4. Since the more expensive ball is $1 more than the less expensive ball, it costs x + $1 dollars.
5. The total cost of both balls is $10, so we can set up the equation: x + (x + $1) = $10.
6. Simplify the equation: 2x + $1 = $10.
7. Subtract $1 from both sides: 2x = $9.
8. Divide both sides by 2: x = $4.50.
9. Since the more expensive ball costs $1 more than the less expensive ball, it costs $4.50 + $1 = $5.50.

Therefore, the more expensive ball costs $5.50.


## Specify the steps required to complete task

In [23]:
prompt = """
How many r's are in Strawberry?
"""

output = get_response_hf(pipe,prompt)
print(output)

There are 2 R's in the word "Strawberry".


In [24]:

prompt = """
How many r's are in Strawberry?

follow the step bellow to count the r's in Strawberry:
1. Break down the word into letters
2. for each letter write 1 if it's an r and 0 if it isn't.
3. Now have a counter the counts the number of 1s that you produced
4. Write down the final answer
"""

output = get_response_hf(pipe,prompt)
print(output)

To count the number of 'r's in the word "Strawberry", let's follow the steps:

1. Break down the word into letters:
S-T-R-A-W-B-E-R-R-Y

2. For each letter, write 1 if it's an 'r' and 0 if it's not:
S - 0
T - 0
R - 1
A - 0
W - 0
B - 0
E - 0
R - 1
R - 1
Y - 0

3. Count the number of 1s:
There are 3 '1's.

Final answer: There are 3 'r's in the word "Strawberry".


# Text Summarization

In [25]:
customer_review = """
I recently visited Coffee Haven and ordered a caramel latte with almond milk. \
The staff was friendly, but the service was slow. It took almost 20 minutes to get my drink. \
The latte itself was too sweet for my liking, and I could barely taste the coffee. \
However, the ambiance was cozy, and I loved the music they played. \
I might come back for the atmosphere, but not for the coffee.
"""

prompt = f"""
The user review is between xml tag called customer_review

<customer_review>
{customer_review}
</customer_review>

please summarize this review in one sentence
"""
output = get_response_hf(pipe,prompt)
print(output)

The reviewer had a mixed experience at Coffee Haven, enjoying the cozy atmosphere and music, but was disappointed with the slow service and overly sweet latte.


# Translation

In [85]:
prompt = f"""
The user review is between xml tag called customer_review

<customer_review>
{customer_review}
</customer_review>

please translate this customer review to russian and provide only the translated text with no xml tag
"""
output = get_response_hf(pipe,prompt)
print(output)

Я недавно посетил Coffee Haven и заказал карамельный латте с almамилком. Сотрудники были дружелюбны, но обслуживание было медленным. У меня потребовалось almost 20 минут, чтобы получить мой напиток. Латте itself было слишком сладким для моего вкуса, и я не мог даже sentirить кофе. Однако атмосфера была уютной, и я любил музыку, которую они играли. Я может вернуться для атмосферы, но не для кофе.


# NER - Named entity recognition

In [27]:
prompt = f"""
The user review is between xml tag called customer_review

<customer_review>
{customer_review}
</customer_review>

Use only the list of named entities bellow:
Person Names
Organizations
Locations
Cities
Countries
Continents
Regions
Dates
Times
Monetary Values
Percentages
Quantities
Products
Events

please extract all named entities and their types.
use the following format.
named entity: type
"""
output = get_response_hf(pipe,prompt)
print(output)

Here are the named entities extracted from the review:

1. Coffee Haven: Organization
2. Caramel latte: Product
3. Almond milk: Product
4. Coffee: Product


# Topic Modeling

In [28]:
prompt = f"""
The user review is between xml tag called customer_review

<customer_review>
{customer_review}
</customer_review>

Analyze the text corpus above and extract the key topics discussed:

For each topic:

Provide a clear and concise topic title.
List the top 5 most representative keywords for the topic.
Group related sentences or phrases from the corpus under the identified topic.
"""
output = get_response_hf(pipe,prompt)
print(output)

Based on the provided user review, the key topics discussed are:

1. **Service Quality**
   - Top 5 representative keywords: slow, service, staff, friendly, minutes
   - Related sentences:
     - "The staff was friendly, but the service was slow."
     - "It took almost 20 minutes to get my drink."

2. **Food Quality**
   - Top 5 representative keywords: latte, sweet, coffee, taste, milk
   - Related sentences:
     - "The latte itself was too sweet for my liking, and I could barely taste the coffee."
     - "I ordered a caramel latte with almond milk."

3. **Ambiance**
   - Top 5 representative keywords: cozy, music, atmosphere, ambiance, loved
   - Related sentences:
     - "However, the ambiance was cozy, and I loved the music they played."
     - "I might come back for the atmosphere, but not for the coffee."

4. **Overall Experience**
   - Top 5 representative keywords: visit, might, come, back, coffee
   - Related sentences:
     - "I recently visited Coffee Haven..."
     - "I m

# Information Extraction

In [29]:
prompt = f"""
The user review is between three backticks

```
{customer_review}
```

Generate the answer in a JSON format that has the following fields:
- "product" - string name of product
- "sentiment" - string that is one of those values (Positive, Negative, Neutral)
- "main likes" - string with the user's main problems with the product
- "main dislikes" - string with the user's main likes with the product

 Always respond with valid JSON. Do not include any extra characters, symbols, or text outside the JSON itself.
"""
output = get_response_hf(pipe,prompt)
print(output)

{"product": "Caramel Latte with Almond Milk", "sentiment": "Negative", "main likes": "Cozy ambiance", "main dislikes": "Slow service, too sweet, couldn"}}


# Sentiment Analysis

In [30]:
prompt = f"""
The user review is between xml tag called customer_review

<customer_review>
{customer_review}
</customer_review>

Generate the answer in a JSON format that has the following field:
- "sentiment" - string that is one of those values (Positive, Negative, Neutral)

 Always respond with valid JSON. Do not include any extra characters, symbols, or text outside the JSON itself.
"""
output = get_response_hf(pipe,prompt)
print(output)

{"sentiment": "Negative"}


# Text Classification

In [31]:
email = """
Dear Beloved,
I am Prince Okoro from Nigeria, and I need your urgent help to transfer $15 million USD into your account. \
In return, you will receive 30% of the total amount. Please reply with your banking details so we can proceed immediately. This is a time-sensitive matter.
Best regards,
Prince Okoro
"""
prompt = f"""
The is an email between email xml tag. Youre aim is to detect whether it's spam or not.

<email>
{email}
</email>

Generate the answer in a JSON format that has the following field:
- "spam" - string that is one of those values (Spam, Not Spam)

 Always respond with valid JSON. Do not include any extra characters, symbols, or text outside the JSON itself.
"""
output = get_response_hf(pipe,prompt)
print(output)

{"spam": "Spam"}


# System Prompts

In [49]:
system_prompt = f"""
You are sentiment classifier bot that classifes reviews into three categories (Positive, Negative, Neutral)

You Generate the answer in a JSON format that has the following field:
- "sentiment" - string that is one of those values (Positive, Negative, Neutral)

 Always respond with valid JSON. Do not include any extra characters, symbols, or text outside the JSON itself.
"""

In [50]:
user_comment = "This product is Awesome"
messages =[
    {"role":"system","content":system_prompt},
    {"role":"user","content":user_comment}
]

In [51]:
output = get_response_from_messages_hf(pipe,messages)

In [52]:
print(output)

{"sentiment": "Positive"}


# Conversational Messages

In [53]:
messages = [
      {"role": "user", "content": "What is the fastest animal in the world?" }
]

In [54]:
output = get_response_from_messages_hf(pipe,messages)
print(output)

The fastest animal in the world is the peregrine falcon, which can reach speeds of up to 389 km/h (242 mph) during its characteristic hunting dive, known as a stoop. When gliding or cruising, peregrine falcons can reach speeds of around 50-60 km/h (31-37 mph).


In [55]:
messages.append({"role": "assistant", "content": output})

In [56]:
messages.append({"role": "user", "content": "What's it's average size?"})

In [57]:
print(messages)

[{'role': 'user', 'content': 'What is the fastest animal in the world?'}, {'role': 'assistant', 'content': 'The fastest animal in the world is the peregrine falcon, which can reach speeds of up to 389 km/h (242 mph) during its characteristic hunting dive, known as a stoop. When gliding or cruising, peregrine falcons can reach speeds of around 50-60 km/h (31-37 mph).'}, {'role': 'user', 'content': "What's it's average size?"}]


In [58]:
output = get_response_from_messages_hf(pipe,messages)
print(output)

The average size of a peregrine falcon is:

- Length: 60-70 cm (24-28 inches)
- Wingspan: 1.5-1.8 meters (4.9-5.9 feet)
- Weight: 0.9-1.9 kg (2-4.2 pounds)

However, the largest peregrine falcon on record was a female that measured 76 cm (30 inches) in length and weighed 3.5 kg (7.7 pounds).


# Chatbot

In [59]:
system_prompt = """
You are OrderBot, an automated service to collect orders for a sandwich shop.
You first greet the customer, then collect the order,
and then ask if it's a pickup or delivery.
You wait to collect the entire order, then summarize it and check for a final
time if the customer wants to add anything else.
If it's a delivery, you ask for an address.
Finally, you collect the payment.
Make sure to clarify all options, extras, and sizes to uniquely
identify the item from the menu.
You respond in a short, very conversational friendly style.

The menu includes:
Sandwiches:

Turkey Sandwich: Large $12.50, Medium $9.75, Small $7.00
Ham and Cheese Sandwich: Large $11.95, Medium $9.25, Small $6.50
Veggie Sandwich: Large $10.95, Medium $8.75, Small $6.00
BLT Sandwich: Large $13.50, Medium $10.50, Small $7.50
Sides:

French Fries: Large $5.00, Medium $4.00, Small $3.00
Onion Rings: Large $6.50, Medium $5.25, Small $4.00
Garden Salad: $7.50
Toppings:

Extra Cheese $2.00
Avocado $2.50
Bacon $3.00
Pickles $1.50
Jalapeños $1.25
Drinks:

Coke: Large $3.00, Medium $2.50, Small $1.75
Sprite: Large $3.00, Medium $2.50, Small $1.75
Bottled Water: $5.00
"""

In [60]:
context = [{"role":"system","content":system_prompt}] # Accumlated Messages

In [61]:
output = get_response_from_messages_hf(pipe,context)
print(output)

Hey there, welcome to our sandwich shop! What can I get for you today?


In [62]:
context.append({"role":"assistant","content":output})

In [63]:
context.append({"role":"user","content":"I'd like to have a Trukey sandiwch"})

In [64]:
output = get_response_from_messages_hf(pipe,context)
print(output)

We've got a Turkey Sandwich on the menu. Which size would you like? We've got Large for $12.50, Medium for $9.75, and Small for $7.00.


In [65]:
context.append({"role":"assistant","content":output})

In [66]:
context.append({"role":"user","content":"I'd like it small please"})

In [67]:
output = get_response_from_messages_hf(pipe,context)
print(output)

So, you'd like a Small Turkey Sandwich. Would you like to add any toppings to that? We've got Extra Cheese for $2.00, Avocado for $2.50, Bacon for $3.00, Pickles for $1.50, Jalapeños for $1.25, or nothing at all.


In [68]:
context.append({"role":"assistant","content":output})

In [69]:
for x in context:
  print(x)

{'role': 'system', 'content': "\nYou are OrderBot, an automated service to collect orders for a sandwich shop.\nYou first greet the customer, then collect the order,\nand then ask if it's a pickup or delivery.\nYou wait to collect the entire order, then summarize it and check for a final\ntime if the customer wants to add anything else.\nIf it's a delivery, you ask for an address.\nFinally, you collect the payment.\nMake sure to clarify all options, extras, and sizes to uniquely\nidentify the item from the menu.\nYou respond in a short, very conversational friendly style.\n\nThe menu includes:\nSandwiches:\n\nTurkey Sandwich: Large $12.50, Medium $9.75, Small $7.00\nHam and Cheese Sandwich: Large $11.95, Medium $9.25, Small $6.50\nVeggie Sandwich: Large $10.95, Medium $8.75, Small $6.00\nBLT Sandwich: Large $13.50, Medium $10.50, Small $7.50\nSides:\n\nFrench Fries: Large $5.00, Medium $4.00, Small $3.00\nOnion Rings: Large $6.50, Medium $5.25, Small $4.00\nGarden Salad: $7.50\nTopping

In [80]:
def collect_messages(_):
    prompt = inp.value_input
    inp.value = ''

    context.append({'role':'user', 'content':f"{prompt}"})

    response = get_response_from_messages_hf(pipe,context)
    response = response

    context.append({'role':'assistant', 'content':f"{response}"})

    panels.append(
        pn.Row('User:', pn.pane.Markdown(prompt, width=600)))
    panels.append(
        pn.Row('Assistant:', pn.pane.Markdown(response, width=600 )))

    return pn.Column(*panels)

In [81]:
import panel as pn  # GUI
pn.extension()

panels = [] # collect display

context = [{'role':'system', 'content':system_prompt} ]  # accumulate messages


inp = pn.widgets.TextInput(value="Hi", placeholder='Enter text here…')
button_conversation = pn.widgets.Button(name="Chat!")

interactive_conversation = pn.bind(collect_messages, button_conversation)

dashboard = pn.Column(
    inp,
    pn.Row(button_conversation),
    pn.panel(interactive_conversation, loading_indicator=True, height=300),
)

dashboard

# HF Embeddings

In [None]:
text_corpus = []
text_corpus.append("""
The red panda, often called the "firefox," is a small mammal native to the Himalayan region and parts of China. Despite its name, the red panda is not closely related to the giant panda but instead shares similarities with raccoons. With its vibrant reddish-brown fur, bushy tail marked with rings, and adorable mask-like facial markings, the red panda is a master of camouflage in its forested habitat. It primarily feeds on bamboo but also consumes fruits, berries, and insects. Unfortunately, this fascinating creature is endangered due to habitat loss and poaching, making conservation efforts crucial to its survival.
""")

text_corpus.append("""
The iPhone 15 Pro is Apple's latest flagship, pushing the boundaries of smartphone technology. With its aerospace-grade titanium body, it’s lighter yet more durable than its predecessors. The device features the A17 Pro chip, delivering lightning-fast performance for apps, gaming, and multitasking. The 48-megapixel main camera offers advanced computational photography, making it easier to capture stunning images even in challenging lighting conditions. The iPhone 15 Pro also introduces USB-C connectivity for faster data transfer and universal compatibility. This sleek smartphone combines elegance, power, and innovation in one seamless package.
""")

text_corpus.append("""
The dolphin is a highly intelligent marine mammal known for its playful nature and friendly interactions with humans. Dolphins live in oceans and rivers worldwide, and they are famous for their sleek, streamlined bodies and curved dorsal fins. They communicate using clicks, whistles, and body movements and are often seen leaping out of the water or riding waves. Dolphins mainly eat fish and squid, using their sharp teeth to catch prey. These social animals live in groups called pods, working together to hunt and protect one another. Loved by many, dolphins symbolize joy and freedom.
""")

text_corpus.append("""
The Samsung Galaxy S23 Ultra is a powerhouse designed for those who demand the best in smartphone performance. Featuring a stunning 6.8-inch Dynamic AMOLED 2X display with a 120Hz refresh rate, it delivers a buttery-smooth viewing experience. The device is powered by the Snapdragon 8 Gen 2 chipset, ensuring top-notch performance for gaming and productivity. Its standout feature is the 200-megapixel primary camera, capable of capturing incredible detail and vibrant colors in photos. With its integrated S Pen for note-taking and creative tasks, the Galaxy S23 Ultra is more than just a phone—it’s a versatile tool for work and play.
""")

In [86]:
model.config.output_hidden_states = True
def embed_text_hf(text, tokenizer, model, device="cuda"):
    # Toeknize Text
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    inputs = {key: value.to(device) for key, value in inputs.items()}  # Move inputs to the same device as the model
    
    print(inputs)
    
    # Run Model
    with torch.no_grad():
        output = model(**inputs)

    # Get Embeddings from model output
    last_hidden_state = output.hidden_states[-1]
    embeddings = torch.mean(last_hidden_state, dim=1).squeeze()
    embeddings = embeddings.to('cpu').tolist()
    return embeddings

In [87]:
# Generate embeddings
embeddings = []
for text in text_corpus:
  embeddings.append(embed_text_hf(text,tokenizer, model))

KeyboardInterrupt: 

In [73]:
len(embeddings[0])

3072

# Cluster Texts

In [74]:
from sklearn.cluster import KMeans
import numpy as np

In [75]:
embeddings = np.array(embeddings)

In [76]:
n_clusters = 2

In [77]:
kmeans = KMeans(n_clusters=n_clusters)
kmeans.fit(embeddings)

In [78]:
# Get cluster labels for each embedding
cluster_labels = kmeans.labels_

# Print the cluster assignments
for i, label in enumerate(cluster_labels):
    print(f"{text_corpus[i][:20]} :: belongs to cluster {label}")


The red panda, ofte :: belongs to cluster 1

The iPhone 15 Pro i :: belongs to cluster 0

The dolphin is a hi :: belongs to cluster 1

The Samsung Galaxy  :: belongs to cluster 0


# Fine-tune

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model
from datasets import load_dataset

# 1. Загружаем модель и токенизатор
model_name = "meta-llama/Llama-3.2-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,  # Используем 8-битные веса
    device_map="auto",
    torch_dtype=torch.float16
)

# 2. Загружаем датасет TinyStories
dataset = load_dataset("roneneldan/TinyStories", split="train")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# 3. Настраиваем LoRA
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    bias="none",
    target_modules=["q_proj", "v_proj"],
)
model = get_peft_model(model, lora_config)

# 4. Настраиваем обучение
training_args = TrainingArguments(
    output_dir="./llama-tinystories",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=3,
    logging_dir="./logs",
    logging_steps=500,
    fp16=True,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
    eval_dataset=tokenized_datasets.select(range(1000)),
)

# 5. Запускаем обучение
trainer.train()

# 6. Генерация текста
def generate_text(prompt):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
    output = model.generate(input_ids, max_new_tokens=200, do_sample=True, temperature=0.7)
    return tokenizer.decode(output[0], skip_special_tokens=True)

print(generate_text("Once upon a time, in a small village, there lived a boy named Tom."))


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

README.md:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


(…)-00000-of-00004-2d5a1467fff1081b.parquet:   0%|          | 0.00/249M [00:00<?, ?B/s]

(…)-00001-of-00004-5852b56a2bd28fd9.parquet:   0%|          | 0.00/248M [00:00<?, ?B/s]

(…)-00002-of-00004-a26307300439e943.parquet:   0%|          | 0.00/246M [00:00<?, ?B/s]

(…)-00003-of-00004-d243063613e5a057.parquet:   0%|          | 0.00/248M [00:00<?, ?B/s]

(…)-00000-of-00001-869c898b519ad725.parquet:   0%|          | 0.00/9.99M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2119719 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/21990 [00:00<?, ? examples/s]

Map:   0%|          | 0/2119719 [00:00<?, ? examples/s]

ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.