<a href="https://colab.research.google.com/github/hyesunyun/huggingface-lab/blob/main/huggingface_inference_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HuggingFace Lab

*Adapted from https://colab.research.google.com/github/huggingface/notebooks/blob/master/transformers_doc/quicktour.ipynb and https://huggingface.co/docs/transformers/en/conversations*

## What is HuggingFace?

HuggingFace is an open-source platform that provides tools for building, training, and deploying machine learning (ML) and natural language processing (NLP) models. It is similar to GitHub for AI and is a hub for AI developers.

HuggingFace has a large model and datasets library. You can browse and create your own models and share their weights (either as public or private). Also, you can find over 30,000 datasets for training or evaluating AI models.

In this lab, we will do a quick tour of using HuggingFace's Transformers & Datasets libraries for different common NLP tasks with pretrained models.

Make sure you have the runtime to GPU. You can pick T4 GPU.

Run the following cell to verify your GPU setup.
You should see information about the GPU available for your session.

In [None]:
! nvidia-smi

Thu Jan  9 04:22:09 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

### Install Packages

This is only needed for Google Colab users.

In [None]:
# Transformers installation
! pip install transformers[torch] datasets
# Install dependencies
! pip install torch

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

### Quick Tour

We will start using the [`pipeline()`](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline) for rapid inference, and then quickly load a pretrained model and tokenizer with an [AutoClass](https://huggingface.co/docs/transformers/main/en/model_doc/auto) to solve text tasks.

#### Pipeline

`pipeline()` is the easiest way to use a pretrained model for a given task. It supports many common tasks out-of-the-box:

- Sentiment analysis: classify the polarity of a given text.
- Text generation (in English): generate text from a given input.
- Name entity recognition (NER): label each word with the entity it represents (person, date, location, etc.).
- Question answering: extract the answer from the context, given some context and a question.
- Fill-mask: fill in the blank given a text with masked words.
- Summarization: generate a summary of a long sequence of text or document.
- Translation: translate text into another language.
- Feature extraction: create a tensor representation of the text.

In this example, we will use `pipeline()` for sentiment analysis.

Import and load the pipeline.
The pipeline downloads and caches a default pretrained model and tokenizer for sentiment analysis.

In [None]:
import torch

device = "cuda" if torch.cuda.is_available() else "cpu" # checks if gpu is available
pipeline_device = 0 if device == "cuda" else -1 # for determining if we want to load model in GPU or CPU

##### Sentiment Analysis

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", device=pipeline_device)
# for the device argument, you can also just do "auto".
# This will let Huggingface decide where to load the model weights.
# We manually set the device.

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


In [None]:
classifier("We are very happy to show you the HuggingFace's Transformers library.")

[{'label': 'POSITIVE', 'score': 0.9997667670249939}]

You can also use more than one sentence by passing a list of sentences to the `pipeline()` which resturns a list of dictionaries.

In [None]:
results = classifier(["We are very happy to show you the HuggingFace's Transformers library.", "We hope you don't hate it."])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309


The `pipeline()` can accommodate any model from the Model Hub, making it easy to adapt the `pipeline()`.

In this example, the task is translation.

##### Translation

In [None]:
# Change `xx` to the language of the input and `yy` to the language of the desired output.
# Examples: "en" for English, "fr" for French, "de" for German, "es" for Spanish, "zh" for Chinese, etc; translation_en_to_fr translates English to French
# You can view all the lists of languages here - https://huggingface.co/languages

# Helsinki-NLP/opus-mt-en-es is the model used for translation from English to Spanish
model_name = "Helsinki-NLP/opus-mt-en-es"
translator = pipeline("translation_en_to_es", model=model_name, device=pipeline_device)

text = "Peanut butter is a food paste or spread made from ground, dry-roasted peanuts."
translator(text)

config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/826k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.59M [00:00<?, ?B/s]

Device set to use cuda:0


[{'translation_text': 'La mantequilla de maní es una pasta alimenticia o untar hecha de maní molido y tostado en seco.'}]

Another way to load the pipeline:
Use the AutoModelForSequenceClassification and AutoTokenizer to load the pretrained model and it's associated tokenizer (more on an AutoClass below):

```python
from transformers import AutoModel, AutoTokenizer

model_name = "username/model_name"
model = AutoModel.from_pretrained(model_name).to(model)
tokenizer = AutoTokenizer.from_pretrained(model_name)

pipeline = pipeline("task name", model=model, tokenizer=tokenizer)
pipeline("text")
```

We can also iterate over an entire dataset via HuggingFace's [Datasets](https://huggingface.co/docs/datasets/index) library. We will load [opus-100's en-es test split dataset](https://huggingface.co/datasets/Helsinki-NLP/opus-100/viewer/en-es/test).

In [None]:
from datasets import load_dataset

dataset = load_dataset("Helsinki-NLP/opus-100", name="en-es", split="test")

README.md:   0%|          | 0.00/65.4k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/237k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/99.6M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/238k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [None]:
# select first 4 samples from the dataset and format
inputs = [sample["en"] for sample in dataset[:4]["translation"]]
result = translator(inputs)

for d in result:
  print(d["translation_text"])

Si su país produjo SAO con este fin, sírvase indicar la cantidad así producida en la columna 6 del formulario de datos 3.”
♪ reformar al gran hombre, ¿quién más podría ser sino yo?
El planeta se está acabando.
¿Nunca las chicas matan a sus madres?


For a larger dataset where the inputs are big (like in speech or vision), you will want to pass along a generator instead of a list that loads all the inputs in memory. See the pipeline documentation for more information.

##### Question Analysis

Let's practice with a question answering task but with a model that can handle French text. Search for a model in [Model Hub](https://huggingface.co/models) that handle question answering and French. Tip: Use the tags `Question Answering` NLP task and `French` language.
Use the appropriate model to load `pipeline()` and use the dataset: [manu/fquad2_test](https://huggingface.co/datasets/manu/fquad2_test)

Dataset Details:
- split: test
- pre-processing: question-answering pipeline takes in question and context. `qa(question=questions, context=contexts)`

In [None]:
##### Add your code below #####

# load pipeline
qa = pipeline('question-answering', model='CATIE-AQ/QAmemberta', device=pipeline_device)

# load dataset
dataset = load_dataset("manu/fquad2_test", split="test")

# sample first 4 rows
samples = dataset[:4]

questions = samples["question"]
contexts = samples["context"]

# call qa pipeline with questions and contexts
results = qa(question=questions, context=contexts)

for result in results:
  print(result["score"])
  if result['score'] < 0.01:
      print("La réponse n'est pas dans le contexte fourni.") # The answer is not in the context provided.
  else :
      print(result['answer'])
  print("----------")

config.json:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/442M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.25k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/756k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/971 [00:00<?, ?B/s]

Device set to use cuda:0


README.md:   0%|          | 0.00/2.15k [00:00<?, ?B/s]

(…)-00000-of-00001-7afbf23107dc86df.parquet:   0%|          | 0.00/372k [00:00<?, ?B/s]

(…)-00000-of-00001-4ba6abaa8b2e4d33.parquet:   0%|          | 0.00/203k [00:00<?, ?B/s]

(…)-00000-of-00001-c57e7fc735be9d5f.parquet:   0%|          | 0.00/135k [00:00<?, ?B/s]

(…)-00000-of-00001-8dd5791b98b1f591.parquet:   0%|          | 0.00/75.9k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/800 [00:00<?, ? examples/s]

Generating test_hasAns split:   0%|          | 0/400 [00:00<?, ? examples/s]

Generating valid split:   0%|          | 0/200 [00:00<?, ? examples/s]

Generating valid_hasAns split:   0%|          | 0/100 [00:00<?, ? examples/s]

0.9423643350601196
mauvais état de santé
----------
0.5591740608215332
dix exemplaires
----------
0.7064977884292603
à la présomption et à l'infamie
----------
0.9823763370513916
Blanche-Marie
----------


#### AutoClass and AutoTokenizer

Under the hood, `pipeline()` is powered by AutoModels and AutoTokenizers. An [AutoClass](https://huggingface.co/docs/transformers/main/en/model_doc/auto) is a shortcut that automatically retrieves the architecture of a pretrained model from it's name or path. You only need to select the appropriate AutoClass for your task and it's associated tokenizer with [AutoTokenizer](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoTokenizer).

A tokenizer is responsible for preprocessing text into a format that is understandable to the model. First, the tokenizer will split the text into words called tokens. There are multiple rules that govern the tokenization process, including how to split a word and at what level (learn more about tokenization here). The most important thing to remember though is you need to instantiate the tokenizer with the same model name to ensure you're using the same tokenization rules a model was pretrained with.

##### Translation

Let's return to our translation example and see how you can use the AutoClass to replicate the results of the pipeline().

In [None]:
from transformers import AutoTokenizer

# Load tokenizer with the AutoTokenizer
model_name = "Helsinki-NLP/opus-mt-en-es"
tokenizer = AutoTokenizer.from_pretrained(model_name)

Next, the tokenizer converts the tokens into numbers in order to construct a tensor as input to the model. This is known as the model's vocabulary.

In [None]:
encoding = tokenizer("Peanuts are a good source of protein.").to(device)
print(encoding)

{'input_ids': [2506, 423, 3601, 9, 53, 8, 387, 3032, 7, 15084, 3, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


The tokenizer will return a dictionary containing:

*   input_ids: numerical representions of your tokens.
*   atttention_mask: indicates which tokens should be attended to.

Just like the pipeline(), the tokenizer will accept a list of inputs. In addition, the tokenizer can also pad and truncate the text to return a batch with uniform length:

In [None]:
batch = tokenizer(
    ["Is a taco a sandwich?", "I like cilantro with my tacos."],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
).to(device)

Read the [preprocessing tutorial](https://huggingface.co/docs/transformers/main/en/preprocessing) for more details about tokenization.

Transformers provides a simple and unified way to load pretrained instances. This means you can load an AutoModel like you would load an AutoTokenizer. The only difference is selecting the correct AutoModel for the task. Since you are doing text summarization (sequence to sequence), load [AutoModelForSeq2SeqLM](https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoModelForSeq2SeqLM):

In [None]:
from transformers import AutoModelForSeq2SeqLM

model_name = "Helsinki-NLP/opus-mt-en-es"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

See the [task summary](https://huggingface.co/docs/transformers/main/en/task_summary) for which AutoModel class to use for which task.

Now you can pass your preprocessed batch of inputs directly to the model. You just have to unpack the dictionary by adding **:

In [None]:
outputs = model.generate(**batch).to(device)

The model outputs are tokenized. We need to decode the output to be able to view the output in natural language:

In [None]:
# decode the outputs
decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)

print([d for d in decoded])

['¿Un taco es un sándwich?', 'Me gusta el cilantro con mis tacos.']


##### Open-ended Text Generation (Causal Language Modeling)

Let's practice with open-ended text generation task using AutoModelForCausalLM and AutoTokenizer.

We will use gpt 2 causal language model.

In [None]:
#### Add your code below ####

# import the AutoModelForCausalLM and AutoTokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the model and tokenizer
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define the input text(s)
text = "The sun is"

# Encode the input text(s)
encoding = tokenizer(text, return_tensors="pt").to(device)

# Generate the output(s)
output = model.generate(**encoding, max_new_tokens=50)

# Decode the output(s)
decoded = tokenizer.decode(output[0], skip_special_tokens=True)

print(decoded)

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The sun is shining on the horizon.

The sun is shining on the horizon.

The sun is shining on the horizon.

The sun is shining on the horizon.

The sun is shining on the horizon.

The sun is


##### Chatting

You can also chat with Transformers!

Chat models continue chats. This means that you pass them a conversation history, which can be as short as a single user message, and the model will continue the conversation by addings its response.

In [None]:
chat = [
    {"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."},
    {"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"}
]

Notice that in addition to the **user's message**, we added a **system message** at the start of the conversation. Not all chat models support system messages, but when they do, they represent high-level directives about how the model should behave in the conversation. You can use this to guide the model - whether you want short or long responses, lighthearted or serious ones, and so on. If you want the model to do useful work instead of practicing its improv routine, you can either omit the system message or try a terse one such as "You are a helpful and intelligent AI assistant who responds to user queries."

The quickest way to continue the chat is using [TextGenerationPipeline](https://huggingface.co/docs/transformers/v4.47.1/en/main_classes/pipelines#transformers.TextGenerationPipeline).

We will use a 1.7 billion parameter chat model named [SmolLM2](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct)

In [None]:
pipe = pipeline("text-generation", "HuggingFaceTB/SmolLM2-1.7B-Instruct", torch_dtype=torch.bfloat16, device_map=pipeline_device)
response = pipe(chat, max_new_tokens=512)
print(response[0]['generated_text'][-1]['content'])

Device set to use cuda:0


Oh, absolutely! New York, the city that never sleeps, where the streets are paved with gold and the pizza is always hot. Here are a few things you might enjoy:

1. **Visit the Statue of Liberty**: This iconic symbol of freedom and democracy is a must-see. Just remember to book your tickets in advance, as they often sell out.

2. **Take a stroll through Central Park**: This 843-acre green oasis in the middle of Manhattan is a perfect place to relax and enjoy the city's natural beauty.

3. **Explore the Metropolitan Museum of Art**: Known as "The Met", it's one of the world's largest and finest art museums.

4. **Dance the night away at a Broadway show**: New York is the heart of the American theater, and you can't miss a show.

5. **Visit Times Square**: Known as the "Crossroads of the World", Times Square is a bustling hub of activity, filled with bright lights, giant billboards, and a lively atmosphere.

6. **Take a ride on the Staten Island Ferry**: It's a great way to see the Statue

You can continue the chat by appending your own response to it. The response object returned by the pipeline actually contains the entire chat so far, so we can simply append a message and pass it back:

In [None]:
chat = response[0]['generated_text']
chat.append(
    {"role": "user", "content": "Wait, why are crowds and long lines fun?"}
)
response = pipe(chat, max_new_tokens=512)
print(response[0]['generated_text'][-1]['content'])

Oh, absolutely! The crowds and long lines are part of the fun, my friend. It's all part of the New York experience. You see, in a city that never sleeps, there's always something going on. Whether it's a bustling street, a packed theater, or a crowded restaurant, there's always something happening. And that's what makes it so exciting!

Plus, it's all part of the "New York Experience" - a unique blend of energy, excitement, and unpredictability. It's like a big, bustling, never-ending party. And who doesn't love a good party, right?


There are so many different chat models available on [Hugging Face Hub](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending). Choosing a model can be very overwhelming.

There are two important considerations when choosing a model:


1.   The model's size, which will determine if you can fit in memory and how quickly it will run.
2.   The quality of the model's chat output.

Without quantization, you should expect to need about 2 bytes of memory per parameter. This means that an “8B” model with 8 billion parameters will need about 16GB of memory just to fit the parameters, plus a little extra for other overhead. Note that it is very common to use quantization techniques to reduce the memory usage per parameter to 8 bits, 4 bits, or even less.

Using leaderboards can be a good way to consult which chat models perform well.
[OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) and the [LMSys Chatbot Arena Leaderboard](https://chat.lmsys.org/?leaderboard) are two popular leaderboards.
There are also [domain specific leaderboards](https://huggingface.co/blog/leaderboard-medicalllm).

###### Chat Model Exercise

For this exercise, find a chat model that you would like to use on [Hugging Face Hub](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending). T4 GPU has 16GB of memory so in theory you can load an 8B parameter model in "bfloat16" (16 bits precision) using `torch_dtype` arugment like the example above.

Load the model either with TextGenerationPipeline or AutoModelForCausalLM (and AutoTokenizer) and start playing around with it.

Make sure to set up the input in the chat format if you are using AutoModelForCausalLM

```python
# Prepare the input as before
chat = [
    {"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."},
    {"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"}
]
# Apply the chat template
formatted_chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
# Tokenize the chat (This can be combined with the previous step using tokenize=True)
inputs = tokenizer(formatted_chat, return_tensors="pt", add_special_tokens=False)
# Move the tokenized inputs to the same device the model is on (GPU/CPU)
inputs = {key: tensor.to(model.device) for key, tensor in inputs.items()}
# Generate text from the model
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1)
print("Generated tokens:\n", outputs)
```

In [None]:
##### ADD YOUR CODE HERE #####

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Qwen/Qwen2.5-1.5B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map=device
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)


config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

A Large Language Model (LLM) is a type of artificial intelligence that can understand and generate human-like text based on vast amounts of data it has been trained on. LLMs use advanced algorithms to analyze patterns in the input data and produce outputs that mimic natural language.

Some key characteristics of LLMs include:

1. **Vast Knowledge Base**: They have access to an enormous amount of information from various sources such as books, articles, web pages, and more, allowing them to provide comprehensive answers to questions or generate coherent texts.

2. **Natural Language Understanding**: LLMs are capable of understanding context, grammar, syntax, and semantics to generate responses that flow naturally like those produced by humans.

3. **Training**: These models are typically trained using massive datasets, often with billions of parameters, which allows them to learn complex relationships between words and phrases.

4. **Generative Abilities**: Beyond just answering queries

### Appendix

If you would like to learn how to improve open-ended language generation with very little effort, learn about better decoding methods.

https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb