# Introduction: Understanding Large Language Models with HuggingFace

## Overview

Large Language Models (LLMs) are advanced deep learning models that revolutionize Natural Language Processing (NLP) by understanding and generating human-like text. In this tutorial, we'll explore the fundamentals of **Llama-2-Ko**, a base model designed for text generation using the Huging Face framework.

### Key Points:

- **Definition:** LLMs leverage neural networks to understand and generate text, capturing language patterns, context, and semantics through vast training datasets.

- **Applications:** LLMs find applications in various NLP tasks, including text generation, summarization, translation, question answering, and sentiment analysis.

## Text Generation Models

Text generation models, a common architecture of LLMs, specialize in producing coherent and contextually relevant text. They leverage deep learning techniques to understand language intricacies and generate human-like responses.

### GPT Models:

- **Architecture:** GPT models use a transformer architecture, capturing long-range dependencies and contextual information effectively through attention mechanisms, self-attention layers, and positional encoding.

- **Working Principle:** Trained in an **unsupervised manner** on massive text data, GPT models learn to **predict the next word** in a sequence, leading to a rich understanding of language patterns.

In this tutorial, we'll focus on working with the Llama-2-Ko **base model**, which is not specifically tuned for instructions or chat interactions. Subsequent tutorials will explore the intricacies of **instruction-tuned and chat-tuned LLMs**. Let's dive into the practical aspects of building a chatbot using this base model.
ext generation models.t generation models.


# Practical Guide: Text Generation with HuggingFace


In this section, we'll walk through the practical steps of using Hugging Face for text generation. We'll focus on leveraging the GPT models for creating coherent and contextually relevant text. Follow along with the code snippets to understand the implementation details.

## Step 1: Install HuggingFace Transformers Library


To get started, we need to install the Hugging Face Transformers library, a comprehensive toolkit for natural language processing tasks. Execute the following command in a code cell to install the library:

In [1]:
%%capture
!pip install transformers

## Step 2: Import Required Libraries

To use Hugging Face's powerful Transformers library, we need to import essential components. In particular, we will make use of two key classes: `AutoModelForCausalLM` and `AutoTokenizer`.

### AutoModelForCausalLM

`AutoModelForCausalLM` is a class that provides a simple interface for loading pre-trained causal language models. A causal language model is designed for auto-regressive tasks, where the order of the sequence matters, such as text generation. This class automatically identifies the appropriate model architecture based on the provided model name, making it easy to experiment with different models without changing your code.

In the context of our tutorial, we use it to load a pre-trained GPT (Generative Pre-trained Transformer) model for text generation. The `from_pretrained` method loads the model weights, and the `AutoModelForCausalLM` class ensures compatibility with the specific architecture.

### AutoTokenizer

`AutoTokenizer` is another crucial component from Hugging Face's Transformers library. Tokenizers are essential for processing raw text into a format that can be fed into the model. The `AutoTokenizer` class automatically selects the appropriate tokenizer based on the provided model name.

In our tutorial, we use it to load the tokenizer corresponding to the GPT model we selected. This tokenizer is responsible for breaking down input text into tokens, which are then processed by the model. The `from_pretrained` method loads the pre-trained tokenizer associated with the chosen GPT model.

By using `AutoModelForCausalLM` and `AutoTokenizer`, we can easily work with various pre-trained language models without having to worry about the specifics of each model's architecture or tokenization process.

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


## Step 3: Load a Pre-trained GPT Model

Now, let's proceed to load a pre-trained GPT model for text generation. In this tutorial, we'll use the "beomi/llama-2-ko-7b" model, an advanced iteration of Llama-2 tailored for Korean text generation. Here's a brief overview of the model:

### Llama-2-Ko Overview

- **Developer:** Junbum Lee (Beomi)
- **Variations:** Llama-2-Ko comes in different parameter sizes, including 7B (used in this tutorial), 13B, and 70B, as well as pretrained and fine-tuned variations.
- **Input and Output:** The model takes input text and generates text as output, making it suitable for various text generation tasks.

### Vocabulary Expansion

- **Original Llama-2:** Vocabulary size of 32,000 using Sentencepiece BPE.
- **Expanded Llama-2-Ko:** Vocabulary size increased to 46,336, incorporating Korean vocabulary and merges.

In [3]:
# Initialize Model Name from HuggingFace Hub
model_name = "beomi/llama-2-ko-7b"

# Load Model and Tokenizers
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Loading checkpoint shards: 100%|██████████| 15/15 [00:02<00:00,  5.45it/s]


## Step 4: Tokenize Inputs

Tokenization is a crucial step in natural language processing that involves breaking down a sequence of text into smaller units, known as tokens. These tokens serve as the fundamental building blocks for language models.

Let's walk through the tokenization example with the sentence "안녕하세요, 오늘은 날씨가 좋네요." using Llama-2-Ko:


In [4]:
# Tokenize the sentence
tokens = tokenizer.tokenize("안녕하세요, 오늘은 날씨가 좋네요.") # Hello, the weather is nice today.
print("Tokenize sentence:", tokens)

# Tokenize the sentence into Pytorch Tensor as input id.
tokens_pt = tokenizer("안녕하세요, 오늘은 날씨가 좋네요.", return_tensors="pt")
print("Encoded tokens", tokens_pt['input_ids'])

Tokenize sentence: ['▁안녕', '하세요', ',', '▁오늘은', '▁날', '씨가', '▁좋네요', '.']
Encoded tokens tensor([[    1, 43116, 33055, 29892, 41636, 32502, 33876, 32864, 29889]])


In this code snippet:

- `tokenizer.tokenize("안녕하세요, 오늘은 날씨가 좋네요.")`: Tokenizes the input sentence into a list of tokens. The resulting list, tokens, represents how the input is broken down into individual units.

- `tokenizer("안녕하세요, 오늘은 날씨가 좋네요.", return_tensors="pt")`: Tokenizes the input sentence and returns a PyTorch Tensor containing the input IDs. The Tensor, accessed using `tokens_pt['input_ids']`, provides a numerical representation of the tokens. This is the format suitable for input into the Llama-2-Ko model.

## Step 5: Generate Text (Next Token Prediction)

Now that we have tokenized our input, let's generate text using the Llama-2-Ko model. In this step, we'll perform next token prediction, where the model predicts the next token in the sequence given the input tokens.


In [5]:
# Generate text using next token prediction
output = model.generate(input_ids=tokens_pt['input_ids'], max_length=100,)
print("Generated output (IDs):", output)

# Decode the generated output
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated text:", generated_text)

Generated output (IDs): tensor([[    1, 43116, 33055, 29892, 41636, 32502, 33876, 32864, 29889, 29871,
         30166, 35715, 31354, 34333, 35874, 32500, 31286, 34595, 31435, 35362,
         32349, 32869, 29889, 29871, 30166, 34120,   525, 33139, 32019, 30906,
         32140, 32921, 32305, 29915, 32295, 32500, 32214, 29889, 29871, 30166,
         30393, 32500, 31354, 34333, 29871, 29906, 29900, 29896, 29953, 31571,
         29871, 29896, 36648, 34156, 34537, 32500, 32214, 29889, 29871, 30166,
         31607, 33070, 31054, 35323, 29871, 29906, 29900, 30890, 34685, 43505,
         29892, 29871, 30166, 44897, 31299, 33787, 31286, 33889, 32504, 43143,
         33406, 33115, 32757, 32043, 29889, 29871, 30166, 31607, 33070, 31054,
         35323,   525, 35356, 32153, 32624, 39486, 32474, 17901, 32295, 34963]])
Generated text: 안녕하세요, 오늘은 날씨가 좋네요. ​오늘은 제가 좋아하는 책을 소개해드리려고 합니다. ​바로 '나는 나로 살기로 했다'라는 책입니다. ​이 책은 제가 2016년 1월에 읽었던 책입니다. ​그 당시에 저는 20대 후반이었고, ​직장생활을 하면서 많은 스트레스를 받고 있었습니다. ​그 당시에 저는 '내

In this code snippet:

- `output = model.generate(input_ids=tokens_pt['input_ids'], max_length=50)`: Utilizes the `generate` method of the Llama-2-Ko model to predict the next tokens in the sequence. The `input_ids` parameter is set to the PyTorch Tensor containing the input IDs obtained from tokenization (`tokens_pt['input_ids']`). The `max_length` parameter controls the maximum length of the generated sequence.

- `generated_text = tokenizer.decode(output[0], skip_special_tokens=True)`: After generating the output IDs, this line uses the `decode` method of the tokenizer to convert the numerical output into human-readable text. The `output[0]` corresponds to the first sequence in the generated output, and `skip_special_tokens=True` removes any special tokens from the final decoded text.

# Closing Note

In this tutorial, we've explored the fascinating world of **Large Language Models (LLMs)** with a focus on text generation using the **Llama-2-Ko base model**. It's crucial to understand that this **base model** is primarily trained to predict and complete text and is not explicitly tuned for instruction following or chat-based interactions.

Let's take a quick look at an example to illustrate the behavior of this base model:

In [6]:
# Illustrative example with the base model
prompt = "인도네시아의 도시들을 인구 규모별로 정렬된 목록으로 제시해주세요." # Give me a list of the cities in Indonesia sorted by population size.

# Generate text using the base model
output = model.generate(input_ids=tokenizer(prompt, return_tensors="pt")['input_ids'], max_length=100,)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

print("Prompt:", prompt)
print("Generated text:", generated_text)

Prompt: 인도네시아의 도시들을 인구 규모별로 정렬된 목록으로 제시해주세요.
Generated text: 인도네시아의 도시들을 인구 규모별로 정렬된 목록으로 제시해주세요.⊙ 1960년대 1960년 4.19혁명으로 이승만 독재정권이 무너지고 장면 민주당 정권이 들어섰으나 1961년 5.16 군사정변으로 박정희 군사정권이 들어섰다. 1962년 1월 1일 시행된 '향토예비군 설치법'에 따라 19


**Note:** The above example demonstrates that the base model generates text based on the prompt but may not strictly follow the provided instructions.

In our upcoming tutorial, we will delve into the specialized capabilities of instruction-tuned and chat-tuned LLMs, exploring their ability to follow instructions and engage in chat-like interactions. Stay tuned for an in-depth journey into the tailored functionalities of these advanced language models!