# **Base Model Part**
- **Model:** Llama-2-7b-hf(LLaMA2)
- base model version

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import login
import torch
import os

login("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")
os.environ["HF_TOKEN"] = "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"

torch_device = "cuda" if torch.cuda.is_available() else "cpu"

#Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
# add the EOS token as PAD token to avoid warnings
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    pad_token_id=tokenizer.eos_token_id,
    torch_dtype=torch.float16  # loda our model by FP16 mode
).to(torch_device)

**Test Sentences**

In [None]:
# Define 3 candidate sentences
candidate_sentences = [
    'Hanyang University is',
    'Difficulty of learning Korean compare to English is',
    'Alpine ski is'
]

**1. Greedy Search**

- It is simplest decoding method.
- Selects the highest probability as its next word **at each timestep t**.


**Pro**
- Simple and fast


**Con**
- Repetition Happen
- Miss high probability words hidden begind a low probability word -> We can alleviate by using **Beam Search**


In [None]:
print("Output for Greedy Search:\n" + 100 * '-')

# generate max 30 new tokens
for sentence in candidate_sentences:
    # Encode the sentence
    model_inputs = tokenizer(sentence, return_tensors='pt').to(torch_device)

    # Generate up to 30 new tokens
    greedy_output = model.generate(**model_inputs, max_new_tokens=30)

    # Decode the generated text and format output
    generated_text = tokenizer.decode(greedy_output[0], skip_special_tokens=True)

    # Print the input sentence and the corresponding output
    print(f"Input Sentence: {sentence}")
    print(f"Generated Output: {generated_text}")
    print("-" * 100)

**2. Beam Search**

- On each step of decoder, keep track of the k most probable partial translations.
(k : beam size)

- All hypothesis has score(log probability)

- Decode untill **reach timestep T** or **at least n completed hypothesis**

**Pro**
- Reduces the risk of missing hidden high probability word sequences
- Work well in task such as Machine Translation or Text Summarization


**Con**
- Does not guarantee to find optimal solution(or most likely output)
- Repetition Happen -> we can solve it by **n-gram**
- Not good at open-ended generation such as dialog and story generatiion



In [None]:
# activate beam search and early_stopping
print("Output for Beam Search with Early Stopping:\n" + 100 * '-')

# Generate max 30 new tokens using beam search
for sentence in candidate_sentences:
    # Encode the sentence
    model_inputs = tokenizer(sentence, return_tensors='pt').to(torch_device)

    # Generate output using beam search
    beam_output = model.generate(
        **model_inputs,
        max_new_tokens=30,
        num_beams=5,
        early_stopping=True
    )

    # Decode the generated text
    generated_text = tokenizer.decode(beam_output[0], skip_special_tokens=True)

    # Print the input sentence and the corresponding beam search output
    print(f"Input Sentence: {sentence}")
    print(f"Generated Output (Beam Search): {generated_text}")
    print("-" * 100)


**3. No Repeat N-Gram**

- Simple option for not to generate repetition on Beam Search Decoding
- N-gram appears twice by manually setting the probability of next words that could create an already seen n-gram to

In [None]:
# set return_num_sequences > 1 for generating multiple sequences
print("Output for Beam Search with Multiple Sequences:\n" + 100 * '-')

# Generate max 30 new tokens using beam search with multiple return sequences
for sentence in candidate_sentences:
    # Encode the sentence
    model_inputs = tokenizer(sentence, return_tensors='pt').to(torch_device)

    # Generate output with multiple sequences
    beam_outputs = model.generate(
        **model_inputs,
        max_new_tokens=30,
        num_beams=5,
        no_repeat_ngram_size=2,
        early_stopping=True
    )

    # Print each generated sequence with index
    print(f"Input Sentence: {sentence}")
    for i, beam_output in enumerate(beam_outputs):
        generated_text = tokenizer.decode(beam_output, skip_special_tokens=True)
        print(f"Sequence {i + 1}: {generated_text}")
    print("-" * 100)


**4. Vanilla Sampling**

- Randomly picking the next word wt according to its conditional probability distribution


**Pro**
- Repetitiion is not occuring as much as in Beam Search or Greedy Decoding Algorithms

**Con**
- Model often generate incoherent gibberish beacuse Vanilla sampling makes every token in the vocabulary an option.

In [None]:
# activate sampling and deactivate top_k by setting top_k sampling to 0
print("Output for Sampling with Top_k=0:\n" + 100 * '-')

# Generate max 30 new tokens using sampling with top_k set to 0
for sentence in candidate_sentences:
    # Encode the sentence
    model_inputs = tokenizer(sentence, return_tensors='pt').to(torch_device)

    # Generate output using sampling
    sample_output = model.generate(
        **model_inputs,
        max_new_tokens=30,
        do_sample=True,
        top_k=0
    )

    # Decode the generated text
    generated_text = tokenizer.decode(sample_output[0], skip_special_tokens=True)

    # Print the input sentence and the corresponding sampled output
    print(f"Input Sentence: {sentence}")
    print(f"Generated Output (Sampling): {generated_text}")
    print("-" * 100)


**5. Top - K Sampling**

- Sample from the top k tokens in the probability distribution.
- Incresing k -> Diverse, risky output
- Decresing k -> Safe, generic output

**Pro**
- Much more human like text compare to **vanila Sampling**

**Con**
- Cannot dynamically adapt the k value


In [None]:
# set top_k to 50 for sampling
print("Output for Sampling with Top_k=50:\n" + 100 * '-')

# Generate max 30 new tokens using sampling with top_k set to 50
for sentence in candidate_sentences:
    # Encode the sentence
    model_inputs = tokenizer(sentence, return_tensors='pt').to(torch_device)

    # Generate output using sampling with top_k set to 50
    sample_output = model.generate(
        **model_inputs,
        max_new_tokens=30,
        do_sample=True,
        top_k=50
    )

    # Decode the generated text
    generated_text = tokenizer.decode(sample_output[0], skip_special_tokens=True)

    # Print the input sentence and the corresponding sampled output
    print(f"Input Sentence: {sentence}")
    print(f"Generated Output (Sampling, Top_k=50): {generated_text}")
    print("-" * 100)


**6. Top - p (Nucleus) Sampling**

- Sample from all tokens in the top p cumulative probability mass

**Pro**
- Number of words in the set can dynamically increase and decrese
- Diversity increses

**Con**
- Sensitive to p value.



In [None]:
# set top_k to 0 and top_p to 0.92 for sampling
print("Output for Sampling with Top_k=0 and Top_p=0.92:\n" + 100 * '-')

# Generate max 30 new tokens using sampling with top_p=0.92 and top_k=0
for sentence in candidate_sentences:
    # Encode the sentence
    model_inputs = tokenizer(sentence, return_tensors='pt').to(torch_device)

    # Generate output using sampling with top_p=0.92 and top_k=0
    sample_output = model.generate(
        **model_inputs,
        max_new_tokens=30,
        do_sample=True,
        top_p=0.92,
        top_k=0
    )

    # Decode the generated text
    generated_text = tokenizer.decode(sample_output[0], skip_special_tokens=True)

    # Print the input sentence and the corresponding sampled output
    print(f"Input Sentence: {sentence}")
    print(f"Generated Output (Sampling, Top_p=0.92, Top_k=0): {generated_text}")
    print("-" * 100)


**Test complex question by Top - p Decoding Alogorithm**

- **Complex step by step Question:** "Kim has 3 boxes of apples, with each box containing 10 apples.
Kim gives 1 box of apples to Jung and eats 3 apples from one of the remaining boxes.
How many apples does Kim have now?"

In [None]:
# Test sentence
complex_questions ="""
Kim has 3 boxes of apples, with each box containing 10 apples.
Kim gives 1 box of apples to Jung and eats 3 apples from one of the remaining boxes.
How many apples does Kim have now?
"""


# Generate max 100 new tokens using sampling with top_p=0.92 and top_k=0

# Encode the sentence
base_model_inputs = tokenizer(complex_questions, return_tensors='pt').to(torch_device)

# Generate output using sampling with top_p=0.92 and top_k=0
base_model_output = model.generate(
    **base_model_inputs,
    max_new_tokens=100,
    do_sample=True,
    top_p=0.92,
    top_k=0
)

# Decode the generated text
generated_text = tokenizer.decode(base_model_output[0], skip_special_tokens=True)
print(generated_text)

**Prompting**
- Few Shot

In [None]:
# Few-Shot Prompt
few_shot_prompt = """
Question1: What is 6 - 4 ?
Answer1: Answer is 2.

Question2: Alex has 5 pens, and he buys 3 more pens. How many pens does Alex have now?
Answer2: Alex now has 8 pens.

Question3: Kim has 10 apples and gives 4 apples to Jung. How many apples does Kim have now?
Answer3: Kim has 6 apples left.

Quserion4: Kim has 3 boxes of apples, with each box containing 10 apples.
Kim gives 1 box of apples to Jung and eats 3 apples from one of the remaining boxes.
How many apples does Kim have now?

What is the answer of Question4?
Answer4:
"""

# Encode the prompt
base_model_few_shot_prompt = tokenizer(few_shot_prompt, return_tensors="pt").to(torch_device)

# Generate response
base_model_few_shot_output = model.generate(
    **base_model_few_shot_prompt,
    max_new_tokens=100,
    do_sample=True,
    top_p=0.92,
    top_k=0
)

# Decode and print the output
few_shot_generated_text = tokenizer.decode(base_model_few_shot_output[0], skip_special_tokens=True)
print(few_shot_generated_text)

**Promptiong**
- Chain-of-Thought(COT)

In [None]:
# COT Prompt
cot_prompt = """
Question1:Question1: Sarah has 100 dollars. She buys a book for 40 dollars and a pencil for 5 dollars.
How much money does Sarah have left?
Answer1: Let's go through this step-by-step.
Sarah starts with 100 dollars. After buying the book for 40 dollars, she has 100 - 40 = 60 dollars.
Then, after buying the pencil for 5 dollars, she has 60 - 5 = 55 dollars. So, Sarah has 55 dollars left.

Question2: John’s flight departs at 3:00 PM and arrives at 6:30 PM. How long is John’s flight?
Answer2: Let's go through this step-by-step.
John’s flight departs at 3:00 PM and arrives at 6:30 PM. From 3:00 PM to 6:00 PM is 3 hours.
From 6:00 PM to 6:30 PM is 30 minutes. So, the total flight duration is 3 hours and 30 minutes.

Question3: Kim has 3 boxes of apples, with each box containing 10 apples.
Kim gives 1 box of apples to Jung and eats 3 apples from one of the remaining boxes.
How many apples does Kim have now?

What is the answer of Question3?
Answer3:

"""

# Encode the prompt
base_model_cot_prompt = tokenizer(cot_prompt, return_tensors="pt").to(torch_device)

# Generate response
base_model_cot_output = model.generate(
    **base_model_cot_prompt,
    max_new_tokens=200,
    do_sample=True,
    top_p=0.92,
    top_k=0
)

# Decode and print the output
cot_generated_text = tokenizer.decode(base_model_cot_output[0], skip_special_tokens=True)  # 출력도 base_model_cot_output 사용
print(cot_generated_text)


# **Instruction fine-tuned part**

- **Model:** Llama-2-7b-chat-hf(LLaMA2)
- Instruction fine-tuned version

In [None]:
#Load tokenizer and model
fine_tuned_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
# add the EOS token as PAD token to avoid warnings
fine_tuned_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-chat-hf",
    pad_token_id=tokenizer.eos_token_id,
    torch_dtype=torch.float16  # loda our model by FP16 mode
).to(torch_device)

**Test complex question by Top - p Decoding Alogorithm**

- **Complex step by step question:** "Kim has 3 boxes of apples, with each box containing 10 apples.
Kim gives 1 box of apples to Jung and eats 3 apples from one of the remaining boxes.
How many apples does Kim have now?"

In [None]:
# Test sentence
complex_questions ="""
Kim has 3 boxes of apples, with each box containing 10 apples.
Kim gives 1 box of apples to Jung and eats 3 apples from one of the remaining boxes.
How many apples does Kim have now?
"""


# Generate max 100 new tokens using sampling with top_p=0.92 and top_k=0

# Encode the sentence
fine_tuned_model_inputs = fine_tuned_tokenizer(complex_questions, return_tensors='pt').to(torch_device)

# Generate output using sampling with top_p=0.92 and top_k=0
fine_tuned_model_output = fine_tuned_model.generate(
    **base_model_inputs,
    max_new_tokens=200,
    do_sample=True,
    top_p=0.92,
    top_k=0
)

# Decode the generated text
generated_text = fine_tuned_tokenizer.decode(fine_tuned_model_output[0], skip_special_tokens=True)
print(generated_text)

**Prompting**
- Few Shot Promptiong

In [None]:
# Few-Shot Prompt
few_shot_prompt = """
Question1: What is 6 - 4 ?
Answer1: Answer is 2.

Question2: Alex has 5 pens, and he buys 3 more pens. How many pens does Alex have now?
Answer2: Alex now has 8 pens.

Question3: Kim has 10 apples and gives 4 apples to Jung. How many apples does Kim have now?
Answer3: Kim has 6 apples left.

Quserion4: Kim has 3 boxes of apples, with each box containing 10 apples.
Kim gives 1 box of apples to Jung and eats 3 apples from one of the remaining boxes.
How many apples does Kim have now?

What is the answer of Question4?
Answer4:
"""

# Encode the prompt
base_model_few_shot_prompt = fine_tuned_tokenizer(few_shot_prompt, return_tensors="pt").to(torch_device)

# Generate response
base_model_few_shot_output = fine_tuned_model.generate(
    **base_model_few_shot_prompt,
    max_new_tokens=100,
    do_sample=True,
    top_p=0.92,
    top_k=0
)

# Decode and print the output
few_shot_generated_text = fine_tuned_tokenizer.decode(base_model_few_shot_output[0], skip_special_tokens=True)
print(few_shot_generated_text)

**Prompting**
- CoT(Chain of Thought) Prompting

In [None]:
# COT Prompt
cot_prompt = """
Question1: Sarah has 100 dollars. She buys a book for 40 dollars and a pencil for 5 dollars.
How much money does Sarah have left?
Answer1: Let's go through this step-by-step.
Sarah starts with 100 dollars. After buying the book for 40 dollars, she has 100 - 40 = 60 dollars.
Then, after buying the pencil for 5 dollars, she has 60 - 5 = 55 dollars. So, Sarah has 55 dollars left.

Question2: John’s flight departs at 3:00 PM and arrives at 6:30 PM. How long is John’s flight?
Answer2: Let's go through this step-by-step.
John’s flight departs at 3:00 PM and arrives at 6:30 PM. From 3:00 PM to 6:00 PM is 3 hours.
From 6:00 PM to 6:30 PM is 30 minutes. So, the total flight duration is 3 hours and 30 minutes.

Question3: Kim has 3 boxes of apples, with each box containing 10 apples.
Kim gives 1 box of apples to Jung and eats 3 apples from one of the remaining boxes.
How many apples does Kim have now?

What is the answer of Question3?
Answer3:

"""

# Encode the prompt
fine_tuned_model_cot_input = fine_tuned_tokenizer(cot_prompt, return_tensors="pt").to(torch_device)

# Generate response
fine_tuned_model_cot_output = fine_tuned_model.generate(
    **fine_tuned_model_cot_input,
    max_new_tokens=150,
    do_sample=True,
    top_p=0.92,
    top_k=0
)

# Decode and print the output
cot_generated_text = tokenizer.decode(fine_tuned_model_cot_output[0], skip_special_tokens=True)  # 출력도 base_model_cot_output 사용
print(cot_generated_text)