<a href="https://colab.research.google.com/github/hanghae-plus-AI/AI-1-tolluset/blob/main/week5/5-basic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip -q install datasets

In [2]:
from huggingface_hub import login
from google.colab import userdata


login(userdata.get('HF_TOKEN'))

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", device_map="auto")

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

✅ 이번 과제에서는 이전 주차 과제에서 활용했던 fancyzhx/ag_news 문제를 zero-shot classification으로 푸시면 됩니다. 아래 사항들에 유의하시면 될 것 같습니다.

In [4]:
from datasets import load_dataset

ag_news = load_dataset("fancyzhx/ag_news")

def preprocess_function(examples):
    return tokenizer(examples["text"], max_length=300, truncation=True)

tokenized_ag_news = ag_news.map(preprocess_function, batched=True)

print(ag_news)
print(tokenized_ag_news)

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})
DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 7600
    })
})


In [5]:
ag_news_labels = ag_news['train'].features['label'].names

ag_news_labels

['World', 'Sports', 'Business', 'Sci/Tech']

In [6]:
input_text = "The new LLL model is published name gemma"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
print(outputs)

print(tokenizer.decode(outputs[0]))



tensor([[     2,    651,    888,    629,   1650,   2091,    603,   7018,   1503,
          21737,    534, 235265,    109,    651,    888,    629,   1650,   2091,
            603,   7018]], device='cuda:0')
<bos>The new LLL model is published name gemma.

The new LLL model is published


In [7]:
tokens = input_ids['input_ids']
print(tokens)

logits = model(**input_ids).logits
for i in range(tokens.shape[-1]):
    token = tokens[0, i].item()
    print(logits[0, i, token])

tensor([[    2,   651,   888,   629,  1650,  2091,   603,  7018,  1503, 21737,
           534]], device='cuda:0')
tensor(-18.2747, device='cuda:0', grad_fn=<SelectBackward0>)
tensor(-30.8340, device='cuda:0', grad_fn=<SelectBackward0>)
tensor(-24.9039, device='cuda:0', grad_fn=<SelectBackward0>)
tensor(6.9362, device='cuda:0', grad_fn=<SelectBackward0>)
tensor(-12.3923, device='cuda:0', grad_fn=<SelectBackward0>)
tensor(-24.0329, device='cuda:0', grad_fn=<SelectBackward0>)
tensor(-26.9775, device='cuda:0', grad_fn=<SelectBackward0>)
tensor(-27.1768, device='cuda:0', grad_fn=<SelectBackward0>)
tensor(-19.8710, device='cuda:0', grad_fn=<SelectBackward0>)
tensor(-12.1889, device='cuda:0', grad_fn=<SelectBackward0>)
tensor(-15.2993, device='cuda:0', grad_fn=<SelectBackward0>)


In [8]:
import torch

def zero_shot_classification(text, task_description, labels):  # text는 주어진 입력, task_description은 task에 대한 설명, labels은 class들을 text로 변환한 결과입니다.
    text_ids = tokenizer(task_description + text, return_tensors="pt").to("cuda")  # 먼저 task_description과 text를 이어붙인 후, tokenize합니다.
    probs = []
    for label in labels:  # 그 다음 각 text화된 label들을 tokenize하고 입력에 이어붙인 후, Gemma-2B에 넣어줍니다.
        label_ids = tokenizer(label, return_tensors="pt").to("cuda")
        n_label_tokens = label_ids['input_ids'].shape[-1] - 1  # text로 변환한 label의 token 수를 계산합니다.
        input_ids = {
            'input_ids': torch.concatenate([text_ids['input_ids'], label_ids['input_ids'][:, 1:]], axis=-1),  # concatenate 명령어를 통해 이어붙이는 모습입니다.
            'attention_mask': torch.concatenate([text_ids['attention_mask'], label_ids['attention_mask'][:, 1:]], axis=-1)
        }

        logits = model(**input_ids).logits  # Logit을 계산한 모습입니다.
        prob = 0
        n_total = input_ids['input_ids'].shape[-1]
        for i in range(n_label_tokens, 0, -1):  # 일반적으로 text로 변환한 label은 여러 token으로 이루어져있습니다. 이러한 label에 대한 logit은 구성하는 모든 token들의 logit들의 합으로 정의합니다.
            token = label_ids['input_ids'][0, i].item()
            prob += logits[0, n_total - i, token].item()
        probs.append(prob)

        del input_ids
        del logits
        torch.cuda.empty_cache()  # 위의 del과 empty_cache() 명령어를 통해 GPU를 제때 할당해제 해줍니다. 만약 GPU가 여유롭다면 지워주시는게 속도적으로 이득입니다.

    return probs

✅ Label들을 올바르게 text화 하여 넘겨주셔야 합니다.

In [13]:
import numpy as np

probs = zero_shot_classification(
    "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.",
    "What is the sentence category. 'World', 'Sports', 'Business', 'Sci/Tech': ",
    [f"{label}." for label in ag_news_labels]
  )

print(ag_news_labels[np.argmax(probs)])

Sci/Tech


In [12]:
probs = zero_shot_classification(
    """Never cast models and Playboy bunnies in your films! Bob Fosse's "Star 80" about Dorothy Stratten, of whom Bogdanovich was obsessed enough to have married her SISTER after her murder at the hands of her low-life husband, is a zillion times more interesting than Dorothy herself on the silver screen. Patty Hansen is no actress either..I expected to see some sort of lost masterpiece a la Orson Welles but instead got Audrey Hepburn cavorting in jeans and a god-awful "poodlesque" hair-do....Very disappointing...."Paper Moon" and "The Last Picture Show" I could watch again and again. This clunker I could barely sit through once. This movie was reputedly not released because of the brouhaha surrounding Ms. Stratten's tawdry death; I think the real reason was because it was so bad!""",
    "Is the sentence negative or positive?: ",
    [f"{label}." for label in ["negative", "positive"]]
  )

print(["negative", "positive"][np.argmax(probs)])

negative


In [17]:
from tqdm import tqdm

prompts, label_prompts = [
    "What is the category that the sentence is in the following 'World', 'Sports', 'Business', 'Sci/Tech': ",
    "Which category does the following sentence belong to? Choose from: 'World', 'Sports', 'Business', 'Sci/Tech'.",
    "Determine the category of the sentence from the options: 'World', 'Sports', 'Business', 'Sci/Tech'.",
    "Classify the sentence into one of these categories - 'World', 'Sports', 'Business', 'Sci/Tech'."
], [
    "Answer: ",
    "Answer: ",
    "Category is ",
    "This sentence is about ",
]

✅ test split data 50개에 대한 정확도 계산 코드 및 출력이 남아있어야 합니다.

In [31]:
for prompt, label_prompt in zip(prompts, label_prompts):

  n_corrects = 0
  for i in tqdm(range(50)):
      text = tokenized_ag_news['test'][i]['text']
      label = tokenized_ag_news['test'][i]['label']
      probs = zero_shot_classification(
          text,
          prompt,
          labels=[f"{label_prompt}: {label}." for label in ag_news_labels]
      )

      pred = np.argmax(np.array(probs))

      if pred == label:
          n_corrects += 1

  print(prompt, label_prompt)
  print("\n\n ✅✅✅", round(n_corrects / 50 * 100), "%")


100%|██████████| 50/50 [00:42<00:00,  1.17it/s]


What is the category that the sentence is in the following 'World', 'Sports', 'Business', 'Sci/Tech':  Answer: 


 ✅✅✅ 26 %


100%|██████████| 50/50 [00:43<00:00,  1.15it/s]


Which category does the following sentence belong to? Choose from: 'World', 'Sports', 'Business', 'Sci/Tech'. Answer: 


 ✅✅✅ 20 %


100%|██████████| 50/50 [00:43<00:00,  1.16it/s]


Determine the category of the sentence from the options: 'World', 'Sports', 'Business', 'Sci/Tech'. Category is 


 ✅✅✅ 24 %


100%|██████████| 50/50 [00:43<00:00,  1.16it/s]

Classify the sentence into one of these categories - 'World', 'Sports', 'Business', 'Sci/Tech'. This sentence is about 


 ✅✅✅ 22 %





In [34]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-2.7B", device_map="auto")



In [35]:
from datasets import load_dataset

ag_news = load_dataset("fancyzhx/ag_news")

def preprocess_function(examples):
    return tokenizer(examples["text"], max_length=300, truncation=True)

tokenized_ag_news = ag_news.map(preprocess_function, batched=True)

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

In [36]:
from tqdm import tqdm
import numpy as np


prompt, label_prompt = "What is the category that the sentence is in the following 'World', 'Sports', 'Business', 'Sci/Tech': ", "Answer: "

n_corrects = 0
for i in tqdm(range(50)):
    text = tokenized_ag_news['test'][i]['text']
    label = tokenized_ag_news['test'][i]['label']
    probs = zero_shot_classification(
        text,
        prompt,
        labels=[f"{label_prompt}: {label}." for label in ag_news_labels]
    )

    pred = np.argmax(np.array(probs))

    if pred == label:
        n_corrects += 1

print(prompt, label_prompt)
print("\n\n ✅✅✅", round(n_corrects / 50 * 100), "%")


100%|██████████| 50/50 [00:53<00:00,  1.07s/it]

What is the category that the sentence is in the following 'World', 'Sports', 'Business', 'Sci/Tech':  Answer: 


 ✅✅✅ 32 %





프롬프트보다 더 좋은 성능의 모델을 사용하는게 좋아 보임.