# Hugging Face의 Transformer의 기본 사용법을 익혀보자

* https://huggingface.co/transformers/quicktour.html

## 모델 CACHE DIR 설정

In [3]:
import os 

os.environ['TRANSFORMERS_CACHE'] = '/data/huggingface/transformers'

## pipeline으로 특정 task 명시해서  pretrained model 불러오고 predict 수행

In [4]:
from transformers import pipeline

classifier = pipeline('sentiment-analysis')
results = classifier(["We are very happy to show you the 🤗 Transformers library.",
            "We hope you don't hate it."])

for result in results:
     print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309


## 특정 모델을 쓰고 싶으면 명시 

In [5]:
classifier = pipeline('sentiment-analysis', 
                      model="nlptown/bert-base-multilingual-uncased-sentiment")

Downloading:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/669M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/872k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

## Under the hood: pretrained models

* 특정 task/모델의 tokenizer와 classifier를 획득하여 사용

In [12]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = TFAutoModelForSequenceClassification.from_pretrained(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name)

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_95']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some layers 

In [13]:
# using tokenizer

inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.")
inputs

{'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

## preprocessed tf batch 얻기

In [14]:
tf_batch = tokenizer(
    ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="tf"
)

for key, value in tf_batch.items():
    print(f"{key}: {value.numpy().tolist()}")

input_ids: [[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], [101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102, 0, 0, 0]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]


## tf_batch 를 tf_model로 예측하기

In [20]:
import tensorflow as tf
tf_outputs = tf_model(tf_batch, labels = tf.constant([1, 0]))
print(tf_outputs)

TFSequenceClassifierOutput(loss=<tf.Tensor: shape=(2,), dtype=float32, numpy=array([2.2051287e-04, 6.3325751e-01], dtype=float32)>, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-4.0832963 ,  4.336415  ],
       [ 0.08181015, -0.04178578]], dtype=float32)>, hidden_states=None, attentions=None)


In [22]:
import tensorflow as tf
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)
tf_predictions

<tf.Tensor: shape=(2,), dtype=float32, numpy=array([0.34682223, 0.6531778 ], dtype=float32)>

## Distilled Bert 사용

In [24]:
from transformers import DistilBertTokenizer, TFDistilBertForQuestionAnswering
import tensorflow as tf

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = TFDistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased')

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/442 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForQuestionAnswering: ['activation_13', 'vocab_transform', 'vocab_layer_norm', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs', 'dropout_135']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [31]:
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
input_dict = tokenizer(question, text, return_tensors='tf')
input_dict

{'input_ids': <tf.Tensor: shape=(1, 14), dtype=int32, numpy=
array([[  101,  2040,  2001,  3958, 27227,  1029,   102,  3958, 27227,
         2001,  1037,  3835, 13997,   102]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 14), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}

In [30]:
outputs = model(input_dict)
outputs

TFQuestionAnsweringModelOutput(loss=None, start_logits=<tf.Tensor: shape=(1, 14), dtype=float32, numpy=
array([[-0.09447779, -0.03184513, -0.06706214,  0.01800456,  0.09351401,
        -0.02026102, -0.27730948,  0.04466629,  0.12293912, -0.02050541,
         0.10064174,  0.0583155 , -0.19937657, -0.27738905]],
      dtype=float32)>, end_logits=<tf.Tensor: shape=(1, 14), dtype=float32, numpy=
array([[-0.06986502, -0.38322595, -0.16931126,  0.24886736,  0.15729505,
        -0.2968017 ,  0.18457553,  0.2881911 ,  0.19785365, -0.23467627,
        -0.12396736,  0.04690195,  0.07048073,  0.18457395]],
      dtype=float32)>, hidden_states=None, attentions=None)

In [33]:
start_logits = outputs.start_logits
end_logits = outputs.end_logits
all_tokens = tokenizer.convert_ids_to_tokens(input_dict["input_ids"].numpy()[0])
answer = ' '.join(all_tokens[tf.math.argmax(start_logits, 1)[0] : tf.math.argmax(end_logits, 1)[0]+1])
answer

''