<a href="https://colab.research.google.com/github/meizee/transformers-course/blob/main/Chapter2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Behind The Pipeline**

In [1]:
!pip install transformers[sentencepiece]

Collecting transformers[sentencepiece]
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 5.1 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 56.2 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 46.8 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 489 kB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 45.0 MB/s 
Collecting sentencepiece!=0.1.92,>=0.1.91
  Downloading sentencepiece-0.1.96-cp

In [2]:
import transformers

In [3]:
import sentencepiece

### Preprocessing with a tokenizer

Preprocessing di pipeline dilakukan dengan Tokenizer. Tokenizer memiliki fungsi seperti berikut:
*   Split input menjadi bentuk kata, subkata, atau simbol (seperti tanda baca) yang disebut token
*   Map setiap token ke sebuah integer
*   Menambahkan input tambahan yang mungkin dapat berguna oleh model

In [4]:
from transformers import AutoTokenizer

In [5]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [6]:
raw_inputs = [
    "Happy new year! Cheers to more shared successes in the new year!",
    "I don't like this song",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="tf")
print(inputs)

{'input_ids': <tf.Tensor: shape=(2, 16), dtype=int32, numpy=
array([[  101,  3407,  2047,  2095,   999, 21250,  2000,  2062,  4207,
        14152,  1999,  1996,  2047,  2095,   999,   102],
       [  101,  1045,  2123,  1005,  1056,  2066,  2023,  2299,   102,
            0,     0,     0,     0,     0,     0,     0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 16), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>}



Output merupakan dictionary yang terdiri dari dua key. Key pertama, `input_ids` yang terdiri dari dua baris integer (1 baris mewakili 1 kalimat input) yang merupakan unique identifiers dari token di setiap kata. Dan key kedua yaitu `attention_mask`. 

### Going through the model

In [7]:
from transformers import TFAutoModel

In [8]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = TFAutoModel.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertModel: ['classifier', 'dropout_19', 'pre_classifier']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


Arsitektur ini hanya terdiri dari base modul Transformer: diberikan beberapa input yang kemudian output-nya adalah fitur (bisa dibilang ini untuk ekstraksi fitur). Setiap input model, kita akan mendapatkan high-dimensional vektor yang merepresentasikan pemahaman kontekstual dari input tersebut oleh model Transformers.

**A high-dimensional vector?**

Ouput vektor dari modul Transformers biasanya berukuran besar. Secara umum memiliki tiga dimensi:

1. Batch size: Banyaknya sequence yang diproses untuk setiap satuan waktu.
2. Sequence length: Panjang epresentasi numerik dari sequence. 
3. Hidden size: Dimensi vektor dari setiap input modelThe vector dimension of each model input.

In [9]:
outputs = model(inputs)
print(outputs.last_hidden_state.shape)

(2, 16, 768)


### Model heads: Making sense out of numbers

Model heads mengambil high-dimensional vektor dari hidden states sebagai input dan memproyeksikannya ke dimensi yang berbeda. Model heads biasanya terdiri dari satu atau sedikit layer linear:

![image.png](https://huggingface.co/course/static/chapter2/transformer_and_head.png)

Pada diagram di atas, model direpresentasikan oleh layer embedding dan subsquent. Embedding layer mengonversikan setiap input ID dari input yang ditokenisasi menjadi sebuah vektor yang merepresentasikan token tersebut. Subsequent layer memanipulasi vektor-vektor tersebut dengan mekanisme atention untuk menghasilkan representasi akhir dari sebuah kalimat.

Ada beberapa arsitektur berbeda yang tersedia di 🤗 Transformers, setiap arsitektur didesain untuk mengatasi task yang spesifik.

Pada contoh ini digunakan model dengan head sequence classification (untuk mengklasifikasi input apakan positif atau negatif labelnya) yaitu dengan TFAutoModelForSequenceClassification

In [10]:
from transformers import TFAutoModelForSequenceClassification

In [11]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(inputs)

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_38']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model head mengambil input vektor high-dimensional vectors yang sebelumnya, dan output vektornya terdiri dari 2 nilai (satu untuk setiap label):

In [12]:
print(outputs.logits.shape)

(2, 2)


Karena kita hanya mencoba dua kalimat dan dua label (positif dan negatif) maka shape dari model 2 x 2.

### Postprocessing the output

In [13]:
print(outputs.logits)

tf.Tensor(
[[-4.3278794  4.719447 ]
 [ 2.359575  -1.9958891]], shape=(2, 2), dtype=float32)


In [14]:
import tensorflow as tf

In [15]:
predictions = tf.math.softmax(outputs.logits, axis=-1)
print(predictions)

tf.Tensor(
[[1.17691394e-04 9.99882340e-01]
 [9.87326205e-01 1.26737952e-02]], shape=(2, 2), dtype=float32)


In [16]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

Sehingga hasil prediksi model adalah:

**Kalimat pertama**: NEGATIVE: 0.0001, **POSITIVE: 0.9998**

**Kalimat kedua**: **NEGATIVE: 0.9873**, POSITIVE: 0.0126