# HuggingFace Crash Course
Pipelines, Pretrained Model, Fine Tuning  
Prerequisite:  `pip install transformers`

In [1]:
from transformers import pipeline
from transformers import AutoTokenizer
from transformers import TFAutoModelForSequenceClassification # for pytorch, same model but remove 'TF' in front
import tensorflow as tf
import torch
import torch.nn.functional as F

  from .autonotebook import tqdm as notebook_tqdm


## [Pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines)  
An easy way to use models for inference, e.g. sentiment analysis, image classification, object detection, question answering

In [2]:
classifier = pipeline('sentiment-analysis')

# feeding a sample text through pipeline
result = classifier('We are very happy to show you the HuggingFace Transformers Library.')
print(result)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'POSITIVE', 'score': 0.9997598528862}]


In [3]:
# feeding a list of text through pipeline
X_train = ["We are very happy to show you the HuggingFace Transformers Library.", "We hope you don't hate it."]
results = classifier(X_train)

for result in results:
    print(result)

{'label': 'POSITIVE', 'score': 0.9997598528862}
{'label': 'NEGATIVE', 'score': 0.5308609008789062}


### Using a specific model in pipeline
[AutoModels documentation](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html)  
- for pytorch, `import AutoModelForSequenceClassification from transformers`
- for tensorflow, `import TFAutoModelForSequenceClassification from transformers`  

[AutoTokenizer documentation](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#autotokenizer)

In [None]:
model_name = 'bert-base-uncased'

model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier1 = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

## Using Model and Tokenizer without pipeline

In [4]:
model_name = 'bert-base-uncased'

model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

token_ids = tokenizer("We are very happy to show you the HuggingFace Transformers Library.")
print(token_ids)
print('In input_ids, tokens 101 and 102 are the beginning and end of string tokens')

Downloading: 100%|██████████| 570/570 [00:00<00:00, 285kB/s]
Downloading: 100%|██████████| 511M/511M [00:17<00:00, 29.8MB/s] 
All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Downloading: 100%|██████████| 28.0/28.0 [00:00<00:00, 14.0kB/s]
Downloading: 100%|██████████| 226k/226k [00:00<00:00, 312kB/s] 
Downloading: 100%|██████████| 455k/455k [00:01<00:00, 372kB/s]  


{'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 17662, 12172, 19081, 3075, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Token ID 101 and 102 are the beginning and end of string tokens


See [PreTrainedTokenizer documentation](https://huggingface.co/transformers/v3.0.2/main_classes/tokenizer.html#transformers.PreTrainedTokenizer) for parameter details.  
Specifically for `return_tensors` parameter: click [here](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.__call__.return_tensors)  

In [5]:
X_train = ["We are very happy to show you the HuggingFace Transformers Library.", "We hope you don't hate it."]

batch = tokenizer(X_train, padding=True, truncation=True, max_length=512, return_tensors='tf')
print(batch)

{'input_ids': <tf.Tensor: shape=(2, 15), dtype=int32, numpy=
array([[  101,  2057,  2024,  2200,  3407,  2000,  2265,  2017,  1996,
        17662, 12172, 19081,  3075,  1012,   102],
       [  101,  2057,  3246,  2017,  2123,  1005,  1056,  5223,  2009,
         1012,   102,     0,     0,     0,     0]])>, 'token_type_ids': <tf.Tensor: shape=(2, 15), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])>, 'attention_mask': <tf.Tensor: shape=(2, 15), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])>}
