# HuggingFace Crash Course
Pipelines, Pretrained Model, Fine Tuning  
Inspired by this [YouTube tutorial](https://youtu.be/GSt00_-0ncQ) (using pytorch).  
Prerequisite:  `pip install transformers`

In [1]:
from transformers import pipeline
from transformers import AutoTokenizer
from transformers import TFAutoModelForSequenceClassification # for pytorch, same model but remove 'TF' in front
import tensorflow as tf
import torch
import torch.nn.functional as F

  from .autonotebook import tqdm as notebook_tqdm


## [Pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines)  
An easy way to use models for inference, e.g. sentiment analysis, image classification, object detection, question answering

In [2]:
classifier = pipeline('sentiment-analysis')

# feeding a sample text through pipeline
result = classifier('We are very happy to show you the HuggingFace Transformers Library.')
print(result)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'POSITIVE', 'score': 0.9997598528862}]


In [3]:
# feeding a list of text through pipeline
X_train = ["We are very happy to show you the HuggingFace Transformers Library.", "We hope you don't hate it."]
results = classifier(X_train)

for result in results:
    print(result)

{'label': 'POSITIVE', 'score': 0.9997598528862}
{'label': 'NEGATIVE', 'score': 0.5308609008789062}


### Using a specific model in pipeline
[AutoModels documentation](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html)  
- for pytorch, `import AutoModelForSequenceClassification from transformers`
- for tensorflow, `import TFAutoModelForSequenceClassification from transformers`  

[AutoTokenizer documentation](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#autotokenizer)

In [None]:
model_name = 'bert-base-uncased'

model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier1 = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

## Using Model and Tokenizer without pipeline
Note: you'll get same results with or without pipeline!  
Recommended to use pipeline unless you want to finetune model

In [17]:
model_name = 'distilbert-base-uncased-finetuned-sst-2-english' # 'bert-base-uncased'

model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

token_ids = tokenizer("We are very happy to show you the HuggingFace Transformers Library.")
print(token_ids)
print('In input_ids, tokens 101 and 102 are the beginning and end of string tokens')

Downloading: 100%|██████████| 256M/256M [00:07<00:00, 37.4MB/s] 
Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_57']
You should probably TRAIN this model on a down-stream task

{'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 17662, 12172, 19081, 3075, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
In input_ids, tokens 101 and 102 are the beginning and end of string tokens


See [PreTrainedTokenizer documentation](https://huggingface.co/transformers/v3.0.2/main_classes/tokenizer.html#transformers.PreTrainedTokenizer) for parameter details.  
Specifically for `return_tensors` parameter: click [here](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.__call__.return_tensors)  

In [18]:
X_train = ["We are very happy to show you the HuggingFace Transformers Library.", "We hope you don't hate it."]

batch = tokenizer(X_train, padding=True, truncation=True, max_length=512, return_tensors='tf')
print(batch)

{'input_ids': <tf.Tensor: shape=(2, 15), dtype=int32, numpy=
array([[  101,  2057,  2024,  2200,  3407,  2000,  2265,  2017,  1996,
        17662, 12172, 19081,  3075,  1012,   102],
       [  101,  2057,  3246,  2017,  2123,  1005,  1056,  5223,  2009,
         1012,   102,     0,     0,     0,     0]])>, 'attention_mask': <tf.Tensor: shape=(2, 15), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])>}


In [19]:
outputs = model(batch)
predictions = tf.nn.softmax(outputs.logits, axis=-1)
labels = tf.argmax(predictions, axis=-1)
labels = [model.config.id2label[label_id] for label_id in labels.numpy()] # convert from label_id to class name

print(outputs)
print(predictions)
print(labels)

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-4.058453  ,  4.2757363 ],
       [ 0.07809059, -0.03831829]], dtype=float32)>, hidden_states=None, attentions=None)
tf.Tensor(
[[2.4010611e-04 9.9975985e-01]
 [5.2906942e-01 4.7093061e-01]], shape=(2, 2), dtype=float32)
['POSITIVE', 'NEGATIVE']


## Saving and Reloading Saved Models and Tokenizers

In [None]:
# save model and tokenizer (esp. after finetuning)
save_directory = 'saved'
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)

tokenizer = AutoTokenizer.from_pretrained(save_directory)
model = TFAutoModelForSequenceClassification.from_pretrained(save_directory)

## [Model Hub](https://huggingface.co/models)
Use any model uploaded by community  
To use any model, copy the name in your chosen model (e.g. `oliverguhr/german-sentiment-bert`)

In [20]:
model_name = 'oliverguhr/german-sentiment-bert'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForSequenceClassification.from_pretrained(model_name)

X_train_german = ["Mit keinem guten Ergebnis", "Das war unfair", # negative
            "nicht so schlecht wie erwartet", "Das war gut!", # positive
            "Sie fahrt ein grunes Auto"] # neutral

batch = tokenizer(X_train_german, padding=True, truncation=True, max_length=512, return_tensors='tf')
outputs = model(batch)
predictions = tf.nn.softmax(outputs.logits, axis=-1)
labels = tf.argmax(predictions, axis=-1)
labels = [model.config.id2label[label_id] for label_id in labels.numpy()]

print(labels)
print('Correct answer should be:')
print('[negative, negative, positive, positive, neutral]')

Downloading: 100%|██████████| 161/161 [00:00<00:00, 79.8kB/s]
Downloading: 100%|██████████| 665/665 [00:00<00:00, 665kB/s]
Downloading: 100%|██████████| 249k/249k [00:01<00:00, 239kB/s]  
Downloading: 100%|██████████| 112/112 [00:00<00:00, 111kB/s]
Downloading: 100%|██████████| 416M/416M [00:28<00:00, 15.4MB/s]    
Some layers from the model checkpoint at oliverguhr/german-sentiment-bert were not used when initializing TFBertForSequenceClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSeque

['negative', 'negative', 'positive', 'positive', 'neutral']


## Challenge: Implement the model J2s used for TIL NLP
`harshit345/xlsr-wav2vec-speech-emotion-recognition`

In [None]:
 from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
 from datasets import load_dataset
 import torch
 
 # load model and tokenizer
 processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
 model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
     
 # load dummy dataset and read soundfiles
 ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
 
 # tokenize
 input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values  # Batch size 1
 
 # retrieve logits
 logits = model(input_values).logits
 
 # take argmax and decode
 predicted_ids = torch.argmax(logits, dim=-1)
 transcription = processor.batch_decode(predicted_ids)

