# A Quick Tour of Hugging Face

The goal of this notebook is to act as a "quick start" guide to help everyone get up and running with BERT (and all the other open-source Transformer models) using Hugging Face's `transformers` library!

## What is this Library?

Hugging Face's `transformers` package is a comprehensive open-source (Apache 2.0 License) library that makes it easy to use over 30 different cutting-edge transformer-based models, such as BERT and GPT-2. The library supports both TensorFlow 2 and PyTorch, and has a simple, Keras-like interface. Retraining and fine-tuning models typically only takes a few lines of code, and Hugging Face has created great [documentation](https://github.com/huggingface/transformers) that includes starter code and tutorials.  

## Installation

To run the sample code in this notebook, you'll first need to install the transformers library. To do so, please run the the cell below, or see the Hugging Face's [full installation guide](https://github.com/huggingface/transformers#installation). 

**_NOTE:_** We strongly recommend using a virtual environment! 

In [1]:
# !pip install transformers
# !pip install tensorflow_datasets

# Example 1: Sequence Classification

In the following cell, we:

**1.** Import everything from the `transformers` library, as well as `tensorflow` and the `tensorflow_datasets` library. 
<br>
<br>
**2.** Next, we load a BERT tokenizer. Remember, different models have different tokenizers! The model we'll be using is `'bert-base-cased'`.  For a full list of available models, see [this list](https://github.com/huggingface/transformers#quick-tour).
<br>
<br>
**3.** Similarly, we load the model. Just like the tokenizer, we just pass in a string that references the model we want. 
<br>
<br>
**4.** Finally, we load the dataset. For this example, we'll be using `"glue/mrpc"` ([the Microsoft Research Paraphrase Corpus](https://www.microsoft.com/en-us/download/details.aspx?id=52398)), which we'll instantiate using the `tensorflow_datasets` package. 

In [2]:
import tensorflow as tf
from transformers import * 
import tensorflow_datasets
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = TFBertForSequenceClassification.from_pretrained('bert-base-cased')
dataset = tensorflow_datasets.load('glue/mrpc')

INFO:absl:Overwrite dataset info from restored data version.
INFO:absl:Reusing dataset glue (/home/mikekane00/tensorflow_datasets/glue/mrpc/1.0.0)
INFO:absl:Constructing tf.data.Dataset for split None, from /home/mikekane00/tensorflow_datasets/glue/mrpc/1.0.0


Next, we use some preprocessing tools from the `transformers` package to prepare our dataset. Note that the tools work seamlessly with a TensorFlow `Dataset` object.  

In [3]:
train_dataset = glue_convert_examples_to_features(dataset['train'], tokenizer, max_length=128, task='mrpc')
valid_dataset = glue_convert_examples_to_features(dataset['validation'], tokenizer, max_length=128, task='mrpc')
train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)
valid_dataset = valid_dataset.batch(64)

This next part is pure Tensorflow. We compile and train our model, using Keras syntax. 

In [4]:
model.compile(optimizer='adam',  loss='sparse_categorical_crossentropy', metrics=['sparse_categorical_crossentropy'])

In [5]:
history = model.fit(train_dataset, epochs=2, steps_per_epoch=115, validation_data=valid_dataset, validation_steps=7)

Train for 115 steps, validate for 7 steps
Epoch 1/2
Epoch 2/2


We can easily save our trained model.  Note that the save file is library-agnostic. We can easily load it up in a PyTorch model. We'll reload our saved model as a PyTorch model, and then use it for inference on a few tasks in order to check that it works. 

In [9]:
model.save_pretrained('./save/')
pytorch_model = BertForSequenceClassification.from_pretrained('./save/', from_tf=True)

The final cell block is standard pytorch. We'll use it for inference. Note that PyTorch models have two separate modes--training, and evaluation. When models are created, they are in train mode by default. However, when loaded as a pretrained model, they are loaded in evaluation mode. 

In [12]:
sentence_0 = "Mark and his friends went to the movies."
sentence_1 = "Mark and Tim went to a movie."
sentence_2 = "His findings were not compatible with this research."
inputs_1 = tokenizer.encode_plus(sentence_0, sentence_1, add_special_tokens=True, return_tensors='pt')
inputs_2 = tokenizer.encode_plus(sentence_0, sentence_2, add_special_tokens=True, return_tensors='pt')

pred_1 = pytorch_model(inputs_1['input_ids'], token_type_ids=inputs_1['token_type_ids'])[0].argmax().item()
pred_2 = pytorch_model(inputs_2['input_ids'], token_type_ids=inputs_2['token_type_ids'])[0].argmax().item()

print("sentence_1 is", "a paraphrase" if pred_1 else "not a paraphrase", "of sentence_0")
print("sentence_2 is", "a paraphrase" if pred_2 else "not a paraphrase", "of sentence_0")

sentence_1 is not a paraphrase of sentence_0
sentence_2 is not a paraphrase of sentence_0
