# Transfer learning using Transformers 

## Using transformers for feature extraction

Hugging Face is a company that provides a wide range of pre-trained models and tools for natural language processing tasks. They have a library named transformers which provides thousands of pre-trained models to perform tasks on texts such as classification, information extraction, summarization, translation, text generation, and more. The library is built with an aim to make state-of-the-art NLP accessible to everyone and is widely used in the community.

**Finding Other Hugging Face Models**:
You can find all pre-trained models provided by Hugging Face on their [model hub](https://huggingface.co/models). You can filter these models by task, language, and more. For each model, you will find usage details, including how to load the model using the transformers library.



1. **CLS and SEP Tokens**:
In BERT-style models (including DistilBert), special tokens are used to provide boundary and classification information. The "[CLS]" token is a special token added at the beginning of each sentence, which is used as an aggregate representation for classification tasks. The "[SEP]" token is used to denote separation between sentences, particularly in tasks that require understanding of two sentences (like question answering or natural language inference).

2. **Sentence Representation**:
While simple averaging of word vectors to create sentence vectors is a common approach, it may not always be the best. For models like BERT and DistilBert, the embedding corresponding to the "[CLS]" token is often used as the sentence representation. This is especially true if the model has been fine-tuned on a task similar to the one you are working on, as the fine-tuning process will modify the model so that the "[CLS]" token captures relevant information about the sentence as a whole.

3. **DistilBert and Alternatives**:
DistilBert is a lighter and faster version of BERT. It retains 95% of BERT’s performance while being 60% smaller and 60% faster. It achieves this by removing the token-type embeddings and the pooler (used for next sentence prediction tasks), and also by training the model to mimic BERT's behavior. 

    Alternatives to DistilBert include:
    * BERT: The original model, which has multiple versions of varying sizes.
    * RoBERTa: A version of BERT that uses dynamic masking rather than static masking.
    * ALBERT: A lite BERT that reuses parameters across layers, resulting in significant reduction in size.
    * XLNet: A generalized autoregressive model that outperforms BERT on several benchmarks.
    * GPT-2 and GPT-3: Transformer models designed for language generation tasks.

    You can use these alternatives in a similar way to DistilBert, just replace 'distilbert-base-uncased' with the model name you want (like 'bert-base-uncased', 'roberta-base', 'albert-base-v2', 'xlnet-base-cased', 'gpt2', etc.) in the code provided earlier.

In [2]:
import torch
from transformers import DistilBertTokenizer, DistilBertModel, RobertaModel, RobertaTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Load pre-trained model and tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained('distilbert-base-uncased')

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
# For roberta model, for example:
# Load the pre-trained RoBERTa model
roberta_model = RobertaModel.from_pretrained('roberta-base')

# Load the tokenizer associated with the model
roberta_tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



To find out the available model names for RoBERTa or any other model in the Hugging Face's Transformers library, you can refer to the official Hugging Face Model Hub documentation. The Model Hub provides a wide range of pre-trained models that you can use.

You can visit the Hugging Face Model Hub website at https://huggingface.co/models and search for RoBERTa models specifically. H
For example, some common RoBERTa models you might find are:

* roberta-base: The base RoBERTa model with 12 layers and 110 million parameters.
* roberta-large: A larger RoBERTa model with 24 layers and 355 million parameters.
* roberta-large-mnli: A RoBERTa model pre-trained on the MultiNLI dataset.

These names can be passed to the from_pretrained method to load the corresponding RoBERTa models.
Remember to consult the documentation and the model's README file for more information on each model, including the input/output formats, fine-tuning tasks, and any specific instructions.

**I will use the distilled Bert model for the next demonstration, but the concept is the same with the rest of the BERT family models.**

In [8]:
%%time 
# Text to vectorize
text = "Here is an example paragraph that we will convert into an embedding."

# Add special tokens for BERT (start and end of sentence)
marked_text = "[CLS] " + text + " [SEP]"

# Tokenize our sentence
tokenized_text = tokenizer.tokenize(marked_text)

# Map tokens to their index in the tokenizer vocabulary
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

# Convert list to torch tensor
tokens_tensor = torch.tensor([indexed_tokens])

# Put everything on the GPU if available and run through the model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokens_tensor = tokens_tensor.to(device)
model = model.to(device)

with torch.no_grad():
    outputs = model(tokens_tensor)
    # The first element of outputs is the last layer of the model, which can be used as embeddings.
    embeddings = outputs[0]

# Calculate the mean to get sentence vector
mean_embeddings = torch.mean(embeddings, dim=1).cpu().numpy()

print(mean_embeddings.flatten()[:50])

[-5.75909903e-03 -1.40274286e-01  6.08086586e-04 -1.64577499e-01
  2.39165109e-02 -3.49079072e-01  1.14989005e-01  3.25672895e-01
  3.41304064e-01 -2.07848027e-01 -3.44636321e-01 -1.31399527e-01
 -2.10214600e-01  1.57644272e-01 -2.64885396e-01  3.34598899e-01
 -3.45788933e-02  1.43658146e-01  5.99903241e-03 -1.34861842e-01
  1.15781657e-01  3.37876678e-01 -2.10142940e-01  2.18721047e-01
  6.12023413e-01 -2.36835033e-01  3.14411938e-01 -2.31933132e-01
 -3.67941141e-01  3.29555646e-02  2.74281979e-01  2.93108165e-01
 -2.37404536e-02 -2.98157543e-01  1.70525998e-01  6.03910461e-02
  5.33379316e-01 -2.53717005e-01 -4.15424742e-02  1.16764858e-01
 -4.99883920e-01 -1.38034701e-01  3.69493395e-01  9.91934240e-02
  5.66613339e-02 -3.94730270e-01  1.38495211e-02 -2.30717529e-02
 -7.10175037e-02 -7.83770531e-03]
CPU times: user 30.1 ms, sys: 14.8 ms, total: 44.8 ms
Wall time: 33.9 ms


In [7]:
tokenized_text

['[CLS]',
 'here',
 'is',
 'an',
 'example',
 'paragraph',
 'that',
 'we',
 'will',
 'convert',
 'into',
 'an',
 'em',
 '##bed',
 '##ding',
 '.',
 '[SEP]']

In [6]:
embeddings.shape

torch.Size([1, 17, 768])

In [14]:
print(tokenized_text)
print(len(tokenized_text))
print(embeddings.shape)

['[CLS]', 'here', 'is', 'an', 'example', 'paragraph', 'that', 'we', 'will', 'convert', 'into', 'an', 'em', '##bed', '##ding', '.', '[SEP]']
17
torch.Size([1, 17, 768])


In [11]:
print(mean_embeddings.flatten().shape)

(768,)


### Using GPT models

In [20]:
import torch
from transformers import GPT2Tokenizer, GPT2Model

# Load pre-trained model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')

# Text to vectorize
text = "Here is an example paragraph that we will convert into an embedding."

# Tokenize our sentence
input_ids = tokenizer.encode(text, return_tensors='pt')

# Run through the model
outputs = model(input_ids)
# The first element of outputs is the last layer of the model, which can be used as embeddings.
embeddings = outputs[0]

# Calculate the mean to get sentence vector
mean_embeddings = torch.mean(embeddings, dim=1).cpu().detach().numpy()

In [21]:
print(mean_embeddings.flatten()[:50])

[ 6.82176873e-02 -2.17835337e-01 -7.27493167e-01  2.93786347e-01
 -1.58162013e-01 -5.51667623e-02  2.28856421e+00  1.47566915e-01
  2.66588796e-02 -3.06096047e-01  1.96142383e-02  6.88604340e-02
 -2.73834676e-01 -3.17501999e-03 -6.38329834e-02  1.09410897e-01
 -2.67304450e-01  7.81244412e-02  3.33967358e-01 -4.50377464e-01
  6.72684684e-02  1.86494902e-01  8.82443041e-02 -1.19551606e-01
  5.64160869e-02 -1.86481867e-02 -5.05188107e-01 -3.79158318e-01
  2.73496896e-01  9.72704440e-02 -2.41519675e-01 -2.06735775e-01
  4.31770161e-02 -1.41417785e-02  1.27175167e-01  1.24921359e-01
  6.93927994e+01  4.60944235e-01 -2.53891051e-01  2.88569361e-01
  2.90439814e-01  4.95970100e-01  7.61573985e-02  5.72054163e-02
 -2.42845088e-01  1.03123374e-01 -2.68422186e-01 -5.21096289e-02
 -3.16923827e-01  1.00429356e+00]
