# Putting it all together (PyTorch)

The explanation of this notebook is in the Hugging Face course, chapter 2, section 6: [Putting it all together](https://huggingface.co/course/chapter2/6?fw=pt)

The original code of this notebook is in the Hugging Face's SageMaker repository: [section6_pt.ipynb](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/master/course/en/chapter2/section6_pt.ipynb)

## Run conditions

This notebook has been tested in the following environment:
- Environment: Project created in [Paperspace Gradient](https://gradient.paperspace.com) with Python 3.9.13.
- Machine: P5000 (30GiB RAM 8 CPU 16GiB GPU) (more details on [Paperspace Machines](https://docs.paperspace.com/gradient/machines/)).
- IDE: Visual Studio Code using remote Jupyter server.

## Install dependencies

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
# Install the libraries datasets v2.7.1, evaluate v0.3.0, and transformers v4.25.1 with quiet and upgrade flags.
%pip install -q datasets==2.7.1 evaluate==0.3.0 transformers==4.25.1 --upgrade

[0mNote: you may need to restart the kernel to use updated packages.


## Code

In [2]:
# Import AutoTokenizer from Transformers.
from transformers import AutoTokenizer

# Create a checkpoint name for the tokenizer.
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
# Create a tokenizer object.
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# Create a sequence.
sequence = "I've been waiting for a HuggingFace course my whole life."
# Create a model input.
inputs = tokenizer(sequence)
# Print inputs IDs.
print(inputs["input_ids"])

[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]


In [3]:
# Create tokens.
tokens = tokenizer.tokenize(sequence)
# Convert tokens to IDs.
token_ids = tokenizer.convert_tokens_to_ids(tokens)
# Print token IDs.
print(token_ids)

[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]


In [4]:
# Decode inputs IDs.
print(tokenizer.decode(inputs["input_ids"]))
# Decode token IDs.
print(tokenizer.decode(token_ids))

[CLS] i've been waiting for a huggingface course my whole life. [SEP]
i've been waiting for a huggingface course my whole life.


## Wrapping up: From tokenizer to model

In [5]:
# Import PyTorch.
import torch

# Import AutoTokenizer and AutoModelForSequenceClassification from Transformers.
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Create a checkpoint name for the model.
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
# Create a model object.
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
# Create a sequence.
sequence = "I've been waiting for a HuggingFace course my whole life."
# Create a model input with padding and truncation in True.
inputs = tokenizer(sequence, padding=True, truncation=True, return_tensors="pt")
# Create outputs.
outputs = model(**inputs)
# Print outputs.
print(outputs)

SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
