# Task
Create a program that uses a pre-trained language model (such as BERT or GPT) to tokenize and encode incoming text.

## Import necessary libraries

### Subtask:
Import `transformers` for pre-trained models and tokenizers, and `torch` for tensor operations.


**Reasoning**:
Import the necessary libraries for the subtask.



In [None]:
import transformers
import torch

## Load the pre-trained tokenizer and model

### Subtask:
Load the pre-trained tokenizer and model.


**Reasoning**:
Load the pre-trained tokenizer and model as per the instructions.



In [None]:
model_name = 'bert-base-uncased'
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModel.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

## Define input text

### Subtask:
Define input text.


**Reasoning**:
The subtask is to define the input text. I will create a string variable and assign a sample sentence to it.



In [None]:
input_text = "This is a sample sentence for tokenization and encoding."

## Tokenize the input text

### Subtask:
Tokenize the input text.


**Reasoning**:
Use the loaded tokenizer to tokenize the input text and store the result.



In [None]:
tokens = tokenizer(input_text, return_tensors="pt")

## Encode the tokens

### Subtask:
Convert the tokens into numerical representations (input IDs) using the tokenizer.


**Reasoning**:
Access the input IDs from the tokens object and store them in a new variable.



In [None]:
input_ids = tokens['input_ids']

## Perform model inference (optional)

### Subtask:
Pass the encoded input to the loaded model to get model outputs (e.g., hidden states, logits).


**Reasoning**:
Pass the encoded input to the loaded model to get model outputs.



In [None]:
outputs = model(input_ids)

## Print the results

### Subtask:
Display the original text, tokens, encoded input IDs, and model outputs (if inference was performed).


**Reasoning**:
Display the original text, tokens, encoded input IDs, and model outputs as requested by the instructions.



In [None]:
print("Original Text:")
print(input_text)

print("\nTokens:")
print(tokens)

print("\nEncoded Input IDs:")
print(input_ids)

print("\nModel Outputs:")
print(outputs)

Original Text:
This is a sample sentence for tokenization and encoding.

Tokens:
{'input_ids': tensor([[  101,  2023,  2003,  1037,  7099,  6251,  2005, 19204,  3989,  1998,
         17181,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Encoded Input IDs:
tensor([[  101,  2023,  2003,  1037,  7099,  6251,  2005, 19204,  3989,  1998,
         17181,  1012,   102]])

Model Outputs:
BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.2581, -0.5900, -0.4180,  ..., -0.4544, -0.0746,  0.8369],
         [-0.4306, -0.7994, -0.4574,  ..., -0.1594,  0.8422,  0.4387],
         [-0.3340, -0.4471, -0.0218,  ..., -0.0096, -0.1303,  0.7933],
         ...,
         [ 0.6940, -0.3468, -0.2339,  ..., -0.2570, -0.6028,  0.1661],
         [ 0.7511, -0.0681, -0.7081,  ...,  0.3149, -0.5963, -0.3096],
         [ 0.3835, -0.1569, -0.4574,  ...,  0.6327, -0.8696, -0.0965]]],
   

## Summary:

### Data Analysis Key Findings

*   The necessary libraries, `transformers` and `torch`, were successfully imported.
*   The `bert-base-uncased` tokenizer and model were successfully loaded from the `transformers` library.
*   The input text "This is a sample sentence for tokenization and encoding." was defined.
*   The input text was successfully tokenized using the loaded tokenizer, resulting in a PyTorch tensor.
*   The numerical representations (input IDs) were successfully extracted from the tokenized output.
*   The encoded input IDs were successfully passed to the loaded model to obtain model outputs, including the `last_hidden_state` and `pooler_output`.
*   The original text, tokens, encoded input IDs, and model outputs were successfully printed.

### Insights or Next Steps

*   The process demonstrates a standard workflow for using pre-trained language models for text processing tasks.
*   The resulting model outputs (hidden states) can be used for various downstream natural language processing tasks such as text classification, sentiment analysis, or question answering.


# New Project: Text Tokenization and Encoding with GPT

This section demonstrates how to use a pre-trained GPT model to tokenize and encode text.

## Import necessary libraries (already imported)

### Subtask:
Import `transformers` for pre-trained models and tokenizers, and `torch` for tensor operations.

**Reasoning**:
The necessary libraries are already imported in the previous section.

## Load the pre-trained GPT tokenizer and model

### Subtask:
Load the pre-trained GPT tokenizer and model.

**Reasoning**:
Load the pre-trained GPT tokenizer and model using the `transformers` library.

In [None]:
gpt_model_name = 'gpt2'
gpt_tokenizer = transformers.AutoTokenizer.from_pretrained(gpt_model_name)
gpt_model = transformers.AutoModel.from_pretrained(gpt_model_name)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

## Define input text for GPT

### Subtask:
Define input text for the GPT project.

**Reasoning**:
Define a sample input text string specifically for the GPT project.

In [None]:
gpt_input_text = "This is another sample sentence for GPT tokenization and encoding."

## Tokenize the input text with GPT tokenizer

### Subtask:
Tokenize the input text using the GPT tokenizer.

**Reasoning**:
Use the loaded GPT tokenizer to tokenize the input text and store the result.

In [None]:
gpt_tokens = gpt_tokenizer(gpt_input_text, return_tensors="pt")

## Encode the tokens with GPT tokenizer

### Subtask:
Convert the GPT tokens into numerical representations (input IDs) using the GPT tokenizer.

**Reasoning**:
Access the input IDs from the GPT tokens object and store them in a new variable.

In [None]:
gpt_input_ids = gpt_tokens['input_ids']

## Perform GPT model inference (optional)

### Subtask:
Pass the encoded input to the loaded GPT model to get model outputs.

**Reasoning**:
Pass the encoded input to the loaded GPT model to get model outputs.

In [None]:
gpt_outputs = gpt_model(gpt_input_ids)

## Print the GPT results

### Subtask:
Display the original GPT text, tokens, encoded input IDs, and model outputs.

**Reasoning**:
Display the original GPT text, tokens, encoded input IDs, and model outputs as requested.

In [None]:
print("Original GPT Text:")
print(gpt_input_text)

print("\nGPT Tokens:")
print(gpt_tokens)

print("\nEncoded GPT Input IDs:")
print(gpt_input_ids)

print("\nGPT Model Outputs:")
print(gpt_outputs)

Original GPT Text:
This is another sample sentence for GPT tokenization and encoding.

GPT Tokens:
{'input_ids': tensor([[ 1212,   318,  1194,  6291,  6827,   329,   402, 11571, 11241,  1634,
           290, 21004,    13]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Encoded GPT Input IDs:
tensor([[ 1212,   318,  1194,  6291,  6827,   329,   402, 11571, 11241,  1634,
           290, 21004,    13]])

GPT Model Outputs:
BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=tensor([[[ 5.2970e-02, -1.3685e-02, -2.3932e-01,  ..., -1.2450e-01,
          -1.1159e-01,  2.2529e-02],
         [ 2.4703e-01,  2.2600e-01,  3.9671e-02,  ...,  2.4134e-01,
           4.3486e-01,  1.7679e-01],
         [ 4.6754e-01, -4.8599e-01, -2.6761e-01,  ...,  4.5541e-04,
           1.5176e-01,  2.0704e-01],
         ...,
         [ 1.0660e-01, -4.2693e-01, -8.1358e-01,  ..., -2.7512e-02,
           2.5860e-01, -3.8891e-01],
         [-2.5486e-01, -4.6663e-01, -1.4872e+00,  ...,  1

## Summary: GPT Project

### Data Analysis Key Findings

* The necessary libraries, `transformers` and `torch`, were already imported.
* The `gpt2` tokenizer and model were successfully loaded from the `transformers` library.
* A new input text "This is another sample sentence for GPT tokenization and encoding." was defined for the GPT project.
* The GPT input text was successfully tokenized using the loaded GPT tokenizer, resulting in a PyTorch tensor.
* The numerical representations (input IDs) were successfully extracted from the GPT tokenized output.
* The encoded GPT input IDs were successfully passed to the loaded GPT model to obtain model outputs.
* The original GPT text, tokens, encoded input IDs, and model outputs were successfully printed.

### Insights or Next Steps

* This section demonstrates the process of using a GPT model for text tokenization and encoding, similar to the BERT project but with a different model architecture.
* The GPT model outputs can be used for tasks relevant to generative models, such as text generation, summarization, or other tasks that leverage its autoregressive nature.