# Custom Tokenizer Notebook

## 1. install dependencies

In [1]:
%pip install transformers[torch]
%pip install boto3

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


## 2. Download Data

In [5]:
# Import the necessary libraries
from tokenizers import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing
import boto3

# Specify the name of your S3 bucket and the file in the bucket
bucket_name = "awsc.datascience.objecjstore"
file_name = "training_data/TokenTrainingText.txt"
local_file_name = "TokenTrainingText.txt"

# Initialize the S3 client
s3 = boto3.client('s3')

# Download the file from the S3 bucket to the local file system in the SageMaker instance
# '/tmp/your-file-name.txt' is the location and file name where the file will be downloaded
s3.download_file(bucket_name, file_name, local_file_name)


## 3. Retrain with domain data

In [20]:
# Initialize a tokenizer of the Byte Pair Encoding type
tokenizer = ByteLevelBPETokenizer()

# Train the tokenizer on the downloaded file
# vocab_size=52_000 and min_frequency=2 are hyperparameters that can be adjusted according to the specific characteristics of your text
# special_tokens is a list of tokens that will be added to the tokenizer's vocabulary
tokenizer.train(files=local_file_name, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

# Save the trained tokenizer to disk
# It will create two files: 'aws-blogs-vocab.json' and 'aws-blogs-merges.txt'
tokenizer.save_model(".", "awsblogs")

['.\\awsblogs-vocab.json', '.\\awsblogs-merges.txt']

## Load Model

In [21]:
from transformers import BertTokenizer
from tokenizers.implementations import ByteLevelBPETokenizer

# Define a test sentence
test_sentence = "aws is awesome."

# Load the pre-trained tokenizer
pretrained_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the test sentence with the pre-trained tokenizer
pretrained_output = pretrained_tokenizer.tokenize(test_sentence)

# Print the tokens
print("Pre-trained tokenizer tokens: ", pretrained_output)

# Load the custom trained tokenizer
custom_tokenizer = ByteLevelBPETokenizer(
    "./awsblogs-vocab.json",
    "./awsblogs-merges.txt",
)

# Tokenize the test sentence with the custom tokenizer
custom_output = custom_tokenizer.encode(test_sentence)

# Print the tokens
print("Custom tokenizer tokens: ", custom_output.tokens)


Pre-trained tokenizer tokens:  ['aw', '##s', 'is', 'awesome', '.']
Custom tokenizer tokens:  ['aws', 'Ġis', 'Ġawesome', '.']


Run Vector

In [37]:
from transformers import RobertaModel, RobertaTokenizerFast
import torch
import os
print(os.getcwd())
# Load the trained tokenizer
tokenizer = RobertaTokenizerFast(
    vocab_file="./awsblogs-vocab.json", 
    merges_file="./awsblogs-merges.txt",
    bos_token="<s>",
    eos_token="</s>",
    sep_token="</s>",
    cls_token="<s>",
    unk_token="<unk>",
    pad_token="<pad>",
    mask_token="<mask>"
)

# Specify the model name
model_name = 'roberta-base'  # You should use a model that matches the tokenizer (RoBERTa in this case)

# Load the model
model = RobertaModel.from_pretrained(model_name)



c:\Users\manumishra\source\repos\manu-mishra\awsconcepts\Workloads\AwsConceptsApp\DataScience\notebooks


Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[[ 0.0621059   0.07121446  0.03583737  0.15002523 -0.01553448  0.06262513
   0.02541252  0.08205176  0.0336398  -0.03258981  0.04044851 -0.14505434
   0.0204825   0.09314897  0.0913348   0.05756848  0.03054859 -0.01449592
  -0.11808332  0.0449594  -0.06246282  0.10158909  0.01026067  0.06910552
   0.05934934  0.00413021 -0.07368245 -0.07226573 -0.0789161   0.00210865
  -0.03282139  0.04227065 -0.00086759 -0.02976679  0.074085   -0.0158735
   0.08096281  0.04571429 -0.05945798 -0.03628789  0.09128615 -0.1272096
   0.08767764  0.10240908  0.01987366  0.03109403 -0.00810728  0.22222985
   0.03203924  0.01365474]]


In [42]:
# Define a sample text
text = "Hello, this is a test."

# Encode the text to get the input tensors using your custom tokenizer
inputs = tokenizer(text, return_tensors='pt')

# Run the text through the model to get the embeddings
outputs = model(**inputs)

# Use the average of the last hidden state as the text's embedding
embeddings = outputs.last_hidden_state.mean(dim=1)

# Convert the tensor to a numpy array
vectors = embeddings.detach().numpy()

print(vectors[:, :50])


[[ 0.0621059   0.07121446  0.03583737  0.15002523 -0.01553448  0.06262513
   0.02541252  0.08205176  0.0336398  -0.03258981  0.04044851 -0.14505434
   0.0204825   0.09314897  0.0913348   0.05756848  0.03054859 -0.01449592
  -0.11808332  0.0449594  -0.06246282  0.10158909  0.01026067  0.06910552
   0.05934934  0.00413021 -0.07368245 -0.07226573 -0.0789161   0.00210865
  -0.03282139  0.04227065 -0.00086759 -0.02976679  0.074085   -0.0158735
   0.08096281  0.04571429 -0.05945798 -0.03628789  0.09128615 -0.1272096
   0.08767764  0.10240908  0.01987366  0.03109403 -0.00810728  0.22222985
   0.03203924  0.01365474]]
