<a href="https://colab.research.google.com/github/gstripling00/conferences/blob/main/04_02_adv_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Next, we need to preprocess our data so that it matches the data BERT was trained on. For this, we'll need to do a couple of things (but don't worry--this is also included in the Python library):


1. Lowercase our text (if we're using a BERT lowercase model)
2. Tokenize it (i.e. "sally says hi" -> ["sally", "says", "hi"])
3. Break words into WordPieces (i.e. "calling" -> ["call", "##ing"])
4. Map our words to indexes using a vocab file that BERT provides
5. Add special "CLS" and "SEP" tokens (see the [readme](https://github.com/google-research/bert))
6. Append "index" and "segment" tokens to each input (see the [BERT paper](https://arxiv.org/pdf/1810.04805.pdf))

Happily, we don't have to worry about most of these details.




To start, we'll need to load a vocabulary file and lowercasing information directly from the BERT tf hub module:

Great--we just learned that the BERT model we're using expects lowercase data (that's what stored in tokenization_info["do_lower_case"]) and we also loaded BERT's vocab file. We also created a tokenizer, which breaks words into word pieces:

Using our tokenizer, we'll call `run_classifier.convert_examples_to_features` on our InputExamples to convert them into features BERT understands.

In [None]:
pred_sentences = [
  "That movie was absolutely awful",
  "The acting was a bit lacking",
  "The film was creative and surprising",
  "Absolutely fantastic!"
]

#NEW EXERCISE - TOKENIZING

To import the DistilBERT model in Python, you can use the Hugging Face transformers library, which provides a convenient interface to work with various pre-trained transformer models, including DistilBERT. First, you need to install the library:

In this example, DistilBertModel.from_pretrained loads the pre-trained DistilBERT model, and DistilBertTokenizer.from_pretrained loads the corresponding tokenizer. You can replace 'distilbert-base-uncased' with other DistilBERT variants or models fine-tuned for specific tasks, depending on your requirements.

After loading the model and tokenizer, you can use them for various natural language processing tasks such as text classification, sentiment analysis, or embedding extraction.

Note: Depending on your use case, you might want to choose a specific DistilBERT variant that suits your needs. The model variant in the example, 'distilbert-base-uncased', is a commonly used base version trained on English text in an uncased format. There are other variants and models available, and you can explore them in the Hugging Face model hub: DistilBERT Models.






In [1]:
from transformers import DistilBertModel, DistilBertTokenizer

In [2]:
# Load pre-trained DistilBERT model and tokenizer
model = DistilBertModel.from_pretrained('distilbert-base-uncased')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [3]:
# Example: Tokenizing and encoding text
text = "Hello, how are you doing today?"
tokens = tokenizer(text, return_tensors='pt')
outputs = model(**tokens)

In [4]:
print(outputs)
#Notice six words shown.

BaseModelOutput(last_hidden_state=tensor([[[-0.1287, -0.1419, -0.1345,  ..., -0.0144,  0.5856,  0.3436],
         [ 0.1180, -0.1199,  0.3374,  ...,  0.1191,  0.7489, -0.0225],
         [-0.5806,  0.1913,  0.5075,  ..., -0.1679,  0.4484,  0.1599],
         ...,
         [-0.6880, -0.7057, -0.5547,  ..., -0.1592,  0.1028, -0.4917],
         [-0.3022, -0.6085, -0.4973,  ..., -0.1851,  0.6125,  0.1633],
         [ 0.7487,  0.1659, -0.3929,  ...,  0.3324, -0.2238, -0.3643]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)


In [5]:
print(tokens)

{'input_ids': tensor([[ 101, 7592, 1010, 2129, 2024, 2017, 2725, 2651, 1029,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [6]:
print(text)

Hello, how are you doing today?


In [25]:
tokenizer.tokenize("This here's an example of using the BERT tokenizer")

['this',
 'here',
 "'",
 's',
 'an',
 'example',
 'of',
 'using',
 'the',
 'bert',
 'token',
 '##izer']

In [30]:
tokenizer.tokenize("That movie was absolutely awful")

['that', 'movie', 'was', 'absolutely', 'awful']

#NEW EXERCISE

Certainly! Let's extend the previous example to perform sentiment analysis using a pre-trained DistilBERT model. For this, I'll use the pipeline module from the transformers library, which provides a convenient way to perform various NLP tasks, including sentiment analysis:

In [7]:
from transformers import pipeline, DistilBertModel, DistilBertTokenizer

# Load pre-trained DistilBERT model and tokenizer
model = DistilBertModel.from_pretrained('distilbert-base-uncased')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Example text for sentiment analysis
text = "I really enjoyed watching the movie. It was fantastic!"

# Tokenize and encode the text
tokens = tokenizer(text, return_tensors='pt')

# Get the model's output
outputs = model(**tokens)

# Access the output embeddings or other information from 'outputs'
# For example, you can extract the embeddings for the [CLS] token
cls_embedding = outputs.last_hidden_state[:, 0, :]

# Print the embeddings (you can use these for your custom sentiment analysis task)
print("Embeddings:", cls_embedding)

# Perform sentiment analysis using the pipeline
sentiment_analysis = pipeline('sentiment-analysis', model='distilbert-base-uncased')
result = sentiment_analysis(text)

# Print the sentiment analysis result
print("Sentiment Analysis Result:", result)


Embeddings: tensor([[ 2.0706e-01, -8.8669e-02, -4.8918e-02, -1.3041e-01,  4.7975e-02,
         -3.9025e-01,  1.2626e-01,  6.5851e-01, -3.6693e-02, -8.2247e-02,
          8.6010e-02, -1.0300e-01,  5.3220e-02,  6.7070e-01,  1.2688e-01,
          1.5185e-01, -8.8836e-02,  2.7484e-01,  7.9578e-02, -1.2691e-01,
         -1.5399e-02, -4.4519e-01, -1.8450e-02,  1.7430e-01,  6.2543e-02,
         -1.3739e-01,  2.3637e-02, -7.6291e-02,  5.0763e-01, -1.4827e-02,
          6.5497e-02,  3.6562e-02, -2.3892e-01, -2.3559e-01, -2.7979e-02,
         -1.1920e-01, -2.0597e-01, -1.3350e-01, -2.5894e-01,  7.3608e-02,
         -1.7778e-01, -4.5025e-03,  2.7264e-01, -1.3120e-01, -1.3571e-01,
         -2.6281e-01, -2.4850e+00, -2.9151e-02,  2.1403e-02, -2.4612e-01,
          4.5534e-01,  5.5840e-02,  7.8191e-02,  2.5053e-01,  3.8298e-01,
          3.6595e-01, -4.9592e-01,  3.6923e-01, -2.2526e-01, -6.0456e-02,
          3.1174e-01,  1.5549e-01, -1.9369e-01, -2.1970e-01,  3.8572e-02,
         -4.1085e-02,  2.4

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Sentiment Analysis Result: [{'label': 'LABEL_0', 'score': 0.5327839255332947}]


For example above:
In this extended example:

We tokenize and encode the input text using the DistilBERT tokenizer.

We pass the tokens through the DistilBERT model to obtain the model's output embeddings. In this example, we extract the embeddings for the [CLS] token, which is commonly used for sentence-level tasks.

We print the embeddings. You can use these embeddings for custom sentiment analysis or other downstream tasks.

We use the pipeline module to perform sentiment analysis directly. The result is printed, showing the sentiment label ('POSITIVE', 'NEGATIVE', 'NEUTRAL') and the confidence score.

Note: The pipeline module simplifies the process of using pre-trained models for specific tasks. In this case, we use the 'sentiment-analysis' pipeline, which is designed for sentiment analysis tasks. You can customize the example based on your specific use case and requirements.







In [18]:
#Running again in a different way

from transformers import pipeline, DistilBertModel, DistilBertTokenizer

# Load pre-trained DistilBERT model and tokenizer
model = DistilBertModel.from_pretrained('distilbert-base-uncased')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Example text for sentiment analysis
#text = "I really enjoyed watching the movie. It was fantastic!"
text = "That movie was absolutely awful"

# Tokenize and encode the text
tokens = tokenizer(text, return_tensors='pt')

# Get the model's output
outputs = model(**tokens)

# Access the output embeddings or other information from 'outputs'
# For example, you can extract the embeddings for the [CLS] token
cls_embedding = outputs.last_hidden_state[:, 0, :]

# Print the embeddings (you can use these for your custom sentiment analysis task)
print("Embeddings:", cls_embedding)

# Perform sentiment analysis using the pipeline
sentiment_analysis = pipeline('sentiment-analysis', model='distilbert-base-uncased')
result = sentiment_analysis(text)

# Extract the sentiment label and confidence score
sentiment_label = result[0]['label']

#labels = ['Negative','Positive'] #(0:negative, 1:positive)
confidence_score = result[0]['score']

# Print the sentiment analysis result
print("Sentiment Label:", sentiment_label)
print("Confidence Score:", confidence_score)


Embeddings: tensor([[ 7.0210e-02,  6.8084e-02,  7.6470e-02,  1.9690e-02,  2.8007e-02,
         -1.0710e-01,  1.2325e-01,  4.8434e-01, -1.9508e-02,  2.3242e-02,
          5.9509e-02,  3.7812e-03, -7.5885e-02,  4.8900e-01,  9.0280e-02,
          1.4583e-01, -1.8004e-01,  1.1228e-01,  9.8468e-02, -1.6458e-01,
         -2.2962e-01, -2.3786e-01, -3.1089e-02,  1.1482e-02, -1.1495e-01,
          2.2647e-02,  3.7191e-02, -9.5181e-02,  1.8810e-01,  7.0893e-02,
          1.1936e-01, -5.1023e-02, -9.5914e-02, -4.0946e-03, -1.4172e-02,
         -4.6111e-02, -5.8164e-02, -6.1541e-02, -9.2024e-02, -5.6487e-02,
         -1.1766e-01,  4.8398e-02,  1.3649e-01, -1.2492e-01, -1.7980e-02,
         -1.1209e-02, -1.8811e+00, -1.3138e-02,  4.5990e-02, -2.3842e-01,
          2.7772e-01, -3.2499e-02,  1.7736e-01,  2.7165e-01,  3.0734e-01,
          2.1610e-01, -1.3272e-01,  3.9919e-01, -6.1820e-02,  1.8868e-02,
          1.9426e-01, -2.3554e-02, -5.8178e-02, -5.3716e-02, -1.0974e-02,
          5.5273e-02,  2.6

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Sentiment Label: LABEL_0
Confidence Score: 0.5016996264457703


In [31]:
#TEST
# Example text for sentiment analysis
#text = "That movie was absolutely awful"

pred_sentences = [
  "That movie was absolutely awful",
  "The acting was a bit lacking",
  "The film was creative and surprising",
  "Absolutely fantastic!"
]
# Print the sentiment analysis result
print("Sentiment Label:", sentiment_label)
print("Confidence Score:", confidence_score)

Sentiment Label: LABEL_0
Confidence Score: 0.5293472409248352


In [24]:
#Running again in a different way

from transformers import pipeline, DistilBertModel, DistilBertTokenizer

# Load pre-trained DistilBERT model and tokenizer
model = DistilBertModel.from_pretrained('distilbert-base-uncased')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Example text for sentiment analysis
text = ["That movie was absolutely awful"]

# Tokenize and encode the text
tokens = tokenizer(text, return_tensors='pt')

# Get the model's output
outputs = model(**tokens)

# Access the output embeddings or other information from 'outputs'
# For example, you can extract the embeddings for the [CLS] token
cls_embedding = outputs.last_hidden_state[:, 0, :]

# Print the embeddings (you can use these for your custom sentiment analysis task)
print("Embeddings:", cls_embedding)

# Perform sentiment analysis using the pipeline
sentiment_analysis = pipeline('sentiment-analysis', model='distilbert-base-uncased')
result = sentiment_analysis(text)

# Extract the sentiment label and confidence score
sentiment_label = result[0]['label']

#labels = ['Negative','Positive'] #(0:negative, 1:positive)
confidence_score = result[0]['score']

# Print the sentiment analysis result
print("Sentiment Label:", sentiment_label)
print("Confidence Score:", confidence_score)


Embeddings: tensor([[ 7.0210e-02,  6.8084e-02,  7.6470e-02,  1.9690e-02,  2.8007e-02,
         -1.0710e-01,  1.2325e-01,  4.8434e-01, -1.9508e-02,  2.3242e-02,
          5.9509e-02,  3.7812e-03, -7.5885e-02,  4.8900e-01,  9.0280e-02,
          1.4583e-01, -1.8004e-01,  1.1228e-01,  9.8468e-02, -1.6458e-01,
         -2.2962e-01, -2.3786e-01, -3.1089e-02,  1.1482e-02, -1.1495e-01,
          2.2647e-02,  3.7191e-02, -9.5181e-02,  1.8810e-01,  7.0893e-02,
          1.1936e-01, -5.1023e-02, -9.5914e-02, -4.0946e-03, -1.4172e-02,
         -4.6111e-02, -5.8164e-02, -6.1541e-02, -9.2024e-02, -5.6487e-02,
         -1.1766e-01,  4.8398e-02,  1.3649e-01, -1.2492e-01, -1.7980e-02,
         -1.1209e-02, -1.8811e+00, -1.3138e-02,  4.5990e-02, -2.3842e-01,
          2.7772e-01, -3.2499e-02,  1.7736e-01,  2.7165e-01,  3.0734e-01,
          2.1610e-01, -1.3272e-01,  3.9919e-01, -6.1820e-02,  1.8868e-02,
          1.9426e-01, -2.3554e-02, -5.8178e-02, -5.3716e-02, -1.0974e-02,
          5.5273e-02,  2.6

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Sentiment Label: LABEL_0
Confidence Score: 0.5293472409248352
