<a href="https://colab.research.google.com/github/gstripling00/conferences/blob/main/11.11.24/04_03_adv_nlp_end.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Next, we need to preprocess our data so that it matches the data BERT was trained on. For this, we'll need to do a couple of things (but don't worry--this is also included in the Python library):


1. Lowercase our text (if we're using a BERT lowercase model)
2. Tokenize it (i.e. "sally says hi" -> ["sally", "says", "hi"])
3. Break words into WordPieces (i.e. "calling" -> ["call", "##ing"])
4. Map our words to indexes using a vocab file that BERT provides
5. Add special "CLS" and "SEP" tokens (see the [readme](https://github.com/google-research/bert))
6. Append "index" and "segment" tokens to each input (see the [BERT paper](https://arxiv.org/pdf/1810.04805.pdf))

Happily, we don't have to worry about most of these details.




To start, we'll need to load a vocabulary file and lowercasing information directly from the BERT tf hub module:

Great--we just learned that the BERT model we're using expects lowercase data (that's what stored in tokenization_info["do_lower_case"]) and we also loaded BERT's vocab file. We also created a tokenizer, which breaks words into word pieces:

Using our tokenizer, we'll call `run_classifier.convert_examples_to_features` on our InputExamples to convert them into features BERT understands.

In [None]:
pred_sentences = [
  "That movie was absolutely awful",
  "The acting was a bit lacking",
  "The film was creative and surprising",
  "Absolutely fantastic!"
]

#NEW EXERCISE - TOKENIZING

To import the DistilBERT model in Python, you can use the Hugging Face transformers library, which provides a convenient interface to work with various pre-trained transformer models, including DistilBERT. First, you need to install the library:

In this example, DistilBertModel.from_pretrained loads the pre-trained DistilBERT model, and DistilBertTokenizer.from_pretrained loads the corresponding tokenizer. You can replace 'distilbert-base-uncased' with other DistilBERT variants or models fine-tuned for specific tasks, depending on your requirements.

After loading the model and tokenizer, you can use them for various natural language processing tasks such as text classification, sentiment analysis, or embedding extraction.

Note: Depending on your use case, you might want to choose a specific DistilBERT variant that suits your needs. The model variant in the example, 'distilbert-base-uncased', is a commonly used base version trained on English text in an uncased format. There are other variants and models available, and you can explore them in the Hugging Face model hub: DistilBERT Models.






In [None]:
from transformers import DistilBertModel, DistilBertTokenizer

In [None]:
# Load pre-trained DistilBERT model and tokenizer
model = DistilBertModel.from_pretrained('distilbert-base-uncased')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
# Example: Tokenizing and encoding text
text = "Hello, how are you doing today?"
tokens = tokenizer(text, return_tensors='pt')
outputs = model(**tokens)

In [None]:
print(outputs)
#Notice six words shown.

BaseModelOutput(last_hidden_state=tensor([[[-0.1287, -0.1419, -0.1345,  ..., -0.0144,  0.5856,  0.3436],
         [ 0.1180, -0.1199,  0.3374,  ...,  0.1191,  0.7489, -0.0225],
         [-0.5806,  0.1913,  0.5075,  ..., -0.1679,  0.4484,  0.1599],
         ...,
         [-0.6880, -0.7057, -0.5547,  ..., -0.1592,  0.1028, -0.4917],
         [-0.3022, -0.6085, -0.4973,  ..., -0.1851,  0.6125,  0.1633],
         [ 0.7487,  0.1659, -0.3929,  ...,  0.3324, -0.2238, -0.3643]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)


In [None]:
print(tokens)

{'input_ids': tensor([[ 101, 7592, 1010, 2129, 2024, 2017, 2725, 2651, 1029,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [None]:
print(text)

Hello, how are you doing today?


In [None]:
tokenizer.tokenize("This here's an example of using the BERT tokenizer")

['this',
 'here',
 "'",
 's',
 'an',
 'example',
 'of',
 'using',
 'the',
 'bert',
 'token',
 '##izer']

In [None]:
tokenizer.tokenize("That movie was absolutely awful")

['that', 'movie', 'was', 'absolutely', 'awful']

#NEW EXERCISE

Certainly! Let's extend the previous example to perform sentiment analysis using a pre-trained DistilBERT model. For this, I'll use the pipeline module from the transformers library, which provides a convenient way to perform various NLP tasks, including sentiment analysis:

In [None]:
from transformers import pipeline, DistilBertModel, DistilBertTokenizer

# Load pre-trained DistilBERT model and tokenizer
model = DistilBertModel.from_pretrained('distilbert-base-uncased')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Example text for sentiment analysis
text = "I really enjoyed watching the movie. It was fantastic!"

# Tokenize and encode the text
tokens = tokenizer(text, return_tensors='pt')

# Get the model's output
outputs = model(**tokens)

# Access the output embeddings or other information from 'outputs'
# For example, you can extract the embeddings for the [CLS] token
cls_embedding = outputs.last_hidden_state[:, 0, :]

# Print the embeddings (you can use these for your custom sentiment analysis task)
print("Embeddings:", cls_embedding)

# Perform sentiment analysis using the pipeline
sentiment_analysis = pipeline('sentiment-analysis', model='distilbert-base-uncased')
result = sentiment_analysis(text)

# Print the sentiment analysis result
print("Sentiment Analysis Result:", result)


Embeddings: tensor([[ 2.0706e-01, -8.8669e-02, -4.8918e-02, -1.3041e-01,  4.7975e-02,
         -3.9025e-01,  1.2626e-01,  6.5851e-01, -3.6693e-02, -8.2247e-02,
          8.6010e-02, -1.0300e-01,  5.3220e-02,  6.7070e-01,  1.2688e-01,
          1.5185e-01, -8.8836e-02,  2.7484e-01,  7.9578e-02, -1.2691e-01,
         -1.5399e-02, -4.4519e-01, -1.8450e-02,  1.7430e-01,  6.2543e-02,
         -1.3739e-01,  2.3637e-02, -7.6291e-02,  5.0763e-01, -1.4827e-02,
          6.5497e-02,  3.6562e-02, -2.3892e-01, -2.3559e-01, -2.7979e-02,
         -1.1920e-01, -2.0597e-01, -1.3350e-01, -2.5894e-01,  7.3608e-02,
         -1.7778e-01, -4.5025e-03,  2.7264e-01, -1.3120e-01, -1.3571e-01,
         -2.6281e-01, -2.4850e+00, -2.9151e-02,  2.1403e-02, -2.4612e-01,
          4.5534e-01,  5.5840e-02,  7.8191e-02,  2.5053e-01,  3.8298e-01,
          3.6595e-01, -4.9592e-01,  3.6923e-01, -2.2526e-01, -6.0456e-02,
          3.1174e-01,  1.5549e-01, -1.9369e-01, -2.1970e-01,  3.8572e-02,
         -4.1085e-02,  2.4

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Sentiment Analysis Result: [{'label': 'LABEL_0', 'score': 0.5327839255332947}]


For example above:
In this extended example:

We tokenize and encode the input text using the DistilBERT tokenizer.

We pass the tokens through the DistilBERT model to obtain the model's output embeddings. In this example, we extract the embeddings for the [CLS] token, which is commonly used for sentence-level tasks.

We print the embeddings. You can use these embeddings for custom sentiment analysis or other downstream tasks.

We use the pipeline module to perform sentiment analysis directly. The result is printed, showing the sentiment label ('POSITIVE', 'NEGATIVE', 'NEUTRAL') and the confidence score.

Note: The pipeline module simplifies the process of using pre-trained models for specific tasks. In this case, we use the 'sentiment-analysis' pipeline, which is designed for sentiment analysis tasks. You can customize the example based on your specific use case and requirements.







In [None]:
#Running again in a different way

from transformers import pipeline, DistilBertModel, DistilBertTokenizer

# Load pre-trained DistilBERT model and tokenizer
model = DistilBertModel.from_pretrained('distilbert-base-uncased')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Example text for sentiment analysis
#text = "I really enjoyed watching the movie. It was fantastic!"
text = "That movie was absolutely awful"

# Tokenize and encode the text
tokens = tokenizer(text, return_tensors='pt')

# Get the model's output
outputs = model(**tokens)

# Access the output embeddings or other information from 'outputs'
# For example, you can extract the embeddings for the [CLS] token
cls_embedding = outputs.last_hidden_state[:, 0, :]

# Print the embeddings (you can use these for your custom sentiment analysis task)
print("Embeddings:", cls_embedding)

# Perform sentiment analysis using the pipeline
sentiment_analysis = pipeline('sentiment-analysis', model='distilbert-base-uncased')
result = sentiment_analysis(text)

# Extract the sentiment label and confidence score
sentiment_label = result[0]['label']

#labels = ['Negative','Positive'] #(0:negative, 1:positive)
confidence_score = result[0]['score']

# Print the sentiment analysis result
print("Sentiment Label:", sentiment_label)
print("Confidence Score:", confidence_score)


Embeddings: tensor([[ 7.0210e-02,  6.8084e-02,  7.6470e-02,  1.9690e-02,  2.8007e-02,
         -1.0710e-01,  1.2325e-01,  4.8434e-01, -1.9508e-02,  2.3242e-02,
          5.9509e-02,  3.7812e-03, -7.5885e-02,  4.8900e-01,  9.0280e-02,
          1.4583e-01, -1.8004e-01,  1.1228e-01,  9.8468e-02, -1.6458e-01,
         -2.2962e-01, -2.3786e-01, -3.1089e-02,  1.1482e-02, -1.1495e-01,
          2.2647e-02,  3.7191e-02, -9.5181e-02,  1.8810e-01,  7.0893e-02,
          1.1936e-01, -5.1023e-02, -9.5914e-02, -4.0946e-03, -1.4172e-02,
         -4.6111e-02, -5.8164e-02, -6.1541e-02, -9.2024e-02, -5.6487e-02,
         -1.1766e-01,  4.8398e-02,  1.3649e-01, -1.2492e-01, -1.7980e-02,
         -1.1209e-02, -1.8811e+00, -1.3138e-02,  4.5990e-02, -2.3842e-01,
          2.7772e-01, -3.2499e-02,  1.7736e-01,  2.7165e-01,  3.0734e-01,
          2.1610e-01, -1.3272e-01,  3.9919e-01, -6.1820e-02,  1.8868e-02,
          1.9426e-01, -2.3554e-02, -5.8178e-02, -5.3716e-02, -1.0974e-02,
          5.5273e-02,  2.6

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Sentiment Label: LABEL_0
Confidence Score: 0.5016996264457703


In [None]:
#TEST
# Example text for sentiment analysis
#text = "That movie was absolutely awful"

pred_sentences = [
  "That movie was absolutely awful",
  "The acting was a bit lacking",
  "The film was creative and surprising",
  "Absolutely fantastic!"
]
# Print the sentiment analysis result
print("Sentiment Label:", sentiment_label)
print("Confidence Score:", confidence_score)

Sentiment Label: LABEL_0
Confidence Score: 0.5293472409248352


In [None]:
#Running again in a different way

from transformers import pipeline, DistilBertModel, DistilBertTokenizer

# Load pre-trained DistilBERT model and tokenizer
model = DistilBertModel.from_pretrained('distilbert-base-uncased')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Example text for sentiment analysis
text = ["That movie was absolutely awful"]

# Tokenize and encode the text
tokens = tokenizer(text, return_tensors='pt')

# Get the model's output
outputs = model(**tokens)

# Access the output embeddings or other information from 'outputs'
# For example, you can extract the embeddings for the [CLS] token
cls_embedding = outputs.last_hidden_state[:, 0, :]

# Print the embeddings (you can use these for your custom sentiment analysis task)
print("Embeddings:", cls_embedding)

# Perform sentiment analysis using the pipeline
sentiment_analysis = pipeline('sentiment-analysis', model='distilbert-base-uncased')
result = sentiment_analysis(text)

# Extract the sentiment label and confidence score
sentiment_label = result[0]['label']

#labels = ['Negative','Positive'] #(0:negative, 1:positive)
confidence_score = result[0]['score']

# Print the sentiment analysis result
print("Sentiment Label:", sentiment_label)
print("Confidence Score:", confidence_score)


Embeddings: tensor([[ 7.0210e-02,  6.8084e-02,  7.6470e-02,  1.9690e-02,  2.8007e-02,
         -1.0710e-01,  1.2325e-01,  4.8434e-01, -1.9508e-02,  2.3242e-02,
          5.9509e-02,  3.7812e-03, -7.5885e-02,  4.8900e-01,  9.0280e-02,
          1.4583e-01, -1.8004e-01,  1.1228e-01,  9.8468e-02, -1.6458e-01,
         -2.2962e-01, -2.3786e-01, -3.1089e-02,  1.1482e-02, -1.1495e-01,
          2.2647e-02,  3.7191e-02, -9.5181e-02,  1.8810e-01,  7.0893e-02,
          1.1936e-01, -5.1023e-02, -9.5914e-02, -4.0946e-03, -1.4172e-02,
         -4.6111e-02, -5.8164e-02, -6.1541e-02, -9.2024e-02, -5.6487e-02,
         -1.1766e-01,  4.8398e-02,  1.3649e-01, -1.2492e-01, -1.7980e-02,
         -1.1209e-02, -1.8811e+00, -1.3138e-02,  4.5990e-02, -2.3842e-01,
          2.7772e-01, -3.2499e-02,  1.7736e-01,  2.7165e-01,  3.0734e-01,
          2.1610e-01, -1.3272e-01,  3.9919e-01, -6.1820e-02,  1.8868e-02,
          1.9426e-01, -2.3554e-02, -5.8178e-02, -5.3716e-02, -1.0974e-02,
          5.5273e-02,  2.6

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Sentiment Label: LABEL_0
Confidence Score: 0.5293472409248352


In [None]:
from transformers import pipeline, DistilBertModel, DistilBertTokenizer

# Load pre-trained DistilBERT model and tokenizer
model = DistilBertModel.from_pretrained('distilbert-base-uncased')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Example texts for sentiment analysis
texts = ["That movie was absolutely awful", "I loved the performance in the play", "The weather today is fantastic"]

# Tokenize and encode the texts
tokens = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)

# Get the model's output
outputs = model(**tokens)

# Access the output embeddings or other information from 'outputs'
# For example, you can extract the embeddings for the [CLS] token
cls_embeddings = outputs.last_hidden_state[:, 0, :]

# Perform sentiment analysis using the pipeline
sentiment_analysis = pipeline('sentiment-analysis', model='distilbert-base-uncased')
results = sentiment_analysis(texts)

# Extract sentiment labels and confidence scores for each text
for i, result in enumerate(results):
    sentiment_label = result['label']
    confidence_score = result['score']

    print(f"\nSentiment Analysis for Text {i + 1}:")
    print("Text:", texts[i])
    print("Sentiment Label:", sentiment_label)
    print("Confidence Score:", confidence_score)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Sentiment Analysis for Text 1:
Text: That movie was absolutely awful
Sentiment Label: LABEL_1
Confidence Score: 0.527485191822052

Sentiment Analysis for Text 2:
Text: I loved the performance in the play
Sentiment Label: LABEL_1
Confidence Score: 0.5284966230392456

Sentiment Analysis for Text 3:
Text: The weather today is fantastic
Sentiment Label: LABEL_1
Confidence Score: 0.5225977301597595


In [None]:
!pip install pandas




In [None]:
imdb = keras.datasets.imdb

NameError: name 'keras' is not defined

In [None]:
import pandas as pd
from transformers import pipeline, DistilBertModel, DistilBertTokenizer

# Load pre-trained DistilBERT model and tokenizer
model = DistilBertModel.from_pretrained('distilbert-base-uncased')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Load the dataset from a CSV file
dataset_path = 'sentiment_dataset.csv'
df = pd.read_csv(dataset_path)
#pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv'

# Extract the 'text' column from the dataset
texts = df['text'].tolist()

# Tokenize and encode the texts
tokens = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)

# Get the model's output
outputs = model(**tokens)

# Access the output embeddings or other information from 'outputs'
# For example, you can extract the embeddings for the [CLS] token
cls_embeddings = outputs.last_hidden_state[:, 0, :]

# Perform sentiment analysis using the pipeline
sentiment_analysis = pipeline('sentiment-analysis', model='distilbert-base-uncased')
results = sentiment_analysis(texts)

# Extract sentiment labels and confidence scores for each text
for i, result in enumerate(results):
    sentiment_label = result['label']
    confidence_score = result['score']

    print(f"\nSentiment Analysis for Text {i + 1}:")
    print("Text:", texts[i])
    print("Sentiment Label:", sentiment_label)
    print("Confidence Score:", confidence_score)


#NEW EXERCISE

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)

In [None]:
batch_1 = df[:1000]

In [None]:
batch_1[1].value_counts()

1    521
0    479
Name: 1, dtype: int64

In [None]:
# For DistilBERT:
from transformers import pipeline, DistilBertModel, DistilBertTokenizer



In [None]:
tokenized = batch_1[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

In [None]:
print(tokenized)

0      [101, 1037, 18385, 1010, 6057, 1998, 2633, 182...
1      [101, 4593, 2128, 27241, 23931, 2013, 1996, 62...
2      [101, 2027, 3653, 23545, 2037, 4378, 24185, 10...
3      [101, 2023, 2003, 1037, 17453, 14726, 19379, 1...
4      [101, 5655, 6262, 1005, 1055, 12075, 2571, 376...
                             ...                        
995    [101, 2612, 1997, 5599, 1996, 11680, 2272, 200...
996    [101, 2023, 10722, 19068, 2080, 2323, 2031, 20...
997    [101, 2412, 2156, 2028, 1997, 2216, 22092, 200...
998    [101, 1996, 2143, 2003, 8052, 2005, 1996, 1592...
999    [101, 2005, 2087, 1997, 2049, 8333, 1010, 1996...
Name: 0, Length: 1000, dtype: object


In [None]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

In [None]:
print(padded)

[[  101  1037 18385 ...     0     0     0]
 [  101  4593  2128 ...     0     0     0]
 [  101  2027  3653 ...     0     0     0]
 ...
 [  101  2412  2156 ...     0     0     0]
 [  101  1996  2143 ...     0     0     0]
 [  101  2005  2087 ...     0     0     0]]


In [None]:
np.array(padded).shape

(1000, 59)

#Question Answering

In [None]:
from transformers import pipeline, DistilBertTokenizer, DistilBertForQuestionAnswering

# Load pre-trained DistilBERT model and tokenizer
model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased-distilled-squad')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased-distilled-squad')

# Example context and question
context = "DistilBERT is a lightweight version of BERT designed for speed and efficiency."
question = "What is DistilBERT designed for?"

# Tokenize and encode the context and question
inputs = tokenizer(context, question, return_tensors='pt')

# Get the model's output
outputs = model(**inputs)

# Extract start and end logits from the output
start_logits = outputs.start_logits
end_logits = outputs.end_logits

# Use the tokenizer to convert logits to tokens
start_index = start_logits.argmax(dim=-1).item()
end_index = end_logits.argmax(dim=-1).item()
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][start_index:end_index + 1])

# Convert tokens back to text
answer = tokenizer.convert_tokens_to_string(tokens)

# Print the question-answering result
print("Question:", question)
print("Answer:", answer)


config.json:   0%|          | 0.00/451 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Question: What is DistilBERT designed for?
Answer: speed and efficiency


In the above:
In this example:

We load a pre-trained DistilBERT model and tokenizer specifically fine-tuned for question-answering tasks (distilbert-base-uncased-distilled-squad).

We provide a context (a piece of text) and a question related to that context.

We tokenize and encode the context and question using the DistilBERT tokenizer.

We pass the encoded inputs through the DistilBERT model to obtain start and end logits. These logits represent the probability distribution over tokens for the start and end positions of the answer.

We use the tokenizer to convert logits to tokens and then convert the tokens back to text to form the final answer.

The question and the obtained answer are printed.

You can customize this example by providing your own context and question to perform question-answering on topics of interest to you.

#Named Entity Extraction

In [None]:
from transformers import pipeline, DistilBertTokenizer, DistilBertForTokenClassification

# Load pre-trained DistilBERT model and tokenizer for token classification
model = DistilBertForTokenClassification.from_pretrained('distilbert-base-uncased')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Example text for named entity recognition
text = "Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne."

# Tokenize and encode the text
tokens = tokenizer(text, return_tensors='pt')

# Get the model's output
outputs = model(**tokens)

# Extract predicted labels (named entities) from the output
predicted_labels = outputs.logits.argmax(dim=2).squeeze().numpy()

# Map labels back to their corresponding tokens using the tokenizer
predicted_tokens = tokenizer.convert_ids_to_tokens(tokens['input_ids'][0].numpy())
predicted_entities = [tokenizer.convert_ids_to_tokens([token_id])[0] for token_id in predicted_labels]

# Extract named entities from the predicted labels and tokens
#named_entities = [(token, entity) for token, entity_id in zip(predicted_tokens, predicted_entities) if entity_id.startswith('B-')]
named_entities = [(token, entity) for token, entity_id in zip(text[0].split(), predicted_entities[0]) if entity_id != '[PAD]']


# Print the named entities
print("Named Entities:")
for token, entity in named_entities:
    print(f"{token}: {entity}")


Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


NameError: name 'entity' is not defined

#NER

In [None]:
#pip install transformers


In [None]:
Apple    B-ORG
Inc.     I-ORG
was      O
founded  O
by       O
Steve    B-PER
Jobs,    I-PER
Steve    B-PER
Wozniak, I-PER
and      O
Ronald   B-PER
Wayne.   I-PER


SyntaxError: invalid syntax (<ipython-input-56-4871fc2fa3ae>, line 1)

#TEXT GENERATION



DistilBERT, as a distilled version of BERT, is primarily designed for tasks like question answering, text classification, and named entity recognition. It may not be the most suitable model for text generation, as its architecture is optimized for understanding context and relationships in input text rather than generating new content.

However, for text generation tasks, models like GPT (Generative Pre-trained Transformer) are more commonly used. Since DistilBERT doesn't have a generative architecture, I'll provide you with an example using GPT-2 for text generation:

In [None]:
!pip install transformers

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained GPT-2 model and tokenizer for text generation
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Prompt for text generation
prompt = "Once upon a time in a land far, far away"

# Tokenize and encode the prompt
input_ids = tokenizer.encode(prompt, return_tensors='pt')

# Generate text using the model
output = model.generate(input_ids, max_length=100, num_return_sequences=1, no_repeat_ngram_size=2, top_k=50, top_p=0.95, temperature=0.7)

# Decode and print the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated Text:")
print(generated_text)


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text:
Once upon a time in a land far, far away, the world was a place of great beauty and great danger. The world of the gods was the land of darkness and darkness. And the darkness of this world, which was far from the light of day, was not the place where the sun and the moon met. It was in the midst of all the worlds, and it was there that the stars and all that were in them met, that they were all in one place.




In this example, I've used GPT-2 for text generation. You can customize the prompt variable to set the initial context for the generated text. Feel free to experiment with the parameters of the model.generate function to adjust the length, diversity, and other aspects of the generated text.

Please note that GPT-2 is more suitable for text generation tasks, whereas DistilBERT is better suited for tasks like classification, question answering, and named entity recognition. If you specifically need text generation capabilities, GPT-2 or similar models would be more appropriate.

#MULTIPLE PROMPTS

In [None]:
# NEW EXERCISE
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained GPT-2 model and tokenizer for text generation
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# List of prompts for text generation
prompts = ["Once upon a time in a land far, far away",
           "In a galaxy not so distant, there was a hero",
           "The mysterious door creaked open revealing"]

# Generate text for each prompt
for prompt in prompts:
    # Tokenize and encode the prompt
    input_ids = tokenizer.encode(prompt, return_tensors='pt')

    # Generate text using the model
    output = model.generate(input_ids, max_length=100, num_return_sequences=1, no_repeat_ngram_size=2, top_k=50, top_p=0.95, temperature=0.7)

    # Decode and print the generated text
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

    print(f"Prompt: {prompt}")
    print("Generated Text:")
    print(generated_text)
    print("\n" + "="*50 + "\n")  # Separator between generated texts


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: Once upon a time in a land far, far away
Generated Text:
Once upon a time in a land far, far away, the world was a place of great beauty and great danger. The world of the gods was the land of darkness and darkness. And the darkness of this world, which was far from the light of day, was not the place where the sun and the moon met. It was in the midst of all the worlds, and it was there that the stars and all that were in them met, that they were all in one place.






The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: In a galaxy not so distant, there was a hero
Generated Text:
In a galaxy not so distant, there was a hero who could save the galaxy.

The hero was the hero of the Galactic Empire. He was known as the Emperor. The Emperor was an Emperor who had been born in the past, and had become a Jedi. His name was Darth Vader. Vader was born on the planet of Coruscant, where he was raised by his father, Darth Sidious. After his birth, Vader became a Sith Lord, a master of Sith sorcery.


Prompt: The mysterious door creaked open revealing
Generated Text:
The mysterious door creaked open revealing a man in a black suit and a white shirt. He was wearing a red hooded sweatshirt and black pants.

"I'm a doctor," he said. "I've been in the hospital for a year. I've never seen anything like this."
...
, a former nurse who worked at the University of California, Berkeley, and was a member of the medical staff at UC Berkeley's medical center. She was killed in an




#Fine Tuning LLM on Domain Specific Data

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Config
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

# Load pre-trained GPT-2 model and tokenizer
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Load your domain-specific data
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path='/content/advertising.csv',
   # file_path='path/to/your/train.txt',
    block_size=128
)



# Define data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

# Define training arguments
training_args = TrainingArguments(
    output_dir='output_directory',
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    save_steps=10_000,
    save_total_limit=2,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)

# Fine-tune the model
trainer.train()




ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1`: Please run `pip install transformers[torch]` or `pip install accelerate -U`

ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1`: Please run `pip install transformers[torch]` or `pip install accelerate -U`

In [None]:
pip install transformers[torch]

Collecting accelerate>=0.20.3 (from transformers[torch])
  Downloading accelerate-0.26.0-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.7/270.7 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.26.0


In [None]:
!pip install accelerate -U



In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Config
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

# Load pre-trained GPT-2 model and tokenizer
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Load your domain-specific data
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path='/content/advertising.csv',
   # file_path='path/to/your/train.txt',
    block_size=128
)



# Define data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

# Define training arguments
training_args = TrainingArguments(
    output_dir='output_directory',
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    save_steps=10_000,
    save_total_limit=2,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)

# Fine-tune the model
trainer.train()


ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1`: Please run `pip install transformers[torch]` or `pip install accelerate -U`

In [None]:
!pip install tensorflow transformers




In [None]:
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer, GPT2Config
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import TFTrainer, TFTrainingArguments

# Load pre-trained GPT-2 model and tokenizer
model = TFGPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Load your domain-specific data
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path='/content/advertising.csv',
    block_size=128
)

# Define data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

# Define training arguments
training_args = TFTrainingArguments(
    output_dir='output_directory',
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    save_steps=10_000,
    save_total_limit=2,
)

# Initialize Trainer
trainer = TFTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)

# Fine-tune the model
trainer.train()


All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


TypeError: TFTrainer.__init__() got an unexpected keyword argument 'data_collator'

In [None]:
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer, GPT2Config
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import TFTrainer, TFTrainingArguments

# Load pre-trained GPT-2 model and tokenizer
model = TFGPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Load your domain-specific data
train_dataset = TextDataset(
    tokenizer=tokenizer,
      file_path='/content/advertising.csv',
    block_size=128
)

# Define data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

# Define training arguments
training_args = TFTrainingArguments(
    output_dir='output_directory',
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,  # This ensures only the prediction loss is computed (equivalent to mlm=False)
)

# Initialize Trainer
trainer = TFTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=data_collator,
)

# Fine-tune the model
trainer.train()


All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


TypeError: TFTrainer.__init__() got an unexpected keyword argument 'data_collator'

In [None]:
import accelerate

In [None]:
!pip install accelerate>=0.20.1

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Config
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

# Load pre-trained GPT-2 model and tokenizer
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Load your domain-specific data
train_dataset = TextDataset(
    tokenizer=tokenizer,
     file_path='/content/advertising.csv',
    block_size=128
)

# Define data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

# Define training arguments
training_args = TrainingArguments(
    output_dir='output_directory',
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    save_steps=10_000,
    save_total_limit=2,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)

# Fine-tune the model
trainer.train()




ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1`: Please run `pip install transformers[torch]` or `pip install accelerate -U`

#Fine-Tune using Keras

In [None]:
import tensorflow as tf
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel

# Load pre-trained GPT-2 model and tokenizer
model = TFGPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Load your domain-specific data
train_data_path = '/content/advertising.csv'
with open(train_data_path, 'r', encoding='utf-8') as file:
    train_texts = file.readlines()

# Tokenize and encode the texts
input_ids = tokenizer(train_texts, return_tensors='tf', padding=True, truncation=True)

# Create TensorFlow Dataset
train_dataset = tf.data.Dataset.from_tensor_slices((dict(input_ids), dict(input_ids)))

# Define training parameters
batch_size = 8
num_epochs = 3
learning_rate = 5e-5

# Compile the model
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))

# Train the model
model.fit(train_dataset.shuffle(1000).batch(batch_size),
          epochs=num_epochs)


All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

In [None]:
import tensorflow as tf
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel

# Load pre-trained GPT-2 model and tokenizer
model = TFGPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Add the EOS token as the pad token
tokenizer.pad_token = tokenizer.eos_token

# Load your domain-specific data
train_data_path = '/content/advertising.csv'
with open(train_data_path, 'r', encoding='utf-8') as file:
    train_texts = file.readlines()

# Tokenize and encode the texts
input_ids = tokenizer(train_texts, return_tensors='tf', padding=True, truncation=True)

# Create TensorFlow Dataset
train_dataset = tf.data.Dataset.from_tensor_slices((dict(input_ids), dict(input_ids)))

# Define training parameters
batch_size = 8
num_epochs = 3
learning_rate = 5e-5

# Compile the model
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))

# Train the model
model.fit(train_dataset.shuffle(1000).batch(batch_size),
          epochs=num_epochs)


All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Epoch 1/3


TypeError: in user code:

    File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1401, in train_function  *
        return step_function(self, iterator)
    File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1384, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1373, in run_step  **
        outputs = model.train_step(data)
    File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_tf_utils.py", line 1677, in train_step
        self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
    File "/usr/local/lib/python3.10/dist-packages/keras/src/optimizers/optimizer.py", line 543, in minimize
        grads_and_vars = self.compute_gradients(loss, var_list, tape)
    File "/usr/local/lib/python3.10/dist-packages/keras/src/optimizers/optimizer.py", line 276, in compute_gradients
        grads = tape.gradient(loss, var_list)

    TypeError: Argument `target` should be a list or nested structure of Tensors, Variables or CompositeTensors to be differentiated, but received None.


In [None]:
import tensorflow as tf
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel

# Load pre-trained GPT-2 model and tokenizer
model = TFGPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Add the EOS token as the pad token
tokenizer.pad_token = tokenizer.eos_token

# Load your domain-specific data
train_data_path = '/content/advertising.csv'
with open(train_data_path, 'r', encoding='utf-8') as file:
    train_texts = file.readlines()

# Tokenize and encode the texts
input_ids = tokenizer(train_texts, return_tensors='tf', padding=True, truncation=True)

# Shift the input sequence to create target sequence
labels = tf.roll(input_ids['input_ids'], shift=-1, axis=-1)

# Create TensorFlow Dataset
train_dataset = tf.data.Dataset.from_tensor_slices((dict(input_ids), labels))

# Define training parameters
batch_size = 8
num_epochs = 3
learning_rate = 5e-5

# Compile the model
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction=tf.keras.losses.Reduction.NONE),
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy(name="accuracy")])

# Train the model
model.fit(train_dataset.shuffle(1000).batch(batch_size),
          epochs=num_epochs)


All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x7b3bc1f3ee60>