<a href="https://colab.research.google.com/github/purvesh1/ABSA-BERT-pair/blob/master/mcd_absa_tf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import warnings
warnings.filterwarnings("ignore")
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install transformers
import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification

Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m34.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m54.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m60.8 MB/s[0m eta [36m0:00:0



---


# Aspect-Based Sentiment Analysis using BERT


---

In this notebook, we will explore Aspect-Based Sentiment Analysis (ABSA) using the BERT model. Our primary inspiration is the paper titled [BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis](https://arxiv.org/abs/1903.09588). ABSA is a fine-grained sentiment analysis technique that aims to identify sentiments towards specific aspects in a given text. For instance, in the review "The burger was great, but the fries were too salty", the sentiment towards the "burger" aspect is positive, while it's negative for the "fries" aspect.

In [3]:
#@title Choose a dataset and a task { run: "auto", display-mode: "form" }
base_dir = "/content/drive/MyDrive/ABSA/data" #@param {type:"string"}
dataset_type = "semeval2014" #@param ["sentihood", "semeval2014"]
task = "NLI_B" #@param ["QA_M", "NLI_M", "QA_B", "NLI_B"]

In [4]:
!pip install gcsfs



In [5]:
from google.colab import auth
auth.authenticate_user()

In [6]:
import pandas as pd
import random
import torch
import sys
'''if base_dir not in sys.path:
    sys.path.insert(0, f'{base_dir}/')'''
import numpy as np
from scipy.special import softmax
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [7]:

def get_dataset(path):
    original_sentences = []
    auxiliary_sentences = []
    labels = []
    data = pd.read_csv(path, header=0, sep="\t").values.tolist()
    for row in data:
        original_sentences.append(row[1])
        auxiliary_sentences.append(row[2])
        labels.append(row[3])
    return original_sentences, auxiliary_sentences, labels


BUCKET_ROOT = 'gs://absa-395317-mcd_bucket'
train_original_sentences, train_auxiliary_sentences, train_labels = get_dataset(f"{BUCKET_ROOT}/data/semeval14/train_{task}.csv")
test_original_sentences, test_auxiliary_sentences, test_labels = get_dataset(f"{BUCKET_ROOT}/data/semeval14/test_{task}.csv")

/kaggle/input/semeval2014/train_NLI_B.csv
??


In this step, we have separated the auxiliary statements (referred to as "combinations") alongside their corresponding labels.

In [8]:
print(f"Original sentence: {train_original_sentences[0]}. Auxiliary sentence: {train_auxiliary_sentences[0]}. Label: {train_labels[0]}.")
print(f"Original sentence: {test_original_sentences[0]}. Auxiliary sentence: {test_auxiliary_sentences[0]}. Label: {test_labels[0]}.")

Original sentence: The food is good, especially their more basic dishes, and the drinks are delicious.. Auxiliary sentence: price - positive. Label: 0.
Original sentence: Great food, great waitstaff, great atmosphere, and best of all GREAT beer!. Auxiliary sentence: price - positive. Label: 0.


### Tokenization using BERT Tokenizer

In this section, we utilize the `BertTokenizer` from the Hugging Face Transformers library. The tokenizer is designed for the BERT model and is pre-trained on the `'bert-base-uncased'` variant of BERT. This means it uses a vocabulary of words and subwords seen during the pre-training of BERT on a large corpus, and it does not consider letter casing (i.e., "apple" and "Apple" are treated the same).

Here's a breakdown of the steps:

1. **Importing the Tokenizer**:
    ```python
    from transformers import BertTokenizer
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    ```
    We first import the `BertTokenizer` class and then instantiate a tokenizer object using the `'bert-base-uncased'` pre-trained version.

2. **Tokenizing the Datasets**:
    We tokenize three different sets of data: training, validation, and testing. Each dataset comprises original sentences and their corresponding auxiliary sentences.

    - **Training Data**:
        ```python
        train_encodings = tokenizer(train_original_sentences, train_auxiliary_sentences, truncation=True, padding=True)
        ```
        Here, `train_original_sentences` and `train_auxiliary_sentences` are tokenized together. The `truncation=True` argument ensures that inputs longer than what the model can handle are truncated, and `padding=True` ensures that all outputs are padded to the same length.

    - **Testing Data**:
        ```python
        test_encodings = tokenizer(test_original_sentences, test_auxiliary_sentences, truncation=True, padding=True)
        ```
        And finally, the testing sentences are tokenized.

This process results in tokenized versions of our datasets, which are ready to be fed into the BERT model for training or evaluation.


In [9]:
train_encodings = tokenizer(train_original_sentences, train_auxiliary_sentences, truncation=True, padding=True, return_tensors="tf")
test_encodings = tokenizer(test_original_sentences, test_auxiliary_sentences, truncation=True, padding=True, return_tensors="tf")

In [10]:
'''class evaluation:
    @staticmethod
    def compute_semeval_PRF(test_labels, predicted_labels):
      num_total_intersection = 0
      num_total_test_aspects = 0
      num_total_predicted_aspects = 0
      num_examples = len(test_labels) // 5
      for i in range(num_examples):
          test_aspects = set()
          predicted_aspects = set()
          for j in range(5):
              if test_labels[i * 5 + j] != 4:
                  test_aspects.add(j)
              if predicted_labels[i * 5 + j] != 4:
                  predicted_aspects.add(j)
          if len(test_aspects) == 0:
              continue
          intersection = test_aspects.intersection(predicted_aspects)
          num_total_test_aspects += len(test_aspects)
          num_total_predicted_aspects += len(predicted_aspects)
          num_total_intersection += len(intersection)
      mi_P = num_total_intersection / num_total_predicted_aspects
      mi_R = num_total_intersection / num_total_test_aspects
      mi_F = (2 * mi_P * mi_R) / (mi_P + mi_R)
      return mi_P, mi_R, mi_F

    @staticmethod
    def compute_semeval_accuracy(test_labels, predicted_labels, scores, num_classes=4):
      count_considered_examples = 0
      count_correct_examples = 0
      if num_classes == 4:
          for i in range(len(test_labels)):
              if test_labels[i] == 4:
                  continue
              new_predicted_label = predicted_labels[i]
              if new_predicted_label == 4:
                  new_scores = scores[i].copy()
                  new_scores[4] = 0
                  new_predicted_label = np.argmax(new_scores)
              if test_labels[i] == new_predicted_label:
                  count_correct_examples += 1
              count_considered_examples += 1
          semeval_accuracy = count_correct_examples / count_considered_examples

      elif num_classes == 3:
          for i in range(len(test_labels)):
              if test_labels[i] >= 3:
                  continue
              new_predicted_label = predicted_labels[i]
              if new_predicted_label >= 3:
                  new_scores = scores[i].copy()
                  new_scores[3] = 0
                  new_scores[4] = 0
                  new_predicted_label = np.argmax(new_scores)
              if test_labels[i] == new_predicted_label:
                  count_correct_examples += 1
              count_considered_examples += 1
          semeval_accuracy = count_correct_examples / count_considered_examples
      elif num_classes == 2:
          for i in range(len(test_labels)):
              if test_labels[i] == 1 or test_labels[i] >= 3:
                  continue
              new_predicted_label = predicted_labels[i]
              if new_predicted_label == 1 or new_predicted_label >= 3:
                  new_scores = scores[i].copy()
                  new_scores[1] = 0
                  new_scores[3] = 0
                  new_scores[4] = 0
                  new_predicted_label = np.argmax(new_scores)
              if test_labels[i] == new_predicted_label:
                  count_correct_examples += 1
              count_considered_examples += 1
          semeval_accuracy = count_correct_examples / count_considered_examples
      else:
          raise ValueError("num_classes must be equal to 2, 3, or 4")
      return semeval_accuracy
'''

'class evaluation:\n    @staticmethod\n    def compute_semeval_PRF(test_labels, predicted_labels):\n      num_total_intersection = 0\n      num_total_test_aspects = 0\n      num_total_predicted_aspects = 0\n      num_examples = len(test_labels) // 5\n      for i in range(num_examples):\n          test_aspects = set()\n          predicted_aspects = set()\n          for j in range(5):\n              if test_labels[i * 5 + j] != 4:\n                  test_aspects.add(j)\n              if predicted_labels[i * 5 + j] != 4:\n                  predicted_aspects.add(j)\n          if len(test_aspects) == 0:\n              continue\n          intersection = test_aspects.intersection(predicted_aspects)\n          num_total_test_aspects += len(test_aspects)\n          num_total_predicted_aspects += len(predicted_aspects)\n          num_total_intersection += len(intersection)\n      mi_P = num_total_intersection / num_total_predicted_aspects\n      mi_R = num_total_intersection / num_total_test_asp

### Creating a Custom Dataset for ABSA with PyTorch

The `ABSA_Dataset` class serves as a custom dataset structure tailored for our Aspect-Based Sentiment Analysis (ABSA) task. This structure facilitates:

- **Batching**: Efficiently grouping tokenized sentences into mini-batches during training and evaluation.
- **Indexing**: Easily accessing tokenized data and corresponding labels using indices.
- **Integration**: Seamlessly working with PyTorch's DataLoader for optimized data loading and parallelization.

By defining this class, we ensure smooth interaction with PyTorch's training and evaluation mechanisms.

In [11]:
## TF
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
))
## PT
'''from transformers import logging
class ABSA_Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = ABSA_Dataset(train_encodings, train_labels)
test_dataset = ABSA_Dataset(test_encodings, test_labels)'''

"from transformers import logging\nclass ABSA_Dataset(torch.utils.data.Dataset):\n    def __init__(self, encodings, labels):\n        self.encodings = encodings\n        self.labels = labels\n\n    def __getitem__(self, idx):\n        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}\n        item['labels'] = torch.tensor(self.labels[idx])\n        return item\n\n    def __len__(self):\n        return len(self.labels)\n\ntrain_dataset = ABSA_Dataset(train_encodings, train_labels)\ntest_dataset = ABSA_Dataset(test_encodings, test_labels)"

In [12]:
type(test_encodings)

transformers.tokenization_utils_base.BatchEncoding

### Post-Processing and Metrics Computation

This section defines helper functions to process the raw outputs of our model, map them to aspect-based sentiment predictions, and compute evaluation metrics:

1. **get_test_labels**: Extracts the true labels from the test dataset.
2. **get_predictions**: For aspect-level tasks (those ending with "B"), it normalizes the raw output scores and determines the predicted label.
3. **compute_metrics**:
    - Converts model predictions to softmax scores.
    - Maps these scores to aspect sentiments.
    - Compares predictions with true labels to compute:
        - Precision (`P`), Recall (`R`), and F1-Score (`F1`).
        - 4-way, 3-way, and binary accuracies, capturing different granularities of sentiment analysis.

These functions ensure that we can evaluate our model's performance meaningfully on the ABSA task.

In [13]:
'''def get_test_labels():
    original_sentences = []
    auxiliary_sentences = []
    labels = []
    data = pd.read_csv(f"{BUCKET_ROOT}/data/semeval14/test_NLI_M.csv", header=0, sep="\t").values.tolist()
    for row in data:
        labels.append(row[3])
    return labels


def get_predictions(data):
    predicted_labels = []
    scores = []
    count_aspect_rows = 0
    current_aspect_scores = []
    for row in data:
        current_aspect_scores.append(row[2])
        count_aspect_rows += 1
        if count_aspect_rows % 5 == 0:
            sum_current_aspect_scores = np.sum(current_aspect_scores)
            current_aspect_scores = [score / sum_current_aspect_scores for score in current_aspect_scores]
            scores.append(current_aspect_scores)
            predicted_labels.append(np.argmax(current_aspect_scores))
            current_aspect_scores = []
    return predicted_labels, scores

def compute_metrics(predictions):
    scores = [softmax(prediction) for prediction in predictions[0]]
    predicted_labels = [np.argmax(x) for x in scores]
    data = np.insert(scores, 0, predicted_labels, axis=1)
    predicted_labels, scores = get_predictions(data)
    test_labels = get_test_labels()
    metrics = {}
    p, r, f1 = evaluation.compute_semeval_PRF(test_labels, predicted_labels)
    metrics["P"] = p
    metrics["R"] = r
    metrics["F1"] = f1
    metrics["4-way"] = evaluation.compute_semeval_accuracy(test_labels, predicted_labels, scores, 4)
    metrics["3-way"] = evaluation.compute_semeval_accuracy(test_labels, predicted_labels, scores, 3)
    metrics["binary"] = evaluation.compute_semeval_accuracy(test_labels, predicted_labels, scores, 2)
    return metrics'''

'def get_test_labels():\n    original_sentences = []\n    auxiliary_sentences = []\n    labels = []\n    data = pd.read_csv(f"{BUCKET_ROOT}/data/semeval14/test_NLI_M.csv", header=0, sep="\t").values.tolist()\n    for row in data:\n        labels.append(row[3])\n    return labels\n\n\ndef get_predictions(data):\n    predicted_labels = []\n    scores = []\n    count_aspect_rows = 0\n    current_aspect_scores = []\n    for row in data:\n        current_aspect_scores.append(row[2])\n        count_aspect_rows += 1\n        if count_aspect_rows % 5 == 0:\n            sum_current_aspect_scores = np.sum(current_aspect_scores)\n            current_aspect_scores = [score / sum_current_aspect_scores for score in current_aspect_scores]\n            scores.append(current_aspect_scores)\n            predicted_labels.append(np.argmax(current_aspect_scores))\n            current_aspect_scores = []\n    return predicted_labels, scores\n\ndef compute_metrics(predictions):\n    scores = [softmax(predicti

#### Paths to directories for us to save our models and evaluation results

In [16]:
save_path = BUCKET_ROOT
log_path = 'log_semeval14'

In [17]:
!mkdir {save_path}
!mkdir {log_path}

mkdir: cannot create directory ‘gs://absa-395317-mcd_bucket’: No such file or directory


In [18]:
epochs = 4
batch_size = 24
num_steps = len(train_dataset) * epochs // batch_size
warmup_steps = num_steps // 10  # 10% of the training steps
save_steps = num_steps // epochs    # Save a checkpoint at the end of each epoch
num_classes = 2


In [19]:
from transformers import TFBertForSequenceClassification, BertConfig, logging
logging.set_verbosity_debug()

config = BertConfig.from_pretrained(
    'bert-base-uncased',
    architectures = ['BertForSequenceClassification'],
    hidden_size = 768,
    num_hidden_layers = 12,
    num_attention_heads = 12,
    hidden_dropout_prob = 0.1,
    num_labels = 2
)
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', config=config)


loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/1dbc166cf8765166998eff31ade2eb64c8a40076/config.json
Model config BertConfig {
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.31.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/1dbc166cf8765166998eff31ade2eb64c8a40076/model.safetensors
Loaded 109,482,24

In [20]:
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

history = model.fit(train_dataset.shuffle(1000).batch(batch_size), epochs=epochs, batch_size=batch_size)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


In [None]:
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)


In [None]:
evaluation_result = trainer.evaluate(test_dataset)
print(evaluation_result)

In [None]:
import pandas as pd


results = trainer.predict(test_dataset)

scores = [softmax(prediction) for prediction in results.predictions]
predicted_labels = [np.argmax(x) for x in scores]

### Take the pretrained model home
*     model = BertForSequenceClassification.from_pretrained(path to pytorch_model.bin and config.json)



In [None]:
!zip -r file.zip /kaggle/working

In [None]:
!ls

In [None]:
from IPython.display import FileLink
FileLink(r'file.zip')

In [None]:
#df.to_csv(f"{base_dir}/results.csv")



---


# McDonald's Reviews


---

Let's run this over the MCDonald's review dataset and generate aspect based sentiment classifications

In [None]:
McDf = pd.read_csv(f"/content/drive/MyDrive/ABSA/data/McDonald_s_Reviews.csv", encoding = 'latin-1')
print(McDf.shape)
McDf.head()
# we just know that 'latin-1' works for this particular dataset, having looked at notebooks using this dataset.

(33396, 10)


Unnamed: 0,reviewer_id,store_name,category,store_address,latitude,longitude,rating_count,review_time,review,rating
0,1,McDonald's,Fast food restaurant,"13749 US-183 Hwy, Austin, TX 78750, United States",30.460718,-97.792874,1240,3 months ago,Why does it look like someone spit on my food?...,1 star
1,2,McDonald's,Fast food restaurant,"13749 US-183 Hwy, Austin, TX 78750, United States",30.460718,-97.792874,1240,5 days ago,It'd McDonalds. It is what it is as far as the...,4 stars
2,3,McDonald's,Fast food restaurant,"13749 US-183 Hwy, Austin, TX 78750, United States",30.460718,-97.792874,1240,5 days ago,Made a mobile order got to the speaker and che...,1 star
3,4,McDonald's,Fast food restaurant,"13749 US-183 Hwy, Austin, TX 78750, United States",30.460718,-97.792874,1240,a month ago,My mc. Crispy chicken sandwich was ï¿½ï¿½ï¿½ï¿...,5 stars
4,5,McDonald's,Fast food restaurant,"13749 US-183 Hwy, Austin, TX 78750, United States",30.460718,-97.792874,1240,2 months ago,"I repeat my order 3 times in the drive thru, a...",1 star


### We're  gonna add aspect sentiment pairs to each store. So 1 row explodes into (No. of aspects) * (No. of sentiments) rows

Update: Not anymore. Let's try tokenizing before mapping to avoid tokenizing duplicates

In [None]:
aspects = ['price', 'anecdotes', 'food', 'ambience', 'service']
sentiments = ['positive', 'neutral', 'negative', 'conflict', 'none']

combinations = [f"{aspect} - {sentiment}" for aspect in aspects for sentiment in sentiments]

token_sep = False

if not token_sep:
    # Step 1: Add a column with all combinations for each row
    McDf['combinations'] = [combinations] * len(McDf)

    # Step 2: Explode the combinations into separate rows
    McDf = McDf.explode('combinations')

    # Reset the index for cleanliness
    McDf.reset_index(drop=True, inplace=True)

    print(McDf.shape)
    McDf.head()

(834900, 11)


*Thinking out loud: A relational DB might be a more efficient format of storage for our data*

Update: Yes confirmed. Using the previous redundant info storage had the notebooking running out of resources

### A few cells above we trained an ABSA BERT classifier. Now we use that pretrained model on this *unlabeled* set of reviews

*Thinking out loud: In business, we might use human experts to give feedbacks on the model's classifications and then use that data for fine tuning*

In [None]:
pretrain_path = "/content/drive/MyDrive/ABSA/pretrained"

In [None]:
tokenizer = BertTokenizer.from_pretrained(pretrain_path)
model = BertForSequenceClassification.from_pretrained(pretrain_path)

# Ensure model is in evaluation mode
model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

### Then we convert the reviews and ABS pairs to lists because that's what the tokenizer will accept

In [None]:
def batch_tokenize(reviews, combinations, batch_size):
      total = len(reviews)
      batched_encodings = []

      for start in range(0, total, batch_size):
          end = start + batch_size
          batch_reviews = reviews[start:end]
          batch_combinations = combinations[start:end]
          encodings = tokenizer(batch_reviews, batch_combinations, truncation=True, padding=True, return_tensors="pt")
          batched_encodings.append(encodings)

      return batched_encodings

In [None]:
# Tokenize the input pairs seperately
if(token_sep):
    review_encodings = tokenizer(McDf['review'].tolist(), truncation=True, padding=True, return_tensors="pt")
    combinations_encodings = tokenizer(combinations, truncation=True, padding=True, return_tensors="pt")

else:
# token together
    reviews_list = McDf['review'].tolist()
    combinations_list = McDf['combinations'].tolist()

    batch_size = 5000  # or whatever you find appropriate based on memory constraints
    batched_encodings_list = batch_tokenize(reviews_list, combinations_list, batch_size)


# You can then combine these batched_encodings when needed.
type(batched_encodings_list)

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

list

### Here we define a modified version of our ABSA_Dataset class above due to the fact that McD review dataset is "unlabeled"

In [None]:
len(batched_encodings_list)

167

In [None]:
import torch
from torch.utils.data import Dataset

class PredictDataset(Dataset):
    def __init__(self, review_encodings, combinations_encodings):
        self.review_encodings = review_encodings
        self.combinations_encodings = combinations_encodings
        self.num_combinations = len(combinations_encodings['input_ids'])

    def __len__(self):
        return len(self.review_encodings['input_ids']) * self.num_combinations

    def __getitem__(self, idx):
        review_idx = idx // self.num_combinations
        combination_idx = idx % self.num_combinations

        # Fetch encoded review and combination
        item = {key: val[review_idx] for key, val in self.review_encodings.items()}
        combination_item = {key: val[combination_idx] for key, val in self.combinations_encodings.items()}

        # Adjust the length of the review encodings to ensure combined length <= 512
        max_review_length = 512 - len(combination_item['input_ids'])
        for key in item:
            if len(item[key]) > max_review_length:
                item[key] = item[key][:max_review_length]

        # Concatenate review and combination encodings
        for key in item:
            item[key] = torch.cat([item[key], combination_item[key]])

        return item, review_idx, combination_idx


class ABSA_PredictDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        return item

    def __len__(self):
        return len(self.encodings['input_ids'])


### Trainer will only take the model in eval mode as input.
When it comes to making predictions, it doesn't necessarily need training arguments or datasets because it's not going through a training or evaluation loop; it's just performing a forward pass on the model.

In [None]:
import gc
import os

In [None]:
output_dir = "/content/drive/MyDrive/ABSA/mcd_batches"

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

### In this resource scarce situation let's be mindful and predict in chucks and avoid blowing the lid off of the GPUs

In [None]:
trainer = Trainer(model=model)
# Create test dataset
if token_sep :
    test_predict_dataset = PredictDataset(review_encodings, combinations_encodings)
    chunk_size = 50000  # Adjust based on your available memory
    all_predictions = []
    all_reviews = []
    all_combinations = []

    model.eval()  # Set the model to evaluation mode

    for i in range(0, len(test_predict_dataset), chunk_size):
        chunk_dataset = [test_predict_dataset[j] for j in range(i, min(i + chunk_size, len(test_predict_dataset)))]

        # Collect tokenized inputs, review, and combination indices from the chunk dataset
        tokenized_inputs, review_indices, combination_indices = zip(*chunk_dataset)

        # Convert the list of dictionaries to a single dictionary
        tokenized_inputs = {
            key: torch.stack([item[key] for item in tokenized_inputs])
            for key in tokenized_inputs[0].keys()
        }

        # Predict using the model
        with torch.no_grad():
            logits = model(**tokenized_inputs).logits
            predictions = logits.argmax(dim=-1).tolist()

        all_predictions.extend(predictions)
        all_reviews.extend(McDf['review'].iloc[list(review_indices)].tolist())
        all_combinations.extend([combinations[idx] for idx in combination_indices])

        # Clear memory
        del chunk_dataset, tokenized_inputs
        gc.collect()

else:
        all_scores = []
        all_predicted_labels = []

        for i, batch_encoding in enumerate(batched_encodings_list):
            if (i < 162):
              continue
            test_predict_dataset = ABSA_PredictDataset(batch_encoding)
            results = trainer.predict(test_predict_dataset)

            scores_chunk = [softmax(prediction) for prediction in results.predictions]
            predicted_labels_chunk = [np.argmax(x) for x in scores_chunk]

            all_scores.extend(scores_chunk)
            all_predicted_labels.extend(predicted_labels_chunk)

            # Create a temporary dataframe for the current batch
            df_output_batch = pd.DataFrame({
                'reviews': reviews_list[i*batch_size:(i+1)*batch_size],
                'combinations': combinations_list[i*batch_size:(i+1)*batch_size],
                'predicted_labels': predicted_labels_chunk
            })


            # Save the current batch's dataframe to a CSV file in the specified Google Drive directory
            batch_output_path = os.path.join(output_dir, f"batch_{i}_results.csv")
            df_output_batch.to_csv(batch_output_path, index=False)


### This job takes a long,  and I've lost 4 hrs worth of output for a little disconnect. So now we 'save_as_csv' in batches as well.

*So from tokenizing in batches to predicitng in batches to now saving in batches, we've learnt things the hard way.*

In [None]:
df_output = pd.DataFrame({
    'reviews': reviews_list,
    'combinations': combinations_list,
    'predicted_labels': all_predicted_labels
})
df_output[df_output['predicted_labels']==1]


ValueError: ignored

In [None]:
df_output

In [None]:
if token_sep :
    # Extract predictions
    scores = [softmax(prediction) for prediction in results.predictions]
    predicted_labels = [np.argmax(x) for x in scores]
    reviews = McDf['review'].tolist()
    max_len = max(len(reviews), len(combinations), len(predicted_labels))

    # Pad the lists to make them of equal length
    reviews += [np.nan] * (max_len - len(McDf['review'].tolist()))
    combinations += [np.nan] * (max_len - len(combinations))
    predicted_labels += [np.nan] * (max_len - len(predicted_labels))

    # Create DataFrame
    # Note: You need to adjust the lengths of reviews_list and combinations_list
    # to match the length of predicted_labels, since you've tokenized and exploded them earlier.
    df_output = pd.DataFrame({
        'Review': reviews,
        'Combination': combinations,
        'Predicted_Label': predicted_labels
    })

# Save to CSV
output_path = f"/{base_dir}/predictions.csv"
df_output.to_csv(output_path, index=False)

#print(f"Saved predictions to {output_path}")

In [None]:
print(df_output.shape)
print(McDf.shape)