# Prompt

```
You are a data scientist and will support automation for text classification

Use this link as an example https://huggingface.co/blog/sentiment-analysis-python

For the data, it will be different, DO NOT use the data they used in the tutorial, use this:

from datasets import load_dataset
ds = load_dataset("SetFit/tweet_sentiment_extraction")
The 2 columns we are interested in are 'text' and 'label' the 'label' column is a number in this set {0,1,2} (0: Negative, 1: Neutral and 2: Positive)




Use this cell structure for the Python notebook:

Cell 1:
Import the packages, and get the model (I want the same as the tutorial: DistilBert)

Cell 2:

Load the training data, and cap it to 5000 rows

Cell 3:
Load the tokenizer, and tokenise the text data
Train the model for 3 epochs
Skip the eval data

Cell 4:
Evaluate the model based on the 'test' data

Cell 5:

Write a dict for mapping: from the ints (0,1,2) to the text label

Cell 6


Load a CSV file (Restaurant reviews.csv), and classify the 'Review' column, and make a new data frame with the restaurant reviews, and a new 'sentiment label' column, which is the text label (use the previous dict)
```



In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [None]:
# Import necessary packages
import torch
from datasets import load_dataset
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
import pandas as pd

# Load the pre-trained DistilBERT model for sentiment analysis
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Load the dataset
ds = load_dataset("SetFit/tweet_sentiment_extraction")

# Cap the training data to 5000 rows
train_data = ds['train'].select(range(5000))

# Preview the dataset
train_data[:5]


README.md:   0%|          | 0.00/94.0 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


train.jsonl:   0%|          | 0.00/3.93M [00:00<?, ?B/s]

test.jsonl:   0%|          | 0.00/503k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/27481 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3534 [00:00<?, ? examples/s]

{'textID': ['cb774db0d1',
  '549e992a42',
  '088c60f138',
  '9642c003ef',
  '358bd9e861'],
 'text': [' I`d have responded, if I were going',
  ' Sooo SAD I will miss you here in San Diego!!!',
  'my boss is bullying me...',
  ' what interview! leave me alone',
  ' Sons of ****, why couldn`t they put them on the releases we already bought'],
 'label': [1, 0, 0, 0, 0],
 'label_text': ['neutral', 'negative', 'negative', 'negative', 'negative']}

In [None]:
# Load the tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Tokenize the text data
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128)

# Tokenize the training data
train_data = train_data.map(tokenize_function, batched=True)

# Split into train and validation sets
# train_data, eval_data = train_test_split(train_data, test_size=0.1)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # number of training epochs
    per_device_train_batch_size=8,   # batch size for training
    # per_device_eval_batch_size=8,    # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    report_to='none',
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    # eval_dataset=eval_data
)

# Train the model
trainer.train()


Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Step,Training Loss
10,1.109
20,1.1154
30,1.1126
40,1.1177
50,1.1201
60,1.0851
70,1.0928
80,1.0945
90,1.0771
100,1.081


TrainOutput(global_step=1875, training_loss=0.5199690247694652, metrics={'train_runtime': 220.8013, 'train_samples_per_second': 67.934, 'train_steps_per_second': 8.492, 'total_flos': 496761603840000.0, 'train_loss': 0.5199690247694652, 'epoch': 3.0})

In [None]:
# Evaluate the model
test_data = ds['test']
test_data = test_data.map(tokenize_function, batched=True)
test_results = trainer.evaluate(test_data)

# Display test results
print(test_results)


Map:   0%|          | 0/3534 [00:00<?, ? examples/s]

{'eval_loss': 0.931968629360199, 'eval_runtime': 12.9601, 'eval_samples_per_second': 272.683, 'eval_steps_per_second': 34.105, 'epoch': 3.0}


In [None]:
# Free up GPU memory
torch.cuda.empty_cache()

# del test_data
# del train_data
# del encoded_reviews

In [None]:
# Mapping from integers to text labels
label_mapping = {0: 'Negative', 1: 'Neutral', 2: 'Positive'}

# Display the label mapping
print(label_mapping)


{0: 'Negative', 1: 'Neutral', 2: 'Positive'}


In [None]:
import pandas as pd
import torch

# Load the restaurant reviews CSV file
restaurant_reviews = pd.read_csv('Restaurant reviews.csv')

# Ensure all reviews are valid strings
reviews = restaurant_reviews['Review']
reviews = reviews.apply(lambda x: str(x) if isinstance(x, str) else "").fillna("")

# Define the model and tokenizer setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Set bounds dynamically
chunk_size = 500
min_bound = 0
max_bound = 10_000

# Process in chunks
for start_idx in range(min_bound, max_bound, chunk_size):
    end_idx = min(start_idx + chunk_size, max_bound)
    chunk_reviews = reviews[start_idx:end_idx].tolist()

    # Tokenize the reviews
    encoded_reviews = tokenizer(chunk_reviews, padding=True, truncation=True, max_length=128, return_tensors='pt')
    encoded_reviews = {key: value.to(device) for key, value in encoded_reviews.items()}

    # Make predictions
    with torch.no_grad():
        model.eval()
        outputs = model(**encoded_reviews)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1).to('cpu')

    # Update the dataframe with predictions
    restaurant_reviews.loc[start_idx:end_idx - 1, 'sentiment_label'] = predictions.numpy()

# Map predictions to text labels
restaurant_reviews['sentiment_label'] = restaurant_reviews['sentiment_label'].map(label_mapping)

# Display the updated dataframe
import ace_tools as tools; tools.display_dataframe_to_user(name="Classified Restaurant Reviews", dataframe=restaurant_reviews)


In [None]:
restaurant_reviews[['Review', 'sentiment_label']]
