In [2]:
%pip install transformers==4.44.2 datasets --quiet

Note: you may need to restart the kernel to use updated packages.


In [1]:
import tensorflow as tf
from transformers import DistilBertTokenizerFast, TFDistilBertForSequenceClassification
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]




In [12]:
df1 = pd.read_csv('policy_comments_dataset_1.txt',sep=';',header=None,names=['text','labels'])
df2 = pd.read_csv('policy_comments_dataset_2.txt',sep=';',header=None,names=['text','labels'])
df3 = pd.read_csv('policy_comments_dataset_3.txt',sep=';',header=None,names=['text','labels'])
df4 = pd.read_csv('policy_comments_dataset_4.txt',sep=';',header=None,names=['text','labels'])
df = pd.concat([df1,df2,df3,df4],ignore_index=True)
df["text"] = df["text"].str.replace(r'^\d+\.\s*', '', regex=True)
df["labels"] = df["labels"].str.replace(r'\s*', '', regex=True)

df = df.sample(frac=1,random_state=42).reset_index(drop=True)
df.tail()

Unnamed: 0,text,labels
1195,The dignity assault dehumanizes residents whil...,negative
1196,The bridge burning destroys connections while ...,negative
1197,The risk management approach incorporates stan...,neutral
1198,The proposed administrative updates appear to ...,neutral
1199,The funding distribution mechanism employs sta...,neutral


In [13]:
df['labels'].value_counts()

labels
neutral     400
negative    400
positive    400
Name: count, dtype: int64

In [14]:
df['labels'] = df["labels"].map({'negative': 0, 'neutral': 1, 'positive': 2})
df.tail()

Unnamed: 0,text,labels
1195,The dignity assault dehumanizes residents whil...,0
1196,The bridge burning destroys connections while ...,0
1197,The risk management approach incorporates stan...,1
1198,The proposed administrative updates appear to ...,1
1199,The funding distribution mechanism employs sta...,1


In [16]:
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df["text"].tolist(),
    df["labels"].tolist(),
    test_size=0.2,
    stratify=df["labels"],
    random_state=42
)

In [17]:
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]



In [18]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=128)
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=128)

In [19]:
def to_dataset(encodings, labels):
    return tf.data.Dataset.from_tensor_slices((
        dict(encodings),
        labels
    ))

In [20]:
train_dataset = to_dataset(train_encodings, train_labels).shuffle(1000).batch(16)
val_dataset = to_dataset(val_encodings, val_labels).batch(16)

In [21]:
model = TFDistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=3
)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]




Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

In [22]:
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy("accuracy")

In [23]:
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

In [24]:
history = model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=3
)

Epoch 1/3

Epoch 2/3
Epoch 3/3


In [25]:
loss, acc = model.evaluate(val_dataset)
print(f"Validation Accuracy: {acc:.3f}")

Validation Accuracy: 0.996


In [28]:
def predict(text):
    inputs = tokenizer(text, return_tensors="tf", truncation=True, padding=True, max_length=128)
    logits = model(inputs).logits
    probs = tf.nn.softmax(logits, axis=-1)
    label_id = tf.argmax(probs, axis=1).numpy()[0]
    labels = {0: "Negative", 1: "Neutral", 2: "Positive"}
    return labels[label_id], probs.numpy()[0]

In [29]:
print(predict("The new law is a great step forward for workers."))

('Positive', array([0.14597568, 0.03034432, 0.82368   ], dtype=float32))


In [30]:
model.save_pretrained("distilbert_sentiment_model")
tokenizer.save_pretrained("distilbert_sentiment_model")

('distilbert_sentiment_model\\tokenizer_config.json',
 'distilbert_sentiment_model\\special_tokens_map.json',
 'distilbert_sentiment_model\\vocab.txt',
 'distilbert_sentiment_model\\added_tokens.json',
 'distilbert_sentiment_model\\tokenizer.json')