
### Basic Binary Transformer Model Implementation

Implemented using HuggingFace Transformer Pre-Trained Model (Based on BERT, PyTorch, and TensorFlow): 
https://huggingface.co/docs/transformers/tasks/sequence_classification

In [5]:
!pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: evaluate
Successfully installed evaluate-0.4.0


In [6]:
import pandas as pd
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
import evaluate
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import create_optimizer
import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification
from transformers.keras_callbacks import KerasMetricCallback
from transformers.keras_callbacks import PushToHubCallback
from transformers import pipeline
from tqdm import tqdm

In [7]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# !pip install transformers datasets evaluate

In [None]:
# !pip3 install --upgrade tensorflow-gpu --user

In [3]:
imdb = load_dataset("imdb")



  0%|          | 0/3 [00:00<?, ?it/s]

In [4]:
imdb

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [5]:
imdb["train"][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [6]:
imdb["test"][0]

{'text': 'I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as 

In [7]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [8]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

In [9]:
tokenized_imdb = imdb.map(preprocess_function, batched=True)



Map:   0%|          | 0/25000 [00:00<?, ? examples/s]



In [10]:
tokenized_imdb["test"][0]

{'text': 'I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as 

In [16]:
# BELOW IS FOR PYTORCH

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
# BELOW IS FOR TENSORFLOW

# data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

In [11]:
accuracy = evaluate.load("accuracy")

In [12]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [13]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

In [14]:
# BELOW IS FOR PYTORCH

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'pre_classi

In [17]:
training_args = TrainingArguments(
    output_dir="transformer_1_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

tqdm(trainer.train())
# trainer.train()

Cloning https://huggingface.co/kenkliesner/transformer_1_model into local empty directory.
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2312,0.193241,0.92612
2,0.1515,0.23467,0.9296


  0%|          | 0/3 [00:00<?, ?it/s]

<tqdm.std.tqdm at 0x7fc3dbe4a7d0>

In [18]:
trainer.push_to_hub()

# ABOVE IS FOR PYTORCH

Upload file runs/May23_01-56-59_b415bf03a8c0/events.out.tfevents.1684807029.b415bf03a8c0.4124.0:   0%|        …

To https://huggingface.co/kenkliesner/transformer_1_model
   3e3a823..c3b0858  main -> main

   3e3a823..c3b0858  main -> main

To https://huggingface.co/kenkliesner/transformer_1_model
   c3b0858..f6d26cc  main -> main

   c3b0858..f6d26cc  main -> main



'https://huggingface.co/kenkliesner/transformer_1_model/commit/c3b08584b482364043a373120ec18f78ba362ec8'

In [None]:
# from transformers import create_optimizer
# import tensorflow as tf

In [None]:
# print(tf.__version__)

2.12.0


In [None]:
# !pip install tensorflow-gpu

# https://stackoverflow.com/questions/70624869/tfbertforsequenceclassification-requires-the-tensorflow-library-but-it-was-not-f

In [None]:
# BELOW IS FOR TENSORFLOW

# batch_size = 16
# num_epochs = 5
# batches_per_epoch = len(tokenized_imdb["train"]) // batch_size
# total_train_steps = int(batches_per_epoch * num_epochs)
# optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

In [None]:
# model = TFAutoModelForSequenceClassification.from_pretrained(
#     "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
# )

In [None]:
# tf_train_set = model.prepare_tf_dataset(
#     tokenized_imdb["train"],
#     shuffle=True,
#     batch_size=16,
#     collate_fn=data_collator,
# )

# tf_validation_set = model.prepare_tf_dataset(
#     tokenized_imdb["test"],
#     shuffle=False,
#     batch_size=16,
#     collate_fn=data_collator,
# )

In [None]:
# model.compile(optimizer=optimizer)

In [None]:
# metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)

In [None]:
# push_to_hub_callback = PushToHubCallback(
#     output_dir="my_awesome_model",
#     tokenizer=tokenizer,
# )

In [None]:
# callbacks = [metric_callback, push_to_hub_callback]

In [None]:
# model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=callbacks)

# ABOVE IS FOR TENSORFLOW

In [25]:
text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."

In [26]:
classifier = pipeline("sentiment-analysis", model="kenkliesner/transformer_1_model")
classifier(text)

Downloading (…)lve/main/config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/320 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


[{'label': 'POSITIVE', 'score': 0.9932585954666138}]

In [28]:
classifier(text)[0]["label"]

'POSITIVE'

In [None]:
# BELOW IS FOR PYTORCH

tokenizer = AutoTokenizer.from_pretrained("kenkliesner/transformer_1_model")
inputs = tokenizer(text, return_tensors="pt")

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("kenkliesner/transformer_1_model")
with torch.no_grad():
    logits = model(**inputs).logits

In [None]:
predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]

# ABOVE IS FOR PYTORCH

In [None]:
# BELOW IS FOR TENSORFLOW

# tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model")
# inputs = tokenizer(text, return_tensors="tf")

In [None]:
# model = TFAutoModelForSequenceClassification.from_pretrained("stevhliu/my_awesome_model")
# logits = model(**inputs).logits

In [None]:
# predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])
# model.config.id2label[predicted_class_id]

# ABOVE IS FOR TENSORFLOW

In [10]:
  from google.colab import drive 
  # drive.mount('/content/gdrive', force_remount=True)
  drive.mount('/content/gdrive')
  PATH = "gdrive/MyDrive/datasets/adv_ml_data/"

Mounted at /content/gdrive


In [11]:
# load test data
test_data = pd.read_csv(f"{PATH}test.csv")

In [12]:
test_data.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \r\n\r\n The title is fine as i...
2,00013b17ad220c46,""" \r\n\r\n == Sources == \r\n\r\n * Zawe Ashto..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


In [13]:
len(test_data)

153164

In [14]:
test_data["comment_text"]

0         Yo bitch Ja Rule is more succesful then you'll...
1         == From RfC == \r\n\r\n The title is fine as i...
2         " \r\n\r\n == Sources == \r\n\r\n * Zawe Ashto...
3         :If you have a look back at the source, the in...
4                 I don't anonymously edit articles at all.
                                ...                        
153159    . \r\n i totally agree, this stuff is nothing ...
153160    == Throw from out field to home plate. == \r\n...
153161    " \r\n\r\n == Okinotorishima categories == \r\...
153162    " \r\n\r\n == ""One of the founding nations of...
153163    " \r\n :::Stop already. Your bullshit is not w...
Name: comment_text, Length: 153164, dtype: object

In [15]:
# load labels data
test_labels = pd.read_csv(f"{PATH}test_labels.csv")

In [16]:
test_labels.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,-1,-1,-1,-1,-1,-1
1,0000247867823ef7,-1,-1,-1,-1,-1,-1
2,00013b17ad220c46,-1,-1,-1,-1,-1,-1
3,00017563c3f7919a,-1,-1,-1,-1,-1,-1
4,00017695ad8997eb,-1,-1,-1,-1,-1,-1


In [20]:
# bring in the comments
all_comments = pd.merge(test_data, test_labels, on="id", how="left")

In [21]:
len(all_comments)

153164

In [31]:
all_comments.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...,-1,-1,-1,-1,-1,-1
1,0000247867823ef7,== From RfC == \r\n\r\n The title is fine as i...,-1,-1,-1,-1,-1,-1
2,00013b17ad220c46,""" \r\n\r\n == Sources == \r\n\r\n * Zawe Ashto...",-1,-1,-1,-1,-1,-1
3,00017563c3f7919a,":If you have a look back at the source, the in...",-1,-1,-1,-1,-1,-1
4,00017695ad8997eb,I don't anonymously edit articles at all.,-1,-1,-1,-1,-1,-1


In [None]:
classifier = pipeline("sentiment-analysis", model="kenkliesner/transformer_1_model")
classifier(text)

In [33]:
# all_comments['label'] = all_comments["comment_text"].apply(lambda x: classifier(x)[0]["label"])

bi_labels = {}
for comment in tqdm(test_data["comment_text"]):
  # print(comment)
  label = classifier(comment)[0]["label"]
  score = classifier(comment)[0]["score"]
  bi_labels[comment] = (label, score)

  0%|          | 38/153164 [00:14<15:57:49,  2.66it/s]


RuntimeError: ignored

In [34]:
bi_labels

{"Yo bitch Ja Rule is more succesful then you'll ever be whats up with you and hating you sad mofuckas...i should bitch slap ur pethedic white faces and get you to kiss my ass you guys sicken me. Ja rule is about pride in da music man. dont diss that shit on him. and nothin is wrong bein like tupac he was a brother too...fuckin white boys get things right next time.,": ('NEGATIVE',
  0.8985871076583862),
 '== From RfC == \r\n\r\n The title is fine as it is, IMO.': ('NEGATIVE',
  0.6899991631507874),
 '" \r\n\r\n == Sources == \r\n\r\n * Zawe Ashton on Lapland —  /  "': ('NEGATIVE',
  0.6301909685134888),
 ":If you have a look back at the source, the information I updated was the correct form. I can only guess the source hadn't updated. I shall update the information once again but thank you for your message.": ('POSITIVE',
  0.7086068391799927),
 "I don't anonymously edit articles at all.": ('NEGATIVE', 0.8797335028648376),
 'Thank you for understanding. I think very highly of you and 