# BERT
___

This model is based on:

```Bibtex
@article{toledo-ronenMultilingualArgumentMining2020,
  title = {Multilingual Argument Mining: {{Datasets}} and Analysis},
  author = {Toledo-Ronen, Orith and Orbach, Matan and Bilu, Yonatan and Spector, Artem and Slonim, Noam},
  date = {2020},
  url = {https://arxiv.org/abs/2010.06432},
}
```

Features:
- Sentence

Parameter:

In [None]:
%%capture
! pip install --quiet transformers
! pip install wandb -qqq

In [3]:
MODEL_NAME = "distilbert-base-uncased"

In [4]:
import os
import pandas as pd
import numpy as np
import random
import timeit


from sklearn.model_selection import train_test_split

from transformers import DistilBertTokenizer
from transformers import TFDistilBertForSequenceClassification

import tensorflow as tf

import wandb
from wandb.keras import WandbCallback


In [5]:
from google.colab import drive

In [6]:
!wandb login

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [7]:
drive.mount('/content/drive')  # connect to google drive
base_dir = "drive/MyDrive/BA"

Mounted at /content/drive


### 0. Load data

In [21]:
dataset = "CE-ACL_full.csv"

In [22]:
train_data = pd.read_csv(os.path.join(base_dir, dataset))

In [23]:
label = train_data["Claim"].astype(int).to_list()  # convert bool labels to int
text = train_data["Sentence"].to_list()

In [24]:
train_text_split, test_text_split, train_labels_split, test_labels_split = train_test_split(text, label, test_size=.2, random_state=42) # train/test

In [25]:
print("Num train data:      ", str(len(train_text_split)))
print("Num test data:       ", str(len(test_text_split)))
print("Num claims in train: ", str(sum(train_labels_split)))
print("Num claims in test:  ", str(sum(test_labels_split)))


Num train data:       37186
Num test data:        9297
Num claims in train:  995
Num claims in test:   239


In [26]:
train_texts, val_texts, train_labels, val_labels = train_test_split(train_text_split, train_labels_split, test_size=.2, random_state=42) # train/test

### 2. Prepare dataset

In [27]:
tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME) # load tokenizer

In [28]:
def tokenize_dataset(dataset):
    """Tokenize a list of strings for the BERT model."""
    encoded = tokenizer(
        dataset,
        padding=True,
        truncation=True,
        return_tensors='np',
    )
    return encoded.data

In [29]:
encodet_train_text = tokenize_dataset(train_texts)
encodet_val_texts = tokenize_dataset(val_texts)
encodet_test_texts = tokenize_dataset(test_text_split)

### 3. Create Model

In [30]:
model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)  # Load model

Downloading:   0%|          | 0.00/347M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_transform', 'activation_13', 'vocab_layer_norm', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['dropout_19', 'classifier', 'pre_classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

In [31]:
# Set hyperparameter
learning_rate = 5e-5
epochs = 3
batch_size = 16

In [32]:
model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),

optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_transform', 'activation_13', 'vocab_layer_norm', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['dropout_39', 'classifier', 'pre_classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

In [33]:
model.compile(
    optimizer=optimizer, 
    loss=model.compute_loss,
    metrics=tf.metrics.SparseCategoricalAccuracy(),
    ) # can also use any keras loss fn

### 4. Train model

In [37]:
wandb.init(project="claim_detect_en",
           config={
               "model": MODEL_NAME,
               "dataset": dataset,
               "train_data_size": len(train_texts),
               "validation_data_size": len(val_texts),
               "test_data_size": len(test_text_split),
               "batch_size": batch_size,
               "learning_rate": learning_rate,
               "epochs": epochs
           })

[34m[1mwandb[0m: Currently logged in as: [33mjueri[0m (use `wandb login --relogin` to force relogin)


In [38]:
start = timeit.default_timer()

model.fit(
      encodet_train_text,
      np.array(train_labels), 
      validation_data=(encodet_val_texts, np.array(val_labels)),
      epochs=epochs, 
      batch_size=batch_size,
      callbacks=[WandbCallback()])

stop = timeit.default_timer()

print('Time Elapsed: ', stop - start)
wandb.log({'time-elapsed': stop - start})

# wandb.finish()

Epoch 1/3


[34m[1mwandb[0m: [32m[41mERROR[0m Can't save model, h5py returned error: Saving the model to HDF5 format requires the model to be a Functional model or a Sequential model. It does not work for subclassed models, because such models are defined via the body of a Python method, which isn't safely serializable. Consider saving to the Tensorflow SavedModel format (by setting save_format="tf") or using `save_weights`.


Epoch 2/3
Epoch 3/3
Time Elapsed:  9875.218564791


### 5. Predict results

In [39]:
# wandb.init(project="jupyter-projo")

test_loss, test_accuracy = model.evaluate(encodet_test_texts, np.array(test_labels_split))
wandb.log({'test_accuracy': test_accuracy})

wandb.finish()



VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
epoch,▁▅█
loss,█▄▁
sparse_categorical_accuracy,▁▃█
test_accuracy,▁
time-elapsed,▁
val_loss,▁▁█
val_sparse_categorical_accuracy,▄█▁

0,1
best_epoch,0.0
best_val_loss,0.0886
epoch,2.0
loss,0.03824
sparse_categorical_accuracy,0.98682
test_accuracy,0.97257
time-elapsed,9875.21856
val_loss,0.13838
val_sparse_categorical_accuracy,0.97244


In [None]:
preds = model.predict(encodet_test_texts)

In [None]:
preds.logits.shape

(924, 2)

In [None]:
class_preds = np.argmax(preds.logits, axis=1)
print(preds.logits.shape, class_preds.shape)

(924, 2) (924,)


### 6. Export model

In [40]:
model.save_pretrained(base_dir +"/fearful-poltergeist-1")
tokenizer.save_pretrained(base_dir +"/fearful-poltergeist-1")

('drive/MyDrive/BA/fearful-poltergeist-1/tokenizer_config.json',
 'drive/MyDrive/BA/fearful-poltergeist-1/special_tokens_map.json',
 'drive/MyDrive/BA/fearful-poltergeist-1/vocab.txt',
 'drive/MyDrive/BA/fearful-poltergeist-1/added_tokens.json')

In [None]:
from transformers import TFDistilBertForSequenceClassification

loaded_model = TFDistilBertForSequenceClassification.from_pretrained(base_dir +"/fearful-poltergeist-1")

Some layers from the model checkpoint at drive/MyDrive/BA/test_output were not used when initializing TFDistilBertForSequenceClassification: ['dropout_59']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at drive/MyDrive/BA/test_output and are newly initialized: ['dropout_116']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
preds_loaded = loaded_model.predict(encodet_test_texts)

In [None]:
class_preds = np.argmax(preds_loaded.logits, axis=1)