# Distilbert Model Training

Distbilbert was chosen as the model to train the Reddit sentiment classifier due to it's realitive low size and quaility performance. 

In [4]:
from transformers import (
    AutoModelForSequenceClassification, 
    AutoTokenizer,
    DataCollatorWithPadding
)

import pandas as pd
from datasets import Dataset


In [16]:
# Load the DistilBERT model and tokenizer
checkpoint = "distilbert-base-cased"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast=True)

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier.wei

## Load Data

In [11]:
# Load csv
df = pd.read_csv("data/crypto_sentiment_dataset.csv")
df['label'] = df['Sentiment'].apply(lambda x: 1 if x.lower() == 'positive' else 0)

# Create Dataset for use in Huggingface training pipeline
dset = Dataset.from_pandas(df[['Comment Text', 'label']])
dset = dset.train_test_split(test_size=0.2)

## Tokenize Data

In [13]:
# Create tokenize function for parallel tokenization
def tokenize_function(dset):
    return tokenizer(dset['Comment Text'], truncation=True)

# Tokenize Dataset in parallel and add padding
tokenized_datasets = dset.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

## Train Model

In [14]:
from transformers import TrainingArguments

# Define Training class that saves all the hyper parameters the Trainer will use for training and evaluation
training_args = TrainingArguments("distilbert_training_args")

In [18]:
from transformers import Trainer

# Build trainer object for model training
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [19]:
# Train model
trainer.train()

The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: Comment Text. If Comment Text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 449
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 171


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=171, training_loss=0.2990056980423063, metrics={'train_runtime': 1003.6909, 'train_samples_per_second': 1.342, 'train_steps_per_second': 0.17, 'total_flos': 61792678042980.0, 'train_loss': 0.2990056980423063, 'epoch': 3.0})

## Make Predictions

In [21]:
# Make predictions
predictions = trainer.predict(tokenized_datasets["test"])
print(predictions.predictions.shape, predictions.label_ids.shape)

The following columns in the test set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: Comment Text. If Comment Text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 113
  Batch size = 8


(113, 2) (113,)


In [22]:
# Take argmax of logits for prediction
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)

## Accuracy

Since we are trying to predict positive/negative equally well and the dataset is balanced, accuracy will be a good metric. We are attempting to beat 90% accuracy.

In [24]:
# Get accuracy score
from sklearn.metrics import accuracy_score

print(f"Accuracy: {accuracy_score(dset['test']['label'], preds)}")

Accuracy: 0.9203539823008849


## Save Model and Tokenizers

In [None]:
# Save the model and tokenizer in the inference lambda function before creating the CloudFormation stack
model.save_pretrained("serverless/crypto-sentiment/inference/distilbert_reddit_model")
tokenizer.save_pretrained("serverless/crypto-sentiment/inference/distilbert_reddit_tokenizer")