<a href="https://www.kaggle.com/code/jmostol/class-competition?scriptVersionId=94866990" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
# Current Dockerfile does not include HuggingFace Datasets, which must be installed.
!pip install datasets

Collecting datasets
  Downloading datasets-2.1.0-py3-none-any.whl (325 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m325.4/325.4 KB[0m [31m952.1 kB/s[0m eta [36m0:00:00[0m
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Installing collected packages: xxhash, responses, datasets
Successfully installed datasets-2.1.0 responses-0.18.0 xxhash-3.0.0
[0m

In [2]:
import torch
import pandas as pd
from transformers import set_seed
# For reproducability:
set_seed(42) # Set seed for `random`,`numpy`,`torch`, etc. (https://huggingface.co/docs/transformers/main/en/internal/trainer_utils#transformers.set_seed)

In [3]:
from datasets import Dataset

train_csv = "../input/class-competition-data/uazhlt-ling-539-sp-2022-2/train.csv"
test_csv = "../input/class-competition-data/uazhlt-ling-539-sp-2022-2/test.csv"

df = pd.read_csv(train_csv).sample(n=600, random_state=42) # For random subset. Most recent run: 10000

dataset = Dataset.from_pandas(df) # Convert to HF Dataset
film_review_datasets = dataset.train_test_split(test_size=0.01) # Don't really need to validate, if we're just submitting
film_review_datasets = film_review_datasets.map(lambda batch: {"TEXT": str(batch["TEXT"])}) # Fix "none" elements.

  0%|          | 0/594 [00:00<?, ?ex/s]

  0%|          | 0/6 [00:00<?, ?ex/s]

In [4]:
checkpoint = "distilbert-base-uncased"

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["TEXT"], truncation=True)

tokenized_datasets = film_review_datasets.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.map(lambda examples: {'labels': examples['LABEL']}, batched=True)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [6]:
from transformers import AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments, DataCollatorWithPadding
from datasets import load_metric
import numpy as np

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)

training_args = TrainingArguments("test-trainer",
                                  num_train_epochs=1,
                                  per_device_train_batch_size=16,
                                  per_device_eval_batch_size=16,
                                  evaluation_strategy="epoch",
                                  seed=42,
                                  report_to="none")
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

def compute_metrics(eval_preds):
    metric = load_metric("f1")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels, average="macro")

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

In [7]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: LABEL, __index_level_0__, ID, TEXT. If LABEL, __index_level_0__, ID, TEXT are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 594
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 38


Epoch,Training Loss,Validation Loss,F1
1,No log,0.499948,0.822222


The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: LABEL, __index_level_0__, ID, TEXT. If LABEL, __index_level_0__, ID, TEXT are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 6
  Batch size = 16


Downloading builder script:   0%|          | 0.00/2.06k [00:00<?, ?B/s]



Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=38, training_loss=0.6170237189845035, metrics={'train_runtime': 17.3458, 'train_samples_per_second': 34.245, 'train_steps_per_second': 2.191, 'total_flos': 72745562282328.0, 'train_loss': 0.6170237189845035, 'epoch': 1.0})

In [8]:
test_dataset = Dataset.from_pandas(pd.read_csv(test_csv))
test_dataset_tokenized = test_dataset.map(lambda x: tokenizer(str(x["TEXT"]), padding=True, truncation=True, max_length=512)) # Added `str(...)`

  0%|          | 0/30078 [00:00<?, ?ex/s]

In [9]:
results = trainer.predict(test_dataset_tokenized)

The following columns in the test set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: ID, TEXT. If ID, TEXT are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 30078
  Batch size = 16


In [10]:
classes = np.argmax(results.predictions,axis=1)

In [11]:
final_preds = pd.DataFrame(zip(test_dataset["ID"],classes), columns=["Id", "Predicted"])

In [12]:
final_preds.to_csv("submission.csv",index=False) # Use API?