# Training the classifier

From: https://huggingface.co/transformers/training.html, https://huggingface.co/transformers/preprocessing.html, and https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files

## Setup

In [1]:
!gdown https://drive.google.com/uc?id=1-1ySq4Eqy5i0v7IO5lKsntYNowIbpBYC -O ./corrected_labels_data.csv

Downloading...
From: https://drive.google.com/uc?id=1-1ySq4Eqy5i0v7IO5lKsntYNowIbpBYC
To: /content/corrected_labels_data.csv
163MB [00:01, 148MB/s]


In [2]:
!pip3 install transformers datasets

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |████████████████████████████████| 2.3MB 2.8MB/s 
[?25hCollecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/94/f8/ff7cd6e3b400b33dcbbfd31c6c1481678a2b2f669f521ad20053009a9aa3/datasets-1.7.0-py3-none-any.whl (234kB)
[K     |████████████████████████████████| 235kB 20.1MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 24.1MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_1

### Data

In [3]:
import pandas as pd
from datasets import load_dataset

The dataset has 0 as suicide, 1 as non-suicide, we'll invert that. Also, we will split the data into the training and testing sets using a 0.9 ratio.

In [4]:
data = pd.read_csv('./corrected_labels_data.csv')
data['sentiment'] = data['sentiment'].apply(lambda s: 0 if s == 1 else 1)
data.dropna(subset=['text', 'sentiment'], inplace=True)
data.rename(columns={'sentiment': 'label'}, inplace=True)

train_test_ratio = 0.9

data = data.sample(len(data), random_state=42)
data_train = data.iloc[:int(len(data) * train_test_ratio)]
data_test = data.iloc[int(len(data) * train_test_ratio):]

data_train.to_csv('./corrected_labels_data_train.csv')
data_test.to_csv('./corrected_labels_data_test.csv')

In [25]:
datasets = load_dataset('csv',
                       data_files={'train': './corrected_labels_data_train.csv',
                                   'test': './corrected_labels_data_test.csv'})

Using custom data configuration default-6ea66d1be80a111b


Downloading and preparing dataset csv/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/csv/default-6ea66d1be80a111b/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0...


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-6ea66d1be80a111b/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0. Subsequent calls will reuse this data.


In [26]:
len(datasets['train'])

208834

In [27]:
len(datasets['test'])

23204

### Training setup

In [4]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

from transformers import TrainingArguments
from transformers import Trainer

In [11]:
# using bert cased due to some posts claiming it gives better results
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




From the data preprocessing and visualization step we gathered the following information:

- Suicide posts have an average number of words per post of 236.28
- Non-suicide posts have an average words per post of 70.85

We want to avoid the natural tendency of posts about heavy topics of being much longer afecting the inference result, so we truncated the result to 90 words per post, allowing some of the extra information provided by suicide posts, but staying a lot closer to the average of non-suicide posts.

In [28]:
def tokenize_function(examples):
    try:
        result = tokenizer(examples["text"],
                     max_length=90,
                     truncation=True,
                     padding="max_length")
    except:
        print(examples)
        raise
    return result
tokenized_datasets = datasets.map(tokenize_function, batched=True)

HBox(children=(FloatProgress(value=0.0, max=209.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=24.0), HTML(value='')))




In [29]:
# same cased model as tokenizer
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [30]:
!ls /content/drive/MyDrive/Carrera/Decimo/nlp/hugging_face/

checkpoints  training_modelipynb.ipynb


In [31]:
training_args = TrainingArguments("/content/drive/MyDrive/Carrera/Decimo/nlp/hugging_face/checkpoints",
                                  evaluation_strategy="epoch")

We're cutting the dataset to a fifth due to time constrains.

In [34]:
subset_train = (
    tokenized_datasets['train'].
    shuffle(seed=42).
    select(range(int(0.2 * len(tokenized_datasets['train']))))
)

subset_test = (
    tokenized_datasets['test'].
    shuffle(seed=42).
    select(range(int(0.2 * len(tokenized_datasets['test']))))
)

In [35]:
trainer = Trainer(model=model,
                  args=training_args,
                  train_dataset=subset_train,
                  eval_dataset=subset_test)

In [36]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.3783,0.350627
2,0.3517,0.359975
3,0.3655,0.37434


TrainOutput(global_step=15663, training_loss=0.37731385074233836, metrics={'train_runtime': 6177.4292, 'train_samples_per_second': 2.536, 'total_flos': 76151867374800.0, 'epoch': 3.0, 'init_mem_cpu_alloc_delta': 0, 'init_mem_gpu_alloc_delta': 0, 'init_mem_cpu_peaked_delta': 0, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': -6004736, 'train_mem_gpu_alloc_delta': 867215872, 'train_mem_cpu_peaked_delta': 13586432, 'train_mem_gpu_peaked_delta': 591530496})

## Inference

In [5]:
import torch
from transformers import pipeline

In [42]:
custom_text_pipeline = pipeline('sentiment-analysis',
                                model=model,
                                tokenizer=tokenizer,
                                device=0 if torch.cuda.is_available() else -1)

In [49]:
custom_text_pipeline([
    "I'm feeling a bit more alone than usual",
    "Having a crazy time lately",
    "Fun times for the family",
    "No one in the family cares too much",
    "I hate my horrible life"
])

[{'label': 'LABEL_0', 'score': 0.8476553559303284},
 {'label': 'LABEL_1', 'score': 0.9731742739677429},
 {'label': 'LABEL_0', 'score': 0.847655177116394},
 {'label': 'LABEL_1', 'score': 0.973181426525116},
 {'label': 'LABEL_1', 'score': 0.9731813669204712}]

## Saving and loading

In [9]:
# colab runtime died, reloading from checkpoint
model = AutoModelForSequenceClassification.from_pretrained(
    '/content/drive/MyDrive/Carrera/Decimo/nlp/hugging_face/checkpoints/checkpoint-15500'
)

In [12]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




In [13]:
# saving model
model.save_pretrained('./bert_labels_corrected')

In [14]:
!zip -r bert_labels_corrected.zip ./bert_labels_corrected

  adding: bert_labels_corrected/ (stored 0%)
  adding: bert_labels_corrected/pytorch_model.bin (deflated 7%)
  adding: bert_labels_corrected/config.json (deflated 47%)


In [17]:
!mv ./bert_labels_corrected.zip /content/drive/MyDrive/Carrera/Decimo/nlp/final_project/data