# Learning to split
***
Is it possible to teach a model to split a rule-of-thumb into action and judgement?\
Task:
* Load and parse the Social-Chem-101 dataset consisting of ~300k rules of thumb, which were manually split into action and judgment
* Train a seq2seq model on generating the sentence judgement + action from a given rot
* Use the model to split the complete moral stories dataset, probably in another notebook though!

In [1]:
import pandas as pd
import numpy as np
import string

pd.set_option('display.max_colwidth', 400)

from ailignment.datasets import get_social_chem101

In [2]:
data = get_social_chem101()
# do some data cleaning
data.dropna(inplace=True)

# add punctuation to each rot-judgment
data["rot-judgment"] = data["rot-judgment"].apply(lambda x: x if x[-1] in ".!" else x+".")

data["targets"] = data["rot-judgment"] + " " + data["action"]

data[["rot-judgment", "rot", "action"]]

Unnamed: 0,rot-judgment,rot,action
0,it's bad.,It's bad to do something that causes other people to lose trust in you.,doing something that causes other people to lose trust in you.
1,expected.,People are expected to participate in the big events in their friends' lives if asked.,people participating in the big events in their friends' lives if asked.
2,Partners should.,Partners should listen to each other's issues.,Listening to each other's issues.
3,it's okay.,It is okay to need space from family.,needing space from family.
4,it's good.,It's good to keep things clean.,keeping things clean.
...,...,...,...
355897,Is economical.,Making dinner is economical as opposed to eating out.,Making dinner instead of eating out.
355902,it's economical.,Making dinner is economical as opposed to eating out.,Making dinner instead of eating out
355918,is economical.,Making dinner is economical as opposed to eating out.,Making dinner as opposed to eating out.
355919,it is.,Making dinner is economical as opposed to eating out.,Making dinner is economical as opposed to eating out.


In [3]:
# load the NLI model and its tokenizer
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(name)
model = AutoModelForSeq2SeqLM.from_pretrained(name)

In [4]:
# fix up huggingface dataset
from datasets import Dataset
from sklearn.model_selection import train_test_split

def tok_inp(x):
    return tokenizer(x["rot"], padding="max_length", truncation=True)

def tok_out(x):
    with tokenizer.as_target_tokenizer():
        y = tokenizer(x["targets"], padding="max_length", truncation=True)
    return y

def preprocess(dataframe):
    dataset = Dataset.from_pandas(dataframe)
    dataset =  dataset.map(tok_inp, batched=True)
    labels = dataset.map(tok_out, batched=True)
    dataset = dataset.add_column("labels", labels["input_ids"])
    return dataset

train, test = train_test_split(data, test_size=0.1)

train_data = preprocess(train)
eval_data = preprocess(test)



  0%|          | 0/209 [00:00<?, ?ba/s]

  0%|          | 0/209 [00:00<?, ?ba/s]

  0%|          | 0/24 [00:00<?, ?ba/s]

  0%|          | 0/24 [00:00<?, ?ba/s]

In [6]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq

import torch

data_collator = DataCollatorForSeq2Seq(tokenizer, model)

training_args = Seq2SeqTrainingArguments(
    output_dir="results/",
    num_train_epochs=10,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=50,
    evaluation_strategy="epoch",
    save_steps=10000,
)

In [None]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    data_collator=data_collator,
    tokenizer=tokenizer,
)
trainer.train()

The following columns in the training set  don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: rot-id, rot-char-targeting, action, action-legal, rot-worker-id, rot-moral-foundations, action-agency, m, action-moral-judgment, breakdown-worker-id, __index_level_0__, rot, action-char-involved, action-hypothetical, rot-bad, n-characters, rot-agree, targets, rot-categorization, situation, rot-judgment, action-agree, action-pressure, area, split, situation-short-id, characters.
***** Running training *****
  Num examples = 208883
  Num Epochs = 10
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 522210


Epoch,Training Loss,Validation Loss


In [3]:
path = "../data/models/learn_to_split"
model.save_pretrained(path)
tokenizer.save_pretrained(path)

('../data/models/learn_to_split\\tokenizer_config.json',
 '../data/models/learn_to_split\\special_tokens_map.json',
 '../data/models/learn_to_split\\tokenizer.json')

In [41]:
tokens = tokenizer("You should not. hurting your siblings.", add_special_tokens=True)["input_ids"]
print(tokens)
print(tokenizer.decode(tokens))

[148, 225, 59, 5, 4781, 53, 39, 19665, 5, 1]
You should not. hurting your siblings.</s>


In [19]:
hard_samples = [291372, 291975, 270432, 216267, 287788 ]
sample = data.sample(10).copy()
sample = data.loc[hard_samples]
x = tokenizer(sample["rot"].to_list(), padding="max_length", return_tensors="pt")
x = {k:v.cuda() for k,v in x.items()}
y = model.generate(**x, do_sample=True, min_length=1, max_length=100, top_p=0.95, top_k=50, num_beams=10, temperature=1.0)
sample["split"] = tokenizer.batch_decode(y, skip_special_tokens=True)
sample[["rot","split", "action"]]

Unnamed: 0,rot,split,action
291372,"It is appropriate to contact child services if you know of a child being mentally, physically, or sexually abused.","It is appropriate. contacting child services if you know of a child being mentally, physically, or sexually abused.","contacting child services if you know of a child being mentally, physically, or sexually abused."
291975,It's expected that family members should stick together.,It's expected. family members sticking together.,sticking together with family members.
270432,"If you think you're sick, you should see a doctor.",should. seeing a doctor when you think you're sick,seeing a doctor when you think you're sick.
216267,It's better to get a job after high school and not head straight to college.,It's better. getting a job after high school and not head straight to college.,getting a job after high school and not heading straight to college.
287788,It's okay to be annoyed when it takes the repair shop a long time to get you taken care of.,It's okay. being annoyed when it takes the repair shop a long time to get you taken care of.,being annoyed when it takes the repair shop a long time to get you taken care of.


In [10]:
model = AutoModelForSeq2SeqLM.from_pretrained("../data/models/learn_to_split/").cuda()