## TR question 2

Note: This requires running tr_q1 first

The order of the notebook is as follow, 
- Split train val test class
- Define a baseline
- Train model

In [68]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding
from sklearn.model_selection import train_test_split
from sklearn.metrics import multilabel_confusion_matrix, f1_score
from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np
from typing import Dict
import pandas as pd
import os
from transformers import Trainer, TrainingArguments

from datasets import Dataset

seed = 42
MAX_LENGTH = 512

I take the headtext and firsts characters of each paragraph and concatenatem together, such that each row in the dataset are representend by `MAX_LENGTH` characters

### Create a dataset

In [69]:
file_path = 'TRDataChallenge2023.zip'
extract_file_path = 'TRDataChallenge2023'
df = pd.read_json(os.path.join(extract_file_path, f"{extract_file_path}.txt"), lines=True).head(10)   # todo

mlb = MultiLabelBinarizer()
labels = pd.Series(np.array(mlb.fit_transform(df["postures"].values), dtype="float").tolist(), name="label_ids")
df = pd.concat([df, labels], axis=1)


In [70]:
df.head()

Unnamed: 0,documentId,postures,sections,label_ids
0,Ib4e590e0a55f11e8a5d58a2c8dcb28b5,[On Appeal],"[{'headtext': '', 'paragraphs': ['Plaintiff Dw...","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0]"
1,Ib06ab4d056a011e98c7a8e995225dbf9,"[Appellate Review, Sentencing or Penalty Phase...","[{'headtext': '', 'paragraphs': ['After pleadi...","[1.0, 0.0, 0.0, 0.0, 0.0, 1.0]"
2,Iaa3e3390b93111e9ba33b03ae9101fb2,"[Motion to Compel Arbitration, On Appeal]","[{'headtext': '', 'paragraphs': ['Frederick Gr...","[0.0, 1.0, 0.0, 1.0, 0.0, 0.0]"
3,I0d4dffc381b711e280719c3f0e80bdd0,"[On Appeal, Review of Administrative Decision]","[{'headtext': '', 'paragraphs': ['Appeal from ...","[0.0, 0.0, 0.0, 1.0, 1.0, 0.0]"
4,I82c7ef10d6d111e8aec5b23c3317c9c0,[On Appeal],"[{'headtext': '', 'paragraphs': ['Order, Supre...","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0]"


Normally we `fit_transform` in the train set and `transform` on the test set. However here I `fit_transform` in the whole dataset to cover all of the labels, because some of them only have one instance (See first notebook)

In [71]:
def clean_up_strings(max_len, sections):
    """
    Remove the \\u and clean up texts
    """
    cleaned_sections = []
    chars_per_section = max_len // len(sections)
    for section in sections:            
        cleaned_text = ""
        headtext = [section['headtext'].encode("ascii", "ignore").decode().strip()]
        cleaned_paragraph = [paragraph.encode("ascii", "ignore").decode().strip() for paragraph in section['paragraphs']]
        cleaned_text += ". ".join(headtext + cleaned_paragraph)        
        
        if (len(cleaned_text) < chars_per_section):
            cleaned_sections.append(cleaned_text[:len(cleaned_text)])
        else:
            last_space_index = cleaned_text[:chars_per_section].rfind(' ')
            cleaned_sections.append(cleaned_text[:last_space_index])  # last element that is a space

    cleaned_sections = '. '.join(cleaned_sections)

    return cleaned_sections

In [72]:
def test_clean_up_strings():
    # Test the function with a basic scenario
    max_len = 50
    sections = [
        {
            'headtext': "Sample Headline",
            'paragraphs': ["This is the first paragraph."]
        },
        {
            'headtext': "Sample Headline",
            'paragraphs': ["Second paragraph."]
        }
    ]
    cleaned_sections = clean_up_strings(max_len, sections)    
    expected_result = 'Sample Headline. This is. Sample Headline. Second'    
    assert cleaned_sections == expected_result

test_clean_up_strings()

In [73]:
df["cleaned_text"] = df.sections.map(lambda x: clean_up_strings(MAX_LENGTH, x))

In [74]:
df

Unnamed: 0,documentId,postures,sections,label_ids,cleaned_text
0,Ib4e590e0a55f11e8a5d58a2c8dcb28b5,[On Appeal],"[{'headtext': '', 'paragraphs': ['Plaintiff Dw...","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0]",. Plaintiff Dwight Watson (Husband) appeals fr...
1,Ib06ab4d056a011e98c7a8e995225dbf9,"[Appellate Review, Sentencing or Penalty Phase...","[{'headtext': '', 'paragraphs': ['After pleadi...","[1.0, 0.0, 0.0, 0.0, 0.0, 1.0]",". After pleading guilty, William Jerome Howard..."
2,Iaa3e3390b93111e9ba33b03ae9101fb2,"[Motion to Compel Arbitration, On Appeal]","[{'headtext': '', 'paragraphs': ['Frederick Gr...","[0.0, 1.0, 0.0, 1.0, 0.0, 0.0]",". Frederick Greene, the plaintiff below, deriv..."
3,I0d4dffc381b711e280719c3f0e80bdd0,"[On Appeal, Review of Administrative Decision]","[{'headtext': '', 'paragraphs': ['Appeal from ...","[0.0, 0.0, 0.0, 1.0, 1.0, 0.0]",. Appeal from an amended judgment of the Supre...
4,I82c7ef10d6d111e8aec5b23c3317c9c0,[On Appeal],"[{'headtext': '', 'paragraphs': ['Order, Supre...","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0]",". Order, Supreme Court, New York County (Arthu..."
5,Iafe9e30074ba11e88be5ff0f408d813f,[],"[{'headtext': 'OPINION & ORDER', 'paragraphs':...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",OPINION & ORDER. The Grievance Committee for t...
6,Icfcdb0e00bed11ea83e6f815c7cdf150,"[Appellate Review, Sentencing or Penalty Phase...","[{'headtext': 'OPINION', 'paragraphs': ['In 20...","[1.0, 0.0, 0.0, 0.0, 0.0, 1.0]","OPINION. In 2017, a jury convicted Jose Carlos..."
7,I0c356d10cfce11e79fcefd9d4766cbba,[Motion to Dismiss],"[{'headtext': 'ORDER OF DISMISSAL', 'paragraph...","[0.0, 0.0, 1.0, 0.0, 0.0, 0.0]",ORDER OF DISMISSAL. BACKGROUND. Plaintiff U.S....
8,I53552890fb1e11e790b3a4cf54beb9bd,[On Appeal],"[{'headtext': 'SUMMARY ORDER', 'paragraphs': [...","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0]",SUMMARY ORDER. Petitioner-appellant Chauncey M...
9,I916de920dad811e7929ecf6e705a87cd,[On Appeal],"[{'headtext': '', 'paragraphs': [' Plaintiffs ...","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0]",. Plaintiffs appeal a judgment dismissing a. F...


### Create a dataset

In [75]:
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-large')
model = AutoModelForSequenceClassification.from_pretrained('poltextlab/xlm-roberta-large-english-legal-cap',
                                                           num_labels= len(mlb.classes_),
                                                           problem_type="multi_label_classification",
                                                           ignore_mismatched_sizes=True)

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at poltextlab/xlm-roberta-large-english-legal-cap and are newly initialized because the shapes did not match:
- classifier.out_proj.weight: found shape torch.Size([22, 1024]) in the checkpoint and torch.Size([6, 1024]) in the model instantiated
- classifier.out_proj.bias: found shape torch.Size([22]) in the checkpoint and torch.Size([6]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [76]:
def tokenize(examples):
    encoding = tokenizer(examples["cleaned_text"], padding="max_length", truncation=True, max_length=512)    
    
    return encoding

In [77]:

train_dataset = Dataset.from_pandas(df)
tokenized_datasets = train_dataset.map(tokenize)
tokenized_datasets = tokenized_datasets.train_test_split(test_size=0.2, seed=seed)


Map: 100%|██████████| 10/10 [00:00<00:00, 623.91 examples/s]


### Train

In [78]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")


In [79]:

training_args = TrainingArguments(
    per_device_train_batch_size=3,
    output_dir='./output', 
    save_total_limit=2,
    evaluation_strategy="steps",
    eval_steps=500,
    logging_dir='./logs',
    seed=seed
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
)

trainer.train()

  0%|          | 0/9 [00:00<?, ?it/s]You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
100%|██████████| 9/9 [01:46<00:00, 11.78s/it]

{'train_runtime': 106.012, 'train_samples_per_second': 0.226, 'train_steps_per_second': 0.085, 'train_loss': 0.4859818352593316, 'epoch': 3.0}





TrainOutput(global_step=9, training_loss=0.4859818352593316, metrics={'train_runtime': 106.012, 'train_samples_per_second': 0.226, 'train_steps_per_second': 0.085, 'train_loss': 0.4859818352593316, 'epoch': 3.0})

### Evaluation

#### From model

In [80]:
results = trainer.predict(tokenized_datasets["test"])

100%|██████████| 1/1 [00:00<?, ?it/s]


In [81]:
values = np.argmax(results.predictions, axis=1)
n_values = len(mlb.classes_)
prediction = np.eye(n_values)[values]

In [82]:
f1_score(y_true=tokenized_datasets["test"]['label_ids'], y_pred=prediction, average='weighted')

  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


0.0

#### From a Baseline
We noticed from 1st question
- Most common class: Appellate Review
- Most common number of labels: 1

Therefore, the baseline would be to predict everything with "Appellate Review"

In [83]:
baseline_pred = mlb.transform([['Appellate Review']] * len(tokenized_datasets["test"]))

In [84]:
multilabel_confusion_matrix(y_true=tokenized_datasets["test"]['label_ids'], y_pred=baseline_pred)

array([[[0, 1],
        [0, 1]],

       [[2, 0],
        [0, 0]],

       [[2, 0],
        [0, 0]],

       [[2, 0],
        [0, 0]],

       [[2, 0],
        [0, 0]],

       [[1, 0],
        [1, 0]]], dtype=int64)

In [85]:
f1_score(y_true=tokenized_datasets["test"]['label_ids'], y_pred=baseline_pred, average='weighted')

  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


0.3333333333333333