## TR question 2

Note: This requires running tr_q1 first

The order of the notebook is as follow, 
- Split train val test class
- Define a baseline
- Train model

In [5]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding
from sklearn.model_selection import train_test_split
from sklearn.metrics import multilabel_confusion_matrix, f1_score
from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np
from typing import Dict
import pandas as pd
import os
from transformers import Trainer, TrainingArguments

from datasets import Dataset

seed = 42

MAX_LENGTH = 512

I take the headtext and firsts characters of each paragraph and concatenatem together, such that each row in the dataset are representend by `MAX_LENGTH` characters

### Create a dataset

In [8]:
file_path = 'TRDataChallenge2023.zip'
extract_file_path = 'TRDataChallenge2023'
df = pd.read_json(os.path.join(extract_file_path, f"{extract_file_path}.txt"), lines=True).head(10)   # todo

mlb = MultiLabelBinarizer()
labels = pd.Series(np.array(mlb.fit_transform(df["postures"].values), dtype="float").tolist(), name="label_ids")
df = pd.concat([df, labels], axis=1)


In [9]:
df.head()

Unnamed: 0,documentId,postures,sections,label_ids
0,Ib4e590e0a55f11e8a5d58a2c8dcb28b5,[On Appeal],"[{'headtext': '', 'paragraphs': ['Plaintiff Dw...","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0]"
1,Ib06ab4d056a011e98c7a8e995225dbf9,"[Appellate Review, Sentencing or Penalty Phase...","[{'headtext': '', 'paragraphs': ['After pleadi...","[1.0, 0.0, 0.0, 0.0, 0.0, 1.0]"
2,Iaa3e3390b93111e9ba33b03ae9101fb2,"[Motion to Compel Arbitration, On Appeal]","[{'headtext': '', 'paragraphs': ['Frederick Gr...","[0.0, 1.0, 0.0, 1.0, 0.0, 0.0]"
3,I0d4dffc381b711e280719c3f0e80bdd0,"[On Appeal, Review of Administrative Decision]","[{'headtext': '', 'paragraphs': ['Appeal from ...","[0.0, 0.0, 0.0, 1.0, 1.0, 0.0]"
4,I82c7ef10d6d111e8aec5b23c3317c9c0,[On Appeal],"[{'headtext': '', 'paragraphs': ['Order, Supre...","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0]"


Normally we `fit_transform` in the train set and `transform` on the test set. However here I `fit_transform` in the whole dataset to cover all of the labels, because some of them only have one instance (See first notebook)

In [10]:
def clean_up_strings(max_len, sections):
    """
    Remove the \\u and clean up texts
    """
    cleaned_sections = []
    chars_per_section = max_len // len(sections)
    for section in sections:            
        cleaned_text = ""
        headtext = [section['headtext'].encode("ascii", "ignore").decode().strip()]
        cleaned_paragraph = [paragraph.encode("ascii", "ignore").decode().strip() for paragraph in section['paragraphs']]
        cleaned_text += ". ".join(headtext + cleaned_paragraph)        
        
        if (len(cleaned_text) < chars_per_section):
            cleaned_sections.append(cleaned_text[:len(cleaned_text)])
        else:
            last_space_index = cleaned_text[:chars_per_section].rfind(' ')
            cleaned_sections.append(cleaned_text[:last_space_index])  # last element that is a space

    cleaned_sections = '. '.join(cleaned_sections)

    return cleaned_sections

In [11]:
def test_clean_up_strings():
    # Test the function with a basic scenario
    max_len = 50
    sections = [
        {
            'headtext': "Sample Headline",
            'paragraphs': ["This is the first paragraph."]
        },
        {
            'headtext': "Sample Headline",
            'paragraphs': ["Second paragraph."]
        }
    ]
    cleaned_sections = clean_up_strings(max_len, sections)    
    expected_result = 'Sample Headline. This is. Sample Headline. Second'    
    assert cleaned_sections == expected_result

test_clean_up_strings()

In [12]:
df["cleaned_text"] = df.sections.map(lambda x: clean_up_strings(MAX_LENGTH, x))

In [13]:
df

Unnamed: 0,documentId,postures,sections,label_ids,cleaned_text
0,Ib4e590e0a55f11e8a5d58a2c8dcb28b5,[On Appeal],"[{'headtext': '', 'paragraphs': ['Plaintiff Dw...","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0]",. Plaintiff Dwight Watson (Husband) appeals fr...
1,Ib06ab4d056a011e98c7a8e995225dbf9,"[Appellate Review, Sentencing or Penalty Phase...","[{'headtext': '', 'paragraphs': ['After pleadi...","[1.0, 0.0, 0.0, 0.0, 0.0, 1.0]",". After pleading guilty, William Jerome Howard..."
2,Iaa3e3390b93111e9ba33b03ae9101fb2,"[Motion to Compel Arbitration, On Appeal]","[{'headtext': '', 'paragraphs': ['Frederick Gr...","[0.0, 1.0, 0.0, 1.0, 0.0, 0.0]",". Frederick Greene, the plaintiff below, deriv..."
3,I0d4dffc381b711e280719c3f0e80bdd0,"[On Appeal, Review of Administrative Decision]","[{'headtext': '', 'paragraphs': ['Appeal from ...","[0.0, 0.0, 0.0, 1.0, 1.0, 0.0]",. Appeal from an amended judgment of the Supre...
4,I82c7ef10d6d111e8aec5b23c3317c9c0,[On Appeal],"[{'headtext': '', 'paragraphs': ['Order, Supre...","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0]",". Order, Supreme Court, New York County (Arthu..."
5,Iafe9e30074ba11e88be5ff0f408d813f,[],"[{'headtext': 'OPINION & ORDER', 'paragraphs':...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",OPINION & ORDER. The Grievance Committee for t...
6,Icfcdb0e00bed11ea83e6f815c7cdf150,"[Appellate Review, Sentencing or Penalty Phase...","[{'headtext': 'OPINION', 'paragraphs': ['In 20...","[1.0, 0.0, 0.0, 0.0, 0.0, 1.0]","OPINION. In 2017, a jury convicted Jose Carlos..."
7,I0c356d10cfce11e79fcefd9d4766cbba,[Motion to Dismiss],"[{'headtext': 'ORDER OF DISMISSAL', 'paragraph...","[0.0, 0.0, 1.0, 0.0, 0.0, 0.0]",ORDER OF DISMISSAL. BACKGROUND. Plaintiff U.S....
8,I53552890fb1e11e790b3a4cf54beb9bd,[On Appeal],"[{'headtext': 'SUMMARY ORDER', 'paragraphs': [...","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0]",SUMMARY ORDER. Petitioner-appellant Chauncey M...
9,I916de920dad811e7929ecf6e705a87cd,[On Appeal],"[{'headtext': '', 'paragraphs': [' Plaintiffs ...","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0]",. Plaintiffs appeal a judgment dismissing a. F...


In [14]:
X_train, X_test, y_train, y_test = train_test_split(df[["documentId", "cleaned_text"]], df["label_ids"], test_size=0.2, random_state=seed)
y_train = np.stack(y_train, axis=0)
y_test = np.stack(y_test, axis=0)
y_pred = [['Appellate Review']] * len(y_test)

In [15]:
multilabel_confusion_matrix(y_true=y_test, y_pred=mlb.transform(y_pred))

array([[[0, 1],
        [0, 1]],

       [[2, 0],
        [0, 0]],

       [[2, 0],
        [0, 0]],

       [[1, 0],
        [1, 0]],

       [[2, 0],
        [0, 0]],

       [[1, 0],
        [1, 0]]], dtype=int64)

In [16]:
f1_score(y_true=y_test, y_pred=mlb.transform(y_pred), average='weighted')

  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


0.2222222222222222

### Create a dataset

In [17]:
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-large')
model = AutoModelForSequenceClassification.from_pretrained('poltextlab/xlm-roberta-large-english-legal-cap',
                                                           num_labels= len(mlb.classes_),
                                                           problem_type="multi_label_classification",
                                                           ignore_mismatched_sizes=True)

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at poltextlab/xlm-roberta-large-english-legal-cap and are newly initialized because the shapes did not match:
- classifier.out_proj.weight: found shape torch.Size([22, 1024]) in the checkpoint and torch.Size([6, 1024]) in the model instantiated
- classifier.out_proj.bias: found shape torch.Size([22]) in the checkpoint and torch.Size([6]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [18]:
def tokenize(examples):
    encoding = tokenizer(examples["cleaned_text"], padding="max_length", truncation=True, max_length=512)    
    
    return encoding

In [19]:

train_dataset = Dataset.from_pandas(df)


In [20]:
tokenized_datasets = train_dataset.map(tokenize)


Map: 100%|██████████| 10/10 [00:00<00:00, 625.02 examples/s]


In [21]:
tokenized_datasets = tokenized_datasets.train_test_split(test_size=0.2)

In [22]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")

### Train

In [24]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")


In [25]:

training_args = TrainingArguments(
    per_device_train_batch_size=3,
    output_dir='./output', 
    save_total_limit=2,
    evaluation_strategy="steps",
    eval_steps=500,
    logging_dir='./logs',
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
)

trainer.train()

  0%|          | 0/9 [00:00<?, ?it/s]You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
 22%|██▏       | 2/9 [00:25<01:30, 12.89s/it]

### Evaluation

In [None]:
results = trainer.predict(tokenized_datasets["test"])

100%|██████████| 1/1 [00:00<00:00, 999.60it/s]


In [None]:
np.argmax(results.predictions)

NameError: name 'results' is not defined

In [None]:
(results.predictions, results.label_ids)

(array([[-1.3941845 , -1.6327689 , -1.4802446 ,  0.1399697 , -1.4017375 ,
         -1.8464881 ],
        [-0.3776071 , -1.4899278 , -1.7351217 , -0.65194833, -1.6428034 ,
         -0.24245712]], dtype=float32),
 array([[0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 1., 0., 0.]], dtype=float32))

### Baseline
We noticed from 1st question
- Most common class: Appellate Review
- Most common number of labels: 1

Therefore, the baseline would be to predict everything with "Appellate Review"

### Question

- The representability of the data
  - Is the distribution of the labels represent the real world

- We note that the many classes have only one example, for example
  ```
  Application for Bankruptcy Trustee Fees
  Declinatory Exception of Improper Venue
  Declinatory Exception of Insufficiency of Service of Process
  Declinatory Exception of Lack of Personal Jurisdiction
  Dilatory Exception of Unauthorized Use of Summary Proceeding
  Joinder
  Motion Authorizing and Approving Payment of Certain Prepetition Obligations
  Motion for Abandonment of Property
  Motion for Adequate Protection
  Motion for Appointment of an Expert
  Motion for Contempt for Violating Discharge Injunction or Order
  Objection to Disclosure Statement
  Peremptory Exception of Nonjoinder of a Party
  Petition for Legal Separation
  Petition for Special Action
  Petition to Prevent Relocation
  ```
  - It makes sense to collect more examples for these problem, just bear in mind to keep the label distribution of the training data in line with real world 
  - There are 224 classes in this problem, and some classes are related to each other, for example `'Declinatory Exception of Improper Venue', 'Declinatory Exception of Insufficiency of Service of Process', 'Declinatory Exception of Lack of Personal Jurisdiction'` are all Declinatory posture. 
    - Intuitively it makes sense to split those labels into a separate problem. However a more end-to-end approach the label relations between them, as explored [in this paper](https://openaccess.thecvf.com/content_cvpr_2016/papers/Hu_Learning_Structured_Inference_CVPR_2016_paper.pdf) and [this library](http://scikit.ml/labelrelations.html)
    - This could alleviate the problems that many classes have only one examples