## TR question 2

Note: This requires running tr_q1 first

The order of the notebook is as follow, 
- Split train val test class
- Define a baseline
- Use off-the-shelf-model
- Use model

In [41]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM
from sklearn.model_selection import train_test_split
from sklearn.metrics import multilabel_confusion_matrix
from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np

import pandas as pd
import os

seed = 42

### Split a train val test class

In [53]:
file_path = 'TRDataChallenge2023.zip'
extract_file_path = 'TRDataChallenge2023'
df = pd.read_json(os.path.join(extract_file_path, f"{extract_file_path}.txt"), lines=True)

mlb = MultiLabelBinarizer()
labels = pd.Series(list(mlb.fit_transform(df["postures"].values)), name="labels")

In [54]:
df = pd.concat([df, labels], axis=1)

In [55]:
df.head()

Unnamed: 0,documentId,postures,sections,labels
0,Ib4e590e0a55f11e8a5d58a2c8dcb28b5,[On Appeal],"[{'headtext': '', 'paragraphs': ['Plaintiff Dw...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,Ib06ab4d056a011e98c7a8e995225dbf9,"[Appellate Review, Sentencing or Penalty Phase...","[{'headtext': '', 'paragraphs': ['After pleadi...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,Iaa3e3390b93111e9ba33b03ae9101fb2,"[Motion to Compel Arbitration, On Appeal]","[{'headtext': '', 'paragraphs': ['Frederick Gr...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,I0d4dffc381b711e280719c3f0e80bdd0,"[On Appeal, Review of Administrative Decision]","[{'headtext': '', 'paragraphs': ['Appeal from ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,I82c7ef10d6d111e8aec5b23c3317c9c0,[On Appeal],"[{'headtext': '', 'paragraphs': ['Order, Supre...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


Normally we `fit_transform` in the train set and `transform` on the test set. However here I `fit_transform` in the whole dataset to cover all of the labels, because some of them only have one instance (See first notebook)

In [56]:
X_train, X_test, y_train, y_test = train_test_split(df[["documentId", "sections"]], df["labels"], test_size=0.2, random_state=seed)

### Choosing a loss function
Multilabel multiclass classification, choose Sigmoid




### Baseline
We noticed from 1st question
- Most common class: Appellate Review
- Most common number of labels: 1

Therefore, the baseline would be to predict everything with "Appellate Review"

In [63]:
y_pred = [['Appellate Review']] * len(y_test)

In [65]:
mlb.transform(y_pred)

array([[1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]])

In [59]:
mlb.transform(y_pred)



array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [58]:
multilabel_confusion_matrix(y_true=y_test, y_pred=mlb.transform(y_pred))



ValueError: Classification metrics can't handle a mix of unknown and multilabel-indicator targets

### Use off-the-shelf models

In [None]:
tokenizer = AutoTokenizer.from_pretrained("lexlms/legal-roberta-large")
model = AutoModelForMaskedLM.from_pretrained("lexlms/legal-roberta-large", problem_type="multi_label_classification")

In [None]:
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/legal-bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("nlpaueb/legal-bert-base-uncased", problem_type="multi_label_classification")

Question
- The representative power of the data
  - Is the distribution of the labels represent the real world

- We note that the following classes have only one example:
  ```
  Application for Bankruptcy Trustee Fees
  Declinatory Exception of Improper Venue
  Declinatory Exception of Insufficiency of Service of Process
  Declinatory Exception of Lack of Personal Jurisdiction
  Dilatory Exception of Unauthorized Use of Summary Proceeding
  Joinder
  Motion Authorizing and Approving Payment of Certain Prepetition Obligations
  Motion for Abandonment of Property
  Motion for Adequate Protection
  Motion for Appointment of an Expert
  Motion for Contempt for Violating Discharge Injunction or Order
  Motion for Genetic Testing
  Motion for Leave to File Late or Untimely Notice of Appeal
  Motion for Maritime Attachment and Garnishment
  Motion for Qualified Domestic Relations Order (QDRO)
  Motion for Witness List or Production of Witnesses
  Motion to Admonish Jury
  Motion to Allow Late Filing of Proof of Claim
  Motion to Appoint Chapter 11 Trustee or Examiner
  Motion to Appoint Substitute Custodian of Vessel
  Motion to Approve Disclosure Statement
  Motion to Compel Abandonment
  Motion to Deny Class Certification
  Motion to Determine Tax Liability
  Motion to Enforce Child Custody Decree
  Motion to Extend Claims Bar Date
  Motion to Increase/Reduce Security
  Motion to Reinstate Visitation or Parenting Time
  Motion to Remove a Non-Suit
  Motion to Sell Property Free and Clear of Interests
  Motion to Serve Additional Discovery Requests
  Motion to Set Aside Default Judgment
  Motion to Surcharge Collateral
  Motion to Transfer Guardianship
  Motion to Vacate Arbitration Award
  Motion to Vacate Attachment
  Motion to Vacate Wardship
  Motion to Withdraw Reference
  Motion to Withdraw an Admission
  Objection to Disclosure Statement
  Peremptory Exception of Nonjoinder of a Party
  Petition for Legal Separation
  Petition for Special Action
  Petition to Prevent Relocation
  ```