# Legal Text Classification

### Import Statements

In [1]:
# !pip install datasets

In [2]:
import pandas as pd
import numpy as np
import datasets
import torch
import transformers
import random

random.seed(10)

In [3]:
print(torch.__version__)

1.12.1+cu116


In [4]:
print(transformers.__version__)

4.21.3


In [5]:
# If you are loading from local
df = pd.read_csv("legal_text_classification.csv")

In [6]:
df.shape

(24985, 4)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24985 entries, 0 to 24984
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   case_id       24985 non-null  object
 1   case_outcome  24985 non-null  object
 2   case_title    24985 non-null  object
 3   case_text     24809 non-null  object
dtypes: object(4)
memory usage: 780.9+ KB


In [8]:
df.isna().sum()

case_id           0
case_outcome      0
case_title        0
case_text       176
dtype: int64

In [9]:
df.loc[df['case_text'].isna(), :].head(1)

Unnamed: 0,case_id,case_outcome,case_title,case_text
24,Case29,followed,Elderslie Finance Corp Ltd v Australian Securi...,


In [10]:
df.head()

Unnamed: 0,case_id,case_outcome,case_title,case_text
0,Case1,cited,Alpine Hardwood (Aust) Pty Ltd v Hardys Pty Lt...,Ordinarily that discretion will be exercised s...
1,Case2,cited,Black v Lipovac [1998] FCA 699 ; (1998) 217 AL...,The general principles governing the exercise ...
2,Case3,cited,Colgate Palmolive Co v Cussons Pty Ltd (1993) ...,Ordinarily that discretion will be exercised s...
3,Case4,cited,Dais Studio Pty Ltd v Bullett Creative Pty Ltd...,The general principles governing the exercise ...
4,Case5,cited,Dr Martens Australia Pty Ltd v Figgins Holding...,The preceding general principles inform the ex...


In [11]:
df.loc[1, 'case_text']

'The general principles governing the exercise of the discretion to award indemnity costs after rejection by an unsuccessful party of a so called Calderbank letter were set out in the judgment of the Full Court in Black v Lipovac [1998] FCA 699 ; (1998) 217 ALR 386. In summary those principles are: 1. Mere refusal of a "Calderbank offer" does not itself warrant an order for indemnity costs. In this connection it may be noted that Jessup J in Dais Studio Pty Ltd v Bullet Creative Pty Ltd [2008] FCA 42 said that (at [6]): if the rejection of such an offer is to ground a claim for indemnity costs, it must be by reason of some circumstance other than that the offer happened to comply with the Calderbank principle. 2. To obtain an order for indemnity costs the offeror must show that the refusal to accept it was unreasonable. 3. The reasonableness of the conduct of the offeree is to be viewed in the light of the circumstances that existed when the offer was rejected.'

some information from data exploration. There are some examples with missing case_text. But case_title is not missing at all. We can merge case_title and case_text into a single column text which we will pass to training. case_outcome is the label/target column

In [12]:
# Case id is not useful
df = df.drop(columns=['case_id'])

In [13]:
df['case_outcome'].value_counts()/df.shape[0]

cited            0.489053
referred to      0.175465
applied          0.097979
followed         0.090294
considered       0.068521
discussed        0.040985
distinguished    0.024335
related          0.004523
affirmed         0.004523
approved         0.004323
Name: case_outcome, dtype: float64

since some classes are very less we should do a stratified split. case_outcome is the label column

### Preprocessing

In [14]:
# Convert the pandas DataFrame to a Hugging Face dataset
df = df.rename(columns={'case_outcome': 'label'})
data = datasets.Dataset.from_pandas(df)
data = data.class_encode_column("label")

# Perform a stratified train-test split test set 90%, some of the classes are very less so better to stratify
data = data.train_test_split(test_size=0.1, stratify_by_column='label', seed=10)


num_classes = data['train'].features['label'].num_classes
id2label = {i:data['train'].features['label'].int2str(i) for i in range(num_classes)}
label2id = {label:i for (i,label) in id2label.items()}

Casting to class labels:   0%|          | 0/25 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/3 [00:00<?, ?ba/s]

In [15]:
data

DatasetDict({
    train: Dataset({
        features: ['label', 'case_title', 'case_text'],
        num_rows: 22486
    })
    test: Dataset({
        features: ['label', 'case_title', 'case_text'],
        num_rows: 2499
    })
})

In [16]:
id2label

{0: 'affirmed',
 1: 'applied',
 2: 'approved',
 3: 'cited',
 4: 'considered',
 5: 'discussed',
 6: 'distinguished',
 7: 'followed',
 8: 'referred to',
 9: 'related'}

In [17]:
data['train'].features['label']

ClassLabel(num_classes=10, names=['affirmed', 'applied', 'approved', 'cited', 'considered', 'discussed', 'distinguished', 'followed', 'referred to', 'related'], id=None)

### Feature Engineering

In [18]:
# case_title and case_text we can merge in one column as they both might contain some important textual information.

def merge_title_text(example):
    example['text'] = "Case Title: " + example['case_title'] + str("" if example['case_text'] is None else " Case Text: " + example['case_text'])
    return example

In [19]:
data = data.map(merge_title_text)

  0%|          | 0/22486 [00:00<?, ?ex/s]

  0%|          | 0/2499 [00:00<?, ?ex/s]

In [20]:
data

DatasetDict({
    train: Dataset({
        features: ['label', 'case_title', 'case_text', 'text'],
        num_rows: 22486
    })
    test: Dataset({
        features: ['label', 'case_title', 'case_text', 'text'],
        num_rows: 2499
    })
})

In [21]:
# Now we dont need case_title ans case_text so we will remove it
data = data.remove_columns(["case_title", "case_text"])

In [22]:
data

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 22486
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 2499
    })
})

In [23]:
print(data['train']['text'][0])

Case Title: Comandate Marine Corporation v The Ship "Boomerang I" [2006] FCAFC 106 ; (2006) 151 FCR 403 Case Text: course, there is an incongruity in this approach because it ignores the rights of a secured creditor (other than a holder of a maritime lien recognised in s 15) such as a mortgagee and instead prefers those of a co-owner. Thus, if a vessel is co-owned it would not be able to be arrested under s 19 if one co-owner were not a relevant person under s 19(a), but a mortgagee cannot escape the amenability of the vessel to arrest. But this is the consequence of the legislative choice of selecting, as the criterion for actuating the right defined in s 19(b), the "owner", and not extending this to secured creditors or demise charterers: cf Comandate Marine Corporation v The Ship "Boomerang I" [2006] FCAFC 106 ; (2006) 151 FCR 403. As Allsop J observed, the wide group of categories identified in s 19(a) is then "limited to the more narrow funnel in para (b) ...": " Boomerang I " 151

### Tokenization

In [24]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

In [25]:
# truncate input text to be not more than distibert maximum imput limit
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

In [26]:
data = data.map(preprocess_function, batched=True)

  0%|          | 0/23 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

In [27]:
data

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 22486
    })
    test: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 2499
    })
})

In [28]:
from transformers import DataCollatorWithPadding

# For padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [29]:
type(data['train']['label'])

list

In [30]:
set_outcome = list(set(data['train']['label']))

counts = [0]*len(set_outcome)

list(map(lambda x, y: {y: (x+data['train']['label'].count(y))/len(data['train']['label'])}, counts, set_outcome))

[{0: 0.004536155830294405},
 {1: 0.0979720715111625},
 {2: 0.004313795250378013},
 {3: 0.4890598594681135},
 {4: 0.06853153073023215},
 {5: 0.04100329093658276},
 {6: 0.024326247442853333},
 {7: 0.09027839544605533},
 {8: 0.17544249755403363},
 {9: 0.004536155830294405}]

In [31]:
list(map(lambda x, y: {y: (x+data['test']['label'].count(y))/len(data['test']['label'])}, counts, set_outcome))

[{0: 0.004401760704281713},
 {1: 0.09803921568627451},
 {2: 0.004401760704281713},
 {3: 0.4889955982392957},
 {4: 0.06842737094837935},
 {5: 0.04081632653061224},
 {6: 0.024409763905562223},
 {7: 0.09043617446978791},
 {8: 0.1756702681072429},
 {9: 0.004401760704281713}]

So train and test have equal percentages of classes. Just verifying before we pass to finetuning

### Finetuning

In [32]:
# !pip install evaluate

In [33]:
import evaluate

accuracy = evaluate.load("accuracy")
precision = evaluate.load("precision")
recall = evaluate.load("recall")
f1 = evaluate.load("f1")

# clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])

In [34]:
import numpy as np

# We can see precision and recall later first lets try accuracy
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    # predictions = np.argmax(predictions, axis=1)
    predictions = predictions.argmax(axis=-1)
    return {'accuracy': accuracy.compute(predictions=predictions, references=labels)['accuracy'],
            'precision': precision.compute(predictions=predictions, references=labels, average="weighted")['precision'],
            'recall': recall.compute(predictions=predictions, references=labels, average="weighted")['recall'],
            'f1': f1.compute(predictions=predictions, references=labels, average="weighted")['f1']}
    # return clf_metrics.compute(predictions=predictions, references=labels, average='weighted')

In [35]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

# Distilbert becuase its small, easy to fit in memory
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", num_labels=10, id2label=id2label, label2id=label2id
)

Some weights of the model checkpoint at distilbert/distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.we

In [36]:
# !pip install accelerate

Fintuining

In [37]:
training_args = TrainingArguments(
    output_dir="finetuned_model",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=data["train"],
    eval_dataset=data["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 22486
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 14055
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[34m[1mwandb[0m: Currently logged in as: [33mhiteshsom[0m. Use [1m`wandb login --relogin`[0m to force relogin


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,1.4823,1.432102,0.499,0.414843,0.499,0.384755
2,1.3188,1.342319,0.528211,0.481895,0.528211,0.446245
3,1.1717,1.3197,0.534614,0.495936,0.534614,0.50568
4,0.9786,1.362533,0.558623,0.512348,0.558623,0.52434
5,0.8278,1.398878,0.556222,0.522594,0.556222,0.53473


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2499
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to finetuned_model/checkpoint-2811
Configuration saved in finetuned_model/checkpoint-2811/config.json
Model weights saved in finetuned_model/checkpoint-2811/pytorch_model.bin
tokenizer config file saved in finetuned_model/checkpoint-2811/tokenizer_config.json
Special tokens file saved in finetuned_model/checkpoint-2811/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  yo

TrainOutput(global_step=14055, training_loss=1.183310688606406, metrics={'train_runtime': 2454.3009, 'train_samples_per_second': 45.809, 'train_steps_per_second': 5.727, 'total_flos': 1.47874249850496e+16, 'train_loss': 1.183310688606406, 'epoch': 5.0})

Earlier Training for 1 epoch: Accuracy=0.49, Precision = 0.33, Recall = 0.49, F1 = 0.34

Final Training for 5 epochs: Accuracy= 0.55, Precision = 0.52, Recall = 0.55, F1 = 0.53

### Saving model for backup

In [38]:
model.save_pretrained('finetuned_model_backup')
tokenizer.save_pretrained('finetuned_model_backup')

Configuration saved in finetuned_model_backup/config.json
Model weights saved in finetuned_model_backup/pytorch_model.bin
tokenizer config file saved in finetuned_model_backup/tokenizer_config.json
Special tokens file saved in finetuned_model_backup/special_tokens_map.json


('finetuned_model_backup/tokenizer_config.json',
 'finetuned_model_backup/special_tokens_map.json',
 'finetuned_model_backup/vocab.txt',
 'finetuned_model_backup/added_tokens.json',
 'finetuned_model_backup/tokenizer.json')