# How to Generate Docket Narratives for NLP Experiments
> Do you have a better approach to creating a synthetic dataset for this?

- toc: true 
- badges: true
- comments: true
- image: images/social_logo.png
- author: Charles Dobson
- categories: [python, artificial intelligence, machine learning, natural language processing, nlp, law, litigation]

# Introduction

Lately I've been thinking about how natural language processing techniques could possibly be applied to the docket narratives lawyers draft for their timesheets. Categorizing dockets according to phase/task codes is an obvious possibility. 

But I also wonder about other potential insights; for example, to develop an understanding of the type of work associates and/or partners are doing, and the distribution of this work amongst timekeepers. Are associates being provided enough opportunities on their feet in court or conducting examinations? Are senior associates doing too much research that is better suited for junior associates? To what extent are partners doing "associate" work and vice versa?

There is potentially a wealth of information in docket narratives that could be used to improve the operations of a litigation department. However, in most law firms, analyzing these narratives manually would not be practical. Hence my interest in applying NLP techniques to this type of data.

# Generating a Synthetic Dataset of Docket Narratives

Of course, dockets are confidential. For my experiments I need to develop a synthetic dataset. 

Below is a script I wrote to generate a synthetic dataset of docket narratives. It's very simple and, admittedly, only roughly approximates genuine docket narratives. I'm hoping it will be a sufficient place to start.

My first goal will simply be to see if I can train a language model to distinguish between narratives that involve drafting and those that do not. I plan try a range of supervised learning techniques, approaching this as a categorization problem.

This script generates a dataset with two columns of information. The rows in the first column contain the randomly generated docket narratives. The rows in the second column contain either a 0 or a 1. So-called drafting narratives are assigned a 1 and all other narratives are assigned a 0. These will be the labels required to train the language model.

# Final Thoughts

Have you worked on this type of problem with similar data? Written anything on it? Is my dataset unsuited for this purpose? Please let me know!

# Script

In [5]:
import random
import pandas as pd

#Lists consisting of the components for the docket narratives
actions = ["writing", "drafting", "editing", "revising", "briefing", "reviewing", "analyzing", "preparing", "proofing", "researching"]

object = ["notice of motion", "affidavit", "factum", "memorandum", "memo", "compendium", "book of authority", "motion record", "analysis", "order", ]

subject = ["summary judgment", "injunction", "enforcing foreign judgment", "motion to strike", "refusals", "disqualifying expert"]

dockets = []

#Function to randomly generate a docket narrative and add it to a list containing all of the narratives
def add_docket():
  docket = random.choice(actions) + " " + random.choice(object) + " re: " + random.choice(subject)
  dockets.append(docket)

#While loop to generate the dataset. Edit the number to the right of the < to specify the number of docket narratives you desire.
x = 0
while x < 10000:
  add_docket()
  x += 1

#Convert list with narratives to DataFrame
df = pd.DataFrame(dockets, columns=['narrative'])

#List containing the drafting words
drafting = ["writing", "drafting", "editing", "revising", "briefing"]

#Function to identify drafting narratives
def is_drafting(row):  
    for i in drafting:
      if i in row['narrative']:
        return 1

#Lambda function to review each narrative and identify the drafting narratives
df['drafting'] = df.apply (lambda row: is_drafting(row), axis=1)

#Put a 0 in the drafting column where there is no 1 and convert the column from float to integer
df['drafting'] = df['drafting'].fillna(0)
df['drafting'] = df['drafting'].astype(int)

#Convert DataFrame to CSV file
df.to_csv('dockets')


In [6]:
#Inspect the first 10 rows of the dataset
df.head(10)

Unnamed: 0,narrative,drafting
0,analyzing notice of motion re: motion to strike,0
1,drafting memo re: summary judgment,1
2,writing notice of motion re: summary judgment,1
3,briefing motion record re: enforcing foreign j...,1
4,briefing affidavit re: motion to strike,1
5,drafting notice of motion re: summary judgment,1
6,editing memo re: refusals,1
7,reviewing analysis re: disqualifying expert,0
8,editing order re: disqualifying expert,1
9,researching order re: enforcing foreign judgment,0


In [7]:
#Confirm the dataset contains the defined number of rows
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   narrative  10000 non-null  object
 1   drafting   10000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 156.4+ KB


In [8]:
#Inspect the distribution of drafting vs non-drafting dockets
df['drafting'].value_counts()

1    5026
0    4974
Name: drafting, dtype: int64