# Advanced Legal Analytics (2021/22)
# LAW 3027 Tutorial 6: Robot Judge

#### Intended Learning Outcomes:
In this notebook you will learn how to train your own robot judge (a machine learning classifier) to predict the outcomes from the Supreme Court Judgements Dataset used in this paper [1] (See the reading below). It is recommended that you go through the section 1 and section 2 of the paper. 

By the end of this notebook you will know how to:
- Train and evaluate a (legal) prediction model using machine learning classifiers
- Utilize textual features for (legal) prediction of a target variable

#### Reading

Please read atleast section 1 and section 2 of [1]:

[1] Alali, M., Syed, S., Alsayed, M., Patel, S., & Bodala, H. (2021). JUSTICE: A Benchmark Dataset for Supreme Court's Judgment Prediction. arXiv preprint arXiv:2112.03414. The paper is available here: https://arxiv.org/abs/2112.03414

[2] Analyzing Documents with TF-IDF: https://programminghistorian.org/en/lessons/analyzing-documents-with-tfid


#### Libraries to be used:
You can activate your previously used environment, though you will not use most packages from that environment. In this tutorial, we will use only the most commonly used python libraries such as: `pandas`, `numpy`, `matplotlib`, `scipy`, `seaborn` etc. 

We will use the Machine Learning library of Python, called Scikit Learn. You can use `pip` to install it. See the instructions here: https://scikit-learn.org/stable/install.html

You will need to install other libraries also mentioned below in various cells.

#### 1. Supreme Court Dataset

The dataset from [1] is available here: https://raw.githubusercontent.com/maastrichtlawtech/law3027-advanced-legal-analytics/main/data/justice.csv 

**Table 1**. Description of variables in the dataset

| Column             | Description                                                                                                       |
| ------------------ | ----------------------------------------------------------------------------------------------------------------- |
| ID                 | Unique Case Identifier.                                                                                           |
| Name               | The name of the case.                                                                                             |
| HREF               | The Oyez’s API URL for the case.                                                                                  |
| Docket ID          | A special identifier of the case used by the legal system.                                                        |
| Term               | The year when the Court received the case.                                                                        |
| First Party        | The name of the first party (petitioner).                                                                         |
| Second Party       | The name of the second party (respondent).                                                                        |
| Facts              | The absolute, neutral facts of the case written by the court clerk.                                               |
| Majority Vote      | The number of justices voting for the majority opinion.                                                           |
| Minority Vote      | The number of justices voting for the minority opinion.                                                           |
| Winning Party      | The name of the party that won the case.                                                                          |
| First Party Winner | True if the first party won the case, otherwise False and the second party won the case.                          |
| Decision Type      | The type of the decision decided by the court, e.g.: per curiam, equally divided, opinion of the court.           |
| Disposition        | The treatment the Supreme Court accorded the court whose decision it reviewed; e.g.: affirmed, reversed, vacated. |
| Issue Area         | The pre-defined legal issue category of the case; e.g.: Civil Rights, Criminal Procedure, Federal Taxation.       |

#### 1.1 Import the necessary libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.metrics import classification_report, confusion_matrix

import string
import nltk 

from nltk.corpus import stopwords

#### 1.2 Read the dataset into a dataframe `df`

#### 1.3 Inspect and explore the dataset. Remove any anomalies (missing values, outliers and invalid values) 
Hint: Refer to the tutorials here[https://www.geeksforgeeks.org/count-nan-or-missing-values-in-pandas-dataframe/] and here[https://www.machinelearningplus.com/pandas/pandas-dropna-how-to-drop-missing-values/] to count missing values and drop the rows containing missing values respectively.

#### 1.4 Plot the number of class labels: The count of `first_party_winner` for both False and True labels
Hint: You can use the countplot of seaborn or a bar plot from another library like matplotlib.

####  1.5 Cleaning the text: Remove the HTML tags

Since the `facts` column contains some HTML tags, we will use BeautifulSoup library to clean the text.Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree. See more here: https://pypi.org/project/beautifulsoup4/ 

Install the Beautiful Soup library inside your environment by using the command below (don't forget to activate your environment). 

`pip install beautifulsoup4`

In [10]:
from bs4 import BeautifulSoup
df['facts_cleaned'] = [BeautifulSoup(text).get_text() for text in df['facts']]
df['facts_cleaned'].values.tolist()

['Joan Stanley had three children with Peter Stanley.  The Stanleys never married, but lived together off and on for 18 years.  When Joan died, the State of Illinois took the children.  Under Illinois law, unwed fathers were presumed unfit parents regardless of their actual fitness and their children became wards of the state.  Peter appealed the decision, arguing that the Illinois law violated the Equal Protection Clause of the Fourteenth Amendment because unwed mothers were not deprived of their children without a showing that they were actually unfit parents.  The Illinois Supreme Court rejected Stanley’s Equal Protection claim, holding that his actual fitness as a parent was irrelevant because he and the children’s mother were unmarried.\n',
 'John Giglio was convicted of passing forged money orders.  While his appeal to the U.S. Court of Appeals for the Second Circuit was pending, Giglio’s counsel discovered new evidence. The evidence indicated that the prosecution failed to discl

####  1.6 Cleaning the text: Remove the punctuation from the cleaned facts column of the dataframe and convert text to lowercase

Refer to the short tutorial here to learn about different ways to do that. https://datagy.io/python-remove-punctuation-from-string/#:~:text=One%20of%20the%20easiest%20ways,maketrans()%20method.


#### 1.7 Select the variables of interest

**Predictor Variable (or Features) - One or more variables that are used to determine(Predict) the 'Target Variable'.**

**Target Variable - A variable that needs to be predicted is a target variable.**


The features in this case are the facts. The target variable is `first_party_winner`.

In [13]:
variable_selection = ['facts_cleaned','first_party_winner']
df_selection = df[variable_selection]
df_selection

Unnamed: 0,facts_cleaned,first_party_winner
1,joan stanley had three children with peter sta...,True
2,john giglio was convicted of passing forged mo...,True
3,the idaho probate code specified that males mu...,True
4,miller after conducting a mass mailing campaig...,True
5,ernest e mandel was a belgian professional jou...,True
...,...,...
3297,for over a century after the alaska purchase i...,True
3298,refugio palomarsantiago a mexican national was...,True
3299,tarahrick terry pleaded guilty to one count of...,False
3300,joshua james cooley was parked in his pickup t...,True


#### 1.8 Encode target labels with value between 0 and n_classes-1.

False --> 0

True  --> 1

In [14]:
#convert boolean (True and False) values to integer (1 & 0)
df_selection["first_party_winner"] = df_selection["first_party_winner"].astype(int)
df_selection

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,facts_cleaned,first_party_winner
1,joan stanley had three children with peter sta...,1
2,john giglio was convicted of passing forged mo...,1
3,the idaho probate code specified that males mu...,1
4,miller after conducting a mass mailing campaig...,1
5,ernest e mandel was a belgian professional jou...,1
...,...,...
3297,for over a century after the alaska purchase i...,1
3298,refugio palomarsantiago a mexican national was...,1
3299,tarahrick terry pleaded guilty to one count of...,0
3300,joshua james cooley was parked in his pickup t...,1


#### 1.9 Plot the number of class labels: The count of `first_party_winner` for both 0 and 1 labels

#### 1.10 Convert text (Facts) to TF-IDF
Read about TF-IDF here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html and its also recommended that you play with the toy example they provide there with some strings like the one below:

```python

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
'Is this the first document?'
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names())
dict(zip(vectorizer.get_feature_names(), X.toarray()[2]))

```


In [16]:
stopword_list=stopwords.words('english')
vectorizer = TfidfVectorizer(stop_words=stopword_list) 
vectorizer.fit(df_selection['facts_cleaned'])   #Learn vocabulary and idf score from the documents         
# prepare the features variable, X
X = vectorizer.transform(df_selection['facts_cleaned'])  #Learn vocabulary and idf score,  returns the document-term matrix.

In [17]:
#prepare the target variable, y
y = df_selection['first_party_winner'].values
y

array([1, 1, 1, ..., 0, 1, 1])

#### 1.11 Split the dataset into train and test set. Set the size of train set = 90%

#### Check the length of the train and test set. 

#### 1.12 Train the K-nearest neighbour (KNN) Classifier and print the evaluation metrics (confusion matrix, precision, recall, accuracy) 
Change the value of K and see if the performance of the model improves or not ? What do you observe ?
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

Use different variable names to save each model, for instance, `knn` for KNN classifier and `DT` for Decision Tree.

#### 1.13 Train the Decision Tree and Randome Forest Classifiers separtely and print the evaluation metrics (confusion matrix, precision, recall, accuracy).

Is the performance of the model better or worse than KNN ? How would you check ? 
Which model had the best performance ?


#### 1.14 Explaining the predictions of a Machine Learning Model
Local interpretable model-agnostic explanations (LIME) https://github.com/marcotcr/lime is a tool which is used by researchers to explain the predictions made by machine learning classifiers. We will try to use LIME to explain some of the predictions made by the KNN classifier to predict the `first_party_winner` label.

Install LIME in your environment: `pip install lime`

In [24]:
from lime import lime_text
#create a LimeTextExplainer object with class names passed to it.
explainer = lime_text.LimeTextExplainer(class_names=[0,  1])
explainer

<lime.lime_text.LimeTextExplainer at 0x7fac1da3d160>

In [25]:
idx = 700 # pick a random index from the original dataframe d
from sklearn.pipeline import make_pipeline


print("Actual Text : ", df['facts_cleaned'][idx]) # print the actual text (cleaned facts)
print("\n\n")
print("Actual Label:", int(df['first_party_winner'][idx])) # convert bool to int(True:1, False:0)
print("\n\n")
print("First party:" , df['first_party'][idx]) #Get name of the first party from the dataframe

print("Second party:", df['second_party'][idx]) #Get name of the second party from the dataframe

c= make_pipeline(vectorizer,knn) # to make predictions on a piece of text we need to first vectorize it and then apply the machine learning model

#Lime explains the predictions using the predict_proba funtions which returs the probability estimates for both classes
explanation = explainer.explain_instance(df['facts_cleaned'][idx], c.predict_proba ) #

explanation.show_in_notebook()

Actual Text :  an arizona prosecutor brought a charge of firstdegree murder against schad after he was found with a murder victims vehicle and other belongings in arizona firstdegree murder is murder committed with premeditation or murder committed in an attempt to rob schad maintained that circumstantial evidence established at most that he was a thief the jurys instructions addressed firstand seconddegree murder not theft the jury convicted schad of firstdegree murder the judge sentenced schad to death 



Actual Label: 0



First party: Schad
Second party: Arizona


  self.as_list = [s for s in splitter.split(self.raw) if s]


#### 1.15 Explaining LIME predictions

Actual Label: 0 means `first_party_winner`= 0 = False. This means the second party Arizona is the winner. This was correctly predicted by the algorithm. LIME has highlighted the most important words for each class. After reading the facts is it also clear to you that why Arizona was the winner ? Can you identify the key words and their weights which also explain the prediction of the model? 


#### 1.16 Now let's take another example case and see the interpretation of the LIME model. Can you interpret the explanation given by LIME below ?

In [26]:
idx = 2800 # pick a random index from the original dataframe d
from sklearn.pipeline import make_pipeline


print("Actual Text : ", df['facts_cleaned'][idx])
print("\n\n")
print("Actual Label:", int(df['first_party_winner'][idx])) # convert bool to int(True:1, False:0)
print("\n\n")
print("First party:" , df['first_party'][idx]) #Get name of the first party from the dataframe

print("Second party:", df['second_party'][idx]) #Get name of the second party from the dataframe

c= make_pipeline(vectorizer,knn) # to make predictions on a piece of text we need to first vectorize it and then apply the machine learning model

#Lime explains the predictions using the predict_proba funtions which returs the probability estimates for both classes
explanation = explainer.explain_instance(df['facts_cleaned'][idx], c.predict_proba ) #

explanation.show_in_notebook()

Actual Text :  marvin green began working for the united states postal service in 1973 in 2002 he became the postmaster at the englewood colorado post office in 2008 a postmaster position opened in boulder and green applied but did not receive the position he filed a formal equal employment opportunity eeo charge regarding the denial of his application and the charge was settled in 2009 green filed an informal eeo charge and alleged that his supervisor and supervisor’s replacement had been retaliating against him for his prior eeo activity throughout that year green was subject to internal postal service investigations including a threat of criminal prosecution he ultimately signed an agreement that he would immediately give up his position and either retire or accept a much lower paying position green chose to retire and filed subsequent charges with the eeo office which dismissed his claim green then sued in district court and alleged among other claims that he had been constructivel

  self.as_list = [s for s in splitter.split(self.raw) if s]


#### 1.17 Try to take other indexes from the dataframe and see explanations of LIME. You might also come across some cases where the model made the wrong prediction. Try to identify the words which led the model to make the wrong prediction.

You can refer to a tutorial like this or also other similar tutorials online for more information on LIME: https://coderzcolumn.com/tutorials/machine-learning/how-to-use-lime-to-understand-sklearn-models-predictions



**HAVE FUN playing with this simple Robo Judge**. There are more sophisticated versions of Robo Judges online which use more advanced natural language processing features. For instance one is JURI SAYS: https://www.jurisays.com/



For our dataset, you can try to investigate other features like n-grams, and also exploit other columns (features) in the dataset to see if they can improve the performance of the **target variable** `first_party_winner`. 

Good Luck. Hope you had fun in Advanced Legal Analytics. 