# Advanced Legal Analytics
# LAW 3027 Tutorial 4: Large Language Models for Legal Q&A and Information Retrieval


#### Intended Learning Outcomes:
This notebook provides an introduction to the use of large language models.
By the end of this notebook you will know how to:
- Use the OpenAI API to access large language models and generate (legal) text.
- Writing effective prompts to acomplish tasks, such as legal Q&A.
- Iteratively optimizing prompts to improve the quality of generated (legal) text.
- Use advanced techniques such as Retrieval Augmented Generation (RAG) to improve the factuality of the generated (legal) text.

#### Libraries to be used:
You can activate your previously used environment. For this notebook, you will additionally need the following packages:
- requests
- openai
- chromadb
- scikit-learn
- chevron
- pandas
- datasets

#### Reading Material:
- [OpenAI API](https://platform.openai.com/docs/api-reference/introduction)
- [OpenAI prompt engineering guidelines](https://platform.openai.com/docs/guides/prompt-engineering)
- [Lessons after a half-billion GPT tokens](https://kenkantzer.com/lessons-after-a-half-billion-gpt-tokens/)

## Installation

Run this command to install the required packages for this lesson:

```bash
pip install requests openai scipy scikit-learn pandas datasets chevron
```

If you are on colab, you can do this by running the cell below.

In [None]:
!pip install requests openai scipy scikit-learn pandas datasets chevron

# Introduction

Large Language Models (LLMs) have taken the world by storm, and are today some of the most exciting and powerful tools to use in artificial intelligence. Today, we will explore how they can be used in the legal field. We will use the OpenAI API connected to our own custom server. You have received a code, which allows you to access two models, one for text completion and one for embedding:
- Text Generation: [Mixtral-8x7B-Instruct-v0.1](https://mistral.ai/news/mixtral-of-experts/)
- Text Embedding: [text-embedding-3-small](https://platform.openai.com/docs/guides/embeddings)

In [None]:
# Enter your key and initialize the client in this cell
import openai

MODEL_MIXTRAL = "Mixtral-8x7B-Instruct-v0.1"
MODEL_EMBEDDING = "text-embedding-3-small"

key = "sk-..." # your key here
base_url = "https://llm.wstrmnn.com"

client = openai.OpenAI(
    api_key=key,
    base_url=base_url,
)

In [None]:
# We have assigned you 0.1 USD to spend on the API, which should easily cover the exercises in this notebook. Run this function to check how much you have spent so far.
import requests

def check_spend():
    url = f"{base_url}/key/info"
    headers = {
        "Authorization": f"Bearer {key}",
    }
    response = requests.get(url, headers=headers)
    decoded = response.json()
    spent = round(decoded["info"]["spend"],5)
    max_budget = decoded["info"]["max_budget"]
    percent = round(spent / max_budget * 100,1)
    print(f"Spent: {spent} USD. Max Budget: {max_budget} USD. Percent used: {percent}%")
check_spend()

# Exercise 1 - Exploring Large Language Models

First, let us learn how to us the OpenAI API. We will use the Chat Completions API - read the documentation [here](https://platform.openai.com/docs/guides/text-generation/chat-completions-api). The API is called with messages. Each message can have the following roles:
- `system`: Instructions to the model, defining how it should act in general.
- `user`: A message from the user, like if you sent a message to ChatGPT. Can be used to ask questions or provide data.
- `assistant`: A previous response from the model. Can be used to continue a conversation. Not necessary for us.


The API call also takes a number of parameters, including `max_tokens` and `temperature`. `max_tokens` defines the maximum number of tokens the model should generate, and `temperature` defines how creative the model should be. A temperature of 0 will always return the most likely token, while a temperature of 1 will return a more random token.

## Task 1.1 - Write a poem

Look at the code below, which uses the OpenAI client to call the API and generate a poem about the law. Can you get it to generate a poem about another topic? How does changing the max_tokens and temperature affect the poem?

In [None]:
max_tokens = 500
temperature = 0.7

response = client.chat.completions.create(
  model=MODEL_MIXTRAL,
  messages=[
    {"role": "system", "content": "Write a beautiful poem about the subject I will send to you."},
    {"role": "user", "content": "The Law"},
  ],
  max_tokens=max_tokens,
  temperature=temperature,
)

print (response.choices[0].message.content)

## Task 1.2 - Ask a question

Write code to use the OpenAI client to call the API and ask what the notice required is for quitting a job in the netherlands. Use a system prompt telling the bot to be a lawyer, and a user prompt asking the question: What is the notice period to quit my job in the Netherlands?

Then, verify whether the answer is correct. For your consideration: [Think of language models like ChatGPT as a “calculator for words”](https://simonwillison.net/2023/Apr/2/calculator-for-words/)

In [None]:
response = client.chat.completions.create(
  model=MODEL_MIXTRAL,
  messages=[] # <- Add your messages here   
)

print (response.choices[0].message.content)

## Task 1.3 - Drafting via function

To integrate the OpenAI API into our workflow, it can be useful to write a function that does the API call for us. Write a function that acts as a lawyer drafting letters for their client. The function takes as argument a short sentence the user wants to write, call the API with a system prompt and user prompt, and return the response. 

Watch this video to learn more about how to use functions in python: https://youtu.be/NSbOtYzIQI0 . You can also follow the tutorial here: https://www.geeksforgeeks.org/python-functions/ 

- Test the function with the sentences below.
- Change the prompt to see how it impacts the generated letter.

In [None]:
def write_letter(description):
    #Your code here
    return letter_text
    

In [None]:
print(write_letter("I want to write a letter to my boss, informing them that I want to quit on the 17th of April."))

In [None]:
print(write_letter("I want to write a notice to my landlord, informing them that there is an issue with the heating and I want it fixed."))

# Exercise 2 - Legal Q&A

Now that we have learned how to use the OpenAI API, let us use it to answer legal questions. We will use the Contract NLI dataset, which contains annotated contracts. We will analyze a series of Non-Disclosure Agreements to investigate whether the following is true: `Receiving Party shall not disclose the fact that Agreement was agreed or negotiated.`. More information about this dataset and task is available here: https://hazyresearch.stanford.edu/legalbench/tasks/contract_nli_confidentiality_of_agreement.html

Run the cell below to load the dataset.

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', 0)

import datasets
dataset = datasets.load_dataset("nguha/legalbench", "contract_nli_confidentiality_of_agreement")
train_df = dataset["train"].to_pandas()
test_df = dataset["test"].to_pandas()

#For train and test, only keep the columns "answer" and "text"
train_df = train_df[["text", "answer"]]
test_df = test_df[["text", "answer"]]


test_df = test_df.sample(frac=1, random_state=42).reset_index(drop=True)
test_df = test_df.head(30)

In [None]:
#Print the number of rows in the train and test set
print(f"Train set has {train_df.shape[0]} rows.")
print(f"Test set has {test_df.shape[0]} rows.")

#Is this surprising? How does it compare to the ratio (size of train and test) in more traditional machine learning text classification which you did in the last tutorial?

In [None]:
# In the dataset, what is the predictor variable and what is the target variable?
train_df.head(8)

## Task 2.1 - Zero-shot classification

Next, let us try whether the model can predict whether a clause prohibits the disclosure of the agreement itself, or the fact that the parties are negotiating an agreement. Write a function that takes a clause as input, and returns `Yes` or `No`.

Zero-shot means that we will not give _any_ prior examples to the model - instead, we will explain the task to the model and let it figure out the answer by itself. How does this compare to traditional machine learning?

**_Note_: Since we evaluate this model on multiple clauses, it can use a lot of your tokens if you are not careful. Make sure to set the max_tokens parameter to a low value.**

In [None]:
# Write your system prompt below:
system_prompt = ''' Write the system prompt here.  '''

def predict_confidentiality(text):
    # Your code here
    return prediction #Note that the function should return a 'Yes' or 'No'

### Run the below cells to run the prediction on the clauses in the dataset

In [None]:
# Evaluate the function on the test set
# Store the predictions in a new column called 'predicted_answer'
test_df["predicted_answer"] = test_df["text"].apply(predict_confidentiality)
test_df

### Evaluation
Run the below cell to compare the `predicted_answer` with the actual `answer`. Evaluate the performance of the model. Does the model perform well?

You will also see a list of clauses where the model failed. Can you identify why the model failed on these clauses? Can you improve the prompt to make the model perform better?

In [None]:
#Use the scikit learn classification report to evaluate the model
from sklearn.metrics import classification_report

print(classification_report(test_df["answer"], test_df["predicted_answer"]))

#Print the rows where the answer is different from the predicted answer
test_df[test_df["answer"] != test_df["predicted_answer"]]

# Exercise 3 - Retrieval Augmented Generation
We have now seen how good these models are at answering questions about legal texts. However, models are also prone to hallucinate - if they are asked specific questions about legislation or cases, they may invent the answer. To mitigate this, we can use a technique called Retrieval Augmented Generation (RAG). This technique uses a retriever to find relevant pieces of texts (such as articles from legislation) and then generates text based on this information. First, let us see what happens when we ask the model to generate text about a specific topic.

You can see the slides about LLMs and RAG [here](https://canvas.maastrichtuniversity.nl/courses/14858/files/3886649?wrap=1).

## Task 3.1 - Unaugmented generation

Run the code below. What happens when you ask the model to generate text about a specific topic? Can you identify any factual errors? DOes the model e.g. mention [article V](https://www.unoosa.org/oosa/en/ourwork/spacelaw/treaties/outerspacetreaty.html#:~:text=Article%20V-,States%20Parties%20to%20the,of%20their%20space%20vehicle.,-In%20carrying%20on) of the outer space treaty?

In [None]:
def generate_answer(question):
    response = client.chat.completions.create(
        model=MODEL_MIXTRAL,
        messages=[
            {"role": "system", "content": "You are a helpful legal chatbot. Answer the question of the user, referencing relevant legal sources."},
            {"role": "user", "content": f"{question}"},
        ],
        temperature=0.7,
        max_tokens=500,
        )
    value = response.choices[0].message.content
    return value

print(generate_answer("What does the Outer Space Treaty say about accidents in space?"))

## Task 3.2 - Retrieval augmented generation
Now, let us implement a pipeline to use RAG to generate text. First, let us load a library of articles from the Outer Space Treaty (OST.json). 

In [None]:
import pandas as pd

url = 'https://raw.githubusercontent.com/maastrichtlawtech/law3027-advanced-legal-analytics/main/data/OST.json'

# Read JSON data from the URL directly into a DataFrame
df = pd.read_json(url)

article_name_list = df["name"].tolist()
article_text_list = df["text"].tolist()

df.head()

### Subtask 3.2.1 - Embed the articles in a vector space
First, let us create an embedding of the articles of the Outer Space Treaty. For this, we will use the text-embedding-3-small model. You can read more about embeddings and how to create them [here](https://platform.openai.com/docs/guides/embeddings).

In [None]:
# A function that takes a list of text strings and returns a list of embeddings.
def get_embeddings(text_list):
    response = client.embeddings.create(
        model=MODEL_EMBEDDING,
        input=text_list
    )
    embeddings = [item.embedding for item in response.data]
    return embeddings

In [None]:
# Run the function to get the embeddings
embeddings = get_embeddings(article_text_list)
print (embeddings[0][:50])

### Subtask 3.2.2 - Create an index to find similar sentences
Next, let us create an index to find similar sentences. We will use the [BallTree](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.BallTree.html) algorithm to create an index of the embeddings, which can then be used to find the most similar sentences to a given query.

In [None]:
# Build a BallTree from the embeddings
from sklearn.neighbors import BallTree

tree = BallTree(embeddings)


In [None]:
#This should return the indices of the closest 3 articles to the first article.
dist, ind = tree.query([embeddings[0]], k=3)
print(ind)

# Also note that the BallTree returns the indices of the articles in the original list of articles. So the article I has an index of 0, Article II has an index of 1 etc.
# Note that the first article is the closest (most similar) to itself, so the first index is 0.

### Subtask 3.2.2 - Retrieve relevant articles
Create a function that takes a query string, embeds the query, finds the closest N articles, and returns the articles and their content.
Return a list of dictionaries, where each dictionary has the keys "name" and "text".

You might need to use a for loop for this. Learn more about the for loop here: https://www.youtube.com/watch?v=OnDr4J2UXSA&ab_channel=CSDojo . You may also need to use the `append()` function as you need to return the articles and their content in a list of dictionaries. So you can watch this video to understand the `append()` function: https://youtu.be/5IEhquZghp0



In [None]:
def retrieve_articles(query, n=3):
    query_embedding = get_embeddings([query])[0]
    dist, ind = tree.query([query_embedding], k=n)
    articles = []

    #your code here
    return articles

In [None]:
#Run this code to test your function. Does it return relevant articles for the questions?

questions = [
    "According to the Outer Space Treaty, what happens in case of an accident in space?",
    "Can I capture planets?",
    "Can I place weapons on the moon?"
]

for question in questions:
    articles = retrieve_articles(question, n=1)
    print ("==============================")
    print(f"Question: {question}")
    print ("==============================")
    for article in articles:
        print(f"Article: {article['name']}")
        print(f"Content: {article['text']}")
    print("\n")
    print ("==============================")

Let us evaluate the results you received above. Fill in the table below, indicating whether the articles retrieved are relevant to the question asked.

| Query     | Article Retrieved | Relevant? |
|-----------|-------------------|-----------|
| According to the Outer Space Treaty, what happens in case of an accident in space? |                   |           |
| Can I capture planets?   |                   |           |
| Can I place weapons on the moon?   |                   |           |

Let us use precision to evaluate the results. Precision is defined as the number of relevant articles retrieved divided by the total number of articles retrieved. Fill in the table with the precision of the articles retrieved.

We can evaluate the precision at different numbers of retrieved articles. What is the precision if we retrieve 1 article?

In [None]:
#Calculate the precision@1 and precision@3
total_queries = 3
correct_queries_at_first_spot = #Fill in. How many times was the retrieved article correct?

precision_at_1 = correct_queries_at_first_spot / total_queries

print(f"Precision@1: {precision_at_1}")

# Are there any other metrics you could calculate to evaluate the performance of the system?

### Subtask 3.2.3 - Assemble the prompt
Write a function that takes as input a question and a list of articles, and writes prompt messages that instructs the model to answer the questions based on the articles.

*Note*: In the example below, we use the `chevron` library to format the prompt. You can install it by running `pip install chevron`. Chevron uses "mustache" to format strings. If you are interested, read more [here](https://mustache.github.io/mustache.5.html).

In [None]:
import chevron

system_prompt = """ """ #<- Write your system prompt here. Tell the model to answer the question using the relevant articles that will be provided.

user_prompt_template = """Articles:
{{#articles}}
{{name}}: {{text}}
{{/articles}}

Question: {{question}}
"""

def build_user_prompt(question, articles):
    user_prompt = chevron.render(user_prompt_template, {
        "articles": articles,
        "question": question
    })
    return user_prompt

In [None]:
#Run this code to test your function. Does it generate a relevant prompt for the question and articles?
question = "According to the Outer Space Treaty, what happens in case of an accident in space?"

articles = retrieve_articles(question)
user_prompt = build_user_prompt(question, articles)

print ("This is what the LLM will see:")
print ("============SYSTEM===========")
print (system_prompt)
print ("============USER=============")
print(user_prompt)

### Subtask 3.2.4 - Putting it all together
Write a function that takes a query as input, retrieves the relevant articles, puts them into the prompt, sends the prompt to the model, and returns the response.

In [None]:
def generate_answer(question):
    #Your code here
    return answer

In [None]:
#Run this code to test your function. Does it generate a relevant answer for the question?
questions = [
    "According to the Outer Space Treaty, what happens in case of an accident in space?",
    "Can I capture planets?",
    "Can I place weapons on the moon?"
]

for question in questions:
    print ("==============================")
    print(f"Question: {question}")
    print ("==============================")
    print(generate_answer(question))
    print ("\n\n\n")