<a href="https://colab.research.google.com/github/leohpark/leohpark/blob/main/Refine_Chain_Prompts_Iantosca_SJ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro
This Notebook is an introduction for non-coders to explore the LangChain framework to create a Refine Chain Document Analyzer. Specifically here, I'm using it to create a structured data extractor for Copyright Claims.

This extractor is a proof of concept demonstrating that extracting legal claims, entities, and outcomes is viable using LLMs like gpt-3.5-turbo. My expectation is that the solution posed here is relatively brittle, and I have not tested it with any rigor. In particular, I have no idea of the "negation findings" such as "Copyright Not Infringed" will work in cases where declaratory judgment is sought. If you switch out the source document with another Copyright SJ Order, I suspect the results will be more amusing than impressive.

The technology stack used here is a Colab Notebook for a Python environment, then LangChain framework. 

###Disclaimer
Unless required by applicable law or agreed to in writing, the code provided in this notebook is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# Step 0 - Sign up for accounts with OpenAI. 
Link to sign up:

- OpenAI: https://platform.openai.com/signup?launch

Pages where your API keys are located

- OpenAI: https://platform.openai.com/account/api-keys

Request API keys and Store your API keys carefully. If someone has your keys, they can request services and incur costs from those services from these provides as if they are you. If your OpenAI key becomes misplaced or stolen, delete the key from your account and generate a new key.

## Create some variables to store your API key for later
This can be by entering the following commands:

```
OPEN_AI_KEY = "..."
```
where you copy/paste your keys into the "..." for each instruction.


In [None]:
#Input your OpenAI API Key here. We will need it later.
OPEN_AI_KEY = "..."

# Step 1 - Install langchain framework and related packages
Here we will be installing stuff to ensure that we have the tools necessary to execute the commands necessary to create our index and connect to Pinecone and OpenAI via API.
Our Workflow Roadmap is:

	1. Get Frameworks and Packages installed.
	2. Find a copyright PDF to use as a data source. Use RecursiveCharacterTextSplitter to divide our source PDF into smaller chunks.
	3. Connect to our LLM
	4. Write a custom Refine Chain Prompt to extract data from this case, and run it using gpt-3.5-turbo.

## Langchain framework
Langchain is the framework full of useful tools and lots of fairly friendly defaults that will let us quickly write some code that will conect all of the pieces. The other two packages are required for processing PDF Documents.

##Step 2 - Find a long PDF document as our Data Source. Use RecursiveCharacterTextSplitter to divide our source PDF into smaller chunks.
The documetn we are analyzing is a Summary Judgment order from a social media copyright case, Iantosca v. Tahari: https://heitnerlegal.com/wp-content/uploads/copyright-infringement.pdf

Chunking the PDF is necessary because most LLMS have relatively modest context limits, which is how much text they can consider in one instruction. For gpt-3.5.-turbo, the context limit is 4096 tokens, which corresponds to about 3000 words, or 9000 characters (including formatting). This document is 12 pages in length, which corresponds to about 6,000 words or 24,000 characters. Even though that is relatively short by legal standards, it is too long for gpt-3.5-turbo and many other available models to process at once while also considering a question, and formulating an answer.

Chunk size corresponds roughly to the number of characters, although RecursiveCharacterTextSplitter will make some attempts to keep complete sentences or lines of text intact. It may be important/meaningful to calibrate your chunk size according to the type of information being ingested. 

##Refine Chain
For this Refine Chain, I'm using 5,000 character chunks, which correspond to about 1000 tokens. This should leave plenty of Context for the gpt-3.5-turbo to consider the question and output.

##Document Chunks
We are using the `RecursiveCharacterTextSplitter`, which is included in the langchain framework. It goes through the document, attempting to separate it at the preferred separators, in separator odrrer, when it reaches the approxite chunk size. It will make variable chunk sizes based on the document formatting, and where it can find good natural divisions in the document.

The separators here are "double return", "single return", "space" and then "(no characters)", which the splitter will only rely on when the document can't otherwise split the doc elsewhere. You might think ". " would be a good separator for documents. For reasons I can't really explain, in my limited testing including that delimiter didn't appreciably improve the chunk separation I observed. The separator may already be factoring that in. 

One other thing, `chunk_overlap` only comes into play when a clean chunk division can't be found. So it makes sure that fragments of sentences appear on the next chunk when the previous chunk has a "non-clean" ending, e.g. cannot end with a sentence. It DOES NOT automatically buffer some part of a chunk into the next chunk for "semantic overlap" between chunks, as some tutorials suggest. You can easily confirm this by chunking documents and looking at the entire `texts` output.

Try out the next line to see the contents of text blob #2

In [None]:
#@title Install LangChain
pip install langchain unstructured pdf2image

In [None]:
#@title Import PDF chunking tools from LangChain
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter


In [None]:
#@title Upload the PDF to Colab, then copy the pathname into the command below.
loader = UnstructuredPDFLoader("/content/Iantosca v. Tahari.pdf")


In [None]:
#@title Load the file
data = loader.load()

In [None]:
#@title The following will tell us a bit about the PDF we just uploaded. We can check to see it loaded.
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your document')

In [None]:
#@title Run this to see all of the data.
data

In [None]:
#@title Make Text Chunks
# For Refine Chain queries, try 6000-8000 character chunks
text_splitter = RecursiveCharacterTextSplitter(separators=['\n\n', '\n', ' ', ''],  chunk_size=5000, chunk_overlap=200)

texts = text_splitter.split_documents(data)

In [None]:
#@title Check to se that your text chunks contain text
texts[1]

#Step 3 - Connect to our LLM

We're going to install packages for Openai and tiktoken, which is sometimes necessary to manage our text chunks.

After that, we'll import additional tools to run gpt-3.5-turbo, and to customize our Prompt Template. Customizing the prompt template is slightly advanced, but the default Refine Chain will not do what we are trying to accomplish here, which is structured text extraction.

In [None]:
pip install openai tiktoken

In [None]:
from langchain.llms import OpenAI
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain import PromptTemplate


# Step 4 - Write a custom refine chain and run it using gpt-3.5-turbo.

A "Refine Chain" prompt consists of three prompts, a query, a question-prompt, and a refine-prompt. The query is the core prompt that is embedded in every step of the chain. The question-prompt is the first composed prompt consisting of the first text chunk, the query, and the question-template. This produces the first Answer.

Then for each subsequent chunk, the refine-prompt combines the next text chunk, the query, the previous answer, and the refine-template to produce a new Answer. 

Here is the basic prompt chain logic:
1. Define a json structure for data extraction as our "query".
I'm using this structure:

```
query = """
“Case Number”: “”,
“District Court”: “”,
“Date”: “”,
"Plaintiffs": [""],
"Defendant": [""],
“Claims”: [
	[“Legal Claim”: “”,
	“Movant”: “”,
	“Non-Movant”: “”,
	"Claim Outcome: “”],
	]
"""
```
2. Peform the question-prompt which contains additional definitions for the query, including a dictionary of claims, definitions for 'movant' and 'non-movant', and claim oucomes.

3. Perform each refine-prompt using all of the same definitions, but each time instructing the LLM to answer using the same structure as the original query.

I will try to do a separate write up of the thinking behind this prompt design soon. You can have the question-template and refine-template include different instructions, but that's not particularly helpful here. Using the 'query' prompt as an opportunity to enforce strict behavior over each iterative refine prompt greatly improves the uniformity of the structured text output. 

While this text extractor performs well on this SJ Order, they tend to be fairly brittle (meaning they do not perform well once the environment changes), and in general trying to get a consistent, repeatable, result from an LLM is fairly difficult. That's part of why I like building text extractors from gpt, I just enjoy impractical things. I'm also skeptical that certain concepts like the "Copyright does not Infringe" and other Declaratory Judgment type claims will be identified correctly. I hope this provides an intro to Refine Chain Prompting, and gives you some ideas as to how they can be used.


In [None]:
#@title Iantosca Prompt Templates for GPT3.5
refine_template = (
    "Context:{context_str}"
    "Sample Answer {question}\n"
    "Revised Answer: {existing_answer}\n"
    "Use the Context to collect as many Legal Claims. Claims include ('Copyright Valid', 'Copyright Invalid', 'Copyright Infringed', 'Copyright does not Infringe', 'Fair Use Defense', 'Fair Use Defense Not Available')."
    "Movant is the parties name bringing the claim. Non-Movant is the parties name arguing against the claim."
    "Claim Outcome is Granted, Granted-in-Part, or Not Granted"
    "New Revised Answer that matches the structure of the Sample Answer."
)
refine_prompt = PromptTemplate(
    input_variables=["question", "existing_answer", "context_str"],
    template=refine_template,
)


question_template = (
    "Context: {context_str}"
    "Use the Context to collect as many Legal Claims. Claims include ('Copyright Valid', 'Copyright Invalid', 'Copyright Infringed', 'Copyright does not Infringe', 'Fair Use Defense', 'Fair Use Defense Not Available')."
    "Movant is the parties name bringing the claim. Non-Movant is the parties name arguing against the claim."
    "Claim Outcome is Granted, Granted-in-Part, or Not Granted"
    "{question}\n"
)
question_prompt = PromptTemplate(
    input_variables=["context_str", "question"], template=question_template
)

In [None]:
#@title Configure our LLM and Chain prompt for Iantosca SJ
from langchain.chat_models import ChatOpenAI
llm = OpenAI(temperature=0, openai_api_key=OPENAI_KEY, model_name="gpt-3.5-turbo", max_tokens=1400)
chain = load_qa_with_sources_chain(llm, chain_type="refine", verbose=True, return_intermediate_steps=True, question_prompt=question_prompt, refine_prompt=refine_prompt)


In [None]:
#@title Iantosca Run Chain
query = """“Case Number”: “”,
“District Court”: “”,
“Date”: “”,
"Plaintiffs": [""],
"Defendant": [""],
“Claims”: [
	[“Legal Claim”: “”,
	“Movant”: “”,
	“Non-Movant”: “”,
	"Claim Outcome: “”],
	]
"""
chain({"input_documents": texts, "question": query})