# Analysis Customer Agreements with LLMs

This notebook is designed to automate the process of analysis and evaluation by comparing GPT-4's responses to manually provided ones. The goal is to pose various questions of interest regarding the GitHub agreement to GPT-4 and to compare its answers with those manually prepared in a [preceding article](https://hdl.handle.net/11705/JCIS/2022/041).



![process](img/process.png)

In [None]:
! pip install langchain
! pip install openai
! pip install tiktoken
! pip install tqdm
! pip install chromadb
! pip install langchain_experimental
! pip install langchain_openai

In [2]:
import os
from typing import List, Union

import chromadb

from langchain.pydantic_v1 import Field, BaseModel
from langchain.chains.openai_functions import create_structured_output_chain
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

In [3]:
# add your api key

os.environ["OPENAI_API_KEY"] = ""

## Respond to analysis operations on the Github customer agreement

In this section, we will present GPT-4 with some questions about the GitHub Customer Agreement.

Why do we use GPT-4? A [technical study](https://arxiv.org/abs/2303.08774) conducted by Open AI, which extensively evaluates the performance of GPT-4, concludes that the model performs at a human-like level in professional and academic benchmark tests.

*   Specifically, in two of the tests conducted that had a legal focus, the GPT-4 was more than 85% accurate, assessing skills such as critical reading, logic and analysis in this area.
*   Other studies focused on assessing typical NLP tasks such as questioning and answering.


In addition, there are other interesting studies in which they increase the justification of model performance. In [Sifatkaur Dhingra et al.](https://doi.org/10.1016/j.tbench.2023.100139) use a specific dataset to evaluate the accuracy of the model on a natural language understanding (NLU) task. The results show that GPT-4 achieves more than 80% accuracy on commonsense and SuperGlue, which are partially focused on tasks we are interested in, such as question-answer and paraphrase detection.

### Customer Agreement relevant questions

26 questions collected in a [previous article](https://hdl.handle.net/11705/JCIS/2022/041).

In [76]:
questions = [
    "As a customer, can I share the product or service with third parties?",
    "As a customer, can I modify the source code of the product or service to my liking?",
    "Can the supplier make modifications to the products or services?",
    "Does the contract provide any protection for confidential customer information?",
    "Does the customer retain ownership of the data he provides to the supplier?",
    "Are the rights of use the supplier receives over the customer's data limited to what is strictly necessary?",
    "Does the provider commit to any security standards or practices regarding customer content?",
    "Are the licenses received by the supplier on the customer's intellectual property limited?",
    "Does the supplier have to delete personal information after the end of the contract?",
    "Does the supplier indemnify the customer for infringement of third party intellectual property?",
    "Are the customer's indemnification obligations limited in third party claims?",
    "Is the customer's ability to confront the supplier or any other party limited?",
    "Does the supplier have a liability limit of 12 months' quota or higher?",
    "What limits of liability against consequential damages does the supplier have?",
    "If the supplier indemnifies for infringement of third party intellectual property, is it exempt from the limit of liability?",
    "Is the customer's liability limited?",
    "Does the customer have any right to terminate the agreement?",
    "If the contract is self-renewing can the customer opt out at that time?",
    "Does the customer have any liability to pay taxes?",
    "What rights does the customer have regarding data migration?",
    "Is the contract renewal automatic or does it need to be initiated by the customer?",
    "What is the mandatory governing law?",
    "Is the mandatory headquarters located within the United States?",
    "Is the customer's ability to develop or procure similar products or services from other suppliers limited?",
    "Who is responsible for ensuring that the services function properly?",
    "Can the customer notify the supplier via email?"]

### Populate the Database using the Knowledge Base

To enhance the responses generated by GPT, the RAG (Retrieval Augmented Generation) technique has been chosen. This approach involves supplying the model with relevant documents as context to better address a question.

In this part of the code, the complete documents are divided by sections and stored in a database with their semantic vector representation (embedding) for later use, as can be seen in the image.

During the experiment, we used 10 relevant documents out of the total number of chunks, in 2024 the CA was divided into 44 and in 2022 into 52 chunks.

![rag](img/rag.png)


In [77]:
client = chromadb.Client(chromadb.Settings(allow_reset=True))
client.reset()

# create collection to store github general terms documents
collection = client.get_or_create_collection("github-customer-agreement")

In [78]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
semantic_chunker = SemanticChunker(OpenAIEmbeddings())

def chunk_markdown(document: str):
    """Split ducument based on Markdown structure"""
    markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
    return markdown_splitter.split_text(document)

def chunk_semanticly(document: str):
    """Split document grouping sentences semanticly closed to each other"""
    return semantic_chunker.create_documents([document])

def read_and_chunk(path: str, chunking_function=chunk_semanticly):
    """Read file and chunck it

    Returns: a list of documents.
    """
    with open(path, "r") as f:
        document = f.read()
    return chunking_function(document)

def upsert_documents(document_id, documents):
    """Insert or updates a list of documents"""
    collection.upsert(
        documents=[document.page_content for document in documents],
        metadatas=[document.metadata for document in documents],
        ids=["{}-{}".format(document_id, i) for i in range(len(documents))]
    )

In [79]:
knoledge_base = [
    'github-general-terms.md',
    'github-data-protection.md'
]

for document_path in knoledge_base:
    documents = read_and_chunk(document_path, chunking_function=chunk_markdown)
    upsert_documents(document_path, documents)

In [80]:
collection.count()

52

### Chain of Thought QA using Vector Database with Knowledge Base

To address this problem two prompt engineering techniques were tested: One-Shot and Chain-of-Thought. Few-Shot was not considered as it's not well suited for this particular scenario (we don't have multiple examples to provide the LLM with in the promtp). While One-Shot prompting is less expensive, it also tends to produce more inconsistent answers and is more prone to produce hallucinations. The following list sumarizes the most relevant differences between both approaches:

* One-Shot:
    * More cost-efficent (less input tokens as well as less output tokens, wich translates into less money expenditure).
    * Less complete answers.
    * Incosistencies between queries.
    * More likely to produce hallucinations.
* [Chain-of-Thought](https://www.promptingguide.ai/techniques/cot):
    * More expensive.
    * Detailed and well-reasoned answers.
    * Consistent answer between queries.
    * Less likely to produce hallucinations.

The goal is to obtain a straightforward "Yes" or "No" response to quickly determine if the model's answers align with the manual ones. Following this, we seek an explanation of the answer and the section where the information is located. Lastly, to evaluate the answer's accuracy, we request the score.


In [81]:
class Background(BaseModel):
    """Background explaining user question"""
    background: str = Field(..., description="Background explaining user question")

class Thought(BaseModel):
    """A thought about user question"""
    thought: str = Field(..., description="A thought about user question")
    flawed: bool = Field(..., description="Whether or not the thought is flawed or misleading")
    helpful: bool = Field(..., description="Whether or not the thought is helpful to solve user question")

class Answer(BaseModel):
    """The answer to user question"""
    simple_answer: str = Field(..., description="Respond to the question with 'Yes' or 'No'")
    section: str = Field(..., description="Part of the context where the information to answer the question was found.")
    answer: str = Field(..., description="Answer to the user's question and mention the section(s) from which the answer is obtained")
    score: int = Field(..., description="Score from 1 to 10 on how correct the anwser is", min_value=1, max_value=10)

class ChainOfThoughtAnalyzer(BaseModel):
    """A chain of thoughts to answer user question"""
    background: Background = Field(..., description="Background explaining user questions")
    thoughts: List[Thought] = Field(..., description="List of thoughts about user question")
    answer: Answer = Field(..., description="The answer to user question")

class ChainOfThoughtBuilderForQuestionAndAnswering():
    """Build a chain of thought (CoT) to anwser a questions using a LLM"""

    def __init__(self, llm, verbose=False):
        self.prompt = ChatPromptTemplate.from_messages([
            ("system", "You are the most intelligent person in the world."
                ""
                "You will receive a prize if you follow ALL these rules:"
                "- First, establish a detailed background useful to anwser the user question."
                "- Each thought must include whether it is relevant and whether it is helpful."
                "- Continue to add thoughts until you can confidently answer the question."
                "- The anwser must be scored accurately and honestly."),
            ("human", 'Useful context: {context}'),
            ("human", 'User question: """{question}"""'),
        ])
        self.chain = create_structured_output_chain(ChainOfThoughtAnalyzer, llm, self.prompt, verbose=verbose)

    def build(self, question: str, context: str = None):
        return self.chain.run(context=context, question=question)

In [82]:
def cot_analysis_to_markdown(cot: ChainOfThoughtAnalyzer, question: str, top_documents: List[str], metadatas: List[dict]):
    """Convert a ChainOfThought to markdown"""
    md = "# Top documents\n"
    for i, (document, metadata) in enumerate(zip(top_documents, metadatas)):
        md += "## Document " + str(i) + ": " + " - ".join(metadata.values()) + "\n"
        md += document + "\n\n"

    md += "# " + question + "\n"
    md += "## Background\n"
    md += cot.background.background + "\n\n"

    md += "## Thoughts\n"
    for thought in cot.thoughts:
        md += "- " + thought.thought + "\n"
        md += "  - Flawed: " + str(thought.flawed) + "\n"
        md += "  - Helpful: " + str(thought.helpful) + "\n\n"

    md += "## Answer\n"
    md += "- " + cot.answer.simple_answer + ". Section(s): " + cot.answer.section + ". Justification: " + cot.answer.answer + "\n"
    md += "  - Score: " + str(cot.answer.score) + "\n\n"

    return md

def search_topn_documents(query, n_results=5):
    """Return the top N document given a query text"""
    return collection.query(
        query_texts=query,
        n_results=n_results
    )

### Main


In [83]:
llm = ChatOpenAI(model="gpt-4-1106-preview", temperature=0)
cot_builder_analysis = ChainOfThoughtBuilderForQuestionAndAnswering(llm, verbose=True)

def get_GPT4_analysis_response(question, number, strategy, n_rag_documents, cot_builder_analysis):
    """Return the answer of the model and the question"""
    top_documents = search_topn_documents(question, n_results=n_rag_documents)

    data = zip(top_documents['documents'][0], top_documents['metadatas'][0])

    context = "\n".join(["# " + " - ".join(metadata.values()) + "\n" + document for (document, metadata) in data])

    cot = cot_builder_analysis.build(question, context=context)
    markdown = cot_analysis_to_markdown(cot, question, top_documents['documents'][0], top_documents['metadatas'][0])

    directory_path = os.path.join(f"./{strategy}", "analysis")

    if not os.path.exists(directory_path):
        os.makedirs(directory_path)

    file_path = os.path.join(directory_path, f"question_{number}_{strategy}.md")

    with open(file_path, "w") as f:
        f.write(markdown)

    return cot, question

----------

## Compare results using GPT-4 as an assistant to the manual evaluation

In this phase of the experiment, we present the results obtained from GPT-4 using various strategies together with those from the manual analysis. Our goal is to compare all responses as efficiently and correctly as possible, so we will perform a manual evaluation assisted by GPT-4, automating part of the evaluation and using it as a starting point.

Therefore, the initial step is to ask GPT-4 to determine whether the answers of each strategy match the manual one. Using this data, we will categorize GPT-4's responses based on different patterns identified in earlier iterations of the experiment. The aim is to derive more insightful conclusions from this study.

### Chain of Thought compare both model and manual answer

The goal is to obtain a clear "Match" or "No Match" response, as this represents the primary function we can utilize GPT-4 for as an assistant.

In [84]:
class Thought2(BaseModel):
    """A thought about user question"""
    thought: str = Field(..., description="A thought about user prompt")
    flawed: bool = Field(..., description="Whether or not the thought is flawed or misleading")
    helpful: bool = Field(..., description="Whether or not the thought is helpful to solve the task")

class Answer2(BaseModel):
    """The answer to user question"""
    simple_answer: str = Field(..., description="Indicate whether the answers match by responding with 'Match' or 'Not Match'")
    answer: str = Field(..., description="Provide a justification for your assessment.")
    score: int = Field(..., description="Score from 1 to 10 on how correct your evaluation is", min_value=1, max_value=10)

class ChainOfThoughtComparator(BaseModel):
    """A chain of thoughts to answer user question"""
    thoughts: List[Thought2] = Field(..., description="List of thoughts about user prompt")
    answer: Answer2 = Field(..., description="The answer to user prompt")

class ChainOfThoughtBuilderForComparison():
    """Build a chain of thought (CoT) to compare sentences"""

    def __init__(self, llm, verbose=False):
        self.prompt = ChatPromptTemplate.from_messages([
            ("system",
                "You are the most intelligent person in the world."
                "You will receive a prize if you follow ALL these rules:"
                "- You will be provided with one question and two answers to that question."
                "- Compare the answers to determine if they convey the same meaning."
                "- Each thought must include whether it is relevant and whether it is helpful."
                "- Continue to add thoughts until you can confidently answer the comparation."
                "- The anwser must be scored accurately and honestly."),
            ("human", 'Useful context: {context}'),
            ("human",
             'User prompt: the question is """{question}"""'
             'Answer 1: """{manual}"""'
             'Answer 2: """{gpt4}"""'
             ),
        ])
        self.chain = create_structured_output_chain(ChainOfThoughtComparator, llm, self.prompt, verbose=verbose)

    def build(self, question: str, manual: str, gpt4: str, context: str = None):
        return self.chain.run(context=context, question=question, manual=manual, gpt4=gpt4)

In [85]:
def cot_evaluation_to_markdown(cot: ChainOfThoughtComparator, question: str, manual_answer: str, gpt4_answer:str, strategy:str):
    """Convert a ChainOfThought to markdown"""
    md = "# Evaluation of question "+ question + "\n"
    md += "## Background\n"
    md += "- Manual analysis answer is: " + manual_answer + "\n\n"
    md += "- GPT-4 analysis answer with the strategy "  + strategy + " is: " + gpt4_answer + "\n"

    md += "## Thoughts\n"
    for thought in cot.thoughts:
        md += "- " + thought.thought + "\n"
        md += "  - Flawed: " + str(thought.flawed) + "\n"
        md += "  - Helpful: " + str(thought.helpful) + "\n\n"

    md += "## Assesment\n"
    md += "- " + cot.answer.simple_answer + ". Justification: " + cot.answer.answer + "\n"
    md += "  - Score: " + str(cot.answer.score) + "\n\n"

    return md

### Main

In [86]:
llm = ChatOpenAI(model="gpt-4-1106-preview", temperature=0)
cot_builder_evaluation = ChainOfThoughtBuilderForComparison(llm, verbose=True)

def get_GPT4_evaluation_response(question, manual_answer, gpt4_answer, number, strategy, cot_builder_evaluation):
    """Return the assesment of the model"""

    cot = cot_builder_evaluation.build(question=question, manual=manual_answer, gpt4=gpt4_answer, context=None)
    markdown = cot_evaluation_to_markdown(cot, question, manual_answer, gpt4_answer, strategy)
    directory_path = os.path.join(f"./{strategy}", "evaluation")

    if not os.path.exists(directory_path):
        os.makedirs(directory_path)

    file_path = os.path.join(directory_path, f"question_{number}_{strategy}.md")

    with open(file_path, "w") as f:
        f.write(markdown)

    return cot

## Execution of the experiment

In [87]:
strategies = ["gpt4_2024_complete", "gpt4_2024_rag", "gpt4_2022_complete", "gpt4_2022_rag"]
manual_answers = [
    "No, as the customer may neither distribute the product to third parties nor upgrade parts of the product respectively (section 1.12 Restrictions).",
    "No, as the customer may neither distribute the product to third parties nor upgrade parts of the product respectively (section 1.12 Restrictions).",
    "Yes, because GitHub has the right to make changes to products if they do not materially diminish the functionality of the product.",
    "Yes, because confidential information will only be used for the relationship between both parties (section 4.3 Protection on Confidential information).",
    "Yes, because the customer retains all rights, title and interest and GitHub may only use the customer's data for what is strictly necessary (Data Protection Document, Nature of Data Processing and Ownership and Processing for GitHub's legitimate business operations sections).",
    "Yes, because the customer retains all rights, title and interest and GitHub may only use the customer's data for what is strictly necessary (Data Protection Document, Nature of Data Processing and Ownership and Processing for GitHub's legitimate business operations sections).",
    "Yes, because GitHub complies with the security standards (data protection document, Data Security section).",
    "Yes, because the customer retains all rights, title and interest and GitHub may only use the customer's data for what is strictly necessary (Data Protection Document, Nature of Data Processing and Ownership and Processing for GitHub's legitimate business operations sections).",
    "No, because customer data is only destroyed at the customer's request and if lawful (data protection document, Data Retention and Deletion section).",
    "Yes, because if GitHub is unable to resolve a misappropriation claim, it offers customer compensation options (section 6 Third-party claims).",
    "No, since the parties mutually defend each other and will contribute the amount of any final judgment (section 6 Third party claims).",
    "No, as no related information appears in the CA.",
    "Yes, as GitHub will not exceed the amount the customer has paid for the product during the 12 months prior to the incident (secBon 7.1 (a) Products).",
    "In response to Q17 we can say that liability for indirect damages is excluded for both parties (section 7.1 (C) Exclusions).",
    "Yes, as infringement of IP rights is an exception to limitation of liability and liability is limited to direct damages respectively (section 7 Limitation of liability).",
    "Yes, as infringement of IP rights is an exception to limitation of liability and liability is limited to direct damages respectively (section 7 Limitation of liability).",
    "Yes, as a customer may terminate the agreement, but with 30 days' notice (section 9 Term and termination).",
    "Yes, as a customer may terminate the agreement, but with 30 days' notice (section 9 Term and termination).",
    "Yes, the customer has tax liabilities (section 8.3 Taxes).",
    "The customer can request data migration up to 90 days after the end of the agreement (section 9.4 Migration).",
    "Needs to be initiated, according to CA product licenses expire at the end of the subscription period, unless renewed (section 1.2 Duration of licenses).",
    "Depends on the principal office of the customer, as if it is located in the European Union, the European Economic Area or Switzerland then the agreement is governed by the laws of Ireland, otherwise by the laws of the State of California and the federal laws of the United States (section 10.11 Applicable law and venue).",
    "No, as there is insufficient information in the CA on the compulsory seat.",
    "No (section 10.1 Independent contracts).",
    "GitHub is responsible for the performance of the services (section 10.12 GitHub and contractors).",
    "Yes, the customer can notify the supplier via email (section 10.10 Notices)."
    ]
numbers = [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]

In [88]:
def main(question, number, manual_answer, strategy, n_rag_documents, cot_builder_analysis, cot_builder_evaluation):
  analysis_results = get_GPT4_analysis_response(question, number, strategy, n_rag_documents, cot_builder_analysis)
  gpt4_answer = analysis_results[0].answer.simple_answer + ". " + analysis_results[0].answer.answer
  evaluation_results = get_GPT4_evaluation_response(question, manual_answer, gpt4_answer, number, strategy, cot_builder_evaluation)
  return analysis_results, evaluation_results

In [None]:
result = main(questions[25], numbers[25], manual_answers[25], strategies[2], 10, cot_builder_analysis, cot_builder_evaluation)