# In this Notebook we will learn how we can use different documents and get them analysed by GPT

## First of all we will install the required packages that will help us along the way
1. openai: This package will help us to call the chat completion method of openai to generate results using GPT.
3. PyMuPDF: This package is used for easy PDF manipulation.
4. tiktoken: This package is used to calculate the tokens in a text

In [1]:
# %%capture
!pip install PyMuPDF==1.24.2 PyMuPDFb==1.24.1 tqdm tiktoken
! pip install openai==1.55.3 httpx==0.27.2 --force-reinstall

Collecting PyMuPDF==1.24.2
  Downloading PyMuPDF-1.24.2-cp311-none-manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting PyMuPDFb==1.24.1
  Downloading PyMuPDFb-1.24.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.4 kB)
Collecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading PyMuPDF-1.24.2-cp311-none-manylinux2014_x86_64.whl (3.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading PyMuPDFb-1.24.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (30.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.8/30.8 MB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m36.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
! pip install openai httpx



In [13]:
import openai
from openai import OpenAI
import fitz
from tqdm import tqdm
import os
import tiktoken

# Initialize the OpenAI client
client = OpenAI(api_key="Place Your OpenAi API Key Here")  # Replace with your actual API key


token_encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

Here we have created a function that is called for generating response from OpenAI

This loads the keys from the environment and uses it to call the `openai.chat.completions.create` method

This method takes the system message and the user message as input for generating response.

### The temperature parameter defines the randomness of the output,

Higher the temperature the more creative the LLM will become with its answers, thus higher temperatures are used for poem generation, jokes etc.
Lower temperature gives deterministic results and return the most probable next token.

### Top P

A sampling technique with temperature, called nucleus sampling, where you can control how deterministic the model is. If you are looking for exact and factual answers keep this low. If you are looking for more diverse responses, increase to a higher value. If you use Top P it means that only the tokens comprising the top_p probability mass are considered for responses, so a low top_p value selects the most confident responses. This means that a high top_p value will enable the model to look at more possible words, including less likely ones, leading to more diverse outputs.

In [4]:
def CallOpenAI(user,system):
  response = client.chat.completions.create(
              model= "gpt-3.5-turbo", # model = "deployment_name".
              temperature= 0,
              top_p= 0,
              messages=[
                  {"role": "system", "content": system},
                  {"role": "user", "content": user}
              ]
          )
  return response

## Lets take a contract and try to analyse it without much instruction

First we load the PDF and extract the texts from it and generate the token count of the text

In [5]:
def extract_text(pdf_path):
  pdf = fitz.open(pdf_path)
  text = ''

  for page in pdf:
    text += page.get_text()

  num_tokens = len(token_encoding.encode(text))
  print("Number of tokens in the entire Document: ", num_tokens)
  return text

Out here we can see the token count of the document is 11590 which is well withing the 16000 context limit of the GPT-3.5 model

**⚠️ Note:** **In the cell below, you need to upload a file named `AWS1.pdf`.**  
**You can download the file from the link below.**
[📥 Download AWS1.pdf](https://github.com/initmahesh/MLAI-community-labs/blob/main/Class-Labs/Lab-2(Understanding%20RAG)/Lab-2.1(Generating-Response-without-RAG)/AWS1.pdf)


In [6]:
from google.colab import files
uploaded = files.upload()

Saving AWS1.pdf to AWS1.pdf


In [7]:
short_document = extract_text("/content/AWS1.pdf")

Number of tokens in the entire Document:  11590


## We concatenate the text from the PDF and the question that the user wants to ask to the GPT about the PDF and form a prompt that we will use to generate the response using `openai.chat.completion.create` method

In [8]:
Question = "What is the governing courts for Amazon Web Services South Africa ProprietaryLimited"

full_prompt_SD = "<Context>"+short_document+"</Context>" +"\n\n" +"<Question>"+Question+"</Question>"

In [9]:
response = CallOpenAI(full_prompt_SD,"You are a Professional lawyer who can analyse documents thorougly")

### We can see that the GPT was able to generate the answer by refering to the prompt and give the correct result

In [10]:
print(response.choices[0].message.content)

The governing courts for Amazon Web Services South Africa Proprietary Limited are the South Gauteng High Court in Johannesburg, South Africa. This information is specified in the document provided under the "AWS Contracting Party" section in the definitions. It states that the governing laws for this entity are the laws of the Republic of South Africa, and the specific court mentioned for legal matters is the South Gauteng High Court in Johannesburg.


## Now lets load up a document that has more than 16000 tokens, which is the limit of GPT-3.5-Turbo

**⚠️ Note:** **In the cell below, you need to upload a file named `PROFRAC HOLDINGS, LLC credit agreement.pdf`.**  
**You can download the file from the link below.**
[📥 Download PROFRAC HOLDINGS, LLC credit agreement.pdf ](https://github.com/initmahesh/MLAI-community-labs/blob/main/Class-Labs/Lab-2(Understanding%20RAG)/Lab-2.1(Generating-Response-without-RAG)/PROFRAC%20HOLDINGS%2C%20LLC%20credit%20agreement.pdf)


In [12]:
from google.colab import files
uploaded = files.upload()

Saving PROFRAC HOLDINGS, LLC credit agreement.pdf to PROFRAC HOLDINGS, LLC credit agreement (1).pdf


In [None]:
long_document = extract_text("/content/PROFRAC HOLDINGS, LLC credit agreement.pdf")

Number of tokens in the entire Document:  163227


In [None]:
Question = "What is the Acknowledgement Regarding Any Supported QFCs?"

full_prompt_LD = "<Context>"+long_document+"</Context>" +"\n\n" +"<Question>"+Question+"</Question>"

## Here what you see is, when the message length exceeded the limit of GPT, it throws an error.
### This problem will be fixed in the next lab where you see how Retrieval Augmented Generation(RAG) will fix this problem and enable us to analyse documents of any length.

In [None]:
try:
  response = CallOpenAI(full_prompt_LD,"You are a Professional lawyer who can analyse documents thorougly")
except Exception as e:
  print(str(e))

Error code: 400 - {'error': {'message': "This model's maximum context length is 16384 tokens. However, your messages resulted in 163263 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}
