## importing:

In [1]:
#Note: The openai-python library support for Azure OpenAI is in preview.
#Note: This code sample requires OpenAI Python library version 1.0.0 or higher.
from openai import AzureOpenAI
from PyPDF2 import PdfReader
import os

## setting OpenAI key we got from Azure:

In [2]:
os.environ["AZURE_OPENAI_KEY"] = ''

Libraries and API Key explaination:

The code imports libraries for interacting with Azure OpenAI (openai), reading PDFs (PyPDF2), and accessing environment variables (os).
An API key (AZURE_OPENAI_KEY) is set from an environment variable for secure access to Azure OpenAI.

## defining a function tp extract PDF file page contents with a maximunm number of tokens = 20000 to prevint exceeding the token count limit of the used LLM. 

In [3]:
def extract_pdf_pages(pdf_file_path, max_tokens = 20000):
    pdf_pages = []
    tokens_count = 0
    with open(pdf_file_path, 'rb') as pdf_file:
        reader = PdfReader(pdf_file)
        for page_number in range(len(reader.pages)):
            pdf_page_info = {
                'information': reader.pages[page_number].extract_text(),
                'source_file_name': os.path.basename(pdf_file_path),
                'page_number': page_number + 1
            }
            tokens_count += len(pdf_page_info['information'].split())
            if tokens_count >= max_tokens:
                break
            pdf_pages.append(pdf_page_info)
    return pdf_pages

extract_pdf_pages function explaination:

- This function reads a PDF file (pdf_file_path) and extracts information from each page.
- It iterates through all pages, extracting text content, filename, and page number for each page.
- It keeps track of the total number of words encountered (tokens_count).
- It stops processing pages if the total word count exceeds a limit (max_tokens).
- The extracted information for each page is stored in a dictionary and appended to a list (pdf_pages).

## Connecting to Azure OpenAI:

In [4]:
client = AzureOpenAI(
  azure_endpoint = "https://rag-openai-aueast.openai.azure.com/", 
  api_key=os.getenv("AZURE_OPENAI_KEY"),  
  api_version="2024-02-15-preview"
)

Azure OpenAI Client:

An AzureOpenAI client is created, specifying the endpoint URL (azure_endpoint), API key (api_key), and API version (api_version).

## Creating the message variable (for sammarization):

In [5]:
message_text = [{"role":"system","content":"You are an AI assistant that helps people sammarize documents."},
                {"role":"user","content":str(extract_pdf_pages('telecom-development.pdf'))}]

Chat Message Construction:

- A list (message_text) is created to represent a chat conversation.
- The first element represents a system message informing the user that the AI assistant helps with document tagging.
- The second element represents a user message containing the extracted information from the PDF (telecom-development.pdf) using the extract_pdf_pages function. The output of the function is likely a list of dictionaries, converted to a string here.

Sending the message to the LLM (GPT-4):

In [6]:
completion = client.chat.completions.create(
  model="gpt4model", # model = "deployment_name"
  messages = message_text,
  temperature=0.7,
  max_tokens=800,
  top_p=0.95,
  frequency_penalty=0,
  presence_penalty=0,
  stop=None
)

Calling Azure OpenAI for Chat Completion:

- The chat.completions.create method is called on the Azure OpenAI client to generate a response based on the provided conversation (message_text).
- Several parameters are set to influence the generated response:
- model: The AI model to use for generating text (here, "gpt4model").
- temperature: Controls the randomness of the generated text (0.7 for some randomness).
- max_tokens: Limits the total number of words generated (800 words).
- top_p: Influences the sampling process to favor more likely continuations (0.95 for favoring high probability sentences).
- frequency_penalty and presence_penalty: Not explicitly set here, but can penalize repetitive phrases or overusing certain words.
- stop: No explicit stopping sequence is provided, allowing the model to generate a response until it reaches the max_tokens limit.

## Showing the Output:

In [7]:
completion.choices[0].message.content

'The document discusses the role of Project Finance in facilitating Telecommunication Infrastructure Development in Newly Industrializing Countries. It highlights that traditionally, infrastructure development has been facilitated through public sector financing. However, due to economic deficiencies and poor financial management of resources, many newly industrializing countries are seeking foreign capital as a primary source of funding for infrastructure development. \n\nThe document discusses various financing methods, including direct government expenditure, general-obligation bonds, subsidization, and concessions to state-owned enterprises. However, these methods may have certain limitations. For example, bonds require public approval, while revenue and debt limits may inhibit infrastructure investment by government agencies.\n\nThe document also discusses the role of the World Bank and other multilateral financial institutions in providing debt financing. However, it notes that t

### Tagging:

In [8]:
message_text = [{"role":"system","content":"You are an AI assistant that helps people tag documents, you only provide a list of tags with no other context or caption."},
                {"role":"user","content":str(extract_pdf_pages('telecom-development.pdf'))}]

completion = client.chat.completions.create(
  model="gpt4model", # model = "deployment_name"
  messages = message_text,
  temperature=0.7,
  max_tokens=800,
  top_p=0.95,
  frequency_penalty=0,
  presence_penalty=0,
  stop=None
)

completion.choices[0].message.content

'- Telecommunications\n- Infrastructure Development\n- Project Finance\n- Newly-Industrializing Countries\n- Law\n- Regulation\n- Concession Agreements\n- BOT Model\n- Private Sector Participation\n- Public Sector Financing\n- Foreign Investment\n- International Telecommunication Union\n- Federal Communications Commission\n- Limited Liability Company\n- Joint Venture\n- Construction Agreement\n- Limited Partnership\n- Technology Transfer\n- Risk Allocation\n- Contractual Assurances\n- National Regulatory Methods\n- Bilateral and Multilateral Treaties\n- Sovereign Immunity\n- Concessionary Arrangements\n- Limited Recourse Financing\n- Taxation\n- Equity Investment\n- Debt Financing\n- Technology Adoption\n- Limited Liability Company\n- Risk Management\n- Regulatory Framework\n- International Law\n- Economic Development\n- Telecommunications Industry\n- Privatization\n- State Ownership\n- State Monopolies\n- Telecommunications Services\n- Communications Act of 1934\n- International Marit

### Classifying:

In [9]:
message_text = [{"role":"system","content":"You are an AI assistant that helps people classify documents, you only provide a classification label with no other context or caption."},
                {"role":"user","content":str(extract_pdf_pages('telecom-development.pdf'))}]

completion = client.chat.completions.create(
  model="gpt4model", # model = "deployment_name"
  messages = message_text,
  temperature=0.7,
  max_tokens=800,
  top_p=0.95,
  frequency_penalty=0,
  presence_penalty=0,
  stop=None
)

completion.choices[0].message.content

'"Legal Document"'

## Future work:
Now you may see that the output response isn't as clear or direct as we expected, we expected the model to provide a list all the tags for the document sent, however, it responded with a description about the doncument fisrt and then it started to list the tags of it, an issue like this is very easy to solve by tuning the prompt of the model, but let's leave that for the future work, as this was just an example of how we can deal with Azure OpenAI document handling.