<a href="https://colab.research.google.com/github/pradeepvaranasi/LLMs-ChatWithDocuments/blob/main/LLM_chatwithdocuments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Goal: Build LLMs based file-reading assistant that can quickly extract, locate, and summarize information from documents

### Data Preparation


#### Installing LLM libraries

In [29]:
# !pip -q install langchain faiss-cpu unstructured
# !pip -q install openai tiktoken
# !pip -q install pytesseract pypdf

  and should_run_async(code)


#### Importing libraries

In [30]:
import numpy as np
import pandas as pd
from langchain.document_loaders.image import UnstructuredImageLoader
from langchain.document_loaders import UnstructuredFileLoader


#### Detecting Documents


In [31]:
from filetype import guess

def detect_document_type(document_path):

    guess_file = guess(document_path)
    file_type = ""
    image_types = ['jpg', 'jpeg', 'png', 'gif']

    if(guess_file.extension.lower() == "pdf"):
        file_type = "pdf"

    elif(guess_file.extension.lower() in image_types):
        file_type = "image"

    else:
        file_type = "unkown"

    return file_type


In [32]:
pdf_cisco_annualreport_2022 = "/content/cisco-annual-report-subset-2022.pdf"
img_machinelearning = "/content/Advantages-and-disadvantages-of-machine-learning-methods.png"

print(f"Cisco Report - Document Type : {detect_document_type(pdf_cisco_annualreport_2022)}")
print(f"Machine Learning Information - Document Type : {detect_document_type(img_machinelearning)}")

Cisco Report - Document Type : pdf
Machine Learning Information - Document Type : image


In [33]:
from langchain.document_loaders.image import UnstructuredImageLoader
from langchain.document_loaders import UnstructuredFileLoader

def extract_file_content(file_path):

    file_type = detect_document_type(file_path)

    if(file_type == "pdf"):
        loader = UnstructuredFileLoader(file_path)

    elif(file_type == "image"):
        loader = UnstructuredImageLoader(file_path)

    documents = loader.load()
    documents_content = '\n'.join(doc.page_content for doc in documents)

    return documents_content

In [34]:
# !pip install unstructured==0.7.12

In [35]:
!sudo apt install tesseract-ocr

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 15 not upgraded.


In [36]:
!pip install pytesseract



In [37]:
import pytesseract
import shutil
import os
import random

try:
  from PIL import Image
except ImportError:
  import Image

In [38]:
cisco_content = extract_file_content(pdf_cisco_annualreport_2022)
ml_content = extract_file_content(img_machinelearning)



In [39]:
nb_characters = 400

print(f"First {nb_characters} Characters of the Report : \n{cisco_content[:nb_characters]}...")
print(" "*25)
print("---"*25)
print(" "*25)
print(f"First {nb_characters} Characters of the ML image :\n {ml_content[:nb_characters]}...")

First 400 Characters of the Report : 
Our purpose

Powering an inclusive future for all

Cisco’s efforts to fulfill our Purpose to Power an Inclusive Future for All are organized into three ESG pillars. From the technology that helps securely power the world’s connectivity (Power), to driving fairness, inclusion, and equitable opportunity (Inclusive), and helping to ensure a sustainable and regenerative planet (Future).

Power For alm...
                         
---------------------------------------------------------------------------
                         
First 400 Characters of the ML image :
 Name

type

Advantages

Disadvantages

SVM

Supervised Learning,

Kemel methods



based

No probability Good for high dimensional data Less risk of over-fitting

Difficult to choose a proper kernel function Long training time

Difficult to understand and interpret the final model, variable weights and individual impact

does not perform very well, when the data set has more noise

Logic

  and should_run_async(code)


### Model Training

In [40]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator = "\n\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)

In [41]:
cisco_content_chunks = text_splitter.split_text(cisco_content)
ml_content_chunks = text_splitter.split_text(ml_content)

print(f"# Chunks in Research Paper: {len(cisco_content_chunks)}")
print(f"# Chunks in Article Document: {len(ml_content_chunks)}")

# Chunks in Research Paper: 10
# Chunks in Article Document: 2


In [42]:
from langchain.embeddings.openai import OpenAIEmbeddings
import os

os.environ["OPENAI_API_KEY"] = "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"

embeddings = OpenAIEmbeddings()

In [43]:
from langchain.vectorstores import FAISS

def get_doc_search(text_splitter):

    return FAISS.from_texts(text_splitter, embeddings)

### Chat with Documents

In [44]:
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain
chain = load_qa_chain(OpenAI(), chain_type = "map_rerank",
                      return_intermediate_steps=True)

def chat_with_file(file_path, query):

    file_content = extract_file_content(file_path)
    file_splitter = text_splitter.split_text(file_content)

    document_search = get_doc_search(file_splitter)
    documents = document_search.similarity_search(query)

    results = chain({
                        "input_documents":documents,
                        "question": query
                    },
                    return_only_outputs=True)
    results = results['intermediate_steps'][0]

    return results

In [45]:
query = "What is the purpose of Cisco?"

results = chat_with_file(pdf_cisco_annualreport_2022, query)

answer = results["answer"]
confidence_score = results["score"]

print(f"Answer: {answer}\n\nConfidence Score: {confidence_score}")



Answer:  Cisco's purpose is to ensure that their products are made responsibly, consistent with Cisco's values, and that their suppliers uphold their standards for labor, health and safety, environment, and ethics.

Confidence Score: 100


In [48]:
query = "What is the document about?"

results = chat_with_file(img_machinelearning, query)

answer = results["answer"]
confidence_score = results["score"]

print(f"Answer: {answer}\n\nConfidence Score: {confidence_score}")



Answer:  This document is about Deep Learning, Bayesian Learning, Monte Carlo, and Reinforcement Teaming and their respective advantages and disadvantages.

Confidence Score: 100
