<a href="https://colab.research.google.com/github/kizombaciao/Working-with-MULTIPLE-PDF-Files-in-LangChain-ChatGPT-for-your-Data/blob/main/Multiple_PDF_files.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Installation

https://www.youtube.com/watch?v=s5LhRdh5fu4

The unstructured Python package is a free and open-source toolkit that makes it easy to prepare unstructured data like PDFs, HTML and Word Documents for downstream data science tasks. It can be used to perform tasks such as parsing documents, extracting text, and cleaning data.

The unstructured Python package is available on PyPI and can be installed using the pip package manager. Once installed, you can use the unstructured library to create a pipeline for processing documents. The pipeline can be customized to fit your specific needs.

The unstructured Python package is a powerful tool for working with unstructured data. It can be used to prepare data for machine learning, natural language processing, and other data science tasks.



In [None]:
!pip install langchain
!pip install unstructured
!pip install openai
!pip install chromadb
!pip install Cython
!pip install tiktoken

### Load Required Packages

In [None]:
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.indexes import VectorstoreIndexCreator

### OpenAI API Key

In [None]:
# Get your API keys from openai, you will need to create an account. 
# Here is the link to get the keys: https://platform.openai.com/account/billing/overview
import os
os.environ["OPENAI_API_KEY"] = "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"

### Connect Google Drive

In [None]:
# connect your Google Drive
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"

Mounted at /content/gdrive


In [None]:
pdf_folder_path = f'{root_dir}/data/'
os.listdir(pdf_folder_path)

['state_of_the_union_part1.pdf', 'state_of_the_union_part2.pdf']

### Load Multiple PDF files

In [None]:
# location of the pdf file/files. 
loaders = [UnstructuredPDFLoader(os.path.join(pdf_folder_path, fn)) for fn in os.listdir(pdf_folder_path)]

In [None]:
loaders

[<langchain.document_loaders.pdf.UnstructuredPDFLoader at 0x7f590c1e2df0>,
 <langchain.document_loaders.pdf.UnstructuredPDFLoader at 0x7f59043b66a0>]

### Vector Store 
Chroma as vectorstore to index and search embeddings


There are three main steps going on after the documents are loaded:

- Splitting documents into chunks

- Creating embeddings for each document

- Storing documents and embeddings in a vectorstore


In [None]:
index = VectorstoreIndexCreator().from_loaders(loaders)
index



VectorStoreIndexWrapper(vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x7f5906573910>)

In [None]:
index.query('What was the main topic of the address?')

' The main topic of the address was investing in America and rebuilding infrastructure.'

In [None]:
pdf_folder_path = '/content/gdrive/My Drive/data_2/'
os.listdir(pdf_folder_path)

['2008.10010.pdf', '2023_GPT4All_Technical_Report.pdf']

In [None]:
# location of the pdf file/files. 
loaders = [UnstructuredPDFLoader(os.path.join(pdf_folder_path, fn)) for fn in os.listdir(pdf_folder_path)]
index = VectorstoreIndexCreator().from_loaders(loaders)
index



VectorStoreIndexWrapper(vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x7f59043d1040>)

In [None]:
index.query('How was the GPT4all model trained?')

' The GPT4all model was trained with LoRA (Hu et al., 2021) on the 437,605 post-processed examples for four epochs.'

In [None]:
index.query_with_sources('How was the GPT4all model trained?')

{'question': 'How was the GPT4all model trained?',
 'answer': ' The GPT4all model was trained with LoRA on 437,605 post-processed examples for four epochs. The data was collected using the GPT-3.5-Turbo OpenAI API and was curated to ensure a diverse distribution of prompt topics and model responses.\n\n',
 'sources': '/content/gdrive/My Drive/data_2/2023_GPT4All_Technical_Report.pdf'}

In [None]:
index.query_with_sources('Who wrote the lip sync paper? ')

{'question': 'Who wrote the lip sync paper? ',
 'answer': ' The lip sync paper was written by K R Prajwal, Vinay P. Namboodiri, Rudrabha Mukhopadhyay, and C V Jawahar.\n',
 'sources': '/content/gdrive/My Drive/data_2/2008.10010.pdf'}

In [None]:
index.query_with_sources('How was the GPT4all model trained?')

{'question': 'How was the GPT4all model trained?',
 'answer': ' The GPT4all model was trained with LoRA on 437,605 post-processed examples for four epochs. Detailed model hyper-parameters and training code can be found in the associated repository and model training log.\n',
 'sources': '/content/gdrive/My Drive/data_2/2023_GPT4All_Technical_Report.pdf'}

In [None]:
index.query_with_sources('Who wrote the lip sync paper? ')

{'question': 'Who wrote the lip sync paper? ',
 'answer': ' The lip sync paper was written by K R Prajwal, Vinay P. Namboodiri, Rudrabha Mukhopadhyay, and C V Jawahar.\n',
 'sources': '/content/gdrive/My Drive/data_2/2008.10010.pdf'}

## Disclaimer:
Note: OpenAI provides a free API key for initial testing. Once you move to a paid subscription, calling the API in the way demonstrated in this example will incur monetary charges. Refer to OpenAI's pricing information for details.

Be aware that information, such as files to train OpenAI's LLM can become public if applied in the way this demo demonstrates. Refer to OpenAI's usage policy for details.

Do not use for actual tax filing purposes. This demo is for educational purposes only and for demonstrating machine learning methods. The author makes no claims that the outcomes shown here or any outcomes that could be produced by this method are accurate or reliable.

PyPI, or the Python Package Index, is a public repository for Python packages. It is the official third-party software repository for Python, and it is analogous to the CPAN repository for Perl and to the CRAN repository for R. PyPI is run by the Python Software Foundation, a charity.

PyPI primarily hosts Python packages in the form of archives called sdists (source distributions) or precompiled "wheels." PyPI as an index allows users to search for packages by keywords or by filters against their metadata, such as free software license or compatibility with POSIX. A single entry on PyPI is able to store, aside from just a package and its metadata, previous releases of the package, precompiled wheels (e.g. containing DLLs on Windows), as well as different forms for different operating systems and Python versions.