<a href="https://colab.research.google.com/github/matthiaskozubal/Chat_with_data/blob/main/code/demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chat with data
Dynamic, LLM-powered quering over PDFs, using either predefined PDFs or yours (from your google drive).

## How to use
1. **Run code**:
  - Click `Runtime` -> `Run all` or hit `Ctrl+F9`
2. **Choose PDFs source**:
  - When prompted in the 'User Input' section:
    - type **author's** or hit **Enter** (to use predefined PDFs)
    - type **user's** and ensure (to use your PDFs):
      - a folder named `PDFs` is in your Google Drive with your PDFs inside
3. **Provide OpenAI API key**
  - When prompted in the 'OpenAI API key' section
  - From https://platform.openai.com/account/api-keys

# Setup

## Libraries

In [None]:
!pip -q install chromadb
!pip -q install langchain
!pip -q install openai
!pip -q install pypdf
!pip -q install tiktoken

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m479.8/479.8 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m55.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.9/92.9 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.5/59.5 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m94.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.2/6.2 MB[0m [31m94.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.5/57.5 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.9/103.9 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import openai
import os
import requests

from google.colab import drive
from IPython.display import display, Markdown
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.text_splitter import CharacterTextSplitter

## Constants

In [None]:
# Verbose
VERBOSE = False

In [None]:
# Author's data (hosted on GitHub)
GITHUB_USERNAME = 'matthiaskozubal'
REPO = 'Chat_with_data'
BRANCH = 'main'
DATA_PATH = 'data'

In [None]:
# User's data (hosted on Google Drive)
PREDEFINED_GOOGLE_DRIVE_FOLDER_PATH = 'content/drive/My Drive/PDFs'

## Functions

In [None]:
def get_openai_api_key():
  openai_api_key = input(f"Provide OpenAI API key: ")
  os.environ['OPENAI_API_KEY'] = openai_api_key
  openai.api_key  = os.environ['OPENAI_API_KEY']

In [None]:
def get_user_choice(verbose=False):
  PDFs_source_choice = input(f"Choose PDF source (author's/user's) [default: author's]: ")
  if PDFs_source_choice not in ['author\'s', 'user\'s']:
    PDFs_source_choice = 'author\'s'

  if verbose:
    print(PDFs_source_choice)

  return PDFs_source_choice

In [None]:
def connect_to_google_drive():
	drive.mount('/content/drive')

In [None]:
def get_PDFs_urls_from_github(GITHUB_USERNAME, REPO, BRANCH, DATA_PATH, verbose=False):
  # construct url to github data folder
  data_folder_url = f"https://github.com/{GITHUB_USERNAME}/{REPO}/tree/{BRANCH}/{DATA_PATH}"

  # construct partial url to retrieve each file
  file_partial_url = f"https://raw.githubusercontent.com/{GITHUB_USERNAME}/{REPO}/{BRANCH}"

  # get files metada
  response = requests.get(data_folder_url)
  if response.status_code == 200:
    response_json = response.json()
    files = response_json['payload']['tree']['items']
    file_paths = [file_name['path'] for file_name in files if file_name['contentType'] == 'file']
    file_urls = ['/'.join([file_partial_url, file_path.replace(' ', '%20')]) for file_path in file_paths]
    # file_urls = ['/'.join([partial_url, file_path]) for file_path in file_paths]
    if verbose:
      print(40*'#', ' ', 'File urls:' , 40*'#' , ' ')
      for file_url in file_urls:
        print(file_url)
  else:
    print(f"Failed to fetch the API, status code: {response.status_code}")

  return file_urls

# _ = get_PDFs_urls_from_github(GITHUB_USERNAME, REPO, BRANCH, DATA_PATH, verbose=True)

In [None]:
def get_PDFs_urls_from_user_google_drive(PREDEFINED_GOOGLE_DRIVE_FOLDER_PATH, verbose=False):
  # mount their google drive
  drive.mount('/content/drive')

  if os.path.exists(PREDEFINED_GOOGLE_DRIVE_FOLDER_PATH):
      pdf_files = [f for f in os.listdir(PREDEFINED_GOOGLE_DRIVE_FOLDER_PATH) if f.endswith('.pdf')]
      # pdfs = connect_to_drive() and do what?

  if verbose:
    print(f"{folder_link}: {PREDEFINED_GOOGLE_DRIVE_FOLDER_PATH}")

  return file_urls

In [None]:
def get_PDFs_urls(PDFs_source_choice, PREDEFINED_GOOGLE_DRIVE_FOLDER_PATH, GITHUB_USERNAME, REPO, BRANCH, DATA_PATH, verbose=False):
  if PDFs_source_choice == 'author\'s':
    file_urls = get_PDFs_urls_from_github(GITHUB_USERNAME, REPO, BRANCH, DATA_PATH, verbose=verbose)
  elif PDFs_source_choice == 'USER\'s':
    file_urls = get_PDFs_urls_from_user_google_drive(PREDEFINED_GOOGLE_DRIVE_FOLDER_PATH, verbose=verbose)

  return file_urls

In [None]:
def load_split_embed_store_PDFs(file_urls):
  # https://learn.deeplearning.ai/langchain-chat-with-your-data/lesson/3/document-splitting

  # load
  docs = []
  for i in range(3):
    loader = PyPDFLoader(file_urls[i])
    document = loader.load()
    docs.extend(document)

  # split
  text_splitter = CharacterTextSplitter(separator="\n", chunk_size=1000, chunk_overlap=200, length_function=len)
  splits = text_splitter.split_documents(document)

  # embed
  embedding = OpenAIEmbeddings()

  # store
  from langchain.vectorstores import Chroma
  persist_directory = 'docs/chroma/'
  vectordb = Chroma.from_documents(
      documents = docs,
      persist_directory=persist_directory,
      embedding=embedding
  )

  return vectordb

In [None]:
def query_over_PDFs(vectordb, verbose=False):
  # query (user's or a default one)
  query = input(f"Ask a question to the PDFs: \n")
  if query == '':
    query = '''
      What is the bias and variance in terms of machine learning?
      '''

  # retrieval
  llm = OpenAI(temperature=0)
  compressor = LLMChainExtractor.from_llm(llm)
  compression_retriever = ContextualCompressionRetriever(
      base_compressor=compressor,
      base_retriever=vectordb.as_retriever()
  )
  compressed_docs = compression_retriever.get_relevant_documents(query)
  print(compressed_docs)
  # query an LLM
  llm_chat = ChatOpenAI(temperature = 0.0)
  response = llm_chat.call_as_llm(f"{compressed_docs} Question: {query}")

  # response
  display(Markdown(response))

# Main

In [None]:
get_openai_api_key()

In [None]:
PDFs_source_choice = get_user_choice(verbose=VERBOSE)

Choose PDF source (author's/user's) [default: author's]: 


In [None]:
file_urls = get_PDFs_urls(PDFs_source_choice, PREDEFINED_GOOGLE_DRIVE_FOLDER_PATH, GITHUB_USERNAME, REPO, BRANCH, DATA_PATH, verbose=VERBOSE)

In [None]:
vectordb = load_split_embed_store_PDFs(file_urls)

In [None]:
query_over_PDFs(vectordb)

Ask a question to the PDFs: 
what is mars




[]


Mars is the fourth planet from the Sun in our solar system. It is often referred to as the "Red Planet" due to its reddish appearance, caused by iron oxide (rust) on its surface. Mars is a terrestrial planet with a thin atmosphere, and it has surface features similar to both the Moon and Earth. It has polar ice caps, volcanoes, canyons, and a variety of other geological formations. Mars is also known for having the largest volcano and the deepest canyon in the solar system. It has been a subject of scientific interest and exploration, with several missions sent to study its surface and search for signs of past or present life.