<a href="https://colab.research.google.com/github/matthiaskozubal/Chat_with_data/blob/main/code/demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chat with data
Dynamic, LLM-powered quering over PDFs, using either predefined PDFs or yours (from your google drive).

## How to use
1. **Run code**:
  - Click `Runtime` -> `Run all` or hit `Ctrl+F9`
2. **Choose PDFs source**:
  - When prompted in the 'User Input' section:
    - type **author's** or hit **Enter** (to use predefined PDFs)
    - type **user's** and ensure (to use your PDFs):
      - a folder named `PDFs` is in your Google Drive with your PDFs inside

# Setup

## Libraries

In [80]:
!pip -q install chromadb
!pip -q install langchain
!pip -q install openai
!pip -q install pypdf
!pip -q install tiktoken

In [81]:
import openai
import os
import requests

from google.colab import drive
from IPython.display import display, Markdown
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.text_splitter import CharacterTextSplitter

## Constants

In [82]:
OPENAI_API_KEY = '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'
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
openai.api_key  = os.environ['OPENAI_API_KEY']

In [83]:
# Verbose
VERBOSE = False

In [84]:
# Author's data (hosted on GitHub)
GITHUB_USERNAME = 'matthiaskozubal'
REPO = 'Chat_with_data'
BRANCH = 'main'
DATA_PATH = 'data'

In [85]:
# User's data (hosted on Google Drive)
PREDEFINED_GOOGLE_DRIVE_FOLDER_PATH = 'content/drive/My Drive/PDFs'

## Functions

In [86]:
def get_user_choice(verbose=False):
  PDFs_source_choice = input(f"Choose PDF source (author's/user's) [default: author's]: ")
  if PDFs_source_choice not in ['author\'s', 'user\'s']:
    PDFs_source_choice = 'author\'s'

  if verbose:
    print(PDFs_source_choice)

  return PDFs_source_choice

In [87]:
def connect_to_google_drive():
	drive.mount('/content/drive')

In [None]:
def get_PDFs_urls_from_github(GITHUB_USERNAME, REPO, BRANCH, DATA_PATH, verbose=False):
  # construct url to github data folder
  data_folder_url = f"https://github.com/{GITHUB_USERNAME}/{REPO}/tree/{BRANCH}/{DATA_PATH}"

  # construct partial url to retrieve each file
  file_partial_url = f"https://raw.githubusercontent.com/{GITHUB_USERNAME}/{REPO}/{BRANCH}"

  # get files metada
  response = requests.get(data_folder_url)
  if response.status_code == 200:
    response_json = response.json()
    files = response_json['payload']['tree']['items']
    file_paths = [file_name['path'] for file_name in files if file_name['contentType'] == 'file']
    file_urls = ['/'.join([file_partial_url, file_path.replace(' ', '%20')]) for file_path in file_paths]
    # file_urls = ['/'.join([partial_url, file_path]) for file_path in file_paths]
    if verbose:
      print(40*'#', ' ', 'File urls:' , 40*'#' , ' ')
      for file_url in file_urls:
        print(file_url)
  else:
    print(f"Failed to fetch the API, status code: {response.status_code}")

  return file_urls

# _ = get_PDFs_urls_from_github(GITHUB_USERNAME, REPO, BRANCH, DATA_PATH, verbose=True)

In [96]:
def get_PDFs_urls_from_user_google_drive(PREDEFINED_FOLDER_PATH, verbose=False):
  # mount their google drive
  drive.mount('/content/drive')

  if os.path.exists(PREDEFINED_FOLDER_PATH):
      pdf_files = [f for f in os.listdir(PREDEFINED_FOLDER_PATH) if f.endswith('.pdf')]
      # pdfs = connect_to_drive() and do what?

  if verbose:
    print(f"{folder_link}: {PREDEFINED_FOLDER_PATH}")

  return file_urls

In [100]:
def get_PDFs_urls(PDF_source_choice, PREDEFINED_FOLDER_PATH, GITHUB_USERNAME, REPO, BRANCH, DATA_PATH, verbose=False):
  if PDF_source_choice == 'author\'s':
    file_urls = get_PDFs_urls_from_github(GITHUB_USERNAME, REPO, BRANCH, DATA_PATH, verbose=verbose)
  elif PDF_source_choice == 'USER\'s':
    file_urls = get_PDFs_urls_from_user_google_drive(PREDEFINED_FOLDER_PATH, verbose=verbose)

  return file_urls

In [90]:
def load_split_embed_store_PDFs(file_urls):
  # https://learn.deeplearning.ai/langchain-chat-with-your-data/lesson/3/document-splitting

  # load
  docs = []
  for i in range(3):
    loader = PyPDFLoader(file_urls[i])
    document = loader.load()
    docs.extend(document)

  # split
  text_splitter = CharacterTextSplitter(separator="\n", chunk_size=1000, chunk_overlap=200, length_function=len)
  splits = text_splitter.split_documents(document)

  # embed
  embedding = OpenAIEmbeddings()

  # store
  from langchain.vectorstores import Chroma
  persist_directory = 'docs/chroma/'
  vectordb = Chroma.from_documents(
      documents = docs,
      persist_directory=persist_directory,
      embedding=embedding
  )

  return vectordb

In [91]:
def query_over_PDFs(vectordb, verbose=False):
  # query (user's or a default one)
  query = input(f"Ask a question to the PDFs")
  if query == '':
    query = '''
      What is the bias and variance in terms of machine learning?
      '''

  # retrieval
  llm = OpenAI(temperature=0)
  compressor = LLMChainExtractor.from_llm(llm)
  compression_retriever = ContextualCompressionRetriever(
      base_compressor=compressor,
      base_retriever=vectordb.as_retriever()
  )
  compressed_docs = compression_retriever.get_relevant_documents(query)

  # query an LLM
  llm_chat = ChatOpenAI(temperature = 0.0)
  response = llm_chat.call_as_llm(f"{compressed_docs} Question: {query}")

  # response
  display(Markdown(response))

# Main

In [92]:
PDFs_source_choice = get_user_choice(verbose=VERBOSE)

Choose PDF source (author's/user's) [default: author's]: 
author's


In [93]:
file_urls = get_PDFs_urls(PDF_source_choice, PREDEFINED_FOLDER_PATH, GITHUB_USERNAME, REPO, BRANCH, DATA_PATH, verbose=VERBOSE)

########################################   File urls: ########################################  
https://raw.githubusercontent.com/matthiaskozubal/Chat_with_data/main/data/A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone.pdf
https://raw.githubusercontent.com/matthiaskozubal/Chat_with_data/main/data/Algorithmic%20Aspects%20of%20Machine%20Learning%20-%20Ankur%20Moitra%20(2014).pdf
https://raw.githubusercontent.com/matthiaskozubal/Chat_with_data/main/data/An%20introduction%20to%20variable%20and%20feature%20selection%20-%20Isabelle%20Guyon,%20André%20Elisseeff.pdf
https://raw.githubusercontent.com/matthiaskozubal/Chat_with_data/main/data/Automated%20Machine%20Learning,%20Methods,%20Systems,%20Challenges%20-%20Frank%20Hutter,%20Lars%20Kotthoff,%20Joaquin%20Vanschoren.pdf
https://raw.githubusercontent.com/matthiaskozubal/Chat_with_data/main/data/CAUSALITY%20FOR%20MACHINE%20LEARNING%20-%20Bernhard%20Schölkopf.pdf
https://raw.githubusercontent.com/ma

In [94]:
vectordb = load_split_embed_store_PDFs(file_urls)

In [95]:
query_over_PDFs(vectordb)

Ask a question to the PDFs




Bias and variance are two important concepts in machine learning that relate to the generalization error of a model.

Bias refers to the error introduced by approximating a real-world problem with a simplified model. It represents the difference between the expected predictions of the model and the true values. A high bias indicates that the model is too simple and unable to capture the underlying patterns in the data, leading to underfitting.

Variance, on the other hand, refers to the variability of model predictions for different training sets. It measures the sensitivity of the model to the specific training data it was trained on. A high variance indicates that the model is too complex and overfits the training data, resulting in poor performance on unseen data.

In summary, bias and variance are two sources of error in machine learning models. Bias represents the error due to oversimplification, while variance represents the error due to overcomplication. The goal is to find a balance between bias and variance to achieve optimal model performance and generalization.