<a href="https://colab.research.google.com/github/matthiaskozubal/Chat_with_data/blob/main/code/demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chat with data
Dynamic, LLM-powered quering over PDFs, using either predefined PDFs or yours (from your google drive).

## How to use
1. **Run code**:
  - Click `Runtime` -> `Run all` or hit `Ctrl+F9`
2. **Choose PDFs source**:
  - When prompted in the 'User Input' section:
    - type **author's** or hit **Enter** (to use predefined PDFs)
    - type **user's** and ensure (to use your PDFs):
      - a folder named `PDFs` is in your Google Drive with your PDFs inside

# Setup

## Libraries

In [86]:
!pip install langchain
!pip install pypdf

Collecting langchain
  Downloading langchain-0.0.310-py3-none-any.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.1-py3-none-any.whl (27 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langsmith<0.1.0,>=0.0.40 (from langchain)
  Downloading langsmith-0.0.43-py3-none-any.whl (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.0/40.0 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain)
  Downloading marshmallow-3.20.1-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langch

In [87]:
import requests
from google.colab import drive
from langchain.document_loaders import PyPDFLoader

## Constants

In [62]:
# Verbose
VERBOSE = True

In [143]:
# Author's data (hosted on GitHub)
GITHUB_USERNAME = 'matthiaskozubal'
REPO = 'Chat_with_data'
BRANCH = 'main'
DATA_PATH = 'data'

In [144]:
# User's data (hosted on Google Drive)
PREDEFINED_GOOGLE_DRIVE_FOLDER_PATH = 'content/drive/My Drive/PDFs'

## Functions

In [126]:
def get_user_choice(verbose=False):
  PDFs_source_choice = input(f"Choose PDF source (author's/user's) [default: author's]: ")
  if PDFs_source_choice not in ['author\'s', 'user\'s']:
    PDFs_source_choice = 'author\'s'

  if verbose:
    print(PDFs_source_choice)

  return PDFs_source_choice

# PDFs_source_choice = get_user_choice(verbose=True)

In [127]:
def connect_to_google_drive():
	drive.mount('/content/drive')

In [158]:
def get_PDFs_urls(GITHUB_USERNAME, REPO, BRANCH, DATA_PATH, verbose=False):
  # construct url to github data folder
  data_folder_url = f"https://github.com/{GITHUB_USERNAME}/{REPO}/tree/{BRANCH}/{DATA_PATH}"

  # construct partial url to retrieve each file
  file_partial_url = f"https://raw.githubusercontent.com/{GITHUB_USERNAME}/{REPO}/{BRANCH}"

  # get files metada
  response = requests.get(data_folder_url)
  if response.status_code == 200:
    response_json = response.json()
    files = response_json['payload']['tree']['items']
    file_paths = [file_name['path'] for file_name in files if file_name['contentType'] == 'file']
    file_urls = ['/'.join([file_partial_url, file_path.replace(' ', '%20')]) for file_path in file_paths]
    # file_urls = ['/'.join([partial_url, file_path]) for file_path in file_paths]
    if verbose:
      print(40*'#', ' ', 'File urls:' , 40*'#' , ' ')
      for file_url in file_urls:
        print(file_url)
  else:
    print(f"Failed to fetch the API, status code: {response.status_code}")

  return file_urls

# _ = get_PDFs_urls(GITHUB_USERNAME, REPO, BRANCH, DATA_PATH, verbose=True)

In [132]:
def get_PDFs(PDFs_source_choice, PREDEFINED_FOLDER_PATH, verbose=False):
  if PDFs_source_choice == 'author\'s':
    print(f"Using author's data")
    file_urls = get_PDFs_urls(GITHUB_USERNAME, REPO, BRANCH, FOLDER_PATH, verbose=VERBOSE)
    # pdfs = load PDFs
  elif PDFs_source_choice == 'user\'s':
    print(f"Using user's data")
    # mount their google drive
    if os.path.exists(PREDEFINED_FOLDER_PATH):
        pdf_files = [f for f in os.listdir(PREDEFINED_FOLDER_PATH) if f.endswith('.pdf')]
        # pdfs = connect_to_drive() and do what?

  if verbose:
    print(f"{folder_link}: {PREDEFINED_FOLDER_PATH}")

  # return ...
# _ = query_over_PDFs(PDFs_source_choice, PREDEFINED_FOLDER_PATH, verbose=True)

In [130]:
def get_user_question(verbose=False):
  user_question = input(f"Ask a question to the PDFs")
  if user_question == '':
    user_question = '''
      What is the bias and variance in terms of machine learning?
      '''
  if verbose:
    print(user_question)

  return user_question

# user_question = get_user_question(verbose=True)

In [134]:
def load_split_embeed_store_PDFs(file_urls):
  # https://learn.deeplearning.ai/langchain-chat-with-your-data/lesson/3/document-splitting

  # load
  pass

  # split
  from langchain.text_splitter import CharacterTextSplitter
  text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
  )
  docs = text_splitter.split_documents(pages)

  # embeed

  # store


In [131]:
def query_over_PDFs(verbose=False):
    pass

# Main

In [152]:
PDFs_source_choice = get_user_choice(verbose=VERBOSE)

Choose PDF source (author's/user's) [default: author's]: 
author's


In [159]:
file_urls = get_PDFs_urls(GITHUB_USERNAME, REPO, BRANCH, FOLDER_PATH, verbose=VERBOSE)

########################################   File urls: ########################################  
https://raw.githubusercontent.com/matthiaskozubal/Chat_with_data/main/data/A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone.pdf
https://raw.githubusercontent.com/matthiaskozubal/Chat_with_data/main/data/Algorithmic%20Aspects%20of%20Machine%20Learning%20-%20Ankur%20Moitra%20(2014).pdf
https://raw.githubusercontent.com/matthiaskozubal/Chat_with_data/main/data/An%20introduction%20to%20variable%20and%20feature%20selection%20-%20Isabelle%20Guyon,%20André%20Elisseeff.pdf
https://raw.githubusercontent.com/matthiaskozubal/Chat_with_data/main/data/Automated%20Machine%20Learning,%20Methods,%20Systems,%20Challenges%20-%20Frank%20Hutter,%20Lars%20Kotthoff,%20Joaquin%20Vanschoren.pdf
https://raw.githubusercontent.com/matthiaskozubal/Chat_with_data/main/data/CAUSALITY%20FOR%20MACHINE%20LEARNING%20-%20Bernhard%20Schölkopf.pdf
https://raw.githubusercontent.com/ma

In [165]:
loader = PyPDFLoader(file_urls[2])
pages = loader.load()#_and_split()
# pages[19]

In [166]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(separator="\n", chunk_size=1000, chunk_overlap=150, length_function=len)
docs = text_splitter.split_documents(pages)

In [167]:
len(docs)

91

In [168]:
len(pages)

26

In [170]:
pages[9]

Document(page_content='Guyon andElisseeff\n−5 0 5 −2−1 012−505−2−1012\n−0.5 00.511.5 −0.5 00.511.5−0.500.511.5−0.500.511.5\n(a) (b)\nFigure3:Avariableuselessbyitselfcanbeusefultogether withothers. (a)One\nvariablehascompletely overlapping classconditional densities. Still,usingitjointlywith\ntheothervariableimprovesclassseparabilit ycompared tousingtheothervariablealone.\n(b)XOR-likeorchessboard-likeproblems. Theclassesconsistofdisjointclumpssuchthatin\nprojectionontheaxestheclassconditional densities overlapperfectly.Therefore, individual\nvariableshavenoseparation power.Still,takentogether, thevariablesprovidegoodclass\nseparabilit y.\n4Variable Subset Selection\nIntheprevious section,wepresentedexamples thatillustrate theusefulness ofselecting\nsubsetsofvariablesthattogether havegoodpredictiv epower,asopposedtorankingvari-\nablesaccording totheirindividual predictiv epower.Wenowturntothisproblem and\noutlinethemaindirections thathavebeentakentotackleit.Theyessentiallydivideinto\nwra

In [169]:
docs[9]

Document(page_content='rankingofstep5,construct asequence ofpredictors ofsamenatureusingincreasing\nsubsetsoffeatures. Canyoumatchorimproveperformance withasmallersubset?If\nyes,tryanon-linear predictor withthatsubset.\n9.Doyouhavenewideas,time,computational resources, andenoughex-\namples? Ifyes,compare severalfeatureselection methods,including yournewidea,\ncorrelation coeﬃcients,backwardselection andembeddedmethods(Section 4).Use\nlinearandnon-linear predictors. Selectthebestapproachwithmodelselection (Sec-\ntion6).\n10.Doyouwantastablesolution (toimproveperformance and/orunderstanding)?\nIfyes,sub-sample yourdataandredoyouranalysis forseveral“bootstraps” (Section\n7.1).\n2.Wecautionthereaderthatthischecklistisheuristic. Theonlyrecommendation thatisalmostsurely\nvalidistotrythesimplestthingsﬁrst.\n3.By“linearpredictor” wemeanlinearintheparameters. Featureconstruction mayrenderthepredictor\nnon-linear intheinputvariables.\n1159', metadata={'source': '/tmp/tmpypk4uyuh/tmp.pdf', 'page'

In [37]:
user_question = get_user_question(verbose=VERBOSE)

Ask a question to the PDFs

      What is the bias and variance in terms of machine learning.
      
