In [1]:
#@title Overview

#@markdown The main purpose of this code is to allow users to query a large body of text relevant to their project easily. In the future, a visualization tool may also be combined to allow users to visualize their knowledge space in two dimensions and reveal potential connections and such.

In [43]:
#@title User-Defined Variables
search_query = "How do you set the PID parameters in constant deflection OBD AFM scanning?" #@param
pdf_folder = "steven" #@param
chunk_size = 256 #@param
chunk_overlap = 0.1 #@param
number_results = 3 #@param
openai_api_key = "" #@param
save_name_suffix = "[with-embeddings]" #@param
COMPLETIONS_MODEL = "gpt-3.5-turbo" #@param
EMBEDDING_MODEL = "text-embedding-ada-002"#@param

#@markdown ---
#@markdown ### **Help**
#@markdown - **`search_query`** - *`str; The question or information you would like to search for that is related to the text contained in pdf_folder.`*
#@markdown - **`chapter_mapper_folder`** - *`str; The name of the chapter-mapper root folder.`*
#@markdown - **`pdf_folder`** - *`str; The name of the folder which contains all the PDF's you want to analyze.`*
#@markdown - **`chunk_size`** - *`int; How many words you want each chunk of text to contain (approximately). Default is 256.`*
#@markdown - **`chunk_overlap`** - *`float; How much overlap you want each chunk of text to have with the next chunk. Default is 0.1.`*
#@markdown - **`number_results`** - *`int; How many search results you want to see in the plot. Default is 10.`*
#@markdown - **`openai_api_key`** - *`str; Your (free trial) OpenAI API Key`*
#@markdown - **`save_name_suffix`** - *`str; string to add to end of csv file for saving purposes. Default is '[with-embeddings].csv'`*
#@markdown - **`COMPLETIONS_MODEL`** - *`str; The name of the model to use for answering prompts. Default is 'text-davinci-003' from OpenAI.`*
#@markdown - **`EMBEDDING_MODEL`** - *`str; The name of the model to use for obtaining text embeddings. Default is 'text-embedding-ada-002' from OpenAI.`*

In [2]:
#@title Mount Drive
from google.colab import drive, files
from IPython.display import HTML
import os
import sys
import shutil
import subprocess
output = subprocess.run(["pip", "list"], capture_output=True)
default_packages = output.stdout.decode().strip().split("\n")
drive.mount('/content/drive')
!mkdir /content/drive/MyDrive/python-packages
%cd '/content/drive/MyDrive/python-packages'
sys.path.append('/content/drive/MyDrive/python-packages')

Mounted at /content/drive
mkdir: cannot create directory ‘/content/drive/MyDrive/python-packages’: File exists
/content/drive/MyDrive/python-packages


In [None]:
#@title Install Required Packages
!pip install openai
!pip install pymupdf
!pip install tiktoken
!pip install fpdf
!pip install section_headers
!git clone https://github.com/malekinho8/chapter-mapper.git /content/chapter-mapper
!mv /content/chapter-mapper/utils.py /content/utils.py
!mv /content/chapter-mapper/section_headers.py /content/section_headers.py

In [7]:
#@title Import Dependencies
import numpy as np
import colorsys
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import os
import fpdf
from utils import *
from sklearn.manifold import TSNE
from openai.embeddings_utils import get_embedding, cosine_similarity
from typing import List, Dict, Tuple
from tqdm import tqdm
from IPython.display import Markdown
import pickle
openai.api_key = openai_api_key

In [15]:
#@title Collect Data from all PDF's in `pdf_folder`
pdf_folder = find_folder_path(pdf_folder, "/content/drive/MyDrive")
df_init, file_prefix = collect_pdf_folder_data(pdf_folder,chunk_size,chunk_overlap)

0
32
84
192
218
244
283
294
317


In [16]:
#@title Create Embeddings for the Raw Data
df_init = get_batched_embeddings(df_init, pdf_folder, file_prefix, save_name_suffix, openai_api_key)

100%|██████████| 394/394 [00:02<00:00, 174.60it/s]


In [17]:
#@title Obtain TSNE Matrix Data for 2D Map Visualization
df, dm, all_titles = get_tsne_plot_params(df_init,pdf_folder,file_prefix,save_name_suffix)

Evaluating TSNE on Dataset...
Extracting Embedding Feature Matrix...


In [35]:
#@title Plot Results and Answer Question based on Text Data
all_titles = [x.split(os.sep)[-1] for x in all_titles]
df.title = [x.split(os.sep)[-1] for x in df.title]
fig = plotmap(df,search_query,number_results,dm,all_titles,pdf_folder,EMBEDDING_MODEL)

Searching for relevant text...
(3, 13)


In [38]:
#@title (Optional) Ask another question...

search_query = "What exactly is the problem formulation for anomaly detection?" #@param

In [39]:
#@title Answer the Question Based on Relevant Text
if len(search_query) > 0:
  test = answer_query_with_context(search_query, df, COMPLETIONS_MODEL=COMPLETIONS_MODEL, EMBEDDING_MODEL=EMBEDDING_MODEL)

Markdown("<br>**ChatGPT Response**:<br>" + test)

Converting string embeddings to a list of floats...
Selected 3 document sections:
p. 21, Yeung et al. - 2022 - Modular, general purpose, automated, anomalous dat.pdf
p. 8, Yeung et al. - 2022 - RoSAA Mechatronically Synthesized Dataset for Rot.pdf
p. 8, Yeung et al. - 2022 - RoSAA Mechatronically Synthesized Dataset for Rot.pdf


<br>**ChatGPT Response**:<br>The context provides multiple perspectives on anomaly detection, including its definition as outliers in a set of data that may indicate potential problems with a particular system, and its importance in providing feedback to operators to identify potential inefficiencies or points of failure in machines. Anomaly detection and health monitoring are core concerns for most engineering applications, and data-driven anomaly analysis is gaining popularity over traditional model-based analysis of real plants. [ref: Yeung et al. - 2022 - Modular, general purpose, automated, anomalous dat.pdf, p. 21]

In [42]:
#@title Save the Plot as an HTML File
pdf_folder_name = pdf_folder.split(os.sep)[-1]
file_out = os.path.join(pdf_folder,f"{pdf_folder_name}-{search_query}")
go.Figure.write_html(fig,f"{file_out}.html") # write as html or image
files.download(f"{file_out}.html") # download your file and give me a vote my answer

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>