<a href="https://colab.research.google.com/github/pankajtandon/Gist/blob/main/gist.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook will show you a way to build your app iteratively in Colab.

To run this notebook, navigate to 
https://colab.research.google.com
and File | Open this notebook or simply click on the link above.


To prevent your API Keyes from being committed to source control, do the following:
- Create a directory in the root of your Google Drive and call it `colab_content`.
- Create a file in that directory called `api-keys.txt` and in that file add contents like:
```
OPENAI_API_KEY=<your key>. 
NGROK_AUTH_TOKEN=<your key>
```

For OPENAI_API_KEY, you will need to create an account at https://platform.openai.com and it will cost you but it's usually pennies for moderate usage and usage can be monitored at https://platform.openai.com/account/usage
The NGROK_AUTH_TOKEN is free and can be gotten from https://ngrok.com/


Then run each cell in this notebook in order by looking at the comment in each cell.



In [2]:
# First mount a directory in Google Drive. This will help keep your API Keys out of source control.
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# This will need to be done everytime your VM disconnects.

!pip install pyngrok
!pip install streamlit
!pip install openai
!pip install langchain
!pip install tiktoken
!pip install sentence_transformers
!pip install tiktoken
!pip install PyPDF2
!pip install faiss-cpu


In [4]:
# This writes the code to the VM on which this notebook runs.

%%writefile /content/drive/MyDrive/colab_content/gist.py


# from scipy import spatial
# import ast  # for converting embeddings saved as strings back to arrays
# import openai  # for calling the OpenAI API
# import pandas as pd  # for storing text and embeddings data
# import tiktoken  # for counting tokens
import streamlit as st
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.callbacks import get_openai_callback

PAGE_CONFIG = {"page_title": "Hello baby!", "page_icon": "smiley", "layout": "centered"}
st.set_page_config(**PAGE_CONFIG)
st.title("Welcome to our world!")
st.subheader("We are head over heels!")
pdf = st.file_uploader("Upload your PDF", type = "PDF")

if pdf is not None:
    pdf_reader = PdfReader(pdf)
    content = ""
    for page in pdf_reader.pages:
        content += page.extract_text()
    # st.write("====Content====")
    # st.write(content)

    # Chunk out the file
    text_splitter = CharacterTextSplitter(separator=" ", chunk_size= 160, chunk_overlap = 15, length_function= len)
    chunks = text_splitter.split_text(content)

    # st.write("====Chunks====")
    # st.write(chunks)

    #Ask the question
    question = st.text_input("Ask me something about the PDF that you just uploaded:")
    if question:
        embeddings = OpenAIEmbeddings()

        # These are the vectorized chunks:
        knowledge_base = FAISS.from_texts(chunks, embeddings)

        # Docs are those vectors that are similar to the vectors in the knowledge base.
        docs = knowledge_base.similarity_search(question)

        if docs is not None:
            st.write("These are the related chunks:")
            for doc in docs:
                st.write(doc)
            
            # Forward the related chunks to the LLM with the query as a prompt
            llm = OpenAI()
        
            chain = load_qa_chain(llm, chain_type = "stuff")
            with get_openai_callback() as cb:
                response = chain.run(question = question, input_documents = docs)
                st.write("Cost of query:")
                st.write(cb)

            st.write(response)
        else:
            st.write("No match on the chunks!")


# EMBEDDING_MODEL = "text-embedding-ada-002"
# GPT_MODEL = "gpt-3.5-turbo"


Overwriting /content/drive/MyDrive/colab_content/gist.py


In [10]:
# Set up the tunnel to allow access to the running Streamlit instance.

from pyngrok import ngrok
import os


os.environ['STREAMLIT_SERVER_MAX_UPLOAD_SIZE']='201'
with open('/content/drive/MyDrive/colab_content/api-keys.txt', 'r') as f:
    api_key_list = f.readlines()
for kv in api_key_list:
    k,v = kv.split('=')
    #print(k, v)
    os.environ[k] = v.strip()
ngrok_token = os.getenv('NGROK_AUTH_TOKEN').strip()
!ngrok authtoken $ngrok_token
public_url = ngrok.connect(addr='8501') # This is the default Streamlit port
print('This is the URL that can be used to access the Streamlit app', public_url)

Authtoken saved to configuration file: /root/.ngrok2/ngrok.yml




This is the URL that can be used to access the Streamlit app NgrokTunnel: "https://fdb2-34-134-54-16.ngrok-free.app" -> "http://localhost:8501"


In [None]:
# Start the streamlit app and leave it running and then access the running app at the URL above.

!streamlit run /content/drive/MyDrive/colab_content/gist.py