The main objectives of this notebook are:

1. Loading webpages from a specific domain.
2. Embedding the webpage content using Cohere.
3. Storing the resulting vectors in the Qdrant vector database.
4. Retrieving documents similar to a given query.
5. Generating output for the query based on the retrieved documents using Cohere.

<br>

This is the basic workflow
![workflow](images/website_qa.png)

<br>
Before running this notebook, make sure the Qdrant docker image is running

1. Make sure that Docker daemon is installed and running:
    ```
    sudo docker info
    ```
2. Pull the image:
    ```
    docker pull qdrant/qdrant
    ```
3. Run the container
    ```
    docker run -p 6333:6333 \
        -v $(pwd)/path/to/data:/qdrant/storage \
        qdrant/qdrant
    ```

In [19]:
# Uncomment below line of code to Install dependencies
# %pip install -r requirements.txt
# %pip freeze > requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [10]:
import sys
import os

from dotenv import load_dotenv

load_dotenv()

HUGGINGFACEHUB_API_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
COHERE_API_KEY = os.getenv("COHERE_API_KEY")

# Get the absolute path of the current working directory
current_dir = os.getcwd()
print(current_dir)

# Add the current directory to the Python path
sys.path.append(current_dir)


/Users/ojaskapre/projects/notebooks


In [11]:
import asyncio
import nest_asyncio
from pprint import pprint

nest_asyncio.apply()

In [12]:
from webpagelinkminer import WebPageLinkExtractor

# url = "https://python.langchain.com/en/latest/"
# url = "https://gpt-index.readthedocs.io/en/latest/"
# url = "https://docs.sqlalchemy.org/en/20/"
url = "https://next-auth.js.org/getting-started/introduction"
# url = "https://flask.palletsprojects.com/en/2.3.x/"
# url = "https://svelte.dev/docs"
# url = "https://firebase.google.com/docs"
# url = "https://www.mysqltutorial.org/"
# url = "https://nextjs.org/docs"

Extract links from same domain using the WebpageLinkMiner library (https://github.com/ojasskapre/WebPageLinkMiner)

In [13]:
extractor = WebPageLinkExtractor(url, max_depth=1000, algorithm='dfs')
extracted_urls = asyncio.run(extractor.get_links_async())

print(f'Number of URLs extracted: {len(extracted_urls)}')
pprint(extracted_urls[:10])

Fetching links from https://next-auth.js.org/getting-started/introduction at depth 0
Fetching links from https://next-auth.js.org/ at depth 1
Fetching links from https://next-auth.js.org/tutorials at depth 1
Fetching links from https://next-auth.js.org/faq at depth 1
Fetching links from https://next-auth.js.org/security at depth 1
Fetching links from https://next-auth.js.org/v3/getting-started/introduction at depth 1
Fetching links from https://next-auth.js.org/getting-started/example at depth 1
Fetching links from https://next-auth.js.org/getting-started/client at depth 1
Fetching links from https://next-auth.js.org/getting-started/rest-api at depth 1
Fetching links from https://next-auth.js.org/getting-started/typescript at depth 1
Fetching links from https://next-auth.js.org/getting-started/upgrade-v4 at depth 1
Fetching links from https://next-auth.js.org/configuration/initialization at depth 1
Fetching links from https://next-auth.js.org/providers/ at depth 1
Fetching links from h

Loading all the webpage links using Langchain WebBaseLoader (https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/web_base.html)

In [14]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader(extracted_urls)
loader.requests_per_second = 1
docs = loader.aload()

print(f'Number of documents loaded: {len(docs)}')
pprint(docs[:10])

Fetching pages: 100%|##########| 212/212 [00:19<00:00, 11.13it/s]


Number of documents loaded: 212


Using tiktoken encoder which is used for OpenAI models along with the Langchain RecursiveCharacterTextSplitter (https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/recursive_text_splitter.html)

In [15]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=1000, chunk_overlap=20)
texts = text_splitter.split_documents(docs)

print(f'Number of texts split: {len(texts)}')
pprint(texts[:10])

Number of texts split: 443
 Document(page_content='WorkOS | NextAuth.js', metadata={'source': 'https://next-auth.js.org/providers/workos', 'title': 'WorkOS | NextAuth.js', 'description': 'Documentation', 'language': 'en'}),
 Document(page_content='return (            <div key={provider.id}>              <input                type="email"                value={email}                placeholder="Email"                onChange={(event) => setEmail(event.target.value)}              />              <button                onClick={() =>                  signIn(provider.id, undefined, {                    domain: email.split("@")[1],                  })                }              >                Sign in with SSO              </button>            </div>          )        }        return (          <div key={provider.id}>            <button onClick={() => signIn(provider.id)}>              Sign in with {provider.name}            </button>          </div>        )      })}    </>  )}export a

Initializing Cohere Embeddings (https://python.langchain.com/en/latest/modules/models/text_embedding/examples/cohere.html)

In [20]:
import cohere
from langchain.embeddings import CohereEmbeddings

embeddings = CohereEmbeddings(cohere_api_key=COHERE_API_KEY)

Create embeddings for the split text using Cohere embedding models and storing them in Qdrant vector database (https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/qdrant.html)

In [22]:
from langchain.vectorstores import Qdrant

qdrant_url = "http://localhost:6333/"
qdrant_port = 6333

qdrant = Qdrant.from_documents(documents=texts,
                               embedding=embeddings, 
                               url=qdrant_url, 
                               collection_name="langchain_documents")

Retrieving the documents that may contain answer for the query using the qdrant similarity search

In [23]:
import qdrant_client

query = "How to protect backend API route? Give me code for that."

found_docs = qdrant.similarity_search(query)
print(found_docs[0].page_content)
print(found_docs[0].metadata['source'])

Securing pages and API routes | NextAuth.js
https://next-auth.js.org/v3/tutorials/securing-pages-and-api-routes


Initializing the Cohere LLM (https://python.langchain.com/en/latest/modules/models/llms/integrations/cohere.html)

In [24]:
from langchain.llms import Cohere

llm = Cohere(cohere_api_key=COHERE_API_KEY)

Creating the Question Answer sources chain using Langchain (https://python.langchain.com/en/latest/modules/chains/index_examples/qa_with_sources.html) to generate the output for the given query

In [29]:
from langchain.chains.qa_with_sources import load_qa_with_sources_chain

chain = load_qa_with_sources_chain(llm=llm, chain_type="stuff")
results = chain.run(input_documents=found_docs, question=query)
print(results)


 I don't know.




 Create the Retriever QA with sources using Langchain (https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa_with_sources.html) to generate the output for the given query using OpenAI

In [26]:
from langchain.chains import RetrievalQAWithSourcesChain

chain = RetrievalQAWithSourcesChain.from_chain_type(llm, chain_type="stuff", retriever=qdrant.as_retriever())
results = chain({"question": query}, return_only_outputs=True)
print(results)

{'answer': "\nI don't know.", 'sources': ''}


Create a summarization chain to summarize all the retrieved documents  (https://python.langchain.com/en/latest/modules/chains/index_examples/summarize.html)

In [27]:
from langchain.chains.summarize import load_summarize_chain
from langchain.prompts import PromptTemplate


prompt_template = """Write a concise summary of the following content. 
If the content has a python code snippet then return the code along with the summary else mention 'No Code Found'

Content: {text}

Answer:
"""

PROMPT = PromptTemplate(template=prompt_template, input_variables=["text"])

chain = load_summarize_chain(llm=llm, chain_type="stuff", prompt=PROMPT)
# print(chain.prompt)

results = chain.run(input_documents=found_docs, return_only_outputs=False)
print(results)

The content talks about how to secure API routes using getSession() and getToken() methods. The getSession() method is used to protect API routes and getToken() is used to access the contents of the JWT without having to handle JWT decryption / verification yourself. The content also talks about how to read a JSON Web Token from an API route using the getToken() helper function and how to include all dashboard nested routes (sub pages like /dashboard/settings, /dashboard/profile) by passing matcher: "/dashboard/:path*" to config.
