The main objectives of this notebook are:

1. Loading webpages from a specific domain.
2. Embedding the webpage content using OpenAI models.
3. Storing the resulting vectors in the Qdrant vector database.
4. Retrieving documents similar to a given query.
5. Generating output for the query based on the retrieved documents using OpenAI.

<br>

This is the basic workflow
![workflow](images/website_qa.png)

<br>
Before running this notebook, make sure the Qdrant docker image is running

1. Make sure that Docker daemon is installed and running:
    ```
    sudo docker info
    ```
2. Pull the image:
    ```
    docker pull qdrant/qdrant
    ```
3. Run the container
    ```
    docker run -p 6333:6333 \
        -v $(pwd)/path/to/data:/qdrant/storage \
        qdrant/qdrant
    ```

In [1]:
# Uncomment below line of code to Install dependencies
# %pip install -r requirements.txt

In [14]:
import sys
import os

from dotenv import load_dotenv

load_dotenv()

HUGGINGFACEHUB_API_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Get the absolute path of the current working directory
current_dir = os.getcwd()
print(current_dir)

# Add the current directory to the Python path
sys.path.append(current_dir)


/Users/ojaskapre/projects/notebooks


In [15]:
import asyncio
import nest_asyncio
from pprint import pprint

nest_asyncio.apply()

In [16]:
from webpagelinkminer import WebPageLinkExtractor

# url = "https://python.langchain.com/en/latest/"
# url = "https://gpt-index.readthedocs.io/en/latest/"
# url = "https://docs.sqlalchemy.org/en/20/"
url = "https://next-auth.js.org/getting-started/introduction"
# url = "https://flask.palletsprojects.com/en/2.3.x/"
# url = "https://svelte.dev/docs"
# url = "https://firebase.google.com/docs"
# url = "https://www.mysqltutorial.org/"
# url = "https://nextjs.org/docs"

Extract links from same domain using the WebpageLinkMiner library (https://github.com/ojasskapre/WebPageLinkMiner)

In [17]:
extractor = WebPageLinkExtractor(url, max_depth=1000, algorithm='dfs')
extracted_urls = asyncio.run(extractor.get_links_async())

print(f'Number of URLs extracted: {len(extracted_urls)}')
pprint(extracted_urls[:10])

Fetching links from https://next-auth.js.org/getting-started/introduction at depth 0
Fetching links from https://next-auth.js.org/ at depth 1
Fetching links from https://next-auth.js.org/tutorials at depth 1
Fetching links from https://next-auth.js.org/faq at depth 1
Fetching links from https://next-auth.js.org/security at depth 1
Fetching links from https://next-auth.js.org/v3/getting-started/introduction at depth 1
Fetching links from https://next-auth.js.org/getting-started/example at depth 1
Fetching links from https://next-auth.js.org/getting-started/client at depth 1
Fetching links from https://next-auth.js.org/getting-started/rest-api at depth 1
Fetching links from https://next-auth.js.org/getting-started/typescript at depth 1
Fetching links from https://next-auth.js.org/getting-started/upgrade-v4 at depth 1
Fetching links from https://next-auth.js.org/configuration/initialization at depth 1
Fetching links from https://next-auth.js.org/providers/ at depth 1
Fetching links from h

Loading all the webpage links using Langchain WebBaseLoader (https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/web_base.html)

In [18]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader(extracted_urls)
loader.requests_per_second = 1
docs = loader.aload()

print(f'Number of documents loaded: {len(docs)}')
pprint(docs[:10])

Fetching pages: 100%|##########| 212/212 [00:21<00:00,  9.90it/s]


Number of documents loaded: 212
 Document(page_content='\n\n\n\n\nUsage with class components | NextAuth.js\n\n\n\n\n\nSkip to main contentNextAuth.js is becoming Auth.js! 🎉 We\'re creating Authentication for the Web. Everyone included. You are looking at the NextAuth.js (v4) documentation. For the new documentation go to authjs.dev.NextAuth.jsDocumentationTutorialsFAQSecurityv4v4v3All ReleasesnpmGitHubSearchVersion: v4On this pageUsage with class componentsIf you want to use the useSession() hook in your class components you can do so with the help of a higher order component or with a render prop.Higher Order Component\u200bimport { useSession } from "next-auth/react"const withSession = (Component) => (props) => {  const session = useSession()  // if the component has a render property, we are good  if (Component.prototype.render) {    return <Component session={session} {...props} />  }  // if the passed component is a function component, there is no need for this wrapper  throw new

Using tiktoken encoder which is used for OpenAI models along with the Langchain RecursiveCharacterTextSplitter (https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/recursive_text_splitter.html)

In [19]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=1000, chunk_overlap=20)
texts = text_splitter.split_documents(docs)

print(f'Number of texts split: {len(texts)}')
pprint(texts[:10])

Number of texts split: 443
 Document(page_content='MySQL | NextAuth.js', metadata={'source': 'https://next-auth.js.org/v3/adapters/typeorm/mysql', 'title': 'MySQL | NextAuth.js', 'description': 'Schema for a MySQL database.', 'language': 'en'}),
 Document(page_content='INT NOT NULL AUTO_INCREMENT,    name           VARCHAR(255),    email          VARCHAR(255),    email_verified TIMESTAMP(6),    image          VARCHAR(255),    created_at     TIMESTAMP(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6),    updated_at     TIMESTAMP(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6),    PRIMARY KEY (id)  );CREATE TABLE verification_requests  (    id         INT NOT NULL AUTO_INCREMENT,    identifier VARCHAR(255) NOT NULL,    token      VARCHAR(255) NOT NULL,    expires    TIMESTAMP(6) NOT NULL,    created_at TIMESTAMP(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6),    updated_at TIMESTAMP(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6),    PRIMARY KEY (id)  );CREATE UNIQUE INDEX compound_id  ON accounts(compound_id);CREAT

Initializing OpenAI Embeddings

In [20]:
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

Create embeddings for the split text using OpenAI embedding models and storing them in Qdrant vector database (https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/qdrant.html)

In [24]:
from langchain.vectorstores import Qdrant

qdrant_url = "http://localhost:6333/"
qdrant_port = 6333

qdrant = Qdrant.from_documents(documents=texts,
                               embedding=embeddings, 
                               url=qdrant_url, 
                               collection_name="langchain_documents")

Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised APIConnectionError: Error communicating with OpenAI: HTTPSConnectionPool(host='api.openai.com', port=443): Max retries exceeded with url: /v1/engines/text-embedding-ada-002/embeddings (Caused by SSLError(SSLError(1, '[SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2536)'))).


Retrieving the documents that may contain answer for the query using the qdrant similarity search

In [25]:
import qdrant_client

query = "How to protect backend API route? Give me code for that."

found_docs = qdrant.similarity_search(query)
print(found_docs[0].page_content)
print(found_docs[0].metadata['source'])

Securing pages and API routes | NextAuth.js
https://next-auth.js.org/tutorials/securing-pages-and-api-routes


Initializing the OpenAI LLM

In [26]:
from langchain.llms import OpenAI

llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)

Creating the Question Answer sources chain using Langchain (https://python.langchain.com/en/latest/modules/chains/index_examples/qa_with_sources.html) to generate the output for the given query using OpenAI

In [12]:
from langchain.chains.qa_with_sources import load_qa_with_sources_chain

chain = load_qa_with_sources_chain(llm=llm, chain_type="stuff")
results = chain.run(input_documents=found_docs, question=query)
print(results)


 You can protect API routes using the getSession() method.
SOURCES: https://next-auth.js.org/tutorials/securing-pages-and-api-routes


 Create the Retriever QA with sources using Langchain (https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa_with_sources.html) to generate the output for the given query using OpenAI

In [13]:
from langchain.chains import RetrievalQAWithSourcesChain

chain = RetrievalQAWithSourcesChain.from_chain_type(llm, chain_type="stuff", retriever=qdrant.as_retriever())
results = chain({"question": query}, return_only_outputs=True)
print(results)

{'answer': ' You can protect API routes using the getSession() method.\n', 'sources': 'https://next-auth.js.org/tutorials/securing-pages-and-api-routes'}


Create a summarization chain to summarize all the retrieved documents  (https://python.langchain.com/en/latest/modules/chains/index_examples/summarize.html)

In [38]:
from langchain.chains.summarize import load_summarize_chain
from langchain.prompts import PromptTemplate


prompt_template = """Write a concise summary of the following content. 
If the content has a python code snippet then return the code along with the summary else mention 'No Code Found'

Content: {text}

Answer:
"""

PROMPT = PromptTemplate(template=prompt_template, input_variables=["text"])

chain = load_summarize_chain(llm=llm, chain_type="stuff", prompt=PROMPT)
# print(chain.prompt)

results = chain.run(input_documents=found_docs, return_only_outputs=False)
print(results)

NextAuth.js provides an easy way to secure client and server side rendered pages and API routes. Client side pages can be secured using the useSession React Hook, while server side pages can be secured using the getSession() method. API routes can be secured using the getSession() and getToken() methods. No Code Found.
