## Retrieval Augmented Generation

## Scrape documents

In [1]:
import requests

response = requests.get("https://langchain.readthedocs.io/en/latest/")
response

<Response [200]>

In [2]:
response.text

'\n\n<!DOCTYPE html>\n\n\n<html lang="en" >\n\n  <head>\n    <meta charset="utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.17.1: http://docutils.sourceforge.net/" />\n\n    <title>Welcome to LangChain &#8212; 🦜🔗 LangChain 0.0.126</title>\n  \n  \n  \n  <script data-cfasync="false">\n    document.documentElement.dataset.mode = localStorage.getItem("mode") || "";\n    document.documentElement.dataset.theme = localStorage.getItem("theme") || "light";\n  </script>\n  \n  <!-- Loaded before other Sphinx assets -->\n  <link href="_static/styles/theme.css?digest=12da95d707ffb74b382d" rel="stylesheet" />\n<link href="_static/styles/bootstrap.css?digest=12da95d707ffb74b382d" rel="stylesheet" />\n<link href="_static/styles/pydata-sphinx-theme.css?digest=12da95d707ffb74b382d" rel="stylesheet" />\n\n  \n  <link href="_static/vendor/fontawesome/6.1.2/css/all.min.css?digest=12da95d707ffb74b382d" rel="stylesheet" />\n 

In [3]:
from bs4 import BeautifulSoup
import urllib.parse
import html
import re

domain = "https://langchain.readthedocs.io/"
domain_full = domain + "en/latest/"

response = requests.get(domain_full)


soup = BeautifulSoup(response.text, "html.parser")

local_links = []

for link in soup.find_all("a", href=True):
    href = link["href"]
    if href.startswith(domain) or href.startswith("./") \
        or href.startswith("/") or href.startswith("modules") or href.startswith("use_cases"):
        local_links.append(urllib.parse.urljoin(domain_full, href))

In [4]:
def scrape(url: str):
    response = requests.get(url)
    if "404" in str(response):
        return
    
    soup = BeautifulSoup(response.text, "html.parser")

    local_links = []

    for link in soup.find_all("a", href=True):
        href = link["href"]
        if href.startswith(domain) or href.startswith("./") \
        or href.startswith("/") or href.startswith("modules") or href.startswith("use_cases"):
            local_links.append(urllib.parse.urljoin(domain_full, href))
        
    main_content = soup.select("body main")[0]
    main_content_text = main_content.get_text()
    main_content_text = re.sub(r"<[^>]+>", "", main_content_text)
    main_content_text = " ".join(main_content_text.split())
    main_content_text = html.unescape(main_content_text)

    return {"url": url, "text": main_content_text}, local_links

In [5]:
links = set(["https://langchain.readthedocs.io/en/latest/"])
scraped = set()
data = []
i = 0

while True:
    if len(links) == 0:
        print("Completed!")
        break
        
    if len(scraped) == 200:
        break

    url = list(links)[0]
    print(f"[{str(i+1).zfill(4)}] - {url}")
    response = scrape(url)
    if response is not None:
        content, local_links = response
    scraped.add(url)

    if content is not None:
        data.append(content)

    if local_links is not None:
        links.update(local_links)
    
    links -= scraped
    i += 1

[0001] - https://langchain.readthedocs.io/en/latest/
[0002] - https://langchain.readthedocs.io/en/latest/modules/agents/toolkits/examples/python.html
[0003] - https://langchain.readthedocs.io/en/latest/modules/indexes/document_loaders/examples/youtube.html
[0004] - https://langchain.readthedocs.io/en/latest/modules/memory/examples/agent_with_memory.html
[0005] - https://langchain.readthedocs.io/en/latest/modules/indexes/document_loaders/examples/bigquery.html
[0006] - https://langchain.readthedocs.io/en/latest/modules/chains/index_examples/vector_db_text_generation.html
[0007] - https://langchain.readthedocs.io/en/latest/modules/indexes/document_loaders/examples/gcs_file.html
[0008] - https://langchain.readthedocs.io/en/latest/modules/agents/tools/examples/human_tools.html
[0009] - https://langchain.readthedocs.io/en/latest/modules/indexes/text_splitters/examples/nltk.html
[0010] - https://langchain.readthedocs.io/en/latest/model_laboratory.html
[0011] - https://langchain.readthedocs.i

In [6]:
data[2]

{'url': 'https://langchain.readthedocs.io/en/latest/modules/indexes/document_loaders/examples/youtube.html',
 'text': '.ipynb .pdf YouTube Contents Add video info YouTube loader from Google Cloud Prerequisites 🧑 Instructions for ingesting your Google Docs data YouTube# How to load documents from YouTube transcripts. from langchain.document_loaders import YoutubeLoader # !pip install youtube-transcript-api loader = YoutubeLoader.from_youtube_url("https://www.youtube.com/watch?v=QsYGlZkevEg", add_video_info=True) loader.load() Add video info# # ! pip install pytube loader = YoutubeLoader.from_youtube_url("https://www.youtube.com/watch?v=QsYGlZkevEg", add_video_info=True) loader.load() YouTube loader from Google Cloud# Prerequisites# Create a Google Cloud project or use an existing project Enable the Youtube Api Authorize credentials for desktop app pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib youtube-transcript-api 🧑 Instructions for ingesting 

In [7]:
urls = [record["url"] for record in data]
urls

['https://langchain.readthedocs.io/en/latest/',
 'https://langchain.readthedocs.io/en/latest/modules/agents/toolkits/examples/python.html',
 'https://langchain.readthedocs.io/en/latest/modules/indexes/document_loaders/examples/youtube.html',
 'https://langchain.readthedocs.io/en/latest/modules/memory/examples/agent_with_memory.html',
 'https://langchain.readthedocs.io/en/latest/modules/indexes/document_loaders/examples/bigquery.html',
 'https://langchain.readthedocs.io/en/latest/modules/chains/index_examples/vector_db_text_generation.html',
 'https://langchain.readthedocs.io/en/latest/modules/indexes/document_loaders/examples/gcs_file.html',
 'https://langchain.readthedocs.io/en/latest/modules/agents/tools/examples/human_tools.html',
 'https://langchain.readthedocs.io/en/latest/modules/indexes/text_splitters/examples/nltk.html',
 'https://langchain.readthedocs.io/en/latest/model_laboratory.html',
 'https://langchain.readthedocs.io/en/latest/use_cases/evaluation/agent_benchmarking.html'

## Chunk documents

In [8]:
import tiktoken

tokenizer = tiktoken.get_encoding("p50k_base")

def tokenize_and_count(text):
    tokens = tokenizer.encode(text, disallowed_special=())

    return len(tokens)

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,
    length_function=tokenize_and_count,
    separators=["\n\n", "\n", " ", ""]
)

In [10]:
from uuid import uuid4
from tqdm.auto import tqdm

chunks_and_metadata = []

for i, doc in enumerate(data):
    chunks = text_splitter.split_text(doc["text"])
    chunks_and_metadata.extend([{
        "id": str(uuid4()),
        "chunk_id": j,
        "text": chunk,
        "url": doc["url"]
    } for j, chunk in enumerate(chunks)])

In [11]:
print(chunks_and_metadata[0]["text"][-50:], "+", chunks_and_metadata[1]["text"][:50])

to get started, how-to guides, reference docs, and + get started, how-to guides, reference docs, and co


In [12]:
len(chunks_and_metadata)

890

## Create Vectorstore

In [13]:
import openai
from getpass import getpass
import os

if os.getenv("OPENAI_API_KEY") is None:
    os.environ["OPENAI_API_KEY"] = getpass()
    openai.api_key = os.environ["OPENAI_API_KEY"]

embedding_model = "text-embedding-ada-002"

response = openai.Embedding.create(
    input=["this one example sentence.", "this is another example to be embedded."],
    engine=embedding_model
)

In [14]:
for example in response.data:
    print(len(example["embedding"]))

1536
1536


In [15]:
import pinecone

if os.getenv("PINECONE_API_KEY") is None:
    os.environ["PINECONE_API_KEY"] = getpass()
    os.environ["PINECONE_ENVIRONMENT"] = input()

In [16]:
pinecone.init(
    api_key=os.environ["PINECONE_API_KEY"],
    environment=os.environ["PINECONE_ENVIRONMENT"]
)

In [17]:
index_name = "gpt-4-langchain-docs"

if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=len(response.data[0]["embedding"]),
        metric="dotproduct"
    )

index = pinecone.GRPCIndex(index_name)
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 45}},
 'total_vector_count': 45}

In [18]:
batch_size = 100

for i in tqdm(range(0, len(chunks_and_metadata), batch_size)):
    batch = chunks_and_metadata[i:i+batch_size]

    batch_ids = [example["id"] for example in batch]
    batch_texts = [example["text"] for example in batch]

    response = openai.Embedding.create(input=batch_texts, engine=embedding_model)
    embeddings = [example["embedding"] for example in response.data]

    batch_cleaned = [{
        "text": example["text"],
        "chunk_id": example["chunk_id"],
        "url": example["url"]
    } for example in batch]

    to_upsert = list(zip(batch_ids, embeddings, batch_cleaned))

    index.upsert(vectors=to_upsert)

  0%|          | 0/9 [00:00<?, ?it/s]

In [19]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 935}},
 'total_vector_count': 935}

## Retrieval

In [20]:
query = "How do I use the LLMChain in LangChain?"

response = openai.Embedding.create(
    input=[query],
    engine=embedding_model
)

query_embedding = response.data[0]["embedding"]
retrieved_info = index.query(query_embedding, top_k=5, include_metadata=True)

In [21]:
retrieved_info

{'matches': [{'id': '75951fb9-582d-421a-9456-91761be3542e',
              'metadata': {'chunk_id': 0.0,
                           'text': '.rst .pdf Chains Chains# Note Conceptual '
                                   'Guide Using an LLM in isolation is fine '
                                   'for some simple applications, but many '
                                   'more complex ones require chaining LLMs - '
                                   'either with each other or with other '
                                   'experts. LangChain provides a standard '
                                   'interface for Chains, as well as some '
                                   'common implementations of chains for ease '
                                   'of use. The following sections of '
                                   'documentation are provided: Getting '
                                   'Started: A getting started guide for '
                                   'chains, to get yo

In [22]:
context = [match["metadata"]["text"] for match in retrieved_info["matches"]]
context

['.rst .pdf Chains Chains# Note Conceptual Guide Using an LLM in isolation is fine for some simple applications, but many more complex ones require chaining LLMs - either with each other or with other experts. LangChain provides a standard interface for Chains, as well as some common implementations of chains for ease of use. The following sections of documentation are provided: Getting Started: A getting started guide for chains, to get you up and running quickly. How-To Guides: A collection of how-to guides. These highlight how to use various types of chains. Reference: API reference documentation for all Chain classes. previous Redis Chat Message History next Getting Started By Harrison Chase © Copyright 2023, Harrison Chase. Last updated on Mar 29, 2023.',
 'about how to do cool things with Chains, check out the how-to guide for chains. previous Chains next How-To Guides Contents Why do we need chains? Query an LLM with the LLMChain Combine chains with the SequentialChain Create a 

In [24]:
augmented_query = "\n\n---\n\n".join(context) + "\n\n---\n\n" + query
print(augmented_query)

.rst .pdf Chains Chains# Note Conceptual Guide Using an LLM in isolation is fine for some simple applications, but many more complex ones require chaining LLMs - either with each other or with other experts. LangChain provides a standard interface for Chains, as well as some common implementations of chains for ease of use. The following sections of documentation are provided: Getting Started: A getting started guide for chains, to get you up and running quickly. How-To Guides: A collection of how-to guides. These highlight how to use various types of chains. Reference: API reference documentation for all Chain classes. previous Redis Chat Message History next Getting Started By Harrison Chase © Copyright 2023, Harrison Chase. Last updated on Mar 29, 2023.

---

about how to do cool things with Chains, check out the how-to guide for chains. previous Chains next How-To Guides Contents Why do we need chains? Query an LLM with the LLMChain Combine chains with the SequentialChain Create a 

In [26]:
primer = f"""
You are Q&A bot. A highly intelligent system that answers
user questions based on the information provided by the user above
each question. If the information can not be found in the information
provided by the user you truthfully say "I don't know".
""".strip()

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": primer},
        {"role": "user", "content": augmented_query}
    ]
)

In [33]:
from IPython.display import Markdown

display(Markdown(response['choices'][0]['message']['content']))

To use the LLMChain in LangChain, follow these steps:

1. Import the necessary modules and classes.

```python
from langchain.llms import OpenAI
from langchain.chains import LLMChain
```

2. Initialize the LLM wrapper with the desired arguments. In this example, we'll use the OpenAI language model with a temperature of 0.9.

```python
llm = OpenAI(temperature=0.9)
```

3. Create an instance of the `LLMChain` class and provide the LLM wrapper object.

```python
llm_chain = LLMChain(llm)
```

4. Now, you can call the `llm_chain` with the user input text to get the LLM's response.

```python
text = "What would be a good company name for a company that makes colorful socks?"
response = llm_chain(text)
print(response)
```

The `LLMChain` takes in the user input, formats it with the given prompt template (if any), and returns the response from the LLM.

In [34]:
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": primer},
        {"role": "user", "content": query}
    ]
)

In [35]:
display(Markdown(response['choices'][0]['message']['content']))

I don't know.