# Langchain: LLM + Web Scraper

* https://github.com/leegonzales/LangChainExamples

## Setup

In [1]:
!pip install langchain requests openai transformers faiss-cpu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langchain
  Downloading langchain-0.0.92-py3-none-any.whl (288 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m288.8/288.8 KB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Collecting openai
  Downloading openai-0.26.5.tar.gz (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.5/55.5 KB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m49.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting faiss-cpu
  Downloading faiss_cpu-1.7.3-cp38-cp38-manylinux_2_17_x86_

## Set OpenAI API Key

In [None]:
# Signup for an OpenAPI API Key at www.openai.com/api

In [2]:
import os
from getpass import getpass

OPENAI_API_KEY = getpass('Enter your OpenAI key: ')
# print(f'OPENAI_API_KEY is: {OPENAI_API_KEY}')
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY



Enter your OpenAI key: ··········


In [3]:
from langchain.llms import OpenAI
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores.faiss import FAISS
import requests

In [5]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
from time import sleep
import pickle

def get_page_text(url, depth=2, visited_links=None, max_links=10, cache=None, timeout=3, user_agent=None):
    """
    Recursively follow links on a webpage and return a list of documents of subsequent found pages.
    :param url: The URL of the webpage to scrape
    :param depth: The number of levels deep to recursively follow links. Default is 2.
    :param visited_links: A dictionary or list of links that have already been visited to prevent revisiting links
    :param max_links: The maximum number of links to follow. Default is 50.
    :param cache: A cache of links and their corresponding documents to prevent unnecessary web requests
    :param timeout: Number of seconds to wait before timing out a request. Default is 5.
    :param user_agent: The User Agent string to use for requests. Default is None.
    """
    # Initialize the visited links set if not provided
    if visited_links is None:
        visited_links = {}
    if cache is None:
        cache = {}
    # Extract the root domain from the URL
    parsed_uri = urlparse(url)
    root_domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
    # Check if the link has already been visited
    if url in visited_links:
        print("Hit in visited_links set: ", url)
        return None
    # Check if the link is in the cache
    if url in cache:
        print("Hit in cache set: ", url)
        return cache[url]
    # Check for relative paths, fragments, and mailto links
    if not parsed_uri.netloc:
        print("Invalid URL: ", url)
        return None
    visited_links[url] = True
    # Send a GET request to the URL and handle common errors
    try:
        headers = {'User-Agent': user_agent or 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36','Accept-Language': 'en-US,en;q=0.5'}
        print("Retrieving: ", url, headers)
        page = requests.get(url, headers=headers)
        page.raise_for_status()
    except requests.exceptions.RequestException as e:
        print(f"Error retrieving the webpage {url}: {str(e)}")
        return None
    # parse the HTML and extract the text
    soup = BeautifulSoup(page.text, 'html.parser')
    text = soup.get_text()
    # Add the link and its corresponding document to the cache
    cache[url] = Document(text={text}, metadata={"source": url})
    with open('scrape_cache.pickle', 'wb') as handle:
        pickle.dump(cache, handle, protocol=pickle.HIGHEST_PROTOCOL)
    # Check if we have reached the maximum depth or maximum number of links to follow
    if depth <= 0 or max_links <= 0:
        return cache[url]
    # Follow links on the webpage
    links = []
    for link in soup.find_all('a'):
        href = link.get('href')
        # Only follow links that are on the same root domain
        if href and root_domain in href:
            links.append(href)
    # Follow the links recursively and space out the requests to avoid throttling
    for link in links:
        sleep(timeout)
        doc = get_page_text(link, depth-1, visited_links, max_links-1, cache, timeout, user_agent)
        if doc:
            cache[link] = doc
    return cache


## Set Website URLs

In [6]:
sources = [
    get_page_text("https://www.guildeducation.com/solutions/", depth=0),
    get_page_text("https://www.guildeducation.com/leadership/", depth=0),
    get_page_text("https://blog.guildeducation.com/", depth=1),
    get_page_text("https://www.guildeducation.com/terms/", depth=0),
]

Retrieving:  https://www.guildeducation.com/solutions/ {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36', 'Accept-Language': 'en-US,en;q=0.5'}


ValidationError: ignored

In [7]:
source_chunks = []
splitter = CharacterTextSplitter(separator=" ", chunk_size=1024, chunk_overlap=0)
for source in sources:
    for chunk in splitter.split_text(source.page_content):
        source_chunks.append(Document(page_content=chunk, metadata=source.metadata))

search_index = FAISS.from_documents(source_chunks, OpenAIEmbeddings())

NameError: ignored

In [None]:
chain = load_qa_with_sources_chain(OpenAI(temperature=0))

def print_answer(question):
    print(
        chain(
            {
                "input_documents": search_index.similarity_search(question, k=4),
                "question": question,
            },
            return_only_outputs=False,
        )["output_text"]
    )

In [None]:
print_answer("Who are all of the VPs at BetterUP?")

 The VPs at BetterUP are Tom Patterson (SVP, Corporate Development & Strategy), Brad McCracken (SVP, Worldwide Sales), Armen Berjikly (VP, Product), Preeya Voss (VP, Sales, North America), Duke Daehling (VP, Sales), Erik Darby (VP, Business Development), Shonna Waters, PhD (VP, Alliance Solutions), Karen Lai (VP, Field), Evelyn Kim (VP, Product Design), Dr. Christine Carter (VP, Learning Experience Design), Katie Coupe (VP, People), Allison Yost (VP, BetterUp Labs), Adam Lavezzo (VP, Revenue Operations), Cameran Hetrick (VP, Analytics), Meredith Speece (Director of Legal and Privacy), and Chanel Fanaberia (VP, Operations).
SOURCES: https://www.betterup.com/about-us/leadership-team?hsLang=en, https://www.betterup.com/en/about-us?hsLang=en


In [None]:
print_answer("How many VPs are men vs women? List the woman, list the men. Emit as a markdown table")

 There are 8 male VPs and 6 female VPs. The male VPs are Tom Patterson, Brad McCracken, Armen Berjikly, Duke Daehling, Erik Darby, Adam Lavezzo, Cameran Hetrick, and Alexi Robichaux. The female VPs are Preeya Voss, Shonna Waters, Karen Lai, Evelyn Kim, Dr. Christine Carter, Katie Coupe, Allison Yost, and Cindy Goodrich.

| Male VPs | Female VPs |
| -------- | ---------- |
| Tom Patterson | Preeya Voss |
| Brad McCracken | Shonna Waters |
| Armen Berjikly | Karen Lai |
| Duke Daehling | Evelyn Kim |
| Erik Darby | Dr. Christine Carter |
| Adam Lavezzo | Katie Coupe |
| Cameran Hetrick | Allison Yost |
| Alexi Robichaux | Cindy Goodrich |

SOURCES: https://www.betterup.com/about-us/leadership-team?hsLang=en


In [None]:
print_answer("What are all of the ways coaching can help people?")

 Coaching can help people by providing support, guidance, and accountability to help them reach their goals, build resilience, and develop skills to manage stress and anxiety.
SOURCES: https://www.betterup.com/blog/page/1


In [None]:
print_answer("What is Better UP?")

 BetterUp is a coaching and Care platform that helps organizations build a happier, healthier workforce that fuels business growth. It provides world-class coaching, AI technology, and behavioral science experts to deliver change at scale, improving individual resilience, adaptability, and effectiveness.
SOURCES: https://www.betterup.com/en/about-us?hsLang=en, https://www.betterup.com/about-us/careers
