<img src="https://fsdl.me/logo-720-dark-horizontal">

This notebook accompanies [this YouTube video](https://www.youtube.com/watch?v=zaYTXQFR0_s)
walking through what LangChain is and interviewing the creator, Harrison Chase.

## Auth

In [None]:
%pip install -qqq langchain openai
%pip install -qqq beautifulsoup4
%pip install -qqq unstructured
%pip install -qqq tiktoken
%pip install -qqq faiss-cpu

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

#set up for tracing
os.environ["LANGCHAIN_HANDLER"] = "langchain"

## LLMs without context are internet simulators, which aren't always useful

In [3]:
from langchain.chat_models import ChatOpenAI

llm = OpenAI(model_name="gpt-4", temperature=0)

In [4]:
import textwrap

print("\n".join(textwrap.wrap(llm("What is FreeMoCap Does it have real-time? who made it?").strip())))

FreeMoCap is an open-source motion capture system that uses low-cost
hardware and computer vision algorithms to track human motion. It is
designed to be accessible and affordable for researchers, artists, and
hobbyists. The system can provide real-time motion capture data,
depending on the hardware and software setup used.  FreeMoCap was
created by a team of researchers and developers led by Dr. Talmo
Pereira, a postdoctoral researcher at Princeton Neuroscience
Institute. The project is hosted on GitHub and is continuously being
updated and improved by its contributors.


### Scrape the docs into text

In [8]:
toplevel = "https://freemocap.readthedocs.io/en/latest"

response = requests.get(toplevel)
response

<Response [200]>

In [9]:
soup = BeautifulSoup(response.text, 'html.parser')

print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width,initial-scale=1" name="viewport"/>
  <link href="about_us/" rel="next"/>
  <link href="assets/skelly_freemocap_favicon.ico" rel="icon"/>
  <meta content="mkdocs-1.4.2, mkdocs-material-9.1.5" name="generator"/>
  <title>
   FreeMoCap Documentation
  </title>
  <link href="assets/stylesheets/main.7a7fce14.min.css" rel="stylesheet"/>
  <link href="assets/stylesheets/palette.a0c5b2b5.min.css" rel="stylesheet"/>
  <link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/>
  <link href="https://fonts.googleapis.com/css?family=Roboto:300,300i,400,400i,700,700i%7CRoboto+Mono:400,400i,700,700i&amp;display=fallback" rel="stylesheet"/>
  <style>
   :root{--md-text-font:"Roboto";--md-code-font:"Roboto Mono"}
  </style>
  <link href="stylesheets/extra.css" rel="stylesheet"/>
  <link href="/_/static/css/badge_only.css" rel="stylesheet"/>
  <link href="/_/static/css/readt

In [11]:
anchors_attrs = [anchor.attrs for anchor in soup.find_all('a')]
anchors_attrs

[{'href': '#welcome-skele-friend', 'class': ['md-skip']},
 {'href': '.',
  'title': 'FreeMoCap Documentation',
  'class': ['md-header__button', 'md-logo'],
  'aria-label': 'FreeMoCap Documentation',
  'data-md-component': 'logo'},
 {'href': '.', 'class': ['md-tabs__link', 'md-tabs__link--active']},
 {'href': 'about_us/', 'class': ['md-tabs__link']},
 {'href': 'how_to_guides/', 'class': ['md-tabs__link']},
 {'href': 'terminology/terminology/', 'class': ['md-tabs__link']},
 {'href': 'roadmap/roadmap/', 'class': ['md-tabs__link']},
 {'href': 'privacy_policy/', 'class': ['md-tabs__link']},
 {'href': '.',
  'title': 'FreeMoCap Documentation',
  'class': ['md-nav__button', 'md-logo'],
  'aria-label': 'FreeMoCap Documentation',
  'data-md-component': 'logo'},
 {'href': '.', 'class': ['md-nav__link', 'md-nav__link--active']},
 {'href': '#helpful-links', 'class': ['md-nav__link']},
 {'href': '#troubleshooting', 'class': ['md-nav__link']},
 {'href': 'about_us/', 'class': ['md-nav__link']},
 {'hr

In [12]:
paths = []

for anchor_attrs in anchors_attrs:
    try:
        classes = anchor_attrs["class"]
        link = anchor_attrs["href"]
        if "reference" in classes:
            if "internal" in classes:
                paths.append(link)
            elif "external" in classes:
                if link.startswith("./"):
                    paths.append(link[len("./"):])
                else:
                    pass # not a link to docs
            else:
                pass # i didn't understand that reference
        else:
            pass # not a reference
    except KeyError:
       print("no classes or no href:", anchor_attrs)

no classes or no href: {'href': 'https://diataxis.fr/'}
no classes or no href: {'href': 'how_to_guides/'}
no classes or no href: {'href': 'https://youtu.be/GxKmyKdnTy0'}
no classes or no href: {'href': 'https://freemocap.org'}
no classes or no href: {'href': 'https://github.com/freemocap/freemocap'}
no classes or no href: {'href': 'https://freemocap.org/about-us.html#donate'}
no classes or no href: {'href': 'https://github.com/freemocap/freemocap/issues'}
no classes or no href: {'href': 'https://github.com/freemocap/documentation/issues'}
no classes or no href: {'href': 'https://discord.gg/P2nyraRYjb'}
no classes or no href: {'href': 'https://squidfunk.github.io/mkdocs-material/', 'target': '_blank', 'rel': ['noopener']}


In [13]:
paths = ["index.html"] + paths
print(paths)

['index.html']


In [16]:
%%time
import requests

pages = []

for path in paths:
    try:
        url = "/".join([toplevel, path])
        resp = requests.get(url)
        resp.raise_for_status()
    except Exception:
        print(url)
    finally:
        pages.append({"content": resp.content, "url": url})

CPU times: total: 0 ns
Wall time: 51.8 ms


In [18]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jonma\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\jonma\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [19]:
%%time
from unstructured.partition.html import partition_html

parsed_docs = [partition_html(text=page["content"]) for page in pages]

CPU times: total: 31.2 ms
Wall time: 70.3 ms


In [20]:
texts = []
for doc in parsed_docs:
    texts.append("\n\n".join(
        [str(el).strip() for el in doc]).strip().replace("\\n", ""))

In [22]:
print(*textwrap.wrap(texts[0]), sep="\n")

Welcome Skele-Friend! \xf0\x9f\x92\x80\xe2\x9c\xa8\xc2\xb6  This is
the official and most up-to-date place to find documentation for
FreeMoCap. We\'re slowly building a Knowledge Base that roughly
follows the \'diataxis framework\'. Our documentation is very much a
work in progress, so we appreciate your patience, support, and
engagement!  If you\'re looking for a quick start, head on over to our
"How to" Guides page!  We are very close to out v0.1.0 release, and
there will be a new round of tutorials/walk-throughs/etc released
around then.    In the mean time, check out this (rough) video which
provides a broad overview of some of the topics relevant to camera-
based markerless motion capture ( HINT - Look at the video chapters
for specific topics.)  Helpful Links\xc2\xb6  The FreeMoCap Website
https://freemocap.org  The FreeMoCap GitHub
https://github.com/freemocap/freemocap  Support FreeMoCap by donating
to our non-profit that supports our work!  Troubleshooting?\xc2\xb6
If you run 

In [23]:
for page, text in zip(pages, texts):
    page["text"] = text

In [24]:
pages[0].keys()

dict_keys(['content', 'url', 'text'])

#### Low-key alert: this belongs in a DB

### Chunk the text for use inside LLM prompts

In [25]:
from langchain.text_splitter import CharacterTextSplitter

In [26]:
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1024, chunk_overlap=128, separator=" ")

In [27]:
documents = text_splitter.create_documents(
    [page["text"] for page in pages], metadatas=[{"source": page["url"]} for page in pages])

documents[0].dict()


{'page_content': 'Welcome Skele-Friend! \\xf0\\x9f\\x92\\x80\\xe2\\x9c\\xa8\\xc2\\xb6\n\nThis is the official and most up-to-date place to find documentation for FreeMoCap. We\\\'re slowly building a Knowledge Base that roughly follows the \\\'diataxis framework\\\'. Our documentation is very much a work in progress, so we appreciate your patience, support, and engagement!\n\nIf you\\\'re looking for a quick start, head on over to our "How to" Guides page!\n\nWe are very close to out v0.1.0 release, and there will be a new round of tutorials/walk-throughs/etc released around then.  \n\nIn the mean time, check out this (rough) video which  provides a broad overview of some of the topics relevant to camera-based markerless motion capture ( HINT - Look at the video chapters for specific topics.)\n\nHelpful Links\\xc2\\xb6\n\nThe FreeMoCap Website https://freemocap.org\n\nThe FreeMoCap GitHub https://github.com/freemocap/freemocap\n\nSupport FreeMoCap by donating to our non-profit that sup

In [28]:
print(documents[0].metadata["source"], *textwrap.wrap(documents[0].page_content), sep="\n")

https://freemocap.readthedocs.io/en/latest/index.html
Welcome Skele-Friend! \xf0\x9f\x92\x80\xe2\x9c\xa8\xc2\xb6  This is
the official and most up-to-date place to find documentation for
FreeMoCap. We\'re slowly building a Knowledge Base that roughly
follows the \'diataxis framework\'. Our documentation is very much a
work in progress, so we appreciate your patience, support, and
engagement!  If you\'re looking for a quick start, head on over to our
"How to" Guides page!  We are very close to out v0.1.0 release, and
there will be a new round of tutorials/walk-throughs/etc released
around then.    In the mean time, check out this (rough) video which
provides a broad overview of some of the topics relevant to camera-
based markerless motion capture ( HINT - Look at the video chapters
for specific topics.)  Helpful Links\xc2\xb6  The FreeMoCap Website
https://freemocap.org  The FreeMoCap GitHub
https://github.com/freemocap/freemocap  Support FreeMoCap by donating
to our non-profit that su

### Enable search over text chunks

#### Here, using embeddings and vector search

In [29]:
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

In [30]:
from langchain.vectorstores import FAISS

docsearch = FAISS.from_documents(documents, embeddings)

### Ask questions and get answers

In [31]:
from langchain.chains.qa_with_sources import load_qa_with_sources_chain


chain = load_qa_with_sources_chain(llm, chain_type="stuff")

In [32]:
query = "What is FreeMoCap? Does it have realtime capacity? Who made it? What is a charuco board?"
# query = "What is LangChainHub?"
# query = "Does LangChain integrate with OpenAI? If so, how?"

docs = docsearch.similarity_search(query)
result = chain({"input_documents": docs, "question": query})

text = "\n".join(textwrap.wrap(result["output_text"]))
text = "\n\nSOURCES:\n".join(map(lambda s: s.strip(), text.split("SOURCES:")))

print(text)

FreeMoCap is an open-source markerless motion capture system. It does
not mention realtime capacity in the provided content. The creators
can be found on the FreeMoCap GitHub page
(https://github.com/freemocap/freemocap). A charuco board is not
mentioned in the provided content.

SOURCES:
https://freemocap.readthedocs.io/en/latest/index.html


In [None]:
print(*textwrap.wrap(result["input_documents"][0].page_content), sep="\n")