Now that we got the prompts loaded and the lls answering our questions, we dive into mixing the llms with our own content. This is useful because llms usually have a knowledge date cut off and also we don't want private information to leak into public llms. An important technique for this is called *RAG : Retrieval Augmented Generation*.

Langchain has a set of documentloader to load & parse different formats.

In [1]:
%pip install langchain

Note: you may need to restart the kernel to use updated packages.


A first loader could be to load simple text. We will load a markdown file as a text first to show the difference. With the generic text loader you it loads the file as is and does not try to interprete the content.

In [2]:
# Load a doc as text file
from langchain.document_loaders import TextLoader

loader = TextLoader("./data/history.md")
loader.load()

[Document(page_content='# A history lesson on Devops\n\n## Devopsdays\n\nDevopsdays is a worldwide series of technical conferences covering topics of software development, IT infrastructure operations, and the intersection between them. Each event is run by volunteers from the local area.\n\nMost devopsdays events feature a combination of curated talks (see open Calls for Proposals) and self organized open space content. Topics often include automation, testing, security, and organizational culture.\n\n### History\nThe first devopsdays was held in Ghent, Belgium in 2009. Since then, devopsdays events have multiplied, and if there isn’t one in your city, check out the information about organizing one yourself!\n\n### About the organization\nThe devopsdays global core team guides local organizers in hosting their own devopsdays events worldwide. Active core organizers onboard and guide events, answer questions, and maintain the website. Advisory core organizers are less involved day-to-d

We can now do the same with a Markdownloader. For that we install a few extra packages.

In [3]:
%pip install unstructured markdown

Collecting unstructured
  Obtaining dependency information for unstructured from https://files.pythonhosted.org/packages/c2/18/aa070da1ba14a07f546bfe6f2a1e64144c344afd058a315069f46098fca6/unstructured-0.10.12-py3-none-any.whl.metadata
  Downloading unstructured-0.10.12-py3-none-any.whl.metadata (23 kB)
Collecting markdown
  Obtaining dependency information for markdown from https://files.pythonhosted.org/packages/1a/b5/228c1cdcfe138f1a8e01ab1b54284c8b83735476cb22b6ba251656ed13ad/Markdown-3.4.4-py3-none-any.whl.metadata
  Downloading Markdown-3.4.4-py3-none-any.whl.metadata (6.9 kB)
Collecting chardet (from unstructured)
  Obtaining dependency information for chardet from https://files.pythonhosted.org/packages/38/6f/f5fbc992a329ee4e0f288c1fe0e2ad9485ed064cac731ed2fe47dcc38cbf/chardet-5.2.0-py3-none-any.whl.metadata
  Downloading chardet-5.2.0-py3-none-any.whl.metadata (3.4 kB)
Collecting filetype (from unstructured)
  Downloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Collecting p

The difference is that it 'understood' the file and stripped it from any Markdown specific elements.

In [4]:
from langchain.document_loaders import UnstructuredMarkdownLoader
loader = UnstructuredMarkdownLoader("./data/history.md")
loader.load()

[nltk_data] Downloading package punkt to /Volumes/home-
[nltk_data]     ssd/patrick/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Volumes/home-ssd/patrick/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[Document(page_content='A history lesson on Devops\n\nDevopsdays\n\nDevopsdays is a worldwide series of technical conferences covering topics of software development, IT infrastructure operations, and the intersection between them. Each event is run by volunteers from the local area.\n\nMost devopsdays events feature a combination of curated talks (see open Calls for Proposals) and self organized open space content. Topics often include automation, testing, security, and organizational culture.\n\nHistory\n\nThe first devopsdays was held in Ghent, Belgium in 2009. Since then, devopsdays events have multiplied, and if there isn’t one in your city, check out the information about organizing one yourself!\n\nAbout the organization\n\nThe devopsdays global core team guides local organizers in hosting their own devopsdays events worldwide. Active core organizers onboard and guide events, answer questions, and maintain the website. Advisory core organizers are less involved day-to-day but we

We can similarly load whole directories

In [5]:
from langchain.document_loaders import DirectoryLoader

loader = DirectoryLoader(
    "./",
    glob="data/**/*.*",
    use_multithreading=True,
    max_concurrency=4,
    show_progress=True,
)
loader.load()


  0%|          | 0/1 [00:00<?, ?it/s]

100%|██████████| 1/1 [00:00<00:00, 109.22it/s]


[Document(page_content='A history lesson on Devops\n\nDevopsdays\n\nDevopsdays is a worldwide series of technical conferences covering topics of software development, IT infrastructure operations, and the intersection between them. Each event is run by volunteers from the local area.\n\nMost devopsdays events feature a combination of curated talks (see open Calls for Proposals) and self organized open space content. Topics often include automation, testing, security, and organizational culture.\n\nHistory\n\nThe first devopsdays was held in Ghent, Belgium in 2009. Since then, devopsdays events have multiplied, and if there isn’t one in your city, check out the information about organizing one yourself!\n\nAbout the organization\n\nThe devopsdays global core team guides local organizers in hosting their own devopsdays events worldwide. Active core organizers onboard and guide events, answer questions, and maintain the website. Advisory core organizers are less involved day-to-day but we

But it does not stop with files , we can load webfiles , and many many other documents from various sources. From Jira tickets to Datadog Logfiles etc..

In [6]:
# https://python.langchain.com/docs/integrations/document_loaders/web_base
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://jedi.be")

data = loader.load()
print(data)


[Document(page_content='\n\n\n\n\n  Personal website of Patrick Debois – JEDI - Just Enough Documented Information Blog\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n            [Home]\n            [Blog]\n            [Contact]\n         -\n            [Talks]\n            [Bio]\n            [Customers]\n          \n\n\n\n\n\n\n\n\n\nSubscribe via email\n\n\n\n\n\nPersonal website of Patrick Debois\n\n\nA warm welcome to you !\n\nDuring 15 years of consultancy, I’ve assumed different roles within large enterprises ranging from developer, network specialist, system administrator, tester and project manager.  And because I’ve lived and experienced each role, I can talk to both manager, developer and IT people. Each in their own language.\n\nThis allows me to break past silo-based organizational boundaries resulting in smoother project delivery.\n\nI currently specialize in applying Agile techniques in infrastructure integration projects or what is sometimes called devops; agil