# LangChain `DocumentLoaders`

[LangChain](https://python.langchain.com/docs/get_started/introduction.html) is library that provides a kitchen sink of tools for LLMs, particularly integrating LLMs with other tools.  

One underrated feature of Langchain is [DocumentLoaders](https://integrations.langchain.com/), which allow you to acquire text data from any source, which is super useful even if you aren't using LLMs at all!  (It can also be useful to hijack these loaders to acquire data for fine tuning!)  

For example, if you are trying to get data from a website as text here are some useful `DocumentLoaders`:

1. [RecursiveURLLoader](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/recursive_url_loader)
2. [SeleniumLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain.document_loaders.url_selenium.SeleniumURLLoader.html)
3. [SitemapLoader](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/sitemap): **this is explored below**.

I think it is useful to combine LangChain `DocumentLoaders` with [HuggingFace datasets](https://huggingface.co/docs/datasets/index), because it allows you to save, version and do other fun things like perform [semantic search of your data](https://huggingface.co/learn/nlp-course/chapter5/6?fw=tf) with FAISS.

As of this writing, there are [over 125 different kinds](https://integrations.langchain.com/) of `DocumentLoaders`.  I haven't been able to find a loader that isn't there to quickly acquire data I need.

## Sitemap Loader

[Sitemaps](https://en.wikipedia.org/wiki/Site_map) are a nice way to see a listing of all pages on a site.  This is useful for acquiring all of the text from a large site that might contain many pages.  Below, I use the [SitemapLoader](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/sitemap) to get all of the text from [https://quarto.org](https://quarto.org).

:::{.callout-warning}
There is currently [a bug](https://github.com/hwchase17/langchain/issues/6521) in langchain, so I had to install an old version right before [this commit](https://github.com/hwchase17/langchain/pull/6107) which broke the `SitemapLoader`.  I had to downgrade to `v0.0.202` via `pip install langchain==0.0.202`
:::

In [None]:
import nest_asyncio
nest_asyncio.apply() # you don't need this line outside notebooks
from langchain.document_loaders.sitemap import SitemapLoader
sitemap_loader = SitemapLoader(web_path="https://quarto.org/sitemap.xml")
sitemap_loader.requests_per_second = 4
docs = sitemap_loader.load()

Fetching pages: 100%|####################################| 269/269 [00:16<00:00, 16.21it/s]


In [None]:
print(f'There are {len(docs)} pages')

There are 269 pages


Let's look at the content of one page:

In [None]:
example = docs[0]
example.dict()

{'page_content': '\n\n\n\n\nQuarto - About Quarto\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nOverview\n\n\n\nGet Started\n\n\n\nGuide\n\n\n\nExtensions\n\n\n\nReference\n\n\n\nGallery\n\n\n\nBlog\n\n\n\nHelp\n\n\n\n\n\nReport a Bug\n\n\n\n\nAsk a Question\n\n\n\n\nFAQ\n\n\n\n\n \n\n\n\n\n\n \n\n\n\n\n\n\n\n\nOn this page\n\nGoals\nProject\nContribute\n\nEdit this pageReport an issue\n\n\n\n\n\nAbout Quarto\nOpen source tools for scientific and technical publishing\n\n\n\n\n\nGoals\nThe overarching goal of Quarto is to make the process of creating and collaborating on scientific and technical documents dramatically better. We hope to do this in several dimensions:\n\nCreate a writing and publishing environment with great integrated tools for technical content. We want to make authoring with embedded code, equations, figures, complex diagrams, interactive widgets, citations, cross references, and the myriad other special req

## Clean the data

When we look at this page, we can see a bunch of unwanted text.  The navbar and the sidenav are showing up, and we do not want this.   We can update the parsing function to fix this:

In [None]:
from bs4 import BeautifulSoup


def remove_nav_and_header_elements(content: BeautifulSoup) -> str:
    exclude = content.find_all(["nav", "footer", "header", "head"])
    for element in exclude:
        element.decompose()

    return str(content.get_text()).strip()

In [None]:
sitemap_loader = SitemapLoader(web_path="https://quarto.org/sitemap.xml",
                              parsing_function=remove_nav_and_header_elements)
sitemap_loader.requests_per_second = 4
docs = sitemap_loader.load()

Fetching pages: 100%|####################################| 269/269 [00:05<00:00, 52.00it/s]


In [None]:
example = docs[0]
example.dict()

{'page_content': 'Goals\nThe overarching goal of Quarto is to make the process of creating and collaborating on scientific and technical documents dramatically better. We hope to do this in several dimensions:\n\nCreate a writing and publishing environment with great integrated tools for technical content. We want to make authoring with embedded code, equations, figures, complex diagrams, interactive widgets, citations, cross references, and the myriad other special requirements of scientific discourse straightforward and productive for everyone.\nHelp authors take full advantage of the web as a connected, interactive platform for communications, while still providing the ability to create excellent printed output from the same document source. Researchers shouldn’t need to choose between LaTeX, MS Word, and HTML but rather be able to author documents that target all of them at the same time.\nMake reproducible research and publications the norm rather than the exception. Reproducibili

# Create a HF Dataset

We can use the `from_list` method to load that sitemap data into a HF Dataset.

In [None]:
from datasets import Dataset 
repo_name = 'hamel/quarto'

In [None]:
quarto_data = Dataset.from_list([d.dict() for d in docs])

In [None]:
quarto_data

Dataset({
    features: ['page_content', 'metadata'],
    num_rows: 269
})

In [None]:
quarto_data.push_to_hub(repo_name)

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Updating downloaded metadata with the new split.


In [None]:
#|echo: false
from IPython.display import Markdown
Markdown(f'This data is available at [https://huggingface.co/datasets/{repo_name}]({repo_name})')

This data is available at [https://huggingface.co/datasets/hamel/quarto](hamel/quarto)

## Download the data

You can download the data from the HuggingFace Hub like this:

In [None]:
from datasets import load_dataset
remote_data = load_dataset(repo_name)

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Using custom data configuration hamel--quarto-b88699e31e28f953


Downloading and preparing dataset None/None (download: Unknown size, generated: 1.81 MiB, post-processed: Unknown size, total: 1.81 MiB) to /Users/hamel/.cache/huggingface/datasets/hamel___parquet/hamel--quarto-b88699e31e28f953/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/735k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset parquet downloaded and prepared to /Users/hamel/.cache/huggingface/datasets/hamel___parquet/hamel--quarto-b88699e31e28f953/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
remote_data['train'][0]

{'page_content': 'Goals\nThe overarching goal of Quarto is to make the process of creating and collaborating on scientific and technical documents dramatically better. We hope to do this in several dimensions:\n\nCreate a writing and publishing environment with great integrated tools for technical content. We want to make authoring with embedded code, equations, figures, complex diagrams, interactive widgets, citations, cross references, and the myriad other special requirements of scientific discourse straightforward and productive for everyone.\nHelp authors take full advantage of the web as a connected, interactive platform for communications, while still providing the ability to create excellent printed output from the same document source. Researchers shouldn’t need to choose between LaTeX, MS Word, and HTML but rather be able to author documents that target all of them at the same time.\nMake reproducible research and publications the norm rather than the exception. Reproducibili

# GitHub Issues

We can use the [GitHubIssuesLoader](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/github) to get all of the issues from a GitHub repo.

In [None]:
from langchain.document_loaders import GitHubIssuesLoader

This assumes you have set the `GITHUB_PERSONAL_ACCESS_TOKEN` as an environment variable

In [None]:
loader = GitHubIssuesLoader(
    repo="quarto-dev/quarto-cli",
    state='all', #get both open and closed issues
    include_prs=False,
)
quarto_issues = loader.load()

In [None]:
len(quarto_issues)

2841

Wow, that's a lot of Issues!  Let's take a look at one:

In the issue below, I can see that it doesn't include comments.  We would have to get those separately with further API calls, but this is a good start!

In [None]:
quarto_issues[0]



### Upload to the Hub

We can upload these issues to the hub like so, this will be available at [https://huggingface.co/datasets/hamel/quarto-issues](https://huggingface.co/datasets/hamel/quarto-issues)

In [None]:
#|output: false
ds = Dataset.from_list([x.dict() for x in quarto_issues])
ds.push_to_hub('hamel/quarto-issues')

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
from datasets import load_dataset
remote_data = load_dataset('hamel/quarto-issues')

Downloading:   0%|          | 0.00/2.58k [00:00<?, ?B/s]

Using custom data configuration hamel--quarto-issues-52921768ee5c97fb


Downloading and preparing dataset None/None (download: 1.99 MiB, generated: 4.78 MiB, post-processed: Unknown size, total: 6.77 MiB) to /Users/hamel/.cache/huggingface/datasets/hamel___parquet/hamel--quarto-issues-52921768ee5c97fb/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.08M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset parquet downloaded and prepared to /Users/hamel/.cache/huggingface/datasets/hamel___parquet/hamel--quarto-issues-52921768ee5c97fb/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
remote_data['train'][0]

 'metadata': {'assignee': None,
  'comments': 1,
  'created_at': '2023-07-05T21:11:31Z',
  'creator': 'joaoaugustofrei',
  'is_pull_request': False,
  'labels': ['bug'],
  'locked': False,
  'milestone': None,
  'number': 6113,
  'state': 'open',
  'title': 'Quarto not recognizing python packages',
  'url': 'https://github.com/quarto-dev/quarto-cli/issues/6113'}}