# Loading and preparing data

In the previous lesson, we learned how to construct a chain with LCEL. Now that we have those fundamentals down, we'll continue on our journey to construct a "chat with data" application by going over some techniques to store our own documents for later retrieval to ground the LLM's generation in our own context. This is generally called Retrieval Augmented Generation (or RAG).

The basic flow is as follows:

1. Load documents from a source.
2. Split the docs into chunks small enough to fit into an LLM's context window and avoid distraction.
3. Embed the chunks in a vectorstore to allow for later retrieval based on input queries.
4. Retrieval of relevant previously-split chunks.
5. Generating a final output with retrieved chunks as context.

![](./static/images/rag_diagram.png)

We'll cover the first two steps in this lesson.

## Loading

We'll need some source documents to start. LangChain includes many other document loaders for many sources of data.

For example, we can load code from a GitHub repo, LangChain.js in the below case:

In [1]:
import "dotenv/config";

[Module: null prototype] { default: {} }

In [2]:
import { GithubRepoLoader } from "langchain/document_loaders/web/github";
// Peer dependency, used to support .gitignore syntax
import ignore from "ignore";

// Will not include anything under "ignorePaths"
const loader = new GithubRepoLoader(
  "https://github.com/langchain-ai/langchainjs",
  { recursive: false, ignorePaths: ["*.md", "yarn.lock"] }
);
const docs = await loader.load();
console.log({ docs });

{
  docs: [
    Document {
      pageContent: "# top-most EditorConfig file\n" +
        "root = true\n" +
        "\n" +
        "# Unix-style newlines with a newline ending every file\n" +
        "[*]"... 17 more characters,
      metadata: {
        source: ".editorconfig",
        repository: "https://github.com/langchain-ai/langchainjs",
        branch: "main"
      }
    },
    Document {
      pageContent: "* text=auto eol=lf",
      metadata: {
        source: ".gitattributes",
        repository: "https://github.com/langchain-ai/langchainjs",
        branch: "main"
      }
    },
    Document {
      pageContent: "node_modules/\n" +
        "dist/\n" +
        "dist-cjs/\n" +
        "lib/\n" +
        ".turbo\n" +
        ".eslintcache\n" +
        ".env\n" +
        ".env.local\n" +
        "yarn-error.log\n" +
        "docs/_dist/\n" +
        "\n" +
        "."... 340 more characters,
      metadata: {
        source: ".gitignore",
        repository: "https://github.com/

We can see that the resulting documents contain fields for `pageContent` and `metadata` corresponding to top level files in the repo, not including `.md` or `yarn.lock` files.

We can also load from a PDF, such as [this transcript](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf) of Andrew Ng's famous CS229 course on machine learning.

In [3]:
// Peer dependency
import * as parse from "pdf-parse";
import { PDFLoader } from "langchain/document_loaders/fs/pdf";

const loader = new PDFLoader("./static/docs/MachineLearning-Lecture01.pdf");

const rawCS229Docs = await loader.load();

rawCS229Docs

[
  Document {
    pageContent: [32m"MachineLearning-Lecture01  \n"[39m +
      [32m"Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machin"[39m... 2999 more characters,
    metadata: {
      source: [32m"./static/docs/MachineLearning-Lecture01.pdf"[39m,
      pdf: {
        version: [32m"1.10.100"[39m,
        info: {
          PDFFormatVersion: [32m"1.4"[39m,
          IsAcroFormPresent: [33mfalse[39m,
          IsXFAPresent: [33mfalse[39m,
          Title: [32m""[39m,
          Author: [32m""[39m,
          Creator: [32m"PScript5.dll Version 5.2.2"[39m,
          Producer: [32m"Acrobat Distiller 8.1.0 (Windows)"[39m,
          CreationDate: [32m"D:20080711112523-07'00'"[39m,
          ModDate: [32m"D:20080711112523-07'00'"[39m
        },
        metadata: Metadata { _metadata: [36m[Object: null prototype][39m },
        totalPages: [33m22[39m
      },
      loc: { pageNumber: [33m1[39m }
    }
  },
  Document {
    pageContent: [32m

LangChain supports a wide range of other document loaders as well.

## Splitting

Because chunks are passed to the LLM after retrieval, our goal with splitting is to try to keep semantically related ideas together in the same chunk so that the LLM gets an entire self-contained idea.

There are many different strategies for splitting on different types of data. For example, we can use code-specific delimters, which are optimized to split input docs into JavaScript functions and classes for easier querying along those lines:

In [4]:
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

const splitter = RecursiveCharacterTextSplitter.fromLanguage("js", {
  chunkSize: 32,
  chunkOverlap: 0,
});
const code = `function helloWorld() {
console.log("Hello, World!");
}
// Call the function
helloWorld();`;

await splitter.splitText(code);

[
  [32m"function helloWorld() {"[39m,
  [32m'console.log("Hello, World!");\n}'[39m,
  [32m"// Call the function"[39m,
  [32m"helloWorld();"[39m
]

Above, the splitter divides the function and function call along sensible boundaries smaller than the chunk size. If we had just split naively, for example using spaces as a separator, we might get chunks containing half of a `console.log()` statement, which makes the LLM's job more difficult when generating from retrieved chunks.

In [5]:
import { CharacterTextSplitter } from "langchain/text_splitter";

const splitter = new CharacterTextSplitter({
  chunkSize: 32,
  chunkOverlap: 0,
  separator: " "
});
const code = `function helloWorld() {
console.log("Hello, World!");
}
// Call the function
helloWorld();`;

await splitter.splitText(code);

[
  [32m"function helloWorld()"[39m,
  [32m'{\nconsole.log("Hello,'[39m,
  [32m'World!");\n}\n// Call the'[39m,
  [32m"function\nhelloWorld();"[39m
]

You'll notice I set two properties above: `chunkSize`, and `chunkOverlap`. Chunk size is the maximum size of the generated chunks, while chunk overlap is the maximum amount that chunks can overlap with each other. This can be useful to allow for continuity between fragments. Here's what happens if we tweak the above:

In [6]:
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

const splitter = RecursiveCharacterTextSplitter.fromLanguage("js", {
  chunkSize: 64,
  chunkOverlap: 32,
});
const code = `function helloWorld() {
console.log("Hello, World!");
}
// Call the function
helloWorld();`;

await splitter.splitText(code);

[
  [32m'function helloWorld() {\nconsole.log("Hello, World!");\n}'[39m,
  [32m'console.log("Hello, World!");\n}\n// Call the function'[39m,
  [32m"}\n// Call the function\nhelloWorld();"[39m
]

You can see we get three chunks split along different lines.

LangChain includes several different options for different types of content, including Markdown and HTML. For generic written text, the `RecursiveCharacterTextSplitter` splits chunks on paragraphs, which is a good starting point since it's a natural boundary for people to split up their thoughts and points.

In [7]:
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 512,
  chunkOverlap: 64,
});

This will produce chunks with a maximum size of 512 characters and no overlap between chunks. There's no one-size-fits-all for these values - it will depend on your specific data source and the LLM you're using. Overlap can be advantageous to help keep semantic ideas in the same chunk, but too much can be inefficient.

Now, let's split our previously ingested raw docs from the CSS229 lesson transcript into chunks:

In [8]:
const splitDocs = await splitter.splitDocuments(rawCS229Docs);

console.log(splitDocs);

[
  Document {
    pageContent: "MachineLearning-Lecture01  \n" +
      "Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machin"... 352 more characters,
    metadata: {
      source: "./static/docs/MachineLearning-Lecture01.pdf",
      pdf: {
        version: "1.10.100",
        info: {
          PDFFormatVersion: "1.4",
          IsAcroFormPresent: false,
          IsXFAPresent: false,
          Title: "",
          Author: "",
          Creator: "PScript5.dll Version 5.2.2",
          Producer: "Acrobat Distiller 8.1.0 (Windows)",
          CreationDate: "D:20080711112523-07'00'",
          ModDate: "D:20080711112523-07'00'"
        },
        metadata: Metadata { _metadata: [Object: null prototype] },
        totalPages: 22
      },
      loc: { pageNumber: 1, lines: { from: 1, to: 6 } }
    }
  },
  Document {
    pageContent: "I actually think that machine learning is the most exciting field of all the computer \n" +
      "sciences. So "... 333 more characters,


Looks good!

We won't dig in too deeply in this course, but documents can also have metadata, which may be automatically populated and can be useful for more advanced types of filtering and querying.

In the next lesson, we will show how to embed and add these chunks to a vectorstore.