# Loading and preparing data

The basic flow for RAG is as follows:

1. Load documents from a source.
2. Split the docs into chunks small enough to fit into an LLM's context window and avoid distraction.
3. Embed the chunks in a vectorstore to allow for later retrieval based on input queries.
4. Retrieval of relevant previously-split chunks.
5. Generating a final output with retrieved chunks as context.

![](./static/images/rag_diagram.png)

## Loading

Let's look at an example:

In [1]:
import "npm:dotenv/config";

[Module: null prototype] { default: {} }

In [4]:
import { GithubRepoLoader } from "npm:langchain@0.0.178/document_loaders/web/github";
// Peer dependency, used to support .gitignore syntax
import ignore from "npm:ignore";

const loader = new GithubRepoLoader(
  "https://github.com/langchain-ai/langchainjs",
  { recursive: false, ignorePaths: ["*.md", "yarn.lock"] }
);

const docs = await loader.load();
console.log({ docs });

{
  docs: [
    Document {
      pageContent: "# top-most EditorConfig file\n" +
        "root = true\n" +
        "\n" +
        "# Unix-style newlines with a newline ending every file\n" +
        "[*]"... 17 more characters,
      metadata: {
        source: ".editorconfig",
        repository: "https://github.com/langchain-ai/langchainjs",
        branch: "main"
      }
    },
    Document {
      pageContent: "* text=auto eol=lf",
      metadata: {
        source: ".gitattributes",
        repository: "https://github.com/langchain-ai/langchainjs",
        branch: "main"
      }
    },
    Document {
      pageContent: "node_modules/\n" +
        "dist/\n" +
        "dist-cjs/\n" +
        "lib/\n" +
        ".turbo\n" +
        ".eslintcache\n" +
        ".env\n" +
        ".env.local\n" +
        "yarn-error.log\n" +
        "docs/_dist/\n" +
        "\n" +
        "."... 340 more characters,
      metadata: {
        source: ".gitignore",
        repository: "https://github.com/

We can also load from a PDF:

In [1]:
// Peer dependency
import * as parse from "npm:pdf-parse";
import { PDFLoader } from "npm:langchain@0.0.178/document_loaders/fs/pdf";

In [3]:
const loader = new PDFLoader("./static/docs/MachineLearning-Lecture01.pdf");

In [4]:
const rawCS229Docs = await loader.load();

console.log({ docs: rawCS229Docs });

{
  docs: [
    Document {
      pageContent: "MachineLearning-Lecture01  \n" +
        "Instructor (Andrew Ng):\n" +
        " Okay. Good morning. Welcome to CS229, the machi"... 3042 more characters,
      metadata: {
        source: "./static/docs/MachineLearning-Lecture01.pdf",
        pdf: {
          version: "1.10.100",
          info: [Object],
          metadata: [Metadata],
          totalPages: 22
        },
        loc: { pageNumber: 1 }
      }
    },
    Document {
      pageContent: "many biologers are there here? Wow, just a \n" +
        "few, not many. I'm surprised. Anyone from \n" +
        "statistics? O"... 1024 more characters,
      metadata: {
        source: "./static/docs/MachineLearning-Lecture01.pdf",
        pdf: {
          version: "1.10.100",
          info: [Object],
          metadata: [Metadata],
          totalPages: 22
        },
        loc: { pageNumber: 2 }
      }
    },
    Document {
      pageContent: "So in this class, we've tried to convey

# Splitting

Goal: keep distraction to a minimum while also keeping semantically related idaes together in the same chunk.

// Add example of non-specific splitter
// Play with sizes, update code inline

In [12]:
import { RecursiveCharacterTextSplitter } from "npm:langchain@0.0.178/text_splitter";

const splitter = RecursiveCharacterTextSplitter.fromLanguage("js", {
  chunkSize: 64,
  chunkOverlap: 0,
});

const code = `function helloWorld() {
  console.log("Hello, World!");
  console.log("Hello, DLAI!");
  const someVar = 1 + 1;
  }
  // Call the function
  helloWorld();`;
  
await splitter.splitText(code);

[
  [32m'function helloWorld() {\n  console.log("Hello, World!");'[39m,
  [32m'console.log("Hello, DLAI!");\n  const someVar = 1 + 1;\n  }'[39m,
  [32m"// Call the function\n  helloWorld();"[39m
]

In [13]:
import { CharacterTextSplitter } from "npm:langchain@0.0.178/text_splitter";

const naiveSplitter = new CharacterTextSplitter({
  chunkSize: 10,
  chunkOverlap: 0,
});

const code = `function helloWorld() {
  console.log("Hello, World!");
  }
  // Call the function
  helloWorld();`;
  
await naiveSplitter.splitText(code);

[
  [32m"function helloWorld() {\n"[39m +
    [32m'  console.log("Hello, World!");\n'[39m +
    [32m"  }\n"[39m +
    [32m"  // Call the function\n"[39m +
    [32m"  helloWorld();"[39m
]

In [14]:
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 512,
  chunkOverlap: 64,
});

In [18]:
// Peer dependency
import * as parse from "npm:pdf-parse";
import { PDFLoader } from "npm:langchain@0.0.178/document_loaders/fs/pdf";

const loader = new PDFLoader("./static/docs/MachineLearning-Lecture01.pdf");

const rawCS229Docs = await loader.load();

const splitDocs = await splitter.splitDocuments(rawCS229Docs);

console.log({ splitDocs })

{
  splitDocs: [
    Document {
      pageContent: "MachineLearning-Lecture01  \n" +
        "Instructor (Andrew Ng):\n" +
        " Okay. Good morning. Welcome to CS229, the machi"... 404 more characters,
      metadata: {
        source: "./static/docs/MachineLearning-Lecture01.pdf",
        pdf: {
          version: "1.10.100",
          info: [Object],
          metadata: [Metadata],
          totalPages: 22
        },
        loc: { pageNumber: 1, lines: [Object] }
      }
    },
    Document {
      pageContent: "I actually think that machine learning is th\n" +
        "e most exciting field of all the computer \n" +
        "sciences. So"... 399 more characters,
      metadata: {
        source: "./static/docs/MachineLearning-Lecture01.pdf",
        pdf: {
          version: "1.10.100",
          info: [Object],
          metadata: [Metadata],
          totalPages: 22
        },
        loc: { pageNumber: 1, lines: [Object] }
      }
    },
    Document {
      pageContent: "re