# PDFLoader

```{=mdx}

:::tip Compatibility

Only available on Node.js.

:::

```

This notebook provides a quick overview for getting started with `PDFLoader` [document loaders](/docs/concepts/document_loaders). For detailed documentation of all `PDFLoader` features and configurations head to the [API reference](https://api.js.langchain.com/classes/langchain_community_document_loaders_fs_pdf.PDFLoader.html).

## Overview
### Integration details

| Class | Package | Compatibility | Local | PY support | 
| :--- | :--- | :---: | :---: |  :---: |
| [PDFLoader](https://api.js.langchain.com/classes/langchain_community_document_loaders_fs_pdf.PDFLoader.html) | [@langchain/community](https://api.js.langchain.com/modules/langchain_community_document_loaders_fs_pdf.html) | Node-only | ✅ | 🟠 (See note below) |

## Setup

To access `PDFLoader` document loader you'll need to install the `@langchain/community` integration, along with the `pdf-parse` package.

### Credentials

### Installation

The LangChain PDFLoader integration lives in the `@langchain/community` package:

```{=mdx}
import IntegrationInstallTooltip from "@mdx_components/integration_install_tooltip.mdx";
import Npm2Yarn from "@theme/Npm2Yarn";

<IntegrationInstallTooltip></IntegrationInstallTooltip>

<Npm2Yarn>
  @langchain/community @langchain/core pdf-parse
</Npm2Yarn>

```

## Instantiation

Now we can instantiate our model object and load documents:

In [10]:
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf"

const nike10kPdfPath = "../../../../data/nke-10k-2023.pdf"

const loader = new PDFLoader(nike10kPdfPath)

## Load

In [4]:
const docs = await loader.load()
docs[0]

Document {
  pageContent: 'Table of Contents\n' +
    'UNITED STATES\n' +
    'SECURITIES AND EXCHANGE COMMISSION\n' +
    'Washington, D.C. 20549\n' +
    'FORM 10-K\n' +
    '(Mark One)\n' +
    '☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934\n' +
    'FOR THE FISCAL YEAR ENDED MAY 31, 2023\n' +
    'OR\n' +
    '☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934\n' +
    'FOR THE TRANSITION PERIOD FROM                         TO                         .\n' +
    'Commission File No. 1-10635\n' +
    'NIKE, Inc.\n' +
    '(Exact name of Registrant as specified in its charter)\n' +
    'Oregon93-0584541\n' +
    '(State or other jurisdiction of incorporation)(IRS Employer Identification No.)\n' +
    'One Bowerman Drive, Beaverton, Oregon 97005-6453\n' +
    '(Address of principal executive offices and zip code)\n' +
    '(503) 671-6453\n' +
    "(Registrant's telephone number, including area code)\n" +
 

In [5]:
console.log(docs[0].metadata)

{
  source: '../../../../data/nke-10k-2023.pdf',
  pdf: {
    version: '1.10.100',
    info: {
      PDFFormatVersion: '1.4',
      IsAcroFormPresent: false,
      IsXFAPresent: false,
      Title: '0000320187-23-000039',
      Author: 'EDGAR Online, a division of Donnelley Financial Solutions',
      Subject: 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31',
      Keywords: '0000320187-23-000039; ; 10-K',
      Creator: 'EDGAR Filing HTML Converter',
      Producer: 'EDGRpdf Service w/ EO.Pdf 22.0.40.0',
      CreationDate: "D:20230720162200-04'00'",
      ModDate: "D:20230720162208-04'00'"
    },
    metadata: null,
    totalPages: 107
  },
  loc: { pageNumber: 1 }
}


## Usage, one document per file

In [8]:
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";

const singleDocPerFileLoader = new PDFLoader(nike10kPdfPath, {
  splitPages: false,
});

const singleDoc = await singleDocPerFileLoader.load();
console.log(singleDoc[0].pageContent.slice(0, 100))

Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K



## Usage, custom `pdfjs` build

By default we use the `pdfjs` build bundled with `pdf-parse`, which is compatible with most environments, including Node.js and modern browsers. If you want to use a more recent version of `pdfjs-dist` or if you want to use a custom build of `pdfjs-dist`, you can do so by providing a custom `pdfjs` function that returns a promise that resolves to the `PDFJS` object.

In the following example we use the "legacy" (see [pdfjs docs](https://github.com/mozilla/pdf.js/wiki/Frequently-Asked-Questions#which-browsersenvironments-are-supported)) build of `pdfjs-dist`, which includes several polyfills not included in the default build.

```{=mdx}
<Npm2Yarn>
  pdfjs-dist
</Npm2Yarn>

```


In [None]:
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";

const customBuildLoader = new PDFLoader(nike10kPdfPath, {
  // you may need to add `.then(m => m.default)` to the end of the import
  // @lc-ts-ignore
  pdfjs: () => import("pdfjs-dist/legacy/build/pdf.js"),
});

## Eliminating extra spaces

PDFs come in many varieties, which makes reading them a challenge. The loader parses individual text elements and joins them together with a space by default, but
if you are seeing excessive spaces, this may not be the desired behavior. In that case, you can override the separator with an empty string like this:


In [12]:
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";

const noExtraSpacesLoader = new PDFLoader(nike10kPdfPath, {
  parsedItemSeparator: "",
});

const noExtraSpacesDocs = await noExtraSpacesLoader.load();
console.log(noExtraSpacesDocs[0].pageContent.slice(100, 250))

(Mark One)
☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
FOR THE FISCAL YEAR ENDED MAY 31, 2023
OR
☐ TRANSITI


## Loading directories

In [17]:
import { DirectoryLoader } from "langchain/document_loaders/fs/directory";
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const exampleDataPath = "../../../../../../examples/src/document_loaders/example_data/";

/* Load all PDFs within the specified directory */
const directoryLoader = new DirectoryLoader(
  exampleDataPath,
  {
    ".pdf": (path: string) => new PDFLoader(path),
  }
);

const directoryDocs = await directoryLoader.load();

console.log(directoryDocs[0]);

/* Additional steps : Split text into chunks with any TextSplitter. You can then use it as context or save it to memory afterwards. */
const textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 200,
});

const splitDocs = await textSplitter.splitDocuments(directoryDocs);
console.log(splitDocs[0]);


Unknown file type: Star_Wars_The_Clone_Wars_S06E07_Crisis_at_the_Heart.srt
Unknown file type: example.txt
Unknown file type: notion.md
Unknown file type: bad_frontmatter.md
Unknown file type: frontmatter.md
Unknown file type: no_frontmatter.md
Unknown file type: no_metadata.md
Unknown file type: tags_and_frontmatter.md
Unknown file type: test.mp3


Document {
  pageContent: 'Bitcoin: A Peer-to-Peer Electronic Cash System\n' +
    'Satoshi Nakamoto\n' +
    'satoshin@gmx.com\n' +
    'www.bitcoin.org\n' +
    'Abstract.   A  purely   peer-to-peer   version   of   electronic   cash   would   allow   online \n' +
    'payments   to   be   sent   directly   from   one   party   to   another   without   going   through   a \n' +
    'financial institution.   Digital signatures provide part of the solution, but the main \n' +
    'benefits are lost if a trusted third party is still required to prevent double-spending. \n' +
    'We propose a solution to the double-spending problem using a peer-to-peer network. \n' +
    'The   network   timestamps   transactions   by   hashing   them   into   an   ongoing   chain   of \n' +
    'hash-based proof-of-work, forming a record that cannot be changed without redoing \n' +
    'the proof-of-work.   The longest chain not only serves as proof of the sequence of \n' +
    'events witnessed, but p

## API reference

For detailed documentation of all PDFLoader features and configurations head to the API reference: https://api.js.langchain.com/classes/langchain_community_document_loaders_fs_pdf.PDFLoader.html