Skip to content

Commit

Permalink
Allow passing a custom pdfjs build
Browse files Browse the repository at this point in the history
  • Loading branch information
nfcampos committed Apr 5, 2023
1 parent 022ef67 commit 432567b
Show file tree
Hide file tree
Showing 3 changed files with 36 additions and 2 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,16 @@ const loader = new PDFLoader("src/document_loaders/example_data/example.pdf", {

const docs = await loader.load();
```

# Usage, legacy environments

In legacy environments, you can use the `pdfjs` option to provide a function that returns a promise that resolves to the `PDFJS` object. This is useful if you want to use a custom build of `pdfjs-dist` or if you want to use a different version of `pdfjs-dist`. Eg. here we use the legacy build of `pdfjs-dist`, which includes several polyfills that are not included in the default build.

```typescript
import { PDFLoader } from "langchain/document_loaders";

const loader = new PDFLoader("src/document_loaders/example_data/example.pdf", {
pdfjs: () =>
import("pdfjs-dist/legacy/build/pdf.js").then((mod) => mod.default),
});
```
10 changes: 8 additions & 2 deletions langchain/src/document_loaders/pdf.ts
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,22 @@ import { BufferLoader } from "./buffer.js";
export class PDFLoader extends BufferLoader {
private splitPages: boolean;

constructor(filePathOrBlob: string | Blob, { splitPages = true } = {}) {
private pdfjs: typeof PDFLoaderImports;

constructor(
filePathOrBlob: string | Blob,
{ splitPages = true, pdfjs = PDFLoaderImports } = {}
) {
super(filePathOrBlob);
this.splitPages = splitPages;
this.pdfjs = pdfjs;
}

public async parse(
raw: Buffer,
metadata: Document["metadata"]
): Promise<Document[]> {
const { getDocument, version } = await PDFLoaderImports();
const { getDocument, version } = await this.pdfjs();
const pdf = await getDocument({
data: new Uint8Array(raw.buffer),
useWorkerFetch: false,
Expand Down
15 changes: 15 additions & 0 deletions langchain/src/document_loaders/tests/pdf.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,18 @@ test("Test PDF loader from file to single document", async () => {
expect(docs.length).toBe(1);
expect(docs[0].pageContent).toContain("Attention Is All You Need");
});

test("Test PDF loader from file using custom pdfjs", async () => {
const filePath = path.resolve(
path.dirname(url.fileURLToPath(import.meta.url)),
"./example_data/1706.03762.pdf"
);
const loader = new PDFLoader(filePath, {
pdfjs: () =>
import("pdfjs-dist/legacy/build/pdf.js").then((mod) => mod.default),
});
const docs = await loader.load();

expect(docs.length).toBe(15);
expect(docs[0].pageContent).toContain("Attention Is All You Need");
});

0 comments on commit 432567b

Please sign in to comment.