Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notion vectorStore integration breaks when importing a database with an empty row #2807

Closed
natalie-bb opened this issue Oct 6, 2023 · 10 comments · Fixed by #2837
Closed
Labels
auto:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@natalie-bb
Copy link

file:///Users/natalie/chatbot/node_modules/langchain/dist/embeddings/openai.js:139
const batches = chunkArray(this.stripNewLines ? texts.map((t) => t.replace(/\n/g, " ")) : texts, this.batchSize);
^
TypeError: Cannot read properties of undefined (reading 'replace')
at file:///Users/natalie/chatbot/node_modules/langchain/dist/embeddings/openai.js:139:76
at Array.map ()
at OpenAIEmbeddings.embedDocuments (file:///Users/natalie/chatbot/node_modules/langchain/dist/embeddings/openai.js:139:63)
at HNSWLib.addDocuments (file:///Users/natalie/chatbot/node_modules/langchain/dist/vectorstores/hnswlib.js:48:54)
at Function.fromDocuments (file:///Users/natalie/chatbot/node_modules/langchain/dist/vectorstores/hnswlib.js:258:24)
at file:///Users/natalie/chatbot/tools/vectorStores/restaurants.ts:27:46
at processTicksAndRejections (node:internal/process/task_queues:95:5)

@dosubot dosubot bot added the auto:bug Related to a bug, vulnerability, unexpected error with an existing feature label Oct 6, 2023
@dosubot
Copy link

dosubot bot commented Oct 6, 2023

🤖

Hello,

Thank you for bringing this issue to our attention. It seems like the error is occurring because the embedDocuments function in openai.js is trying to call the replace method on an undefined value. This is likely happening because there's an empty row in the database you're trying to import, which results in an undefined value in the texts array that's passed to embedDocuments.

Here's the relevant part of the code:

const batches = chunkArray(this.stripNewLines ? texts.map((t) => t.replace(/\n/g, " ")) : texts, this.batchSize);

In this line, texts.map((t) => t.replace(/\n/g, " ")) is trying to replace all newline characters in each string in the texts array. If there's an undefined value in the texts array (which would be the case if there's an empty row in the database), calling replace on it will throw a TypeError.

To fix this issue, we could add a check to ensure that t is defined before calling replace on it. Here's how you could modify the code:

const batches = chunkArray(this.stripNewLines ? texts.map((t) => t ? t.replace(/\n/g, " ") : "") : texts, this.batchSize);

In this modified code, t ? t.replace(/\n/g, " ") : "" will return an empty string if t is undefined, which should prevent the TypeError from being thrown.

Please try this modification and let us know if it resolves the issue. If it doesn't, or if you have any other questions, please don't hesitate to ask.

Best,
Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@jacoblee93
Copy link
Collaborator

@skarard can you have a look at this one?

@skarard
Copy link
Contributor

skarard commented Oct 7, 2023

Sure.

@skarard
Copy link
Contributor

skarard commented Oct 7, 2023

I was able to reproduce the bug and it opens a pretty solid question.

First the bug:

/**
 * Interface for interacting with a document.
 */
export class Document<
  // eslint-disable-next-line @typescript-eslint/no-explicit-any
  Metadata extends Record<string, any> = Record<string, any>
> implements DocumentInput
{
  pageContent: string;

  metadata: Metadata;

  constructor(fields: DocumentInput<Metadata>) {
    this.pageContent = fields.pageContent
      ? fields.pageContent.toString()
      : this.pageContent;
    this.metadata = fields.metadata ?? ({} as Metadata);
  }
}

The constructor for the Document class will set pageContent to undefined if it's falsy (e.g. an empty string).

A Notion page contains properties that the NotionAPILoader store as metadata. Often in a Notion database, a page will not have any contents and the pages properties are where the data is stored.

@natalie-bb a work around to get the script working for now is to replace the undefined pageContent with an empty string.

const workaroundDocs = docs.map(
  (doc) => ({ ...doc, pageContent: doc.pageContent ?? "" } as Document)
);

Your code may want to look something like this...

const loader = new NotionAPILoader({...});

const docs = await loader.load();

const workaroundDocs = docs.map(
  (doc) => ({ ...doc, pageContent: doc.pageContent ?? "" } as Document)
);

const vectorStore = await HNSWLib.fromDocuments(workaroundDocs, embeddings);

@jacoblee93 Should the Document class store an empty string if pageContent is an empty string?

@jacoblee93
Copy link
Collaborator

IMO no - it would never get retrieved

@skarard
Copy link
Contributor

skarard commented Oct 7, 2023

@natalie-bb I'm going to add a little more context which might not be obvious.

The database properties are not used when running the embeddings, this is to say the only searchable data is pageContent, the properties can be used to filter or do a database query (if your vector store supports that).

Something that may be closer to having similar searchable Notion properties is to add the properties as a header to pageContent. This is an example following front matter content management.

  const workaroundDocs = docs.map(
    (doc) =>
      ({
        ...doc,
        pageContent: `---\n${yaml.dump(doc.metadata.properties)}---\n${doc.pageContent ?? ""}`,
      } as Document)
  );

There is a yaml parsing/writing library already in Langchain, this is the import you'd need.

import yaml from "js-yaml";

@skarard
Copy link
Contributor

skarard commented Oct 7, 2023

IMO no - it would never get retrieved

@jacoblee93 Ok, if I understand you correctly, this would mean that HNSWLib will need to support pageContent as undefined?

@natalie-bb
Copy link
Author

As a workaround, I simply deleted the empty row.

The error was somewhat obtuse, and it wasn't immediately obvious to me that I needed to do that.

@jacoblee93
Copy link
Collaborator

IMO no - it would never get retrieved

@jacoblee93 Ok, if I understand you correctly, this would mean that HNSWLib will need to support pageContent as undefined?

What I mean is it shouldn't be created at all - pageContent is a required field:

https://github.com/langchain-ai/langchainjs/blob/main/langchain/src/document.ts#L18

@skarard
Copy link
Contributor

skarard commented Oct 8, 2023

IMO no - it would never get retrieved

@jacoblee93 Ok, if I understand you correctly, this would mean that HNSWLib will need to support pageContent as undefined?

What I mean is it shouldn't be created at all - pageContent is a required field:

https://github.com/langchain-ai/langchainjs/blob/main/langchain/src/document.ts#L18

Thank you for your clarification.

The typing for the Document class currently suggests that an empty string is a valid pageContents but sets this to undefined breaking the fields typing.

Therefore the correct behaviour is for the Document class constructor is to throw an error when page contents are string.length == 0. I'll throw in a new PR for this, you're welcome to close it if you hate the idea.

I'll also update the NotionAPILoader to check if pageContents is falsy, so that a page would not be created. I'm also going to throw in a new NotionAPILoaderOption to add the page properties as front matter to page contents (yaml). The default will be the current behaviour, to not prepend anything. I'll link the issue to this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants