Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to ingest data #468

Open
fraction94 opened this issue May 20, 2024 · 5 comments
Open

Fail to ingest data #468

fraction94 opened this issue May 20, 2024 · 5 comments
Labels
kind: bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@fraction94
Copy link

I have followed the readme instructions and created an index in Pinecone. However, still fail to ingest data and keep on getting this error. I dont know what am I doing wrong. I am fairly new to this and trying to learn along the way. any help is much appreciated.

`creating vector store...
error TypeError: Cannot read properties of undefined (reading 'text')
at C:\Python\gpt4-pdf\node_modules@pinecone-database\pinecone\dist\errors\utils.js:44:57
at step (C:\Python\gpt4-pdf\node_modules@pinecone-database\pinecone\dist\errors\utils.js:33:23)
at Object.next (C:\Python\gpt4-pdf\node_modules@pinecone-database\pinecone\dist\errors\utils.js:14:53)
at C:\Python\gpt4-pdf\node_modules@pinecone-database\pinecone\dist\errors\utils.js:8:71
at new Promise ()
at __awaiter (C:\Python\gpt4-pdf\node_modules@pinecone-database\pinecone\dist\errors\utils.js:4:12)
at extractMessage (C:\Python\gpt4-pdf\node_modules@pinecone-database\pinecone\dist\errors\utils.js:40:48)
at C:\Python\gpt4-pdf\node_modules@pinecone-database\pinecone\dist\errors\handling.js:66:70
at step (C:\Python\gpt4-pdf\node_modules@pinecone-database\pinecone\dist\errors\handling.js:33:23)
at Object.next (C:\Python\gpt4-pdf\node_modules@pinecone-database\pinecone\dist\errors\handling.js:14:53)

file:///C:/Python/gpt4-pdf/scripts/ingest-data.ts:46
throw new Error('Failed to ingest your data');
^
Error: Failed to ingest your data
at run (file:///C:/Python/gpt4-pdf/scripts/ingest-data.ts:46:11)
at processTicksAndRejections (node:internal/process/task_queues:95:5)
at file:///C:/Python/gpt4-pdf/scripts/ingest-data.ts:51:3
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
`

Copy link

dosubot bot commented May 20, 2024

The error "TypeError: Cannot read properties of undefined (reading 'text')" usually occurs when the documents being processed do not have the expected text property. Here’s how you can troubleshoot and fix this issue:

Steps to Troubleshoot

  1. Check Document Structure: Ensure that the documents loaded by directoryLoader.load() have the text property.

  2. Verify Split Documents: Ensure that the documents returned by textSplitter.splitDocuments(rawDocs) have the text property.

  3. Add Logging: Modify your code to log the structure of the documents at each step to inspect them.

Here’s the modified code with additional logging:

import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { OpenAIEmbeddings } from 'langchain/embeddings/openai';
import { PineconeStore } from 'langchain/vectorstores/pinecone';
import { pinecone } from '@/utils/pinecone-client';
import { PDFLoader } from 'langchain/document_loaders/fs/pdf';
import { PINECONE_INDEX_NAME, PINECONE_NAME_SPACE } from '@/config/pinecone';
import { DirectoryLoader } from 'langchain/document_loaders/fs/directory';

const filePath = 'docs';

export const run = async () => {
  try {
    const directoryLoader = new DirectoryLoader(filePath, {
      '.pdf': (path) => new PDFLoader(path),
    });

    const rawDocs = await directoryLoader.load();
    console.log('rawDocs', rawDocs); // Log rawDocs to inspect structure

    const textSplitter = new RecursiveCharacterTextSplitter({
      chunkSize: 1000,
      chunkOverlap: 200,
    });

    const docs = await textSplitter.splitDocuments(rawDocs);
    console.log('split docs', docs); // Log split docs to inspect structure

    console.log('creating vector store...');
    const embeddings = new OpenAIEmbeddings();
    const index = pinecone.Index(PINECONE_INDEX_NAME);

    await PineconeStore.fromDocuments(docs, embeddings, {
      pineconeIndex: index,
      namespace: PINECONE_NAME_SPACE,
      textKey: 'text',
    });
  } catch (error) {
    console.log('error', error);
    throw new Error('Failed to ingest your data');
  }
};

(async () => {
  await run();
  console.log('ingestion complete');
})();

Additional Checks

  • Environment Variables: Ensure your .env file is correctly configured with the necessary API keys and index names:
OPENAI_API_KEY=
PINECONE_API_KEY=  
PINECONE_ENVIRONMENT=
PINECONE_INDEX_NAME=
  • Pinecone Index Configuration: Verify that PINECONE_INDEX_NAME, PINECONE_ENVIRONMENT, and PINECONE_API_KEY are correctly set.

By following these steps, you should be able to identify and resolve the issue. If you continue to encounter problems, please provide the output of the logged structures for further assistance.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@dosubot dosubot bot added the kind: bug Related to a bug, vulnerability, unexpected error with an existing feature label May 20, 2024
@fraction94
Copy link
Author

Apologies for being so dumb, but which part of the program should I use this code in?

Copy link

dosubot bot commented May 21, 2024

You should place the provided code in the scripts/ingest-data.ts file. This file is responsible for ingesting data into Pinecone. Here is the complete code for scripts/ingest-data.ts with the added logging for inspection:

import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { OpenAIEmbeddings } from 'langchain/embeddings/openai';
import { PineconeStore } from 'langchain/vectorstores/pinecone';
import { pinecone } from '@/utils/pinecone-client';
import { PDFLoader } from 'langchain/document_loaders/fs/pdf';
import { PINECONE_INDEX_NAME, PINECONE_NAME_SPACE } from '@/config/pinecone';
import { DirectoryLoader } from 'langchain/document_loaders/fs/directory';

const filePath = 'docs';

export const run = async () => {
  try {
    const directoryLoader = new DirectoryLoader(filePath, {
      '.pdf': (path) => new PDFLoader(path),
    });

    const rawDocs = await directoryLoader.load();
    console.log('rawDocs', rawDocs); // Log rawDocs to inspect structure

    const textSplitter = new RecursiveCharacterTextSplitter({
      chunkSize: 1000,
      chunkOverlap: 200,
    });

    const docs = await textSplitter.splitDocuments(rawDocs);
    console.log('split docs', docs); // Log split docs to inspect structure

    console.log('creating vector store...');
    const embeddings = new OpenAIEmbeddings();
    const index = pinecone.Index(PINECONE_INDEX_NAME);

    await PineconeStore.fromDocuments(docs, embeddings, {
      pineconeIndex: index,
      namespace: PINECONE_NAME_SPACE,
      textKey: 'text',
    });
  } catch (error) {
    console.log('error', error);
    throw new Error('Failed to ingest your data');
  }
};

(async () => {
  await run();
  console.log('ingestion complete');
})();

Explanation of the Code

  1. DirectoryLoader: Loads all PDF files from the specified directory (docs).
  2. Logging rawDocs: Logs the raw documents loaded from the directory to inspect their structure.
  3. RecursiveCharacterTextSplitter: Splits the documents into smaller chunks.
  4. Logging docs: Logs the split documents to inspect their structure.
  5. Creating Vector Store: Creates a vector store using the Pinecone index and the embeddings generated from the documents.

Running the Script

To run this script, use the following command in your terminal:

node scripts/ingest-data.ts

Ensure you have all the necessary environment variables set up in your .env file, especially PINECONE_INDEX_NAME and PINECONE_API_KEY.

Troubleshooting

If you encounter the error TypeError: Cannot read properties of undefined (reading 'text'), it indicates that the text property is missing in the documents being processed. Ensure that the documents have the correct structure after being split by the RecursiveCharacterTextSplitter. The expected structure of the docs object is:

[
  {
    "text": "chunk of text",
    "metadata": {
      "source": "source file path or identifier"
    }
  },
  ...
]

If the structure is correct and the error persists, double-check the configuration of the Pinecone index and ensure that the textKey property matches the key used in your documents.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@fraction94
Copy link
Author

fraction94 commented May 21, 2024

i have tried it but still not working. I get the following error. my pdf is simplest document of few pages which I converted from word to PDF so it is text property. looks like OpenAI works fine but Pinecone is not. any help is greatly appreciated.

Document {
pageContent: '8 \n' +
' \n' +
'diverse cultures, Rocky Mountains, Niagara Falls, hospitality and Canadian cities. The greatest \n' +
'Canadians that you should know include; Wayne Gretzky. Tommy Douglas, Dr. Roberta Bondar, \n' +
'Pierre Trudeau, and Terrance Stanley Fox. The five common Canadian musicians include \n' +
'Leonard Cohen, Celine Dion, The Tragically Hip (Gord Downie as lead singer), Joni Mitchell \n' +
'and Shania Twain. Canada has had great inventions which have been impacts to the world the \n' +
'inventors are Alexander Graham Bell (telephone), Mathew Evans and Henry Woodward (first \n' +
'electric bulb), Sir Sandford Fleming (standard time), James Naismith (basketball), and Arthur \n' +
'Sicard (snowblower).',
metadata: {
source: 'C:\Python\gpt4-pdf\docs\testcase.pdf',
pdf: [Object],
loc: [Object]
}
}
]
creating vector store...
error TypeError: Cannot read properties of undefined (reading 'text')
at C:\Python\gpt4-pdf\node_modules@pinecone-database\pinecone\dist\errors\utils.js:44:57
at step (C:\Python\gpt4-pdf\node_modules@pinecone-database\pinecone\dist\errors\utils.js:33:23)
at Object.next (C:\Python\gpt4-pdf\node_modules@pinecone-database\pinecone\dist\errors\utils.js:14:53)
at C:\Python\gpt4-pdf\node_modules@pinecone-database\pinecone\dist\errors\utils.js:8:71
at new Promise ()
at __awaiter (C:\Python\gpt4-pdf\node_modules@pinecone-database\pinecone\dist\errors\utils.js:4:12)
at extractMessage (C:\Python\gpt4-pdf\node_modules@pinecone-database\pinecone\dist\errors\utils.js:40:48)
at C:\Python\gpt4-pdf\node_modules@pinecone-database\pinecone\dist\errors\handling.js:66:70
at step (C:\Python\gpt4-pdf\node_modules@pinecone-database\pinecone\dist\errors\handling.js:33:23)
at Object.next (C:\Python\gpt4-pdf\node_modules@pinecone-database\pinecone\dist\errors\handling.js:14:53)

file:///C:/Python/gpt4-pdf/scripts/ingest-data.ts:39
throw new Error('Failed to ingest your data');
^
Error: Failed to ingest your data
at run (file:///C:/Python/gpt4-pdf/scripts/ingest-data.ts:39:11)
at processTicksAndRejections (node:internal/process/task_queues:95:5)
at file:///C:/Python/gpt4-pdf/scripts/ingest-data.ts:44:3
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.

Also this is how i setup my .env file

OPENAI_API_KEY=sk-proj-zxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx PINECONE_API_KEY=4d8dxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx PINECONE_ENVIRONMENT=us-east-1 PINECONE_INDEX_NAME=pdf

Copy link

dosubot bot commented May 21, 2024

The error "TypeError: Cannot read properties of undefined (reading 'text')" typically occurs when the code is trying to access the 'text' property of an object that is undefined. This suggests that the documents being passed to Pinecone are not in the expected format.

Steps to Resolve the Issue

  1. Verify Document Structure:
    Ensure that each document has a text property before passing it to Pinecone. Add the following logging to your ingest-data.ts script to inspect the structure of the documents after they are split:

    const docs = await textSplitter.splitDocuments(rawDocs);
    console.log('split docs', docs);
    
    // Check if all documents have the 'text' property
    docs.forEach((doc, index) => {
      if (!doc.text) {
        console.error(`Document at index ${index} is missing the 'text' property`, doc);
      }
    });
  2. Ensure Correct Pinecone Configuration:
    Make sure your .env file is correctly set up with the necessary Pinecone configuration. Ensure your .env file includes the PINECONE_INDEX_NAME variable:

    PINECONE_INDEX_NAME=pdf
    
  3. Check CustomPDFLoader Implementation:
    The CustomPDFLoader class should ensure that each document has a text property. Verify that the pageContent property of the Document object is set to parsed.text, which should contain the text content of the PDF.

    export class CustomPDFLoader extends BufferLoader {
      public async parse(
        raw: Buffer,
        metadata: Document['metadata'],
      ): Promise<Document[]> {
        const { pdf } = await PDFLoaderImports();
        const parsed = await pdf(raw);
        return [
          new Document({
            pageContent: parsed.text,
            metadata: {
              ...metadata,
              pdf_numpages: parsed.numpages,
            },
          }),
        ];
      }
    }
  4. Debugging Pinecone Ingestion:
    Add more detailed logging around the Pinecone ingestion process to identify any issues:

    try {
      const docs = await textSplitter.splitDocuments(rawDocs);
      console.log('split docs', docs);
    
      // Check if all documents have the 'text' property
      docs.forEach((doc, index) => {
        if (!doc.text) {
          console.error(`Document at index ${index} is missing the 'text' property`, doc);
        }
      });
    
      // Ingest documents into Pinecone
      await PineconeStore.fromDocuments(docs, {
        pineconeIndexName: PINECONE_INDEX_NAME,
        textKey: 'text',
      });
      console.log('Successfully ingested documents into Pinecone');
    } catch (error) {
      console.error('Failed to ingest documents into Pinecone:', error);
      throw new Error('Failed to ingest your data');
    }

By following these steps, you should be able to identify and resolve the issue causing the "TypeError: Cannot read properties of undefined (reading 'text')" error.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind: bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant