Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with UnstructuredLoader when attempting to load markdown files #738

Closed
LarchLiu opened this issue Apr 11, 2023 · 6 comments · Fixed by #742
Closed

Error with UnstructuredLoader when attempting to load markdown files #738

LarchLiu opened this issue Apr 11, 2023 · 6 comments · Fixed by #742

Comments

@LarchLiu
Copy link
Contributor

I have successfully run Docker for unstructured-api and I am using UnstructuredLoader to load markdown files.

const directoryLoader = new DirectoryLoader(filePath, {
      '.pdf': (path) => new PDFLoader(path, {
        pdfjs: () =>
          import('pdfjs-dist/legacy/build/pdf.js').then((mod) => mod.default)
      }),
      '.epub': (path) => new EPubLoader(path),
      '.docx': (path) => new DocxLoader(path),
      '.csv': (path) => new CSVLoader(path),
      '.json': (path) => new JSONLoader(path),
      '.txt': (path) => new TextLoader(path),
      '.md': (path) => new UnstructuredLoader('http://localhost:5000/general/v0/general', path)
    })

But there is an error report

error Error: Failed to partition file /Users/xxx/back-end/resource/fc87122e-704d-4dc7-b37f-0b87b39c023e/1681209481/GfhQDfat0WP38nWf7JpYvItS.md with error 400 and message {"detail":"Unable to process blob: File type None is not supported."}

I have tested it using Postman and everything is ok.

image

May be we should append file name (with .md) to formData?

https://github.com/hwchase17/langchainjs/blob/47539dae010cd6a38c10ebcf3fb315339c348889/langchain/src/document_loaders/fs/unstructured.ts#L30-L36

@LarchLiu
Copy link
Contributor Author

I try to append file name to formData and it works

image

@arronKler
Copy link
Contributor

arronKler commented Apr 11, 2023

seem's like can't get file MIME type from buffer which read by fs.readFile API

LarchLiu added a commit to LarchLiu/langchainjs that referenced this issue Apr 11, 2023
@LarchLiu
Copy link
Contributor Author

seem's like can't get file MIME type from buffer which read by fs.readFile API

Yes, as mentioned in this PR unstructured-api, it should use the filename to determine the file type.

nfcampos pushed a commit that referenced this issue Apr 11, 2023
@alextkd
Copy link

alextkd commented Jul 6, 2023

It's still not detecting the type if it's a PDF

@LarchLiu
Copy link
Contributor Author

LarchLiu commented Jul 8, 2023

@alextkd You can use PDFLoader.

const directoryLoader = new DirectoryLoader(filePath, {
      '.pdf': (path) => new PDFLoader(path, {
        pdfjs: () =>
          import('pdfjs-dist/legacy/build/pdf.js').then((mod) => mod.default)
      }),
      '.epub': (path) => new EPubLoader(path),
      '.docx': (path) => new DocxLoader(path),
      '.csv': (path) => new CSVLoader(path),
      '.json': (path) => new JSONLoader(path),
      '.txt': (path) => new TextLoader(path),
      '.md': (path) => new UnstructuredLoader('http://localhost:5000/general/v0/general', path)
    })

@alextkd
Copy link

alextkd commented Jul 8, 2023

@alextkd You can use PDFLoader.


const directoryLoader = new DirectoryLoader(filePath, {

      '.pdf': (path) => new PDFLoader(path, {

        pdfjs: () =>

          import('pdfjs-dist/legacy/build/pdf.js').then((mod) => mod.default)

      }),

      '.epub': (path) => new EPubLoader(path),

      '.docx': (path) => new DocxLoader(path),

      '.csv': (path) => new CSVLoader(path),

      '.json': (path) => new JSONLoader(path),

      '.txt': (path) => new TextLoader(path),

      '.md': (path) => new UnstructuredLoader('http://localhost:5000/general/v0/general', path)

    })

It worked in the end. It was a typo in my PDF that prevented it from loading. But thanks for sharing, I need to handle them from an S3 bucket, so I'll have to try using blobs if I use this approach and not the s3Loader that uses unstructured . It should be good now, it may use more resources but it does the job. Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants