Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: 'style' #72

Open
RaphSte opened this issue Jun 14, 2024 · 21 comments
Open

KeyError: 'style' #72

RaphSte opened this issue Jun 14, 2024 · 21 comments

Comments

@RaphSte
Copy link

RaphSte commented Jun 14, 2024

When trying to run a pdf file through it I get the KeyError: 'style', with the following stacktrace:

error uploading file, stacktrace: Traceback (most recent call last):
  File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 48, in parse_document
    return_dict, _ = ingestor_api.ingest_document(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
    ingestor = pdf_ingestor.PDFIngestor(doc_location, parse_options)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in __init__
    blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
                                                                            ^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 176, in parse_blocks
    parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in __init__
    self.parse(pages)
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 198, in parse
    p["style"], p.text, page_width
    ~^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bs4/element.py", line 1573, in __getitem__
    return self.attrs[key]
           ~~~~~~~~~~^^^^^
KeyError: 'style'
Traceback (most recent call last):
  File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 48, in parse_document
    return_dict, _ = ingestor_api.ingest_document(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
    ingestor = pdf_ingestor.PDFIngestor(doc_location, parse_options)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in __init__
    blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
                                                                            ^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 176, in parse_blocks
    parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in __init__
    self.parse(pages)
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 198, in parse
    p["style"], p.text, page_width
    ~^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bs4/element.py", line 1573, in __getitem__
    return self.attrs[key]
           ~~~~~~~~~~^^^^^
KeyError: 'style'

Steps to reproduce:

(tested on linux server)

  • docker pull ghcr.io/nlmatics/nlm-ingestor:latest
  • docker run -p 5010:5001 ghcr.io/nlmatics/nlm-ingestor
  • After that, from a client:
from llmsherpa.readers import LayoutPDFReader
llmsherpa_api_url = "https://my-url/api/parseDocument?renderFormat=all"

#both mehtods, local and online will produce the same error
pdf_url = "https://arxiv.org/pdf/1910.13461.pdf"
pdf_url = "./arxiv.org/pdf/1910.13461.pdf"


pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)

@RaphSte
Copy link
Author

RaphSte commented Jun 14, 2024

this seems to be the same error as in #24
One comment suggests, that the tika server is not running. How can I verify that?

@eburnette
Copy link

v0.1.8 and v0.1.7 have this problem for me. v0.1.6 works fine.

@shshnk158
Copy link

shshnk158 commented Jun 17, 2024

Hey @RaphSte did the latest version work for you. If so, can you update me how?

@aleksvercau
Copy link

Same issue here, latest version has an issue.

@RaphSte
Copy link
Author

RaphSte commented Jun 17, 2024

Hey @RaphSte did the latest version work for you. If so, can you update me how?

hey, @shshnk158, no, it diddn't work for me. I'll use v0.1.6 for now.

@shshnk158
Copy link

Yes v0.1.6 is working fine, but it comes with tika-server-standard-nlm-modified-2.4.1_v6.jar, I wanted to try out with the latest jar file [tika-server-standard-nlm-modified-2.9.2_v1.jar](https://github.com/nlmatics/nlm-ingestor/blob/main/jars/tika-server-standard-nlm-modified-2.9.2_v1.jar) any suggestions @ansukla

@thenicekat
Copy link

yes, facing this issue on v0.1.7 and v0.1.8

@vitorhirota
Copy link

The issue is because paragraphs are missing metadata. PR #70 solves this issue.

While not merged, you can use it locally with git fetch origin pull/70/head:PR70 and git switch PR70

@rednag
Copy link

rednag commented Jun 27, 2024

I'm facing the same issue, trying now to build the container with PR70.

@rednag
Copy link

rednag commented Jun 27, 2024

Container build fails with "Failed to build pandas"

@vitorhirota
Copy link

@rednag PR #73 is related, but in my case I just updated requirements.txt to have pandas >= 1.24

pandas==1.2.4

@heyalistair
Copy link

heyalistair commented Jul 3, 2024

@vitorhirota Changing that to >= fixes the pandas error.

Now I'm hitting a problem with python -m nltk.downloader punkt.

➜  nlm-ingestor git:(PR70) ✗ docker build --platform=linux/x86_64 -t ohalo-nlm-ingestor .
[+] Building 2.6s (22/24)                                                                                                                                           docker:desktop-linux
 => [internal] load .dockerignore                                                                                                                                                   0.0s
 => => transferring context: 2B                                                                                                                                                     0.0s
 => [internal] load build definition from Dockerfile                                                                                                                                0.0s
 => => transferring dockerfile: 1.54kB                                                                                                                                              0.0s
 => resolve image config for docker.io/docker/dockerfile:experimental                                                                                                               0.4s
 => CACHED docker-image://docker.io/docker/dockerfile:experimental@sha256:600e5c62eedff338b3f7a0850beb7c05866e0ef27b2d2e8c02aa468e78496ff5                                          0.0s
 => [internal] load build definition from Dockerfile                                                                                                                                0.0s
 => [internal] load .dockerignore                                                                                                                                                   0.0s
 => [internal] load metadata for docker.io/library/python:3.11-bookworm                                                                                                             0.4s
 => [ 1/16] FROM docker.io/library/python:3.11-bookworm@sha256:4eee56938c2f48480acb90db616162cfa361f5987dd43e1371e5288ed3e5e95e                                                     0.0s
 => => resolve docker.io/library/python:3.11-bookworm@sha256:4eee56938c2f48480acb90db616162cfa361f5987dd43e1371e5288ed3e5e95e                                                       0.0s
 => [internal] load build context                                                                                                                                                   0.1s
 => => transferring context: 186.95kB                                                                                                                                               0.1s
 => CACHED [ 2/16] RUN apt-get update && apt-get -y --no-install-recommends install libgomp1                                                                                        0.0s
 => CACHED [ 3/16] RUN mkdir -p /usr/share/man/man1 &&   apt-get update -y &&   apt-get install -y openjdk-17-jre-headless                                                          0.0s
 => CACHED [ 4/16] RUN apt-get install -y   libxml2-dev libxslt-dev   build-essential libmagic-dev                                                                                  0.0s
 => CACHED [ 5/16] RUN apt-get install -y   tesseract-ocr   lsb-release   && echo "deb https://notesalexp.org/tesseract-ocr5/$(lsb_release -cs)/ $(lsb_release -cs) main" | tee /e  0.0s
 => CACHED [ 6/16] RUN apt-get install unzip -y &&   apt-get install git -y &&   apt-get autoremove -y                                                                              0.0s
 => CACHED [ 7/16] WORKDIR /app                                                                                                                                                     0.0s
 => CACHED [ 8/16] COPY . ./                                                                                                                                                        0.0s
 => CACHED [ 9/16] RUN pip install --upgrade pip setuptools                                                                                                                         0.0s
 => CACHED [10/16] RUN apt-get install -y libmagic1                                                                                                                                 0.0s
 => CACHED [11/16] RUN mkdir -p -m 0600 ~/.ssh && ssh-keyscan github.com >> ~/.ssh/known_hosts                                                                                      0.0s
 => CACHED [12/16] RUN pip install -r requirements.txt                                                                                                                              0.0s
 => CACHED [13/16] RUN python -m nltk.downloader stopwords                                                                                                                          0.0s
 => ERROR [14/16] RUN python -m nltk.downloader punkt                                                                                                                               1.6s
------                                                                                                                                                                                   
 > [14/16] RUN python -m nltk.downloader punkt:                                                                                                                                          
0.505 <frozen runpy>:128: RuntimeWarning: 'nltk.downloader' found in sys.modules after import of package 'nltk', but prior to execution of 'nltk.downloader'; this may result in unpredictable behaviour                                                                                                                                                                          
0.874 [nltk_data] Downloading package punkt to /root/nltk_data...
1.525 [nltk_data]   Unzipping tokenizers/punkt.zip.
1.526 [nltk_data] Error with downloaded zip file
1.526 Error installing package. Retry? [n/y/e]
1.529 Traceback (most recent call last):
1.529   File "<frozen runpy>", line 198, in _run_module_as_main
1.530   File "<frozen runpy>", line 88, in _run_code
1.530   File "/usr/local/lib/python3.11/site-packages/nltk/downloader.py", line 2537, in <module>
1.532     rv = downloader.download(
1.532          ^^^^^^^^^^^^^^^^^^^^
1.532   File "/usr/local/lib/python3.11/site-packages/nltk/downloader.py", line 790, in download
1.533     choice = input().strip()
1.533              ^^^^^^^
1.534 EOFError: EOF when reading a line
------
Dockerfile:34
--------------------
  32 |     RUN pip install -r requirements.txt
  33 |     RUN python -m nltk.downloader stopwords
  34 | >>> RUN python -m nltk.downloader punkt
  35 |     RUN python -c "import tiktoken; tiktoken.get_encoding(\"cl100k_base\")"
  36 |     RUN chmod +x run.sh
--------------------
ERROR: failed to solve: process "/bin/sh -c python -m nltk.downloader punkt" did not complete successfully: exit code: 1

EDIT: Nevermind, turning on my VPN resolved this issue. I really need to switch ISPs... :)

@heyalistair
Copy link

I can confirm I can building the docker image PR #70, with pandas>=1.2.4 works and the container does not show the KeyError. Thanks!

@jamesvillarrubia
Copy link
Collaborator

jamesvillarrubia commented Jul 9, 2024

Sorry, this one's on me. First PR was massive code refresh on top of the latest Tika and I missed some key elements that my tests didn't cover. Second PR with jar v2 should resolve it, but waiting on @ansukla or someone to merge here.

Here's the docker image I'm using with everything baked in:
jamesmtc/nlm-ingestor

@ddose-inferyx
Copy link

v0.1.8 and v0.1.7

are you talking about nlm-ingestor version. I cant see v0.1.6 there

@RaphSte
Copy link
Author

RaphSte commented Jul 17, 2024

v0.1.8 and v0.1.7

are you talking about nlm-ingestor version. I cant see v0.1.6 there

@ddose-inferyx yes, this is about the nlm ingestor version. You can either pull the image directly (see here) or build it yourself selecting the tag v0.1.6
I diddn't try building it myself though. I just pulled the image directly and it worked for me.

@ddose-inferyx
Copy link

v0.1.8 and v0.1.7 have this problem for me. v0.1.6 works fine.

Thanks! @RaphSte , It started working for me as I used -http://localhost:5010/api/parseDocument?renderFormat=all&applyOcr=yes&useNewIndentParser=yes. Using "NewIndentParser=yes." will also work with the latest.

@irash03
Copy link

irash03 commented Jul 24, 2024

Here's the docker image I'm using with everything baked in: jamesmtc/nlm-ingestor

docker pull ghcr.io/jamesmtc/nlm-ingestor

Error response from daemon: Head "https://ghcr.io/v2/jamesmtc/nlm-ingestor/manifests/latest": denied

@ddose-inferyx
Copy link

ddose-inferyx commented Jul 25, 2024 via email

@AeRabelais
Copy link

Here's the docker image I'm using with everything baked in: jamesmtc/nlm-ingestor

docker pull ghcr.io/jamesmtc/nlm-ingestor

Error response from daemon: Head "https://ghcr.io/v2/jamesmtc/nlm-ingestor/manifests/latest": denied

I was getting the same error, and I just needed to reset my authentication info for ghcr. I removed any preset ghcr configs, then followed the setup instructions here. After that, running docker pull jamesmtc/nlm-ingestor:latest worked fine

@ansukla
Copy link
Member

ansukla commented Jul 26, 2024

Merging changes from @jamesvillarrubia. Apologies for the delay. Thanks James for putting together the fix. Feel free to send me a note on LinkedIn if something needs attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests