Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llmware is not working in Sub linux system Ubuntu under Windows 11 #115

Closed
AlbelTec opened this issue Nov 21, 2023 · 4 comments
Closed

Llmware is not working in Sub linux system Ubuntu under Windows 11 #115

AlbelTec opened this issue Nov 21, 2023 · 4 comments

Comments

@AlbelTec
Copy link

Hi

I tried to do some experiments with Parser() (first into json and then into memory) but in both cases I'm ending up with error : Fault Segmentation.
I presume this has been already reported but seems no fix has been released so far.

kr,

@JessBerl
Copy link
Contributor

Hi @AlbelTec - please try the work around described in #48.

@AlbelTec
Copy link
Author

AlbelTec commented Nov 21, 2023

Hi @JessBerl
Actually I did use : ulimit -s 32768000
but still getting the error :

> Parsing folder: data...
Segmentation fault
(llmware) albel@Thinkpad:~/llmware$ 

here is my code :

def parsing_pdf():
    # Create a parser
    parser = Parser()

     # Parse entire folder to json
    print (f"\n > Parsing folder: {dataDir}...")

    pdf_parsed_output = Parser().parse_one_pdf("/home/albel/llmware/data/", "Large Language Models.pdf")
    page_number = pdf_parsed_output[0]["master_index"]
    block_text = pdf_parsed_output[0]["text"]
    print(f"\nFirst block found on page {page_number}:\n{block_text}")

    # Parse to json
    #blocks  = parser.ingest_to_json(dataDir)
    # print (f"Total Blocks: {len(parser.parser_output)}")
    # print (f"Files Parsed:")
    # for processed_file in blocks["processed_files"]:
    #     print(f"  - {processed_file}")
    
parsing_pdf()

with json it's more verbose :

albel@Thinkpad:~/llmware$ source /home/albel/llmware/bin/activate
(llmware) albel@Thinkpad:~/llmware$ /home/albel/llmware/bin/python3.10 /home/albel/llmware/llmware_pdf.py

 > Parsing folder: data...
update: pdf_parser - START NEW PDF Processing - file path-/home/albel/llmware_data/tmp/parser_tmp/process_pdf_files/Large Language Models.pdf 
update: pdf_parser - build_obj_master_list - obj created - 3130 
update: pdf_parser - Catalog Dict - <<
/Type /Catalog
/Version /1.4
/Pages 2 0 R
/StructTreeRoot 3 0 R
/MarkInfo 4 0 R
/Lang (en-GB)
/ViewerPreferences 5 0 R
/Metadata 6 0 R
> 
update: pdf_parser - filelen - 5447062 
update: pdf_parser - created additional hidden objstm objects - 0 
update: pdf_parser - page count - 31- pages_found - 31 
update: pdf_parser - global font count- 40 
update: pdf_parser - PAGE PROCESSING-MAIN-LOOP -0-content entries-1 
Segmentation fault

@turnham
Copy link
Contributor

turnham commented Nov 21, 2023

Hi @AlbelTec I just tried on WSL2 (Windows 10) and was able to get things working with:

ulimit -s 160000

(The higher 32768000 value seems to be only required when running Linux in a container on Mac)

However, in your case it looks like the ulimit setting might not be taking effect at all. You may be hitting this issue:

microsoft/WSL#633

Can you try the workaround suggested at the bottom of that issue?:

sudo prlimit --stack=unlimited --pid $$; ulimit -s unlimited

@AlbelTec
Copy link
Author

AlbelTec commented Nov 21, 2023

@turnham Thanks! finally it worked. Actually ulimit was static with 8192 as value. it turned out that with prlimit with root privileges it assigned unlimited as value and the issue is gone. The only drawback, it has to run for every session. I can live with it for now until Windows version to be released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants