I'm not sure to have followed correctly the procedure for running paperai with pre-trained vectors #6

DavidRivasPhD · 2020-08-07T04:13:18Z

After successfully installing paperai in Linux (Ubuntu 20.04.1 LTS), I tried to run it by using the pre-trained vectors option to build the model, as follows:

(1) I downloaded the vectors from https://www.kaggle.com/davidmezzetti/cord19-fasttext-vectors#cord19-300d.magnitude
(2) My Downloads folder in my computer ended up with a Zip file containing the vectors.
(3) I created a directory ~/.cord19/vectors/ and moved the downloaded Zip file into this directory (see yellow folder in the figure below).
(4) I extracted the Zip file, which resulted in the grey folder shown below, which contained the file cord19-300d.magnitude
(5) I moved the cord19-300d.magnitude file outside of the grey folder and thus into the ~/.cord19/vectors/ directory (see figure below)

(6) I excuted the following command to build the embeddings index with the above pre-trained vectors:

python -m paperai.index

Upon performing the above I got the following error message (see below)

Am I getting this error because the above steps are not the correct ones?
If so, what would be the correct steps?
Otherwise, what other things should I try to eliminate the issue?

davidmezzetti · 2020-08-07T16:07:48Z

Hi David,

Thank you for the detailed report. It looks like you don't have a database file to index (the default location is ~/.cord19/models/articles.sqlite).

The paperetl project is used to build the articles database. That project has instructions on indexing CORD-19 and/or custom PDF files.

If you want to test out CORD-19, you can use a pre-built articles.sqlite database found on Kaggle.

DavidRivasPhD · 2020-08-10T23:37:59Z

Hi David,
Thank you for your help above.
paperai is installed in my local Linux computer. So in order to test it out I proceeded as follows:

used pre-built articles.sqlite database from https://www.kaggle.com/davidmezzetti/cord-19-etl/output
and placed it at ~/.cord19/models
used pre-trained vectors from https://www.kaggle.com/davidmezzetti/cord19-fasttext-vectors#cord19-300d.magnitude and placed them at ~/.cord19/vectors/cord19-300d.magnitude
Then, in order to build the model I entered the following command for building embeddings index:
python -m paperai.index

Subsequently, the terminal displayed the following output sequence:

Building new model
streamed XXXXXX documents *
Iterated over 3377117 total rows
streamed XXXXXX documents **
Iterated over 3377117 total rows

where XXXXXX increased progressively from 0 to 3377117 in a matter of 5 minutes and then the line disappeared from the terminal

** where XXXXXX increased progressively from 0 to 3377117 over the course of 6 hours and then the line disappeared from the terminal. After the line disappeared nothing else was displayed on the terminal but my computer’s CPU kept working at full capacity for 7+ more hours until I decided to unplug my computer to terminate the processing by turning it off.
Furthermore, no model or file was stored in ~/.cord19

Do you know what could be the problem?
(also, do the above pre-trained vectors correspond to the above pre-built articles.sqlite database? Was it OK to use both for this test?)

davidmezzetti · 2020-08-11T23:39:01Z

Thank you for the continued attempts to get this installed, sorry it's not going as smooth as we would hope.

The first question I have is what version of the code you're using? I notice you have a fork of paperai, are you using that fork or going off the main codebase? paperai has had a number of major changes the last few weeks, I think the issues may stem from there.

DavidRivasPhD · 2020-08-12T01:30:26Z

Thank you for pointing out that detail. I currently have paperai 1.0.0 installed, and it was installed going off the main codebase (from your GitHub repo). I'll try your latest version, with just a small portion of your pre-built articles.sqlite database, for testing purposes.

davidmezzetti · 2020-08-12T01:38:37Z

For testing purposes, I have a very small version that might help with debugging - https://www.kaggle.com/davidmezzetti/cord-19-slim/output

DavidRivasPhD · 2020-08-13T00:44:34Z

For debugging purposes, I reduced the pre-built articles.sqlite database (https://www.kaggle.com/davidmezzetti/cord-19-etl/output) by retaining only the first 500,000 rows of its sections table while keeping the articles and other tables unchanged.
Subsequently, I installed paperai version 1.2.1. from PYPI, and run the following commands, which created the following respective (directories:) files:
$ python -m paperai.vectors
created in ~/.cord19/vectors: cord19-300d.magnitude and cord19-300d.txt
$ python -m paperai.index
created in ~/.cord19/models: config, embeddings, lsa, and scoring
~/.cord19: remained unchanged, with only 2 folders (models and vectors) and no files

As shown in the following screenshot, 2 different attempts to run queries resulted in the same error

Suggestions about solving this problem would be appreciated.

davidmezzetti · 2020-08-13T01:08:02Z

Looks like it's almost there. I will see if I can update the install scripts to automatically do this but to let you continue testing - at the command prompt shown above run:

import nltk
nltk.download("stopwords")

…t method. Helps with issues discussed in #6.

davidmezzetti · 2020-08-13T01:31:35Z

In reviewing the code using the stopwords methods, this is no longer needed with the current code base and has been removed for future versions (now included in the master branch).

DavidRivasPhD · 2020-08-13T04:04:38Z

Thank you for fixing it so quickly. It works great! We'll keep working on your model.

davidmezzetti · 2020-08-13T13:17:37Z

Glad it worked, thank you for the diligence in getting this to work.

davidmezzetti added a commit that referenced this issue Aug 13, 2020

Removed nltk stopwords dependency as it's not needed for the highligh…

9bebe2f

…t method. Helps with issues discussed in #6.

DavidRivasPhD closed this as completed Aug 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I'm not sure to have followed correctly the procedure for running paperai with pre-trained vectors #6

I'm not sure to have followed correctly the procedure for running paperai with pre-trained vectors #6

DavidRivasPhD commented Aug 7, 2020

davidmezzetti commented Aug 7, 2020

DavidRivasPhD commented Aug 10, 2020

davidmezzetti commented Aug 11, 2020

DavidRivasPhD commented Aug 12, 2020

davidmezzetti commented Aug 12, 2020

DavidRivasPhD commented Aug 13, 2020

davidmezzetti commented Aug 13, 2020 •

edited

davidmezzetti commented Aug 13, 2020

DavidRivasPhD commented Aug 13, 2020

davidmezzetti commented Aug 13, 2020

I'm not sure to have followed correctly the procedure for running paperai with pre-trained vectors #6

I'm not sure to have followed correctly the procedure for running paperai with pre-trained vectors #6

Comments

DavidRivasPhD commented Aug 7, 2020

davidmezzetti commented Aug 7, 2020

DavidRivasPhD commented Aug 10, 2020

davidmezzetti commented Aug 11, 2020

DavidRivasPhD commented Aug 12, 2020

davidmezzetti commented Aug 12, 2020

DavidRivasPhD commented Aug 13, 2020

davidmezzetti commented Aug 13, 2020 • edited

davidmezzetti commented Aug 13, 2020

DavidRivasPhD commented Aug 13, 2020

davidmezzetti commented Aug 13, 2020

davidmezzetti commented Aug 13, 2020 •

edited