Add topk functionality to qa pipeline #480

rcali21 · 2023-05-30T17:48:03Z

This is my first attempt. I'm having some trouble representing the data in a df as shown in #406. I also noticed that that code was borrowed from https://github.com/neuml/txtai/blob/master/examples/20_Extractive_QA_to_build_structured_data.ipynb so I assume this first attempt will need some work if that is the intended output.

rcali21 · 2023-05-30T17:50:19Z

Additionally, to get ibuprofen as a returned answer, as the issue in #406 discusses, the prompt needed to be changed.

context = ["The doctor administered a 7g dose of Acetarsol and a 16mg dose of Ibuprofen",
           "The patient took one Alosetron tablet"]


queries = ["What is the specific drug taken?", "What is the dose?"]

versus:

context = ["The doctor administered a 7g dose of Acetarsol and a 16mg dose of Ibuprofen",
           "The patient took one Alosetron tablet"]


queries = ["What is the drug taken?", "What is the dose?"]```

davidmezzetti · 2023-05-31T11:20:29Z

Thanks for taking on this issue! Couple initial things.

Did you run make data && make coverage locally? I think we'll need a new unit test to keep test coverage up.
The coding style for txtai veers slightly from standard Python. For example, top_k would be topk. answers_list would be answerslist and so forth.
To be consistent with other pipelines, topk would be an optional argument to the __call__ method instead of an __init__ argument.
Is it intended for this to work in a backwards compatible way? The hope is yes.

rcali21 · 2023-06-02T19:28:06Z

Thanks for taking on this issue! Couple initial things.

Did you run make data && make coverage locally? I think we'll need a new unit test to keep test coverage up.

The coding style for txtai veers slightly from standard Python. For example, top_k would be topk. answers_list would be answerslist and so forth.

To be consistent with other pipelines, topk would be an optional argument to the __call__ method instead of an __init__ argument.

Is it intended for this to work in a backwards compatible way? The hope is yes.

Hi David,

I ran both of those a few weeks back when I initially cloned the repo. I just ran again to be certain and make data returned:

(txtai_work) ➜ txtai git:(add_topk) ✗ make data mkdir -p /tmp/txtai wget -N https://github.com/neuml/txtai/releases/download/v3.5.0/tests.tar.gz -P /tmp make: wget: No such file or directory make: *** [data] Error 1

However, make coverage runs with some errors.

Noted.
Noted
Can you clarify in what way, thanks.

davidmezzetti · 2023-06-07T13:12:08Z

Hi Ryan,

I just checked in a change that should show a better message when wget is not found. You'll need to install that package dependency at the OS level.

In terms of backwards compatibility, if the tests run without errors, then we should be good to go.

rcali21 · 2023-06-07T15:32:50Z

Hi Ryan,

I just checked in a change that should show a better message when wget is not found. You'll need to install that package dependency at the OS level.

In terms of backwards compatibility, if the tests run without errors, then we should be good to go.

I just ran make data which now works but make coverage now returns a segfault.

coverage run -m unittest discover -v -s ./test/python testAnnoy (testann.TestANN) Test Annoy backend ... ERROR testAnnoyCustom (testann.TestANN) Test Annoy backend with custom settings ... ERROR testCustomBackend (testann.TestANN) Test resolving a custom backend ... make: *** [coverage] Segmentation fault: 11

Was this what you were referring to before or was that in a different case?

davidmezzetti · 2023-06-08T12:36:40Z

It appears you're running into this macOS issue: #377

I've long wanted to solve it but I don't develop on macOS. From other discussions, it appears it's some sort of library conflict (see this: #300 (comment)).

Perhaps you could try to install/configure/change the OpenMP install to see if that helps? What version of Python and PyTorch are you using locally?

davidmezzetti · 2023-08-22T15:30:11Z

Hi Ryan,

With the summer coming to a close, I assume you didn't get time to come back around on this. Where do you think this stands? Is what you submitted complete or should I close this?

davidmezzetti · 2023-08-28T20:06:41Z

Closing this due to inactivity. Feel free to re-open should there be future interest in completing this work.

Ryan Cali added 2 commits May 27, 2023 19:29

First attempt at top_k impl. to expand outputs

c5ac1c5

Updated questions.py to handle single answer dict

8f7a882

Changed topk to call

96c0d24

davidmezzetti mentioned this pull request Jul 9, 2023

Random Segmentation fault with embedding.search after upserting #498

Closed

davidmezzetti closed this Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add topk functionality to qa pipeline #480

Add topk functionality to qa pipeline #480

rcali21 commented May 30, 2023

rcali21 commented May 30, 2023 •

edited

Loading

davidmezzetti commented May 31, 2023

rcali21 commented Jun 2, 2023

davidmezzetti commented Jun 7, 2023

rcali21 commented Jun 7, 2023

davidmezzetti commented Jun 8, 2023

davidmezzetti commented Aug 22, 2023

davidmezzetti commented Aug 28, 2023

Add topk functionality to qa pipeline #480

Add topk functionality to qa pipeline #480

Conversation

rcali21 commented May 30, 2023

rcali21 commented May 30, 2023 • edited Loading

davidmezzetti commented May 31, 2023

rcali21 commented Jun 2, 2023

davidmezzetti commented Jun 7, 2023

rcali21 commented Jun 7, 2023

davidmezzetti commented Jun 8, 2023

davidmezzetti commented Aug 22, 2023

davidmezzetti commented Aug 28, 2023

rcali21 commented May 30, 2023 •

edited

Loading