New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOCQA.. #3882
Comments
Hey @jyotikhetan , Could you please provide more details about the code in |
### generator_roberta.py from jina import Executor, Document, DocumentArray, requests
from transformers import (
AutoTokenizer,
AutoModelForQuestionAnswering,
pipeline,
)
class Generator(Executor):
answer_model_name = "deepset/bert-large-uncased-whole-word-masking-squad2"
answer_model = AutoModelForQuestionAnswering.from_pretrained(answer_model_name)
answer_tokenizer = AutoTokenizer.from_pretrained(answer_model_name)
nlp = pipeline(
"question-answering", model=answer_model, tokenizer=answer_tokenizer
)
@requests
def generate(self, docs: DocumentArray, **kwargs) -> DocumentArray:
for doc in docs.traverse_flat(('r',)):
context = " ".join([match.text for match in doc.matches])
# context = doc.matches.append(Document(text=answer))
# context = "/home/jyoti/jina2/jina-star-wars-qa/data/StarWars_Descriptions.txt"
# qa_input = {"question": doc.text,"context": context}
qa_input = {"question": doc.text, "context": context}
result = self.nlp(qa_input)
result = DocumentArray(Document(result))
return result ### my execution file import os
import sys
from typing import Iterator
import click
from jina import Flow, Document, DocumentArray
import logging
from pdf_segment import PDFSegmenter
MAX_DOCS = int(os.environ.get("JINA_MAX_DOCS", 0))
cur_dir = os.path.dirname(os.path.abspath(__file__))
def pdf_process():
pdf_name = ".../1706.03762.pdf"
segmentor = PDFSegmenter(pdf_name)
segmentor.text_crafter(save_file_name="data")
segmentor.image_crafter()
def config(dataset: str = "star-wars") -> None:
if dataset == "star-wars":
os.environ["JINA_DATA_FILE"] = os.environ.get("JINA_DATA_FILE", "...../jina-text/data.txt")
os.environ.setdefault('JINA_WORKSPACE', os.path.join(cur_dir, 'workspace'))
os.environ.setdefault(
'JINA_WORKSPACE_MOUNT',
f'{os.environ.get("JINA_WORKSPACE")}:/workspace/workspace')
os.environ.setdefault('JINA_LOG_LEVEL', 'INFO')
os.environ.setdefault('JINA_PORT', str(45678))
def input_generator(file_path: str, num_docs: int) -> Iterator[Document]:
with open(file_path) as file:
lines = file.readlines()
num_lines = len(lines)
if num_docs:
for i in range(min(num_docs, num_lines)):
yield Document(text=lines[i])
else:
for i in range(num_lines):
yield Document(text=lines[i])
def index(num_docs: int) -> None:
flow = Flow().load_config('flows/flow-index.yml')
data_path = os.path.join(os.path.dirname(__file__), os.environ.get("JINA_DATA_FILE", None))
with flow:
flow.post(on="/index", inputs=input_generator(data_path, num_docs), show_progress=True)
def query(top_k: int) -> None:
flow = Flow().load_config('flows/flow-query.yml')
with flow:
text = input('Please type a question: ')
doc = Document(content=text)
result = flow.post(on='/search', inputs=DocumentArray([doc]),
# parameters={'top_k': top_k},
line_format='text',
return_results=True,
)
for doc in result[0].data.docs:
print(f"\n\nAnswer: {doc.tags['answer']}")
@click.command()
@click.option(
'--task',
'-t',
type=click.Choice(['index', 'query', 'pdf_process'], case_sensitive=False),
)
@click.option('--num_docs', '-n', default=MAX_DOCS)
@click.option('--top_k', '-k', default=5)
@click.option('--data_set', '-d', type=click.Choice(['star-wars']), default='star-wars')
def main(task: str, num_docs: int, top_k: int, data_set: str) -> None:
config()
workspace = os.environ['JINA_WORKSPACE']
logger = logging.getLogger('star-wars-qa')
if 'index' in task:
if os.path.exists(workspace):
logger.error(
f'\n +------------------------------------------------------------------------------------+ \
\n | 🤖🤖🤖 | \
\n | The directory {workspace} already exists. Please remove it before indexing again. | \
\n | 🤖🤖🤖 | \
\n +------------------------------------------------------------------------------------+'
)
sys.exit(1)
if 'query' in task:
if not os.path.exists(workspace):
logger.info(f"The directory {workspace} does not exist. Running indexing...")
index(num_docs)
if 'pdf_process' in task:
if not os.path.exists(workspace):
logger.info(f"The directory {workspace} does not exist. Running indexing...")
index(num_docs)
if task == 'index':
index(num_docs)
elif task == 'query':
# query()
query(top_k)
elif task == "pdf_process":
pdf_process()
if __name__ == '__main__':
main() dataset I m trying on this is research paper attention all we need pdf which I parsed and saved in text
### Query I asked On what kind of dataset model was trained..?? |
May I ask what it the expected outcome, and what is the pipeline u follow when working without The first thing I would check, is how many lines are being indexed? This seems like a non-optimal way of chunking your document. |
So query I asked was On what kind of dataset model was trained.. answer expected was WMT 2014 English-German this is the normal pipeline, I used from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline
model_name = "deepset/roberta-base-squad2"
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
QA_input = {
'question': 'Why is model conversion important?',
'context': 'The option to convert models between FARM and transformers gives freedom to the user and let people easily switch between frameworks.'
}
res = nlp(QA_input)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name) |
So what was the result from this pipeline? and what is the result using Jina? |
WMT 2014 English-German this is result i ma getting using pipeline & using jina I am getting Answer: models As you mentioned, I shall check the number of lines indexed..!! |
So i checked my indexing also , I mean just 2 paragraph , indexed it & checked the answer ... still no luck !! only i am getting different wrong answer Training answer i am getting is --- Answer: 8 NVIDIA P100 GPUs |
It is very hard to understand, what is the granularity that u want to index? Shouldn't you try to break your sentences into smaller pieces so that the model works. How do you expect it to work exactly as ur plain implementation if the data u provide to the model is not the same? |
Also, how do u expect to have the same output as in #3882 (comment) if the sentence: |
So I was just giving the model reference in which I tested on hugging face UI..!! Then I have provided Jina with the exact same dataset I have provided to FARMmodel by haystack... & I am getting the right answer..!! That's why I am not able to figure out where I am going wrong here... I checked the length of the input document also..!! Please help... |
What does it mean providing the same input? When you do. def input_generator(file_path: str, num_docs: int) -> Iterator[Document]:
with open(file_path) as file:
lines = file.readlines()
num_lines = len(lines)
if num_docs:
for i in range(min(num_docs, num_lines)):
yield Document(text=lines[i])
else:
for i in range(num_lines):
yield Document(text=lines[i]) what is the text added to each of the Can you please describe in detail what is the input u want to provide to the model, and what do u expect to extract? Also clarify what does it mean to u to provide the same input to FARMmodel? |
This is the paper I provided which is pdf... attention all we need..https://arxiv.org/pdf/1706.03762.pdf using pdfsegmenter I extract text in .txt file...!! As of now i using single document so extracting one pdf extracting the data & dumping in one text file..!!! https://haystack.deepset.ai/ I used this search engine, provided same pdf used same roberta model... & got the right answer ..!! I am not able to figure out where i am making mistake in jina.. |
Can u show the exact way u provided the same pdf? |
from haystack.document_store import InMemoryDocumentStore, SQLDocumentStore
from haystack.reader import FARMReader, TransformersReader
from haystack.retriever.sparse import TfidfRetriever
from haystack.preprocessor.utils import convert_files_to_dicts, fetch_archive_from_http
def tutorial3_basic_qa_pipeline_without_elasticsearch():
document_store = InMemoryDocumentStore()
from haystack.file_converter.pdf import PDFToTextConverter
converter = PDFToTextConverter(remove_numeric_tables=True)
dicts = converter.convert(file_path='/home/jyoti/hay_afterupadate/1706.03762/1706.03762.pdf', meta=None)
dicts = convert_files_to_dicts(dir_path="/home/jyoti/hay_afterupadate/1706.03762/", split_paragraphs=True)
document_store.write_documents(dicts)
retriever = TfidfRetriever(document_store=document_store)
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)
from haystack.pipeline import ExtractiveQAPipeline
pipe = ExtractiveQAPipeline(reader, retriever)
prediction = pipe.run(
query="on what dataset model was trained?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)
print(prediction)
if __name__ == "__main__":
tutorial3_basic_qa_pipeline_without_elasticsearch() This pipeline where I used the same pdf |
Would it be okey to send the |
I have sent the link... it's research paper attention all we need |
Ok, now I understand what u are saying. I have seen 2 major things. Something happened when u load the The sentence: If u check in your Another potential source of problem is that the haystack pipeline is based on So, summarizing.
|
Thank you for clarifying it..!! As you mentioned, It was an indexing issue at my end..!! Solved the issue.
add "--quiet-error" to suppress the exception details Please help..Thank you in advance |
hey @jyotikhetan , Could you set the environment variable Have u also made sure that the workspace is cleared before running the example? |
Also, what do u mean by limiting the size of the document? |
log.txt & Yes I have deleted my previous workspace..!! Limiting the size means... I have taken one PDF, parsed it, save the text in .TXT then indexed it, & when query a question it threw an error. whereas when I reduce the size (i.e I kept only 2 paragraphs in .txt) then indexed it, then asked the query it worked..!!! Thank you |
what is the exact jina version u are working with? |
jina --version - 2.1.5 |
@jyotikhetan maybe u can share here the resulting indexing file. U can zip the As per the new version, it is a known problem that we are fixing in new executor versions |
workspace.zip Okay, so I have to chuck a small part of document 7 then pass to query part...is it? |
Hello @jyotikhetan , What I found is that there are elements in the index that do not have a valid embedding. (I see that u are trying to store a I would suggest that u add at index time an extra Something like this. from jina import Executor, Document, DocumentArray, requests
class Filter(Executor):
@requests
def filter(self, docs: DocumentArray, **kwargs) -> DocumentArray:
filtered_docs = DocumentArray([doc for doc in docs if doc.embedding is not None])
return filtered_docs |
Hey @jyotikhetan , Was this useful to solve your problem? |
Hey hi @JoanFM, Yes it did..!! I was testing if I can make it work on different Pdf by indexing together... |
Very happy to hear @jyotikhetan, please do not hesitate to open any issue in case you need it. I am going to close this one! |
I am trying to parse a pdf & apply the DOCQA model to it.Roberta as well as BERT..!! But I am not getting right answer for it.. whereas independently it's works accurately... !!
This is my flow.yml
This is my query.yml
The text was updated successfully, but these errors were encountered: