-
Notifications
You must be signed in to change notification settings - Fork 13.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature/4493 Improve Evernote Document Loader #4577
feature/4493 Improve Evernote Document Loader #4577
Conversation
pyproject.toml
Outdated
lxml = "^4" | ||
html2text = "^2020.1.16" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Required to run the unit tests. html2text
is currently an optional dependency for langchain and have added lxml
as another optional dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still in the process of updating the testing workflow so we can handle optional dependencies, so nothing is document yet (sorry about that! :()
Adding the dependencies here, will make the tests pass, but won't actually work when the langchain package is installed for users (since it's installed without optional dependencies).
Could you move these to the extended_testing
group below?
extended_testing = ["pypdf", "pdfminer.six"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh yes, no trouble at all. That sounds great, this way we don't have duplicated dependency versions in pyproject.toml
either. I've made this change and added both lxml
and html2text
to extended_testing
.
3b8c59d
to
7612921
Compare
text = _parse_note_xml(self.file_path) | ||
metadata = {"source": self.file_path} | ||
return [Document(page_content=text, metadata=metadata)] | ||
"""Load documents from EverNote export file.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if any users were relying on getting a single document back as a feature, if so, this would break backwards compatibility.
What do you think about adding an extra named argument in the initializer that sets the mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah no trouble at all, the EverNoteLoader
now takes a load_single_document
argument in initialisation and this defaults to True
. I think in the future we should make the default False
, perhaps we can do this with a more significant version update.
return rsc_dict | ||
|
||
@classmethod | ||
def _parse_note(cls, note: List, prefix: Optional[str] = None) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's much more common for classmethods to return instances of the class.
Maybe make it into a staticmethod or else detach from class as python is a land of verbs. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you are spot on, not sure what I was thinking, I've moved these to static methods which they should have been in the first place.
add_prefix(key): value for key, value in note_dict.items() | ||
} | ||
|
||
@classmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about using standalone functions instead?
A developer will not have to worry about anything unexpected happening during inheritance (where identity of cls
changes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point. These should be static methods instead of class methods. As they only exist to support the EverNoteLoader class I think they should exist inside the class rather than outside in the file. Are you happy with them as static methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your message above about using static methods instead so that's perfect. Have updated to be static methods.
pyproject.toml
Outdated
lxml = "^4" | ||
html2text = "^2020.1.16" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still in the process of updating the testing workflow so we can handle optional dependencies, so nothing is document yet (sorry about that! :()
Adding the dependencies here, will make the tests pass, but won't actually work when the langchain package is installed for users (since it's installed without optional dependencies).
Could you move these to the extended_testing
group below?
extended_testing = ["pypdf", "pdfminer.six"]
current_dir = pathlib.Path(__file__).parent | ||
return os.path.join(current_dir, "sample_documents", notebook_name) | ||
|
||
def test_evernoteloader_loadnotebook_eachnoteisindividualdocument(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MikeMcGarry you can use the following (still undocumented) custom marker to get the tests to run. They'll need to be registered in the extended_testing
extra
def test_evernoteloader_loadnotebook_eachnoteisindividualdocument(self) -> None: | |
@pytest.mark.requires("lxml", "html2txt") | |
def test_evernoteloader_loadnotebook_eachnoteisindividualdocument(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sure thing, however when I add this decorator and call make test
(after calling poetry lock --no-update
and poetry install -E extended_testing
all the tests are skipped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh never mind, I see how it works, had a typo in html2txt
. Fixed this. I see that it will skip the tests if the dependency is not installed. Nice!
@MikeMcGarry Thank you for the PR. This is a major improvement to how ever note document loader works! @hwchase17 thoughts on backwards compatibility? |
713f4be
to
9019949
Compare
Hi @eyurtsev thanks for your detailed review, your guidance on setting up the dependencies appropriately was really helpful and you are on the money about those methods which should be static methods rather than class methods. I've addressed all your feedback including preserving the existing behaviour and making the new behaviour available by passing |
Thank you! as a heads up, will commandeer tomorrow to help resolve merge conflicts around poetry.lock files, there's a bunch of PRs that are bumping into merge conflicts, so will be doing the same in all to merge the code faster. |
Okay sounds good! I've run |
cbf452b
to
40d44df
Compare
I've also rebased onto master. |
poetry.lock
Outdated
all = ["O365", "aleph-alpha-client", "anthropic", "arxiv", "atlassian-python-api", "azure-cosmos", "azure-identity", "beautifulsoup4", "clickhouse-connect", "cohere", "deeplake", "docarray", "duckduckgo-search", "elasticsearch", "faiss-cpu", "google-api-python-client", "google-search-results", "gptcache", "hnswlib", "html2text", "huggingface_hub", "jina", "jinja2", "jq", "lancedb", "lark", "lxml", "manifest-ml", "networkx", "nlpcloud", "nltk", "nomic", "openai", "opensearch-py", "pdfminer-six", "pexpect", "pgvector", "pinecone-client", "pinecone-text", "protobuf", "psycopg2-binary", "pyowm", "pypdf", "pytesseract", "pyvespa", "qdrant-client", "redis", "sentence-transformers", "spacy", "steamship", "tensorflow-text", "tiktoken", "torch", "transformers", "weaviate-client", "wikipedia", "wolframalpha"] | ||
azure = ["azure-core", "azure-cosmos", "azure-identity", "openai"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I ran poetry lock --no-update
it put all the back in alphabetical order.
seems like one of the test files is 11k lines, any chance we could use a smaller one 😅 |
It's a base64 encoded image. I wanted to make sure images were removed from the note and didn't end up in the page context. Do we have some size constraints we want to consider with test data files? These aren't part of the package contents when published to PyPi right? |
they're not part of the package, it's still nice to keep the repo small though. could we just use a small image (shouldn't change quality of the test right?) |
…te instead of a single document for the entire export. Also ensured that all available metadata is added to each note
…optional dependencies and updated to use static methods instead of class methods
…ptional. Update Jupyter Notebook example
40d44df
to
0d98e56
Compare
Okay sure thing, I've scaled down the image it's now 18KB instead of 2.1MB. |
Code looks ready for merging. Waiting for tests to pass and then can merge in. Thank you @MikeMcGarry |
Improve Evernote Document Loader
When exporting from Evernote you may export more than one note. Currently the Evernote loader concatenates the content of all notes in the export into a single document and only attaches the name of the export file as metadata on the document.
This change ensures that each note is loaded as an independent document and all available metadata on the note e.g. author, title, created, updated are added as metadata on each document.
It also uses an existing optional dependency of
html2text
instead ofpypandoc
to remove the need to download the pandoc application viadownload_pandoc()
to be able to use thepypandoc
python bindings.Fixes #4493
Before submitting
Who can review?
@eyurtsev / @dev2049
Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: