feature/4493 Improve Evernote Document Loader #4577

MikeMcGarry · 2023-05-12T12:52:45Z

Improve Evernote Document Loader

When exporting from Evernote you may export more than one note. Currently the Evernote loader concatenates the content of all notes in the export into a single document and only attaches the name of the export file as metadata on the document.

This change ensures that each note is loaded as an independent document and all available metadata on the note e.g. author, title, created, updated are added as metadata on each document.

It also uses an existing optional dependency of html2text instead of pypandoc to remove the need to download the pandoc application via download_pandoc() to be able to use the pypandoc python bindings.

Fixes #4493

Before submitting

Who can review?

@eyurtsev / @dev2049

Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested:

MikeMcGarry · 2023-05-12T12:54:10Z

pyproject.toml

+lxml = "^4"
+html2text = "^2020.1.16"


Required to run the unit tests. html2text is currently an optional dependency for langchain and have added lxml as another optional dependency.

I'm still in the process of updating the testing workflow so we can handle optional dependencies, so nothing is document yet (sorry about that! :()

Adding the dependencies here, will make the tests pass, but won't actually work when the langchain package is installed for users (since it's installed without optional dependencies).

Could you move these to the extended_testing group below?

extended_testing = ["pypdf", "pdfminer.six"]

Ahhh yes, no trouble at all. That sounds great, this way we don't have duplicated dependency versions in pyproject.toml either. I've made this change and added both lxml and html2text to extended_testing.

eyurtsev · 2023-05-12T15:10:21Z

langchain/document_loaders/evernote.py

-        text = _parse_note_xml(self.file_path)
-        metadata = {"source": self.file_path}
-        return [Document(page_content=text, metadata=metadata)]
+        """Load documents from EverNote export file."""


I don't know if any users were relying on getting a single document back as a feature, if so, this would break backwards compatibility.

What do you think about adding an extra named argument in the initializer that sets the mode?

Yeah no trouble at all, the EverNoteLoader now takes a load_single_document argument in initialisation and this defaults to True. I think in the future we should make the default False, perhaps we can do this with a more significant version update.

eyurtsev · 2023-05-12T15:12:46Z

langchain/document_loaders/evernote.py

+        return rsc_dict
+
+    @classmethod
+    def _parse_note(cls, note: List, prefix: Optional[str] = None) -> dict:


It's much more common for classmethods to return instances of the class.

Maybe make it into a staticmethod or else detach from class as python is a land of verbs. :)

Yeah you are spot on, not sure what I was thinking, I've moved these to static methods which they should have been in the first place.

eyurtsev · 2023-05-12T15:14:30Z

langchain/document_loaders/evernote.py

+            add_prefix(key): value for key, value in note_dict.items()
+        }
+
+    @classmethod


What do you think about using standalone functions instead?

A developer will not have to worry about anything unexpected happening during inheritance (where identity of cls changes)

Yes, good point. These should be static methods instead of class methods. As they only exist to support the EverNoteLoader class I think they should exist inside the class rather than outside in the file. Are you happy with them as static methods?

I see your message above about using static methods instead so that's perfect. Have updated to be static methods.

eyurtsev · 2023-05-12T15:18:15Z

pyproject.toml

+lxml = "^4"
+html2text = "^2020.1.16"


I'm still in the process of updating the testing workflow so we can handle optional dependencies, so nothing is document yet (sorry about that! :()

Adding the dependencies here, will make the tests pass, but won't actually work when the langchain package is installed for users (since it's installed without optional dependencies).

Could you move these to the extended_testing group below?

extended_testing = ["pypdf", "pdfminer.six"]

eyurtsev · 2023-05-12T15:21:38Z

tests/unit_tests/document_loader/test_evernote_loader.py

+        current_dir = pathlib.Path(__file__).parent
+        return os.path.join(current_dir, "sample_documents", notebook_name)
+
+    def test_evernoteloader_loadnotebook_eachnoteisindividualdocument(self) -> None:


@MikeMcGarry you can use the following (still undocumented) custom marker to get the tests to run. They'll need to be registered in the extended_testing extra

Suggested change

def test_evernoteloader_loadnotebook_eachnoteisindividualdocument(self) -> None:

@pytest.mark.requires("lxml", "html2txt")

def test_evernoteloader_loadnotebook_eachnoteisindividualdocument(self) -> None:

Yeah sure thing, however when I add this decorator and call make test (after calling poetry lock --no-update and poetry install -E extended_testing all the tests are skipped?

Ahh never mind, I see how it works, had a typo in html2txt. Fixed this. I see that it will skip the tests if the dependency is not installed. Nice!

eyurtsev · 2023-05-12T15:23:26Z

@MikeMcGarry Thank you for the PR. This is a major improvement to how ever note document loader works!

@hwchase17 thoughts on backwards compatibility?

MikeMcGarry · 2023-05-13T11:54:35Z

Hi @eyurtsev thanks for your detailed review, your guidance on setting up the dependencies appropriately was really helpful and you are on the money about those methods which should be static methods rather than class methods. I've addressed all your feedback including preserving the existing behaviour and making the new behaviour available by passing load_single_document=False into the initialisation of the EverNoteLoader. I've also added some additional unit tests to improve our coverage.

eyurtsev · 2023-05-15T02:43:21Z

Thank you! as a heads up, will commandeer tomorrow to help resolve merge conflicts around poetry.lock files, there's a bunch of PRs that are bumping into merge conflicts, so will be doing the same in all to merge the code faster.

MikeMcGarry · 2023-05-15T05:00:06Z

Okay sounds good! I've run poetry run black . and committed the updated files as I see one of the pipeline tests failed as the files were not formatted appropriately.

MikeMcGarry · 2023-05-15T05:06:57Z

I've also rebased onto master.

MikeMcGarry · 2023-05-15T05:08:43Z

poetry.lock

+all = ["O365", "aleph-alpha-client", "anthropic", "arxiv", "atlassian-python-api", "azure-cosmos", "azure-identity", "beautifulsoup4", "clickhouse-connect", "cohere", "deeplake", "docarray", "duckduckgo-search", "elasticsearch", "faiss-cpu", "google-api-python-client", "google-search-results", "gptcache", "hnswlib", "html2text", "huggingface_hub", "jina", "jinja2", "jq", "lancedb", "lark", "lxml", "manifest-ml", "networkx", "nlpcloud", "nltk", "nomic", "openai", "opensearch-py", "pdfminer-six", "pexpect", "pgvector", "pinecone-client", "pinecone-text", "protobuf", "psycopg2-binary", "pyowm", "pypdf", "pytesseract", "pyvespa", "qdrant-client", "redis", "sentence-transformers", "spacy", "steamship", "tensorflow-text", "tiktoken", "torch", "transformers", "weaviate-client", "wikipedia", "wolframalpha"]
+azure = ["azure-core", "azure-cosmos", "azure-identity", "openai"]


When I ran poetry lock --no-update it put all the back in alphabetical order.

dev2049 · 2023-05-17T00:34:04Z

seems like one of the test files is 11k lines, any chance we could use a smaller one 😅

MikeMcGarry · 2023-05-17T14:24:17Z

It's a base64 encoded image. I wanted to make sure images were removed from the note and didn't end up in the page context.

Do we have some size constraints we want to consider with test data files? These aren't part of the package contents when published to PyPi right?

dev2049 · 2023-05-17T17:53:50Z

It's a base64 encoded image. I wanted to make sure images were removed from the note and didn't end up in the page context.

Do we have some size constraints we want to consider with test data files? These aren't part of the package contents when published to PyPi right?

they're not part of the package, it's still nice to keep the repo small though. could we just use a small image (shouldn't change quality of the test right?)

…te instead of a single document for the entire export. Also ensured that all available metadata is added to each note

…optional dependencies and updated to use static methods instead of class methods

…ptional. Update Jupyter Notebook example

…ectory

MikeMcGarry · 2023-05-19T02:32:16Z

It's a base64 encoded image. I wanted to make sure images were removed from the note and didn't end up in the page context.
Do we have some size constraints we want to consider with test data files? These aren't part of the package contents when published to PyPi right?

they're not part of the package, it's still nice to keep the repo small though. could we just use a small image (shouldn't change quality of the test right?)

Okay sure thing, I've scaled down the image it's now 18KB instead of 2.1MB.

MikeMcGarry · 2023-05-19T02:34:01Z

@dev2049 / @eyurtsev I've accommodated all the feedback and rebased onto master. Are there any outstanding points we need to address?

eyurtsev · 2023-05-19T14:20:19Z

Code looks ready for merging. Waiting for tests to pass and then can merge in. Thank you @MikeMcGarry

MikeMcGarry changed the title ~~feature/4493 Improved the Evernote loader to return a document per no…~~ feature/4493 Improved the Evernote loader May 12, 2023

MikeMcGarry changed the title ~~feature/4493 Improved the Evernote loader~~ feature/4493 Improve Evernote Document Loader May 12, 2023

MikeMcGarry commented May 12, 2023

View reviewed changes

MikeMcGarry force-pushed the feature/4493.ImproveEvernoteLoader branch from 3b8c59d to 7612921 Compare May 12, 2023 13:04

eyurtsev self-requested a review May 12, 2023 15:00

eyurtsev reviewed May 12, 2023

View reviewed changes

eyurtsev requested a review from hwchase17 May 12, 2023 15:23

eyurtsev added the 03 enhancement Enhancement of existing functionality label May 12, 2023

MikeMcGarry force-pushed the feature/4493.ImproveEvernoteLoader branch 3 times, most recently from 713f4be to 9019949 Compare May 13, 2023 11:51

MikeMcGarry requested a review from eyurtsev May 13, 2023 11:55

MikeMcGarry force-pushed the feature/4493.ImproveEvernoteLoader branch from cbf452b to 40d44df Compare May 15, 2023 05:06

MikeMcGarry commented May 15, 2023

View reviewed changes

Mike McGarry added 5 commits May 19, 2023 12:17

feature/4493 Improved the Evernote loader to return a document per no…

9e89847

…te instead of a single document for the entire export. Also ensured that all available metadata is added to each note

feature/4493 Added additional test coverage, fixed the management of …

11a7c93

…optional dependencies and updated to use static methods instead of class methods

feature/4493 Preserve existing behaviour and make the new behaviour o…

755498b

…ptional. Update Jupyter Notebook example

feature/4493 Run black across file

f43da0c

Use smaller image for media test and move to new document_loaders dir…

0d98e56

…ectory

MikeMcGarry force-pushed the feature/4493.ImproveEvernoteLoader branch from 40d44df to 0d98e56 Compare May 19, 2023 02:31

Remove whitespace changes

1db87b6

eyurtsev added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label May 19, 2023

dev2049 added 2 commits May 19, 2023 13:57

fmt

bf4b019

lint

7ed37ae

dev2049 merged commit ddd595f into langchain-ai:master May 19, 2023
13 checks passed

danielchalef mentioned this pull request Jun 5, 2023

Zep Hybrid Search #5742

Merged

This was referenced Jun 25, 2023

Zep Authentication #6725

Closed

Zep Authentication #6728

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature/4493 Improve Evernote Document Loader #4577

feature/4493 Improve Evernote Document Loader #4577

MikeMcGarry commented May 12, 2023 •

edited

MikeMcGarry May 12, 2023 •

edited

eyurtsev May 12, 2023

MikeMcGarry May 13, 2023

eyurtsev May 12, 2023

MikeMcGarry May 13, 2023

eyurtsev May 12, 2023

MikeMcGarry May 13, 2023

eyurtsev May 12, 2023

MikeMcGarry May 13, 2023

MikeMcGarry May 13, 2023

eyurtsev May 12, 2023

eyurtsev May 12, 2023

MikeMcGarry May 13, 2023

MikeMcGarry May 13, 2023 •

edited

eyurtsev commented May 12, 2023

MikeMcGarry commented May 13, 2023

eyurtsev commented May 15, 2023

MikeMcGarry commented May 15, 2023

MikeMcGarry commented May 15, 2023

MikeMcGarry May 15, 2023

dev2049 commented May 17, 2023

MikeMcGarry commented May 17, 2023 •

edited

dev2049 commented May 17, 2023

MikeMcGarry commented May 19, 2023

MikeMcGarry commented May 19, 2023

eyurtsev commented May 19, 2023 •

edited

	def test_evernoteloader_loadnotebook_eachnoteisindividualdocument(self) -> None:
	@pytest.mark.requires("lxml", "html2txt")
	def test_evernoteloader_loadnotebook_eachnoteisindividualdocument(self) -> None:

		all = ["O365", "aleph-alpha-client", "anthropic", "arxiv", "atlassian-python-api", "azure-cosmos", "azure-identity", "beautifulsoup4", "clickhouse-connect", "cohere", "deeplake", "docarray", "duckduckgo-search", "elasticsearch", "faiss-cpu", "google-api-python-client", "google-search-results", "gptcache", "hnswlib", "html2text", "huggingface_hub", "jina", "jinja2", "jq", "lancedb", "lark", "lxml", "manifest-ml", "networkx", "nlpcloud", "nltk", "nomic", "openai", "opensearch-py", "pdfminer-six", "pexpect", "pgvector", "pinecone-client", "pinecone-text", "protobuf", "psycopg2-binary", "pyowm", "pypdf", "pytesseract", "pyvespa", "qdrant-client", "redis", "sentence-transformers", "spacy", "steamship", "tensorflow-text", "tiktoken", "torch", "transformers", "weaviate-client", "wikipedia", "wolframalpha"]
		azure = ["azure-core", "azure-cosmos", "azure-identity", "openai"]

		lxml = "^4"
		html2text = "^2020.1.16"

		lxml = "^4"
		html2text = "^2020.1.16"

feature/4493 Improve Evernote Document Loader #4577

feature/4493 Improve Evernote Document Loader #4577

Conversation

MikeMcGarry commented May 12, 2023 • edited

Improve Evernote Document Loader

Before submitting

Who can review?

MikeMcGarry May 12, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeMcGarry May 13, 2023 • edited

Choose a reason for hiding this comment

eyurtsev commented May 12, 2023

MikeMcGarry commented May 13, 2023

eyurtsev commented May 15, 2023

MikeMcGarry commented May 15, 2023

MikeMcGarry commented May 15, 2023

Choose a reason for hiding this comment

dev2049 commented May 17, 2023

MikeMcGarry commented May 17, 2023 • edited

dev2049 commented May 17, 2023

MikeMcGarry commented May 19, 2023

MikeMcGarry commented May 19, 2023

eyurtsev commented May 19, 2023 • edited

MikeMcGarry commented May 12, 2023 •

edited

MikeMcGarry May 12, 2023 •

edited

MikeMcGarry May 13, 2023 •

edited

MikeMcGarry commented May 17, 2023 •

edited

eyurtsev commented May 19, 2023 •

edited