Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/4493 Improve Evernote Document Loader #4577

Merged

Conversation

MikeMcGarry
Copy link
Contributor

@MikeMcGarry MikeMcGarry commented May 12, 2023

Improve Evernote Document Loader

When exporting from Evernote you may export more than one note. Currently the Evernote loader concatenates the content of all notes in the export into a single document and only attaches the name of the export file as metadata on the document.

This change ensures that each note is loaded as an independent document and all available metadata on the note e.g. author, title, created, updated are added as metadata on each document.

It also uses an existing optional dependency of html2text instead of pypandoc to remove the need to download the pandoc application via download_pandoc() to be able to use the pypandoc python bindings.

Fixes #4493

Before submitting

Who can review?

@eyurtsev / @dev2049

Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested:

@MikeMcGarry MikeMcGarry changed the title feature/4493 Improved the Evernote loader to return a document per no… feature/4493 Improved the Evernote loader May 12, 2023
@MikeMcGarry MikeMcGarry changed the title feature/4493 Improved the Evernote loader feature/4493 Improve Evernote Document Loader May 12, 2023
pyproject.toml Outdated
Comment on lines 112 to 113
lxml = "^4"
html2text = "^2020.1.16"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required to run the unit tests. html2text is currently an optional dependency for langchain and have added lxml as another optional dependency.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still in the process of updating the testing workflow so we can handle optional dependencies, so nothing is document yet (sorry about that! :()

Adding the dependencies here, will make the tests pass, but won't actually work when the langchain package is installed for users (since it's installed without optional dependencies).

Could you move these to the extended_testing group below?

extended_testing = ["pypdf", "pdfminer.six"]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh yes, no trouble at all. That sounds great, this way we don't have duplicated dependency versions in pyproject.toml either. I've made this change and added both lxml and html2text to extended_testing.

@MikeMcGarry MikeMcGarry force-pushed the feature/4493.ImproveEvernoteLoader branch from 3b8c59d to 7612921 Compare May 12, 2023 13:04
@eyurtsev eyurtsev self-requested a review May 12, 2023 15:00
text = _parse_note_xml(self.file_path)
metadata = {"source": self.file_path}
return [Document(page_content=text, metadata=metadata)]
"""Load documents from EverNote export file."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if any users were relying on getting a single document back as a feature, if so, this would break backwards compatibility.

What do you think about adding an extra named argument in the initializer that sets the mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah no trouble at all, the EverNoteLoader now takes a load_single_document argument in initialisation and this defaults to True. I think in the future we should make the default False, perhaps we can do this with a more significant version update.

return rsc_dict

@classmethod
def _parse_note(cls, note: List, prefix: Optional[str] = None) -> dict:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's much more common for classmethods to return instances of the class.

Maybe make it into a staticmethod or else detach from class as python is a land of verbs. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you are spot on, not sure what I was thinking, I've moved these to static methods which they should have been in the first place.

add_prefix(key): value for key, value in note_dict.items()
}

@classmethod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about using standalone functions instead?

A developer will not have to worry about anything unexpected happening during inheritance (where identity of cls changes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. These should be static methods instead of class methods. As they only exist to support the EverNoteLoader class I think they should exist inside the class rather than outside in the file. Are you happy with them as static methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your message above about using static methods instead so that's perfect. Have updated to be static methods.

pyproject.toml Outdated
Comment on lines 112 to 113
lxml = "^4"
html2text = "^2020.1.16"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still in the process of updating the testing workflow so we can handle optional dependencies, so nothing is document yet (sorry about that! :()

Adding the dependencies here, will make the tests pass, but won't actually work when the langchain package is installed for users (since it's installed without optional dependencies).

Could you move these to the extended_testing group below?

extended_testing = ["pypdf", "pdfminer.six"]

current_dir = pathlib.Path(__file__).parent
return os.path.join(current_dir, "sample_documents", notebook_name)

def test_evernoteloader_loadnotebook_eachnoteisindividualdocument(self) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MikeMcGarry you can use the following (still undocumented) custom marker to get the tests to run. They'll need to be registered in the extended_testing extra

Suggested change
def test_evernoteloader_loadnotebook_eachnoteisindividualdocument(self) -> None:
@pytest.mark.requires("lxml", "html2txt")
def test_evernoteloader_loadnotebook_eachnoteisindividualdocument(self) -> None:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sure thing, however when I add this decorator and call make test (after calling poetry lock --no-update and poetry install -E extended_testing all the tests are skipped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh never mind, I see how it works, had a typo in html2txt. Fixed this. I see that it will skip the tests if the dependency is not installed. Nice!

@eyurtsev
Copy link
Collaborator

@MikeMcGarry Thank you for the PR. This is a major improvement to how ever note document loader works!

@hwchase17 thoughts on backwards compatibility?

@eyurtsev eyurtsev requested a review from hwchase17 May 12, 2023 15:23
@eyurtsev eyurtsev added the 03 enhancement Enhancement of existing functionality label May 12, 2023
@MikeMcGarry MikeMcGarry force-pushed the feature/4493.ImproveEvernoteLoader branch 3 times, most recently from 713f4be to 9019949 Compare May 13, 2023 11:51
@MikeMcGarry
Copy link
Contributor Author

Hi @eyurtsev thanks for your detailed review, your guidance on setting up the dependencies appropriately was really helpful and you are on the money about those methods which should be static methods rather than class methods. I've addressed all your feedback including preserving the existing behaviour and making the new behaviour available by passing load_single_document=False into the initialisation of the EverNoteLoader. I've also added some additional unit tests to improve our coverage.

@MikeMcGarry MikeMcGarry requested a review from eyurtsev May 13, 2023 11:55
@eyurtsev
Copy link
Collaborator

Thank you! as a heads up, will commandeer tomorrow to help resolve merge conflicts around poetry.lock files, there's a bunch of PRs that are bumping into merge conflicts, so will be doing the same in all to merge the code faster.

@MikeMcGarry
Copy link
Contributor Author

Okay sounds good! I've run poetry run black . and committed the updated files as I see one of the pipeline tests failed as the files were not formatted appropriately.

@MikeMcGarry MikeMcGarry force-pushed the feature/4493.ImproveEvernoteLoader branch from cbf452b to 40d44df Compare May 15, 2023 05:06
@MikeMcGarry
Copy link
Contributor Author

I've also rebased onto master.

poetry.lock Outdated
Comment on lines 10090 to 10334
all = ["O365", "aleph-alpha-client", "anthropic", "arxiv", "atlassian-python-api", "azure-cosmos", "azure-identity", "beautifulsoup4", "clickhouse-connect", "cohere", "deeplake", "docarray", "duckduckgo-search", "elasticsearch", "faiss-cpu", "google-api-python-client", "google-search-results", "gptcache", "hnswlib", "html2text", "huggingface_hub", "jina", "jinja2", "jq", "lancedb", "lark", "lxml", "manifest-ml", "networkx", "nlpcloud", "nltk", "nomic", "openai", "opensearch-py", "pdfminer-six", "pexpect", "pgvector", "pinecone-client", "pinecone-text", "protobuf", "psycopg2-binary", "pyowm", "pypdf", "pytesseract", "pyvespa", "qdrant-client", "redis", "sentence-transformers", "spacy", "steamship", "tensorflow-text", "tiktoken", "torch", "transformers", "weaviate-client", "wikipedia", "wolframalpha"]
azure = ["azure-core", "azure-cosmos", "azure-identity", "openai"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I ran poetry lock --no-update it put all the back in alphabetical order.

@dev2049
Copy link
Contributor

dev2049 commented May 17, 2023

seems like one of the test files is 11k lines, any chance we could use a smaller one 😅

@MikeMcGarry
Copy link
Contributor Author

MikeMcGarry commented May 17, 2023

It's a base64 encoded image. I wanted to make sure images were removed from the note and didn't end up in the page context.

Do we have some size constraints we want to consider with test data files? These aren't part of the package contents when published to PyPi right?

@dev2049
Copy link
Contributor

dev2049 commented May 17, 2023

It's a base64 encoded image. I wanted to make sure images were removed from the note and didn't end up in the page context.

Do we have some size constraints we want to consider with test data files? These aren't part of the package contents when published to PyPi right?

they're not part of the package, it's still nice to keep the repo small though. could we just use a small image (shouldn't change quality of the test right?)

Mike McGarry added 5 commits May 19, 2023 12:17
…te instead of a single document for the entire export. Also ensured that all available metadata is added to each note
…optional dependencies and updated to use static methods instead of class methods
@MikeMcGarry MikeMcGarry force-pushed the feature/4493.ImproveEvernoteLoader branch from 40d44df to 0d98e56 Compare May 19, 2023 02:31
@MikeMcGarry
Copy link
Contributor Author

It's a base64 encoded image. I wanted to make sure images were removed from the note and didn't end up in the page context.
Do we have some size constraints we want to consider with test data files? These aren't part of the package contents when published to PyPi right?

they're not part of the package, it's still nice to keep the repo small though. could we just use a small image (shouldn't change quality of the test right?)

Okay sure thing, I've scaled down the image it's now 18KB instead of 2.1MB.

@MikeMcGarry
Copy link
Contributor Author

@dev2049 / @eyurtsev I've accommodated all the feedback and rebased onto master. Are there any outstanding points we need to address?

@eyurtsev eyurtsev added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label May 19, 2023
@eyurtsev
Copy link
Collaborator

eyurtsev commented May 19, 2023

Code looks ready for merging. Waiting for tests to pass and then can merge in. Thank you @MikeMcGarry

@dev2049 dev2049 merged commit ddd595f into langchain-ai:master May 19, 2023
13 checks passed
@danielchalef danielchalef mentioned this pull request Jun 5, 2023
This was referenced Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
03 enhancement Enhancement of existing functionality lgtm PR looks good. Use to confirm that a PR is ready for merging.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Evernote Document Loader Concatenates All Notes Together
3 participants