Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add full Google Drive features. #5135

Closed
wants to merge 10 commits into from

Conversation

pprados
Copy link
Contributor

@pprados pprados commented May 23, 2023

Reimplement the Google Drive features

Propose :

  • langchain.docstore.GoogleDriveDocStore
  • langchain.document_loaders.GoogleDriveLoader
  • langchain.utilities.GoogleDriveAPIWrapper
  • langchain.tools.GoogleDriveSearchTool
  • langchain.utilities.GoogleDriveUtilities

Features:

  • Fully compatible with Google Drive API
  • Manage file in trash
  • Manage shortcut
  • Manage file description
  • Paging with request GDrive list()
  • Multiple kind of template for request GDrive
  • Convert a lot of mime type (can be configured). The list is adjusted according to the packages availables
  • Convert GDoc, GSheet and GSlide with differents modes
  • Can use only the description of files, without loading and conversion of the body
  • Lambda fine filter
  • Remove duplicate documents (in case of shortcut)
  • Add Url to documents (or part of documents like specific slide)
  • Use environment variable for reference an API tokens
  • Manage different king of strange state with Google File (absence of URL, etc.)
  • Use fully lazy strategy to save memory

Recognition

If you accept my pull-request, you can mention me @pprados. Thanks

Before submitting

Unit-tests coverage >80% of new code

No integration test, but some notebook to show how to use.

  • docs/modules/agents/tools/examples/google_drive.ipynb
  • docs/modules/indexes/document_loaders/examples/google_drive.ipynb
  • docs/modules/indexes/retrievers/examples/google_drive.ipynb

Who can review?

Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested:

@eyurtsev @hwchase17 @vowelparrot might be interested

@pprados pprados force-pushed the pprados/google_drive branch 2 times, most recently from 139d499 to d44c2b0 Compare May 25, 2023 09:49
@pprados
Copy link
Contributor Author

pprados commented May 25, 2023

@vowelparrot I have 3 workflow awaitings approval. You must accept to start these jobs?

@Yvelo
Copy link

Yvelo commented May 25, 2023

I'd gladly collaborate on improving Google Drive support on LangChain. The above test failled since googleapiclient is a new requirement for poetry.lock.

@pprados pprados force-pushed the pprados/google_drive branch 5 times, most recently from 6327ab2 to 417957c Compare May 30, 2023 13:12
@pprados
Copy link
Contributor Author

pprados commented May 30, 2023

I resolve this problem. I test the github action in my branch, and now all workflow are correct.

@pprados
Copy link
Contributor Author

pprados commented May 30, 2023

Sorry. I fix the format now. It's not possible to start by hand, the workflow lint

@pprados
Copy link
Contributor Author

pprados commented May 31, 2023

@eyurtsev Can you run the workflows to validate this version ?

@pprados
Copy link
Contributor Author

pprados commented May 31, 2023

After this pull-request, I will normalize the link with google token, to use one scope for google drive, gmail, etc.

@pprados
Copy link
Contributor Author

pprados commented Jun 1, 2023

@eyurtsev, can you active the workflows, and if all is correct, can you review this code?

@pprados
Copy link
Contributor Author

pprados commented Jun 2, 2023

@eyurtsev, sorry for the last error.
In my branch, the workflows run without error.

@pprados pprados force-pushed the pprados/google_drive branch 2 times, most recently from 6acba43 to fa2c3ac Compare June 2, 2023 08:08
@pprados
Copy link
Contributor Author

pprados commented Jun 2, 2023

@eyurtsev, I have a question about my implementation.

In the method _lazy_load_file_from_file(), I had an optional run_manager.
I can use this manager if I catch an exception.

if isinstance(run_manager, CallbackManagerForToolRun):
  run_manager.on_tool_error(e)
elif isinstance(run_manager, CallbackManagerForChainRun):
  run_manager.on_chain_error(e)

But the code is not clean. Do you have a better idea?

I initiate the possibility to use a lazy approach in the retriever.

def lazy_get_relevant_documents(self,query:str) -> Iterator[Document]:
   ...

The default implementation transforms a classic list of documents to an iterator. But, the subclasses can be choice to implements a lazy approach, to optimize the memory footprint.

In my Google Drive utilities, I use a lazy approach.

Later, I would like to update the link between the loaders and vectordb, to use a lazy approach if it's possible. Then, a loader can return a big number of documents to import, without problems with the memory.

@pprados
Copy link
Contributor Author

pprados commented Jun 5, 2023

The code changes every day, so I must make rebase another time. @eyurtsev, If you star the workflows quickly, all will be correct ;-)

@pprados
Copy link
Contributor Author

pprados commented Jun 6, 2023

Hello @hwchase17,
The master moves every day. So, sometime I have the 3 workflows OK, sometime I have no conflicts with the base branch.
Else, I must rebase every day.
Can you analyze this pull request or find someone to do it?

Thanks

@pprados pprados force-pushed the pprados/google_drive branch 2 times, most recently from e0f3248 to ac8f85e Compare June 9, 2023 06:59
@pprados
Copy link
Contributor Author

pprados commented Jun 9, 2023

@hwchase17, @eyurtsev,
Another time, all checks have passed, but I have a conflicting files: poetry.lock.

I rebase my code with the last version.
Can you review the code before the next release ?
Please ;-)

@pprados
Copy link
Contributor Author

pprados commented Jun 12, 2023

Hello @eyurtsev, @hwchase17,

I'm sure you've got plenty of pull requests to validate.
New version with a rebase from 12 June.
Maybe, other people can revu my code.
I have been waiting for 3 weeks now.
Thanks

@pprados
Copy link
Contributor Author

pprados commented Aug 22, 2023

Hello @eyurtsev, @hwchase17, @vowelparrot, @baskaryan and @hinthornw,

Can someone contact me via discord, to organize a commented review, if that would make things easier?

@pprados pprados mentioned this pull request Aug 28, 2023
hwchase17 added a commit that referenced this pull request Sep 3, 2023
My other
[pull-request](#5135) is
too big to be acceptable.
I propose another 'lite' version.

I update only notebook to propose an integration with the external
project
[`langchain-googledrive`](https://github.com/pprados/langchain-googledrive).

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
@pprados
Copy link
Contributor Author

pprados commented Sep 5, 2023

#9999

@pprados pprados closed this Sep 5, 2023
@pprados pprados deleted the pprados/google_drive branch October 3, 2023 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ɑ: doc loader Related to document loader module (not documentation) 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants