Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Parallelize the GithubRepoLoader #2224

Merged

Conversation

yroc92
Copy link
Contributor

@yroc92 yroc92 commented Aug 9, 2023

Instead of synchronously fetching file content and directory info, this change leverages promises and Promise.all to parallelize the file retrieval when recursive = true for GithubRepoLoader.

Fixes #2196

Twitter: https://twitter.com/yroc92

@vercel
Copy link

vercel bot commented Aug 9, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
langchainjs-docs ✅ Ready (Inspect) Visit Preview Aug 14, 2023 8:58pm

@dosubot dosubot bot added the auto:improvement Medium size change to existing code to handle new use-cases label Aug 9, 2023
@@ -0,0 +1,53 @@
import { jest, test } from "@jest/globals";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is flagging a new HTTP request using the fetch function in the code, and it is recommended for maintainers to review this change.

@yroc92 yroc92 force-pushed the feature/parallelize-github-loader branch 3 times, most recently from edcda8c to 21d433c Compare August 9, 2023 23:42
@yroc92 yroc92 marked this pull request as ready for review August 9, 2023 23:45
@jacoblee93
Copy link
Collaborator

Looks nice! Just a bit worried about rate limits/the change in behavior - see comments

@jacoblee93 jacoblee93 self-assigned this Aug 10, 2023
@jacoblee93 jacoblee93 added question Further information is requested close PRs that need one or two touch-ups to be ready labels Aug 10, 2023
@jacoblee93 jacoblee93 added lgtm PRs that are ready to be merged as-is and removed question Further information is requested close PRs that need one or two touch-ups to be ready labels Aug 14, 2023
@jacoblee93
Copy link
Collaborator

Hey @yroc92, added max concurrency support with an internal utility we have. Thank you, will get this merged tonight!

@jacoblee93
Copy link
Collaborator

Thanks again!

@jacoblee93 jacoblee93 merged commit 7b18e21 into langchain-ai:main Aug 15, 2023
9 checks passed
@yroc92
Copy link
Contributor Author

yroc92 commented Aug 15, 2023

@jacoblee93 good to know about the concurrency limit support!

It’s definitely easy to hit rate limits if you don’t filter out enough files with gitignore. But parallelizing the ops won’t impact the rate limit much because the limits are calls per hour. Using an API key makes the limits much higher.

@jacoblee93
Copy link
Collaborator

Ah gotcha. Well, I defaulted to 2 but you can always set it to Infinity to get the other behavior

@yroc92 yroc92 deleted the feature/parallelize-github-loader branch August 18, 2023 05:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto:improvement Medium size change to existing code to handle new use-cases lgtm PRs that are ready to be merged as-is
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GithubRepoLoader: Fetch directories and files in parallel
2 participants