New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GitLoader #2851
Add GitLoader #2851
Conversation
langchain/document_loaders/git.py
Outdated
def is_text_content(content: bytes) -> bool: | ||
"""Determines if the content is text based on the content bytes.""" | ||
try: | ||
content.decode("utf-8") | ||
return True | ||
except UnicodeDecodeError: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe return the content.decode("utf-8")
so that we don't have to decode twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 3a16d35
|
||
def load(self) -> List[Document]: | ||
try: | ||
from git import Blob, Repo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add this to the deps file instead of this try/catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This common pattern in this project, what makes sense given number of dependencies that are not required for every use case.
Can you time how long it takes for large repos like https://github.com/openjdk/jdk ? (67k files) |
langchain/document_loaders/git.py
Outdated
repo = Repo(self.path) | ||
repo.git.checkout(self.branch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check if the repo exists already, if it does, we should do git pull
instead of git clone
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in f43370d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two alternative ways to load repo:
loader = GitLoader(
clone_url="https://github.com/hwchase17/langchain",
repo_path="./example_data/test_repo2/",
branch="master",
)
loader = GitLoader(repo_path="./example_data/test_repo1/", branch=branch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check if the repo exists already, if it does, we should do
git pull
instead ofgit checkout
.
do we actually want the loader forcing a pull by default instead of using what's on disk? I don't think I would want the loader to cause a local repo to update from a remote unless I explicitly stated to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah agreed, I like having loading from local repo to be default without updating.
~6 seconds on Apple M1 Max |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is awesome - thanks!
Supplemental to #2851. Updates one notebook cell that I forgot to commit before.
Allows users to specify what files should be loaded instead of indiscriminately loading the entire repo. extends #2851 NOTE: for reviewers, `hide whitespace` option recommended since I changed the indentation of an if-block to use `continue` instead so it looks less like a Christmas tree :)
Supplemental to langchain-ai/langchain#2851. Updates one notebook cell that I forgot to commit before.
Supplemental to langchain-ai#2851. Updates one notebook cell that I forgot to commit before.
Allows users to specify what files should be loaded instead of indiscriminately loading the entire repo. extends langchain-ai#2851 NOTE: for reviewers, `hide whitespace` option recommended since I changed the indentation of an if-block to use `continue` instead so it looks less like a Christmas tree :)
No description provided.