Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GithubFileLoader API Error #17453

Open
4 tasks done
PrashantDixit0 opened this issue Feb 13, 2024 · 13 comments
Open
4 tasks done

GithubFileLoader API Error #17453

PrashantDixit0 opened this issue Feb 13, 2024 · 13 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)

Comments

@PrashantDixit0
Copy link

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.

Example Code

from langchain.document_loaders import GithubFileLoader
loader = GithubFileLoader(
repo="langchain-ai/langchain", # the repo name
access_token="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",
github_api_url="https://api.github.com",
file_filter=lambda file_path: file_path.endswith(
".md"
), # load all markdowns files.
)
documents = loader.load()
print(documents)

Error Message and Stack Trace (if applicable)

No response

Description

raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://api.github.com/api/v3/repos/langchain-ai/langchain/git/trees/master?recursive=1

System Info

System Information

OS: Linux
OS Version: #17~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Jan 16 14:32:32 UTC 2
Python Version: 3.11.4 (main, Jul 5 2023, 13:45:01) [GCC 11.2.0]

Package Information

langchain_core: 0.1.22
langchain: 0.1.6
langchain_community: 0.0.19
langsmith: 0.0.87

Packages not installed (Not Necessarily a Problem)

The following packages were not found:

langgraph
langserve

Copy link

dosubot bot commented Feb 13, 2024

🤖

Hey there @PrashantDixit0, fancy seeing you here again! Hope this message finds you coding well 😉

Based on the error message you provided, it seems like the URL being used to access the GitHub API is incorrect. The error message indicates a 404 error for the URL https://api.github.com/api/v3/repos/langchain-ai/langchain/git/trees/master?recursive=1.

The correct base URL for the GitHub API is https://api.github.com, not https://api.github.com/api/v3. The /api/v3 in the URL is redundant and likely causing the 404 error.

Please try updating the github_api_url parameter in your GithubFileLoader instance to https://api.github.com:

loader = GithubFileLoader(
    repo="langchain-ai/langchain",  # the repo name
    access_token="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",
    github_api_url="https://api.github.com",
    file_filter=lambda file_path: file_path.endswith(
        ".md"
    ),  # load all markdowns files.
)

If you're still encountering issues after making this change, it would be helpful to see the full traceback of the error. This can provide more context about what's causing the error and where in the code it's occurring.

Also, please ensure that your GitHub personal access token is correct and has the necessary permissions to access the repository. If the token is incorrect or doesn't have the necessary permissions, this could also cause a 404 error.

Lastly, please verify that the repository langchain-ai/langchain exists and that you're using the correct branch name. If the repository or branch doesn't exist, this would also result in a 404 error.

For more information about the GithubFileLoader class and how it interacts with the GitHub API, you can refer to the GithubFileLoader source code and the unit tests for GithubFileLoader.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Feb 13, 2024
@PrashantDixit0
Copy link
Author

PrashantDixit0 commented Feb 13, 2024

I am already updating github_api_url as https://api.github.com/, but still getting this error

@PrashantDixit0
Copy link
Author

PrashantDixit0 commented Feb 13, 2024

@shufanhao @baskaryan I am following this documentation page as a reference and using the same code snippet given, but still getting same error

@brk21
Copy link

brk21 commented Feb 26, 2024

loader = GithubFileLoader(
    repo="langchain-ai/langchain",  # the repo name
    access_token=os.environ["GITHUB_ACCESS_TOKEN"],
    github_api_url="https://api.github.com",
    file_filter=lambda file_path: file_path.endswith(
        ".md"
    ),  # load all markdowns files.
)

docs = loader.load()
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
Cell In[7], line 13
      2 g = Github(auth=auth)
      4 loader = GithubFileLoader(
      5     repo="langchain-ai/langchain",  # the repo name
      6     access_token=os.environ["GITHUB_ACCESS_TOKEN"],
   (...)
     10     ),  # load all markdowns files.
     11 )
---> 13 docs = loader.load()

File /opt/conda/envs/analyst-copilot/lib/python3.11/site-packages/langchain_community/document_loaders/github.py:246, in GithubFileLoader.load(self)
    243 def load(self) -> List[Document]:
    244     documents = []
--> 246     files = self.get_file_paths()
    247     for file in files:
    248         content = self.get_file_content_by_path(file["path"])

File /opt/conda/envs/analyst-copilot/lib/python3.11/site-packages/langchain_community/document_loaders/github.py:218, in GithubFileLoader.get_file_paths(self)
    213 base_url = (
    214     f"{self.github_api_url}/api/v3/repos/{self.repo}/git/trees/"
    215     f"{self.branch}?recursive=1"
    216 )
    217 response = requests.get(base_url, headers=self.headers)
--> 218 response.raise_for_status()
    219 all_files = response.json()["tree"]
    220 """ one element in all_files
    221 {
    222     'path': '.github', 
   (...)
    227 }
    228 """

File /opt/conda/envs/analyst-copilot/lib/python3.11/site-packages/requests/models.py:1021, in Response.raise_for_status(self)
   1016     http_error_msg = (
   1017         f"{self.status_code} Server Error: {reason} for url: {self.url}"
   1018     )
   1020 if http_error_msg:
-> 1021     raise HTTPError(http_error_msg, response=self)

HTTPError: 404 Client Error: Not Found for url: https://api.github.com/api/v3/repos/langchain-ai/langchain/git/trees/main?recursive=1

@timkitch
Copy link

timkitch commented Mar 1, 2024

Same here. I see in the code for GithubFileLoader that it's incorrectly hardcoding the URL /api/v3 prefix as:

base_url = (
            f"{self.github_api_url}/api/v3/repos/{self.repo}/git/trees/"
            f"{self.branch}?recursive=1"
        )

So, not possible to override that part. It's broken.

baskaryan pushed a commit that referenced this issue Mar 1, 2024
Description- 
- Changed the GitHub endpoint as existing was not working and giving 404
not found error
- Also the existing function was failing if file_filter is not passed as
the tree api return all paths including directory as well, and when
get_file_content was iterating over these path, the function was failing
for directory as the api was returning list of files inside the
directory, so added a condition to ignore the paths if it a directory
- Fixes this issue -
#17453

Co-authored-by: Radhika Bansal <Radhika.Bansal@veritas.com>
gkorland pushed a commit to FalkorDB/langchain that referenced this issue Mar 30, 2024
…i#17622)

Description- 
- Changed the GitHub endpoint as existing was not working and giving 404
not found error
- Also the existing function was failing if file_filter is not passed as
the tree api return all paths including directory as well, and when
get_file_content was iterating over these path, the function was failing
for directory as the api was returning list of files inside the
directory, so added a condition to ignore the paths if it a directory
- Fixes this issue -
langchain-ai#17453

Co-authored-by: Radhika Bansal <Radhika.Bansal@veritas.com>
@shawnesquivel
Copy link

Having the same issue.

@shufanhao
Copy link
Contributor

Same here. I see in the code for GithubFileLoader that it's incorrectly hardcoding the URL /api/v3 prefix as:

base_url = (
            f"{self.github_api_url}/api/v3/repos/{self.repo}/git/trees/"
            f"{self.branch}?recursive=1"
        )

So, not possible to override that part. It's broken.

did you use the latest code ?

@timkitch
Copy link

timkitch commented Apr 6, 2024

Same here. I see in the code for GithubFileLoader that it's incorrectly hardcoding the URL /api/v3 prefix as:

base_url = (
            f"{self.github_api_url}/api/v3/repos/{self.repo}/git/trees/"
            f"{self.branch}?recursive=1"
        )

So, not possible to override that part. It's broken.

did you use the latest code ?

I haven't yet tried with the latest. But, I just reviewed the recent updates to that code and it appears to me that it should now work.

@shawnesquivel
Copy link

I can confirm it's working with this requirements.txt file.

langchain==0.1.14
orjson==3.9.15 # needed for langchain > 0.1.7 https://github.com/langchain-ai/langchain/issues/19719
openai==1.14.2
python-dotenv==1.0.0
langchain-community==0.0.30
langchain-openai==0.0.4
langchain-text-splitters

@Bluthunder
Copy link

Bluthunder commented Apr 8, 2024

Can some one confirm if the above issue is resolved, i can still reproduce same error with
lanchain version -0.1.14
open ai version - 1.16.2

@shufanhao
Copy link
Contributor

Can some one confirm if the above issue is resolved, i can still reproduce same error with
lanchain version -0.1.14
open ai version - 1.16.2

@Bluthunder make sure your langchain-community also is updated to latest or updated to langchain-community==0.0.30

@gjuoun
Copy link

gjuoun commented Apr 15, 2024

The error still exists, because GithubFileLoader does not handle HTTP 404 errors, in my case
HTTPError: 404 Client Error: Not Found for url: ...

@Bluthunder
Copy link

@shufanhao thanks.

My error was resolved by adding branch name when instantiating Loader such as below

loader = GithubFileLoader(
repo=repo_name,
access_token=ACCESS_TOKEN,
github_api_url="https://api.github.com",
branch=branch_name,
file_filter=lambda file_path: file_path.startswith(directory)

)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)
Projects
None yet
Development

No branches or pull requests

7 participants