Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Error on projects without a permissible license #7

Closed
2 tasks done
jpmcb opened this issue Jul 25, 2023 · 6 comments 路 Fixed by #9
Closed
2 tasks done

Feature: Error on projects without a permissible license #7

jpmcb opened this issue Jul 25, 2023 · 6 comments 路 Fixed by #9
Assignees

Comments

@jpmcb
Copy link
Member

jpmcb commented Jul 25, 2023

Type of feature

馃崟 Feature

Current behavior

Currently, on the alpha branch, any and all projects will be indexed. This includes project without permissible licenses (i.e., projects with hard copy left licenses). There is no way to filter based on the project's license.

Suggested solution

We should find a way to dynamically check licenses of projects that are being queried / ingested:

  1. Get the zip for the project in question
  2. Parse the files looking for likely license files (LICENSE, license.txt, MIT-LICENSE, etc.) - ideally, we'd use a well known crate for this. Maybe we can use what cargo deny uses internally to check for licenses
  3. Only allow projects that have permissible licenses to be indexed into the vector db.
  4. If the license is not permissible, reject the request and return early with an error.

Additional context

See comments in #5 (comment)

Code of Conduct

  • I agree to follow this project's Code of Conduct

Contributing Docs

  • I agree to follow this project's Contribution Docs
@Anush008
Copy link
Member

If a repository happens to have no licence specified, which means that it doesn't allow for modification and redistribution, do we proceed to index it?

@Anush008
Copy link
Member

We can add a check using the https://api.github.com/repos/open-sauced/repo-query/license endpoint before downloading the repo zip.

@Anush008
Copy link
Member

The list of license keys is available here.
https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/licensing-a-repository#searching-github-by-license-type

@bdougie, I've derived a some license keys from the above that I believe are permissive.
apache, mit, wtfpl, zlib, bsd, unlicensed. Will these be fine?

@jpmcb
Copy link
Member Author

jpmcb commented Jul 26, 2023

If a repository happens to have no licence specified, which means that it doesn't allow for modification and redistribution, do we proceed to index it?

Right. For now, we should err on the side of caution and not proceed if there isn't a license.

@jpmcb
Copy link
Member Author

jpmcb commented Jul 26, 2023

I've derived a some license keys from the above that I believe are permissive.
apache, mit, wtfpl, zlib, bsd, unlicensed. Will these be fine?

Those all look good to me: most software licenses have specific clauses for distribution and as long as we aren't modifying the actual source when getting it from GitHub, there shouldn't be any problem.

I'd say go ahead with those as a starting point and we can explore from there!

@Anush008
Copy link
Member

Right. I'm working on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants