Skip to content

GSOC 2024 Project Ideas

Philippe Ombredanne edited this page Mar 14, 2024 · 8 revisions

See our page on applying for GSoC 2024: https://github.com/nexB/aboutcode/wiki/GSOC-2024

Table of Contents


Here is a list of candidate project ideas for your consideration. Your own ideas are welcomed too! Please chat about them to get early feedback!

Project Ideas Index

PURLdb: project ideas

Vulnerablecode: project ideas

scancode.io: project ideas

scancode-toolkit: project ideas

Archived Project Ideas: https://github.com/nexB/aboutcode/wiki/Archived-GSoC-Project-Ideas


Abbreviations and acronyms used here

We use these now and then:

  • SCIO: ScanCode.io
  • BOM: Bill of Material, same as SBOM
  • SBOM: Software Bill of Material
  • DJCD: DejaCode
  • SCTK: ScanCode-Toolkit
  • VCIO: VulnerableCode
  • NLP: Natural Language Processing
  • VDR: Vulnerability Disclosure Report
  • VEX: Vulnerability Exploitability Exchange

PURLdb project ideas


PURLdb: Add improved scan queue with multiple SCIO instances

Code Repositories:

Description:

This project consists mainly of the following:

  1. Improved and scalable SCIO instances:

PURLdb uses a queue for scanning packages/archives where they are submitted for scanning, and then scanned through a SCIO instance. We should add an improved queue which can send scans to multiple SCIO instances for scale.

  1. Improve scan queue handling

This consists of multiple quality of life improvements in PURLdb scan queue like handle exceptions correctly on a SCIO crash/disconnect, actively look for finished scan for indexing, update status properly for submitted scans, etc.

References:

Priority: Medium

Size: Large

Difficulty Level: Advanced

Tags:

  • Django
  • PostgreSQL
  • Web
  • Redis/RQ
  • Queue

Mentors:

  • @jyang
  • @tdruez
  • @pombredanne
  • @keshav-space

Related Issues:


PURLdb/ScanCode.io: Enrich an SBOM based on OSSF Security Score Card

Code Repositories:

Description:

We already have SBOM export (and import) options in scancode.io supporting SPDX and CycloneDX SBOMs, and we can enrich this data using the public https://github.com/ossf/scorecard#public-data or the Rest API at: https://api.securityscorecards.dev/.

The specific tasks for this project are:

  • Research and figure out how best to consume this data
  • Add models to support external data sources/scores on packages
  • store these as package data in PURLdb (or fetch this by package in SCIO?)
  • Add a pipeline in scancode.io to fetch and show this data in the UI
  • Map this data to SPDX/CycloneDX SBOM elements i.e. how it can be exported in a BOM

References:

Priority: Medium

Size: Large

Difficulty Level: Intermediate

Tags:

  • Django
  • PostgreSQL
  • SBOM
  • Metadata
  • Security]

Mentors:

  • @jyang
  • @tdruez
  • @pombredanne
  • @AyanSinhaMahapatra

Related Issues:


SCIO/PURLdb: Create devel-to-deploy analysis pipeline for Android

Code Repositories:

Description:

Create a pipeline for deployment analysis in Android apps, where the app source and .apk binary is provided as inputs, and we:

  1. Scan the source/binary for packages and send these to be scanned in PURLdb
  2. Make sure package assembly in SCTK/package indexing and scanning in PURLdb works for android packages
  3. Map respective source files to their binary files
  4. Match deployed files to packages indexed in the PURLdb
  5. Handle android specific cases and test with some examples of FOSS android apps

We already have deployment analysis pipelines for the Java/JavaScript ecosystem and that can be a really nice reference point to start this project.

References:

Priority: Medium

Size: Large

Difficulty Level: Advanced

Tags:

  • Django
  • PostgreSQL
  • BinaryAnalysis
  • Metadata
  • Packages

Mentors:

  • @AyanSinhaMahapatra
  • @keshav-space
  • @jyang
  • @pombredanne

SCIO/PURLdb: Create devel-to-deploy analysis pipeline for Python packages

Code Repositories:

Description:

Create a pipeline for deployent analysis in python projects, where the source code and python wheels are provided as inputs, and we:

  1. Scan the source/binary for packages and send these to be scanned in PURLdb
  2. Verify and fix package assembly issues in SCTK
  3. Map respective source files to their binary files
  4. Match deployed files to packages indexed in the PURLdb
  5. Handle android specific cases and test with some examples of FOSS android apps

We already have deployment analysis pipelines for the Java/JavaScript ecosystem and that can be a really nice reference point to start this project.

References:

Priority: Medium

Size: Large

Difficulty Level: Advanced

Tags:

  • Django
  • PostgreSQL
  • BinaryAnalysis
  • Metadata
  • Packages

Mentors:

  • @AyanSinhaMahapatra
  • @keshav-space
  • @jyang
  • @pombredanne

VulnerableCode project ideas

There are two main categories of projects for VulnerableCode:

  • A. COLLECTION: this category is to mine and collect or infer more new and improved data. This includes collecting new data sources, inferring and improving existing data or collecting new primary data (such as finding a fix commit of a vulnerability)

  • B. USAGE: this category is about using and consuming the vulnerability database and includes the API proper, the GUI, the integrations, and data sharing, feedback and curation.


VulnerableCode: Process unstructured data sources for vulnerabilities (Category A)

Code Repositories:

Description:

The project would be to provide a way to effectively mine unstructured data sources for possible unreported vulnerabilities.

For a start this should be focused on a few prominent repos. This project could also find Fix Commits.

Some sources are:

  • mailing lists
  • changelogs
  • reflogs of commit
  • bug and issue trackers

This requires systems to "understand" vulnerability descriptions: as often security advisories do not provide structured information on which package and package versions are vulnerable. The end goal is creating a system which would infer vulnerable package name and version(s) by parsing the vulnerability description using specialized techniques and heuristics.

There is no need to train a model from scratch, we can use AI models pre-trained on code repositories (maybe https://github.com/bigcode-project/starcoder?) and then fine-tune on some prepared datasets of CVEs in code.

We can either use NLP/machine Learning and automate it all, potentially training data masking algorithms to find these specific data (this also involved creating a dataset) but that's going to be super difficult.

We could also start to craft a curation queue and parse as much as we can to make it easy to curate by humans and progressively also improve some mini NLP models and classification to help further automate the work.

References: https://github.com/nexB/vulnerablecode/issues/251

Priority: Medium

Size: Large

Difficulty Level: Advanced

Tags:

  • Python
  • Django
  • PostgreSQL
  • Security
  • Vulnerability
  • NLP
  • AI/ML

Mentors:

  • @pombredanne
  • @tg1999
  • @keshav-space
  • @Hritik14
  • @AyanSinhaMahapatra

Related Issues:


VulnerableCode: Add more data sources and mine the graph to find correlations between vulnerabilities (Category A)

Code Repositories:

Description:

See https://github.com/nexB/vulnerablecode#how for background info. We want to search for more vulnerability data sources and consume them.

There is a large number of pending tickets for data sources. See https://github.com/nexB/vulnerablecode/issues?q=is%3Aissue+is%3Aopen+label%3A"Data+collection"

Also see tutorials for adding new importers and improvers:

More reference documentation in improvers and importers:

Note that this is similar to this GSoC 2022 project (a continuation):

References: https://github.com/nexB/vulnerablecode/issues?q=is%3Aissue+is%3Aopen+label%3A"Data+collection"

Priority: High

Size: Medium/Large

Difficulty Level: Intermediate

Tags:

  • Django
  • PostgreSQL
  • Security
  • Vulnerability
  • API
  • Scraping

Mentors:

  • @pombredanne
  • @tg1999
  • @keshav-space
  • @Hritik14
  • @jmhoran

Related Issues:


VulnerableCode: On demand live evaluation of packages (Category A)

Code Repositories: https://github.com/nexB/vulnerablecode

Description:

Currently VulnerableCode runs importers in bulk where all the data from advisories are imported (and reimported) at once and stored to be displayed and queried.

The objective of this project is to have another endpoint and API where we can dynamically import available advisories for a single PURL at a time.

At a high level this would mean:

  • Support querying a specific package by PURL. This is not for an approximate search but only an exact PURL lookup.

  • Visit advisories/package ecosystem-specific vulnerability data sources and query for this specific package. For instance, for PyPi, the vulnerabilities may be available when querying the main API. An example is https://pypi.org/pypi/lxml/4.1.0/json that lists vulnerabilities. In some other cases, we may need to fetch larger datasets, like when doing this in batch.

  • This is irrespective of whether data related to this package being present in the db (i.e. both for new packages and refreshing old packages).

  • A good test case would be to start with a completely empty database. Then we call the new API endpoint for one PURL, and the vulnerability data is fetched, imported/stored on the fly and the API results are returned live to the caller. After that API call, the database should now have vulnerability data for that one PURL.

  • This would likely imply to modify or update importers to support querying by purl to get advisory data for a specific package. The actual low level fetching should likely be done in FetchCode.

This is not straightforward as many advisories data source do not store data keyed by package, as they are not package-first, but they are stored by security issue. See specific issues/discussions on these importers for more info. See also how things are done in vulntotal.

Priority: Medium

Size: Medium/Large

Difficulty Level: Intermediate

Tags:

  • Python
  • Django
  • PostgreSQL
  • Security
  • web
  • Vulnerability
  • API

Mentors:

  • @pombredanne
  • @tg1999
  • @keshav-space

Related Issues:


VulnerableCode/Vulntotal: Browser Extension (Category B)

Code Repositories: https://github.com/nexB/vulnerablecode

References: https://github.com/nexB/vulnerablecode/tree/main/vulntotal

Description:

Implement a firefox/chrome browser extension which would run vulntotal on the client side, and query the vulnerability datasources for comparing them. The input will be a PURL, similarly as vulntotal.

  • research tools to run python code in a browser (brython/pyscript)
  • implement the browser extension to run vulntotal

Priority: Medium

Size: Medium

Difficulty Level: Intermediate

Tags:

  • Python
  • Security
  • Web
  • Vulnerability
  • BrowserExtension
  • UI

Mentors:

  • @keshav-space
  • @pombredanne
  • @tg1999

Related Issues:


ScanCode.io project ideas


ScanCode.io: Add ability to store/query downloaded packages

Code Repositories:

Description:

Packages which are downloaded and scanned in SCIO can be optionally stored and accessed to have a copy of the packages which are being used for a specific product for reference and future use, and could be used to meet source redistribution obligations.

The specific tasks would be:

  1. Store all packages/archives which are downloaded and scanned in SCIO
  2. Create an API and index by URL/checksum to get these packages on-demand
  3. Create models to store metadata/history and logs for these downloaded/stored packages
  4. Additionally support and design external storage/fetch options

There should be configuration variable to turn this on to enable these features, and connect external databases/storage.

Priority: Low

Size: Medium

Difficulty Level: Intermediate

Tags:

  • Python
  • Django
  • CI
  • Security
  • Vulnerability
  • SBOM

Mentors:

  • @tdruez
  • @keshav-space
  • @jyang
  • @pombredanne

Related Issues:


ScanCode.io: Update SCIO/SCTK for use in CI/CD:

Code Repositories:

Description:

Enhance SCIO/SCTK to be integrated into CI/CD pipelines such as Github Actions, Azure Piplines, Gitlab, Jenkins. We can start with any one CI/CD provider like GitHub Actions and later support others.

These should be enabled and configured as required by scancode configuration files to enable specific functions to be carried out in the pipeline.

There are several types of CI/CD pipelines to choose from potentially:

  1. Generate SBOM/VDRs with scan results:

    • Scan the repo to get all purls: packages, dependencies/requirements
    • Scan repository for package, license and copyrights
    • Query public.vulnerablecode.io for Vulnerabilities by PackageURL
    • Generate SPDX/CycloneDX SBOMs from them with scan and vulnerability data
  2. License/other Compliance CI/CD pipelines

    • Scan repo for licenses and check for detection accuracy
    • Scan repo for licenses and check for license clarity score
    • Scan repo for licenses and check compliance with specified license policy
    • The jobs should pass/fail based on the scan results of these specific cases, so we can have:
      • a special mode to fail with error codes
      • description of issues and failure reasons, and docs on how to fix these
      • ways to configure and set up for these cases with configuration files
  3. Dependency checkers/linters:

    • download and scan all package dependencies, get scan results/SBOM/SBOMs
    • check for vulnerable packages and do non-vulnerable dependency resolutuion
    • check for test failures after dependency upgrades and add PR only if passes
  4. Jobs which checks and fixes for misc other errors:

    • Replaces standard license notices with SPDX license declarations
    • checks and adds ABOUT files for vendored code

References:

Priority: High

Size: Large

Difficulty Level: Intermediate

Tags:

  • Python
  • Django
  • CI
  • Security
  • License
  • SBOM
  • Compliance]

Mentors:

  • @pombredanne
  • @tdruez
  • @keshav-space
  • @tg1999
  • @AyanSinhaMahapatra

Related Issues:


ScanCode Toolkit project ideas


Compute summary for all detected packages:

Code Repositories:

Description:

Today the summary and license clarity scores are computed for the whole scan. Instead we should compute them for EACH package (and their files). This is possible now that we are returning which file belong to a package.

  • Add license clarity scores to package models, so every package can have these
  • Store references to license detection objects for clarity score
  • compute summary and package attributes from their key-files or other files:
    • primary and other licenses
    • copyrights and notices
    • license clarity score (extra field in package model)
    • authors and other misc info
  • make sure the attributes are collected properly for all package ecosystems (like copyrights)

This would make sure all package attributes are properly computed and populated, from their respective package files, instead of only having a codebase level summary.

References:

Priority: High

Size: Medium

Difficulty Level: Medium

Tags:

  • Python
  • Summary
  • Packages

Mentors:

  • @AyanSinhaMahapatra
  • @jyang
  • @pombredanne

Related Issues:


Have variable license sections in license rules:

Code Repositories:

Description:

There are lots of variability in license notices and declarations in practice, and one example of modeling this is the SPDX matching guidelines. Note that this was also one of the major ways scancode used to detect licenses earlier.

  1. Support grammar for variability in license rules (brackets, no of words)
  2. Do a massive analysis on license rules and check for similarity and variable sections This can be used to add variable sections (for copyright/names/companies) and reduce rules.
  3. Support variability in license detection post-processing for extra-words case
  4. Add scripts to add variable sections to rules from detection issues (like bsd detections)

Priority: Medium

Size: Medium

Difficulty Level: Intermediate

Tags:

  • Python
  • Licenses
  • LicenseDetection
  • SPDX
  • Matching

Mentors:

  • @AyanSinhaMahapatra
  • @pombredanne
  • @jyang

Related Issues:


Mark required phrases for rules automatically using NLP/AI:

Code Repositories:

Description:

Required phrases are present in rules to make sure the rule is not matched to text in a case where the required phrase is not present in the text, which would be a false-positive detection.

We are marking required phrases automatically based on what is present in other rules and license attributes, but this still leaves a lot of rules without them.

  • research and choose a model pre-trained on code (StarCoder?)
  • use the dataset of current SCTK rules to train a model
  • Mark required phrases in licenses automatically with the model
  • Test required phrase additions, improve and iterate
  • Bonus: Create a minimal UI to review rule updates massively

Priority: Low

Size: Medium

Difficulty Level: Advanced

Tags:

  • Python
  • ML/AI
  • Licenses

Mentors:

  • @AyanSinhaMahapatra
  • @tg1999;nono
  • @pombredanne

Related Issues:


Our Project ideas

Here are some project related attributes you need to keep in mind while looking into prospective project ideas, see also: guidance on finding the right project:

Project Priority

  1. The repositories/projects are not sorted in order of importance, instead we have a explicit priority mentioned for each project idea and this can be: Low, Medium or High.

  2. This doesn't mean we will always consider a project proposal with a higher priority idea over a relatively lower priority one, no matter the merit of the proposal. This is only one metric of selection, mostly to prioritize important projects.

  3. You can also suggest your own project ideas/discuss changes/updates/enhancements based on the provided ideas, but you need to really know what you are doing here and have lots of discussions with the maintainers.

Project Length

There are three project lengths:

  1. Small (~90 hours)
  2. Medium (~175 hours)
  3. Large (~350 hours)

If you are proposing an idea from this ideas list, it should match what is listed here, and additionally please have a discussion with the mentors about your proposed length and timeline. Please also open a discussion about the same, if not already present, at https://github.com/nexB/aboutcode/discussions/categories/gsoc or discuss this in the respective issues.

We have marked our ideas with medium/large based on general estimates, but this could vary. In a few cases they are both used to mark a project as it can be both. We have made conscious effort to make sure projects are not too large, have clear deliverables and can be finished successfully, but still note that these are complex projects and you're likely underestimating the complexity (and how much we'll bug you to make sure everything is up to our standards).

You must discuss your proposal and the size of project you are proposing with a mentor as otherwise we cannot consider your proposal fairly.

We likely would only select medium/large project ideas only as the small projects are too small to get familiar with and contribute meaningfully to any of our projects.

Please also note that there is a difference in the stipend based on what you select, and it would not be fair if you're selecting and working on a large project, but getting paid for a medium one (or vice-versa).

Project Tags

Here are all the tags we use for specific projects, feel free to search this page using these if you only want to look into projects with specific technical background.

[Django], [PostgreSQL], [Web], [DataStructures], [Scanning], [Javascript], [UI], [LiveServer] [API], [Metadata], [PackageManagers], [SBOM], [Security], [BinaryAnalysis], [Scraping], [NLP], [Social], [Communication], [Review], [Decentralized/Distributed], [Curation]

Project Difficulty Level

We are generally using three levels of difficulty to characterize the projects:

  • Easy
  • Intermediate
  • Advanced

If it is a difficult project it means there is significant domain knowledge required to be able to tackle this project successfully, and you must have prior verifiable experience on this (in the form of open source contributions, either on the same topic in our repos, or elsewhere). You must also consult with mentors/maintainers early, ask a lot of domain specific questions and must be ready to research and tackle greenfield projects in certain cases if you choose a project in this difficulty category.

Most other intermediate projects do not require this much domain knowledge and can easily be acquired during proposal writing/contributing, if you're familiar with the tech stack used in the project. But these are still not straight-forward and requires lots of feedback from the mentors. Most projects fall in this category.

There are also easy projects which only require honest time and effort from the participant, and decent knowledge about the tech stack/problem.

Questions?

Please feel free to chime in at https://github.com/nexB/aboutcode/discussions/133 or in our GSoC 2024 chatroom at https://matrix.to/#/#aboutcode-org_gsoc2024:gitter.im if you have any questions related to AboutCode's participation in GSoC or anything in this page.

Clone this wiki locally