Skip to content

GSoD 2023 Proposal

Ayan Sinha Mahapatra edited this page Mar 25, 2023 · 10 revisions

Our list of ideas: https://github.com/nexB/aboutcode/wiki/GSOD-2023#our-project-ideas Our tech writer selection process: https://github.com/nexB/aboutcode/wiki/GSoD-Technical-Writer-Selection-Process

Proposal Title

Add VulnerableCode How-To guides, PurlDB reference docs and a glossary

About our Organization

Organization Scope

AboutCode.org is a community of developers who focus on Software Composition Analysis (SCA) tools (command line tools, web-based and API servers and applications) and data for identifying and tracking software origin, licensing and security vulnerabilities. SCA tools and data are essential to enable everyone to safely produce and use free and open source software. And with modern software reusing millions of free and open source software components available on the web, we think that these essential SCA tools themselves being FOSS would have a huge impact in how we reuse open source software freely, safely and responsibly. Our tools are not only open source, they help everyone use more open source software!

Our Projects

The focus for GSoD 2023 is on VulnerableCode and PurlDB.

VulnerableCode and PurlDB are the new and upcoming projects from aboutcode! VulnerableCode is a free and open vulnerability data aggregation database, with information about vulnerabilities and the packages affected by them. VulnerableCode has data importers for all major vulneribility advisory publishers, tools for version range parsing and resolving, bots for data consistency check and improvements, and comparison tools for other vulnerability databases (Supports CVEs). We also have an instance (and an open API endpoint) for public use at: public.vulnerablecode.io.

PURL is a leading effort to standardizing software package identification, and is widely used by open source foundations and organizations, SBOMs and other data formats for SCA tools, and major tech organizations, also used by Google's https://github.com/google/osv.dev. And PurlDB is a database of packages keyed by PackageURL (PURL), with tools for scanning packages, matching and indexing for comparison, mining (fetching package metadata from package managers) and more.

We at aboutcode have also created and maintained a lot of other open-source projects like:

  • scancode-toolkit: a widely used static code scanner for getting license, copyright, package and other data out of code, scancode.io
  • scancode.io: a webapp to scan containers, VMs, docker images, packages or source archives with scriptable pipelines and reusable components to support SCA workflows.
  • python-inspector, nuget-inspector: dependency resolvers for dynamic code analysis.

Why select us?

Accepting our project for GSoD to improve better documentation for our tools will be extremely useful for attracting more people to use and contribute to our projects and making sure open access SCA and vulnerability data becomes the standard.

AboutCode.org was started by nexB Inc. in 2013. We have many contributors from a growing FOSS community, including students who have continued contributing to our projects after GSoC and GSoD program participation. As with many open source projects, we only know the identity of a subset of our users, but we know that our AboutCode software is used by (and receives contributions from) several open source Foundations such as Eclipse, OW2, the FSFE and many projects such as ORT, REUSE and Tern. Our projects are also used by major tech companies including Google itself, and at their open source program offices.

About our project

Project problem

VulnerableCode today has basic documentation with a Getting Started section and developer-focused tutorials on adding new data sources and data consistency checkers, comprehensive API documentation and Reference docs with overviews of the important concepts. What is missing from the VulnerableCode documentation is How-to Guides for using the UI to get package/vulnerability data and for the complete vulnerability checking workflow, using PackageURLs obtained from code scanning tools like ScanCode-Toolkit or ScanCode.io. This is the major area of improvement where we want to focus because this would mean users relatively new to software vulnerabilities (generally all members of the developer community) would be able to use the UI to use VulnerableCode more effectively and also integrate VulnerableCode more effectively into their workflows.

PurlDB is a recently released AboutCode project that has only limited documentation. The goals will be to migrate the existing documentation from the code repository to a more friendly ReadTheDocs format, and add detailed Reference documentation for PurlDB which is package data by Package-URLs (the leading modern package identifier specification used by modern SCA tools) and the tools used that you can use to fetch, create, index and compare records in this database. The goal will also include upgrading the Getting-Started documentation for both users and developers. This will help new users with the necessary context and directions to start using PURLs and PurlDB in their SCA workflows.

There is also a growing need for a glossary with Reference documentation for the terminology used in AboutCode projects, such as scanning, matching, packages, SBOMs and vulnerabilities. As there are many competing standards and tools used in the SBOM and SCA tooling space, there is a need to clearly define the terminology used in AboutCode projects. Some examples are scanning versus matching, packages versus components, and copyright holders versus authors.

Project scope

Add new How-To guides for VulnerableCode:

  • Create a How-To Guide for using the new VulnerableCode UI to look for vulnerabilities or vulnerable packages
  • Create a Tutorial to navigate through the workflow of scanning a package through scancode.io and getting a list of PURLs, and then looking up vulnerabilities for those packages in vulnerablecode in the public vulnerablecode instance

Add Reference documentation for PurlDB

  • Enhance the docs in https://github.com/nexB/purldb#readme into RTD pages and sections
  • Update local development installation and configuration documentation
  • Add Reference documentation on why use PurlDB?
  • Add Reference documentation on PurlDB and its components
    • PackageDB
    • MatchCode
    • MineCode
  • Add Reference documentation for how PurlDB works, such as:
    • What is package mining?
    • What is package matching?

Add a glossary of concepts/words used in VulnerableCode and PurlDB:

These are also more generally used in AboutCode projects.

  • Compile a complete list of terminology used in AboutCode projects and what the implied meanings and definitions are. Some of these are:
    • different flavors of packages: top-level packages, package_data, resolved_package, dependency
    • around vulnerabilities, packages and their relationships
  • Add detailed explanations and reference documentation for these

Contributors and Mentors

We already have 14 prospective candidates who have shown interest in our organization and shared their experience and previous technical writing experience with us at our public GSoD chat.

According to our timeline we have started receiving draft statements of interests, but will start the process of tech writer selection only after we are selected as an organization for GSoD 2023.

We have 8 members of the AboutCode community, maintainers of and contributors to different projects, who have committed their time as mentors (and for org-admin responsibilities). They will help and support the tech writer we select with various aspects of their documentation writing, planning, and understanding our projects. They are @pombredanne, @mjherzog, @DennisClark, @JonoYang, @johnmhoran, @AyanSinhaMahapatra, @tg1999 and @keshav-space.

Measuring our project’s success

Add new How-To Guides for VulnerableCode:

Primary Metric:

We provide API keys for https://public.vulnerablecode.io/ and we would track new user growth here and we would measure this for monthly from the time technical writing begins (May) and 4 months after the VulnerablCode documentation writing is published (we have 2 months for this section and 3 months for the rest of the documentation, another month before the case study is due). We would consider this project to be a success if we see 50% or more additional growth in the number of new API keys registered compared to the time before the documentation was available.

Secondary Metric:

We also have email IDs associated with each API key and we will design a feedback form for the users, with 10 qualitative questions on the VulnerableCode documentation each of which can be rated on a scale of 1-10. We will send these feedback forms two times to a subset of users (for example 2/3rd of the users each time) so we can compare both rating improvements for users who got the form two times and also for the users who only got it one time.

  • Before the documentation project starts
  • A month after the documentation writing ends and a month before the case study is due

We expect a net improvement of at least 30% in ratings between the start of the project and its completion, to consider this project a success.

Add reference documentation for PurlDB:

Primary Metric:

PurlDB is a new AboutCode project with only minimal documentation which makes it harder to define a success metric. We will track new contributors (new Issues and Pull Requests) measured monthly for the 5 months in the GSoD project timeline, and compare the number of new contributions/issues gained after the completion of technical writing phase to the number of contributions/issues gained before tech writing started and during the tech writing period. We will consider this project to be a success if we see a 80% or more additional growth in the number of new contributions/issues when compared to the time before the documentation was written.

Timeline

The project will take approximately 5 months to complete, including buffer time reserved for unforseen circumstances/challenges. This also excludes orientation and planning which will be completed before the tech writing period begins. See our detailed timeline for GSoD here.

Here is our rough project timeline, but note that this is subject to change after discussion and planning starts with mentors and the tech writer.

Project Dates Time taken (Weeks Approx) Time Action Items
General May 4 - May 15 1.5 weeks Community bonding and project planning
VulnerableCode May 16 Technical writing begins
VulnerableCode May 16 - May 22 1 week looking into Vulnerablecode, researching and exploring
VulnerableCode May 23 - June 5 3 weeks How-To on vulnerablecode UI
VulnerableCode June 6 - June 26 3 weeks How-To on vulnerablecode integration and usage on SCA workflow
VulnerableCode June 27 - July 7 2 weeks Other docs on vulnerablcode and review/feedback on How-To guides
VulnerableCode July 10 - July 14 1 week Buffer time
VulnerableCode May 16 - July 14 2 months Total VulnerableCode
PurlDB July 17 - July 28 2 weeks looking into PurlDB, researching and exploring
PurlDB July 31 - August 18 3 weeks Reference docs: PurlDB components and how it works: minecode, matchcode, packageDB
PurlDB August 21 - 1 September 2 weeks Update getting started guidelines
PurlDB 4 - 11 September 1.5 weeks Restructure existing docs into RTD sections, misc review feedback
PurlDB 12 - 15 September 0.5 week Reference docs: Why PurldDB
PurlDB July 17 - September 15 2 months Total PurlDB
AboutCode Glossary September 18 - Septmeber 22 1 week looking into other AboutCode projects, and preparing exhaustive list of Glossary elements
AboutCode Glossary September 25 - October 6 2 week Preparing glossary docs
AboutCode Glossary October 9 - October 13 1 week Review/Feedback on glossary and Buffer
AboutCode Glossary September 18 - October 13 1 Month Total AboutCode Glossary

Budget

Budget Item Amount (US $) Notes/Justifications
Technical writer stipend to update, add, test, and publish new documentation of VulnerableCode and PurlDB for AboutCode 15000.00 We are allocating our budget completely to our tech writer as we do not have any other expenses
US $15000.00

Our budget for the project is $15,000 which will be allocated completely to the technical writer working on the project. We do not see any other expenses because:

  • We will be using open source software for all our documentation efforts so there would not be any licenses/other expenses for commercial software.
  • Mentors/Volunteers are already financially supported.

We expect that the technical writer will work on our project full time over a period of 5 months (May to October), and the tasks will be divided into the timeline of 5 months with key deliverables set for each month. We would also disburse funds to the technical writer once after hiring the writer and thereafter monthly over the 5 months based on completing agreed upon deliverables (divided from the first org payment), and the rest upon project completion, from the final payment to the organization.

We already have an open-collective account for aboutcode used actively to fund open source contributors, and for previous years of GSoD/GSoC.

Additional Information

Previous experience with technical writers or documentation:

We believe that documentation should be created, managed and tested like code. With this in mind we expect to include any technical writer directly into the corresponding project development team. This approach worked well for our former GSoD project participation in 2019 and we have adapted it for other contributors to our project documentation.

Our documentation builds are tested in CI/CD, along with linters and link checkers. Most of this documentation infrastructure was implemented based on work from our 2019 GSoD program.

We worked with an experienced technical writer in GSoD 2021 and learnt a lot about managing technical writers from this program. We learnt a lot more about setting expectations and discuss deliverables. We also learned that it is best to let the core maintainer of the project mentor the technical writer primarily and have everyone else chime in for review and feedback for the documentation written.

Based on our recent experience mentoring both newcomers and experienced technical writers, we believe we have the process and tools in place for quickly on-boarding a new technical writer and let them focus on new content structure, design and creation.

Our mentors also have significant experience working with technical writers from prior product development work at commercial software companies (with experienced technical writers) and mentoring open source communities where we have lots of contributors (mostly newcomers to tech writing) writing and maintaining our technical documentation.

Previous participation in Google Season of Docs, Google Summer of Code:

GSoC

All of our present mentors have participated in one or more Google Summer of Code programs since 2015 and we also have 3 org-admins here who have been org-admins in GSoCs since 2015. Additionally we also have 3 mentors who have participated in GSoC as students in 2020, 2021 and 2022 respectively. We are also pleased to be selected for GSoC 2023 this year!

GSoD

We have been selected in GSoD twice, once in the inaugural year in 2019, and once in 2021 in the current format of GSoD. We have 4 mentors and org-admins this year who have been mentors and org-admins in these successful years of GSoD. So we have experience mentoring both beginners to open-source and also worked with experienced technical writers, working with them to successfully write great documentation.

We also have as a mentor and org-admin our GSoD contributor from 2019, who has also participated in GSoC 2020, and been a mentor on both GSoC and GSoD programs thereafter.

See GSoD reports, case studies and documentation written in our previous GSoD years:

For GSoD 2019, our project focused on documentation for our ScanCode-Toolkit project. The first step was to move existing documentation from a GitHub wiki to ReadTheDocs (with Sphinx and other tools) in order to better link documentation with code as part of our overall CI process. The project then focused on adding Tutorials and improving How-to and (command-line) Reference documentation.

Based on experience from our 2019 GSoD project, we were able to confirm that the RTD/Sphinx tools were a good fit for our projects and we have since moved the primary documentation for our other projects to RTD. We also piloted our use of the documentation framework of Tutorials, HowTo Guides, Reference and Discussion (from Daniele Procida of Divio/Django) which we are applying to all of our projects as we improve their documentation.

For GSoD 2021, our project focused on our scancode.io project. The tasks were extending the HowTo Guides to cover Software Composition Analysis workflows, then upgrading the ScanCode.io Web UI documentation and create an introductory video to show how the web UI is used. We also worked on updating and improving the existing Pipe libraries reference API documentation (which is generated from code documentation “docstrings”). And lastly sync the new documentation set with the code to support continuous integration.

Since we were working with an experienced tech writer, and the hiring and admin work was entirely on us, we learnt a lot through this process and the feedback is deocumented here.

Our years on experience in GSoC mentoring successful projects and also 2 years of extensive GSoD experience puts us in a comfortable and confident position to successfully mentor in GSoD 2023.

Clone this wiki locally