Description
Hi! I wanted to bring this issue up for discussion. I'll try to describe the importance of package diffs, existing solutions, some possible architectures and some possible solutions that might fit hex.pm. By web-based diffs I mean highlighted outputs from git diff
in a browser, with shareable links.
We first started seeing npm packages get hijacked, but recently RubyGems has been having issues (rest_client, strong_password, many others). By hijacking I mean the the type of attack where someone gets access to the credentials of the author of a package and uploads malicious versions. Some examples of scenarios:
- You merge the new updates and deploy, now you're running infected code in production. They can then mine cryptocurrency or inject HTTP handlers that respond to certain payloads, even giving access to servers and databases.
- You do automatic bumps in CI. When you run the tests any malicious dependency can read your code or env variables and send them home. Potential leaks: Third party secrets (like AWS credentials), entire source code of your application
A better workflow for updating dependencies
If we can make it easy and painless to audit dependency updates, even if it just means scrolling through the diff, we can close this avenue of attack partially or completely.
Some services already exist for this, like one for npm and one for RubyGems, and I've created a POC for Hex at https://diff.jola.dev.
Architectural problems
I see two broad directions to take with implementing this
- Create diffs on demand
- Create and store all diffs somewhere ahead of time, generate new diffs whenever a package is updated
Both have issues and are non-trivial to implement, depending on how reliable we want it to be.
As for generating diffs, the most straightforward way seems to be using the one that mix hex.package diff
does, which is download the tarballs, unpack them, diff, and then remove leftover files. Potentially it might be possible to generate diffs from every incremental change, eg 0.1.0+0.2.0, 0.2.0+0.3.0, etc, and then merge the diffs, but I don't know if that's really better.
General thoughts
The two directions I suggested are not mutually exclusive, we might want a combination of features.
All diffs are not equal, from an auditing viewpoint. Checking the last few versions is useful, but diffing phoenix 0.1.0 with 1.4.10 is not. We probably want to limit the number of steps supported. Attacks are getting more sophisticated, where one malicious release is made, quickly followed by 1-2 innocuous ones to hide it. It might be a good idea to make it semver aware.
Depending on the direction we go, it might need to be hardened with rate limiting, size limits etc.
On-demand
This is how https://diff.jola.dev works, it generates and caches diffs in ETS. This works surprisingly well, didn't really see any load issues. It should probably be based on an LRU cache to limit memory use.
Pre-generated
This would probably involve building a secondary service that generates diffs and stores them somewhere. Whenever a new version is released it would automatically generate diffs for it. Then the frontend can lookup diffs and no work is done inline.
I'm experimenting with this approach, but assuming we want all possible diffs, the storage required would be huge. UPDATE: I wrote some code to generate all possible diffs for the registry, let it run for an hour or so. It generated something like 85K diffs for 1600 packages. Each diff was compressed, but it still used about 4GB of disk space.
Displaying diffs
There are ready made solutions for making pretty HTML from a git diff, like https://diff2html.xyz/, that can either run from the shell or in the browser. It seems to handle reasonably large diffs even in the browser, there's no lag after it finishes rendering the generated HTML, but rendering it can take a second or two in extreme cases (on my machine). I expect it to work less well on a mobile phone. Arguably we shouldn't expend effort support mobile phones.
We could reduce the work in the browser by generating the HTML on the server side, but of course that's much larger.
API
Finally, there's the question of how we expose this. Here's some potential directions
- API endpoint for diffs,
GET /diff/name/from/to
, returning a text blob - Web endpoint for diffs
- A UI for selecting packages and versions, like https://diff.jola.dev
- Adding links to the last few diffs to a package page
A suggestion for a first approach
This issue has a lot of description of context, and it might be hard to agree on a way forward, so I want to suggest a limited implementation that we can build on.
- Generate diffs on demand, with caching
- Create a web endpoint that displays HTML formatted syntax highlighted diffs
- Add links to latest diffs in each package page (eg "see what changed since the previous version")
- Support all versions of a package, but only a comparison within 5 versions
With the last bullet I am referring to the issue of generating diffs for 0.1.0..1.4.10. For auditing purposes we generally only need to diff each version with the preceding one. I just suggest 5 versions to keep some flexibility, but we could do fewer.
This seems to have a lot of potential and I'm very interested in feedback!
Last thoughts
This is not a trivial project and I expect some discussion here. It might be good to focus on the smallest possible implementation that would still be useful, to get a reasonable scope. It might also be that we don't feel this auditing functionality should be part of hex.pm, but rather could be left to a separate service, like the one I set up.
I also expect this issue to bring new ideas and creative solutions. This is all part of us as a community putting effort into finding better ways to work securely within the ecosystem.