Skip to content
J.C. Jones edited this page Nov 26, 2020 · 23 revisions

See the Technology Overview for details on the tools


How much does CRLite compress data?

CRLite promises substantial compression of the dataset; In our staging environment, the binary form of all unexpired certificate serial numbers comprises about 16 GB of memory in Redis; the binary form of all enrolled and unexpired certificate serial numbers comprises about 1.7 GB on disk, while the resulting binary Bloom filter compresses to approximately 5 MB.

These artifacts are in the mlbf folder for a given run, available in the published data-sets. See "Where can I get the CRLite data that is used to make filters?"

Why is CRLite able to compress so much data?

Bloom filters are probabilistic data structures with an error rate due to data collisions. However, if you know the whole range of data that might be tested against the filter, you can compute all the false positives and build another layer to resolve those. Then you keep going until there are no more false positives. In practice, this happens in 25 to 30 layers, which results in substantial compression.

Bloom filters have a false-positive rate; how can CRLite be relied upon?

The key innovation for CRLite is that Certificate Transparency (CT) data can be used as a stand-in for "all the certificates in the Web PKI". It's reasonably easy to tell if a certificate is in Certificate Transparency: Was it delivered with a Signed Certificate Timestamp (SCT) from a CT log? Similarly, it's reasonably easy to tell that a certificate was known to a CT log at the time that the CRLite filter was constructed: Was the SCT at least one Maximum Merge Delay older than the CRLite filter?

The remaining issues are whether the Issuer is included/enrolled in the CRLite filter set, which is provided as a flag along with the Firefox Intermediate Preloading data.

Why doesn't CRLite use SHA2/SH3/etc?

We’re using MurmurHash3 because it’s fast and there’s no currently-known need for a cryptographically secure hash function. Even though Murmur is not designed to be cryptographically secure, the input data for Murmur includes a SHA256 hash of the issuer's Subject Public Key Information (SPKI) and the certificate's serial number.

The obvious threat model against the input data involves manipulating hashes through manipulation of certificate serial numbers -- which have certain requirements on them by the CABForum Baseline Requirements, making them difficult as a vector of attack. Nevertheless, this is an area of active research.

There are few hashes needed for Firefox clients to check CRLite (one per level), so if in the future we need to move to a more secure hash function, the majority of the additional complexity will happen at the infrastructure-side, which can more easily scale up.

How large are the delta updates for CRLite?

They tend to be between 20kB and 50kB, in a form we call "stashes". You can use the crlite_status tool to investigate the sizes of recent runs. Similarly, you can use moz_crlite_query to read and evaluate certificates against the filter+stash sets.

You can see an output of the crlite-status tool, which shows filter statistics by date, here:

How do you pick what CAs are included in CRLite?

All CAs that have fresh Certificate Revocation Lists (CRLs) encoded into their issued certificates get included into CRLite. Freshness meaning that the CRLs' signatures are valid and that they aren't passed their NextUpdate time.

We initially thought we would hand-pick some issuing CAs, but automation was simpler.

Analysis why issuers become unenrolled in CRLite is still active, but the usual culprit in the logs is that the next CRL simply can't be downloaded by the CRLite aggregate-crls tooling, which has limited retry and resume functionality. There is audit data available using the crlite-status tool with the --crl options to analyze when issuers are being enrolled or unenrolled in CRLite.

What happens if a certificate is too new?

Firefox will use OCSP (stapled or actively queried) if the certificate's Signed Certificate Timestamps are too new for the current filter.

What happens if an issuer is unknown?

CRLite won't be used. If the issuer is truly unknown, Firefox will give an unknown issuer warning like always, nothing there will change. If the issuer is not in the Mozilla Root Program, then it won't be eligible for CRLite.

How can you know if a given issuer has its data in CRLite?

CRLite will only run on issuers that are annotated as enrolled in CRLite in Firefox's Intermediate Preloading data. The list can be examined directly using your favorite JSON tooling at this URL:

For details on downloading the attached data file, see the Kinto Attachment plugin for Kinto, used by Firefox Remote Settings.

What happens if CRLite says a certificate is revoked but OCSP says it's valid?

In the short term, we're interested in gathering telemetry on these cases, though no such telemetry is currently defined. That said, at Internet-scale, this is likely a common occurrence: Certificate Authorities generally have lag in updating revocation information, and there's no requirement that CRLs and OCSP update together.

If CRLite proves robust enough, in this scenario we would expect that the CRLite revocation would take precedence, and OCSP would never be checked.

Where can I get CRLite data that Firefox uses?

The CRLite filters are published manually at Firefox Remote Settings. You can examine the data using JSON tooling at this URL:

For details on downloading the attached data file, see the Kinto Attachment plugin for Kinto, used by Firefox Remote Settings. But using jq and httpie, one can chain commands together to obtain the current filter by:

base_url=$(http | jq -r '.capabilities.attachments.base_url')
path=$(http | jq -r  '.data[0].attachment.location')
http --download --output filter.mlbf ${base_url}${path}

Where can I get the CRLite data that is used to make filters?

The production data is hosted in Google Cloud Storage in a bucket named crlite-filters-prod. The web interface for the files is accessible publicly here, though browsing it requires a Google login:

The staging environment, which contains only a fraction of the WebPKI, is here:

The Google gsutil tool is handy for downloading entire datasets (~7 GB each). These commands would download all the files:

mkdir crlite-dataset/
gsutil -m cp -r gs://crlite-filters-prod/20200101-0 crlite-dataset/

The known folder contains JSON files named by the enrolled issuing CA of all their unexpired DER-encoded serial numbers. The revoked folder has files of the same issuing CA format, but contains DER-encoded serial numbers of the revoked certificates. The serials in revoked are not guarnateed to be a subset of known, as many are likely expired, so set math is required to get known revoked from the directories.

The mlbf folder contains the filter and its metadata as-generated.

The log folder contain all the logs for the runs. As of this writing, many errors and warnings are still emitted that require bugfixing in one fashion or other. There are also many pointers to potential CRL problems with CAs, though few are compliance issues, and at least some are known to be innocent problems.

How can I access statistics about the available filters?

The crlite-status tool is probably what you're looking for. You can get it from pypi:

pip3 install crlite-status
crlite_status 8

How can I produce my own CRLite filter?

You'll need the crlite repository downloaded locally, and to install the requirements.txt Python packages.

With a full dataset at hand from the above gsutil command:

python3 ~/git/crlite/create_filter_cascade/ -knownPath ./20200101-0/known/ -revokedPath ./20200101-0/revoked/ my_filter_identifier

With sufficient memory, you'll get the output filter; it should be deterministic.

How can I query my CRLite filter?

Firefox uses . There's a simple Python tool for this called moz_crlite_query which can be installed from PyPi as pip3 install moz_crlite_query. Keep in mind it requires Python 3.7+:

pip3 install moz_crlite_query
cat >/tmp/top4.txt <<EOF
# This is definitely half of my top 8 spaces

moz_crlite_query --hosts --hosts --hosts-file /tmp/top4.txt

How can I run the CRLite backend infrastructure myself?

See the main

Why don't you also scrape OCSP?

It's extremely inefficient, having to do so many OCSP queries. While the original paper's implementation did it, and so did casebenton/certificate-revocation-analysis (our initial proof-of-principal), downloading CRLs scales much better. If CRLite gains traction, OCSP bandwidth savings and speedups may prove to be reasons for CAs to issue CRLs.

What are the "stashes"?

They're binary-encoded flat lists of Issuer Subject Public Key Information hashes, followed by a list of serial numbers.

The script can read stash files.

What determines whether a new filter gets distributed, or a new stash distributed?

Currently CRLite uses a heuristic that end-users will collect stashes until the total size of the collected stashes is going to be larger than a new filter. At that point, the infrastructure will switch over to a new filter and clear all existing stashes.

The contract between CRLite clients and the infrastructure allows the infrastructure to adjust this heuristic at will. Most likely, this will be modified over time to optimize client-side searches, as searching the stashes is slower than searching the Bloom filter cascade, and purely choosing to update the filter on file-size does not account for those speed differences.

What CT logs are monitored?

Mozilla monitors all the logs listed in Google's main list. There's a script in the repository, list_all_active_ct_logs, which can parse Google's list, but the actual production CRLite entry uses all logs in list without filtering, and is periodically updated. Issue #144 covers the idea of loading Google's list during ct-fetch startup as an optional step.

What gets stored in Redis?

ct-fetch stores certificate serial numbers and CRL distribution points in the Redis database.

Serial numbers are stored as Redis sets with the keys being named in the form serials::<expiration date and hour>::<issuer>, with each key's expiration set to automatically expunge upon reaching the expiration day-and-hour.

CRL distribution points are also stored as Redis sets, with keys in the form crls::<issuer>, and CRL DPs do not expire; as they are discovered, CRLite assumes they will be updated until the retirement of the issuer.