This is proof of concept for implementing searchable encryption over Elasticsearch. It's using industry standard AES encryption for strings and pyope for Order Preserving Encryption (it's based on a paper by Boldyreva et al) of numbers, dates and times.
There are both generic SearchableCipher
and Elasticsearch specific SearchableCipherElasticsearch
. Aside from
indexing helper, latter provides proper Elasticsearch tokenization.
Generic:
sc = SearchableCipher(b'key goes here!!!')
# Encrypts a list of tokens
self.sc.encrypt_tokens(["foo", "bar"])
# Splits text into tokens and encrypts them
self.sc.encrypt_tokenize_text("Foo BAR!")
# Encrypts one token as is
self.sc.encrypt_token("The quick brown fox jumps over the lazy dog!")
# Encrypts an integer
self.sc.encrypt_int(1234)
# Encrypts a datetime.date
self.sc.encrypt_date(datetime.date(2010, 11, 12))
# Encrypts a datetime.time
self.sc.encrypt_time(datetime.time(13, 14, 15))
Elasticsearch:
sce = SearchableCipherElasticsearch(b'key goes here!!!')
doc = {
"text": "The quick brown fox jumps over the lazy dog!",
"value": 1234,
}
sce.index_doc(doc, [], ["text"], [], [], ["value"])
sce.es.indices.refresh(sce.index_name)
# Matches nothing
self.sce.search({"match": {"text": self.sce.encrypt_token("foxie")}})
# Matches "Quick"
self.sce.search({"match": {"text": self.sce.encrypt_token("qui")}})
# Matches nothing
self.sce.search({"range": {"value": {"gt": self.sce.encrypt_int(1235)}}})
# Matches 1234
self.sce.search({"range": {"value": {"gt": self.sce.encrypt_int(1233)}}})
To run generic tests:
python tests/test_basic.py
To run Elasticsearch tests you need to install and run Elasticsearch. Then:
python tests/test_es.py
- Elasticsearch is optional, we're only using it for great tokenization. It's easy to add any necessary backend.
- All encryption is done in the application, so you can use insecure channels and Elasticsearch servers.
- It's possible to use separate encryption key for each index.
- NGram based indexes should consume more space
- Each tokenized field adds a roundtrip to ES, so indexing should be slower
- It imposes additional restrictions and may lose some precision in OPE
python -m benchmarks.es
On my machine:
Not encrypted single 0:00:03.808277
Encrypted single 0:00:23.540614
Not encrypted bulk 0:00:00.356239
Encrypted bulk 0:00:18.666775
I.e. ~4 times slowdown when inserting single documents.
- Remove redundant ES assumptions (hosts, ports, creds, index/analyzer names etc)
- setup.py
- Provide
iv
instead of randomizing