Searchable encryption in python

This is proof of concept for implementing searchable encryption over Elasticsearch. It's using industry standard AES encryption for strings and pyope for Order Preserving Encryption (it's based on a paper by Boldyreva et al) of numbers, dates and times.

There are both generic SearchableCipher and Elasticsearch specific SearchableCipherElasticsearch. Aside from indexing helper, latter provides proper Elasticsearch tokenization.

Usage

Generic:

sc = SearchableCipher(b'key goes here!!!')

# Encrypts a list of tokens
self.sc.encrypt_tokens(["foo", "bar"])

# Splits text into tokens and encrypts them
self.sc.encrypt_tokenize_text("Foo BAR!")

# Encrypts one token as is
self.sc.encrypt_token("The quick brown fox jumps over the lazy dog!")

# Encrypts an integer
self.sc.encrypt_int(1234)

# Encrypts a datetime.date
self.sc.encrypt_date(datetime.date(2010, 11, 12))

# Encrypts a datetime.time
self.sc.encrypt_time(datetime.time(13, 14, 15))

Elasticsearch:

sce = SearchableCipherElasticsearch(b'key goes here!!!')

doc = {
    "text": "The quick brown fox jumps over the lazy dog!",
    "value": 1234,
}

sce.index_doc(doc, [], ["text"], [], [], ["value"])
sce.es.indices.refresh(sce.index_name)

# Matches nothing
self.sce.search({"match": {"text": self.sce.encrypt_token("foxie")}})

# Matches "Quick"
self.sce.search({"match": {"text": self.sce.encrypt_token("qui")}})

# Matches nothing
self.sce.search({"range": {"value": {"gt": self.sce.encrypt_int(1235)}}})

# Matches 1234
self.sce.search({"range": {"value": {"gt": self.sce.encrypt_int(1233)}}})

Testing

To run generic tests:

python tests/test_basic.py

To run Elasticsearch tests you need to install and run Elasticsearch. Then:

python tests/test_es.py

Advantages

Elasticsearch is optional, we're only using it for great tokenization. It's easy to add any necessary backend.
All encryption is done in the application, so you can use insecure channels and Elasticsearch servers.
It's possible to use separate encryption key for each index.

Downsides

NGram based indexes should consume more space
Each tokenized field adds a roundtrip to ES, so indexing should be slower
It imposes additional restrictions and may lose some precision in OPE

Benchmarks

python -m benchmarks.es

On my machine:

Not encrypted single 0:00:03.808277
Encrypted single 0:00:23.540614
Not encrypted bulk 0:00:00.356239
Encrypted bulk 0:00:18.666775

I.e. ~4 times slowdown when inserting single documents.

TODO

Remove redundant ES assumptions (hosts, ports, creds, index/analyzer names etc)
setup.py
Provide iv instead of randomizing

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
benchmarks		benchmarks
pysearchable		pysearchable
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmarks

benchmarks

pysearchable

pysearchable

tests

tests

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

init.py

init.py

requirements.txt

requirements.txt

Repository files navigation

Searchable encryption in python

Usage

Testing

Advantages

Downsides

Benchmarks

TODO

About

Releases

Packages

Contributors 2

Languages

License

ilvar/pysearchable

Folders and files

Latest commit

History

Repository files navigation

Searchable encryption in python

Usage

Testing

Advantages

Downsides

Benchmarks

TODO

About

Resources

License

Stars

Watchers

Forks

Languages