Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to use OpenDP 0.8.0 #586

Merged
merged 2 commits into from
Dec 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions sql/HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
# SmartNoise SQL v1.0.3 Release Notes

* Upgrade to OpenDP v0.8.0
* Better type hints (thanks, @mhauru!)

# SmartNoise SQL v1.0.2 Release Notes

* Fix privacy bug in approx_bounds (thanks, @TedTed)
Expand Down
2 changes: 1 addition & 1 deletion sql/VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.0.2
1.0.3
2 changes: 1 addition & 1 deletion sql/docs/source/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ You can override the default mechanisms used for differentially private summary
privacy = Privacy(epsilon=1.0)
print(f"We default to using {privacy.mechanisms.map[Stat.count]} for counts.")
print("Switching to use gaussian")
privacy.mechanisms.map[Stat.count] = Mechanism.discrete_gaussian
privacy.mechanisms.map[Stat.count] = Mechanism.gaussian

The list of statistics that can be mapped is in the ``Stat`` enumeration, and the mechanisms available are listed in the ``Mechanism`` enumeration. The ``AVG`` sumamry statistic is computed from a sum and a count, each of which can be overriden.

Expand Down
4 changes: 2 additions & 2 deletions sql/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "smartnoise-sql"
version = "1.0.2"
version = "1.0.3"
description = "Differentially Private SQL Queries"
authors = ["SmartNoise Team <smartnoise@opendp.org>"]
license = "MIT"
Expand All @@ -11,7 +11,7 @@ readme = "README.md"

[tool.poetry.dependencies]
python = ">=3.8,<3.12"
opendp = "^0.7.0"
opendp = "^0.8.0"
antlr4-python3-runtime = "4.9.3"
PyYAML = "^6.0.1"
graphviz = "^0.17"
Expand Down
4 changes: 2 additions & 2 deletions sql/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,13 @@
['PyYAML>=6.0.1,<7.0.0',
'antlr4-python3-runtime==4.9.3',
'graphviz>=0.17,<0.18',
'opendp>=0.7.0,<0.8.0',
'opendp>=0.8.0,<0.9.0',
'pandas>=2.0.1,<3.0.0',
'sqlalchemy>=2.0.0,<3.0.0']

setup_kwargs = {
'name': 'smartnoise-sql',
'version': '1.0.2',
'version': '1.0.3',
'description': 'Differentially Private SQL Queries',
'long_description': '[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Python](https://img.shields.io/badge/python-3.7%20%7C%203.8-blue)](https://www.python.org/)\n\n<a href="https://smartnoise.org"><img src="https://github.com/opendp/smartnoise-sdk/raw/main/images/SmartNoise/SVG/Logo%20Mark_grey.svg" align="left" height="65" vspace="8" hspace="18"></a>\n\n## SmartNoise SQL\n\nDifferentially private SQL queries. Tested with:\n* PostgreSQL\n* SQL Server\n* Spark\n* Pandas (SQLite)\n* PrestoDB\n* BigQuery\n\nSmartNoise is intended for scenarios where the analyst is trusted by the data owner. SmartNoise uses the [OpenDP](https://github.com/opendp/opendp) library of differential privacy algorithms.\n\n## Installation\n\n```\npip install smartnoise-sql\n```\n\n## Querying a Pandas DataFrame\n\nUse the `from_df` method to create a private reader that can issue queries against a pandas dataframe. Example below uses datasets\n`PUMS.csv` and `PUMS.yaml` can be found in the [datasets](../datasets/) folder in the root directory.\n\n\n```python\nimport snsql\nfrom snsql import Privacy\nimport pandas as pd\nprivacy = Privacy(epsilon=1.0, delta=0.01)\n\ncsv_path = \'PUMS.csv\'\nmeta_path = \'PUMS.yaml\'\n\npums = pd.read_csv(csv_path)\nreader = snsql.from_df(pums, privacy=privacy, metadata=meta_path)\n\nresult = reader.execute(\'SELECT sex, AVG(age) AS age FROM PUMS.PUMS GROUP BY sex\')\n```\n\n## Querying a SQL Database\n\nUse `from_connection` to wrap an existing database connection. \n\nThe connection must be to a database that supports the SQL standard, \nin this example the database must be configured with the name `PUMS`, have a schema called `PUMS` and a table called `PUMS`, and the data from `PUMS.csv` needs to be in that table.\n\n```python\nimport snsql\nfrom snsql import Privacy\nimport psycopg2\n\nprivacy = Privacy(epsilon=1.0, delta=0.01)\nmeta_path = \'PUMS.yaml\'\n\npumsdb = psycopg2.connect(user=\'postgres\', host=\'localhost\', database=\'PUMS\')\nreader = snsql.from_connection(pumsdb, privacy=privacy, metadata=meta_path)\n\nresult = reader.execute(\'SELECT sex, AVG(age) AS age FROM PUMS.PUMS GROUP BY sex\')\n```\n\n## Querying a Spark DataFrame\n\nUse `from_connection` to wrap a spark session.\n\n```python\nimport pyspark\nfrom pyspark.sql import SparkSession\nspark = SparkSession.builder.getOrCreate()\nfrom snsql import *\n\npums = spark.read.load(...) # load a Spark DataFrame\npums.createOrReplaceTempView("PUMS_large")\n\nmetadata = \'PUMS_large.yaml\'\n\nprivate_reader = from_connection(\n spark, \n metadata=metadata, \n privacy=Privacy(epsilon=3.0, delta=1/1_000_000)\n)\nprivate_reader.reader.compare.search_path = ["PUMS"]\n\n\nres = private_reader.execute(\'SELECT COUNT(*) FROM PUMS_large\')\nres.show()\n```\n\n## Privacy Cost\n\nThe privacy parameters epsilon and delta are passed in to the private connection at instantiation time, and apply to each computed column during the life of the session. Privacy cost accrues indefinitely as new queries are executed, with the total accumulated privacy cost being available via the `spent` property of the connection\'s `odometer`:\n\n```python\nprivacy = Privacy(epsilon=0.1, delta=10e-7)\n\nreader = from_connection(conn, metadata=metadata, privacy=privacy)\nprint(reader.odometer.spent) # (0.0, 0.0)\n\nresult = reader.execute(\'SELECT COUNT(*) FROM PUMS.PUMS\')\nprint(reader.odometer.spent) # approximately (0.1, 10e-7)\n```\n\nThe privacy cost increases with the number of columns:\n\n```python\nreader = from_connection(conn, metadata=metadata, privacy=privacy)\nprint(reader.odometer.spent) # (0.0, 0.0)\n\nresult = reader.execute(\'SELECT AVG(age), AVG(income) FROM PUMS.PUMS\')\nprint(reader.odometer.spent) # approximately (0.4, 10e-6)\n```\n\nThe odometer is advanced immediately before the differentially private query result is returned to the caller. If the caller wishes to estimate the privacy cost of a query without running it, `get_privacy_cost` can be used:\n\n```python\nreader = from_connection(conn, metadata=metadata, privacy=privacy)\nprint(reader.odometer.spent) # (0.0, 0.0)\n\ncost = reader.get_privacy_cost(\'SELECT AVG(age), AVG(income) FROM PUMS.PUMS\')\nprint(cost) # approximately (0.4, 10e-6)\n\nprint(reader.odometer.spent) # (0.0, 0.0)\n```\n\nNote that the total privacy cost of a session accrues at a slower rate than the sum of the individual query costs obtained by `get_privacy_cost`. The odometer accrues all invocations of mechanisms for the life of a session, and uses them to compute total spend.\n\n```python\nreader = from_connection(conn, metadata=metadata, privacy=privacy)\nquery = \'SELECT COUNT(*) FROM PUMS.PUMS\'\nepsilon_single, _ = reader.get_privacy_cost(query)\nprint(epsilon_single) # 0.1\n\n# no queries executed yet\nprint(reader.odometer.spent) # (0.0, 0.0)\n\nfor _ in range(100):\n reader.execute(query)\n\nepsilon_many, _ = reader.odometer.spent\nprint(f\'{epsilon_many} < {epsilon_single * 100}\')\n```\n\n## Histograms\n\nSQL `group by` queries represent histograms binned by grouping key. Queries over a grouping key with unbounded or non-public dimensions expose privacy risk. For example:\n\n```sql\nSELECT last_name, COUNT(*) FROM Sales GROUP BY last_name\n```\n\nIn the above query, if someone with a distinctive last name is included in the database, that person\'s record might accidentally be revealed, even if the noisy count returns 0 or negative. To prevent this from happening, the system will automatically censor dimensions which would violate differential privacy.\n\n## Private Synopsis\n\nA private synopsis is a pre-computed set of differentially private aggregates that can be filtered and aggregated in various ways to produce new reports. Because the private synopsis is differentially private, reports generated from the synopsis do not need to have additional privacy applied, and the synopsis can be distributed without risk of additional privacy loss. Reports over the synopsis can be generated with non-private SQL, within an Excel Pivot Table, or through other common reporting tools.\n\nYou can see a sample [notebook for creating private synopsis](samples/Synopsis.ipynb) suitable for consumption in Excel or SQL.\n\n## Limitations\n\nYou can think of the data access layer as simple middleware that allows composition of `opendp` computations using the SQL language. The SQL language provides a limited subset of what can be expressed through the full `opendp` library. For example, the SQL language does not provide a way to set per-field privacy budget.\n\nBecause we delegate the computation of exact aggregates to the underlying database engines, execution through the SQL layer can be considerably faster, particularly with database engines optimized for precomputed aggregates. However, this design choice means that analysis graphs composed with SQL language do not access data in the engine on a per-row basis. Therefore, SQL queries do not currently support algorithms that require per-row access, such as quantile algorithms that use underlying values. This is a limitation that future releases will relax for database engines that support row-based access, such as Spark.\n\nThe SQL processing layer has limited support for bounding contributions when individuals can appear more than once in the data. This includes ability to perform reservoir sampling to bound contributions of an individual, and to scale the sensitivity parameter. These parameters are important when querying reporting tables that might be produced from subqueries and joins, but require caution to use safely.\n\nFor this release, we recommend using the SQL functionality while bounding user contribution to 1 row. The platform defaults to this option by setting `max_contrib` to 1, and should only be overridden if you know what you are doing. Future releases will focus on making these options easier for non-experts to use safely.\n\n\n## Communication\n\n- You are encouraged to join us on [GitHub Discussions](https://github.com/opendp/opendp/discussions/categories/smartnoise)\n- Please use [GitHub Issues](https://github.com/opendp/smartnoise-sdk/issues) for bug reports and feature requests.\n- For other requests, including security issues, please contact us at [smartnoise@opendp.org](mailto:smartnoise@opendp.org).\n\n## Releases and Contributing\n\nPlease let us know if you encounter a bug by [creating an issue](https://github.com/opendp/smartnoise-sdk/issues).\n\nWe appreciate all contributions. Please review the [contributors guide](../contributing.rst). We welcome pull requests with bug-fixes without prior discussion.\n\nIf you plan to contribute new features, utility functions or extensions, please first open an issue and discuss the feature with us.\n',
'author': 'SmartNoise Team',
Expand Down
5 changes: 2 additions & 3 deletions sql/snsql/sql/_mechanisms/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
from .laplace import Laplace
from .discrete_laplace import DiscreteLaplace
from .discrete_gaussian import DiscreteGaussian
from .gaussian import Gaussian
from .base import Mechanism, Unbounded

__all__ = ["Laplace", "DiscreteLaplace", "DiscreteGaussian", "Mechanism", "Unbounded",]
__all__ = ["Laplace", "Gaussian", "Mechanism", "Unbounded",]
8 changes: 6 additions & 2 deletions sql/snsql/sql/_mechanisms/approx_bounds.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import numpy as np
from opendp.mod import enable_features
from opendp.measurements import make_base_laplace
from opendp.measurements import make_laplace
import opendp.prelude as dp

def approx_bounds(vals, epsilon):
"""Estimate the minimium and maximum values of a list of values.
Expand Down Expand Up @@ -56,7 +57,10 @@ def edges(idx):
enable_features('floating-point', 'contrib')
discovered_scale = 1.0 / epsilon

meas = make_base_laplace(discovered_scale)
input_domain = dp.atom_domain(T=float)
input_metric = dp.absolute_distance(T=float)

meas = make_laplace(input_domain, input_metric, discovered_scale)
hist = [meas(v) for v in hist]
n_bins = len(hist)

Expand Down
3 changes: 1 addition & 2 deletions sql/snsql/sql/_mechanisms/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,7 @@ class Mechanism(Enum):
# gaussian = 1
laplace = 2
geometric = 3 # discrete laplace
# analytic_gaussian = 4
discrete_gaussian = 5
gaussian = 5
discrete_laplace = 6

class AdditiveNoiseMechanism:
Expand Down
70 changes: 0 additions & 70 deletions sql/snsql/sql/_mechanisms/discrete_laplace.py

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,20 +1,19 @@
import math
from opendp.transformations import make_bounded_sum, make_clamp
from opendp.mod import binary_search_param, enable_features
from opendp.measurements import make_base_discrete_gaussian, make_base_gaussian
from opendp.mod import enable_features
from opendp.measurements import make_gaussian
from opendp.accuracy import gaussian_scale_to_accuracy
from opendp.combinators import make_zCDP_to_approxDP, make_fix_delta
from opendp.typing import set_default_int_type
from .base import AdditiveNoiseMechanism, Mechanism
from .normal import _normal_dist_inv_cdf
import opendp.prelude as dp

class DiscreteGaussian(AdditiveNoiseMechanism):
class Gaussian(AdditiveNoiseMechanism):
def __init__(
self, epsilon, *ignore, delta, sensitivity=None, max_contrib=1, upper=None, lower=None, **kwargs
):
super().__init__(
epsilon,
mechanism=Mechanism.discrete_gaussian,
mechanism=Mechanism.gaussian,
delta=delta,
sensitivity=sensitivity,
max_contrib=max_contrib,
Expand All @@ -38,17 +37,21 @@ def _compute_noise_scale(self):
if rough_scale > 10_000_000:
raise ValueError(f"Noise scale is too large using epsilon={self.epsilon} and bounds ({lower}, {upper}) with {self.mechanism}. Try preprocessing to reduce senstivity, or try different privacy parameters.")
enable_features('floating-point', 'contrib')
bounded_sum = (
make_clamp(bounds=bounds) >>
make_bounded_sum(bounds=bounds)
)

input_domain = dp.vector_domain(dp.atom_domain(T=float))
input_metric = dp.symmetric_distance()

bounded_sum = (input_domain, input_metric) >> dp.t.then_clamp(bounds=bounds) >> dp.t.then_sum()

try:
def make_dp_sum(scale):
adp = make_zCDP_to_approxDP(make_base_gaussian(scale))
return bounded_sum >> make_fix_delta(adp, delta=self.delta)
discovered_scale = binary_search_param(
lambda s: make_dp_sum(scale=s),
d_in=1,
def make_adp_sum(scale):
dp_sum = bounded_sum >> dp.m.then_gaussian(scale)
adp_sum = dp.c.make_zCDP_to_approxDP(dp_sum)
return dp.c.make_fix_delta(adp_sum, delta=self.delta)

discovered_scale = dp.binary_search_param(
lambda s: make_adp_sum(scale=s),
d_in=max_contrib,
d_out=(self.epsilon, self.delta))
except Exception as e:
raise ValueError(f"Unable to find appropriate noise scale for with {self.mechanism} with epsilon={self.epsilon} and bounds ({lower}, {upper}). Try preprocessing to reduce senstivity, or try different privacy parameters.\n{e}")
Expand All @@ -65,8 +68,8 @@ def release(self, vals):
enable_features('contrib')
bit_depth = self.bit_depth
set_default_int_type(f"i{bit_depth}")
meas = make_base_discrete_gaussian(self.scale)
vals = [meas(int(round(v))) for v in vals]
meas = make_gaussian(dp.atom_domain(T=float), dp.absolute_distance(T=float), self.scale)
vals = [meas(float(v)) for v in vals]
return vals
def accuracy(self, alpha):
bit_depth = self.bit_depth
Expand Down
22 changes: 14 additions & 8 deletions sql/snsql/sql/_mechanisms/laplace.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
import math

from opendp.transformations import make_bounded_sum, make_clamp
from opendp.transformations import make_sum, make_clamp
from .base import AdditiveNoiseMechanism, Mechanism
from opendp.mod import binary_search_param, enable_features
from opendp.measurements import make_base_laplace
from opendp.measurements import make_laplace
from opendp.accuracy import laplacian_scale_to_accuracy

import opendp.prelude as dp

class Laplace(AdditiveNoiseMechanism):
def __init__(
self, epsilon, *ignore, delta=0.0, sensitivity=None, max_contrib=1, upper=None, lower=None, **kwargs
Expand Down Expand Up @@ -35,13 +37,15 @@ def _compute_noise_scale(self):
search_lower = rough_scale / 10E+6

enable_features('floating-point', 'contrib')
bounded_sum = (
make_clamp(bounds=bounds) >>
make_bounded_sum(bounds=bounds)
)

input_domain = dp.vector_domain(dp.atom_domain(T=float))
input_metric = dp.symmetric_distance()

bounded_sum = (input_domain, input_metric) >> dp.t.then_clamp(bounds=bounds) >> dp.t.then_sum()

try:
discovered_scale = binary_search_param(
lambda s: bounded_sum >> make_base_laplace(scale=s),
lambda s: bounded_sum >> dp.m.then_laplace(s),
bounds=(search_lower, search_upper),
d_in=max_contrib,
d_out=(self.epsilon))
Expand All @@ -60,7 +64,9 @@ def threshold(self):
return thresh
def release(self, vals):
enable_features('floating-point', 'contrib')
meas = make_base_laplace(self.scale)
input_domain = dp.atom_domain(T=float)
input_metric = dp.absolute_distance(T=float)
meas = make_laplace(input_domain, input_metric, self.scale)
vals = [meas(float(v)) for v in vals]
return vals
def accuracy(self, alpha):
Expand Down
Loading
Loading