### Differential Privacy with Spark using Tumult Analytics

For this notebook, you'll need to follow the [Tumult installation instructions](https://docs.tmlt.dev/analytics/latest/installation.html) in a separate virtual environment, as the library currently has clashing requirements compared to some of the other libraries in these notebooks.

Once you have it installed, you're ready to try out your differential privacy knowledge with Spark, let's go!

In [4]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from pyspark.sql import SparkSession
from tmlt.analytics.keyset import KeySet
from tmlt.analytics.privacy_budget import PureDPBudget
from tmlt.analytics.protected_change import AddOneRow
from tmlt.analytics.query_builder import QueryBuilder, ColumnType, BinningSpec
from tmlt.analytics.session import Session


spark = SparkSession.builder.getOrCreate()
members_df = spark.read.csv("data/members.csv", header=True, inferSchema=True)


In [5]:
session = Session.from_dataframe(
    privacy_budget=PureDPBudget(epsilon=1.1),
    source_id="members",
    dataframe=members_df,
    protected_change=AddOneRow(),
)

In [6]:
members_df.columns

['id',
 'name',
 'age',
 'gender',
 'education_level',
 'zip_code',
 'books_borrowed',
 'favorite_genres',
 'date_joined']

I'm curious if there is a correlation between education_level and books_borrowed. Let's take a look!

I first need to build a KeySet with the values I'd like to use... Normally I would use Spark to do this, but I need to get the list of values without looking at the data itself, as this would count towards my budget. Thankfully, we have well documented data, so I was able to get the following list! :)

In [7]:
edu_levels = KeySet.from_dict({
    "education_level": [
        "up-to-high-school",
        "high-school-diploma",
        "bachelors-associate",
        "masters-degree",
        "doctorate-professional",
    ]
})

Now I can use the QueryBuilder to group by education and then give an average. Here I am binning the number of books borrowed between 0 and 100.

In [8]:
edu_average_books_query = (
    QueryBuilder("members")
    .groupby(edu_levels)
    .average("books_borrowed", low=0, high=100)
)
edu_average_books = session.evaluate(
    edu_average_books_query,
    privacy_budget=PureDPBudget(0.6),
)
edu_average_books.sort("books_borrowed_average").show(truncate=False)


                                                                                

+----------------------+----------------------+
|education_level       |books_borrowed_average|
+----------------------+----------------------+
|masters-degree        |19.0265726681128      |
|doctorate-professional|19.13195435092725     |
|bachelors-associate   |19.177823348469314    |
|up-to-high-school     |19.37279031819418     |
|high-school-diploma   |19.603978997061514    |
+----------------------+----------------------+



There doesn't seem to be any correlation to find here! I wonder if age might be a better indicator, maybe even connected with an education level?

To take a look, I first want to create age groups by binning the age in ranges.

In [9]:
age_binspec = BinningSpec([10*i for i in range(0, 11)])

age_bin_keys = KeySet.from_dict({
    "age_binned": age_binspec.bins()
})

Now I can filter and group by age... Here I am singling out those with masters or doctorates and I am using a new bounds for my books borrowed as I think 100 was too high!

In [10]:
binned_age_with_filter_query = QueryBuilder("members")\
      .filter("education_level='masters-degree' or education_level='doctorate-professional'")\
      .bin_column("age", age_binspec)\
      .groupby(age_bin_keys)\
      .average("books_borrowed", low=0, high=22)

session.evaluate(binned_age_with_filter_query, privacy_budget=PureDPBudget(0.4)).show(truncate=False)

                                                                                

+----------+----------------------+
|age_binned|books_borrowed_average|
+----------+----------------------+
|(10, 20]  |-6.0                  |
|(20, 30]  |10.846464646464646    |
|(30, 40]  |11.547257876312718    |
|(40, 50]  |11.070460704607045    |
|(50, 60]  |11.566094100074682    |
|(60, 70]  |11.075132275132274    |
|(70, 80]  |11.117088607594937    |
|(80, 90]  |10.222222222222221    |
|(90, 100] |11.0                  |
|[0, 10]   |10.0                  |
+----------+----------------------+



Oye! I can see that there is a lot of noise added to some of these columns. What did I do wrong? In this case, I filtered on age and did not take into account that some of the age groups represented would likely be underrepresented in my filter. The likelihood that a 8 year old has a masters degree is quite small...

In the future, I might run a query like the following first! Getting an idea for books borrowed by age before filtering... :)

In [11]:
binned_age_query = QueryBuilder("members")\
    .bin_column("age", age_binspec)\
    .groupby(age_bin_keys)\
    .average("books_borrowed", low=0, high=22)

session.evaluate(binned_age_query, privacy_budget=PureDPBudget(0.1)).show(truncate=False)

                                                                                

+----------+----------------------+
|age_binned|books_borrowed_average|
+----------+----------------------+
|(10, 20]  |11.576746088557112    |
|(20, 30]  |11.46492337972726     |
|(30, 40]  |11.550365211482928    |
|(40, 50]  |11.257424458565685    |
|(50, 60]  |11.23477687403825     |
|(60, 70]  |11.349001351554287    |
|(70, 80]  |11.620332883490779    |
|(80, 90]  |10.83838383838384     |
|(90, 100] |243.0                 |
|[0, 10]   |11.138160325083119    |
+----------+----------------------+



Or even just looking at a count....

Oh no! I ran out of budget!

Good news: [Tumult Labs](https://www.tmlt.dev/) has a bunch of notebooks to try out with this dataset and there is an option to set your budget to inifinity as you play around and get to know the library. That said, when you are using Tumult or any differential privacy library in production, you'll need to first make real decisions on your queries and budget! 

Take a look at their tutorials and happy privacying!

### Challenges

- Fix the query so that you get a better result for the books borrowed average.
- Use an unlimited privacy budget (`privacy_budget=PureDPBudget(epsilon=float('inf'))`), and investigate the correlations in the dataset further. If you find an interesting one, switch back to a budget and try to show it via matplotlib or seaborn
- Go through the [Tumult Analytics Tutorial](https://docs.tmlt.dev/analytics/latest/tutorials/) to try out more features.