Privacy typically comes with a cost in terms of running time or accuracy. How significant is this cost? Train two naive Bayes classifiers on the spam detection dataset to get a sense of the differential cost of differential privacy. (In this case, there was no decrease in accuracy and no measurable increase in running time.)

In this notebook, we will build a spam detector using two different naive Bayes model to get a sense of the difference in cost and accuracy for using a differentially private model.

Import the libraries we'll be using.

In [None]:
from sklearn import tree
import graphviz 
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix

Next we'll write a function to process the data into a dictionary of words and their number of occurances, `word_dict`, and a count of the number of words total, `lexiconsize`

In [None]:
# read in the vocabulary file 
def readvocab(vocab_path="vocab.txt"):
   # keep track of the number of words
    lexiconsize = 0
   # initialize an empty dictionary
    word_dict = {}
   # create a feature for unknown words
    word_dict["@unk"] = lexiconsize
    lexiconsize += 1
   # read in the vocabular file
    with open(vocab_path, "r") as f:
        data = f.readlines()
   # Process the file a line at a time.
    for line in data:
        # The count is the first 3 characters
        count = int(line[0:4])
        # The word is the rest of the string
        token = line[5:-1]
       # Create a feature if it’s appeared at least twice
        if count > 1: 
            word_dict[token] = lexiconsize
            lexiconsize += 1
    # squirrel away the total size for later reference
    word_dict["@size"] = lexiconsize
    return(word_dict)

We will download the vocabulary data from GitHub, `vocab.txt`

In [None]:
!wget https://github.com/mlittmancs/great_courses_ml/raw/master/vocab.txt

--2020-07-05 15:43:17--  https://github.com/mlittmancs/great_courses_ml/raw/master/vocab.txt
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/mlittmancs/great_courses_ml/master/vocab.txt [following]
--2020-07-05 15:43:18--  https://raw.githubusercontent.com/mlittmancs/great_courses_ml/master/vocab.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 83233 (81K) [text/plain]
Saving to: ‘vocab.txt.1’


2020-07-05 15:43:18 (2.70 MB/s) - ‘vocab.txt.1’ saved [83233/83233]



Next, we write a `tokenize` function to turn each word into a list of the length of the number words.  Every item in the list is a count of the number of times a given word occurs in the list.

In [None]:
# Turn string str into a vector.
def tokenize(email_string, word_dict):
  # initially the vector is all zeros
  vec = [0 for i in range(word_dict["@size"])]
  # for each word
  for t in email_string.split(" "):
   # if the word has a feature, add one to the corresponding feature
    if t in word_dict: vec[word_dict[t]] += 1
   # otherwise, count it as an unk
    else: vec[word_dict["@unk"]] += 1
  return(vec)

From here, we write a `getdat` function to convert the file we downloaded into two lists:

- `dat`: a list of lists of tokenized words
- `labs`: a list of labels associated with the email being spam or not spam

In [None]:
# read in labeled examples and turn the strings into vectors
def getdat(filename, word_dict):
    with open(filename, "r") as f:
        data = f.readlines()
    dat = []
    labs = []
    for line in data:
        labs = labs + [int(line[0])]
        dat = dat + [tokenize(line[2:], word_dict)]
    return(dat, labs)

Now we'll download the train and test data from GitHub

In [None]:
!wget https://github.com/mlittmancs/great_courses_ml/raw/master/spam-test.csv
!wget https://github.com/mlittmancs/great_courses_ml/raw/master/spam-train.csv

--2020-07-05 15:43:27--  https://github.com/mlittmancs/great_courses_ml/raw/master/spam-test.csv
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/mlittmancs/great_courses_ml/master/spam-test.csv [following]
--2020-07-05 15:43:27--  https://raw.githubusercontent.com/mlittmancs/great_courses_ml/master/spam-test.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 166047 (162K) [text/plain]
Saving to: ‘spam-test.csv.1’


2020-07-05 15:43:28 (3.77 MB/s) - ‘spam-test.csv.1’ saved [166047/166047]

--2020-07-05 15:43:29--  https://github.com/mlittmancs/great_courses_ml/raw/master/spam-train.csv
Resolving gith

With these train and test datasets, we'll build create the data and labels we will use to train and use to test our naive Bayes model.

In [None]:
word_dict = readvocab()
traindat, trainlabs = getdat("spam-train.csv", word_dict)
testdat, testlabs = getdat("spam-test.csv", word_dict)

Now we'll use IBM's `diffprivlib` to train differentially private models.

In [None]:
!pip install diffprivlib

Collecting diffprivlib
[?25l  Downloading https://files.pythonhosted.org/packages/fe/b8/852409057d6acc060f06cac8d0a45b73dfa54ee4fbd1577c9a7d755e9fb6/diffprivlib-0.3.0.tar.gz (70kB)
[K     |████▋                           | 10kB 18.2MB/s eta 0:00:01[K     |█████████▎                      | 20kB 1.7MB/s eta 0:00:01[K     |██████████████                  | 30kB 2.2MB/s eta 0:00:01[K     |██████████████████▋             | 40kB 2.4MB/s eta 0:00:01[K     |███████████████████████▎        | 51kB 1.9MB/s eta 0:00:01[K     |████████████████████████████    | 61kB 2.2MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 2.0MB/s 
Building wheels for collected packages: diffprivlib
  Building wheel for diffprivlib (setup.py) ... [?25l[?25hdone
  Created wheel for diffprivlib: filename=diffprivlib-0.3.0-cp36-none-any.whl size=138998 sha256=d36d89c673ce76aaffb5fe65ccc3481aea9240df0d8bb3e352e3d3a4e3caf216
  Stored in directory: /root/.cache/pip/wheels/64/68/62/617183f73d3fecea

SciKit Learn's `GaussianNB`

In [None]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

gnb = MultinomialNB()
gnb.fit(traindat, trainlabs)
score = gnb.score(testdat, testlabs)

print(score)

0.985


Diff Priv Lab's `GaussianNB`

In [None]:
from diffprivlib.models import GaussianNB

gnb = GaussianNB()

gnb = MultinomialNB()
gnb.fit(traindat, trainlabs)
score = gnb.score(testdat, testlabs)

print(score)

0.985
