# RWN Experiments

**Author:** Noah Perry

**Overview:** This notebook contains empirical experiments illustrating the performance of RWN using the `pef` dataset from the `regtools` R package

**Data description:** The `pef` data contains salary information and demographic characteristics of programmers and engineers in Silicon Valley taken from the 2000 Census.

Occupation categories: https://usa.ipums.org/usa/volii/occ2000.shtml

Link to documentation for `regtools` (version 1.7.0): https://cran.r-project.org/web/packages/regtools/regtools.pdf

**Method:** RWN is a statistical disclosure control (SDC) method created by Norm Matloff and refined in collaboration with Noah Perry

arXiv preprint: https://arxiv.org/abs/2210.06687 (publication in progress)

**Perturbed data and RWN tuning parameters:** To illustrate how RWN works, we create numerous perturbed datasets using different combinations of RWN's tuning parameters ($\epsilon$, $k$, $q$).
- $k$ is varied: 5, 10, 25, 50
- $\epsilon = 0$ and $q = 0.5$ for all perturbed datasets
- For each combination of tuning parameters, one perturbed dataset in created.

**Experiment:**
- Total correlation (aka multiinformation)

### Setup

In [1]:
# Package installation (for running notebook for first time)
#install.packages("gtools")
#install.packages("infotheo")
#install.packages("pdist")
#install.packages("regtools")

In [2]:
rm(list = ls())

# Packages
library(gtools)
library(infotheo)
library(regtools)
    # pdist package is loaded in RWN.R code

# Filepath
fp <- # [UPDATE WITH FILEPATH TO MAIN RWN REPO]

# RWN code
source(paste0(fp,"R/RWN.R"))

# Set seed for reproducibility
set.seed(1)

# Define e() function
e <- function(x) {
  eval(parse(text = x))
}

Loading required package: FNN


Attaching package: 'FNN'


The following object is masked from 'package:infotheo':

    entropy







*********************



Latest version of regtools at GitHub.com/matloff


Type ?regtools to see function list by category






Attaching package: 'regtools'


The following object is masked from 'package:infotheo':

    discretize




### Import Data

In [3]:
data(pef)

In [4]:
str(pef)

'data.frame':	20090 obs. of  6 variables:
 $ age    : num  50.3 41.1 24.7 50.2 51.2 ...
 $ educ   : Factor w/ 3 levels "14","16","zzzOther": 3 3 3 3 3 3 3 3 1 3 ...
 $ occ    : Factor w/ 6 levels "100","101","102",..: 3 2 3 1 1 1 2 1 1 1 ...
 $ sex    : Factor w/ 2 levels "1","2": 2 1 2 1 2 1 2 1 2 1 ...
 $ wageinc: int  75000 12300 15400 0 160 0 0 32000 39000 20000 ...
 $ wkswrkd: int  52 20 52 52 1 0 0 52 48 52 ...


### Make Perturbed Datasets using RWN

In [5]:
k_vec <- c(5,10,25,50)

for(i in 1:length(k_vec)){
   start_time <- Sys.time()
   assign(paste0("pef_pert_eps0_k", k_vec[i]), rwn1(pef, eps = 0, k = k_vec[i], q = 0.5))
   end_time <- Sys.time()
   time_dif <- end_time - start_time
   print(paste0("k=", k_vec[i], ", time=", time_dif))
}

[1] "k=5, time=4.02318626642227"
[1] "k=10, time=3.33308466672897"
[1] "k=25, time=3.41934419870377"
[1] "k=50, time=3.34310403267543"


### Total Correlation / Multiinformation

In [6]:
calc_multiinformation_pef <- function(data, disc_method){
    data_cat <- data[,c("educ", "occ", "sex")]
    data_num <- data[,c("age", "wageinc", "wkswrkd")]
    data_numdisc <- infotheo::discretize(data_num, disc = disc_method)
    data_disc <- cbind(data_cat, data_numdisc)
    multiinfo <- multiinformation(data_disc)
    return(multiinfo)
}

In [8]:
disc_method = "equalwidth"

calc_multiinformation_pef(pef, disc_method = disc_method)
calc_multiinformation_pef(pef_pert_eps0_k5$zperturb, disc_method = disc_method)
calc_multiinformation_pef(pef_pert_eps0_k10$zperturb, disc_method = disc_method)
calc_multiinformation_pef(pef_pert_eps0_k25$zperturb, disc_method = disc_method)
calc_multiinformation_pef(pef_pert_eps0_k50$zperturb, disc_method = disc_method)

In [9]:
sessionInfo()

R version 4.1.3 (2022-03-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] pdist_1.2.1      regtools_1.7.0   FNN_1.1.3.2      infotheo_1.2.0.1
[5] gtools_3.9.4    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.10         pillar_1.9.0        compiler_4.1.3     
 [4] base64enc_0.1-3     iterators_1.0.14    tools_4.1.3        
 [7] digest_0.6.31       uuid_1.1-0          jsonlite_1.8.4     
[10] evaluate_0.21       lifecycle_1.0.3     lattice_0.21-8     
[13] rlang_1.1.0         Matrix_1.5-4        foreach_1.5.2      
[16] mlapi_0.1.1         IRdisplay