# Customer Data Safety Report - Otter River Software 


### Leandro Lopez  
### Data Governance and Privacy, 5300OM  
### Merrimack College School of Science and Engineering  

---

![An abstract work of art generatd by the author using Stable Diffusion](image.jpeg) 
An abstract work of art generatd by the author using Stable Diffusion

---

## Introduction

Data is a highly regarded asset for Otter River Software. We believe that with the right approach, we can respect and honor the privacy of our clients while safely extracting value from our collected data. The following report details the process the author of this report, Leandro Lopez, underwent to ensure the safe and secure sale of data to our Telecom Partners. To protect data, we will be leveraging Differential Privacy techniques as defined by industry leaders (Dwork, 2016).

## Data Description

To start, we must read and describe the data. We use df.describe() to help us contextualize our data.

In [57]:
import pandas as pd

df = pd.read_csv('Customer_Survey.csv')

df.describe()


Unnamed: 0,Region,Gender,Age,EducationYears,JobCategory,UnionMember,EmploymentLength,Retired,HouseholdIncome,DebtToIncomeRatio,...,CallWait,CallForward,ThreeWayCalling,EBilling,TVWatchingHours,OwnsPC,OwnsMobileDevice,OwnsGameSystem,OwnsFax,NewsSubscriber
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,...,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,3.0014,0.5036,47.0256,14.543,2.7528,0.1512,9.7304,0.1476,54.7596,9.95416,...,0.479,0.4806,0.478,0.3486,19.645,0.6328,0.4792,0.4748,0.1788,0.4726
std,1.42176,0.500037,17.770338,3.281083,1.7379,0.35828,9.690929,0.354739,55.377511,6.399783,...,0.499609,0.499673,0.499566,0.476575,5.165609,0.48209,0.499617,0.499415,0.383223,0.499299
min,1.0,0.0,18.0,6.0,1.0,0.0,0.0,0.0,9.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,0.0,31.0,12.0,1.0,0.0,2.0,0.0,24.0,5.1,...,0.0,0.0,0.0,0.0,17.0,0.0,0.0,0.0,0.0,0.0
50%,3.0,1.0,47.0,14.0,2.0,0.0,7.0,0.0,38.0,8.8,...,0.0,0.0,0.0,0.0,20.0,1.0,0.0,0.0,0.0,0.0
75%,4.0,1.0,62.0,17.0,4.0,0.0,15.0,0.0,67.0,13.6,...,1.0,1.0,1.0,1.0,23.0,1.0,1.0,1.0,0.0,1.0
max,5.0,1.0,79.0,23.0,6.0,1.0,52.0,1.0,1073.0,43.1,...,1.0,1.0,1.0,1.0,36.0,1.0,1.0,1.0,1.0,1.0


In analyzing it, we find that an outlier in one of the values--CreditDebt--where the larget value is almost double that of the second largest.

In [58]:

df[['CreditDebt']].sort_values('CreditDebt', ascending=False)

Unnamed: 0,CreditDebt
1102,109.072596
2192,67.490850
4916,48.704524
4412,44.245560
1770,42.098500
...,...
4898,0.006344
4046,0.004940
288,0.003410
4921,0.001364


With the following code, we remove the outlier and reset our indexes. This removes re-identification risk hiding the fact there was an outlier in the first place, and also lowers the risk of running into bugs.

In [59]:
df = df.drop(1102) # index of the outlier
df = df.reset_index(drop=True)
df[['CreditDebt']].sort_values('CreditDebt', ascending=False)

Unnamed: 0,CreditDebt
2191,67.490850
4915,48.704524
4411,44.245560
1769,42.098500
3067,35.252100
...,...
4897,0.006344
4045,0.004940
288,0.003410
4920,0.001364


## Equivalence Classes

Let's create our Equivalence Classes. We are interested in the ways our clients use their data. This will allow us to more effectively meet the needs of our clients, as well as allow us to better focus our advertising efforts. To start, will create Equivalence Classes defined by the following values: Age, Gender, Region, DataLastMonth, and DataOverTenure.

First, let's target adults in the following ranges:

In [54]:
# Create a function that maps age values to age ranges
def map_age_to_range(age):
    
    if 20 <= age <= 34:
        return '20-34'
    elif 35 <= age <= 44:
        return '35-44'
    elif 45 <= age <= 54:
        return '45-54'
    elif 55 <= age <= 64:
        return '55-64'
    else:
        return None


df['Age'] = df['Age'].apply(map_age_to_range)
df.loc[df['Age'] != None]
df[['Age', 'EducationYears']].sort_values('Age', ascending=False)

Unnamed: 0,Age,EducationYears
3723,55-64,11
2816,55-64,17
1978,55-64,21
768,55-64,13
769,55-64,17
...,...,...
4971,,12
4979,,10
4989,,17
4995,,10


We will scrub the 'Other' Category to help eliminate people who fall outside of our age ranges, limiting weaknesses.

In [30]:
region = df.groupby('Region')
gender = df.groupby('Gender')
age = df.groupby('Age')
data_last_month = df.groupby('DataLastMonth')
data_over_tenure = df.groupby('DataOverTenure')


In [31]:
# Save the masked data
df.to_csv('masked_data.csv', index=False)