# Piithon - Scrubadub Exercise
In this Monday Morning Python Exercise, users will be leverage the power of several python packages to anonymize and analyze 311 data. As a bonus, an interesting data profiling quirk is explored at the end.

## Scrubadub
This exercise is BYOD -- bring your own data. This code assumes you have a file called "piithon311data.csv" in a temporary location. At a minimum, the following columns should be included:
 - Id
 - Title
 - Description
 - AssignedTo
 - CreatedBy
 - ReportedBy
 - Address
 - DaysOpen

Due to the sensitive nature of this type of information, this exercise requires your provide your own 311 dataset and update the code to point to that temporary file here:

In [None]:
piithon311filefullpath = "C:\\temp\\piithon311data.csv"

Data Analysts frequently encounter sensitive data, and the ability to analyze and present anonymous details is a must. Therefore, an important library on any pythonista's toolbelt is [scrubadub](https://scrubadub.readthedocs.io/en/stable/). We'll use this library to anonymize our 311 data in a way that keeps analytic integrity in tact. Try out scrubadub's various methods on your data before proceeding to analysis.

In [None]:
import scrubadub as scr

# Also grab a few other useful packages
import pandas as pd
import seaborn as sns

In [None]:
rawdataframe = pd.read_csv(piithon311filefullpath)

In [None]:
rawdataframe.head()

### Remove all names from the dataset

In [None]:
nameless_data = rawdataframe.transformed

### Remove all email addresses from the data set

In [None]:
no_email_data = nameless_data.transformed

### Remove all Phone Numbers from the dataset

In [None]:
no_phone_data = no_email_data.transformed

### Remove all addresses from the dataset

In [None]:
no_address_data = no_phone_data.transformed

Now that the data is anonymized, it can be reshaped for analysis

In [None]:
anonymized_data = no_address_data.set_index('Id')
anonymized_data.head()

## Data Profiling
Now that we have anonymized data, we can launch into analysis! Let's perform some quick data profiling, including a linear regression of two features:

In [None]:
anonymized_data.describe()

In [None]:
anonymized_user_stats = anonymized_data.groupby(['AssignedTo', 'CreatedBy', 'DaysOpen']).size().reset_index().rename(columns={0: 'UserComboCount'})
anonymized_user_stats.head()

In [None]:
sns.regplot(x="UserComboCount", y="DaysOpen", data=anonymized_user_stats)

## Anscombe's Quartet
One important thing to keep in mind when moving from Data Profiling into deeper analysis is that visualization is key. Take [Anscombe's Quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet), for example--four data sets which look identical from a statistical description, but that have very different graphical features.

In [None]:
anscombe = sns.load_dataset("anscombe")

In [None]:
anscombe.groupby("dataset").describe()

In [None]:
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=anscombe,
           col_wrap=2, ci=None, palette="muted", height=4,
           scatter_kws={"s": 50, "alpha": 1})

## [DataSaurus](https://github.com/lockedata/datasauRus)

In [None]:
import emoji

It's nice to end on a fun note 🐍