# Assignment 1: Data Versioning and Differential Privacy Pt. 1
## ADSP 32021 IP01 Machine Learning Operations
#### Maria Clarissa Fionalita
Canvas Assignment Pages:

- [Assignment 1: Data Versioning and Differential Privacy](https://canvas.uchicago.edu/courses/52013/assignments/586652)
- [Hint for the assignment](https://edstem.org/us/courses/48613/discussion/3583746)
- [Intro to Exploratory data analysis (EDA) in Python](https://www.kaggle.com/code/imoore/intro-to-exploratory-data-analysis-eda-in-python)

References:

- [Versioning Data with DVC (Hands-On Tutorial!)](https://www.youtube.com/watch?v=kLKBcPonMYw)
- [DVC Cheatsheet](https://derekchia.com/dvc/)

#### Setting Up DVC filepath and Google Drive as remote storage

In [107]:
!git init
!dvc init -f
!dvc remote add -d storage gdrive://1i8QDLTjDwMzeznRjHizajHFaEFA3O17U
!git commit .dvc/config -m "Configure remote storage"

[33mhint: Using 'master' as the name for the initial branch. This default branch name[m
[33mhint: is subject to change. To configure the initial branch name to use in all[m
[33mhint: [m
[33mhint: 	git config --global init.defaultBranch <name>[m
[33mhint: [m
[33mhint: Names commonly chosen instead of 'master' are 'main', 'trunk' and[m
[33mhint: 'development'. The just-created branch can be renamed via this command:[m
[33mhint: [m
[33mhint: 	git branch -m <name>[m
Initialized empty Git repository in /mnt/d/UChicago/Q4/ADSP 32021 Machine Learning Operations/Assignment 1/.git/
Initialized DVC repository.

You can now commit the changes to git.

[31m+---------------------------------------------------------------------+
[0m[31m|[0m                                                                     [31m|[0m
[31m|[0m        DVC has enabled anonymous aggregate usage analytics.         [31m|[0m
[31m|[0m     Read the analytics documentation (and how to opt-out) her

In [108]:
import pandas as pd
import numpy as np

In [110]:
# Clean cache to fresh pull data

# !rm -f data/athletes.csv
# !rm -f dvc/cache
# !dvc pull

!ls -lh data

total 81M
-rwxrwxrwx 1 mariafshan mariafshan 69M Mar  2  2023 athletes.csv
-rwxrwxrwx 1 mariafshan mariafshan 13M Oct 16 13:43 athletes.csv.zip


In [111]:
!dvc add data/athletes.csv
!git add data/.gitignore data/athletes.csv.dvc

[?25l[32m⠋[0m Checking graph                                       core[39m>
Adding...                                                                       
![A
Collecting files and computing hashes in data/athletes.csv |0.00 [00:00,     ?fi[A
                                                                                [A
![A
  0% Checking cache in '/mnt/d/UChicago/Q4/ADSP 32021 Machine Learning Operation[A
                                                                                [A
![A
  0%|          |Adding data/athletes.csv to cache     0/? [00:00<?,     ?file/s][A
  0%|          |Adding data/athletes.csv to cache     0/1 [00:00<?,     ?file/s][A
  0%|          |Adding data/athletes.csv to cache     0/1 [00:00<?,     ?file/s][A
100%|██████████|Adding data/athletes.csv to cache 1/1 [00:00<00:00,  8.58file/s][A
                                                                                [A
![A
  0%|          |Checking out /mnt/d/UChicago/Q4/ADSP 30/? [00:

In [112]:
# push the raw data

!git commit -m "add raw athletes.csv"
!cat data/athletes.csv.dvc
!dvc push

[master 18eacf2] add raw athletes.csv
 5 files changed, 13 insertions(+)
 create mode 100755 .dvc/.gitignore
 mode change 100644 => 100755 .dvc/config
 create mode 100755 .dvcignore
 create mode 100644 data/.gitignore
 create mode 100644 data/athletes.csv.dvc
outs:
- md5: ade8057a9ad4350dfade9180f021a96d
  size: 71546909
  isexec: true
  hash: md5
  path: athletes.csv
  0% Pushing to gdrive://1i8QDLTjDwMzeznRjHizajHFaEFA3O17U/files/md5| |0/1 [00:0
![A
  0%|          |/mnt/d/UChicago/Q4/ADSP 32021 Machi0.00/? [00:00<?,        ?B/s][A
  0%|          |/mnt/d/UChicago/Q4/ADSP 32021 M0.00/68.2M [00:00<?,        ?B/s][A
  0%|          |/mnt/d/UChicago/Q4/ADSP 8.00k/68.2M [00:03<9:21:52,    2.12kB/s][A
  1%|          |/mnt/d/UChicago/Q4/ADSP 320712k/68.2M [00:04<04:47,     246kB/s][A
  1%|▏         |/mnt/d/UChicago/Q4/ADSP 320952k/68.2M [00:04<03:28,     339kB/s][A
  2%|▏         |/mnt/d/UChicago/Q4/ADSP 321.16M/68.2M [00:04<02:40,     438kB/s][A
  2%|▏         |/mnt/d/UChicago/Q4/ADSP

# 1. Dataset Version 1

In [113]:
data = pd.read_csv("data/athletes.csv")
data.shape

(423006, 27)

## 1.1 Remove Irrelevant Columns and Missing Values

In [114]:
data = data.dropna(subset=['region','age','weight','height','howlong','gender','eat', \
                           'train','background','experience','schedule','howlong', \
                           'deadlift','candj','snatch','backsq','experience',\
                           'background','schedule','howlong'])

data = data.drop(columns=['affiliate','team','name','athlete_id','fran','helen','grace',\
                          'filthy50','fgonebad','run400','run5k','pullups','train',\
                         "eat", "background", "experience", "schedule", "howlong"])

In [116]:
data.head(1)

Unnamed: 0,region,gender,age,height,weight,candj,snatch,deadlift,backsq
6,South Central,Male,21.0,72.0,175.0,0.0,0.0,0.0,0.0


## 1.2 Encode Categorical Variables

In [117]:
data["is_male"] = (data["gender"] == "Male") * 1

In [118]:
data = pd.get_dummies(data, prefix = ["region"], columns = ["region"], dummy_na = True)

In [119]:
data = data.drop(columns=['gender'])

data.head(1)

Unnamed: 0,age,height,weight,candj,snatch,deadlift,backsq,is_male,region_Africa,region_Asia,...,region_Mid Atlantic,region_North Central,region_North East,region_North West,region_Northern California,region_South Central,region_South East,region_South West,region_Southern California,region_nan
6,21.0,72.0,175.0,0.0,0.0,0.0,0.0,1,False,False,...,False,False,False,False,False,True,False,False,False,False


#### Push Data V1 to Google Storage

In [120]:
data.to_csv("data/athletes.csv", index=False)

!ls -lh data

total 17M
-rwxrwxrwx 1 mariafshan mariafshan 4.6M Oct 16 19:10 athletes.csv
-rwxrwxrwx 1 mariafshan mariafshan  111 Oct 16 19:08 athletes.csv.dvc
-rwxrwxrwx 1 mariafshan mariafshan  13M Oct 16 13:43 athletes.csv.zip


In [121]:
# push updated file
!dvc add data/athletes.csv
!git add data/athletes.csv.dvc
!git commit -m "Data V1: removed irrelevant columns and encoded categorical variables"
!dvc push

[?25l[32m⠋[0m Checking graph                                       core[39m>
Adding...                                                                       
![A
Collecting files and computing hashes in data/athletes.csv |0.00 [00:00,     ?fi[A
                                                                                [A
![A
  0% Checking cache in '/mnt/d/UChicago/Q4/ADSP 32021 Machine Learning Operation[A
                                                                                [A
![A
  0%|          |Adding data/athletes.csv to cache     0/? [00:00<?,     ?file/s][A
  0%|          |Adding data/athletes.csv to cache     0/1 [00:00<?,     ?file/s][A
  0%|          |Adding data/athletes.csv to cache     0/1 [00:00<?,     ?file/s][A
                                                                                [A
![A
  0%|          |Checking out /mnt/d/UChicago/Q4/ADSP 30/? [00:00<?,    ?files/s][A
  0%|          |Checking out /mnt/d/UChicago/Q4/ADSP 30/1 [00:

In [123]:
!git log --oneline

[33m581879d[m[33m ([m[1;36mHEAD -> [m[1;32mmaster[m[33m)[m Data V1: removed irrelevant columns and encoded categorical variables
[33m18eacf2[m add raw athletes.csv
[33me890107[m Configure remote storage


# 2. Dataset Version 2

## 2.1. Remove Outliers

In [129]:
data = data[data['weight'] < 1500]
# data = data[data['gender'] != '--']
data = data[data['age'] >= 18]
data = data[(data['height'] < 96) & (data['height'] > 48)]

In [130]:
data = data[(data['deadlift'] > 0) & (data['deadlift'] <= 1105)|((data['is_male'] == 0) \
             & (data['deadlift'] <= 636))]
data = data[(data['candj'] > 0) & (data['candj'] <= 395)]
data = data[(data['snatch'] > 0) & (data['snatch'] <= 496)]
data = data[(data['backsq'] > 0) & (data['backsq'] <= 1069)]

## 2.3. Clean Survey Responses

In [61]:
# decline_dict = {'Decline to answer|': np.nan}
# data = data.replace(decline_dict)
# data = data.dropna(subset=['background','experience','schedule','howlong','eat'])
# data.shape

In [131]:
data.head(1)

Unnamed: 0,age,height,weight,candj,snatch,deadlift,backsq,is_male,region_Africa,region_Asia,...,region_Mid Atlantic,region_North Central,region_North East,region_North West,region_Northern California,region_South Central,region_South East,region_South West,region_Southern California,region_nan
21,30.0,71.0,200.0,235.0,175.0,385.0,315.0,1,False,False,...,False,False,False,False,False,False,False,False,True,False


In [132]:
data.shape

(30848, 26)

### Save Data V.2 & Push to Cloud Storage

In [133]:
data.to_csv("data/athletes.csv", index=False)

In [134]:
# checking that the file is updated
!ls -lh data

total 17M
-rwxrwxrwx 1 mariafshan mariafshan 4.4M Oct 16 19:14 athletes.csv
-rwxrwxrwx 1 mariafshan mariafshan  110 Oct 16 19:11 athletes.csv.dvc
-rwxrwxrwx 1 mariafshan mariafshan  13M Oct 16 13:43 athletes.csv.zip


In [135]:
# # remove tracker to update file
# !git rm -r --cached 'data/athletes.csv'
# !git commit -m "stop tracking data/athletes.csv"

# push updated file
!dvc add data/athletes.csv
!git add data/athletes.csv.dvc
!git commit -m "Data V2: removed outliers"
!dvc push

[?25l[32m⠋[0m Checking graph                                       core[39m>
Adding...                                                                       
![A
Collecting files and computing hashes in data/athletes.csv |0.00 [00:00,     ?fi[A
                                                                                [A
![A
  0% Checking cache in '/mnt/d/UChicago/Q4/ADSP 32021 Machine Learning Operation[A
                                                                                [A
![A
  0%|          |Adding data/athletes.csv to cache     0/? [00:00<?,     ?file/s][A
  0%|          |Adding data/athletes.csv to cache     0/1 [00:00<?,     ?file/s][A
  0%|          |Adding data/athletes.csv to cache     0/1 [00:00<?,     ?file/s][A
                                                                                [A
![A
  0%|          |Checking out /mnt/d/UChicago/Q4/ADSP 30/? [00:00<?,    ?files/s][A
  0%|          |Checking out /mnt/d/UChicago/Q4/ADSP 30/1 [00:

# 3. For both versions calculate total_lift and divide dataset into train and test, keeping the same split ratio.

## 3.1 total_lift for V1

### 3.1.1 Load V1 data

In [136]:
!git log --oneline

[33m3b2451b[m[33m ([m[1;36mHEAD -> [m[1;32mmaster[m[33m)[m Data V2: removed outliers
[33m581879d[m Data V1: removed irrelevant columns and encoded categorical variables
[33m18eacf2[m add raw athletes.csv
[33me890107[m Configure remote storage


In [137]:
# !git checkout HEAD^1 data/athletes.csv.dvc

!git checkout 581879d data/athletes.csv.dvc
!dvc checkout

Updated 1 path from aa2fa08
Building workspace index                              |2.00 [00:00, 71.4entry/s]
Comparing indexes                                     |3.00 [00:00,  997entry/s]
Applying changes                                      |0.00 [00:00,     ?file/s]
![A
  0%|          |/mnt/d/UChicago/Q4/ADSP 32021 Machi0.00/? [00:00<?,        ?B/s][A
  0%|          |/mnt/d/UChicago/Q4/ADSP 32021 M0.00/4.56M [00:00<?,        ?B/s][A
Applying changes                                      |1.00 [00:00,  8.60file/s][A
[33mM[0m       data/athletes.csv
[0m

In [138]:
v1 = pd.read_csv("data/athletes.csv")
print("v1 dimension:", v1.shape)
print("v2 dimension:", data.shape)

v1 dimension: (32172, 26)
v2 dimension: (30848, 26)


### 3.1.2 Calculate Total Lift for Data V1

In [139]:
v1["total_lift"] = v1["deadlift"] + v1['candj'] + v1['snatch'] + v1['backsq']

In [140]:
v1["total_lift"].head()

0       0.0
1       0.0
2    1110.0
3     910.0
4    1335.0
Name: total_lift, dtype: float64

### 3.1.3 Save Data v1 with total lift & Push to Cloud Storage

In [141]:
v1.to_csv("data/athletes.csv", index=False)

In [142]:
# checking that the file is updated
!ls -lh data

total 18M
-rwxrwxrwx 1 mariafshan mariafshan 4.8M Oct 16 19:20 athletes.csv
-rwxrwxrwx 1 mariafshan mariafshan  110 Oct 16 19:15 athletes.csv.dvc
-rwxrwxrwx 1 mariafshan mariafshan  13M Oct 16 13:43 athletes.csv.zip


In [143]:
# push updated file
!dvc add data/athletes.csv
!git add data/athletes.csv.dvc
!git commit -m "Updated V1 Data with total lift"
!dvc push

[?25l[32m⠋[0m Checking graph                                       core[39m>
Adding...                                                                       
![A
Collecting files and computing hashes in data/athletes.csv |0.00 [00:00,     ?fi[A
                                                                                [A
![A
  0% Checking cache in '/mnt/d/UChicago/Q4/ADSP 32021 Machine Learning Operation[A
                                                                                [A
![A
  0%|          |Adding data/athletes.csv to cache     0/? [00:00<?,     ?file/s][A
  0%|          |Adding data/athletes.csv to cache     0/1 [00:00<?,     ?file/s][A
  0%|          |Adding data/athletes.csv to cache     0/1 [00:00<?,     ?file/s][A
                                                                                [A
![A
  0%|          |Checking out /mnt/d/UChicago/Q4/ADSP 30/? [00:00<?,    ?files/s][A
  0%|          |Checking out /mnt/d/UChicago/Q4/ADSP 30/1 [00:

### 3.1.4 Checking different version of data

In [144]:
!git log --oneline

[33m3631c64[m[33m ([m[1;36mHEAD -> [m[1;32mmaster[m[33m)[m Updated V1 Data with total lift
[33m3b2451b[m Data V2: removed outliers
[33m581879d[m Data V1: removed irrelevant columns and encoded categorical variables
[33m18eacf2[m add raw athletes.csv
[33me890107[m Configure remote storage


In [145]:
# raw data

!git checkout 18eacf2 data/athletes.csv.dvc
!dvc checkout
pd.read_csv("data/athletes.csv").shape

Updated 1 path from e3504f6
Building workspace index                              |2.00 [00:00, 81.8entry/s]
Comparing indexes                                     |3.00 [00:00,  828entry/s]
Applying changes                                      |0.00 [00:00,     ?file/s]
![A
  0%|          |/mnt/d/UChicago/Q4/ADSP 32021 Machi0.00/? [00:00<?,        ?B/s][A
  0%|          |/mnt/d/UChicago/Q4/ADSP 32021 M0.00/68.2M [00:00<?,        ?B/s][A
 16%|█▌        |/mnt/d/UChicago/Q4/ADSP 3211.0M/68.2M [00:00<00:00,     110MB/s][A
 37%|███▋      |/mnt/d/UChicago/Q4/ADSP 3225.0M/68.2M [00:00<00:00,     126MB/s][A
 60%|██████    |/mnt/d/UChicago/Q4/ADSP 3241.0M/68.2M [00:00<00:00,     141MB/s][A
 85%|████████▌ |/mnt/d/UChicago/Q4/ADSP 3258.0M/68.2M [00:00<00:00,     153MB/s][A
Applying changes                                      |1.00 [00:00,  1.74file/s][A
[33mM[0m       data/athletes.csv
[0m

(423006, 27)

## 3.2 total_lift for V2

### 3.2.1 Load V2 data

In [146]:
del data

In [147]:
!git log --oneline

[33m3631c64[m[33m ([m[1;36mHEAD -> [m[1;32mmaster[m[33m)[m Updated V1 Data with total lift
[33m3b2451b[m Data V2: removed outliers
[33m581879d[m Data V1: removed irrelevant columns and encoded categorical variables
[33m18eacf2[m add raw athletes.csv
[33me890107[m Configure remote storage


In [148]:
!git checkout 3b2451b data/athletes.csv.dvc
!dvc checkout
v2 = pd.read_csv("data/athletes.csv")
v2.shape

Updated 1 path from b60c4d3
Building workspace index                              |2.00 [00:00, 88.6entry/s]
Comparing indexes                                    |3.00 [00:00, 1.20kentry/s]
Applying changes                                      |0.00 [00:00,     ?file/s]
![A
  0%|          |/mnt/d/UChicago/Q4/ADSP 32021 Machi0.00/? [00:00<?,        ?B/s][A
  0%|          |/mnt/d/UChicago/Q4/ADSP 32021 M0.00/4.38M [00:00<?,        ?B/s][A
Applying changes                                      |1.00 [00:00,  11.3file/s][A
[33mM[0m       data/athletes.csv
[0m

(30848, 26)

### 3.1.2 Calculate Total Lift for Data V1

In [149]:
v2["total_lift"] = v2["deadlift"] + v2['candj'] + v2['snatch'] + v2['backsq']

In [150]:
v2["total_lift"].head()

0    1110.0
1     910.0
2    1335.0
3    1354.0
4    1225.0
Name: total_lift, dtype: float64

### 3.2.3 Save Data v2 with total lift & Push to Cloud Storage

In [151]:
v2.to_csv("data/athletes.csv", index=False)

In [152]:
# checking that the file is updated
!ls -lh data

total 17M
-rwxrwxrwx 1 mariafshan mariafshan 4.6M Oct 16 19:22 athletes.csv
-rwxrwxrwx 1 mariafshan mariafshan  110 Oct 16 19:21 athletes.csv.dvc
-rwxrwxrwx 1 mariafshan mariafshan  13M Oct 16 13:43 athletes.csv.zip


In [153]:
# push updated file
!dvc add data/athletes.csv
!git add data/athletes.csv.dvc
!git commit -m "Updated V2 Data with total lift"
!dvc push

[?25l[32m⠋[0m Checking graph                                       core[39m>
Adding...                                                                       
![A
Collecting files and computing hashes in data/athletes.csv |0.00 [00:00,     ?fi[A
                                                                                [A
![A
  0% Checking cache in '/mnt/d/UChicago/Q4/ADSP 32021 Machine Learning Operation[A
                                                                                [A
![A
  0%|          |Adding data/athletes.csv to cache     0/? [00:00<?,     ?file/s][A
  0%|          |Adding data/athletes.csv to cache     0/1 [00:00<?,     ?file/s][A
  0%|          |Adding data/athletes.csv to cache     0/1 [00:00<?,     ?file/s][A
                                                                                [A
![A
  0%|          |Checking out /mnt/d/UChicago/Q4/ADSP 30/? [00:00<?,    ?files/s][A
  0%|          |Checking out /mnt/d/UChicago/Q4/ADSP 30/1 [00:

In [154]:
# checking upload log
!git log --oneline

[33mbfe075a[m[33m ([m[1;36mHEAD -> [m[1;32mmaster[m[33m)[m Updated V2 Data with total lift
[33m3631c64[m Updated V1 Data with total lift
[33m3b2451b[m Data V2: removed outliers
[33m581879d[m Data V1: removed irrelevant columns and encoded categorical variables
[33m18eacf2[m add raw athletes.csv
[33me890107[m Configure remote storage


# 13. Data Privacy
Use tensor flow privacy library with the dataset v2 and calculate the metrics for the new DP model.

https://www.tensorflow.org/responsible_ai/privacy/tutorials/classification_privacy

In [155]:
!git log --oneline

[33mbfe075a[m[33m ([m[1;36mHEAD -> [m[1;32mmaster[m[33m)[m Updated V2 Data with total lift
[33m3631c64[m Updated V1 Data with total lift
[33m3b2451b[m Data V2: removed outliers
[33m581879d[m Data V1: removed irrelevant columns and encoded categorical variables
[33m18eacf2[m add raw athletes.csv
[33me890107[m Configure remote storage


In [156]:
!git checkout bfe075a data/athletes.csv.dvc
!dvc checkout
data = pd.read_csv("data/athletes.csv")
data.shape

Updated 0 paths from 8357cb4
Building workspace index                              |2.00 [00:00,  152entry/s]
Comparing indexes                                    |3.00 [00:00, 1.09kentry/s]
Applying changes                                      |0.00 [00:00,     ?file/s]
[33mM[0m       data/athletes.csv
[0m

(30848, 27)