<a href="https://colab.research.google.com/github/pratyushlokhande/BigData-Privacy-Preservation/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Setting Up for pySpark**

In [3]:
!sudo apt update

[33m0% [Working][0m            Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:3 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:4 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:5 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Ign:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release [696 B]
Hit:8 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:9 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:10 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release.gpg [836 B]
Get:11 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ Packages [73.9 kB]
H

In [4]:
!apt-get install openjdk-8-jre -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
!tar xf spark-3.1.2-bin-hadoop3.2.tgz
!pip install -q findspark

In [5]:
# 2. Setup Environment Variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop3.2"

**Initialising Modules and PySpark**

In [10]:
!pip install multipledispatch
!pip install pyDes

Collecting multipledispatch
  Downloading multipledispatch-0.6.0-py3-none-any.whl (11 kB)
Installing collected packages: multipledispatch
Successfully installed multipledispatch-0.6.0


In [11]:
# importing modules 
import findspark
import pandas as pd
from multipledispatch import dispatch
import time
# =-> will be used for encrypting and decrypting data
import binascii
from pyDes import des, CBC, PAD_PKCS5

**Starting Spark Session**

In [12]:
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName('Algorithms').getOrCreate()

**Importing pySpark Data types**

In [13]:
from pyspark.sql.types import (StructField, StringType, StructType, IntegerType, FloatType)

**Generating Structure for Data Set Table**

In [14]:
# this is a list of the column names in our dataset (as the file doesn't contain any headers)

data_schema = [
    StructField('age', IntegerType(), True),
    StructField('workclass', StringType(), True),
    StructField('fnlwgt', FloatType(), True),
    StructField('education', StringType(), True),
    StructField('education-num', FloatType(), True),
    StructField('marital-status', StringType(), True),
    StructField('occupation', StringType(), True),
    StructField('relationship', StringType(), True),
    StructField('race', StringType(), True),
    StructField('sex', StringType(), True),
    StructField('capital-gain', FloatType(), True),
    StructField('capital-loss', FloatType(), True),
    StructField('hours-per-week', FloatType(), True),
    StructField('native-country', StringType(), True),
    StructField('income', StringType(), True),
]

final_struct = StructType(fields=data_schema)

df = spark.read.csv("./drive/MyDrive/DATASET/sample-dataset/adult.all.txt", sep=",", header=None, schema=final_struct)

**Here's how Dataset Looks like**

In [15]:
# View Data
df.show()
# View Schema
df.printSchema()

+---+-----------------+--------+-------------+-------------+--------------------+------------------+--------------+-------------------+-------+------------+------------+--------------+--------------+------+
|age|        workclass|  fnlwgt|    education|education-num|      marital-status|        occupation|  relationship|               race|    sex|capital-gain|capital-loss|hours-per-week|native-country|income|
+---+-----------------+--------+-------------+-------------+--------------------+------------------+--------------+-------------------+-------+------------+------------+--------------+--------------+------+
| 39|        State-gov| 77516.0|    Bachelors|         13.0|       Never-married|      Adm-clerical| Not-in-family|              White|   Male|      2174.0|         0.0|          40.0| United-States| <=50k|
| 50| Self-emp-not-inc| 83311.0|    Bachelors|         13.0|  Married-civ-spouse|   Exec-managerial|       Husband|              White|   Male|         0.0|         0.0|   

**Making the tables user & graphic friendly -> Converting to Pandas**

In [16]:
# converting to Pandas
df = df.toPandas()

In [17]:
# lets take a look at the dataset again
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516.0,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,United-States,<=50k
1,50,Self-emp-not-inc,83311.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50k
2,38,Private,215646.0,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50k
3,53,Private,234721.0,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50k
4,28,Private,338409.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50k


**Calculating Weights**<br>Criteria of assigining weights is => <i>Measure of uniqueness</i>

In [18]:
# weight
weights = {}

# storing the the count of unique values in eac attribute
for col in df.columns:
  weights[col] = len(df[col].unique())

# now sorting the weights in decreasing order
sortedWt = dict(sorted(weights.items(), key=lambda item: item[1], reverse=True))

# assigining them a natural value for easy calculations
n = len(sortedWt.keys())
for key in sortedWt.keys():
  sortedWt[key] = n
  n = n - 1


In [20]:
# Lets take a look at our generated Weights Object/Table
sortedWt

{'age': 11,
 'capital-gain': 14,
 'capital-loss': 13,
 'education': 9,
 'education-num': 8,
 'fnlwgt': 15,
 'hours-per-week': 12,
 'income': 1,
 'marital-status': 5,
 'native-country': 10,
 'occupation': 7,
 'race': 3,
 'relationship': 4,
 'sex': 2,
 'workclass': 6}

**Algorithmic functions for encrypting and decrypting the data**<br>We will be using <i>DES Algorithm</i> for Encryption of string based dataset

In [21]:
# DES_Encryption -> It takes a key and the string to be encrypted 
def des_encrypt(secret_key,s):
    start = time.time()
    iv = secret_key
    k = des(secret_key, CBC, iv, pad=None, padmode=PAD_PKCS5)
    en = k.encrypt(s, padmode=PAD_PKCS5)
    t = (time.time() - start)*1000
    return { 'data': binascii.b2a_hex(en), 'time': t }

# DES_Decryption -> It again takes the key used during encryption and the string to be decrypted 
def des_decrypt(secret_key, s):
    iv = secret_key
    k = des(secret_key, CBC, iv, pad=None, padmode=PAD_PKCS5)
    de = k.decrypt(binascii.a2b_hex(s), padmode=PAD_PKCS5)
    return de

**Finding Execution time for every single dataentry**<br>Here we will be passing every single data entry to *des_encrypt()* function and store the time returned and encrypted data

In [22]:
# Finding Encryption Time
et = {}
key = "letsdoit"

# Initialisation
for col in df.columns:
  et[col] = {}

# calculating time and appending to et
for col in df.columns:
  for row in df[col]:
    if row not in et[col].keys():
      et[col][row] = des_encrypt(key, str(row))

In [23]:
# Lets take a look at the dataset obtained
print(et)

{'age': {39: {'data': b'f4d147fc2ec6786c', 'time': 0.9100437164306641}, 50: {'data': b'fb7d6438d20ec08c', 'time': 0.7765293121337891}, 38: {'data': b'ee48689b4efb2070', 'time': 0.7777214050292969}, 53: {'data': b'0f63db7f3b4ad980', 'time': 0.75531005859375}, 28: {'data': b'9916cf68a3834cc8', 'time': 0.7853507995605469}, 37: {'data': b'183228d7b4a53cf0', 'time': 1.9919872283935547}, 49: {'data': b'08b2bbef1138e43a', 'time': 1.3301372528076172}, 52: {'data': b'85d8875585416279', 'time': 1.8744468688964844}, 31: {'data': b'da01539f23a8f933', 'time': 1.3039112091064453}, 42: {'data': b'40c3a71fee1464e3', 'time': 2.034425735473633}, 30: {'data': b'a36ed599ead97ff6', 'time': 1.3949871063232422}, 23: {'data': b'459cc7fea1d43a6f', 'time': 0.8020401000976562}, 32: {'data': b'649e0aaeb1dc037d', 'time': 0.9262561798095703}, 40: {'data': b'9578617171dd6c03', 'time': 0.8580684661865234}, 34: {'data': b'c9b20ede1b3b8782', 'time': 0.8399486541748047}, 25: {'data': b'9d79980d40a2fe27', 'time': 0.76723

**Generating a Helper Function**<br>Calculating and storing time of execution and uniqueness count for a single dataentry of every attribute

In [24]:
# helper dataset
nt = {}
# calculating and storing time of very first data entry of each attribute(column)
for col in df.columns:
  nt[col] = { 'count': len(df[col].unique()), 'time': des_encrypt(key, str(df[col][0]))['time']}

In [25]:
# lets take a look at helper dataset
nt

{'age': {'count': 74, 'time': 1.468658447265625},
 'capital-gain': {'count': 123, 'time': 0.8349418640136719},
 'capital-loss': {'count': 99, 'time': 0.946044921875},
 'education': {'count': 16, 'time': 1.529693603515625},
 'education-num': {'count': 16, 'time': 0.8890628814697266},
 'fnlwgt': {'count': 28523, 'time': 0.9119510650634766},
 'hours-per-week': {'count': 96, 'time': 0.8187294006347656},
 'income': {'count': 2, 'time': 0.8904933929443359},
 'marital-status': {'count': 7, 'time': 1.5459060668945312},
 'native-country': {'count': 42, 'time': 1.5056133270263672},
 'occupation': {'count': 15, 'time': 1.6021728515625},
 'race': {'count': 5, 'time': 0.904083251953125},
 'relationship': {'count': 6, 'time': 1.5804767608642578},
 'sex': {'count': 2, 'time': 1.1544227600097656},
 'workclass': {'count': 9, 'time': 1.499176025390625}}

**Calculate and store Sort time for every attribute**<br>We need a process to test our dataset timing and generate *S-Table*

In [26]:
# helper function
helperEt = {}

# Storing time required for sort operation on each attribute(column)
for col in df.columns:
  start = time.time()
  df[col].sort_values(ascending = False)
  helperEt[col] = (time.time() - start)*1000
   

In [27]:
# Lets take a look at dataset obtained
helperEt

{'age': 15.15817642211914,
 'capital-gain': 3.4170150756835938,
 'capital-loss': 2.997159957885742,
 'education': 79.94508743286133,
 'education-num': 4.985809326171875,
 'fnlwgt': 7.434606552124023,
 'hours-per-week': 3.918886184692383,
 'income': 72.40629196166992,
 'marital-status': 185.80079078674316,
 'native-country': 80.62958717346191,
 'occupation': 87.84985542297363,
 'race': 73.13132286071777,
 'relationship': 72.49999046325684,
 'sex': 73.35114479064941,
 'workclass': 88.3629322052002}

**S-Table Generation**

In [28]:
# STable
st = {}

# Storing the value of (weight table entries)/(the time attribute took to sort the data set)
for col in df.columns:
  st[col] = sortedWt[col]/helperEt[col]

# Sorting the dataset otained -> we will be using the sequence of encryptions from S-table to encryp the data
sortedSt = dict(sorted(st.items(), key=lambda item: item[1], reverse=True))

In [29]:
# Lets take a look at S-Table Obtained
print(sortedSt)

{'capital-loss': 4.337439503619442, 'capital-gain': 4.097143176109405, 'hours-per-week': 3.0620945427997808, 'fnlwgt': 2.017591636468589, 'education-num': 1.6045539403213467, 'age': 0.7256809588222342, 'native-country': 0.12402395138755415, 'education': 0.11257727383884955, 'occupation': 0.07968140603415755, 'workclass': 0.06790177566840519, 'relationship': 0.05517242105055461, 'race': 0.04102209398992616, 'sex': 0.027266104785524138, 'marital-status': 0.02691054208557625, 'income': 0.013810954447568935}


**Finally implemeting DED Algoritm**

In [30]:
# DED

# time constraint (we have claculated data using a factor of 1000)
tc = 100000

# 1. Sequence of execution from S-Table
# 2. using the et dataset to evaluate the time constraint and also replace the data with encrypted data

for key in sortedSt.keys():
  idx = 0
  for el in df[key]:
    if tc >= et[key][el]['time']:
      df.loc[idx, key] = et[key][el]['data']
      tc = tc - et[key][el]['time']
    idx = idx + 1

In [32]:
# Let's take a look at resulting dataset
df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516.0,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,b'8323049c46985b1f',b'0ed705ec7fac3627',b'aa94490952c42c39',United-States,<=50k
1,50,Self-emp-not-inc,83311.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,b'0ed705ec7fac3627',b'0ed705ec7fac3627',b'27ddc835a5acb313',United-States,<=50k
2,38,Private,215646.0,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,b'0ed705ec7fac3627',b'0ed705ec7fac3627',b'aa94490952c42c39',United-States,<=50k
3,53,Private,234721.0,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,b'0ed705ec7fac3627',b'0ed705ec7fac3627',b'aa94490952c42c39',United-States,<=50k
4,28,Private,338409.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,b'0ed705ec7fac3627',b'0ed705ec7fac3627',b'aa94490952c42c39',Cuba,<=50k
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419.0,Bachelors,13.0,Divorced,Prof-specialty,Not-in-family,White,Female,b'0ed705ec7fac3627',b'0ed705ec7fac3627',36,United-States,<=50k
48838,64,?,321403.0,HS-grad,9.0,Widowed,?,Other-relative,Black,Male,b'0ed705ec7fac3627',b'0ed705ec7fac3627',40,United-States,<=50k
48839,38,Private,374983.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Husband,White,Male,b'0ed705ec7fac3627',b'0ed705ec7fac3627',50,United-States,<=50k
48840,44,Private,83891.0,Bachelors,13.0,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,b'c20324bf42eb790e',b'0ed705ec7fac3627',40,United-States,<=50k
