# Data Obfuscation

A common scenario encountered by data scientists is sharing data with others. A lot of data we collect today can easily be linked to an individual, household or entity. What should we do when we need to share these data with others?

A common solution is to remove these fields before sharing the data. However, analysis may rely on personal data.

# Cryptographic Hash

One of the method to anonymise the personal data is by using Hashing. A hash function maps arbitrary strings of data to a fixed length bit array. The function is deterministic and public, but the mapping should seem random. Hash functions do not have a secret key. Since there are no secrets and the function itself is public,anyone can evaluate the function.

The algorithms can map both alphanumeric and non-alphanumeric characters to a bit array. The bit array can be returned in bytes or into a hexadecimal format.

## Properties of Cryptographic Hashes

- **Pre-Image Resistance**: For essentially all pre-specified outputs, it is computationally infeasible to find any input which hashes to that output. This means that a hash can be computed relatively easily for a given string(s), but inverting the output to find the original string(s) is difficult.


- **Second Pre-Image Resistance**: It is computationally infeasible to find any second input which has the same output as any specified input. This means given a certain string input, it should be difficult to find another input that produces the same hash. Also known as Weak Collision Resistance.


- **Collision Resistance**: It is computationally infeasible to find any two distinct inputs which hash to the same output. This means it should be difficult to find two different strings that create the same hash.

## Hashing with Python's `hashlib` module

[hashlib](https://docs.python.org/3/library/hashlib.html) is a convenient hashing module that comes with the Python intrepreter. No additional installation is necessary besides Python interpreter. 

Several different methods to create hash can be called from different functions in `hashlib`. 

In [1]:
import hashlib

## MD5 (Message Digest 5)

MD5 is very popular hashing algorithm created long time ago. MD5 is an algorithm that uses a hash function that takes a given input and produces a 128-bit number that is 32 digits long. The algorithm was developed in the 1990's and has been broken since then. It should not be used as a cybersecurity encryption tool.

**Now creating a md5 hash for the word hello**

In [2]:
md5_hash_hex = hashlib.md5(b"hello").hexdigest()
print(
    f'The MD5 hexadecimal hash value for "hello" is'
    f": {md5_hash_hex}\nThe length of the hash is"
    f": {len(md5_hash_hex)} characters"
)

The MD5 hexadecimal hash value for "hello" is: 5d41402abc4b2a76b9719d911017c592
The length of the hash is: 32 characters


We can see in the below code that the MD5 algorithm is broken because it does not hold up to `Collision Resistance`. A famous [Cryptography paper](http://merlot.usc.edu/csac-f06/papers/Wang05a.pdf) by Wang Xiaoyun and Hongbo Yu shows that they were able to break Collision Resistance for MD5 with the two below strings. Even though the strings are different, they produce the same hash. Over time, cryptographers were able to find more examples of violations of Collision Resistance within the MD5 algorithm.

In the example below, we see that for the two different strings `string_1` and `string_2`, the MD5 hexadecimal hex value are same and hence violating Collision Resistance.

In [3]:
string_1 = "d131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4004583eb8fb7f8955ad340609f4b30283e488832571415a085125e8f7cdc99fd91dbdf280373c5bd8823e3156348f5bae6dacd436c919c6dd53e2b487da03fd02396306d248cda0e99f33420f577ee8ce54b67080a80d1ec69821bcb6a8839396f9652b6ff72a70"
string_2 = "d131dd02c5e6eec4693d9a0698aff95c2fcab50712467eab4004583eb8fb7f8955ad340609f4b30283e4888325f1415a085125e8f7cdc99fd91dbd7280373c5bd8823e3156348f5bae6dacd436c919c6dd53e23487da03fd02396306d248cda0e99f33420f577ee8ce54b67080280d1ec69821bcb6a8839396f965ab6ff72a70"

print(f"Check to see if the strings are the same: {string_1 == string_2}")

# convert into binary
string_1_hex = bytearray.fromhex(string_1)
string_2_hex = bytearray.fromhex(string_2)

# this is an example of collision where MD5 fails
print(
    f"Using the MD5 algorithm, we see that this is a Collision and that the algorithm fails : "
    f"{hashlib.md5(string_1_hex).hexdigest() == hashlib.md5(string_2_hex).hexdigest()}"
)

Check to see if the strings are the same: False
Using the MD5 algorithm, we see that this is a Collision and that the algorithm fails : True


## SHA hash functions 

Secure Hash Algorithm (SHA) is a family of cryptographic hash functions. These one way algorithms takes input of any sizes, mixes it up and creates fixed sized outputs. It is virtually impossible to transform them back into the original data. SHA (SHA1 / SHA224 / SHA256 / SHA384 / SHA512) can create hash from 128 bytes to 512 bytes. Among them, SHA-256 and SHA-512 are considered secure versions from the family. A google partnered [reported](https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html) the first ever hash collision of the SHA-1 hash algorithm. According the this report, two different strings can generate same hash, which violates the collision resistance.  


**Now creating a SHA-256 hash for the word hello**

In [4]:
sha_hash_hex = hashlib.sha256(b"hello").hexdigest()
print(
    f'The SHA-256 hexadecimal hash value for "hello" is'
    f": {sha_hash_hex}\nThe length of the hash is"
    f": {len(sha_hash_hex)} characters"
)

The SHA-256 hexadecimal hash value for "hello" is: 2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824
The length of the hash is: 64 characters


Now, even though SHA-256 hash abides to every condition of being a secure hash function. We just cannot generate original value from the SHA hash. However, hackers have their way of retriving original value from the given hash. One of the way is known as dictionary attack. 

### Dictionary attack

A dictionary attack is a type of brute force attack where an attacker tries to access an account by iterating through a dictionary of common phrases and words. The size of the dictionaries can vary from hundreds of thousands of password variations to billions. A dictionary containing over 1 billion unique words takes up only 15 gigabytes.

An example of dictionary attack is shown below, where the hacker make use of the dictionary of common words with their hash value, to get to the original value. 

In [5]:
# dictionary that we will use to try to break a password
data_dict_attack = {
    "123456": hashlib.sha256(b"123456").hexdigest(),
    "password": hashlib.sha256(b"password").hexdigest(),
    "myname": hashlib.sha256(b"myname").hexdigest(),
    "testpass": hashlib.sha256(b"testpass").hexdigest(),
    "password25": hashlib.sha256(b"password25").hexdigest(),
    "asdf": hashlib.sha256(b"asdf").hexdigest(),
    "123456789": hashlib.sha256(b"123456789").hexdigest(),
    "iloveyou": hashlib.sha256(b"iloveyou").hexdigest(),
    "sunshine": hashlib.sha256(b"sunshine").hexdigest(),
    "basketball": hashlib.sha256(b"basketball").hexdigest(),
}

Now, if a hacker has got hold of the `hash value` and if the original word is in the list of hackers dictionary. Then, hacker can retrieve the info by doing the brute force over all the words in the dictionary. 

In [6]:
hash_value = "a941a4c4fd0c01cddef61b8be963bf4c1e2b0811c037ce3f1835fddf6ef6c223"

for org_word, hash in data_dict_attack.items():
    if hash == hash_value:
        print(org_word)
    else:
        print("No word found")

No word found
No word found
No word found
No word found
No word found
No word found
No word found
No word found
sunshine
No word found


In the above example, since `sunshine`, being a common word is present in the hacker dictionary. They were able to identify the original word for the given hash value. This happens if any password is saved as a hash value in a system. If the hacker got hold of the hash value, then there is a chance of knowing the password if it is in the dictionary of hacker.

One of the way to make password or original word more secure is by the addition of salt value. 

### What is Salt?

A [salt](https://en.wikipedia.org/wiki/Salt_(cryptography)) is a random character string that is added to the beginning or the end of a password/original word. This salt is unique to each user, and is stored in the database along with the username and salted-hashed word/password.

Salt solves the collision problem that hash function suffers from. Like in the above case, even if the given word matches with the word from the hackers dictionary. The hacker will not be able to trace the original word. Salt increases the search space that attackers use to try to brute force passwords by increasing the complexity of each word.

In the example below, we demonstrate how the addition of a salt makes it harder for the hacker to trace the word. 

In [7]:
import string
import secrets

# prompt user to set their password
org_word = "sunshine"

# set up list of alphanumeric characters that can be used for the salt value
char_list = string.printable[0:50]

# create salt from random values
salt = ""
while len(salt) < 10:
    salt += secrets.choice(
        char_list
    )  # it will continually append string characters into our variable randomly
# concatenate the original word and the salt
org_word = org_word + salt

print(f"The salt is: {salt}\nThe new word is: {org_word}")
org_word = hashlib.sha256(org_word.encode("UTF-8"))
print(org_word.hexdigest())

The salt is: F2sbncxCq2
The new word is: sunshineF2sbncxCq2
1d483c49ef898c03db2e727265342cf27150f196b5a9c6dbf47646851e65e1f2


Now, we use the above hash and sample dictionary to see if we can trace back the original word `sunshine`. 

In [8]:
# Dictionary Attack on Salt Adjusted word

# no luck on cracking the hash adjusted with a salt value
intercepted_salt_hash = (
    "9df872a433b8a32a5d29760abd6a3504d3772b49cee6dcd6eb885e5b6b5c18f9"
)

for word, has in data_dict_attack.items():
    if has == intercepted_salt_hash:
        print(word)
    else:
        print("No word found")

No word found
No word found
No word found
No word found
No word found
No word found
No word found
No word found
No word found
No word found


We see that, now because of the addition of random salt. The hacker could not trace the orginal word even though the original word `sunshine` was there in the sample dictionary.

One of the important application of salt use in storing users password in the database. Users enter the password, which is then inserted into a hash function that then maps the user password to a fixed length of random characters.

Some of the common mistakes with salts are **salt reuse** and **short salt**.

**Salt reuse** occurs when a developer uses the same salt for each password and basically makes the salt useless. Users with the same password will have the same salted-hashed password and if an attacker can guess the salt, they can brute-force the passwords with ease.

**Short salt** refers to a salts' length being too short. The problem with this is that attackers can simply generate each combinations of characters with that length and pair it will common password to brute-force some of your users' password. 

## PBKDF2

Traditionally words or more importantly password to be hashed using an algorithm like md5, SHA-1 or SHA-256 and then stored in the database. These stored hashed password would then be used as parts of password based authentication system. For years hashing has been the accepted standards when it comes to using and storing passwords. Using hashing functions for storing passwords does have its problems though. Hash functions like MD5, SHA-1, SHA-256 are general purpose hash functions designed to calculate a message digest of large amount of data in the shortest time possible. This means they are excellent in ensuring the integrity of data but they are not very good for storing passwords. As CPU's and GPU's are getting faster attacker can write code to make recovering hashed password easier and quicker. A salt in a password does give some protection against these attack but we should not rely on it. It is still vulnerable to brute force attacks where attacker can try out every possible key combination till they hit upon the correct one.

So to add another layer of security, we can also define how many times the hashing function run or to put it another way how many iterations will it have. Adding more iterations makes the hashing function slower. So the question why in the world would we want to intentionally slowdown our function. It is because a slower function is more immune to brute force attacks since it will take longer to test out each possible key combination. This makes brute force attacks slower and therefore so much more difficult to pull off successfully. Hence when it comes to password hashing, slower is strangely enough better. 

To get around this problem we have password based key derivation function also known as PBKDF2. It is also known as key stretching algotithm. This is more general term used for password based key derivation function that secures against brute force attack by increasing the time it takes to test each possible key. This is part of RSA public key cryptographic standards or PKCS for sure.

PBKDF is similar to hash function. You have to provide it with initial password to hash as well as a salt except you do not precombine the password and a salt as before. You also have to provide a number of iterations parameter, which tells the algorithm how many times to execute before returning the hash password. This number of iterations parameter allows you to algorithmically slow down the key generation and help guard against the dictionary attack.


PBKDF2 is a simple cryptographic key derivation function, which is resistant to dictionary attacks and rainbow table attacks. It is based on iteratively deriving HMAC many times with some padding. The PBKDF2 is described in the internet standard [RFC 2898](https://www.ietf.org/rfc/rfc2898.txt). It takes several input parameters and produce derived key as output.

```

key = pbkdf2(password, salt, iterations-count, hash-function, derived-key-lan)

```

Technically, the input data for PBKDF2 consists of :

- password : array of bytes/string, e.g. "sunshine!25" (8-10 chars minimum length is recomended)

- salt : securely generated random bytes, e.g. "df1f2d3f4d77ac66e9c5a6c3d8f921b6" (minimum 64 bits, 128 bits is recommended)

- iterations-count : e.g. 1024 iterations

- hash-function : e.g. SHA-256

- derived-key-len : for the output, e.g. 32 bytes (256 bits)

The output data is the derived key of requested length.

In the next case, we are writing some demo code to derive the key from a password using the PBKDF2 algorithm. Firstly, install the package `backports.pbkdf2` using the command:
```pip install backports.pbkdf2```

In [9]:
import os, binascii
from backports.pbkdf2 import pbkdf2_hmac

salt = binascii.unhexlify("aaef2d3f4d77ac66e9c5a6c3d8f921d1")
passwd = "sunshine!25".encode("utf8")
key = pbkdf2_hmac("sha256", passwd, salt, 100, 32)
print("Derived key:", binascii.hexlify(key))

Derived key: b'6231dbad23daaa36b05ef3e0b93c5173b013ccf4f224fce3b43c7c49528ec3a6'


Changing the number of iterations changes the execution time.

PBKDF2 allows to configure the number of iterations and thus to configure the time required to derive the key.

- Slower key derivation means high login time/slower descryption and high resistance to password cracking attacks.


- Faster key derivation means short login time/faster descryption and lower resistance to password cracking attacks.


- PBKDF2 is not resistance to [GPU attacks](https://security.stackexchange.com/questions/118147/how-are-gpus-used-in-brute-force-attacks) (parallel password cracking using vedio cards) and to [ASCII attacks](https://en.wikipedia.org/wiki/Custom_hardware_attack) (specialized password cracking hardware) . This is the main motivation behind more modern KDF. In present times, PBKDF2 is considered old fashioned and less secure than modern KDF, so it it recommended to use Bcrypt, Scrypt or Argon2 instead. 

## Bcrypt

[Bcrypt](https://en.wikipedia.org/wiki/Bcrypt) is another cryptographic KDF function, older than Scrypt, and is less resistant to ASIC and GPU attacks. It provides configurable iteration count, but uses constant memory, so it is easier to build hardware-accelerated password crackers.

In a demo below, we show how to hash a password using bcrypt.

In [10]:
import bcrypt

hashed_pw = bcrypt.hashpw(b"sunshine!25", bcrypt.gensalt())
print(hashed_pw)

b'$2b$12$NguolIeT7yM6TfQSPojAR.ScFRb7sDDlTH/4iIoMEMhwSKwkLmZGe'


Now, if we were to hash passwords without salts, an attacker could do a dictionary attack in order to find the original word. 

BCrypt is from 1999 and is GPU-ASIC resilient by design as it’s also a memory hardening function: it’s not just CPU intensive, but also RAM-intensive to execute a bcrypt hash.
However times have changed and a sophisticated and maybe rich attacker will use big and powerful FPGA, and the contemporary models have now embedded RAM blocks, which greatly optimize this job. So while Bcrypt does a good job at making life difficult for an ASIC attacker, it does little against a FPGA one.

## Scrypt

[Scrypt](https://en.wikipedia.org/wiki/Scrypt) [RFC 7914](https://datatracker.ietf.org/doc/html/rfc7914.html) is a strong cryptographic-derivation function (KDF). It is memory-intensive, designed to prevent GPU, ASIC and FPGA attacks (highly efficient password cracking hardware).

The Scrypt algorithm takes several input parameters and produce the derived key as output:

```
key = Scrypt(password, salt, N , r, p, derived-key-len)
```

The Scrypt coding parameters are:

- N : iterations count (affects memory and CPU usage), also it must be the power of 2 and greater than 1

- r : block size (affects memory and CPU usage), e.g. 8

- p : parallelism factor (threads to run in parallel - affects the memory, CPU usage), usually 1

- password : input password (8-10 chars minimum length is recommended)

- salt : securely-generated random bytes (64 bits minimum, 128 bits recommended)

- derived-key-length : how many bytes to generate as output, e.g. 32 bytes (256 bits).


The memory in Scrypt is accessed in strongly dependent order at each step, so the memory access speed is the algorithm's bottleneck. The memory required to compute Scrypt key derivation is calculated as follows:

```Memory required = 128 * N * r * p bytes```

In the next case, we are writing some demo code to derive the key from a password using the algorithm. Firstly, install the package `scrypt` using the command:
```pip install pyscrypt```

In [21]:
import pyscrypt

salt = b"aa1f2d3f4d23ac44e9c5a6c3d8f9ee8c"
passwd = b"sunshine!25"
key = pyscrypt.hash(passwd, salt, 2048, 8, 1, 32)
print("Derived key:", key.hex())

Derived key: 3d859f266f7c76d73431a115e071c5df41bf0ea4e80bda47d637ada76aa492e4


Try to change the number of iterations or the block size and see how they affect the execution time. When configured properly Scrypt is considered a highly secure KDF function, so you can use it as general purpose password to key derivation algorithm, e.g. when encrypting wallets, files or app passwords.

## Argon2

[Argon2](https://en.wikipedia.org/wiki/Argon2) is a modern ASIC-resistant and GPU-resistant secure key derivation function. It has better password cracking resistance (when configured correctly) than PBKDF2, Bcrypt, and Scrypt (for similar configuration parameters for CPU and RAM usage).

The Argon2 has several variants:

- **Argon2d** : provides strong GPU resistance, but has potential side-channel attacks (possible in very special situations).

- **Argon2i** : provides less GPU resistance, but has no side channel attacks.

- **Argon2id** : recommended (combines the Argon2d and Argon2i).

Argon2 has following config parameters, which are very similar to Scrypt:

- password P: the password (or message) to be hashed

- salt S: random-generated salt (16 bytes recommended for password hashing)

- iterations t: number of iterations to perform

- memorySizeKB m: amount of memory (in kilobytes) to use

- parallelism p: degree of parallelism (i.e. number of threads)

- outputKeyLength T: desired number of returned bytes

In the next case, we are writing some demo code to derive the key from a password using the Argon2 algorithm. Firstly, install the package `argon2_cffi` using the command:
```pip install argon2_cffi```

In [12]:
import argon2, binascii

hash = argon2.hash_password_raw(
    time_cost=16,
    memory_cost=2**15,
    parallelism=2,
    hash_len=32,
    password=b"sunshine!25",
    salt=b"saltname",
    type=argon2.low_level.Type.ID,
)
print("Argon2 raw hash:", binascii.hexlify(hash))

argon2Hasher = argon2.PasswordHasher(
    time_cost=16, memory_cost=2**15, parallelism=2, hash_len=32, salt_len=16
)
hash = argon2Hasher.hash("sunshine!25")
print("Argon2 hash (random salt):", hash)

verifyValid = argon2Hasher.verify(hash, "sunshine!25")
print("Argon2 verify (correct password):", verifyValid)

try:
    argon2Hasher.verify(hash, "sunshine")
except:
    print("Argon2 verify (incorrect password):", False)

Argon2 raw hash: b'bffe5decc2c3a2343cae3b5b77d8068439e1c52486636d1582a18fc205866985'
Argon2 hash (random salt): $argon2id$v=19$m=32768,t=16,p=2$7fNUa5SnVarI3QKVFLp6EQ$3Gfyw0Pcw9ryqkjYRg6KsCpFVR4CZrtTMe5dzHtwXzg
Argon2 verify (correct password): True
Argon2 verify (incorrect password): False


The above code first derives a 'raw hash' (256 bit key), which is argon-2 based key-derivation function. It also derives a 'argon2 hash', which holds the algorithm parameters, along with random salt and derived key. The derived key is used for password storing and verification. Finally, the calculated hashes are tested against a correct and wrong password.

The argon2 hash in the above output is written in a standardized format, which holds the Argon2 algorithm config parameters + the derived key + the random salt.
```
argon2i

v=19

m=32768,t=16,p=2

NZYe4JT0Vh0nkfTyCBg0dA

a6dfhFwO2MolUZWiGEmOipqAZ1bCJTY3piYeliu03ko
```
The first part is the algorithm name (argon2i), the second is the Argon2i version, and the third part is a list of algorithm parameters related to memory cost (in Kb), time cost, and threads to be used (parallelism).

The fourth parameter is the random salt value. This value is generated by password_hash() using a random value for each execution. This is why we have different hash outputs for the same input string. The default size of the salt is 16 bytes.

The fifth and last parameter of the string contains the hash value, encoded in Base64. The hash size is 32 bytes.

When configured properly Argon2 is considered a highly secure KDF function, one of the best available in the industry, so you can use it as general purpose password to key derivation algorithm, e.g. to when encrypting wallets, documents, files or app passwords. In the general case Argon2 is recommended over Scrypt, Bcrypt and PBKDF2.

In the next case, we will try to obfuscate the first_name and last_name from the database using sha-256 and sha-512. We are not using salt since we want the same first name entries from different database to have the same hash function. 

In [13]:
import pandas as pd
import hashlib

In [14]:
df = pd.read_csv("../data/raw/sample_data.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,First name,Last name,Email,Duration,Time joined,Time exited
0,0,Nathan,Burke,brittany44@brown-walton.com,0 days,02:25:50,21:26:44
1,1,Emma,Gonzalez,hotina@mueller-ford.com,0 days,00:38:31,17:25:35
2,2,Kurt,Henderson,chapmanjohn@smith.com,0 days,05:59:49,05:01:29
3,3,Tracey,Dunn,bzhang@wise.com,0 days,08:24:01,15:16:16
4,4,Kimberly,Barnes,unelson@hull.com,0 days,15:48:41,03:14:39


In [15]:
# Discarding first column from dataframe and renaming it properly
df = df.iloc[:, 1:]
df.rename(
    columns={
        "First name": "First_name",
        "Last name": "Last_name",
        "Time joined": "Time_joined",
        "Time exited": "Time_exited",
    },
    inplace=True,
)

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   First_name   500 non-null    object
 1   Last_name    500 non-null    object
 2   Email        500 non-null    object
 3   Duration     500 non-null    object
 4   Time_joined  500 non-null    object
 5   Time_exited  500 non-null    object
dtypes: object(6)
memory usage: 23.6+ KB


In this case, our primary concern is to secure the names of the individuals in the dataset. Columns that needs some kind of obfuscation are `First_name`, `Last_name` and `Email`. We cannot discard these columns since they will be taking part in the analysis. 

Here we are planning to merge the first name and last name in one single string. We will then proceed to apply the double hash on the name. We will first hash the name with sha-256, which is then followed by hashing with sha-512 on the hashed version of the name. One important thing we need to make sure is that the hashing technique is highly sensitive to case of the string. By just changing a single letter in the string from upper case to lower case, it changes the hash value completely. Hence, in this work, we are converting the string into lower case before applying hashing technique. For the `Email` column, we will be requiring only the domain name for the analysis. Hence first we apply the split method in order to just keep the domain name for the corresponding emails in the dataset, followed by merging the string name as one.  

In [17]:
df["Email"] = df["Email"].apply(lambda x: x.split("@")[1])
df["Name"] = df["First_name"] + df["Last_name"]
df["Name"] = df["Name"].str.lower()
df.drop(["First_name", "Last_name"], 1, inplace=True)
df.head()

Unnamed: 0,Email,Duration,Time_joined,Time_exited,Name
0,brown-walton.com,0 days,02:25:50,21:26:44,nathanburke
1,mueller-ford.com,0 days,00:38:31,17:25:35,emmagonzalez
2,smith.com,0 days,05:59:49,05:01:29,kurthenderson
3,wise.com,0 days,08:24:01,15:16:16,traceydunn
4,hull.com,0 days,15:48:41,03:14:39,kimberlybarnes


After that we will apply `hashlib` method twice in order to replace the actual name by their corresponding hashes. We are hashing `name` with sha-256 which is then followed by hasing with sha-512 on the hashed version.  

In [18]:
df["Name#"] = df["Name"].apply(
    lambda x: hashlib.sha512(
        (hashlib.sha256(x.encode()).hexdigest()).encode()
    ).hexdigest()
)
df.drop(["Name"], 1, inplace=True)

In [19]:
# Rearranging columns
df = df[["Name#", "Email", "Duration", "Time_joined", "Time_exited"]]

In [20]:
df.head()

Unnamed: 0,Name#,Email,Duration,Time_joined,Time_exited
0,278bc48acfc7659fea325fac0b7752d3736139626173ac...,brown-walton.com,0 days,02:25:50,21:26:44
1,fbcde97fc6248ab75be12f4f5959e7e032059ec135014b...,mueller-ford.com,0 days,00:38:31,17:25:35
2,e713b03c66b048bfdbe7fef9a72b0a0f21dc3729798e0c...,smith.com,0 days,05:59:49,05:01:29
3,5106113c025181bfb1e6e709984c4e2d2b949c502ee927...,wise.com,0 days,08:24:01,15:16:16
4,bb278af9f161ddb2b8993490553f0ebd18e16c5b99faae...,hull.com,0 days,15:48:41,03:14:39


# Conclusion

In this notebook, we explored some data obfuscation techniques used in industry, highlighting its weakness and strength. SHA-256 being a very good hashing algorithm suffers from dictionary/brute-force attack from hackers. These weakness are then covered by the adding of salt in hashing technique with more powerful key derivations functions like PBKDF2, BCrypt, Scrypt and Argon2 which we discussed in this notebook. 

We then applied the double hashing technique with sha-256 (32 byte word) and sha-512 (64 byte word) on the dataframe we are analysing.  