# Prepare the Greyc dataset
In this notebook, we prepare the Greyc-Web dataset which was found online [here](http://www.labri.fr/perso/rgiot/ressources/GREYC-WebDataset.html) for machine learning

## Imports & Setup

In [97]:
import os
import re
import glob
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


from dataprep import *

%load_ext autoreload
%autoreload 2
%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [74]:
DATA_DIR="/tf/data/datasets/greyc_web/output_numpy"
OUTPUT_DIR="/tf/data/preped/greyc_web/"
!mkdir -p {OUTPUT_DIR}

## Extracting the data
In this section we attempt to load the Greyc Web dataset

### Users data
#### Loading the users
According to the website that accompanied the dataset:
```
The ‘user’ directory contains one file per user named user/user_xxx.txt, with xxx the id of the user. Each user file contains the following information (one information per line):

    the user id
    the login of the user
    the name of the user
    the gender of the user
    the age of the user
```


In [75]:
%%time

user_paths = glob.glob(f"{DATA_DIR}/users/*")

# load user data
def load_user(path):
    with open(path, "r") as f:
        userid, login, name, gender, age = f.readlines()
    
    user = {
        "userid": userid,
        "login": login, 
        "name": name,
        "gender": gender,
        "age": age
    }
    
    return user

users = [ load_user(p) for p in user_paths ]
users_df = pd.DataFrame(users)

CPU times: user 7.35 ms, sys: 615 µs, total: 7.96 ms
Wall time: 57.2 ms


In [76]:
users_df.sample(10)

Unnamed: 0,userid,login,name,gender,age
66,114\n,wchaisantikulwat\n,jaimepaslechoco\n,F\n,31\n
63,48\n,germinou\n,ggtruc_55\n,M\n,20\n
21,30\n,rosenberger\n,testaromain\n,M\n,37\n
99,63\n,pardigon\n,Groundation\n,M\n,22\n
92,112\n,kiki23\n,autruche\n,F\n,19\n
10,16\n,sauvageot\n,mot2passe\n,F\n,22\n
114,103\n,hatin\n,ow6d0|)+\n,M\n,22\n
5,20\n,elabed\n,elabedpassword\n,M\n,28\n
8,117\n,zins\n,ostralopitek\n,M\n,20\n
34,40\n,bendaci\n,Heidar\n,M\n,20\n


In [77]:
users_df.describe(include="all").T

Unnamed: 0,count,unique,top,freq
userid,118,118,117\n,1
login,118,118,ajoly\n,1
name,118,118,motsecret\n,1
gender,118,2,M\n,98
age,118,18,20\n,31


#### Cleaning the users
The user data requires cleaning, specifically, all elements have unwanted `\n`s

In [78]:
users_df = users_df.applymap((lambda x: x.strip()))

In [79]:
users_df.head()

Unnamed: 0,userid,login,name,gender,age
0,45,dassonville,yurgen24,F,20
1,19,waubry,phiwil14,M,36
2,113,kikilautruche,embuscade,F,21
3,35,bgodin,chapapa,M,19
4,21,marnier,g>5079@7113<m,M,21


#### Configuring datatypes
We configure datatypes for the dataframe as follows:

In [80]:
users_df.userid = users_df["userid"].astype("int32")
users_df.gender = users_df["gender"].astype("category")
users_df.age = users_df["age"].astype("int32")

In [81]:
users_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118 entries, 0 to 117
Data columns (total 5 columns):
userid    118 non-null int32
login     118 non-null object
name      118 non-null object
gender    118 non-null category
age       118 non-null int32
dtypes: category(1), int32(2), object(2)
memory usage: 3.1+ KB


### Keystroke Dyanmics
According to the website, keystroke dyamics data is split into two parts
1. `passphrases/` - where all users types the imposed username and password
2. `password/` - where users type their own username and password

#### Finding the Passphrase files
Each entry is stored in directory structure `passphrases/user_<user_id>/<timestamp>`
- under each user folder there is a file called `captures.txt` that contains the entries for the user

In [82]:
# compute a list of entry paths
entry_paths = []
for user_dir in glob.glob(f"{DATA_DIR}/passphrases/user_*"):
    with open(f"{user_dir}/captures.txt", "r") as f:
        entries = f.readlines()
        entry_paths.extend([f"{user_dir}/{e.strip()}" for e in entries])

len(entry_paths)

10656

#### Loading Passphrase data
We load the for each entry in the dataset

In [83]:
%%time 

features_whitelist = [
    "userAgent",
    "userid",
    "date",
    "genuine",
    "login",
    "password",
    "l_raw_press",
    "l_raw_release",
    "p_raw_press",
    "p_raw_release"
]

# load the entry at the given path as a dict 
def load_entry(path):
    feature_files = glob.glob(f"{path}/*")
    entry = {}
    for feature_file in feature_files:
        # check if feature has been whitelisted
        feature_name = os.path.basename(feature_file).replace(".txt","")
        if not feature_name in features_whitelist: continue
        
        # load the feature as a dict
        with open(feature_file, "r") as f:
            entry[feature_name] = f.read()
        
        # extract target user from path
        match = re.match(".*user_([0-9]+).*", path)
        entry["target_userid"] = match.group(1)
        
        
    return entry

entries = [ load_entry(p) for p in entry_paths]
passphrase_df = pd.DataFrame(entries)

CPU times: user 3.7 s, sys: 2.24 s, total: 5.94 s
Wall time: 6.89 s


In [84]:
passphrase_df.userAgent.describe(include="all")

count                                                 10656
unique                                                  347
top       Mozilla/5.0 (X11; U; Linux i686; fr; rv:1.9.2....
freq                                                    492
Name: userAgent, dtype: object

#### Cleaning the dataframe
Configure the datatypes for passphrase:
- the `date` feature should have `datetype` datatype
- the `userid` feature should have `int` datatype
- the `target_userid` feature should have `int` datatype
- the `userAgent` feature should have `category` datatype
- the `geninue` feature should have `bool` datatype

In [85]:
passphrase_df["date"] = pd.to_datetime(passphrase_df["date"])
passphrase_df["userid"] = passphrase_df["userid"].astype("int")
passphrase_df["target_userid"] = passphrase_df["target_userid"].astype("int")
passphrase_df["userAgent"] = passphrase_df["userAgent"].astype("category")
passphrase_df["genuine"] = passphrase_df["genuine"].astype("bool")

#### Finding the password files
Password data is stored in the directory structure: 
- for geniune samples - `passwords/user_<user_id>/genuine`
- for impostor samples - `passwords/user_<user_id>/impostor`

In each sub-directory there is a file `captures.txt` that lists the entries available.

In [86]:
# compute a list of entry paths
entry_paths = []
for sector_dir in glob.glob(f"{DATA_DIR}/passwords/user_*/*"):
    with open(f"{sector_dir}/captures.txt", "r") as f:
        entries = f.readlines()
        entry_paths.extend([f"{sector_dir}/{e.strip()}" for e in entries])

len(entry_paths)

19587

#### Loading the Password data
Reusing the `load_entry()` function we defined earlier, we load the password data

In [87]:
%%time 
entries = [ load_entry(p) for p in entry_paths]
password_df = pd.DataFrame(entries)

CPU times: user 6.84 s, sys: 3.99 s, total: 10.8 s
Wall time: 12.4 s


#### Cleaning the dataframe
Configure the datatypes for password:
- the `date` feature should have `datetype` datatype
- the `userid` feature should have `int` datatype
- the `target_userid` feature should have `int` datatype
- the `userAgent` feature should have `category` datatype
- the `geninue` feature should have `bool` datatype

In [88]:
password_df["date"] = pd.to_datetime(password_df["date"])
password_df["userid"] = password_df["userid"].astype("int")
password_df["target_userid"] = password_df["target_userid"].astype("int")
password_df["userAgent"] = password_df["userAgent"].astype("category")
password_df["genuine"] = password_df["genuine"].astype("bool")

### Merge the dataset
Merge the dataframes into one single dataframe.

In [89]:
combined_df = pd.concat([passphrase_df, password_df])

## Preprocessing the dataset
In this section, we apply transformations to the dataset to prepare it for machine learning

### Keystroke Dyanmics data
Keystroke dynamics data are the data that we use as inputs to our model

1. Parse the raw keystroke data into lists

In [90]:
%%time 
keystroke_feat_names = [
    "l_raw_press",
    "l_raw_release",
    "p_raw_press",
    "p_raw_release",
]

def process_entry(entry):
    if type(entry) is str:
        records = entry.split("\n")
        records = [r.split() for r in records if r ]
        entry = records
    return entry

combined_df[keystroke_feat_names] = combined_df[keystroke_feat_names].applymap(process_entry)

CPU times: user 1.28 s, sys: 200 ms, total: 1.48 s
Wall time: 1.48 s


2. Combine keystroke dynamics features for both login password to facilitate further processing.
> Here we make the assumption that the keystroke dynamic features for both login and the password are the simliar

In [91]:
combined_df["raw_press"] = \
    combined_df.l_raw_press + combined_df.p_raw_press
combined_df["raw_release"] = \
    combined_df.l_raw_release + combined_df.p_raw_release
features_df = combined_df[["raw_press",  "raw_release"]]

3. Perform feature extraction on the keystroke features transfroming the raw keytroke into the following features:
- keycode
- relative press timestamp
- relative release timestamp
- press to press timings
- release to release timings
- press to release timings
- release to press timings

In [99]:
%%time
extractor = KeystrokeFeatureExtractor()
keystroke_features = extractor.fit_transform(features_df.values)

CPU times: user 4.39 s, sys: 72.8 ms, total: 4.46 s
Wall time: 4.51 s


### Metadata
We extract the following columns as meta data for our keystroke data:

In [100]:
meta_features = [
    "genuine",
    "userid",
    "target_userid",
    "userAgent"
]

meta_df = combined_df[meta_features]

## Commit the dataset 
Commit the meta dataframe to disk as a feather file

In [101]:
meta_df = meta_df.reset_index()
meta_df.to_feather(f"{OUTPUT_DIR}/meta.feather")

Commit the numpy array as `.npz` file

In [102]:
with open(f"{OUTPUT_DIR}/keystroke.npz", "wb") as f:
    np.savez(f, keystroke=keystroke_features)