# Prepare the Greyc dataset
In this notebook, we prepare the Greyc-Web dataset which was found online [here](http://www.labri.fr/perso/rgiot/ressources/GREYC-WebDataset.html) for machine learning

## Imports & Setup

In [1]:
import os
import glob
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import dask.dataframe as dd
import dask.array as da
from dask import delayed

%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
DATA_DIR="/tf/data/datasets/greyc_web/output_numpy"
OUTPUT_DIR="/tf/data/preprocessed"

## Preparing the data
In this section we attempt to load the Greyc Web dataset

### Users data
#### Loading the users
According to the website that accompanied the dataset:
```
The ‘user’ directory contains one file per user named user/user_xxx.txt, with xxx the id of the user. Each user file contains the following information (one information per line):

    the user id
    the login of the user
    the name of the user
    the gender of the user
    the age of the user
```


In [3]:
%%time

user_paths = glob.glob(f"{DATA_DIR}/users/*")

# load user data
def load_user(path):
    with open(path, "r") as f:
        userid, login, name, gender, age = f.readlines()
    
    user = {
        "userid": userid,
        "login": login, 
        "name": name,
        "gender": gender,
        "age": age
    }
    
    return user

users = [ load_user(p) for p in user_paths ]
users_df = pd.DataFrame(users)

CPU times: user 6.33 ms, sys: 0 ns, total: 6.33 ms
Wall time: 5.33 ms


In [4]:
users_df.sample(10)

Unnamed: 0,userid,login,name,gender,age
16,31\n,jrsanchez\n,figaro\n,M\n,21\n
7,82\n,marpaud\n,fuel4life\n,M\n,20\n
48,26\n,laval\n,ensicaen\n,M\n,21\n
85,13\n,griech\n,motsecret\n,M\n,22\n
66,114\n,wchaisantikulwat\n,jaimepaslechoco\n,F\n,31\n
65,86\n,clabaux\n,Paradoxe13\n,M\n,22\n
108,65\n,southapaseuth\n,gogolepower\n,F\n,22\n
26,95\n,marciau\n,knevoltage=baisse\n,M\n,22\n
18,80\n,jean\n,lolmdrxd\n,M\n,19\n
52,32\n,gaetan.javelle\n,bono1988\n,M\n,22\n


In [5]:
users_df.describe(include="all").T

Unnamed: 0,count,unique,top,freq
userid,118,118,57\n,1
login,118,118,dassonville\n,1
name,118,118,mata67\n,1
gender,118,2,M\n,98
age,118,18,20\n,31


#### Cleaning the users
The user data requires cleaning, specifically, all elements have unwanted `\n`s

In [6]:
users_df = users_df.applymap((lambda x: x.strip()))

In [7]:
users_df.head()

Unnamed: 0,userid,login,name,gender,age
0,45,dassonville,yurgen24,F,20
1,19,waubry,phiwil14,M,36
2,113,kikilautruche,embuscade,F,21
3,35,bgodin,chapapa,M,19
4,21,marnier,g>5079@7113<m,M,21


#### Configuring datatypes
We configure datatypes for the dataframe as follows:

In [8]:
users_df.userid = users_df["userid"].astype("int32")
users_df.gender = users_df["gender"].astype("category")
users_df.age = users_df["age"].astype("int32")

In [9]:
users_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118 entries, 0 to 117
Data columns (total 5 columns):
userid    118 non-null int32
login     118 non-null object
name      118 non-null object
gender    118 non-null category
age       118 non-null int32
dtypes: category(1), int32(2), object(2)
memory usage: 3.1+ KB


### Keystroke Dyanmics
According to the website, keystroke dyamics data is split into two parts
1. `passphrases/` - where all users types the imposed username and password
2. `password/` - where users type their own username and password

#### Finding the Passphrase files
Each entry is stored in directory structure `passphrases/user_<user_id>/<timestamp>`
- under each user folder there is a file called `captures.txt` that contains the entries for the user

In [10]:
# compute a list of entry paths
entry_paths = []
for user_dir in glob.glob(f"{DATA_DIR}/passphrases/user_*"):
    with open(f"{user_dir}/captures.txt", "r") as f:
        entries = f.readlines()
        entry_paths.extend([f"{user_dir}/{e.strip()}" for e in entries])

len(entry_paths)

10656

#### Loading Passphrase data
For each entry there are the follow features of interest:
- userAgent.txt: The user agent string of the web brower used to type (can be use to analyse the browser habits of the user)
- userid.txt: The id of the user who has typed the text
- date.txt: The acquisition date of the sample
- genuine.txt: A file containing 1 for a sample typed by the user and 0 for a sample typed by an impostor
- login.txt: The string of the login
- password.txt: The string of the password
- l_raw_press.txt: The press events of the login. One event per line with: the code of the key, the timestamp of the event.
- l_raw_release.txt: The release events of the login. One event per line with: the code of the key, the timestamp of the event.
- p_raw_press.txt: The press events of the password. One event per line with: the code of the key, the timestamp of the event.
- p_raw_release.txt: The release events of the password. One event per line with: the code of the key, the timestamp of the event.

In [11]:
%%time 

features_whitelist = [
    "userAgent",
    "userid",
    "date",
    "genuine",
    "login",
    "password",
    "l_raw_press",
    "l_raw_release",
    "p_raw_press",
    "p_raw_release"
]

# load the entry at the given path as a dict 
def load_entry(path):
    feature_files = glob.glob(f"{path}/*")
    entry = {}
    for feature_file in feature_files:
        # check if feature has been whitelisted
        feature_name = os.path.basename(feature_file).replace(".txt","")
        if not feature_name in features_whitelist: continue
        
        # load the feature as a dict
        with open(feature_file, "r") as f:
            entry[feature_name] = f.read()
        
    return entry

entries = [ load_entry(p) for p in entry_paths]
passphrase_df = pd.DataFrame(entries)

CPU times: user 2.88 s, sys: 964 ms, total: 3.85 s
Wall time: 3.86 s


In [12]:
passphrase_df.userAgent.describe(include="all")

count                                                 10656
unique                                                  347
top       Mozilla/5.0 (X11; U; Linux i686; fr; rv:1.9.2....
freq                                                    492
Name: userAgent, dtype: object

#### Cleaning the dataframe
Configure the datatypes for passphrase:
- the `date` feature should have `datetype` datatype
- the `userAgent` feature should have `category` datatype
- the `geninue` feature should have `category` datatype

In [13]:
passphrase_df["date"] = pd.to_datetime(passphrase_df["date"])
passphrase_df["userAgent"] = passphrase_df["userAgent"].astype("category")
passphrase_df["genuine"] = passphrase_df["genuine"].astype("category")

#### Finding the password files
Password data is stored in the directory structure: 
- for geniune samples - `passwords/user_<user_id>/genuine`
- for impostor samples - `passwords/user_<user_id>/impostor`

In each sub-directory there is a file `captures.txt` that lists the entries available.

In [14]:
# compute a list of entry paths
entry_paths = []
for sector_dir in glob.glob(f"{DATA_DIR}/passwords/user_*/*"):
    with open(f"{sector_dir}/captures.txt", "r") as f:
        entries = f.readlines()
        entry_paths.extend([f"{sector_dir}/{e.strip()}" for e in entries])

len(entry_paths)

19587

#### Loading the Password data
Reusing the `load_entry()` function we defined earlier, we load the password data

In [15]:
%%time 
entries = [ load_entry(p) for p in entry_paths]
password_df = pd.DataFrame(entries)

CPU times: user 5.3 s, sys: 1.86 s, total: 7.17 s
Wall time: 7.18 s


#### Cleaning the dataframe
Configure the datatypes for password:
- the `date` feature should have `datetype` datatype
- the `userAgent` feature should have `category` datatype
- the `geninue` feature should have `category` datatype

In [16]:
password_df["date"] = pd.to_datetime(password_df["date"])
password_df["userAgent"] = password_df["userAgent"].astype("category")
password_df["genuine"] = password_df["genuine"].astype("category")

### Merge the dataset
Merge the dataframes into one single dataframe.

In [17]:
keystroke_df = pd.concat([passphrase_df, password_df])

In [18]:
keystroke_df.head()

Unnamed: 0,l_raw_press,l_raw_release,p_raw_release,date,p_raw_press,password,genuine,userid,userAgent,login
0,76 1287509923363\n65 1287509923460\n66 1287509...,76 1287509923448\n65 1287509923577\n66 1287509...,83 1287509928009\n50 1287509928125\n83 1287509...,2010-10-19 19:35:20,83 1287509927808\n50 1287509927951\n83 1287509...,sésame,1,13,Mozilla/5.0 (Windows; U; Windows NT 6.0; fr; r...,laboratoire greyc
1,76 1287509934960\n65 1287509935008\n66 1287509...,76 1287509935057\n65 1287509935131\n66 1287509...,83 1287509939604\n50 1287509939686\n83 1287509...,2010-10-19 19:35:30,83 1287509939434\n50 1287509939567\n83 1287509...,sésame,1,13,Mozilla/5.0 (Windows; U; Windows NT 6.0; fr; r...,laboratoire greyc
2,76 1287509943165\n65 1287509943220\n66 1287509...,76 1287509943249\n65 1287509943321\n66 1287509...,83 1287509947932\n50 1287509948019\n83 1287509...,2010-10-19 19:35:39,83 1287509947714\n50 1287509947898\n83 1287509...,sésame,1,13,Mozilla/5.0 (Windows; U; Windows NT 6.0; fr; r...,laboratoire greyc
3,76 1287509951226\n65 1287509951281\n66 1287509...,76 1287509951306\n65 1287509951401\n66 1287509...,83 1287509955198\n50 1287509955275\n83 1287509...,2010-10-19 19:35:46,83 1287509955029\n50 1287509955139\n83 1287509...,sésame,1,13,Mozilla/5.0 (Windows; U; Windows NT 6.0; fr; r...,laboratoire greyc
4,76 1287509957631\n65 1287509957667\n66 1287509...,76 1287509957687\n65 1287509957756\n66 1287509...,83 1287509962436\n50 1287509962493\n83 1287509...,2010-10-19 19:35:53,83 1287509962206\n50 1287509962376\n83 1287509...,sésame,1,13,Mozilla/5.0 (Windows; U; Windows NT 6.0; fr; r...,laboratoire greyc


## Commit the dataset 
Commit the dataframe to disk as a feather file

In [19]:
keystroke_df = keystroke_df.reset_index()
keystroke_df.to_feather(f"{OUTPUT_DIR}/keystroke.feather")