In [1]:
import pandas as pd
import ast
import json as JSON
#from ipynb.fs.full.InsertsDelComparisons import  map_, INDICES

## Our Dataset

In [2]:
df = pd.read_csv('data/keystrokes-recipes.csv')
df.head(10)

Unnamed: 0,event_date,user_id,ks,recipe
0,2022-09-04 03:28:18.613319,55ae64defdf99b3f864653e7,"[{'time': 1662261900176, 'character': 'Shift'}...",Brown 1 pound of hamburger meat. Drain the gre...
1,2022-09-04 03:29:37.124556,55ae64defdf99b3f864653e7,"[{'time': 1662261900176, 'character': 'Shift'}...",1) Brown 1 pound of hamburger meat. Drain the ...
2,2022-09-04 03:29:47.816111,55ae64defdf99b3f864653e7,"[{'time': 1662261900176, 'character': 'Shift'}...",1) Brown 1 pound of hamburger meat. Drain the ...
3,2022-09-04 03:33:03.555075,55ae64defdf99b3f864653e7,"[{'time': 1662262224600, 'character': '1'}, {'...","1) Cook chicken as desired (boiled, pan seared..."
4,2022-09-04 03:33:30.062465,55ae64defdf99b3f864653e7,"[{'time': 1662262224600, 'character': '1'}, {'...","1) Cook chicken as desired (boiled, pan seared..."
5,2022-09-04 03:34:08.666681,55ae64defdf99b3f864653e7,"[{'time': 1662262224600, 'character': '1'}, {'...","1) Cook chicken as desired (boiled, pan seared..."
6,2022-09-04 03:35:33.869167,55ae64defdf99b3f864653e7,"[{'time': 1662262452515, 'character': '2'}, {'...",28 oz or so of potatoes cubed\r\n1 8oz of crea...
7,2022-09-04 03:36:08.454643,55ae64defdf99b3f864653e7,"[{'time': 1662262452515, 'character': '2'}, {'...",28 oz or so of potatoes cubed\r\n1 8oz of crea...
8,2022-09-04 03:36:43.142187,55ae64defdf99b3f864653e7,"[{'time': 1662262452515, 'character': '2'}, {'...",28 oz or so of potatoes cubed\r\n1 8oz of crea...
9,2022-09-04 14:01:31.746981,55d22025cc2b18000c0b9d9c,"[{'time': 1662298866744, 'character': 'I'}, {'...",To serve 4 people (your family!)\r\n\r\n\r\n-Y...


#### Description

The platform the users used for writing recipes is a ML based review system on recipes users write. The platform is called __RELEX__ and was designed to study the behavior of users. In fact, the users were tasked with writing three recipes each and they were all divided into 5 groups. 



Group 1 | Group 2 | 
|:---: | :---: | 
 Without Adaptive Feedback | Without Adaptive Feedback | 

It's important to note some users wrote less than 3 and others wrote more than 3 (at least from what i've seen in the data). We will filter out samples considered as the 4th or 5th recipe.


We have 73  users (each represented by a ```user_id```) and 450 sets of keystrokes.
Each set of keystrokes is a users revision on what they wrote previously or the start of a new recipe.  A scenario is when using the platform:
- User x writes a first version of a recipe, clicks finished button (registers as a set of keystrokes in our dataset)
- According to his group, the platform may or may not present some suggestions
- The user modifies their text and submits again (registers as a second set of keystrokes in our dataset)
- If the user is done, starts the second recipe, else revises again and so on.

In the cell below we define keywords as characters appearing in the dataset that correspond to a keyboard action. We also define ```noisy_punct``` as noisy ponctuation characters that we want to remove from certain analysis we will make. We also copy the original data to a new file we will later modify.

In [3]:
KEYWORDS = ['Alt', 'ArrowDown', 'ArrowLeft', 'ArrowRight', 'ArrowUp', 'Backspace', 'CapsLock', 'Control', 'Delete', 'End', 'Enter', 'Home', 'Meta', 'PageUp', 'PageDown', 'PrintScreen','Shift', 'Tab']
noisy_punct = [',', '.', '-', ':', '(', ')']
#create a copy of the dataset to another csv file
csv_filename = 'data/keystrokes-recipes-modified.csv'
df.to_csv(csv_filename, index=False)

- ```keystrokes-recipes.csv``` is the original data and we keep it in case we want to look back at one moment
- ```keystrokes-recipes-modified.csv``` is the modified data


## Data cleaning and sorting

Our data consists of a csv file with event dates, user ids, keystrokes and the recipes they wrote.
We clean all the data by working throught the keystrokes first.

* We group the characters into the word written and separate between important keywords typed such as backspace, shift, enter etc. The sequence ['shift', 'p', 'e', 'r'] becomes ['shift', 'per'] 
* We sort the data by user id then event date to get a better idea of every recipe every student has written and the time they took.

### Processing the data

The first step is to collect all the keystrokes in the dataset. Pandas considers `df['ks']` as a string so we use the `ast` library to convert the string to json.

The next step is to group words together and separate them from keywords and we work between each whitespace.
 
So for example this entry: 
```{'time': 1662252404346, 'character': 'Shift'}, {'time': 1662252404376, 'character': 'f'}, {'time': 1662252404505, 'character': 'i'}, {'time': 16622524046700, 'character': ' '}``` 

gives the following output: 
```{'time': 1662252404346, 'word': 'Shift'}, {'time': 1662252404505, 'word': 'fi'}```


In [4]:
keystrokes = df['ks'].values.tolist()
keystrokes = list(map(lambda j: ast.literal_eval(j), keystrokes))

We add a white space at the end of each set of keystrokes to facilitate data formatting.

In [5]:
ks = []
for i, s in enumerate(keystrokes):
   s = list(filter(lambda _ : _ is not None,s))
   last_entry = s[-1]
   s.append({"time" : last_entry['time'], "character": " "})
   ks.append(s)
ks = pd.DataFrame(ks)

In [6]:
def find_seq(chars):
    return "".join(list(filter(lambda _ : _ not in KEYWORDS, chars)))

def separate_entry(json_values):
    new_data = []
    last_whitespace = 0
    characters = [arr[1] for arr in json_values]
    for i, (time, character) in enumerate(json_values):
        if character.isspace():
            word = characters[last_whitespace: i]
            if not any(i in word for i in KEYWORDS):
                new_data.append({'time': time, 'word': "".join(word)})
            else: 
                new_data.append({'time': time, 'word': find_seq(word)})
            last_whitespace = i+1
        elif character in KEYWORDS:
            new_data.append({'time': time, 'word': character})
    
    return new_data

arr = []
for jsonf in ks.values:
    sub_arr = []
    for d in jsonf:
        if d is not None:
            sub_arr.append([d["time"], d["character"]])
    arr.append(sub_arr)

result = []
for jsonf in arr:
    result.append(separate_entry(jsonf))
with open("data/new_data.json", "w") as f:
    JSON.dump(result, f)

Basically, we will format the data for the ```separate_entry``` function and when everything is computed, it dumps all the data in a new json file: ```new_data.json``` in the ```data``` directory.

```separate_entry``` computes the words between each space character, all the while separating words from keywords. It uses the function ```find_seq``` to separate the characters from keywords so it allows to isolate words between each whitespaces.

### Modifying the CSV file

We just modify the keystroke data for each row of the original data in ```keystrokes-recipes.csv``` but apply it to ```keystrokes-recipes-modified.csv```. 
$\textbf{In the original dataset, we have 5 groups. But we are focusing on 2 so we only add the data from the two groups}$
We use this secondary dataset which was provided, which maps users to their groups. We focus on users from groups 2 and 4.

In [7]:
jsons = pd.read_json('data/new_data.json').values.tolist()
users_to_groups = dict(pd.read_csv('data/groupmatching.csv').filter(['user_id', 'group']).values)

for i, json in enumerate(jsons):
    jsons[i]= list(filter(lambda _ : _ is not None, json))

dframe = df.copy()

to_drop = []
for i in range(len(jsons)):
    row = dframe.iloc[i]
    try:
        if users_to_groups[row['user_id']] == 2 or users_to_groups[row['user_id']] == 4:
            dframe.iloc[i]['ks'] = jsons[i]
        else: 
           to_drop.append(i)
    except KeyError:
        to_drop.append(i)
        continue

dframe.drop(to_drop , inplace=True)
dframe.to_csv(csv_filename, index=False)    

## Sorting by `user_id` and `event_date`

We first sort by user id in order to differentiate behaviour between different people more easily and then by event date

In [8]:
#we sort 
pd.read_csv(csv_filename).sort_values(by=['user_id', 'event_date'], ascending=True).to_csv(csv_filename, index=False)
#update the dataframe with which we work with
df = pd.read_csv(csv_filename)

In [9]:
data = [ast.literal_eval(df['ks'].values[i]) for i in range(len(df))]
with open("data/new_data.json", "w") as f:
        JSON.dump(data, f)

## Separating writing sessions

We use distance metrics between each recipe written and keep the indices to separate the different recipes. This alleviates most of the work from the previous idea -- and is more safe to use -- safer than writing my own algorithm. The previous idea consisted of checking the number of inserts between each revision and check for a spike. We were also going to check the time it took between each submission and consider a recipe as new if the difference was important.

We download a glove model (similar to word2vec) which is already trained on wikipedia, where each word is represented as a 50-dimensional vector.

We will separate every recipe written in sessions so that we can look what happens at each revision session for every recipe written.

The algorithm works recursively. For each recipe it computes the distance with the following recipes until it finds a recipe with which $1 - distance <.995$. When it does find one, it restarts the whole process from the index of said recipe with the accumulator containing the index.

In [10]:
#Load the model
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-50")

In [14]:
from scipy import spatial
import numpy as np

def preprocess(s):
    res = ""
    for i, char in enumerate(list(s)):
        if char not in noisy_punct:
            res += char    
    res = [i.lower() for i in res.split()]
    res = list(filter(lambda _ : _ not in noisy_punct, res))
    return res

def get_vector(s):
    """
    Get the vector representation of a sentence from the model

    Args:
        s (str): text

    """
    arr = []
    for i in preprocess(s):
        key = None
        try: 
            key = model[i]
            arr.append(key)
        except:
            continue

    arr = np.array(arr)
    return np.sum(arr, axis=0)

recipes = df['recipe'].values



def compute_recipe_indices(start_index, acc):
    """
    Computes the list of indices where each recipe in the dataset begins
    Basically, user 0 writes 3 recipes:
    starts writing at t = 0, revises once at t = 1, a second time at t = 2 and 
    starts a new recipe at t = 3, then this function will return [0, 3] 

    Args:
        start_index (int): index to compare with the other recipes
        acc (list(int)): list to return

    Returns:
        list(int) : list of indices of beginning of each recipe
    """
    if start_index >= len(recipes) - 1:
        return acc
    vec = get_vector(recipes[start_index])
    for i in range(start_index, len(recipes)):
        dist = 1 - spatial.distance.cosine(vec, get_vector(recipes[i]))
        if dist < .995:
            acc.append(i)
            return compute_recipe_indices(i, acc)


recipes_indices = compute_recipe_indices(0, [0])

# Out of 450 samples, we only have 14 misclassified samples that are misclassified as new recipes so the algorithm pretty effectively
to_remove = [13, 116, 134, 156, 168, 188, 249, 255, 256, 403, 88, 90, 128, 209, 376, 379, 381, 390, 391, 393, 394,395,  444]
add = [121, 204, 254, 336, 97, 360, 362, 392]
for i in to_remove:
    recipes_indices.remove(i)

for i in add:
    recipes_indices.append(i) 

recipes_indices = sorted(recipes_indices)

rec = [df['recipe'][i] for i in recipes_indices]
users = [df['user_id'][i] for i in recipes_indices]
dframe = pd.DataFrame([recipes_indices, rec, users]).transpose()
dframe.columns =['recipe index in data', 'recipe', 'user id']

Now we have an array of indices at which there is a new recipe.
Now we have to map the recipes to the users.
The idea is to transform the indices: ```[0,3,6,9,11,12,...]``` $\rightarrow$ ```[(0, [0,3,6]), (1, [9,11,12]), ... ]```

However since not everyone has 3 recipes, we can't simply group every 3 recipes together as that would map some recipes to users that havent written them.
What we do instead is use pandas methods that does everything so nicely.

In [15]:
#map each user to the index of each recipe they wrote
map_ = dframe.groupby('user id')["recipe index in data"].apply(list)

#Collect the first indices in which each user starts writing their recipes
#if user 0 writes at indices [0,1,2,3,4] and user 1 at indices [5,6,7,8,9] then we have [0,5]
indices_of_first_attempts_per_user = df.groupby('user_id').head(1).index

Storing useful variables for other notebooks

In [16]:
%store df
%store KEYWORDS
%store noisy_punct
%store ks
%store map_
%store recipes_indices
%store users_to_groups
%store indices_of_first_attempts_per_user

Stored 'df' (DataFrame)
Stored 'KEYWORDS' (list)
Stored 'noisy_punct' (list)
Stored 'ks' (DataFrame)
Stored 'map_' (Series)
Stored 'recipes_indices' (list)
Stored 'users_to_groups' (dict)
Stored 'indices_of_first_attempts_per_user' (Int64Index)
