In [76]:
import pandas as pd
import ast
import json as JSON

## Our Dataset

In [77]:
df = pd.read_csv('keystrokes-recipes.csv')
df.head(10)

Unnamed: 0,event_date,user_id,ks,recipe
0,2022-09-04 00:53:12.086991,5e68d82dd39ce517eaccd0c2,"[{'time': 1662252404346, 'character': 'Shift'}...","Firstly, cut up some chicken breasts into cube..."
1,2022-09-04 00:55:35.237120,5e87376eb4921a37c33affb4,"[{'time': 1662252644345, 'character': '1'}, {'...","First dice two onions, cut 2 tomatoes and 3 ch..."
2,2022-09-04 00:56:24.986568,5e68d82dd39ce517eaccd0c2,"[{'time': 1662252808442, 'character': 'Enter'}...",Title: Coconut and Tomato Curry\n\n\n\nIngredi...
3,2022-09-04 00:58:04.594495,5e87376eb4921a37c33affb4,"[{'time': 1662252958637, 'character': '1'}, {'...",Ingredients\n- 2 medium sized Onions\n- 2 medi...
4,2022-09-04 01:03:46.795075,6303814cf442fb34eaa1d118,"[{'time': 1662252871554, 'character': 'I'}, {'...",Ingredients\n- potato\n- carrots\n- parsnips\n...
5,2022-09-04 01:06:06.441679,5e87376eb4921a37c33affb4,"[{'time': 1662253173760, 'character': 'CapsLoc...",Victoria sponge cake\nIngredients\n- 250 grams...
6,2022-09-04 01:06:26.244137,6303814cf442fb34eaa1d118,"[{'time': 1662253570675, 'character': 'Enter'}...",Lamb Leg and Vegetable\n\nIngredients\n- potat...
7,2022-09-04 01:06:30.884313,5e87376eb4921a37c33affb4,"[{'time': 1662253586831, 'character': ' '}, {'...",Victoria sponge cake\nIngredients\n- 250 grams...
8,2022-09-04 01:06:37.199980,5e68d82dd39ce517eaccd0c2,"[{'time': 1662253176308, 'character': 'I'}, {'...",Title: Chicken Risotto\n\n\n\nIngredients:\n- ...
9,2022-09-04 01:08:05.510372,5e68d82dd39ce517eaccd0c2,"[{'time': 1662253611905, 'character': 'Enter'}...",Title: Chicken Risotto\n\n\n\nIngredients:\n- ...


#### Description

The platform the users used for writing recipes is a ML based review on recipes users write. The platform is called __RELEX__ and was designed to study the behavior of users. In fact, the users were tasked with writing three recipes each and they were all divided into 4 groups. 
| Group 1 | Group 2| Group 3 | Group 4|
| :----: | :---:|:----:|:---:|
Without Reflective Prompts | With Reflective Prompts | Without Reflective Prompts | With Reflective Prompts
Without Adaptive Feedback | Without Adaptive Feedback | With Adaptive Feedback  |With Adaptive Feedback

It's important to note some users wrote less than 3 and others wrote more than 3 (at least from what i've seen in the data)


We have 187 users (each represented by a ```user_id```) and 1091 sets of keystrokes.
Each set of keystrokes is a users revision on what they wrote previously or the start of a new recipe. What I imagine a scenario is when using the platform is 
- User x writes a first version of a recipe, clicks finished button (registers as a set of keystrokes in our dataset)
- According to his group, the platform may or may not present some suggestions
- The user modifies their text and submits again (registers as a second set of keystrokes in our dataset)
- If the user is done, starts the second recipe, else revises again and so on.


In the cell below we define keywords as characters appearing in the dataset that correspond to a keyboard action. We also define ```noisy_punct``` as noisy ponctuation characters that we want to remove from certain analysis we will make. We also copy the original data to a new file we will later modify.

In [78]:
KEYWORDS = ['ArrowDown', 'ArrowLeft', 'ArrowRight', 'ArrowUp', 'Backspace', 'CapsLock', 'Control', 'Delete', 'End', 'Enter', 'Home', 'Shift']
noisy_punct = [',', '.', '-', ':', '(', ')']
#create a copy of the dataset to another csv file
csv_filename = 'keystrokes-recipes-modified.csv'
df.to_csv(csv_filename, index=False)

- ```keystrokes-recipes.csv``` is the original data and we keep it in case we want to look back at one moment
- ```keystrokes-recipes-modified.csv``` is the modified data


## Data cleaning and sorting

Our data consists of a csv file with event dates, user ids, keystrokes and the recipes they wrote.
We clean all the data by working throught the keystrokes first.

* We group the characters into the word written and separate between important keywords typed such as backspace, shift, enter etc. The sequence ['shift', 'p', 'e', 'r'] becomes ['shift', 'per'] 
* We sort the data by user id then event date to get a better idea of every recipe every student has written and the time they took.

### Processing the data

The first thing we did was isolate the keystrokes to a new ```json``` file saved in ```data/ks.json```

The next step is to group words together and separate them from keywords and we work between each whitespace.
 
So for example this entry: 
```{'time': 1662252404346, 'character': 'Shift'}, {'time': 1662252404376, 'character': 'f'}, {'time': 1662252404505, 'character': 'i'}, {'time': 16622524046700, 'character': ' '}``` 

gives the following output: 
```{'time': 1662252404346, 'word': 'Shift'}, {'time': 1662252404505, 'word': 'fi'}```


In [79]:
new_df =pd.DataFrame(map(lambda ks: ast.literal_eval(ks), df['ks']))
new_df.head(30)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4327,4328,4329,4330,4331,4332,4333,4334,4335,4336
0,"{'time': 1662252404346, 'character': 'Shift'}","{'time': 1662252404376, 'character': 'f'}","{'time': 1662252404505, 'character': 'i'}","{'time': 1662252404595, 'character': 'r'}","{'time': 1662252404716, 'character': 's'}","{'time': 1662252404755, 'character': 't'}","{'time': 1662252404815, 'character': 'l'}","{'time': 1662252404935, 'character': 'y'}","{'time': 1662252405005, 'character': ','}","{'time': 1662252405115, 'character': ' '}",...,,,,,,,,,,
1,"{'time': 1662252644345, 'character': '1'}","{'time': 1662252644494, 'character': '.'}","{'time': 1662252644745, 'character': 'Backspace'}","{'time': 1662252644934, 'character': 'Backspace'}","{'time': 1662252645235, 'character': 'CapsLock'}","{'time': 1662252645334, 'character': 'F'}","{'time': 1662252645475, 'character': 'CapsLock'}","{'time': 1662252645612, 'character': 'i'}","{'time': 1662252645705, 'character': 'r'}","{'time': 1662252645955, 'character': 's'}",...,,,,,,,,,,
2,"{'time': 1662252808442, 'character': 'Enter'}","{'time': 1662252808691, 'character': 'Enter'}","{'time': 1662252808961, 'character': 'ArrowUp'}","{'time': 1662252809132, 'character': 'ArrowUp'}","{'time': 1662252810261, 'character': 'Shift'}","{'time': 1662252810262, 'character': 't'}","{'time': 1662252810352, 'character': 'i'}","{'time': 1662252810441, 'character': 't'}","{'time': 1662252810582, 'character': 'l'}","{'time': 1662252810642, 'character': 'e'}",...,,,,,,,,,,
3,"{'time': 1662252958637, 'character': '1'}","{'time': 1662252958807, 'character': '.'}","{'time': 1662252959026, 'character': ' '}","{'time': 1662252959746, 'character': 'Enter'}","{'time': 1662252959916, 'character': 'Enter'}","{'time': 1662252960127, 'character': 'Enter'}","{'time': 1662252960296, 'character': 'Enter'}","{'time': 1662252960446, 'character': 'Enter'}","{'time': 1662252961138, 'character': 'CapsLock'}","{'time': 1662252961466, 'character': 'I'}",...,,,,,,,,,,
4,"{'time': 1662252871554, 'character': 'I'}","{'time': 1662252871610, 'character': 'CapsLock'}","{'time': 1662252872097, 'character': 'n'}","{'time': 1662252874930, 'character': 'g'}","{'time': 1662252875349, 'character': 'r'}","{'time': 1662252875553, 'character': 'e'}","{'time': 1662252875816, 'character': 'd'}","{'time': 1662252875933, 'character': 'i'}","{'time': 1662252876731, 'character': 'e'}","{'time': 1662252876909, 'character': 'n'}",...,,,,,,,,,,
5,"{'time': 1662253173760, 'character': 'CapsLock'}","{'time': 1662253174078, 'character': 'I'}","{'time': 1662253174238, 'character': 'CapsLock'}","{'time': 1662253174308, 'character': 'n'}","{'time': 1662253174508, 'character': 'g'}","{'time': 1662253174579, 'character': 'r'}","{'time': 1662253174748, 'character': 'e'}","{'time': 1662253174969, 'character': 'd'}","{'time': 1662253175079, 'character': 'i'}","{'time': 1662253175209, 'character': 'e'}",...,,,,,,,,,,
6,"{'time': 1662253570675, 'character': 'Enter'}","{'time': 1662253572758, 'character': 'L'}","{'time': 1662253572792, 'character': 'CapsLock'}","{'time': 1662253573401, 'character': 'a'}","{'time': 1662253573586, 'character': 'm'}","{'time': 1662253574059, 'character': 'b'}","{'time': 1662253574263, 'character': ' '}","{'time': 1662253574628, 'character': 'CapsLock'}","{'time': 1662253575343, 'character': 'L'}","{'time': 1662253575360, 'character': 'CapsLock'}",...,,,,,,,,,,
7,"{'time': 1662253586831, 'character': ' '}","{'time': 1662253586950, 'character': 's'}","{'time': 1662253587031, 'character': 'm'}","{'time': 1662253587200, 'character': 'a'}","{'time': 1662253587271, 'character': 'l'}","{'time': 1662253587431, 'character': 'l'}","{'time': 1662253587591, 'character': ' '}",,,,...,,,,,,,,,,
8,"{'time': 1662253176308, 'character': 'I'}","{'time': 1662253176329, 'character': 'Shift'}","{'time': 1662253176508, 'character': 'n'}","{'time': 1662253176598, 'character': 'g'}","{'time': 1662253176868, 'character': 'r'}","{'time': 1662253177038, 'character': 'e'}","{'time': 1662253177128, 'character': 'd'}","{'time': 1662253177208, 'character': 'i'}","{'time': 1662253177358, 'character': 'e'}","{'time': 1662253177498, 'character': 'n'}",...,,,,,,,,,,
9,"{'time': 1662253611905, 'character': 'Enter'}","{'time': 1662253612225, 'character': 'Shift'}","{'time': 1662253612285, 'character': 's'}","{'time': 1662253612475, 'character': 't'}","{'time': 1662253612534, 'character': 'e'}","{'time': 1662253612655, 'character': 'p'}","{'time': 1662253612775, 'character': 'S'}","{'time': 1662253612945, 'character': ':'}","{'time': 1662253612955, 'character': 'Shift'}","{'time': 1662253632365, 'character': 'Backspace'}",...,,,,,,,,,,


In [80]:
ks = pd.read_json('data/ks.json')
values = ks.values
ks = []
for i, s in enumerate(values):
   s = list(filter(lambda _ : _ is not None,s))
   last_entry = s[-1]
   s.append({"time" : last_entry['time'], "character": " "})
   ks.append(s)
ks = pd.DataFrame(ks)

In [81]:
def find_seq(chars):
    return "".join(list(filter(lambda _ : _ not in KEYWORDS, chars)))

def separate_entry(json_values):
    new_data = []
    last_whitespace = 0
    characters = [arr[1] for arr in json_values]
    for i, (time, character) in enumerate(json_values):
        if character.isspace():
            word = characters[last_whitespace: i]
            if not any(i in word for i in KEYWORDS):
                new_data.append({'time': time, 'word': "".join(word)})
            else: 
                new_data.append({'time': time, 'word': find_seq(word)})
            last_whitespace = i+1
        elif character in KEYWORDS:
            new_data.append({'time': time, 'word': character})
    
    return new_data

new_df = ks
arr = []
for jsonf in new_df.values:
    sub_arr = []
    for d in jsonf:
        if d is not None:
            if d['character'] not in KEYWORDS:
                sub_arr.append([d["time"], d["character"]])
            elif d['character'] == 'Backspace' or d['character'] == 'Delete':
                sub_arr.append([d["time"], d["character"]])

    arr.append(sub_arr)

result = []
for jsonf in arr:
    result.append(separate_entry(jsonf))
with open("data/new_data.json", "w") as f:
    JSON.dump(result, f)

Basically, we will format the data for the ```separate_entry``` function and when everything is computed, it dumps all the data in a new json file: ```new_data.json``` in the ```data``` directory.

```separate_entry``` computes the words between each space character, all the while separating words from keywords. It uses the function ```find_seq``` to separate the characters from keywords so it allows to isolate words between each whitespaces.

### Modifying the CSV file

We just modify the keystroke data for each row of the original data in ```keystrokes-recipes.csv``` but apply it to ```keystrokes-recipes-modified.csv```

In [82]:
jsons = pd.read_json('data/new_data.json').values.tolist()

for i, json in enumerate(jsons):
    jsons[i]= list(filter(lambda _ : _ is not None, json))

def write_to_csv_file(filename, recipes_len):
    dframe = pd.read_csv(filename)
    for i in range(recipes_len):
        dframe.iloc[i]['ks'] = jsons[i]
    dframe.to_csv(filename, index=False)

write_to_csv_file(csv_filename, len(jsons))

## Sorting by `user_id` and `event_date`

We first sort by user id in order to differentiate behaviour between different people more easily and then by event date

In [83]:
#we sort 
pd.read_csv(csv_filename).sort_values(by=['user_id', 'event_date'], ascending=True).to_csv(csv_filename, index=False)
#update the dataframe with which we work
df = pd.read_csv(csv_filename)
df.head(10)

Unnamed: 0,event_date,user_id,ks,recipe
0,2022-09-04 03:28:18.613319,55ae64defdf99b3f864653e7,"[{'time': 1662261900755, 'word': 'brown'}, {'t...",Brown 1 pound of hamburger meat. Drain the gre...
1,2022-09-04 03:29:37.124556,55ae64defdf99b3f864653e7,"[{'time': 1662261900755, 'word': 'brown'}, {'t...",1) Brown 1 pound of hamburger meat. Drain the ...
2,2022-09-04 03:29:47.816111,55ae64defdf99b3f864653e7,"[{'time': 1662261900755, 'word': 'brown'}, {'t...",1) Brown 1 pound of hamburger meat. Drain the ...
3,2022-09-04 03:33:03.555075,55ae64defdf99b3f864653e7,"[{'time': 1662262225339, 'word': 'Backspace'},...","1) Cook chicken as desired (boiled, pan seared..."
4,2022-09-04 03:33:30.062465,55ae64defdf99b3f864653e7,"[{'time': 1662262225339, 'word': 'Backspace'},...","1) Cook chicken as desired (boiled, pan seared..."
5,2022-09-04 03:34:08.666681,55ae64defdf99b3f864653e7,"[{'time': 1662262225339, 'word': 'Backspace'},...","1) Cook chicken as desired (boiled, pan seared..."
6,2022-09-04 03:35:33.869167,55ae64defdf99b3f864653e7,"[{'time': 1662262452660, 'word': '28'}, {'time...",28 oz or so of potatoes cubed\n1 8oz of cream ...
7,2022-09-04 03:36:08.454643,55ae64defdf99b3f864653e7,"[{'time': 1662262452660, 'word': '28'}, {'time...",28 oz or so of potatoes cubed\n1 8oz of cream ...
8,2022-09-04 03:36:43.142187,55ae64defdf99b3f864653e7,"[{'time': 1662262452660, 'word': '28'}, {'time...",28 oz or so of potatoes cubed\n1 8oz of cream ...
9,2022-09-04 14:01:31.746981,55d22025cc2b18000c0b9d9c,"[{'time': 1662298870230, 'word': 'Ingredients'...",To serve 4 people (your family!)\n\n\n-You wil...


we edit the file ```new_data.json``` in order to have the keystrokes in the sorted fashion, by user and event date. This will prevent future issues. We also created the file ```keystrokes_sorted_by_user.json```

In [84]:
data = [ast.literal_eval(df['ks'].values[i]) for i in range(len(df))]
with open("data/new_data.json", "w") as f:
        JSON.dump(data, f)