# User behavior as a predictor for input accuracy
Johann Miller, University of Maryland Makeability Lab

## Introduction
[Project Sidewalk](http://sidewalk.umiacs.umd.edu) is an online platform for identifying accessibility problems in sidewalks. Users navigate streets using Google Street View, and place labels on issues such as crosswalks without curbramps, uneven pavement, and obstacles blocking the path. In order to ensure some level of accuracy in the data, Project Sidewalk can use a couple tools. Each user has to complete a tutorial before they can begin to report problems. Another option is ground truth seeding, where users place labels in a region that already has established answers. If the user enters data that doesn't match the ground truth, then all of the data they entered can be flagged.

Here, we will investigate another possible option: using the user's interactions with the tool to predict their accuracy. Consider a user who is inactive for long periods, and barely uses any of the tool's features. This user probably gives worse input than a user who works consistently and employs all of the tool's features. Aspects of interaction include mouse movement, keypresses, and others that we can collect while users place labels. If accurate labels correspond to a certain type of usage, then these features could predict the accuracy of a user even in non ground truth regions.

To see if this is possible, we'll use data from ground truth regions to train and test a classifier.

## Setup python notebook

In [11]:
import pandas as pd
import psycopg2 as pg

# connect to volunteer database
vol_con = pg.connect(database="sidewalk", user="sidewalk", password="sidewalk", host="localhost", port="5432" )

# connect to turker database
turk_con = pg.connect(database="sidewalk_turker", user="sidewalk", password="sidewalk", host="localhost", port="5432" )

In [12]:
%%html
<style>
img {margin-left: 0}
</style>

## Collecting user events
Project Sidewalk has logs for variety of user events. The events range from low-level (mouse movements, clicks, etc.) to high-level (zoom in/out, changing label mode, etc.). If we query the interaction table, we can see all the event types.

In [13]:
event_types = pd.read_sql(
'''
SELECT DISTINCT action
FROM audit_task_interaction
''', vol_con)

print("Number of event types:", len(event_types))
event_types.head()

Number of event types: 109


Unnamed: 0,action
0,ModeSwitch_Walk
1,KeyboardShortcut_ModeSwitch_CurbRamp
2,ViewControl_DoubleClick
3,KeyboardShortcut_ClickOk
4,KeyUp


For each user session, we have a collection of the events that were triggered. In order to compare two sessions, we can look at the total number of each type of event. We also look at the mean and standard deviation of the number of events per Google Street View panorama. This way, a user session in a large region can be compared fairly to a session in a smaller region since the former will have more panoramas.

We can load in these event counts from `features.csv`. This file was created by **TODO**.

In [14]:
features = pd.read_csv('features.csv', index_col=0)
features.sample(n=5)

Unnamed: 0,Click_LabelDelete_per_pan_mean,Click_LabelDelete_per_pan_std,Click_LabelDelete_total,Click_ModeSwitch_CurbRamp_per_pan_mean,Click_ModeSwitch_CurbRamp_per_pan_std,Click_ModeSwitch_CurbRamp_total,Click_ModeSwitch_NoCurbRamp_per_pan_mean,Click_ModeSwitch_NoCurbRamp_per_pan_std,Click_ModeSwitch_NoCurbRamp_total,Click_ModeSwitch_NoSidewalk_per_pan_mean,...,ViewControl_DoubleClick_total,ViewControl_MouseDown_per_pan_mean,ViewControl_MouseDown_per_pan_std,ViewControl_MouseDown_total,ViewControl_MouseUp_per_pan_mean,ViewControl_MouseUp_per_pan_std,ViewControl_MouseUp_total,WalkTowards_per_pan_mean,WalkTowards_per_pan_std,WalkTowards_total
143,0.0,0.0,0,1.714286,0.48795,12,1.0,0.0,2,0,...,0,2.111111,1.707659,209,2.030303,1.554812,201,0,0,0
471,0.0,0.0,0,2.666667,1.230915,32,2.0,1.732051,6,0,...,51,5.944444,4.839979,214,5.777778,4.573908,208,0,0,0
140,0.0,0.0,0,1.714286,0.48795,12,1.0,0.0,2,0,...,0,2.111111,1.707659,209,2.030303,1.554812,201,0,0,0
232,1.608696,1.587998,37,2.428571,1.727959,272,1.516129,0.889605,47,0,...,27,4.533898,7.961681,1605,4.272727,6.89395,1504,0,0,0
28,1.0,0.0,4,3.2,2.388004,128,1.666667,0.57735,5,0,...,8,4.848684,6.557701,737,4.651316,6.087943,707,0,0,0


## Grading user accuracy
We need to rate each user session on its accuracy compared to the ground truth. We do so by counting the number of true positives, false positives, true negatives, and false negatives from the session. These are defined as follows:

#### True positive
The user placed a correct label. Here, the green icon is a label for a curb ramp. The user placed it correctly, so this is a true positive.

![true positive](images/true-pos.png)

#### False positive
The user placed an incorrect label. Here, the user placed a green icon to identify a curb ramp, but none are present.

![false positive](images/false-pos.png)

#### True negative
There was nothing to label, and the user didn't label anything.

![true negative](images/true-neg.png)

#### False negative
There was something to label, but the user missed it. Here, there is a curb ramp with no label.

![false negative](images/false-neg.png)

$\text{precision} = \frac{\text{true positives}}{\text{true positives + false positives}}$

$\text{recall} = \frac{\text{true positives}}{\text{true positives + false negatives}}$

$\text{specificity} = \frac{\text{true positives}}{\text{true positives + true negatives}}$

$\text{accuracy} = \frac{\text{true positives + true negatives}}{\text{true positives + false positives + true negatives + false negatives}}$

$\text{F1 score} = \frac{\text{2 * true positives}}{\text{2 * true positives + false positives + false negatives}}$

In [15]:
labels = pd.read_csv('labels.csv', index_col=0)
labels.sample(n=5)

Unnamed: 0,precision,recall,specificity,f.measure
22,0.372549,1.0,0.776224,0.542857
350,0.75,0.044118,0.994536,0.083333
380,0.142857,0.5,0.97,0.222222
475,0.4,0.25,0.984536,0.307692
85,0.5625,0.75,0.93578,0.642857
