In [2]:
import pandas as pd
import numpy as np

In [3]:
data = pd.read_csv("../logs/all.csv",
                  names=["time", "source", "user", "session", "userAgent", "screenWidth", "screenHeight", 
                         "windowWidth", "windowHeight", "resolution", "graph", "position"])

## Cleanup

Remove the first part of the source which is unimportant

In [4]:
data["source"] = data["source"].str.replace("https://jonasoesch.ch/content/work/mortality/", "")

Split source into scenario and story

In [5]:
data["scenario"] = data["source"].str.split("/", expand=True)[0]

In [6]:
data["story"] = data["source"].str.split("/", expand=True)[1].str.split(".", expand=True)[0]

In [7]:
#data = data.drop(["source"], axis=1)

Remove entries where the graph has been undefined (not drawn yet)

In [8]:
data = data[data["graph"] != "undefined"]

## Plausability checks

### When was the first recording by scenario

In [9]:
pd.to_datetime(data.groupby(["scenario"])["time"].min(), unit="ms") 

scenario
juxtaposed-animated   2019-02-04 10:03:03.144
juxtaposed-static     2019-02-04 19:00:43.956
superposed-animated   2019-02-04 18:59:35.878
superposed-static     2019-02-04 18:51:06.671
Name: time, dtype: datetime64[ns]

### When was the latest regording by scenario

In [10]:
pd.to_datetime(data.groupby(["scenario"])["time"].max(), unit="ms") 

scenario
juxtaposed-animated   2019-02-20 10:28:40.584
juxtaposed-static     2019-02-19 10:32:45.696
superposed-animated   2019-02-20 12:43:18.914
superposed-static     2019-02-20 05:39:13.109
Name: time, dtype: datetime64[ns]

### There should only be one user-agent string per user

In [11]:
uaPerUser = data.groupby(["user"])["userAgent"].agg(lambda ua: len(ua.unique()))
uaPerUser[uaPerUser > 1]

user
1549545871217-0.dgx1qfz08iv    2
1549545879946-0.616nkiecbnc    2
1549550007373-0.gy5gegu2zrc    2
Name: userAgent, dtype: int64

### Why are there two user agent strings?

In [12]:
data[data["user"].str.contains("1549545871217-0.dgx1qfz08iv")]["userAgent"].unique()

array(['Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.96 Safari/537.36',
       'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'],
      dtype=object)

In [13]:
data[data["user"].str.contains("1549545879946-0.616nkiecbnc")]["userAgent"].unique()

array(['Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.96 Safari/537.36',
       'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'],
      dtype=object)

In [14]:
data[data["user"].str.contains("1549550007373-0.gy5gegu2zrc")]["userAgent"].unique()

array(['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.96 Safari/537.36',
       'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'],
      dtype=object)

Answer: a Chrome Update on Windows and on Mac

### There should be only one user per session

In [15]:
usersPerSession = data.groupby("session")["user"].agg(lambda user: len(user.unique()))
usersPerSession[usersPerSession > 1]

Series([], Name: user, dtype: int64)

### The position should always be between 0 and -1

In [16]:
len(data[data["position"] > 1])

0

In [17]:
len(data[data["position"] < -1])

0

### Positions that are smaller than 0 should not be on a regular graph and the inverse

In [18]:
len(data[data["position"] < 0][~data["graph"].str.contains("@")])

  """Entry point for launching an IPython kernel.


0

In [19]:
len(data[data["position"] > 0][data["graph"].str.contains("@")])

  """Entry point for launching an IPython kernel.


0

## The first entry of a session is always @init

In [20]:
firstEntry = data.copy()
firstEntry["entry"] = data.session.map(data.groupby("session")["graph"].first())
firstEntry[firstEntry.entry != "@init"].session.unique()

array(['1549274269236-0.yzc7jfhnfej'], dtype=object)

Only in the first recorded session, this was not the case. It was probably not implemented then yet.

## Exploration

### How many sessions per scenario?

In [21]:
data.groupby(["scenario"])["session"].agg(lambda session: len(session.unique()))

scenario
juxtaposed-animated    76
juxtaposed-static      15
superposed-animated    41
superposed-static      25
Name: session, dtype: int64

### How many sessions per story?

In [22]:
data.groupby(["story"])["session"].agg(lambda session: len(session.unique()))

story
absolute        38
causes          24
demographics    64
relative        31
Name: session, dtype: int64

## Sessions over time

In [23]:
sessions = data.groupby("session").agg({"session": "first", "time": "min", "scenario": "first"})
sessions["time"] = pd.to_datetime(sessions["time"], unit="ms")
sessions.to_csv("sessions_ts.csv")

See `timedistribution.vl`

## How long are session durations?

In [24]:
durations = data.groupby("session").agg({"session": "first", "time": ["min", "max"], "scenario": "first"})
durations["duration"] = (durations["time", "max"] - durations["time", "min"]) / 1000
durations.to_csv("session_durations.csv")

In [25]:
durations["duration"].min()

0.0

In [26]:
durations["duration"].max() / 60 / 60

47.87285027777777

## How many zero-duration sessions?

In [27]:
len(durations[durations["duration"] == 0])

41

## How many sessions that are longer than 15 minutes?

In [28]:
len(durations[durations["duration"] > 60*15])

8

## How many sessions in between?

In [29]:
durations = durations[durations["duration"] > 0].copy()
durations = durations[durations["duration"] < 60*15].copy()
durations.to_csv("session_durations.csv")
len(durations)

108

So the longest session was almost 48 hours. This happens when you leave tabs open forever.
We find, that there are many 0 duration sessions and that typical sessions are no longer than 3 minutes

See `session_durations.vl`

## Which charts have been drawn how often?

In [30]:
data.groupby(["scenario", "story", "graph"])["time"].count().to_csv("overview.csv")

### How many unique visitors did the experiment had?

In [31]:
len(data.groupby(['user']))

33

# Preparation

## Timedelta

In our analysis, the moment of each action should be displayed relative to the starting moment of the scenario

In [32]:
minima = data.groupby("session")["time"].min()
data["minTime"] = data['session'].map(minima)
data["timeDelta"] = (data["time"] - data["minTime"]) / 1000 # in seconds

## Creating labels based on the timestamp

In [33]:
dateLabels = pd.to_datetime(data.groupby("session")["time"].min(), unit="ms")
data["sessionDate"] = data.session.map(dateLabels)

## Mapping the positions

Positions have been recorded relative to the graph displayed. These need to be remapped to a more sensible, global position value between 0 and 1

### Demographics

In [34]:
data.loc[data["story"].str.contains("demographics") & data["graph"].str.contains("^gender$"), "globalPosition"] = 0
data.loc[data["story"].str.contains("demographics") & data["graph"].str.contains("^demographics$"), "globalPosition"] = 1

#### Juxtaposed

In [35]:
data.loc[data["scenario"].str.contains("juxtaposed") & data["story"].str.contains("demographics") & data["graph"].str.contains("^gender-highlight$"), "globalPosition"] = data["position"] / 3
data.loc[data["scenario"].str.contains("juxtaposed") & data["story"].str.contains("demographics") & data["graph"].str.contains("^move-highlight$"), "globalPosition"] = 1/3 + data["position"] / 3
data.loc[data["scenario"].str.contains("juxtaposed") & data["story"].str.contains("demographics") & data["graph"].str.contains("^highlight-demographics$"), "globalPosition"] = 2/3 + data["position"] / 3

#### Superposed

In [36]:
data.loc[data["scenario"].str.contains("superposed") & data["story"].str.contains("demographics") & data["graph"].str.contains("^gender-highlight$"), "globalPosition"] = data["position"] / 2
data.loc[data["scenario"].str.contains("superposed") & data["story"].str.contains("demographics") & data["graph"].str.contains("^highlight-demographics$"), "globalPosition"] = 1/2 + data["position"] / 2

### Absolute

In [37]:
data.loc[data["story"].str.contains("absolute") & data["graph"].str.contains("^demographics$"), "globalPosition"] = 0
data.loc[data["story"].str.contains("absolute") & data["graph"].str.contains("^differences$"), "globalPosition"] = 1

In [38]:
data.loc[data["story"].str.contains("absolute") & data["graph"].str.contains("^demographics-differences$"), "globalPosition"] = data["position"]

### Relative

In [39]:
data.loc[data["story"].str.contains("relative") & data["graph"].str.contains("^differences$"), "globalPosition"] = 0
data.loc[data["story"].str.contains("relative") & data["graph"].str.contains("^uptick$"), "globalPosition"] = 1

In [40]:
data.loc[data["story"].str.contains("relative") & data["graph"].str.contains("^differences-uptick$"), "globalPosition"] = data["position"]

### Causes

In [41]:
data.loc[data["story"].str.contains("causes") & data["graph"].str.contains("^uptick$"), "globalPosition"] = 0
data.loc[data["story"].str.contains("causes") & data["graph"].str.contains("^aids$"), "globalPosition"] = 1

#### Juxtaposed

In [42]:
data.loc[data["scenario"].str.contains("juxtaposed") & data["story"].str.contains("causes") & data["graph"].str.contains("^move-lines$"), "globalPosition"] = data["position"] / 2
data.loc[data["scenario"].str.contains("juxtaposed") & data["story"].str.contains("causes") & data["graph"].str.contains("^uptick-aids$"), "globalPosition"] = 1/2 + data["position"] / 2

#### Superposed

In [43]:
data.loc[data["scenario"].str.contains("superposed") & data["story"].str.contains("causes") & data["graph"].str.contains("^highlight$"), "globalPosition"] = data["position"] / 2
data.loc[data["scenario"].str.contains("superposed") & data["story"].str.contains("causes") & data["graph"].str.contains("^uptick-aids$"), "globalPosition"] = 1/2 + data["position"] / 2

We will now remove all the positions that have not been mapped. These graphs are only supporting and do not contain any information. First we check to make sure we didn't miss anything important

In [44]:
data[data.globalPosition.isna()].graph.unique()

array(['@alive', '@init', 'empty-demographics', 'differences-empty',
       'aids-empty', 'uptick-empty'], dtype=object)

In [45]:
slim = data[~data.globalPosition.isna()].copy()
len(slim)/len(data)

0.745833421124713

## Ignore very long and zero-length scenarios for now

In [46]:
maxDelta = slim.groupby("session").timeDelta.max()
slim["maxDelta"] = slim.session.map(maxDelta)
slimBrief = slim[(slim.maxDelta > 0.0) & (slim.maxDelta < 180)].copy()

len(slimBrief) / len(slim)

0.7173569128199333

In [47]:
slimBrief[slimBrief.scenario.str.contains("juxtaposed") & slimBrief.story.str.contains("demographics")].to_csv("juxtaposed-demographics.csv")

In [48]:
slimBrief[slimBrief.scenario.str.contains("juxtaposed") & slimBrief.story.str.contains("absolute")].to_csv("juxtaposed-absolute.csv")

In [49]:
slimBrief[slimBrief.scenario.str.contains("juxtaposed") & slimBrief.story.str.contains("relative")].to_csv("juxtaposed-relative.csv")

In [50]:
slimBrief[slimBrief.scenario.str.contains("juxtaposed") & slimBrief.story.str.contains("causes")].to_csv("juxtaposed-causes.csv")

In [51]:
slimBrief[slimBrief.scenario.str.contains("superposed") & slimBrief.story.str.contains("demographics")].to_csv("superposed-demographics.csv")

In [52]:
slimBrief[slimBrief.scenario.str.contains("superposed") & slimBrief.story.str.contains("absolute")].to_csv("superposed-absolute.csv")

In [53]:
slimBrief[slimBrief.scenario.str.contains("superposed") & slimBrief.story.str.contains("relative")].to_csv("superposed-relative.csv")

In [54]:
slimBrief[slimBrief.scenario.str.contains("superposed") & slimBrief.story.str.contains("causes")].to_csv("superposed-causes.csv")

## Drawing performance

Assuming the drawing rate would never fall below 5 frames/second, what was the mean drawing performance by session? We only consider scenarios that contain animated transitions.

In [105]:
timeDifferences = data.groupby("session")["time"].diff()
data["timeDifference"] = timeDifferences
len(data[(data["timeDifference"] < 100) 
     & (data["timeDifference"] > 0) & (data.scenario.str.contains("animated"))]) / len(data[data.scenario.str.contains("animated")])

0.9636733193504443

Only four percent of the records lie not within 0 and 100 ms. We assume that most of these are when users were inactive for a short time and started scrolling again. Also, from visual tests we conclude that drawing performance would not have dropped below 5 frames/second as this would have been clearly visible. We move on to analyze the distribution within the above bounds

In [109]:
data[(data["timeDifference"] < 100) 
     & (data["timeDifference"] > 0)
     & (data.scenario.str.contains("animated"))].to_csv("performance.csv", 
                                             columns=["session", "timeDifference"])
#placeholder = data[data.timedifferences < 500].groupby("session").mean().timedifferences

Are there sessions with markedly lower drawing performance?

In [124]:
sessionPerformance = data[(data["timeDifference"] < 100) 
     & (data["timeDifference"] > 0) 
     & (data.scenario.str.contains("animated"))].groupby("session").timeDifference.mean()
sessionPerformance[(sessionPerformance > 10) & (sessionPerformance < 100)].mean()

23.99811680654141

In [120]:
sessionPerformance[sessionPerformance > 50]

session
1550247385573-0.itj7bgxptoo    60.222222
Name: timeDifference, dtype: float64

What is this session like?

In [111]:
len(data[data.session.str.contains("1550247385573-0.itj7bgxptoo")])

34

In [112]:
entriesPerSession = data.groupby("session").time.count()

In [113]:
entriesPerSession[entriesPerSession > 1].mean()

408.7931034482759

So it was a rather short session

In [116]:
data[data.session.str.contains("1550247385573-0.itj7bgxptoo")].userAgent.str.split(" ", expand=True).iloc[0]

0             Mozilla/5.0
1             (Macintosh;
2                   Intel
3                     Mac
4                      OS
5                       X
6                10_10_5)
7      AppleWebKit/537.36
8                 (KHTML,
9                    like
10                 Gecko)
11    Chrome/70.0.3538.77
12          Safari/537.36
Name: 21707, dtype: object

In [117]:
data[data.session.str.contains("1550247385573-0.itj7bgxptoo")].source.iloc[0]

'superposed-animated/causes.html'