# Introduction

[Instrumentation ticket](https://phabricator.wikimedia.org/T292586)   | [QA ticket](https://phabricator.wikimedia.org/T297851)

# Instrumentation note

Web team has deployed the instrumentation to track scrolling back to the top of the page.  
The related events will be stored in `mediawiki_web_ui_scroll` schema.

QA on 2021-12-16

In [1]:
shhh <- function(expr) suppressPackageStartupMessages(suppressWarnings(suppressMessages(expr)))
shhh({
library(tidyverse); 
library(lubridate); 
library(scales);
library(magrittr); 
library(dplyr);
})

In [3]:
library(IRdisplay)

display_html(
'<script>  
code_show=true; 
function code_toggle() {
  if (code_show){
    $(\'div.input\').hide();
  } else {
    $(\'div.input\').show();
  }
  code_show = !code_show
}  
$( document ).ready(code_toggle);
</script>
  <form action="javascript:code_toggle()">
    <input type="submit" value="Click here to toggle on/off the raw code.">
 </form>'
)

In [2]:
options(repr.plot.width = 15, repr.plot.height = 10)

# Daily events

In [9]:
query <- 
"
select to_date(dt) AS date_time,substr(dt,1,10),year, month,day, COUNT(1) AS events,
COUNT(DISTINCT web_session_id) AS sessions,
COUNT(DISTINCT page_id) AS pages
FROM event.mediawiki_web_ui_scroll
WHERE year=2021
GROUP BY to_date(dt),substr(dt,1,10),year, month,day
"

In [10]:
df <-  wmfdata::query_hive(query)

Don't forget to authenticate with Kerberos using kinit



In [11]:
df

date_time,X_c1,year,month,day,events,sessions,pages
<chr>,<chr>,<int>,<int>,<int>,<int>,<int>,<int>
2008-12-31,2008-12-31,2021,12,13,1,1,1
2020-12-07,2020-12-07,2021,12,7,1,1,1
2021-11-01,2021-11-01,2021,12,7,1,1,1
2021-11-02,2021-11-02,2021,12,5,3,1,1
2021-11-03,2021-11-03,2021,12,3,1,1,1
2021-11-03,2021-11-03,2021,12,6,1,1,1
2021-11-04,2021-11-04,2021,12,7,2,1,1
2021-11-05,2021-11-05,2021,12,5,11,2,5
2021-11-06,2021-11-06,2021,12,3,1,1,1
2021-11-06,2021-11-06,2021,12,6,2,2,2


__Note:__  


It seems that the data in dt field is not right. It records some dates in 2008 when this schema is not enabled. It also records many dates in 2022, which is the future. It doesn't match with the partition year, month day.
This data issue shows for all wikis not just on particular one wiki.

__Reason:__

In modern eventloggin platform `dt` has switched to meaning the time according to the client. So if someone, say, has the time on their phone set to 2008 or 2022, dt would reflect that. On the other hand, `meta.dt` (which is used to set the partition fields) is the time our server received the event, which would still be 2021. [ticket](https://phabricator.wikimedia.org/T292586#7581979)


Work-around: query data using partitions or `meta.dt` instead of `dt` field

In [28]:
query <- 
"
select to_date(meta.dt) AS date_time,year, month,day, COUNT(1) AS events,
COUNT(DISTINCT web_session_id) AS sessions,
COUNT(DISTINCT page_id) AS pages
FROM event.mediawiki_web_ui_scroll
WHERE year=2021
GROUP BY to_date(meta.dt),year, month,day
"

In [29]:
df <-  wmfdata::query_hive(query)

Don't forget to authenticate with Kerberos using kinit



In [30]:
df

date_time,year,month,day,events,sessions,pages
<chr>,<int>,<int>,<int>,<int>,<int>,<int>
2021-12-02,2021,12,2,81487,48361,46558
2021-12-04,2021,12,4,357090,197734,162809
2021-12-06,2021,12,6,472058,275630,201335
2021-12-08,2021,12,8,456801,265741,196529
2021-12-10,2021,12,10,401310,230381,177447
2021-12-12,2021,12,12,403471,224700,175329
2021-12-14,2021,12,14,453431,263440,195867
2021-12-16,2021,12,16,407185,235108,179875
2021-12-18,2021,12,18,308216,168579,145481
2021-12-20,2021,12,20,386747,221645,174713


__Note:__   

The events are available sine 12-02-2021

# By wiki

In [16]:
query <- 
"
select meta.domain, COUNT(1) AS events,
COUNT(DISTINCT web_session_id) AS sessions,
COUNT(DISTINCT page_id) AS pages
FROM event.mediawiki_web_ui_scroll
WHERE year=2021
GROUP BY meta.domain
ORDER BY meta.domain
LIMIT 10000
"

In [17]:
df <-  wmfdata::query_hive(query)

Don't forget to authenticate with Kerberos using kinit



In [18]:
df

domain,events,sessions,pages
<chr>,<int>,<int>,<int>
api.wikimedia.org,2,1,2
ar.wikipedia.org,4300,148,1993
ar.wikiquote.org,1,1,1
ar.wikisource.org,541,13,170
ar.wiktionary.org,5,1,2
ary.wikipedia.org,2,1,2
arz.wikipedia.org,21,2,4
ast.wiktionary.org,1,1,1
avk.wikipedia.org,1,1,1
az.wikipedia.org,30,13,17


__Note:__

The instrumenation is enabled on all wiki projects

# By action

In [19]:
query <- 
"
select action, COUNT(1) AS events,
COUNT(DISTINCT web_session_id) AS sessions,
COUNT(DISTINCT page_id) AS pages
FROM event.mediawiki_web_ui_scroll
WHERE year=2021
GROUP BY action
"

In [20]:
df <-  wmfdata::query_hive(query)

Don't forget to authenticate with Kerberos using kinit



In [21]:
df

action,events,sessions,pages
<chr>,<int>,<int>,<int>
scroll-to-top,7396229,4018155,1144575


__Note:__   

Only one type of action in schema : scroll-to-top

# By access method

In [22]:
query <- 
"
select access_method, COUNT(1) AS events,
COUNT(DISTINCT web_session_id) AS sessions,
COUNT(DISTINCT page_id) AS pages
FROM event.mediawiki_web_ui_scroll
WHERE year=2021
GROUP BY access_method
"

In [23]:
df <- wmfdata::query_hive(query)

Don't forget to authenticate with Kerberos using kinit



In [24]:
df

access_method,events,sessions,pages
<chr>,<int>,<int>,<int>
desktop,7396229,4018155,1144575


__Note:__

Only one type of access_method: `desktop`

# By anonymous user

In [25]:
query <- 
"
select is_anon, COUNT(1) AS events,
COUNT(DISTINCT web_session_id) AS sessions,
COUNT(DISTINCT page_id) AS pages
FROM event.mediawiki_web_ui_scroll
WHERE year=2021
GROUP BY is_anon
"

In [26]:
df <- wmfdata::query_hive(query)

Don't forget to authenticate with Kerberos using kinit



In [27]:
df

is_anon,events,sessions,pages
<chr>,<int>,<int>,<int>
False,1451302,171245,412911
True,5944927,3850963,943348


# By agent type

In [36]:
query <- "
SELECT CASE WHEN user_agent_map['device_family']='Spider' THEN 'Spider' ELSE 'User' END AS agent_type,
COUNT(1) AS events,
COUNT(DISTINCT web_session_id) AS sessions,
COUNT(DISTINCT page_id) AS pages
FROM event.mediawiki_web_ui_scroll
WHERE year=2021
GROUP BY CASE WHEN user_agent_map['device_family']='Spider' THEN 'Spider' ELSE 'User' END
"


In [37]:
df <- wmfdata::query_hive(query)

Don't forget to authenticate with Kerberos using kinit



In [38]:
df

agent_type,events,sessions,pages
<chr>,<int>,<int>,<int>
Spider,25,15,15
User,11222169,5987066,1417869


__Note:__  

We can identify and exclude `spider` in analysis