In [1]:
import wmfdata as wmf
from wmfdata.utils import pd_display_all

In [None]:
wp_preview_hits = wmf.spark.run("""
    SELECT
        PARSE_URL(referer, 'HOST') as referring_website,
        CASE x_analytics_map['wprov']
            WHEN 'wppw1' THEN 'non-touch'
            WHEN 'wppw1t' THEN 'touch'
        END AS device_type,
        IF(is_pageview, 'pageview', 'preview') AS view_type,
        access_method
    FROM wmf.webrequest
    WHERE
        year = 2022
        AND month = 7
        AND day >= 11
        AND day < 18
        AND webrequest_source = 'text'
        AND x_analytics_map['wprov'] IN ('wppw1', 'wppw1t')
""", session_type="yarn-large")

In [50]:
wp_preview_hits

Unnamed: 0,referring_website,device_type,view_type,access_method
0,paynegap.info,non-touch,preview,desktop
1,sprokkels-en-brokkels.be,non-touch,preview,desktop
2,sprokkels-en-brokkels.be,non-touch,preview,desktop
3,www.goodnewsfromindonesia.id,non-touch,preview,desktop
4,www.lasnuevemusas.com,non-touch,preview,desktop
...,...,...,...,...
4199,afanporsaber.com,non-touch,preview,desktop
4200,gcpawards.com,non-touch,preview,desktop
4201,gcpawards.com,non-touch,preview,desktop
4202,gcpawards.com,non-touch,preview,desktop


It's expected that all previews are counted as desktop. For pageviews, non-touch mostly corresponds to desktop and touch mostly corresponds to mobile web. This suggests our device classification is working as expected.

In [51]:
wp_preview_hits.groupby(["view_type", "device_type", "access_method"]).size()

view_type  device_type  access_method
pageview   non-touch    desktop            34
                        mobile web         15
           touch        mobile web        327
preview    non-touch    desktop          3559
           touch        desktop           256
                        mobile app          7
                        mobile web          6
dtype: int64

But here's the weird thing: touch devices are supposedly responsible for only a small fraction of previews (7%) but a large marjority of pageviews (87%). That doesn't make much sense.

This could theoretically be accurate, but pratically I think it's not possible. For one thing, we actually recorded _more_ pageviews than previews from touch devices; it's possible for a user to click the same link twice, but that can't be happening on such a large scale.

In [52]:
wp_preview_hits.groupby(["view_type", "device_type"]).size()

view_type  device_type
pageview   non-touch        49
           touch           327
preview    non-touch      3559
           touch           269
dtype: int64

This also isn't related to different behaviors between different partner sites. wikimediafoundation.org and diff.wikimedia.org are two of the biggest, and individually each shows a similar pattern.

In [53]:
(
    wp_preview_hits
    .query("referring_website == 'wikimediafoundation.org'")
    .groupby(["view_type", "device_type"])
    .size()
)

view_type  device_type
pageview   non-touch        6
           touch          262
preview    non-touch      442
           touch          104
dtype: int64

In [54]:
(
    wp_preview_hits
    .query("referring_website == 'diff.wikimedia.org'")
    .groupby(["view_type", "device_type"])
    .size()
)

view_type  device_type
pageview   non-touch        8
           touch           30
preview    non-touch      451
           touch           57
dtype: int64

Keynerd.it is still using v1.4.0, so it's expected that they wouldn't record any touch devices.

In [55]:
(
    wp_preview_hits
    .query("referring_website == 'keynerd.it'")
    .groupby(["view_type", "device_type"])
    .size()
)

view_type  device_type
pageview   non-touch        2
preview    non-touch      508
dtype: int64

In [56]:
(
    wp_preview_hits
    .query("referring_website == 'stehn-online.de'")
    .groupby(["view_type", "device_type"])
    .size()
)

view_type  device_type
preview    non-touch      139
           touch            3
dtype: int64

Hmm, lumion.pl is running v1.4.4, so they should be recording touch where appropriate. Maybe they _just_ updated? Or maybe they literally have no touch traffic. That seems hard to believe, though.

In [57]:
(
    wp_preview_hits
    .query("referring_website == 'lumion.pl'")
    .groupby(["view_type", "device_type"])
    .size()
)

view_type  device_type
preview    non-touch      129
dtype: int64

No pageviews from lasnuevemusas.com either. This strategy doesn't seem to be helping; I can't find a non-Wikimedia website that shows _either_ the messed-up traffic pattern or a normal one.

In [58]:
(
    wp_preview_hits
    .query("referring_website == 'www.lasnuevemusas.com'")
    .groupby(["view_type", "device_type"])
    .size()
)

view_type  device_type
preview    non-touch      158
           touch            6
dtype: int64

In [None]:
wp_preview_hits_private = wmf.spark.run("""
    SELECT
        PARSE_URL(referer, 'HOST') as referring_website,
        CASE x_analytics_map['wprov']
            WHEN 'wppw1' THEN 'non-touch'
            WHEN 'wppw1t' THEN 'touch'
        END AS device_type,
        IF(is_pageview, 'pageview', 'preview') AS view_type,
        access_method,
        uri_host,
        uri_path,
        uri_query,
        client_ip,
        user_agent,
        user_agent_map
    FROM wmf.webrequest
    WHERE
        year = 2022
        AND month = 7
        AND day >= 11
        AND day < 18
        AND webrequest_source = 'text'
        AND x_analytics_map['wprov'] IN ('wppw1', 'wppw1t')
""", session_type="yarn-large")

Random sample of 100 touch page views: 
* Definite computers: #76, #1681, #290, #651, #3740, #792, #3375, #557
* Possible computer: #3008, #3407, #4036, #4160
* Neither: #2035 (Google Read Aloud), #3638 (GRA)

However, this misclassification still wouldn't explain the issue. The corresponding pageviews would also be classified as touch.

In [None]:
(
    wp_preview_hits_private
    [["client_ip", "referring_website", "device_type", "view_type", "uri_path", "user_agent_map", "user_agent"]]
    .query("device_type == 'touch' and view_type == 'pageview'")
    .sample(100)
    .pipe(pd_display_all)
    
)

Random sample of 100 non-touch previews:
* Definite phone: #1484, #276, #71, #1282, #2012, #848, #1892, #1664, #3626, #404, 	#3295, #1042, #39, #2812, #215
* Old version: #2859
* Bot: #852, #858

As before, this shows problems with our classification scheme but doesn't explain the main issue.

In [None]:
(
    wp_preview_hits_private
    [["client_ip", "referring_website", "device_type", "view_type", "uri_path", "user_agent_map", "user_agent"]]
    .query("device_type == 'non-touch' and view_type == 'preview'")
    .query("referring_website not in ('keynerd.it')")
    .sample(100, random_state=4138)
    .pipe(pd_display_all)
)

Ah ha! It looks like a lot of pageviews (both touch and non-touch) don't have corresponding previews from the same IP address. Moreover, in all these cases, the page at issue is outside the main namespace. This clearly seems like a bug in the library. It also makes sense that we might overlook this; normally, non-mainspace pages would be a tiny edge case, but since our traffic is currently dominated by two Wikimedia sites, it has a major effect.

In [None]:
pageview_ips = (
    wp_preview_hits_private
    .query("view_type == 'pageview'")
    ["client_ip"]
    .unique()
)

(
    wp_preview_hits_private
    .query("client_ip in @pageview_ips and device_type == 'non-touch'")
    .sort_values("client_ip")
    [["client_ip", "referring_website", "view_type", "uri_path"]]
    .pipe(pd_display_all)
)

In [None]:
(
    wp_preview_hits_private
    .query("client_ip in @pageview_ips and device_type == 'touch'")
    .sort_values("client_ip")
    [["client_ip", "referring_website", "view_type", "uri_path"]]
    .pipe(pd_display_all)
)

In [3]:
test_hit_range = wmf.spark.run("""
    SELECT
        client_ip,
        uri_path,
        uri_query
    FROM wmf.webrequest
    WHERE
        year = 2022
        AND month = 7
        AND day = 21
        AND hour = 0
        AND webrequest_source = 'text'
        AND uri_host = 'en.wikipedia.org'
        AND x_analytics_map['wprov'] IN ('wppw1', 'wppw1t')
""", session_type="yarn-large")

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.
22/07/21 23:23:36 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
                                                                                

Row 4 is a test call I made which added our `wprov` parameter to the type of API call we use for non-mainspace pages. It looks like it got passed through with no problem, so we could just fix the data by doing that.

In [7]:
test_hit_range[["uri_path", "uri_query"]].pipe(pd_display_all)

Unnamed: 0,uri_path,uri_query
0,/api/rest_v1/page/summary/Accessibility_for_Ontarians_with_Disabilities_Act%2C_2005,?wprov=wppw1
1,/api/rest_v1/page/summary/Francis%20Bacon,?wprov=wppw1
2,/api/rest_v1/page/summary/Web_Content_Accessibility_Guidelines,?wprov=wppw1
3,/api/rest_v1/page/summary/Linked_data,?wprov=wppw1
4,/w/api.php,?format=json&formatversion=2&origin=*&action=query&prop=extracts|pageimages&exsentences=4&explaintext=1&exsectionformat=plain&piprop=thumbnail&pilimit=1&titles=Wikipedia%3AConflict_of_interest&wprov=wppw1
5,/api/rest_v1/page/summary/Americans_with_Disabilities_Act_of_1990,?wprov=wppw1
6,/api/rest_v1/page/summary/Steganography,?wprov=wppw1
7,/api/rest_v1/page/summary/Open_knowledge,?wprov=wppw1
8,/api/rest_v1/page/summary/Wiki,?wprov=wppw1
