In [140]:
import pandas as pd
import wmfdata as wmf
from wmfdata.utils import pd_display_all

## Non-main-namespace page views
In July 2022, we fixed a Wikipedia Preview bug ([T313596](https://phabricator.wikimedia.org/T313596)) which meant that previews of non-main-namespace pages were not counted.

Now that the Wikimedia sites that were linking to a lot of such pages have been updated to the new version, does the data look more correct? We would expect to see a _decrease_ in the clickthrough rate.

Old data:


In [2]:
wmf.presto.run("""
    SELECT
        referer_host,
        device_type,
        SUM(previews) AS previews,
        SUM(pageviews) AS pageviews,
        CAST(SUM(pageviews) AS REAL) / CAST(SUM(previews) AS REAL) AS clickthrough_rate
    FROM wmf_product.wikipediapreview_stats
    WHERE
        year = 2022
        AND month = 6
        AND referer_host IN ('diff.wikimedia.org', 'wikimediafoundation.org')
    GROUP BY
        referer_host,
        device_type
    ORDER BY
        referer_host,
        device_type
""")

Unnamed: 0,referer_host,device_type,previews,pageviews,clickthrough_rate
0,diff.wikimedia.org,non-touch,2740,16,0.005839
1,diff.wikimedia.org,touch,493,298,0.604463
2,wikimediafoundation.org,non-touch,2958,27,0.009128
3,wikimediafoundation.org,touch,589,1136,1.928693


New data:

In [2]:
wmf.presto.run("""
    SELECT
        referer_host,
        device_type,
        SUM(previews) AS previews,
        SUM(pageviews) AS pageviews,
        CAST(SUM(pageviews) AS REAL) / CAST(SUM(previews) AS REAL) AS clickthrough_rate
    FROM wmf_product.wikipediapreview_stats
    WHERE
        year = 2023
        AND month = 2
        AND referer_host IN ('diff.wikimedia.org', 'wikimediafoundation.org')
    GROUP BY
        referer_host,
        device_type
    ORDER BY
        referer_host,
        device_type
""")

Unnamed: 0,referer_host,device_type,previews,pageviews,clickthrough_rate
0,diff.wikimedia.org,non-touch,4201,20,0.004761
1,diff.wikimedia.org,touch,416,234,0.5625
2,wikimediafoundation.org,non-touch,7976,50,0.006269
3,wikimediafoundation.org,touch,1812,995,0.549117


Well, unfortunately, neither the new data nor the old looks correct. But most of that is because the most common pageview path on non-touch devices is not instrumented in the version these sites are using ([T317171](https://phabricator.wikimedia.org/T317171)). And we do see a substantial increase in previews at wikimediafoundation.org which has some particularly prominent non-main-namespace links, so it seems like yes, we are seeing the effects of the fix.

## Decreased touch clickthrough rate with new instrumentation version

In early Feb 2023, we released a major improvement to the instrumentation of pageviews for non-touch devices ([T317171](https://phabricator.wikimedia.org/T317171)), which we commemorated with a new "instrumentation version". 

Afterward, I noticed that, for touch devices, version 1 traffic had a _much_ higher clickthrough rate than version 2 ([T330743](https://phabricator.wikimedia.org/T330743)).

Here's the result I originally noticed:

In [11]:
wmf.presto.run("""
    SELECT
        instrumentation_version,
        device_type,
        SUM(pageviews) AS pageviews,
        SUM(previews) AS previews,
        CAST(SUM(pageviews) AS REAL) / CAST(SUM(previews) AS REAL) AS clickthrough_rate
    FROM wmf_product.wikipediapreview_stats
    WHERE
        year = 2023
        AND month = 2
        AND day >= 9
    GROUP BY
        device_type,
        instrumentation_version
    ORDER BY
        instrumentation_version,
        device_type
""")

Unnamed: 0,instrumentation_version,device_type,pageviews,previews,clickthrough_rate
0,1,non-touch,78,13588,0.00574
1,1,touch,1120,2141,0.52312
2,2,non-touch,763,13809,0.055254
3,2,touch,251,1836,0.13671


Ahh! If we omit the Wikimedia websites (which mostly haven't upgraded), we don't see any big disparity. So it looks like the disparity is just because Wikimedia sites account for a major chunk of the traffic, have much higher clickthrough rate, and haven't upgraded. 

In [16]:
wmf.presto.run("""
    SELECT
        instrumentation_version,
        device_type,
        SUM(pageviews) AS pageviews,
        SUM(previews) AS previews,
        CAST(SUM(pageviews) AS REAL) / CAST(SUM(previews) AS REAL) AS clickthrough_rate
    FROM wmf_product.wikipediapreview_stats
    WHERE
        year = 2023
        AND month = 2
        AND day >= 9
        AND referer_host NOT LIKE '%wikimedia.org'
        AND referer_host != 'wikimediafoundation.org'
    GROUP BY
        device_type,
        instrumentation_version
    ORDER BY
        instrumentation_version,
        device_type
""")

Unnamed: 0,instrumentation_version,device_type,pageviews,previews,clickthrough_rate
0,1,non-touch,29,4490,0.006459
1,1,touch,35,485,0.072165
2,2,non-touch,659,13713,0.048057
3,2,touch,233,1820,0.128022


To further confirm this, let's look at sites that have upgraded and have a decent amount of traffic, so we can compare their touch clickthrough rates before and after. We'll also go back to the start of Jan, so we have enough baseline data.

In [None]:
stats_by_site = wmf.presto.run("""
    SELECT
        referer_host,
        instrumentation_version,
        SUM(previews) AS previews,
        CAST(SUM(pageviews) AS REAL) / CAST(SUM(previews) AS REAL) AS clickthrough_rate,
        SUM(pageviews) AS pageviews
    FROM wmf_product.wikipediapreview_stats
    WHERE
        year = 2023
        AND month >= 1
        AND device_type = 'touch'
        AND referer_host NOT IN ('0.0.0.0', 'testingpurpose.local')
    GROUP BY
        referer_host,
        instrumentation_version
    ORDER BY
        referer_host,
        instrumentation_version
""")

sites_to_check = (
    stats_by_site
    .groupby('referer_host')
    .aggregate(
        versions_present=("instrumentation_version", "size"),
        previews=("previews", "sum")
    )
    .query("versions_present == 2 and previews >= 100")
    .index
)

stats_by_site.query('referer_host in @sites_to_check').pipe(pd_display_all)

Overall, considering each rate in light of the number of underlying previews, all the sites show a reasonable before and after pattern. It definitely does not seem like the upgrade introduced an instrumentation bug.

## Low traffic from touch devices

Overall, mobile devices account for a bit over 50% of web use. However, only a small slice of Wikipedia Preview traffic comes from touch devices (smartphones and tablets, generally speaking). I will focus on previews here since we've only fixed the major undercounting of non-touch pageviews.

In [48]:
sites_filter = """
      referer_host NOT IN (
        '0.0.0.0',
        '127.0.0.1',
        'blog-wikimedia-org-develop.go-vip.net',
        'cdpn.io',
        'lumion.imaggo-work.pl',
        'localhost',
        'wikimediadiff.test',
        'wikimediafoundation-org-develop.go-vip.co',
        'wikimedia.github.io',
        'www.wixwikipediapreviewtest.com',
        '-'
      )
      AND referer_host NOT LIKE '%.local'
      AND referer_host NOT LIKE '%.ngrok.io' 
      AND referer_host NOT LIKE '192.%'
      AND referer_host NOT LIKE '%.wikipedia.org'
      AND referer_host NOT LIKE '%facebook.com'
      AND referer_host NOT LIKE '%google.com'
      AND referer_host NOT LIKE '%.test'
      AND referer_host IS NOT NULL
"""

Over the last 4 weeks, only 7.4% of our previews came from touch devices.

In [52]:
wmf.presto.run(f"""
    SELECT
        device_type,
        SUM(previews) AS previews
    FROM wmf_product.wikipediapreview_stats
    WHERE
        year = 2023
        AND (
            month = 2 AND day >= 17
            OR month = 3 AND day < 17
        )
        AND {sites_filter}
    GROUP BY device_type
""")

Unnamed: 0,device_type,previews
0,touch,3926
1,non-touch,48827


The daily touch preview share is pretty volatile, ranging from 3% to 19%, but is certainly never more than a small minority.

In [55]:
wmf.presto.run(f"""
    SELECT
        CONCAT(
            CAST(year AS VARCHAR), 
            '-',
            LPAD(CAST(month AS VARCHAR), 2, '0'), 
            '-', 
            LPAD(CAST(day AS VARCHAR), 2, '0')
        ) AS date,
        SUM(previews) AS previews,
        CAST(SUM(IF(device_type = 'touch', previews, 0)) AS REAL) / CAST(SUM(previews) AS REAL)
            AS touch_preview_share
    FROM wmf_product.wikipediapreview_stats
    WHERE
        year = 2023
        AND (
            month = 2 AND day >= 17
            OR month = 3 AND day < 17
        )
        AND {sites_filter}
    GROUP BY
        year,
        month,
        day
    ORDER BY date
""")

Unnamed: 0,date,previews,touch_preview_share
0,2023-02-17,1249,0.114492
1,2023-02-18,974,0.129363
2,2023-02-19,917,0.134133
3,2023-02-20,1395,0.116846
4,2023-02-21,1823,0.072957
5,2023-02-22,1405,0.113879
6,2023-02-23,1432,0.086592
7,2023-02-24,1138,0.108963
8,2023-02-25,1210,0.099174
9,2023-02-26,959,0.118874


In [56]:
touch_share_by_site = wmf.presto.run(f"""
    SELECT
        referer_host AS website,
        SUM(previews) AS previews,
        CAST(SUM(IF(device_type = 'touch', previews, 0)) AS REAL) / CAST(SUM(previews) AS REAL)
            AS touch_preview_share
    FROM wmf_product.wikipediapreview_stats
    WHERE
        year = 2023
        AND (
            month = 2 AND day >= 17
            OR month = 3 AND day < 17
        )
        AND {sites_filter}
    GROUP BY referer_host
    ORDER BY previews DESC
""")

Among the 25 sites with the most previews, the touch share varies from 0% to 22%. It's noteworthy that the site with the highest rate (orlandotravelham.com) uses the default link styling which visibily distinguishes Preview links from other links. Like virtually all the sites I've seen, the next two highest sites (wikimediafoundation.org, at 19%, and sanremo-festival.de, at 12%) don't use any special styling on Preview links.

Five sites had a clean 0%, with no touch previews whatsoever in 4 weeks. For two (www.goodnewsfromindonesia.id and devel.bibleandplaces.com), this makes sense because they're using a version that predates touch tracking. But for three (blog.thecursedgroup.rf.gd, gorhambury.org, and moonpub.net), that isn't the case, so it seems quite strange that despite about 250 previews each, not a single one was from a touch device.

In [163]:
(
    touch_share_by_site
    .head(25)
    .sort_values("touch_preview_share", ascending=False)
    [["website", "touch_preview_share"]]
)

Unnamed: 0,website,touch_preview_share
22,orlandotravelham.com,0.224359
1,wikimediafoundation.org,0.188336
7,sanremo-festival.de,0.11504
14,soundlogo.wikimedia.org,0.094077
2,diff.wikimedia.org,0.081619
19,genesibiblica.org,0.076531
3,www.scoopearth.com,0.058327
23,jrpgc.com,0.052288
12,www.lasnuevemusas.com,0.052036
5,xpressenglish.com,0.049637


To see if there's some instrumentation bug, at around 2023-03-18 02:30, I went to these three sites on my phone, found a preview link on each one, opened it, and clicked through to Wikipedia. (The articles were "Steganography" on gorhambury.org, "Roswell incident" on moonpub.net, and "Diamante (female wrestler)" on blog.thecursedgroup.rf.gd.)

Somewhat surprisingly, all of them show up in our Wikipedia Preview data. This is probably the clearest indication that our touch traffic is genuinely low.

In [62]:
wmf.presto.run(f"""
    SELECT *
    FROM wmf_product.wikipediapreview_stats
    WHERE
        year = 2023
        AND month = 3
        AND day = 18
        AND referer_host in ('gorhambury.org', 'moonpub.net', 'blog.thecursedgroup.rf.gd')
        AND device_type = 'touch'
""")

Unnamed: 0,pageviews,previews,year,month,day,device_type,referer_host,continent,country_code,country,instrumentation_version
0,1,3,2023,3,18,touch,moonpub.net,North America,US,United States,1
1,1,1,2023,3,18,touch,blog.thecursedgroup.rf.gd,North America,US,United States,2
2,2,1,2023,3,18,touch,gorhambury.org,North America,US,United States,2


[Previously](https://github.com/nshahquinn/misc-wikimedia-analysis/blob/master/2022/2022-07_wikipedia-preview-mobile-desktop-mismatch.ipynb), I looked at the raw user agents of devices that fell into our touch and non-touch buckets and saw what looked like quite a few cases where smartphones were falling into the non-touch bucket and PCs were falling into the touch bucket.

Let me repeat that and see where things stand now.

In [None]:
wmf.spark.create_session(
    type="yarn-large",
    extra_settings=
)

wp_preview_hits_private = wmf.spark.run("""
    SELECT
        dt,
        PARSE_URL(referer, 'HOST') as referring_website,
        IF(
            REGEXP_EXTRACT(
                x_analytics_map['wprov'],
                '^wppw(\\\\d+)(t?)$',
                2
            ) = 't',
            'touch',
            'non-touch'
        ) AS device_type,
        IF(is_pageview, 'pageview', 'preview') AS view_type,
        access_method,
        uri_host,
        uri_path,
        uri_query,
        client_ip,
        user_agent,
        user_agent_map,
        agent_type,
        CAST(REGEXP_EXTRACT(
            x_analytics_map['wprov'],
            '^wppw(\\\\d+)(t?)$',
            1
        ) AS INT) AS instrumentation_version
    FROM wmf.webrequest
    WHERE
        year = 2023
        AND month = 3
        AND day = 15
        AND webrequest_source = 'text'
        AND x_analytics_map['wprov'] REGEXP '^wppw(\\\\d+)(t?)$'
""")

In [100]:
a = wp_preview_hits_private.drop("user_agent_map", axis="columns")
b = pd.DataFrame.from_records(wp_preview_hits_private["user_agent_map"])
wp_preview_hits_private = pd.concat([a, b], axis="columns")

As expected, this day of data shows very little touch traffic (4.4%).

In [119]:
wp_preview_hits_private.query("view_type == 'preview'").groupby('device_type').apply(len)

device_type
non-touch    2707
touch         126
dtype: int64

Well, here's one unexpected but important thing: there is a major amount of spider traffic here.

In [101]:
wp_preview_hits_private.groupby(["view_type", "agent_type"]).apply(len)

view_type  agent_type
pageview   spider           2
           user            79
preview    spider        1641
           user          1192
dtype: int64

Virtually all of this spider traffic is from Facebook, with the user agent they use when fetching a page that has been linked on Facebook.

In [148]:
(
    wp_preview_hits_private
    .query("agent_type == 'spider'")
    .groupby(["view_type", "user_agent"])
    .apply(len)
    .to_frame()
    .rename({0: "views"}, axis="columns")
)

Unnamed: 0_level_0,Unnamed: 1_level_0,views
view_type,user_agent,Unnamed: 2_level_1
pageview,"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36 (compatible; Google-Read-Aloud; +https://support.google.com/webmasters/answer/1061943)",1
pageview,python-requests/2.20.0,1
preview,facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php),1641


Surprisingly, two websites (bjmoreshet.org and richardbevan.co.uk) absolutely dominate the spider traffic.

In [None]:
wp_preview_hits_private.query("agent_type == 'spider'").groupby("referring_website").apply(len)

And virtually all of these sites' Preview traffic is from spiders.

In [154]:
(
    wp_preview_hits_private
    .query("referring_website in ('bjmoreshet.org', 'richardbevan.co.uk') and view_type == 'preview'")
    .groupby("referring_website")
    .agg(
        spider_share=("agent_type", lambda s: (s == "spider").sum() / len(s))
    )
)

Unnamed: 0_level_0,spider_share
referring_website,Unnamed: 1_level_1
bjmoreshet.org,0.982085
richardbevan.co.uk,0.982323


Anyway, when we filter out these spiders, the fraction of touch traffic goes up to 10.6%.

In [156]:
user_previews = wp_preview_hits_private.query("agent_type == 'user' and view_type == 'preview'")
user_previews.groupby('device_type').apply(len)

device_type
non-touch    1066
touch         126
dtype: int64

Most of the touch traffic comes from Android and iOS devices, as expected, but there are substantial numbers of Chrome OS, Linux, macOS, and Windows devices.

In [157]:
touch_previews = user_previews.query("device_type == 'touch'")
touch_previews.groupby("os_family").apply(len)

os_family
Android      70
Chrome OS     7
Linux        15
Mac OS X     11
Windows      10
iOS          13
dtype: int64

The Chrome OS user agents don't offer any device information, but there seem to be plenty of touchscreen Chrome OS devices, so that's probably what these are.

In [None]:
touch_previews.query("os_family == 'Chrome OS'")["user_agent"].pipe(pd_display_all)

The Windows touch devices have normal-looking user agents. Most likely, they are Windows laptops with touchscreens.

In [None]:
touch_previews.query("os_family == 'Windows'")["user_agent"].pipe(pd_display_all)

Most of these "macOS" touch devices also have normal-looking user agents. Very likely, those are iPads; apparently, their user agents are the same as desktop macOS Safari ones. There are no touchscreen Mac models.

There are a couple that have the same type of user agent but with the ending "Version/X Safari/Y" replaced with "WikipediaApp". Perhaps that's the embedded broswer in the iOS Wikipedia app on an iPad.

In [None]:
touch_previews.query("os_family == 'Mac OS X'")["user_agent"].pipe(pd_display_all)

These Linux touch devices could also be laptops with touchscreens; in this case, 14 were visiting wikimediafoundation.org and 1 was visiting diff.wikimedia.org, which are both fairly reasonable places to find an unusual concentration of Linux laptops.

In [None]:
touch_previews.query("os_family == 'Linux'")["user_agent"].pipe(pd_display_all)

So, all in all, touchscreen laptops are not just a negligible share of our touch traffic.

Happily, there are no iOS or Android user agents among the non-touch.

In [131]:
nontouch_previews = user_previews.query("device_type == 'non-touch'")
nontouch_previews.groupby("os_family").apply(len)

os_family
Chrome OS     30
Fedora         1
FreeBSD        1
Linux         29
Mac OS X     232
Other          1
Ubuntu        13
Windows      759
dtype: int64

Since the last time I looked at these, I saw a lot of mobile devices getting classified as non-touch, let me look at a sample just to be sure.

These all look like computers. Probably what I saw previously was devices that had visited websites running older versions of Wikipedia Preview, from before we started indicating the device type in the `wprov` parameter.

In [None]:
nontouch_previews["user_agent"].sample(50).pipe(pd_display_all)

So, filtering out the touchscreen laptops, it looks like about 7.9% of our previews come from smartphones or tablets. Previously, I had assumed that "touch" was essentially synonymous with "mobile", but that isn't the case. Our mobile traffic is actually lower than it appeared.

Ironically, even though I found two important data issues, I can still conclude that our strangely low level of touch traffic actually does reflect reality!

In [162]:
len(touch_previews.query("os_family in ('Android', 'iOS', 'Mac OS X')")) / len(user_previews)

0.07885906040268456