In [33]:
import wmfdata as wmf
from wmfdata.utils import pd_display_all

## Non-main-namespace page views
In July 2022, we fixed a Wikipedia Preview bug ([T313596](https://phabricator.wikimedia.org/T313596)) which meant that previews of non-main-namespace pages were not counted.

Now that the Wikimedia sites that were linking to a lot of such pages have been updated to the new version, does the data look more correct? We would expect to see a _decrease_ in the clickthrough rate.

Old data:


In [2]:
wmf.presto.run("""
    SELECT
        referer_host,
        device_type,
        SUM(previews) AS previews,
        SUM(pageviews) AS pageviews,
        CAST(SUM(pageviews) AS REAL) / CAST(SUM(previews) AS REAL) AS clickthrough_rate
    FROM wmf_product.wikipediapreview_stats
    WHERE
        year = 2022
        AND month = 6
        AND referer_host IN ('diff.wikimedia.org', 'wikimediafoundation.org')
    GROUP BY
        referer_host,
        device_type
    ORDER BY
        referer_host,
        device_type
""")

Unnamed: 0,referer_host,device_type,previews,pageviews,clickthrough_rate
0,diff.wikimedia.org,non-touch,2740,16,0.005839
1,diff.wikimedia.org,touch,493,298,0.604463
2,wikimediafoundation.org,non-touch,2958,27,0.009128
3,wikimediafoundation.org,touch,589,1136,1.928693


New data:

In [2]:
wmf.presto.run("""
    SELECT
        referer_host,
        device_type,
        SUM(previews) AS previews,
        SUM(pageviews) AS pageviews,
        CAST(SUM(pageviews) AS REAL) / CAST(SUM(previews) AS REAL) AS clickthrough_rate
    FROM wmf_product.wikipediapreview_stats
    WHERE
        year = 2023
        AND month = 2
        AND referer_host IN ('diff.wikimedia.org', 'wikimediafoundation.org')
    GROUP BY
        referer_host,
        device_type
    ORDER BY
        referer_host,
        device_type
""")

Unnamed: 0,referer_host,device_type,previews,pageviews,clickthrough_rate
0,diff.wikimedia.org,non-touch,4201,20,0.004761
1,diff.wikimedia.org,touch,416,234,0.5625
2,wikimediafoundation.org,non-touch,7976,50,0.006269
3,wikimediafoundation.org,touch,1812,995,0.549117


Well, unfortunately, neither the new data nor the old looks correct. But most of that is because the most common pageview path on non-touch devices is not instrumented in the version these sites are using ([T317171](https://phabricator.wikimedia.org/T317171)). And we do see a substantial increase in previews at wikimediafoundation.org which has some particularly prominent non-main-namespace links, so it seems like yes, we are seeing the effects of the fix.

## Decreased touch clickthrough rate with new instrumentation version

In early Feb 2023, we released a major improvement to the instrumentation of pageviews for non-touch devices ([T317171](https://phabricator.wikimedia.org/T317171)), which we commemorated with a new "instrumentation version". 

Afterward, I noticed that, for touch devices, version 1 traffic had a _much_ higher clickthrough rate than version 2 ([T330743](https://phabricator.wikimedia.org/T330743)).

Here's the result I originally noticed:

In [11]:
wmf.presto.run("""
    SELECT
        instrumentation_version,
        device_type,
        SUM(pageviews) AS pageviews,
        SUM(previews) AS previews,
        CAST(SUM(pageviews) AS REAL) / CAST(SUM(previews) AS REAL) AS clickthrough_rate
    FROM wmf_product.wikipediapreview_stats
    WHERE
        year = 2023
        AND month = 2
        AND day >= 9
    GROUP BY
        device_type,
        instrumentation_version
    ORDER BY
        instrumentation_version,
        device_type
""")

Unnamed: 0,instrumentation_version,device_type,pageviews,previews,clickthrough_rate
0,1,non-touch,78,13588,0.00574
1,1,touch,1120,2141,0.52312
2,2,non-touch,763,13809,0.055254
3,2,touch,251,1836,0.13671


Ahh! If we omit the Wikimedia websites (which mostly haven't upgraded), we don't see any big disparity. So it looks like the disparity is just because Wikimedia sites account for a major chunk of the traffic, have much higher clickthrough rate, and haven't upgraded. 

In [16]:
wmf.presto.run("""
    SELECT
        instrumentation_version,
        device_type,
        SUM(pageviews) AS pageviews,
        SUM(previews) AS previews,
        CAST(SUM(pageviews) AS REAL) / CAST(SUM(previews) AS REAL) AS clickthrough_rate
    FROM wmf_product.wikipediapreview_stats
    WHERE
        year = 2023
        AND month = 2
        AND day >= 9
        AND referer_host NOT LIKE '%wikimedia.org'
        AND referer_host != 'wikimediafoundation.org'
    GROUP BY
        device_type,
        instrumentation_version
    ORDER BY
        instrumentation_version,
        device_type
""")

Unnamed: 0,instrumentation_version,device_type,pageviews,previews,clickthrough_rate
0,1,non-touch,29,4490,0.006459
1,1,touch,35,485,0.072165
2,2,non-touch,659,13713,0.048057
3,2,touch,233,1820,0.128022


To further confirm this, let's look at sites that have upgraded and have a decent amount of traffic, so we can compare their touch clickthrough rates before and after. We'll also go back to the start of Jan, so we have enough baseline data.

In [None]:
stats_by_site = wmf.presto.run("""
    SELECT
        referer_host,
        instrumentation_version,
        SUM(previews) AS previews,
        CAST(SUM(pageviews) AS REAL) / CAST(SUM(previews) AS REAL) AS clickthrough_rate,
        SUM(pageviews) AS pageviews
    FROM wmf_product.wikipediapreview_stats
    WHERE
        year = 2023
        AND month >= 1
        AND device_type = 'touch'
        AND referer_host NOT IN ('0.0.0.0', 'testingpurpose.local')
    GROUP BY
        referer_host,
        instrumentation_version
    ORDER BY
        referer_host,
        instrumentation_version
""")

sites_to_check = (
    stats_by_site
    .groupby('referer_host')
    .aggregate(
        versions_present=("instrumentation_version", "size"),
        previews=("previews", "sum")
    )
    .query("versions_present == 2 and previews >= 100")
    .index
)

stats_by_site.query('referer_host in @sites_to_check').pipe(pd_display_all)

Overall, considering each rate in light of the number of underlying previews, all the sites show a reasonable before and after pattern. It definitely does not seem like the upgrade introduced an instrumentation bug.