In [10]:
import wmfdata as wmf
from wmfdata.utils import pd_display_all

Here I'm going to update the Wikipedia Preview ETL job to extract instrumentation version ([T314829](https://phabricator.wikimedia.org/T314829)).

Kill the existing job:
```
$ oozie job -kill 0087801-220613130955581-oozie-oozi-C
```

Now, can I use a regex to extract an arbitrary instrumentation version from the `wprov` parameter, or will that increase processing time too much?

This is baseline test of the current approach:

In [5]:
%%time
wmf.hive.run("""
SELECT
    COUNT(*) as requests
FROM wmf.webrequest
WHERE
    x_analytics_map["wprov"] IN ('wppw1', 'wppw1t')
    AND webrequest_source = 'text'
    AND year = 2023
    AND month = 2
    AND day = 1
    AND hour = 0
""")



CPU times: user 157 ms, sys: 7.94 ms, total: 165 ms
Wall time: 34.9 s


Unnamed: 0,requests
0,47


This tests using a regular expression instead:

In [9]:
%%time
wmf.hive.run("""
SELECT
    COUNT(*) as requests
FROM wmf.webrequest
WHERE
    x_analytics_map["wprov"] REGEXP '^wppw\\\\d+t?$'
    AND webrequest_source = 'text'
    AND year = 2023
    AND month = 2
    AND day = 1
    AND hour = 0
""")



CPU times: user 6.35 ms, sys: 9.58 ms, total: 15.9 ms
Wall time: 28.7 s


Unnamed: 0,requests
0,47


Well, okay! That was actually _faster_ (and pretty consistently so, since I had to run it several times while I worked on getting the regex right and the timing was about the same every time). So I can switch to a regex.

Now to test fuller versions of the core query against each other.

First, the old:

In [31]:
%%time
old = wmf.hive.run("""
SELECT
    SUM(CAST(is_pageview AS INT)) AS pageviews,
    SUM(CAST(NOT is_pageview AS INT)) AS previews,
    year,
    month,
    day,
    IF(
        x_analytics_map['wprov'] IN ('wppw1t', 'wppw2t'),
        'touch',
        'non-touch'
    ) AS device_type,
    parse_url(referer, 'HOST') AS referer_host,
    geocoded_data['continent'] AS continent,
    geocoded_data['country_code'] AS country_code,
    geocoded_data['country'] AS country,
    IF(
        x_analytics_map['wprov'] IN ('wppw2', 'wppw2t'),
        2,
        1
    ) AS instrumentation_version
FROM wmf.webrequest
WHERE
    x_analytics_map['wprov'] IN ('wppw1', 'wppw1t', 'wppw2', 'wppw2t')
    AND webrequest_source = 'text'
    AND year = 2023
    AND month = 2
    AND day = 23
    AND hour = 0
GROUP BY
    year,
    month,
    day,
    IF(
        x_analytics_map['wprov'] IN ('wppw1t', 'wppw2t'),
        'touch',
        'non-touch'
    ),
    parse_url(referer, 'HOST'),
    geocoded_data['continent'],
    geocoded_data['country_code'],
    geocoded_data['country'],
    IF(
        x_analytics_map['wprov'] IN ('wppw2', 'wppw2t'),
        2,
        1
    )
""").sort_values(['referer_host', 'device_type', 'country_code'])



CPU times: user 15.1 ms, sys: 8.64 ms, total: 23.7 ms
Wall time: 46.1 s


Now, the new query:

In [34]:
%%time
new = wmf.hive.run("""
SELECT
    SUM(CAST(is_pageview AS INT)) AS pageviews,
    SUM(CAST(NOT is_pageview AS INT)) AS previews,
    year,
    month,
    day,
    IF(
        REGEXP_EXTRACT(
            x_analytics_map['wprov'],
            '^wppw(\\\\d+)(t?)$',
            2
        ) = 't',
        'touch',
        'non-touch'
    ) AS device_type,
    parse_url(referer, 'HOST') AS referer_host,
    geocoded_data['continent'] AS continent,
    geocoded_data['country_code'] AS country_code,
    geocoded_data['country'] AS country,
    CAST(REGEXP_EXTRACT(
        x_analytics_map['wprov'],
        '^wppw(\\\\d+)(t?)$',
        1
    ) AS INT) AS instrumentation_version     
FROM wmf.webrequest
WHERE
    x_analytics_map['wprov'] REGEXP '^wppw(\\\\d+)(t?)$'
    AND webrequest_source = 'text'
    AND year = 2023
    AND month = 2
    AND day = 23
    AND hour = 0
GROUP BY
    year,
    month,
    day,
    IF(
        REGEXP_EXTRACT(
            x_analytics_map['wprov'],
            '^wppw(\\\\d+)(t?)$',
            2
        ) = 't',
        'touch',
        'non-touch'
    ),
    parse_url(referer, 'HOST'),
    geocoded_data['continent'],
    geocoded_data['country_code'],
    geocoded_data['country'],
    CAST(REGEXP_EXTRACT(
        x_analytics_map['wprov'],
        '^wppw(\\\\d+)(t?)$',
        1
    ) AS INT)
""").sort_values(['referer_host', 'device_type', 'country_code'])



CPU times: user 20.9 ms, sys: 5.54 ms, total: 26.4 ms
Wall time: 1min 8s


In [38]:
new.equals(old)

True

The two approaches produce the same results! At least in this case, the new approach took 50% longer than the old one, but at around a minute per hour of webrequest, that's still perfectly acceptable.

I did notice that this test set didn't have any version 2 data from touch devices. Let me make sure we aren't losing data like that.

In [43]:
wmf.hive.run("""
SELECT
    SUM(IF(x_analytics_map['wprov'] = 'wppw2', 1, 0)) AS non_touch_events,
    SUM(IF(x_analytics_map['wprov'] = 'wppw2t', 1, 0)) AS touch_events
FROM wmf.webrequest
WHERE
    x_analytics_map['wprov'] REGEXP '^wppw(\\\\d+)(t?)$'
    AND webrequest_source = 'text'
    AND year = 2023
    AND month = 2
    AND day = 23
""")



Unnamed: 0,non_touch_events,touch_events
0,658,46


This looks good, and the skew towards non-touch events suggests that the underlying instrumentation fix has addressed [T317171](https://phabricator.wikimedia.org/T317171).

Now that I've prepared the [full change to the ETL job](https://gerrit.wikimedia.org/r/c/analytics/wmf-product/jobs/+/891866), I need to test again to ensure that the new output looks right. I'll do that for February 8, 9, and 10 to confirm that the new output for version 1 events matches the old output.

The data from the old job:

In [47]:
old_daily = wmf.presto.run("""
    SELECT
        year,
        month,
        day,
        SUM(previews) AS previews,
        SUM(pageviews) AS pageviews
    FROM wmf_product.wikipediapreview_stats
    WHERE
        year = 2023
        AND month = 2
        AND day IN (8, 9, 10)
    GROUP BY
        year,
        month,
        day
""")

old_daily = (
    old_daily
    .sort_values(["year", "month", "day"])
    .reset_index(drop=True)
)

In [48]:
old_daily

Unnamed: 0,year,month,day,previews,pageviews
0,2023,2,8,2904,147
1,2023,2,9,1963,57
2,2023,2,10,818,56


Run the new ETL job as a test:

In [56]:
!~/product_analytics_jobs/deploy-oozie-job wikipediapreview_stats --test

The HDFS job directory will be hdfs:///user/neilpquinn-wmf/jobs/wikipediapreview_stats
Removing old job files in the job directory...
Creating the job directory...
Putting new job files into the job directory...
Submitting the job...
job: 0148847-220913162928808-oozie-oozi-C


In [61]:
new_daily = wmf.presto.run("""
    SELECT
        year,
        month,
        day,
        instrumentation_version,
        SUM(previews) AS previews,
        SUM(pageviews) AS pageviews
    FROM nshahquinn.wikipediapreview_stats_test
    WHERE
        year = 2023
        AND month = 2
        AND day IN (8, 9, 10)
    GROUP BY
        year,
        month,
        day,
        instrumentation_version
""")

new_daily = (
    new_daily
    .sort_values(["instrumentation_version", "year", "month", "day"])
    .reset_index(drop=True)
)

In [62]:
new_daily

Unnamed: 0,year,month,day,instrumentation_version,previews,pageviews
0,2023,2,8,1,2904,147
1,2023,2,9,1,1963,57
2,2023,2,10,1,818,56
3,2023,2,8,2,36,2
4,2023,2,9,2,1244,69
5,2023,2,10,2,2123,102


The version 1 data matches perfectly, so that looks good.

In [63]:
(
    new_daily
    .groupby(["year", "month", "day"])
    .sum()
    [["previews", "pageviews"]]
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,previews,pageviews
year,month,day,Unnamed: 3_level_1,Unnamed: 4_level_1
2023,2,8,2940,149
2023,2,9,3207,126
2023,2,10,2941,158


And when we combine the version 1 and version 2 events, the event rate looks constant.

Also, the last 3 daily instances of the old ETL job averaged a runtime of 10 min, 41 s. The three daily instances of this test average 10 min, 30 s. So the changes don't seemed to have increased the job's resource consumption.

So, time to merge the change!

Of course, this also means it's time for the annoying "extra" work of massaging the existing data into the new form.

First, drop the tables I used for the last such migration.

In [65]:
wmf.presto.run("SHOW TABLES IN nshahquinn")

Unnamed: 0,Table
0,countries
1,country_test
2,elizabeth_ii_articles
3,test
4,wikipediapreview_stats_backup
5,wikipediapreview_stats_combined
6,wikipediapreview_stats_test
7,wikis
8,wmfdata_test_1
9,wmfdata_test_2


In [71]:
wmf.hive.run([
    "DROP TABLE nshahquinn.wikipediapreview_stats_backup",
    "DROP TABLE nshahquinn.wikipediapreview_stats_combined",
    "DROP TABLE nshahquinn.wikipediapreview_stats_test"
])



Now take a backup of the existing data.

In [72]:
wmf.spark.run([
    """
    CREATE TABLE nshahquinn.wikipediapreview_stats_backup
    LIKE wmf_product.wikipediapreview_stats
    """,
    """
    INSERT INTO nshahquinn.wikipediapreview_stats_backup
    SELECT *
    FROM wmf_product.wikipediapreview_stats
    """
])

In [73]:
backup_daily = wmf.presto.run("""
    SELECT
        year,
        month,
        day,
        SUM(previews) AS previews,
        SUM(pageviews) AS pageviews
    FROM nshahquinn.wikipediapreview_stats_backup
    WHERE
        year = 2023
        AND month = 2
        AND day IN (8, 9, 10)
    GROUP BY
        year,
        month,
        day
""")

backup_daily = (
    backup_daily
    .sort_values(["year", "month", "day"])
    .reset_index(drop=True)
)

Verify that the backup worked correctly.

In [75]:
backup_daily.equals(old_daily)

True

Now, take a slightly wider slice of the older data for verification purposes.

In [77]:
old_daily = wmf.presto.run("""
    SELECT
        year,
        month,
        day,
        SUM(previews) AS previews,
        SUM(pageviews) AS pageviews
    FROM nshahquinn.wikipediapreview_stats_backup
    WHERE
        year = 2023
        AND month = 2
        AND day < 11
    GROUP BY
        year,
        month,
        day
""")

old_daily = (
    old_daily
    .sort_values(["year", "month", "day"])
    .reset_index(drop=True)
)

In [78]:
old_daily

Unnamed: 0,year,month,day,previews,pageviews
0,2023,2,1,1486,52
1,2023,2,2,1800,74
2,2023,2,3,1870,55
3,2023,2,4,1212,74
4,2023,2,5,1243,59
5,2023,2,6,2717,72
6,2023,2,7,3636,175
7,2023,2,8,2904,147
8,2023,2,9,1963,57
9,2023,2,10,818,56


In [95]:
wmf.hive.run("DROP TABLE nshahquinn.wikipediapreview_stats_altered")



In [96]:
wmf.hive.run([
    """
    CREATE TABLE nshahquinn.wikipediapreview_stats_altered (
        `pageviews`      bigint  COMMENT 'Number of pageviews shown as a result of a clickthrough from a Wikipedia Preview preview',
        `previews`       bigint  COMMENT 'Number of API requests for article preview content made by Wikipedia Preview clients',
        `year`           int     COMMENT 'Unpadded year of request',
        `month`          int     COMMENT 'Unpadded month of request',
        `day`            int     COMMENT 'Unpadded day of request',
        `device_type`    string  COMMENT 'Type of device used by the client: touch or non-touch',
        `referer_host`   string  COMMENT 'Host from referer parsing',
        `continent`      string  COMMENT 'Continent of the accessing agents (maxmind GeoIP database)',
        `country_code`   string  COMMENT 'Country iso code of the accessing agents (maxmind GeoIP database)',
        `country`        string  COMMENT 'Country (text) of the accessing agents (maxmind GeoIP database)',
        `instrumentation_version` int COMMENT 'Version number incremented along with major instrumentation changes'
    )
    """,
    """
    INSERT INTO nshahquinn.wikipediapreview_stats_altered
    SELECT
        pageviews,
        previews,
        year,
        month,
        day,
        device_type,
        referer_host,
        continent,
        country_code,
        country,
        1 AS instrumentation_version
    FROM wmf_product.wikipediapreview_stats
    WHERE
        year < 2023
        OR (year = 2023 AND month < 2)
        OR (year = 2023 AND month = 2 and day < 9)
    """
])



In [97]:
altered_daily = wmf.presto.run("""
    SELECT
        year,
        month,
        day,
        SUM(previews) AS previews,
        SUM(pageviews) AS pageviews
    FROM nshahquinn.wikipediapreview_stats_altered
    WHERE
        year = 2023
        AND month = 2
        AND day < 11
    GROUP BY
        year,
        month,
        day
""")

altered_daily = (
    altered_daily
    .sort_values(["year", "month", "day"])
    .reset_index(drop=True)
)

In [98]:
altered_daily

Unnamed: 0,year,month,day,previews,pageviews
0,2023,2,1,1486,52
1,2023,2,2,1800,74
2,2023,2,3,1870,55
3,2023,2,4,1212,74
4,2023,2,5,1243,59
5,2023,2,6,2717,72
6,2023,2,7,3636,175
7,2023,2,8,2904,147


In [99]:
wmf.hive.run("DROP TABLE wmf_product.wikipediapreview_stats")



In [100]:
wmf.hive.run("""
CREATE EXTERNAL TABLE wmf_product.wikipediapreview_stats (
    `pageviews`      bigint  COMMENT 'Number of pageviews shown as a result of a clickthrough from a Wikipedia Preview preview',
    `previews`       bigint  COMMENT 'Number of API requests for article preview content made by Wikipedia Preview clients',
    `year`           int     COMMENT 'Unpadded year of request',
    `month`          int     COMMENT 'Unpadded month of request',
    `day`            int     COMMENT 'Unpadded day of request',
    `device_type`    string  COMMENT 'Type of device used by the client: touch or non-touch',
    `referer_host`   string  COMMENT 'Host from referer parsing',
    `continent`      string  COMMENT 'Continent of the accessing agents (maxmind GeoIP database)',
    `country_code`   string  COMMENT 'Country iso code of the accessing agents (maxmind GeoIP database)',
    `country`        string  COMMENT 'Country (text) of the accessing agents (maxmind GeoIP database)',
    `instrumentation_version` int COMMENT 'Version number incremented along with major instrumentation changes'
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 'hdfs://analytics-hadoop//user/analytics-product/wikipediapreview_stats/daily'
""")



In [101]:
wmf.hive.run([
    "SET hive.exec.compress.output=true",
    "SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec",
    """
    INSERT INTO wmf_product.wikipediapreview_stats
    SELECT *
    FROM nshahquinn.wikipediapreview_stats_altered
    ORDER BY
        year,
        month,
        day
    LIMIT 10000000
    """
])




In [104]:
inserted_daily = wmf.presto.run("""
    SELECT
        year,
        month,
        day,
        SUM(previews) AS previews,
        SUM(pageviews) AS pageviews
    FROM wmf_product.wikipediapreview_stats
    WHERE
        year = 2023
        AND month = 2
        AND day < 11
    GROUP BY
        year,
        month,
        day
""")

inserted_daily = (
    inserted_daily
    .sort_values(["year", "month", "day"])
    .reset_index(drop=True)
)

In [105]:
inserted_daily

Unnamed: 0,year,month,day,previews,pageviews
0,2023,2,1,2972,104
1,2023,2,2,3600,148
2,2023,2,3,3740,110
3,2023,2,4,2424,148
4,2023,2,5,2486,118
5,2023,2,6,5434,144
6,2023,2,7,7272,350
7,2023,2,8,5808,294
8,2023,2,9,1963,57
9,2023,2,10,818,56


Hmm, this is no good. The data from before 9 Feb is clearly duplicated. The old data file wasn't removed when I dropped the table because it's an external table.

In [106]:
!hdfs dfs -ls /user/analytics-product/wikipediapreview_stats/daily

Found 2 items
-rwxrwxr-x   3 neilpquinn-wmf    analytics-privatedata-users     530782 2023-02-27 19:45 /user/analytics-product/wikipediapreview_stats/daily/000000_0.gz
-rw-r--r--   3 analytics-product hdfs                            547128 2023-02-23 21:43 /user/analytics-product/wikipediapreview_stats/daily/data.gz


In [107]:
!hdfs dfs -rm /user/analytics-product/wikipediapreview_stats/daily/data.gz

23/02/27 19:49:02 INFO fs.TrashPolicyDefault: Moved: 'hdfs://analytics-hadoop/user/analytics-product/wikipediapreview_stats/daily/data.gz' to trash at: hdfs://analytics-hadoop/user/neilpquinn-wmf/.Trash/Current/user/analytics-product/wikipediapreview_stats/daily/data.gz


In [108]:
!hdfs dfs -mv /user/analytics-product/wikipediapreview_stats/daily/000000_0.gz /user/analytics-product/wikipediapreview_stats/daily/data.gz

In [114]:
!hdfs dfs -ls /user/analytics-product/wikipediapreview_stats/daily

Found 1 items
-rwxrwxr-x   3 neilpquinn-wmf analytics-privatedata-users     530782 2023-02-27 19:45 /user/analytics-product/wikipediapreview_stats/daily/data.gz


In [109]:
inserted_daily = wmf.presto.run("""
    SELECT
        year,
        month,
        day,
        SUM(previews) AS previews,
        SUM(pageviews) AS pageviews
    FROM wmf_product.wikipediapreview_stats
    WHERE
        year = 2023
        AND month = 2
        AND day < 11
    GROUP BY
        year,
        month,
        day
""")

inserted_daily = (
    inserted_daily
    .sort_values(["year", "month", "day"])
    .reset_index(drop=True)
)

In [110]:
inserted_daily

Unnamed: 0,year,month,day,previews,pageviews
0,2023,2,1,1486,52
1,2023,2,2,1800,74
2,2023,2,3,1870,55
3,2023,2,4,1212,74
4,2023,2,5,1243,59
5,2023,2,6,2717,72
6,2023,2,7,3636,175
7,2023,2,8,2904,147


Okay, that's correct now.

Finally, redeploy the job starting from 2023-02-09.

```
neilpquinn-wmf@stat1005:~/product_analytics_jobs$ ./deploy-oozie-job wikipediapreview_stats --production
The HDFS job directory will be hdfs:///user/analytics-product/jobs/wikipediapreview_stats
Removing old job files in the job directory...
Creating the job directory...
Putting new job files into the job directory...
Submitting the job...
job: 0149051-220913162928808-oozie-oozi-C
```

Oops, it seems to be failing.

`$ !oozie job -kill 0149051-220913162928808-oozie-oozi-C`

Huh, the error is: `Line 27:5 Table not found 'webrequest'`. But the computed `source_table` property is `wmf.webrequest`, which is correct.

Oh, seems like the Data Engineering team was just now altering the structure of webrequest. Apparently I should try again.

```
neilpquinn-wmf@stat1005:~/product_analytics_jobs$ ./deploy-oozie-job wikipediapreview_stats --production
The HDFS job directory will be hdfs:///user/analytics-product/jobs/wikipediapreview_stats
Removing old job files in the job directory...
Creating the job directory...
Putting new job files into the job directory...
Submitting the job...
job: 0149075-220913162928808-oozie-oozi-C
```

In [141]:
new_daily = wmf.presto.run("""
    SELECT
        year,
        month,
        day,
        instrumentation_version,
        SUM(previews) AS previews,
        SUM(pageviews) AS pageviews
    FROM wmf_product.wikipediapreview_stats
    WHERE
        year = 2023
        AND month = 2
    GROUP BY
        year,
        month,
        day,
        instrumentation_version
""")

new_daily = (
    new_daily
    .sort_values(["instrumentation_version", "year", "month", "day"])
    .reset_index(drop=True)
)

In [142]:
new_daily

Unnamed: 0,year,month,day,instrumentation_version,previews,pageviews
0,2023,2,1,1,1486,52
1,2023,2,2,1,1800,74
2,2023,2,3,1,1870,55
3,2023,2,4,1,1212,74
4,2023,2,5,1,1243,59
5,2023,2,6,1,2717,72
6,2023,2,7,1,3636,175
7,2023,2,8,1,2904,147
8,2023,2,9,1,1963,57
9,2023,2,10,1,818,56


In [130]:
(
    new_daily
    .groupby(["year", "month", "day"])
    .sum()
    [["previews", "pageviews"]]
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,previews,pageviews
year,month,day,Unnamed: 3_level_1,Unnamed: 4_level_1
2023,2,1,1486,52
2023,2,2,1800,74
2023,2,3,1870,55
2023,2,4,1212,74
2023,2,5,1243,59
2023,2,6,2717,72
2023,2,7,3636,175
2023,2,8,2904,147
2023,2,9,3207,126
2023,2,10,2941,158


In [149]:
clickthrough = wmf.presto.run("""
    SELECT
        instrumentation_version,
        CAST(SUM(pageviews) AS REAL) / CAST(SUM(previews) AS REAL) AS clickthrough_rate
    FROM wmf_product.wikipediapreview_stats
    WHERE
        year = 2023
        AND month = 2
        AND day >= 9
        AND device_type = 'non-touch'
    GROUP BY
        instrumentation_version
""")

In [150]:
clickthrough

Unnamed: 0,instrumentation_version,clickthrough_rate
0,2,0.055087
1,1,0.005884


Everything seems to be working correctly with the new job. The big bump in previews from 6-10 Feb is almost certainly because the [2023 Sanremo Music Festival](https://en.wikipedia.org/wiki/Sanremo_Music_Festival_2023) was going on; one of our current top sites is devoted to the festival.

For non-touch devices, instrumentation v2 sees a clickthrough rate of about 5.5%, compared to about 0.6% for v1. So that confirms that our fix for [T317171](https://phabricator.wikimedia.org/T317171) was useful and effective!

In [162]:
wmf.presto.run("""
    SELECT
        instrumentation_version,
        device_type,
        SUM(pageviews) AS pageviews,
        SUM(previews) AS previews,
        CAST(SUM(pageviews) AS REAL) / CAST(SUM(previews) AS REAL) AS clickthrough_rate
    FROM wmf_product.wikipediapreview_stats
    WHERE
        year = 2023
        AND month = 2
        AND day >= 9
    GROUP BY
        device_type,
        instrumentation_version
    ORDER BY
        instrumentation_version,
        device_type
""")

Unnamed: 0,instrumentation_version,device_type,pageviews,previews,clickthrough_rate
0,1,non-touch,74,12576,0.005884
1,1,touch,1064,2044,0.520548
2,2,non-touch,731,13270,0.055087
3,2,touch,249,1813,0.137341


But, very weirdly, v2 also has a much lower clickthrough rate for touch devices. The new rate seems much more reasonable, but as far as I know, we didn't change anything related to touch devices in this release, so it's strange.

Oh! Maybe the [Sanremo Festival site](https://sanremo-festival.de/) is included in v1 since it's running an older version, and it looks like its traffic comes overwhelmingly from touch devices (which makes sense since it's probably mostly people looking things up while watching the event of TV). So perhaps that site dominated our v1 traffic this month, and had a much higher clickthrough rate than normal because its visitors were particularly interested in the content.

In [163]:
wmf.presto.run("""
    SELECT
        instrumentation_version,
        device_type,
        SUM(pageviews) AS pageviews,
        SUM(previews) AS previews,
        CAST(SUM(pageviews) AS REAL) / CAST(SUM(previews) AS REAL) AS clickthrough_rate
    FROM wmf_product.wikipediapreview_stats
    WHERE
        year = 2023
        AND month = 2
        AND day >= 9
        AND referer_host != 'sanremo-festival.de'
    GROUP BY
        device_type,
        instrumentation_version
    ORDER BY
        instrumentation_version,
        device_type
""")

Unnamed: 0,instrumentation_version,device_type,pageviews,previews,clickthrough_rate
0,1,non-touch,72,11657,0.006177
1,1,touch,900,1978,0.455005
2,2,non-touch,467,7995,0.058412
3,2,touch,65,750,0.086667


Hmm, wrong on two counts. Most importantly, the big drop in touch clickthrough is still there even when I exclude sanremo-festival.de. Also, it looks like they updated pretty early in the month; when I excluded it, the v2 counts dropped much more than the v1 counts.

Anyway, I think investigating this is separate task; the ETL upgrade is done.