We retain data in the `webrequest` stream for 90 days, but in practice, what range of data do we have at any given time?

In [None]:
from datetime import datetime, timedelta, date

import wmfdata as wmf

wmf.spark.create_session(type="yarn-large")

The most recent data available:

In [7]:
print(f"Query started at {datetime.now().isoformat(sep=' ', timespec='minutes')}.")

newest = wmf.spark.run("""
SELECT
    year,
    month,
    day,
    hour,
    COUNT(*) AS records
FROM wmf.webrequest
WHERE
    year = 2022
    AND month = 12
    AND day = 9
GROUP BY
    year,
    month,
    day,
    hour
ORDER BY
    year,
    month,
    day,
    hour
""")

Query started at 2022-12-09 23:44.


                                                                                

It seems webrequest has data up until about an hour ago.

In [11]:
newest.style.format(formatter={"records": "{:,.0f}"})

Unnamed: 0,year,month,day,hour,records
0,2022,12,9,0,389929313
1,2022,12,9,1,392787993
2,2022,12,9,2,395662013
3,2022,12,9,3,376748217
4,2022,12,9,4,357351573
5,2022,12,9,5,365745780
6,2022,12,9,6,381991376
7,2022,12,9,7,409908555
8,2022,12,9,8,439285275
9,2022,12,9,9,453832974


In [9]:
now = datetime.now()

print(f"Query started at {now.isoformat(sep=' ', timespec='minutes')}.")
print(f"90 days ago was {(now - timedelta(days=90)).isoformat(sep=' ', timespec='minutes')}.")

oldest = wmf.spark.run("""
SELECT
    year,
    month,
    day,
    hour,
    COUNT(*) AS records
FROM wmf.webrequest
WHERE
    year = 2022
    AND month = 9
    AND day IN (9, 10, 11)
GROUP BY
    year,
    month,
    day,
    hour
ORDER BY
    year,
    month,
    day,
    hour
""")

Query started at 2022-12-09 23:45.
90 days ago was 2022-09-10 23:45.


                                                                                

Meanwhile, our oldest data goes back just a little further than exactly 90 days (3 hr, 45 min further, to be exact).

In [14]:
oldest.style.format(formatter={"records": "{:,.0f}"})

Unnamed: 0,year,month,day,hour,records
0,2022,9,10,20,528095803
1,2022,9,10,21,471909319
2,2022,9,10,22,413512281
3,2022,9,10,23,370561948
4,2022,9,11,0,348533382
5,2022,9,11,1,360039623
6,2022,9,11,2,350263209
7,2022,9,11,3,353539440
8,2022,9,11,4,352283291
9,2022,9,11,5,355502095


So what's the simplest strategy to run some analysis query across a predictable period that roughly matches what's available in webrequest?

Spark (our most powerful query engine) cannot handle a query across all of webrequest, so we have to run a bunch of separate sub-period queries. Hourly queries are unnecessarily granular and would require a lot more logic to pick the starting and ending hours, so we want a range of calendar days to run daily queries against.

Right now, `webrequest` contains data spanning from 2022-09-10 to 2022-12-09, which is *91* calendar days. So, if we cut off the two partial days at the beginning and end (90 days ago and today), we will have a period of 89 complete calendar days.

For instance:

In [31]:
# We generated `now` this earlier, when we actually ran the queries
today = date(2022, 12, 9)

print(f"Start 89 days ago with {(today - timedelta(days=89)).isoformat()}.")
print(f"End yesterday with {(today - timedelta(days=1)).isoformat()}.")


Start 89 days ago with 2022-09-11.
End yesterday with 2022-12-08.
