Data from 2019-01-01 contains unknown crawls #156

Themanwithoutaplan · 2019-01-28T15:19:23Z

About 40,000 results in the 2019-01-01 mobile datasets have crawlid 581 which isn't listed in the statistics table.

rviscomi · 2019-02-04T22:59:18Z

Yes, when the crawl started the mobile URL list had not been updated correctly, so we did that on the fly and restarted the crawl. About 40k tests had already been started and those correspond to the aborted crawl.

Themanwithoutaplan · 2019-02-05T08:39:47Z

Thanks for the explanation, but in the interests of data consistency those tests should either be removed from the dump or a relevant crawl id should be added.

rviscomi · 2019-02-06T22:13:57Z

Are you able to omit 581 from your end of the pipeline? We're not maintaining any of the legacy systems beyond recovering from data loss or breakages, and this doesn't seem to affect anything critical. I'm not aware of anyone else who depends on the raw legacy results, so if you could work around it for now we should be ok.

Themanwithoutaplan · 2019-02-07T09:29:22Z

I can work around this one fairly easily because it raises an exception directly when I import the data. A bigger pain are duplicate tests within the same crawl because these don't show up until I create some reports. The right constraints on the database would stop this happening but I can appreciate you not wanting to touch the schema at this stage, particularly as fixing the duplicate tests requires window functions which I'm not sure MySQL does.
Would be great if you could give a heads up at the end of any crawl if any such anomalies are expected.

rviscomi · 2019-02-07T14:04:36Z

Will do! Thanks for your understanding.

Themanwithoutaplan · 2019-03-25T09:58:20Z

FWIW, I'm still seeing around 120,000 duplicates in every run. I haven't done any further analysis but I think you should be looking at what's causing this.

Themanwithoutaplan · 2019-04-25T16:56:20Z

In the desktop crawl for 2019-04-01 there are around 227,000 sites with duplicate tests so we're getting close to 5%. This will start to affect any derived statistics and is also a waste of resources. Do we have any idea what's causing this? Are some crawls being allocated twice?

rviscomi · 2019-04-25T17:08:17Z

Could you share the query you're running to get that count?

summary_pages on BigQuery is created from the MySQL-based CSV dumps and it's not showing any duplicates:

SELECT
  url,
  COUNT(0) AS n
FROM
  `httparchive.summary_pages.2019_04_01_desktop`
GROUP BY
  url
HAVING
  n > 1
ORDER BY
  n DESC

If it's true I agree it's worth investigating, at least for resource conservation.

Themanwithoutaplan · 2019-04-25T17:14:41Z

My query is always slightly different due to the way I import data into Postgres but I can provide a list of what I think are duplicate sites.

pmeenan · 2019-04-25T17:44:40Z

I'm not seeing any duplicates on the main server the dumps are created on either:

mysql> SELECT url, COUNT(0) AS n FROM pages WHERE crawlid = 587 GROUP BY url HAVING n > 1 ORDER BY n DESC;
Empty set (35.76 sec)

587 is the crawl ID for the 2019-04-01 desktop crawl.

Any chance you're not including the protocol (http:// or https://) part of the URL when collecting the site stats? It is expected that there will be both for sites that have traffic on both since they are different origins.

Looking at domains that have both http:// and https:// results:

mysql> SELECT LEFT(RIGHT(`url` ,length(`url`) -(position('//' IN `url`) + 1)) ,position('/' IN RIGHT(`url` ,length(`url`) -(position('//' IN `url`) + 1))) - 1) AS domain, COUNT(0) AS n FROM pages WHERE crawlid = 587 GROUP BY domain HAVING n > 1 ORDER BY n DESC;

113500 rows in set (40.32 sec)

That looks pretty close to half of your number so, assuming you are counting both of them, my guess is you are calculating stats at a domain level, not origin and we test both http and https versions of an origin if CrUX shows that both had traffic during the previous month.

Themanwithoutaplan · 2019-04-26T09:39:01Z

Pat, I think you've identified the problem: I do normalise on the domain which is why these count as duplicates. I'm not sure that there is any point in not doing this just because both protocols are in the CrUX dataset: is it right to consider these as distinct websites? But the important thing is we've identified the source of the anomaly and I can adjust my import script.

pmeenan · 2019-04-26T12:01:20Z

If both protocols are in CrUX that means that both had a meaningful amount of traffic during the month. If it was redirecting from http -> https then CrUX would only report the https. The content can be completely different if, fo example, a site is migrating to a new system and deploying https as part of the migration and the performance characteristics are likely to be very different.

If you are going to de-dupe them into a single entry, I'd recommend favoring the https:// variant when there are duplicates. For the HTTPArchive it makes sense to just collect both of them so we have a clean and complete dataset that matches CrUX.

Themanwithoutaplan · 2019-04-26T12:44:49Z

I appreciate exactly what you're saying about the protocols but a cursory check suggests that these are duplicates and that the websites are just not configured to redirect. At some point, as we move towards http/2 the issue may resolve itself, in the meantime I guess it's an interesting effect itself.

For my purposes I'm doing just what you suggest and am keeping only the https variants.

rviscomi closed this as completed Feb 4, 2019

rviscomi mentioned this issue Feb 4, 2019

Data from 2018-12-15 contains duplicate tests for some sites #155

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data from 2019-01-01 contains unknown crawls #156

Data from 2019-01-01 contains unknown crawls #156

Themanwithoutaplan commented Jan 28, 2019

rviscomi commented Feb 4, 2019

Themanwithoutaplan commented Feb 5, 2019

rviscomi commented Feb 6, 2019

Themanwithoutaplan commented Feb 7, 2019

rviscomi commented Feb 7, 2019

Themanwithoutaplan commented Mar 25, 2019

Themanwithoutaplan commented Apr 25, 2019

rviscomi commented Apr 25, 2019

Themanwithoutaplan commented Apr 25, 2019

pmeenan commented Apr 25, 2019

Themanwithoutaplan commented Apr 26, 2019

pmeenan commented Apr 26, 2019

Themanwithoutaplan commented Apr 26, 2019

Data from 2019-01-01 contains unknown crawls #156

Data from 2019-01-01 contains unknown crawls #156

Comments

Themanwithoutaplan commented Jan 28, 2019

rviscomi commented Feb 4, 2019

Themanwithoutaplan commented Feb 5, 2019

rviscomi commented Feb 6, 2019

Themanwithoutaplan commented Feb 7, 2019

rviscomi commented Feb 7, 2019

Themanwithoutaplan commented Mar 25, 2019

Themanwithoutaplan commented Apr 25, 2019

rviscomi commented Apr 25, 2019

Themanwithoutaplan commented Apr 25, 2019

pmeenan commented Apr 25, 2019

Themanwithoutaplan commented Apr 26, 2019

pmeenan commented Apr 26, 2019

Themanwithoutaplan commented Apr 26, 2019