Skip to content
This repository has been archived by the owner on Jan 4, 2023. It is now read-only.

Data from 2019-01-01 contains unknown crawls #156

Closed
Themanwithoutaplan opened this issue Jan 28, 2019 · 13 comments
Closed

Data from 2019-01-01 contains unknown crawls #156

Themanwithoutaplan opened this issue Jan 28, 2019 · 13 comments

Comments

@Themanwithoutaplan
Copy link

About 40,000 results in the 2019-01-01 mobile datasets have crawlid 581 which isn't listed in the statistics table.

@rviscomi
Copy link
Member

rviscomi commented Feb 4, 2019

Yes, when the crawl started the mobile URL list had not been updated correctly, so we did that on the fly and restarted the crawl. About 40k tests had already been started and those correspond to the aborted crawl.

@Themanwithoutaplan
Copy link
Author

Thanks for the explanation, but in the interests of data consistency those tests should either be removed from the dump or a relevant crawl id should be added.

@rviscomi
Copy link
Member

rviscomi commented Feb 6, 2019

Are you able to omit 581 from your end of the pipeline? We're not maintaining any of the legacy systems beyond recovering from data loss or breakages, and this doesn't seem to affect anything critical. I'm not aware of anyone else who depends on the raw legacy results, so if you could work around it for now we should be ok.

@Themanwithoutaplan
Copy link
Author

I can work around this one fairly easily because it raises an exception directly when I import the data. A bigger pain are duplicate tests within the same crawl because these don't show up until I create some reports. The right constraints on the database would stop this happening but I can appreciate you not wanting to touch the schema at this stage, particularly as fixing the duplicate tests requires window functions which I'm not sure MySQL does.
Would be great if you could give a heads up at the end of any crawl if any such anomalies are expected.

@rviscomi
Copy link
Member

rviscomi commented Feb 7, 2019

Will do! Thanks for your understanding.

@Themanwithoutaplan
Copy link
Author

FWIW, I'm still seeing around 120,000 duplicates in every run. I haven't done any further analysis but I think you should be looking at what's causing this.

@Themanwithoutaplan
Copy link
Author

In the desktop crawl for 2019-04-01 there are around 227,000 sites with duplicate tests so we're getting close to 5%. This will start to affect any derived statistics and is also a waste of resources. Do we have any idea what's causing this? Are some crawls being allocated twice?

@rviscomi
Copy link
Member

Could you share the query you're running to get that count?

summary_pages on BigQuery is created from the MySQL-based CSV dumps and it's not showing any duplicates:

SELECT
  url,
  COUNT(0) AS n
FROM
  `httparchive.summary_pages.2019_04_01_desktop`
GROUP BY
  url
HAVING
  n > 1
ORDER BY
  n DESC

If it's true I agree it's worth investigating, at least for resource conservation.

@Themanwithoutaplan
Copy link
Author

My query is always slightly different due to the way I import data into Postgres but I can provide a list of what I think are duplicate sites.

@pmeenan
Copy link
Member

pmeenan commented Apr 25, 2019

I'm not seeing any duplicates on the main server the dumps are created on either:

mysql> SELECT url, COUNT(0) AS n FROM pages WHERE crawlid = 587 GROUP BY url HAVING n > 1 ORDER BY n DESC;
Empty set (35.76 sec)

587 is the crawl ID for the 2019-04-01 desktop crawl.

Any chance you're not including the protocol (http:// or https://) part of the URL when collecting the site stats? It is expected that there will be both for sites that have traffic on both since they are different origins.

Looking at domains that have both http:// and https:// results:

mysql> SELECT LEFT(RIGHT(`url` ,length(`url`) -(position('//' IN `url`) + 1)) ,position('/' IN RIGHT(`url` ,length(`url`) -(position('//' IN `url`) + 1))) - 1) AS domain, COUNT(0) AS n FROM pages WHERE crawlid = 587 GROUP BY domain HAVING n > 1 ORDER BY n DESC;

113500 rows in set (40.32 sec)

That looks pretty close to half of your number so, assuming you are counting both of them, my guess is you are calculating stats at a domain level, not origin and we test both http and https versions of an origin if CrUX shows that both had traffic during the previous month.

@Themanwithoutaplan
Copy link
Author

Pat, I think you've identified the problem: I do normalise on the domain which is why these count as duplicates. I'm not sure that there is any point in not doing this just because both protocols are in the CrUX dataset: is it right to consider these as distinct websites? But the important thing is we've identified the source of the anomaly and I can adjust my import script.

@pmeenan
Copy link
Member

pmeenan commented Apr 26, 2019

If both protocols are in CrUX that means that both had a meaningful amount of traffic during the month. If it was redirecting from http -> https then CrUX would only report the https. The content can be completely different if, fo example, a site is migrating to a new system and deploying https as part of the migration and the performance characteristics are likely to be very different.

If you are going to de-dupe them into a single entry, I'd recommend favoring the https:// variant when there are duplicates. For the HTTPArchive it makes sense to just collect both of them so we have a clean and complete dataset that matches CrUX.

@Themanwithoutaplan
Copy link
Author

I appreciate exactly what you're saying about the protocols but a cursory check suggests that these are duplicates and that the websites are just not configured to redirect. At some point, as we move towards http/2 the issue may resolve itself, in the meantime I guess it's an interesting effect itself.

For my purposes I'm doing just what you suggest and am keeping only the https variants.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants