Perform June crawl #108

SebastianZimmeck · 2024-04-23T20:16:11Z

@franciscawijaya will perform the crawl (with possible help from @katehausladen).

franciscawijaya · 2024-06-11T21:23:10Z

The crawler is in progress right now. Things to take note for me:

Routinely checks the crawler to make sure everything is going well and the crawler does not stop halfway (if it is, take steps to restart it from the point that it stops)
Familiarize myself with the Google Collab

franciscawijaya · 2024-06-13T16:06:19Z

As per our conversation on call, I have compared the result of some sites on the June crawl pt1 and April crawl pt1 and they looked similar -- the crawl is looking great, so far! Just to track the progress, I'm currently on the third set and in the meantime, I'm changing all the formats for the json files to be readable and also reading through the Google Collab.

SebastianZimmeck · 2024-06-13T17:52:33Z

Nice, @franciscawijaya!

SebastianZimmeck · 2024-06-13T18:45:39Z

@franciscawijaya, can you add the following URL to the end of the crawl list and include it in your crawl going forward?

https://www.washingtonpost.com/

(cc'ing @AramZS)

franciscawijaya · 2024-06-13T18:47:17Z

Added to the 8th batch!

franciscawijaya · 2024-06-18T19:34:46Z

The last batch of the crawl is now done and everything looks great so far!

Next step: I will now begin to parse and analyze the crawl data which would be finished latest by our Thursday meeting.

SebastianZimmeck · 2024-06-18T19:35:47Z

Excellent! Great news!

franciscawijaya · 2024-06-18T20:37:46Z

Update: I have transferred all of the crawl data to the Google Drive and am starting to collate the redo_sites.csv using the Google Collab now that we have all the data. However, I'm currently facing an error when running one of the lines of code and have been struggling to figure out where it went wrong. But, I have reached out to labmates for output and I will continue on debugging.

franciscawijaya · 2024-06-18T21:24:36Z

Solved! I'm now running the redo sites (Google Collab collated 720 sites without subdomains to be crawled).

franciscawijaya · 2024-06-20T01:52:53Z

An update: I have crawled the redo sites and I also tried to run the well-known script. However, I faced a problem in which only the well-known-data.csv was updated and not the well-known-errors.csv once the code fully ran. Nevertheless, I think I found out what the problem was and am now re-running the script. Hopefully, I would have both the well-known-data and well-known-errors by tomorrow morning and can then start working on parsing and analyzing to get all the figures too.

SebastianZimmeck · 2024-06-20T14:54:17Z

As discussed, @franciscawijaya if you can do the following:

Create/update crawl release
Update crawl documentation as necessary

Then, feel free to close this issue.

franciscawijaya · 2024-06-20T16:17:51Z

As mentioned in the meeting, while I am done with the crawl, I am still figuring out the parsing/analyzing of the data and creating the figures which would be my task for this week. I will be closing this issue now as I'm done with the crawl and will make a new issue in gpc-web-crawler-paper to post the figures and data once I finish working out the analysis and figures.

SebastianZimmeck added the crawl Perform crawl or crawl feature-related label Apr 23, 2024

SebastianZimmeck assigned katehausladen and franciscawijaya Apr 23, 2024

SebastianZimmeck assigned Mattm27 and unassigned katehausladen May 20, 2024

SebastianZimmeck mentioned this issue May 24, 2024

Add crawler functionality for identifying sites' usage of GPP 1.0 vs 1.1 and write to database #110

Closed

franciscawijaya closed this as completed Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform June crawl #108

Perform June crawl #108

SebastianZimmeck commented Apr 23, 2024

franciscawijaya commented Jun 11, 2024

franciscawijaya commented Jun 13, 2024

SebastianZimmeck commented Jun 13, 2024

SebastianZimmeck commented Jun 13, 2024

franciscawijaya commented Jun 13, 2024

franciscawijaya commented Jun 18, 2024

SebastianZimmeck commented Jun 18, 2024

franciscawijaya commented Jun 18, 2024

franciscawijaya commented Jun 18, 2024

franciscawijaya commented Jun 20, 2024

SebastianZimmeck commented Jun 20, 2024 •

edited by franciscawijaya

Loading

franciscawijaya commented Jun 20, 2024

Perform June crawl #108

Perform June crawl #108

Comments

SebastianZimmeck commented Apr 23, 2024

franciscawijaya commented Jun 11, 2024

franciscawijaya commented Jun 13, 2024

SebastianZimmeck commented Jun 13, 2024

SebastianZimmeck commented Jun 13, 2024

franciscawijaya commented Jun 13, 2024

franciscawijaya commented Jun 18, 2024

SebastianZimmeck commented Jun 18, 2024

franciscawijaya commented Jun 18, 2024

franciscawijaya commented Jun 18, 2024

franciscawijaya commented Jun 20, 2024

SebastianZimmeck commented Jun 20, 2024 • edited by franciscawijaya Loading

franciscawijaya commented Jun 20, 2024

SebastianZimmeck commented Jun 20, 2024 •

edited by franciscawijaya

Loading