Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform June crawl #108

Closed
SebastianZimmeck opened this issue Apr 23, 2024 · 12 comments
Closed

Perform June crawl #108

SebastianZimmeck opened this issue Apr 23, 2024 · 12 comments
Assignees
Labels
crawl Perform crawl or crawl feature-related

Comments

@SebastianZimmeck
Copy link
Member

@franciscawijaya will perform the crawl (with possible help from @katehausladen).

@franciscawijaya
Copy link
Member

The crawler is in progress right now. Things to take note for me:

  • Routinely checks the crawler to make sure everything is going well and the crawler does not stop halfway (if it is, take steps to restart it from the point that it stops)
  • Familiarize myself with the Google Collab

@franciscawijaya
Copy link
Member

As per our conversation on call, I have compared the result of some sites on the June crawl pt1 and April crawl pt1 and they looked similar -- the crawl is looking great, so far! Just to track the progress, I'm currently on the third set and in the meantime, I'm changing all the formats for the json files to be readable and also reading through the Google Collab.

@SebastianZimmeck
Copy link
Member Author

Nice, @franciscawijaya!

@SebastianZimmeck
Copy link
Member Author

@franciscawijaya, can you add the following URL to the end of the crawl list and include it in your crawl going forward?

https://www.washingtonpost.com/

(cc'ing @AramZS)

@franciscawijaya
Copy link
Member

Added to the 8th batch!

@franciscawijaya
Copy link
Member

The last batch of the crawl is now done and everything looks great so far!

Next step: I will now begin to parse and analyze the crawl data which would be finished latest by our Thursday meeting.

@SebastianZimmeck
Copy link
Member Author

Excellent! Great news!

@franciscawijaya
Copy link
Member

Update: I have transferred all of the crawl data to the Google Drive and am starting to collate the redo_sites.csv using the Google Collab now that we have all the data. However, I'm currently facing an error when running one of the lines of code and have been struggling to figure out where it went wrong. But, I have reached out to labmates for output and I will continue on debugging.

@franciscawijaya
Copy link
Member

Solved! I'm now running the redo sites (Google Collab collated 720 sites without subdomains to be crawled).

@franciscawijaya
Copy link
Member

An update: I have crawled the redo sites and I also tried to run the well-known script. However, I faced a problem in which only the well-known-data.csv was updated and not the well-known-errors.csv once the code fully ran. Nevertheless, I think I found out what the problem was and am now re-running the script. Hopefully, I would have both the well-known-data and well-known-errors by tomorrow morning and can then start working on parsing and analyzing to get all the figures too.

@SebastianZimmeck
Copy link
Member Author

SebastianZimmeck commented Jun 20, 2024

As discussed, @franciscawijaya if you can do the following:

  • Create/update crawl release
  • Update crawl documentation as necessary

Then, feel free to close this issue.

@franciscawijaya
Copy link
Member

As mentioned in the meeting, while I am done with the crawl, I am still figuring out the parsing/analyzing of the data and creating the figures which would be my task for this week. I will be closing this issue now as I'm done with the crawl and will make a new issue in gpc-web-crawler-paper to post the figures and data once I finish working out the analysis and figures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
crawl Perform crawl or crawl feature-related
Projects
None yet
Development

No branches or pull requests

4 participants