Skip to content
Python spider for scraping and monitoring coding contests and events.
Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
AtCoder updated all scrapers but leetcode Jul 11, 2018
CodeChef updated all scrapers but leetcode Jul 11, 2018
CodeForces
GoogleCodeJam
HackerEarth updated all scrapers but leetcode Jul 11, 2018
Hackerrank updated all scrapers but leetcode Jul 11, 2018
Kaggle added interface for dynamic page crawler Jul 11, 2018
LeetCode added interface for dynamic page crawler Jul 11, 2018
TopCoder removed useless files Jul 11, 2018
utils added interface for dynamic page crawler Jul 11, 2018
.gitignore updated readme and project structure;init GCJ doc Jul 7, 2018
README.md updated readme Jul 14, 2018
requirements.txt updated readme and project structure;init GCJ doc Jul 7, 2018

README.md

CodingContestsCrawler

Scraper for popular coding contests and programming challenges. Currently only scrapying active contests.

[ {'event_title': '', 'event_info': optinal 'events': [ 'duration': 'name': 'startDateTime': 'registrationDeadline':, 'url':, 'prerequisite': optional, 'addtionalInfo': optional ] } ]

List of coding contests scraped and Python crawler libraries used:

  • Code Forces: CodeForces presents (all of their) contest info with static html webpage. We only need to parse the table which stores all the contest info.

  • Google Code Jam: Google Code Jam web page was written with dynamic Javascript, meaning that if we request the contest schedule base url directly it will return pre compiled js script, instead of the compiled html we want. But luckily, we can cheat by requesting their base url with additional data=1 token. This will directly provide their event info json, which is what we want.

    • type: coding contest
    • requests, re, XPath(lxml)
    • url: https://code.google.com/codejam/schedule, https://code.google.com/codejam/kickstart/schedule
    • However, Google Code Jam has more than one contest event. The entire Google Code Jam Family consists of Google Code Jam, Distributed Code Jam 2018, Google Code Jam Kickstart, Google Code Jam I/O for Woman, and other contests. And their schedule are likely to be stored in different webpages. With the url listed above, we were only able to scrape Google Code Jam, Distributed Code Jam and Google Code Jam Kickstart. Also, their website are likely to change each year. We should also use a monitor to watch for possible changes.
  • LeetCode Weekly Contest: I tried to request from their graphQL API, but it seems like x-csrftoken (or other possible incorrect information in header) was causing trouble. So I'm temporarily using Selenium to fetch the data, although it's facing proformance bottlenecks.

    • type: coding contest
    • Selenium, re
  • AtCoder: Contests info stored in static html. Easy Peasy.

    • ... aND I LIED. I tried to experiment around with lxml (aka XPATH). It was a torture, altough on performance it beats all other selectors (pyquery, bs4, etc). So far I prefer pyquery the most.
    • type: coding contest
    • requests, XPATH(lxml)
  • HackerRank: All active contest info can be fetched with this url.

    • type: coding contest
    • pyquery, re, requests
  • CodeChef: Static HTML page as well. Pretty standard.

    • type: coding contest
    • pyquery, re, requests
  • Facebook Hacker Cup: All the info stored in a Facebook group. Hard to read directly given info we need is stored in human written posts. Probly easy to add manually.

    • type: coding contest
  • TopCoder: This website is wrttien with js and they do not have an data API that we can request for their json data directly. Fortunately they provide the RSS feed. We only need to request that url and parse the xml got returned.

    • type: programming challenges, coding contest (###TODO)
  • HackerEarth: Found their backend API url. But the long tokens seems suspicious and they might change later.

    • type: programming challenges
  • Kaggle: Dynamic website with no data API. Should probly load html with Selenium and scrape the tables later.

    • type: programming challenges, data science
  • Bounty One ###TODO

Setup

pip install requirements
You can’t perform that action at this time.