A web scraper that collects all project submissions on DevPost.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
README.md
consolidate.py
content.py
data.py
requirements.txt

README.md

devpost-scraper

Getting Started

Set Up

Create two new folders that will serve as targets for files generated by most of these scrapers:

$ mkdir data && mkdir content

Projects

To start, we scrape high-level data about all projects with the following request:

curl -i -X GET "https://devpost.com/software/search?query=&page=0"

This simple request returns page 0, or the first page, of projects with an open search query (i.e. searching ALL projects). This data is sorted by date posted, descending.

DevPost scans the client's User Agent and recognizes that this request is NOT coming from a browser, thus returning a JSON object of all of the projects on that given page. We can iterate through this until the array size returned is 0 (i.e. page doesn't exist).

An example:

{
  "software": [
    {
      "class_name": "Software",
      "name": "PostDev",
      "tagline": "Analyze DevPost - by hackers for hackers",
      "slug": "postdev-k286c0",
      "url": "https://devpost.com/software/postdev-k286c0",
      "members": [
        "jayrav13",
        "nsamarin",
        "otmichael",
        "snowiswhite"
      ],
      "tags": [
        "python",
        "flask",
        "react",
        "redux",
        "mdl",
        "markovify",
        "machine-learning",
        "ibm-watson",
        "amazon-ec2"
      ],
      "winner": false,
      "photo": null,
      "has_video": false,
      "like_count": 2,
      "comment_count": 0
    }
  ],
  "total_count": 1
}

data.py

Execute data.py from the root of the project will return all of these for you, saved in data/ (each page has its own file).

content.py

Now, we want to grab the hacker-written description of all of these projects. To do so, we can execute content.py to begin the scraping process. This file collects all projects from the files in data/ and begins the process of generating one file per project containing the full HTML page of each project and saves it to content/ in the following format:

{
  "url": "https://devpost.com/software/trip-py-planner",
  "content": "..."
}

Here, the "content" key's value is the HTML page representing this project at the time of the request. I did this so I could deal with the lxml scraping later and also being able to keep the HTML metadata for later.

The biggest challenge: making the 65,000+ requests in an acceptable amount of time. My solution came from this StackOverflow response that provided direction on how to make a mass number of requests in Python.

NOTE: Be wary of how many threads you make available to this program in content.py. I used 8 threads to successfully complete this in a few hours. I also learned the hard way that using 100 threads is a bad idea.

consolidate.py

Execute the last file, consolidate.py, combine data from both data/ and content/ to generate a final file, data.json, at the root of the project. The JSON structure is the SAME as what we saw before, except with a new key: description.

Helper shell commands

List number of files:

ls -al | wc -l

List directory size:

du -h -c . -l

...where . is the directory you're looking to calculate size for.