Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


Getting Started

Set Up

Create two new folders that will serve as targets for files generated by most of these scrapers:

$ mkdir data && mkdir content


To start, we scrape high-level data about all projects with the following request:

curl -i -X GET ""

This simple request returns page 0, or the first page, of projects with an open search query (i.e. searching ALL projects). This data is sorted by date posted, descending.

DevPost scans the client's User Agent and recognizes that this request is NOT coming from a browser, thus returning a JSON object of all of the projects on that given page. We can iterate through this until the array size returned is 0 (i.e. page doesn't exist).

An example:

  "software": [
      "class_name": "Software",
      "name": "PostDev",
      "tagline": "Analyze DevPost - by hackers for hackers",
      "slug": "postdev-k286c0",
      "url": "",
      "members": [
      "tags": [
      "winner": false,
      "photo": null,
      "has_video": false,
      "like_count": 2,
      "comment_count": 0
  "total_count": 1

Execute from the root of the project will return all of these for you, saved in data/ (each page has its own file).

Now, we want to grab the hacker-written description of all of these projects. To do so, we can execute to begin the scraping process. This file collects all projects from the files in data/ and begins the process of generating one file per project containing the full HTML page of each project and saves it to content/ in the following format:

  "url": "",
  "content": "..."

Here, the "content" key's value is the HTML page representing this project at the time of the request. I did this so I could deal with the lxml scraping later and also being able to keep the HTML metadata for later.

The biggest challenge: making the 65,000+ requests in an acceptable amount of time. My solution came from this StackOverflow response that provided direction on how to make a mass number of requests in Python.

NOTE: Be wary of how many threads you make available to this program in I used 8 threads to successfully complete this in a few hours. I also learned the hard way that using 100 threads is a bad idea.

Execute the last file,, combine data from both data/ and content/ to generate a final file, data.json, at the root of the project. The JSON structure is the SAME as what we saw before, except with a new key: description.

Helper shell commands

List number of files:

ls -al | wc -l

List directory size:

du -h -c . -l

...where . is the directory you're looking to calculate size for.


A web scraper that collects all project submissions on DevPost.






No releases published


No packages published