Signals Matter: Understanding Popularity and Impact of Users on Stack Overflow
- Arpit Merchant
- Daksh Shah
- Gurpreet Singh Bhatia
- Anurag Ghosh
- Ponnurangam Kumaraguru
This repository contains information on obtaining the data and a reference implementation of the experiments described in the paper Signals Matter: Understanding Popularity and Impact of Users on Stack Overflow accepted at WebConf (formerly WWW) 2019.
Situating our work in Digital Signaling Theory, we investigate the role of these game elements in characterizing social qualities (specifically, popularity and impact) of its users. We present evidence that certain non-trivial badges, reputation scores and age of the user on the site positively correlate with popularity and impact. Further, we find that the presence of costly to earn and hard to observe signals qualitatively differentiates highly impactful users from highly popular users.
Our implementation works with Python>=3.5.2. Install other dependencies:
$ pip install -r requirements.txt
To install xgboost:
git clone --recursive https://github.com/dmlc/xgboost cd xgboost && mkdir build && cd build cmake .. -DPLUGIN_UPDATER_GPU=ON && make -j4 cd ../python-package && python3 setup.py install
There are three different ways to access the Stack Overflow data we use in our experiments.
- Stack Exchange data dump at archive.org - Anonymized dump of all user-contributed content for all Stack Exchange sites.
- Stack Exchange Data Explorer - Official online database of all Stack Exchange sites.
- Google BigQuery public datasets - Public access to the Stack Overflow dataset with 1 TB per month worth of queries for free (terms and conditions applied).
This repository contains SQL queries that can be run on Google Big Query to download the data needed for the experiments. The query load falls within the free monthly user limit.
The fully processed data can also be downloaded from Precog Lab's website.
If you download the data using Google BigQuery, use the methods in
code/preprocessing.py to preprocess it. If you downloaded the processed data from Precog's website, place
augmented_small_df.csv into the
To run the regression model:
python regression_model.py [predict_feature] [model_name] [num_runs] [num_threads]
predict_feature - Choose between "popularity" or "impact".
model_name - Choose between "base", "reputation" and "badges".
num_runs - Number of runs of the experiment.
num_threads - Number of threads for parallelization
We thank Kushagra Bhargava, Shubham Singh, and Shwetanshu Singh for their help in using and maintaining system infrastructures.