Skip to content
No description, website, or topics provided.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
code
data
.gitignore
LICENSE.md
README.md
requirements.txt

README.md

Signals Matter: Understanding Popularity and Impact of Users on Stack Overflow


Authors:

  • Arpit Merchant
  • Daksh Shah
  • Gurpreet Singh Bhatia
  • Anurag Ghosh
  • Ponnurangam Kumaraguru

Overview:

This repository contains information on obtaining the data and a reference implementation of the experiments described in the paper Signals Matter: Understanding Popularity and Impact of Users on Stack Overflow accepted at WebConf (formerly WWW) 2019.

Situating our work in Digital Signaling Theory, we investigate the role of these game elements in characterizing social qualities (specifically, popularity and impact) of its users. We present evidence that certain non-trivial badges, reputation scores and age of the user on the site positively correlate with popularity and impact. Further, we find that the presence of costly to earn and hard to observe signals qualitatively differentiates highly impactful users from highly popular users.

Dependencies

Our implementation works with Python>=3.5.2. Install other dependencies: $ pip install -r requirements.txt

To install xgboost:

git clone --recursive https://github.com/dmlc/xgboost
cd xgboost && mkdir build && cd build
cmake .. -DPLUGIN_UPDATER_GPU=ON && make -j4
cd ../python-package && python3 setup.py install

Data

There are three different ways to access the Stack Overflow data we use in our experiments.

  1. Stack Exchange data dump at archive.org - Anonymized dump of all user-contributed content for all Stack Exchange sites.
  2. Stack Exchange Data Explorer - Official online database of all Stack Exchange sites.
  3. Google BigQuery public datasets - Public access to the Stack Overflow dataset with 1 TB per month worth of queries for free (terms and conditions applied).

This repository contains SQL queries that can be run on Google Big Query to download the data needed for the experiments. The query load falls within the free monthly user limit.

The fully processed data can also be downloaded from Precog Lab's website.

Usage

If you download the data using Google BigQuery, use the methods in code/preprocessing.py to preprocess it. If you downloaded the processed data from Precog's website, place augmented_combined_df.csv and augmented_small_df.csv into the data folder.

To run the regression model:

python regression_model.py [predict_feature] [model_name] [num_runs] [num_threads]

predict_feature - Choose between "popularity" or "impact". model_name - Choose between "base", "reputation" and "badges". num_runs - Number of runs of the experiment. num_threads - Number of threads for parallelization

Acknowledgment

We thank Kushagra Bhargava, Shubham Singh, and Shwetanshu Singh for their help in using and maintaining system infrastructures.

You can’t perform that action at this time.