Skip to content

janhuenermann/stackexchange-data-utils

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

stackexchange-dump
==================

Scripts to deal with gathering, cleaning and loading of stackexchange data

Using the scripts
./download.sh
python ./ingest.py stackexchange/ db.sqlite --ignore-meta
python ./tidy.py db.sqlite
python ./export.py db.sqlite chunks/

Stack Exchange dataset
- Overview of dataset https://archive.org/details/stackexchange
- Download links https://archive.org/download/stackexchange
- Documentation of dataset https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede
- Size of question/answer/user database 100G
- Size of question-answer pairs as plaintext 88G (includes non-English communities)
- License of dataset CC-BY-SA 4.0