RealisticTabularDataSets

Some realistic tabular datasets for testing (CSV)

The datasets are gzipped, you can unzip them under Linux and macOS with the gunzip program. Windows users can use 7-Zip. Mac users should be able to just double-click on the files to uncompress them.

These data sets have been used in several academic papers.

Census-Income

File: census-income.data.gz 5.7MB

Census-Income is a relatively small data set with 100 MB and 199 523 records. However, it has 42 columns and one column has a very high relative cardinality (99 800 distinct values).

We include a subset (census-income.data.d241850.csv.gz) made of 4 columns: age, wage per hour, dividends from stocks and a numerical value found in the 25th position of the original data set. The respective cardinalities are 91, 1 240, 1 478 and 99 800.

Source:

Frank, A. and Asuncion, A. 2010. UCI machine learning repository. http://archive.ics.uci.edu/ml

Census 1881

File: census1881.csv.gz 33MB

Census 1881 comes from the Canadian census of 1881: it has over 4 million records. Census1881 came from a publicly available SPSS file 1881 sept2008 SPSS.rar that we converted to a flat file. In the process, we replaced the special values “ditto” and “do.” by the repeated value, and we deleted all commas within values. The column cardinalities are 183, 2 127, 2 795, 8 837, 24 278, 152 365, 152882.

Source:

Lemire D, Kaser O, Gutarra E. Reordering rows for better compression: Beyond the lexicographical order. ACM Transactions on Database Systems 2012; 37(3), doi:10.1145/2338626.2338633.

Weather

File: weather_sept_85.csv.gz 15MB

It consists of surface synoptic weather reports from land stations for September 1985.

Source:

Frank, A. and Asuncion, A. 2010. UCI machine learning repository. http://archive.ics.uci.edu/ml
Hahn, C., Warren, S., and London, J. 2004. Edited synoptic cloud reports from ships and land stations over the globe, 1982–1991. http://cdiac.ornl.gov/ftp/ndp026b/

Wikileaks

File: wikileaks-noquotes.csv.gz 5.9MB

The Wikileaks table was created from a public repository published by Google and it contains the non-classified metadata related to leaked diplomatic cables. We extracted 4 columns: year, time, place and descriptive code. It has 1 178 559 records. Our Wikileaks table has column cardinalities 273, 1440, 3935 and 4865.

Source:

Lemire D, Kaser O, Gutarra E. Reordering rows for better compression: Beyond the lexicographical order. ACM Transactions on Database Systems 2012; 37(3), doi:10.1145/2338626.2338633.

Sorted versions

File: census-income_srt.csv.gz

File: wikileaks-noquotes_srt.csv.gz

File: weather_sept_85_srt.csv.gz

File: census1881_srt.csv.gz

We sorted the tables lexicographically, with the smallest cardinality column being the primary sort key, the next-smallest cardinality column being the secondary sort key, and so forth.

References:

Lemire, D., Kaser, O., Kurz, N., Deri, L., O'Hara, C., Saint‐Jacques, F. and Ssi‐Yan‐Kai, G., 2017. Roaring bitmaps: Implementation of an optimized software library. Software: Practice and Experience 48 (4), 2018.
Lemire, D., Ssi‐Yan‐Kai, G. and Kaser, O., 2016. Consistently faster and smaller compressed bitmaps with roaring. Software: Practice and Experience, 46(11), pp.1547-1569.
Lemire D, Kaser O, Aouiche K. Sorting improves word-aligned bitmap indexes. Data & Knowledge Engineering 2010; 69(1):3–28, doi:10.1016/j.datak.2009.08.006.
Lemire D, Kaser O, Gutarra E. Reordering rows for better compression: Beyond the lexicographical order. ACM Transactions on Database Systems 2012; 37(3), doi:10.1145/2338626.2338633.

More data

If you just want short tabular datasets for machine learning purposes, there are good choices elsewhere such as adult.

The Web Table Corpora is interesting.

See Big Data And AI: 30 Amazing (And Free) Public Data Sources For 2018.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
census-income		census-income
census1881		census1881
weather		weather
wikileaks		wikileaks
README.md		README.md
datafiledescriptions_census1881.txt		datafiledescriptions_census1881.txt
datafiledescriptions_censusincome.txt		datafiledescriptions_censusincome.txt
datafiledescriptions_wikileaks.txt		datafiledescriptions_wikileaks.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RealisticTabularDataSets

Census-Income

Census 1881

Weather

Wikileaks

Sorted versions

More data

About

Releases

Packages

lemire/RealisticTabularDataSets

Folders and files

Latest commit

History

Repository files navigation

RealisticTabularDataSets

Census-Income

Census 1881

Weather

Wikileaks

Sorted versions

More data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages