Skip to content

using timss data to look at data loading speed and file sizes

Notifications You must be signed in to change notification settings

hhsievertsen/data_benchmarking_anecdotal

Repository files navigation

this repository contains the stata and r scripts used for a comparison of file sizes and processing times of different file formats for data storage and processing.

The data used in this comparison is the international Database (grade 8) 2019 TIMSS data which can be downloaded from the TIMSS website here.

  • I first converted the sas-files to a csv file (using R). that led to the file called timss_8_2019_uncompressed.csv.
  • I then converted that file into Stat format using R. That led to the file called timss_8_2019_uncompressed.dta.
  • I then compressed the stata file in Stata with the compress command. That took more than 24 hours and lead to the file called timss_8_2019_compressed.dta. Which was then also saved as a csv file called timss_8_2019_compressed.csv.
  • The other file formats are all created with R based on the timss_8_2019_uncompressed.csv file using default settings.

A few notes:

  • This is "anecdotal" as it is only based on one example.
  • This loads the full datasets. I do not exploit capabilities of the file formats to only load parts of the data and computation speed.

Here is the result

Comparison Result

About

using timss data to look at data loading speed and file sizes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published