New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download all datasets contained in all R-packages #185

Open
giuseppec opened this Issue Mar 17, 2016 · 6 comments

Comments

Projects
None yet
5 participants
@giuseppec
Copy link
Member

giuseppec commented Mar 17, 2016

We can do something like (ugly code) and then upload everything

# install all packages
# pkg = available.packages()
# for(i in 1:nrow(pkg)) install.packages(pkgs = pkg[i,"Package"])

# get table of all available data sets from all available packages
rdat = data(package = .packages(all.available = TRUE))
rdat = as.data.frame(rdat$results)
rdat$Package.Version = sapply(rdat$Package, function(x) as.character(packageVersion(x)))
# here we remove strange data set names
rdat = rdat[!grepl("\\(|\\)", rdat$Item),]

rdat.unique = unique(rdat[,c("Package", "Package.Version")])
ret = setNames(vector("list", nrow(rdat.unique)), 
  paste0(rdat.unique$Package, "_", rdat.unique$Package.Version))

for(i in seq_along(rdat.unique$Package)) {
  # get all data set names from package 'i'
  dat.names = as.character(subset(rdat, Package == rdat.unique$Package[i])[,"Item"])
  data(list = dat.names, package = as.character(rdat.unique$Package[i]))

  ret[[i]] = setNames(lapply(dat.names, function(dn) {
    return(tryCatch(get(dn), error = function(e) e, warning = function(w) w))
  }), dat.names)

  loaded.pkg = setdiff(loadedNamespaces(),
    c("stats", "graphics", "grDevices", "utils", "datasets", "methods", "base", "tools"))
  lapply(loaded.pkg, function(x) try(unloadNamespace(x)))

  cat(as.character(rdat.unique$Package[i]), ": all data sets downloaded", fill = TRUE)
}

@mllg mllg self-assigned this Mar 17, 2016

@mllg mllg assigned giuseppec and unassigned mllg Mar 17, 2016

@HeidiSeibold

This comment has been minimized.

Copy link
Member

HeidiSeibold commented Apr 29, 2016

I asked on twitter if there are ways to do this without having to install the packages.

This is the best answer I got:
https://twitter.com/GaborCsardi/status/725776034910617600

Seems pretty promising 😃

@HeidiSeibold

This comment has been minimized.

@jakobbossek

This comment has been minimized.

Copy link
Contributor

jakobbossek commented Aug 31, 2016

Thanks Heidi!
Tried the approach proposed by Gabor. Pretty easy with his gh package and githubs's fantastic code search API to obtain a list of all rda files inside cran:

devtools::install_github("gaborcsardi/gh")
library(gh)
repos = gh("GET /search/code?q=user:cran+extension:rda")
catf("#Repos: %i", repos$total_count)

This way we can download the rda files only, e.g., via repos$items[[i]]$html_url. However, there is no easy way to access meta data for the data set, e.g., data set description, default target feature, citations etc. An possibility is to download the corresponding Rd docs as well an parse these. Uploading stuff to OpenML without at least a meaningful description seems useless to me.

Another point: why should we avoid downloading all packages by the crawler. Is it because of time and memory? We can simply download each package, extract the data sets, upload to OpenML and remove the package afterwards. The time aspect is unimportant. The crawler does not need to be fast.

@jakobbossek

This comment has been minimized.

Copy link
Contributor

jakobbossek commented Aug 31, 2016

Started to work on a crawler which operates on the github cran repositories and reads 1) the data itself and 2) metadata from the corresponding Rd file. Works well so far. Just need to parallelize stuff and handle potential errors.

@cvitolo

This comment has been minimized.

Copy link

cvitolo commented May 25, 2017

@jakobbossek I'd love to see the results of your crawler/experiment. Did you publish it?

@giuseppec

This comment has been minimized.

Copy link
Member Author

giuseppec commented Nov 15, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment