Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
Already on GitHub? Sign in to your account
Dealing with larger than memory datasets? #387
Comments
|
Sorry it took me a while to get back to this. I've been thinking about how you would do this. It would be hackable with the current infrastructure under the following circumstances:
In this case, you could farm out appropriate-sized chunks of the data to different servers (with non-shared memory; Hadoop/Spark/etc.). Use the modular structure to set up Z, X, ... structures on each server. To evaluate the likelihood, ship a complete set of theta and beta parameters off to each server, evaluate the likelihood for the subset of the data, return it to the master, and add the log-likelihoods. If this makes sense to you and you want to tackle it, I could give you help if you get stuck. @dmbates , any Julia-oriented thoughts on this one? |
skanskan
commented
Jul 11, 2016
•
|
Hello. But do you mean I could do it with a little hack of lme4 or scripting? I'm interested in simple formulas (with normal or binomial link) such as: The bigmemory's guys told that I could also try with the function bam from the package mgcv. Regards |
|
It seems to me that it would be rather challenging to get enough data that you could not fit a model like (g)lmer( BP ~ Age * Sex + smoke + ( 1 | ID ) ) in memory. In the formulation in the MixedModels package for Julia the largest component of the model structure would be the model matrix for the fixed effects which would be n by 3. You would need two copies of it (the original and, for a GLMM, a copy with case weights applied. How large would n be? An n in the tens or even hundreds of millions could be handled on a server without too much of a problem. Do you have a sample data set that we could use to test different methods? Failing that, could you suggest appropriate values of the parameters for simulating such data. |
|
@skanskan I'm not sure I understand what you mean by "a large model (30 GB)". Where does the 30 GB come from? |
skanskan
commented
Jul 11, 2016
|
I mean that the csv file that contains the data is about 30GB (and my PC has 12GB of RAM) |
|
@skanskan Converting a .,csv file to a data frame is a different issue than fitting a model. There are many ways to handle reading a large file of data. The "R Data Input/Output" manual is one place to start, as is the "readr" package. Can you share the data or at least the structure of the data? How many rows/columns, what column types, ranges, etc.? |
|
A few quick comments: (1)
(Details will obviously depend on the data types, digits of precision, etc.) If you still want to proceed, I think you could do this without writing any C++ code, but it could be very slow. It wouldn't be too hard to compute the likelihood for individual chunks of data, but the issue is that you would have to repeatedly access the chunks; if you have to keep re-loading and clearing large memory chunks, the whole process will likely be excruciatingly slow (this is why I was suggesting a non-shared-memory solution). This is in contrast to something like linear models, where the solution can be computed in a single chunk-wise pass through the data ... |
|
@bbolker A linear mixed-effects model only needs to pass over the whole data set once. After constructing the "cross-product" matrices of the partitions (random-effects term 1, random-effects term 2, ... , fixed-effects, response) everything can be done with two copies of those cross-product blocks. If there is only one random-effects term the first block is diagonal or homogeneous block-diagonal and the storage overhead is not much more than fitting the fixed-effects only. For a GLMM you need to return to the full data set each time you reweight. |
|
OK, thanks, I stand corrected. (Student project???) |
skanskan
commented
Jul 11, 2016
•
|
@dmbates I know that. The problem is that my data is too big to be loaded directly on a dataframe. In fact I don't have just one model to show you now, I'm trying to find a tool for future work, in general, I'm starting my PhD on statistics. Last month I had to analyze a dataset of 300 variables and 325000 individuals, with measures repeated every month for several years. If the dataset is very big it's not easy to deal with it. You need to save it on a database or use special tools. Getting more memory is expensive and it doesn't solve the problem, it's only a patch till you need to solve a bigger problem, it only postpones the problem. |
|
If you look at the LinearMixedModel type in the MixedModels package
the only parts that are modified during the iterations are Thus you can evaluate the profiled log-likelihood for a given value of θ without needing to return to the full data frame. The description of |
skanskan
commented
Jul 11, 2016
•
|
@bbolker Do you think it's going to be easier to get it done with a classic mixed-effects multichunk approach or with a bayesian approach? |
|
@skanskan It is highly unlikely you will find an out-of-the-box solution to do what you want to do. If you need to work with very large data sets you will need to learn to use the tools that can handle such data. You should realize that R hits the wall pretty quickly when dealing with large data sets and models of some complexity. This is a consequence of the design of the language. I switched to working in Julia because it offers more flexibility than R in data structures, Many others use Python. There are ways to attack the problem but none of them are as straightforward as writing some R code. When you say, " I had to analyze a dataset of 300 variables and 325000 individuals, with measures repeated every month for several years." it sounds to me as if these data are in the "wide" format (one row for each person, several columns representing measurements). If they are converted to the "long" format (columns of individual, measurement, occasion and any covariates like age, sex, ...) how many observations would there be? That is, how many measurements in total? It is probably best to perform the data manipulation outside of R, perhaps in a data base, then worry about fitting the model. If you want to try using Julia and the MixedModels package I could advise you on how to set up such a model. |
|
@skanskan If by "a Bayesian approach" you mean MCMC you would need to be prepared to wait a very long time to fit even a simple model to a large data set. |
skanskan
commented
Jul 11, 2016
|
Yes, I meant MCMC. I know it's slower for simple models, |
skanskan
commented
Jul 11, 2016
•
|
@dmbates it would be around 325000individualsx10yearsx12months = Sampling has one drawback: if something happens only sometimes you can miss it. For example if my model is logistic, and y=0 happens 1000000 times but y=1 happens only 200 times in my data, sampling could only take cases with y=0. One solution I found so far is to increase the size of my pagefile (on Windows) to a fixed size of 200GB on a fast SSD. Anyway it's quite slow because it's not optimized for that. I know there are other tools able to work wit very big datasets. One free tool is Root (made by CERN) but they don't have any package for mixed effects. |
|
Sounds like it will be a lot of work for you no matter what approach you take. Modeling very rare events is difficult. If all but a few cases have y = 0 then you will likely end up with complete separation for some combination of the covariates. You will see the intercept getting large and negative while a few coefficients become large and positive. Using Windows will also make things difficult. It doesn't scale well to large data sets. |
skanskan
commented
Jul 11, 2016
|
@dmbates I'm using Windows but I don't mind to move to Linux sometimes. |
|
skanskan
commented
Sep 30, 2016
•
|
Hello again. Anyway I think the future depends on some libraries such as BLAS or MKL being adapted to work with very large datasets, maybe distributed. Or we will need to wait till some common database software include mixed effects models among its commands. |
skanskan commentedJul 5, 2016
•
edited
Hello,
Any news about how to deal with larger-than-memory datasets?
I'd like to fit a large model (30GB) and lme4 is not able to deal with databases so big.
I know that there are some packages such as ff or bigmemory that can work with large datasets but they don't have the ability to fit mixed-effect models nor they seem to be able to work alongside lme4, isn't it?
How can I transparently fit lme4 with data from some of these sources without loading everything on memory?
Regards
Regards.