Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Dealing with larger than memory datasets? #387

Open
skanskan opened this Issue Jul 5, 2016 · 20 comments

Comments

Projects
None yet
3 participants

skanskan commented Jul 5, 2016 edited

Hello,

Any news about how to deal with larger-than-memory datasets?

I'd like to fit a large model (30GB) and lme4 is not able to deal with databases so big.
I know that there are some packages such as ff or bigmemory that can work with large datasets but they don't have the ability to fit mixed-effect models nor they seem to be able to work alongside lme4, isn't it?

How can I transparently fit lme4 with data from some of these sources without loading everything on memory?

Regards

Regards.

Owner

bbolker commented Jul 11, 2016

Sorry it took me a while to get back to this.

I've been thinking about how you would do this. It would be hackable with the current infrastructure under the following circumstances:

  • fitting a GLMM without a scale parameter (e.g. binomial/Poisson)
  • all random effects strictly nested (not crossed or partially crossed)

In this case, you could farm out appropriate-sized chunks of the data to different servers (with non-shared memory; Hadoop/Spark/etc.). Use the modular structure to set up Z, X, ... structures on each server. To evaluate the likelihood, ship a complete set of theta and beta parameters off to each server, evaluate the likelihood for the subset of the data, return it to the master, and add the log-likelihoods.

If this makes sense to you and you want to tackle it, I could give you help if you get stuck.

@dmbates , any Julia-oriented thoughts on this one?

skanskan commented Jul 11, 2016 edited

Hello.
Let's forget about distributed technologies by now.
What about some simple local database (sqlite or monetdb or even bigmemory or ff) and not loading everything on memory, just chunks, and at the end mixing everything.

But do you mean I could do it with a little hack of lme4 or scripting?
Or would I need to write complex C code?

I'm interested in simple formulas (with normal or binomial link) such as:
lmer( BP ~ Age * Sex + smoke + ( 1 | ID ) )

The bigmemory's guys told that I could also try with the function bam from the package mgcv.

Regards

Owner

dmbates commented Jul 11, 2016

It seems to me that it would be rather challenging to get enough data that you could not fit a model like

(g)lmer( BP ~ Age * Sex + smoke + ( 1 | ID ) )

in memory. In the formulation in the MixedModels package for Julia the largest component of the model structure would be the model matrix for the fixed effects which would be n by 3. You would need two copies of it (the original and, for a GLMM, a copy with case weights applied. How large would n be? An n in the tens or even hundreds of millions could be handled on a server without too much of a problem.

Do you have a sample data set that we could use to test different methods? Failing that, could you suggest appropriate values of the parameters for simulating such data.

Owner

dmbates commented Jul 11, 2016

@skanskan I'm not sure I understand what you mean by "a large model (30 GB)". Where does the 30 GB come from?

I mean that the csv file that contains the data is about 30GB (and my PC has 12GB of RAM)

Owner

dmbates commented Jul 11, 2016

@skanskan Converting a .,csv file to a data frame is a different issue than fitting a model. There are many ways to handle reading a large file of data. The "R Data Input/Output" manual is one place to start, as is the "readr" package.

Can you share the data or at least the structure of the data? How many rows/columns, what column types, ranges, etc.?

Owner

bbolker commented Jul 11, 2016

A few quick comments: (1) bam does sound useful, although "... numerical methods are designed for datasets containing upwards of several tens of thousands of data" sounds a little scary (lme4 works out of the box with hundreds of thousands of observations ...), and it's not immediately obvious that it will help -- if you do try it, let us know how it goes! (2) as @dmbates says, size on disk and size in memory are different, e.g.

> n <- 1e5
> dd <- data.frame(BP=rnorm(n),Age=sample(20:60,size=n,replace=TRUE),
         smoke=sample(0:1,size=n,replace=TRUE),ID=1:n)
> object.size(dd)
2000984 bytes  ## 1.9 Mb
> write.csv(dd,file="tmp.csv")
> file.info("tmp.csv")
                        size isdir mode               mtime               ctime
tmp.csv 3694242 FALSE  644 2016-07-11 13:45:46 2016-07-11 13:45:46
## 3.6 MB

(Details will obviously depend on the data types, digits of precision, etc.)
(3) Depending on your hourly pay rate and the level of bureaucracy at your institution, gaining access to a computer with more memory may be (by far) the most efficient way to solve this problem;
https://speakerdeck.com/szilard/size-of-datasets-for-analytics-and-implications-for-r-user-conference-stanford-university-june-2016
comments on the cost of 128GB of RAM (about US$700) and the relatively easy access to cloud nodes (e.g. AWS) with up to 2TB of RAM ...

If you still want to proceed, I think you could do this without writing any C++ code, but it could be very slow. It wouldn't be too hard to compute the likelihood for individual chunks of data, but the issue is that you would have to repeatedly access the chunks; if you have to keep re-loading and clearing large memory chunks, the whole process will likely be excruciatingly slow (this is why I was suggesting a non-shared-memory solution). This is in contrast to something like linear models, where the solution can be computed in a single chunk-wise pass through the data ...

Owner

dmbates commented Jul 11, 2016 edited

@bbolker A linear mixed-effects model only needs to pass over the whole data set once. After constructing the "cross-product" matrices of the partitions (random-effects term 1, random-effects term 2, ... , fixed-effects, response) everything can be done with two copies of those cross-product blocks. If there is only one random-effects term the first block is diagonal or homogeneous block-diagonal and the storage overhead is not much more than fitting the fixed-effects only.

For a GLMM you need to return to the full data set each time you reweight.

Owner

bbolker commented Jul 11, 2016

OK, thanks, I stand corrected. (Student project???)

skanskan commented Jul 11, 2016 edited

@dmbates I know that.

The problem is that my data is too big to be loaded directly on a dataframe.
That's why I'm wondering whether there is a package able to fit a mixed effects model without trying to load all the data at once on memory.

In fact I don't have just one model to show you now, I'm trying to find a tool for future work, in general, I'm starting my PhD on statistics.

Last month I had to analyze a dataset of 300 variables and 325000 individuals, with measures repeated every month for several years.
Sometimes I can reduce the problem by preselecting some variables or sample cases.
Next month my boss is going to give me a much larger database, with metabolomics and genomic information.

If the dataset is very big it's not easy to deal with it. You need to save it on a database or use special tools.
Bigmemory or ff try to help but have one problem, you can only use the functions already predefined, such as simple regression, but they don't allow to do random effects or repeated measures or anything more complex. The same problem goes for Spark (though I don't know much about it)

Getting more memory is expensive and it doesn't solve the problem, it's only a patch till you need to solve a bigger problem, it only postpones the problem.

Owner

dmbates commented Jul 11, 2016

If you look at the LinearMixedModel type in the MixedModels package

help?> LinearMixedModel
search: LinearMixedModel GeneralizedLinearMixedModel

  LinearMixedModel

  Linear mixed-effects model representation

  Members:

    •  formula: the formula for the model
    •  mf: the model frame, mostly used to get the terms component for labeling fixed effects
    •  wttrms: a length nt vector of weighted model matrices. The last two elements are X and y.
    •  trms: a vector of unweighted model matrices. If isempty(sqrtwts) the same object as wttrms
    •  Λ: a length nt - 2 vector of lower triangular matrices
    •  sqrtwts: the Diagonal matrix of the square roots of the case weights. Allowed to be size 0
    •  A: an nt × nt symmetric matrix of matrices representing hcat(Z,X,y)'hcat(Z,X,y)
    •  R: a nt × nt matrix of matrices - the upper Cholesky factor of Λ'AΛ+I
    •  opt: an OptSummary object

the only parts that are modified during the iterations are Λ and R. The Λ member is just the template lower-triangular matrices for each random-effects term. For a scalar random effects term the corresponding component of Λ is a 1 by 1 matrix. The R member is an upper triangular matrix of matrices. A is the original cross-products of the various terms in the model and R is the blocks of the upper Cholesky factor. I made a fortuitous discovery that if you append y as the last term then do the Cholesky factorization the square root of the penalized residual sum of squares is in the 1 by 1 block at the lower right. Initially I thought it would be the square root of the residual sum of squares but I was wrong. Once I knew the answer it was easy to prove that it should be so.

Thus you can evaluate the profiled log-likelihood for a given value of θ without needing to return to the full data frame.

The description of R as Λ'AΛ+I is not exactly correct. You would need to imagine the full Λ matrix extended to a square matrix of size (q + p + 1) with the identity in the lower-right (p + 1) by (p + 1) block. Also, the I a (q + p + 1) matrix of zeros with 1's in the first q positions of the diagonal.

skanskan commented Jul 11, 2016 edited

@bbolker Do you think it's going to be easier to get it done with a classic mixed-effects multichunk approach or with a bayesian approach?
I mean bayesian software it's already able to work with large datasets because it's easier to parallelize chains?

Owner

dmbates commented Jul 11, 2016 edited

@skanskan It is highly unlikely you will find an out-of-the-box solution to do what you want to do. If you need to work with very large data sets you will need to learn to use the tools that can handle such data.

You should realize that R hits the wall pretty quickly when dealing with large data sets and models of some complexity. This is a consequence of the design of the language. I switched to working in Julia because it offers more flexibility than R in data structures, Many others use Python.

There are ways to attack the problem but none of them are as straightforward as writing some R code. When you say, " I had to analyze a dataset of 300 variables and 325000 individuals, with measures repeated every month for several years." it sounds to me as if these data are in the "wide" format (one row for each person, several columns representing measurements). If they are converted to the "long" format (columns of individual, measurement, occasion and any covariates like age, sex, ...) how many observations would there be? That is, how many measurements in total?

It is probably best to perform the data manipulation outside of R, perhaps in a data base, then worry about fitting the model. If you want to try using Julia and the MixedModels package I could advise you on how to set up such a model.

Owner

dmbates commented Jul 11, 2016

@skanskan If by "a Bayesian approach" you mean MCMC you would need to be prepared to wait a very long time to fit even a simple model to a large data set.

Yes, I meant MCMC. I know it's slower for simple models,
but as you told "if you have to keep re-loading and clearing large memory chunks, the whole process will likely be excruciatingly slow" then maybe in this situation MCMC is faster?

skanskan commented Jul 11, 2016 edited

@dmbates it would be around 325000individualsx10yearsx12months =
40 million observations approx and 300 different variables. But the number of cases it's gonna be much bigger in a few months.

Sampling has one drawback: if something happens only sometimes you can miss it. For example if my model is logistic, and y=0 happens 1000000 times but y=1 happens only 200 times in my data, sampling could only take cases with y=0.

One solution I found so far is to increase the size of my pagefile (on Windows) to a fixed size of 200GB on a fast SSD. Anyway it's quite slow because it's not optimized for that.

I know there are other tools able to work wit very big datasets. One free tool is Root (made by CERN) but they don't have any package for mixed effects.

Owner

dmbates commented Jul 11, 2016

Sounds like it will be a lot of work for you no matter what approach you take. Modeling very rare events is difficult. If all but a few cases have y = 0 then you will likely end up with complete separation for some combination of the covariates. You will see the intercept getting large and negative while a few coefficients become large and positive.

Using Windows will also make things difficult. It doesn't scale well to large data sets.

@dmbates I'm using Windows but I don't mind to move to Linux sometimes.

Owner

bbolker commented Jul 11, 2016 edited

  • I agree with previous comment of @dmbates : MCMC is unlikely to be practical (you could of course set up a distributed-memory MCMC-style analog of what I've described above, but it wouldn't be faster). You can run the chains in parallel, but you still have to evaluate the (unscaled) posterior distribution at some point ... I've heard of variational Bayes approaches (which as I understand it are fast deterministic approximations to computing posterior (??modes??)), but know nothing about them.
  • Getting more memory is expensive, but my point is that it's usually relatively cheap compared to the price of expert human effort. It is also certainly true that it only postpones the problem - but if it lets you solve your current problem (and perhaps the next couple of years' worth of problems, depending on what you're doing) - then it may be worth it.
  • Again as @dmbates said, my statement about "re-loading and clearing large memory chunks" is only applicable to GLMMs, not LMMs
  • I was asked to review a paper recently (it dropped through the cracks) by Yanyan Zhao, Haitao Zheng, and Qing Cheng on algorithms for LMMs for "massive data" (keywords were "block sequential SVD"). You could try to find them ...
  • the bottom line is (again agreeing with @dmbates) that this is not an easy problem at present. You could see what options SAS, Genstat, Stata have available (again $$$); asking for advice on r-sig-mixed-models is also not a bad idea. Fundamentally, I think that constructing generally useful tools for this problems is itself on the scale of a good PhD problem in computational statistics ... one could get started during a Master's, but making real headway would probably require the greater focus and time of a PhD.

skanskan commented Sep 30, 2016 edited

Hello again.
Does anybody has some experience with Spark?
In the documentation of the library photonML (foe Spark) there is something called GLMix, supposed to be able to do mixed effects models. But I don't know if the user can directly use it to fit a model as we do with lme4 or if it can only be used in machine learning algorithms.

Anyway I think the future depends on some libraries such as BLAS or MKL being adapted to work with very large datasets, maybe distributed. Or we will need to wait till some common database software include mixed effects models among its commands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment