Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LDA implementation #51

Closed
wants to merge 31 commits into from
Closed

LDA implementation #51

wants to merge 31 commits into from

Conversation

hyu596
Copy link

@hyu596 hyu596 commented Jul 3, 2018

This pull requests contains all the progress of LDA so far. To be more specific, things that are being pushed are:

  1. InputReader class: few functions are added to support reading LDA corpus
  2. lda_gamma function in gamma.h (helper function to simplify the computation of LDA log-likelihood)
  3. LDADataset class : this class is used to hold a LDA dataset. It contains two things; corpus and vocabularies
  4. LDAModel class:
    1. LDAModel can only be constructed from serialized memory
    2. From serialized memory, LDAModel gets all the statistics needed for sampling
    3. Currently LDAModel is not a subclass of CirrusModel (*)
  5. LDAStatistics class : this class is used to hold local variables (ndt, topics assignment, slice). Few things to notice:
    1. Given an integer minibatch_size, return partial LDAStatistics corresponding to minibatch_size # of documents stored in current LDAStatistics
    2. According to the slice_size (class variable; default = 500; can be set to other values), return partial LDAStatistics covering only slice_size # of words
  6. ModelGradient : LDAUpdate class :
    1. containing updates to the global variables
    2. note that vocabulary slice is also stored since it would be needed to figure out which part of global variables to update
    3. currently it’s not a subclass of ModelGradient (*)
  7. PSInterface: support sending LDAUpdate and get partial LDA model now; note that in order to get the partial model, only vocabulary slice needs to be sent.
  8. S3SparseIterator: add push_samples_lda to support reading the serialized version of LDAStatistics from S3
  9. Tasks:
    1. LoadingSparseTaskS3: read the input files, count all the needed statistics and store them in S3
    2. LDAModelTaskS3 : sampling LDAStatistics from S3; note that during sampling, we only consider partial LDAStatistics corresponding to ONE vocabulary slice

@jcarreira jcarreira changed the title Pushing all the current LDA progress LDA implementation Jul 21, 2018
@jcarreira
Copy link
Owner

I think we should put the datasets used for testing outside of github to avoid making this repo unnecessarily heavy.

@jcarreira
Copy link
Owner

If this PR is no longer relevant please close @hyu596

@hyu596 hyu596 closed this Nov 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants