LDA implementation #51

hyu596 · 2018-07-03T07:56:42Z

This pull requests contains all the progress of LDA so far. To be more specific, things that are being pushed are:

InputReader class: few functions are added to support reading LDA corpus
lda_gamma function in gamma.h (helper function to simplify the computation of LDA log-likelihood)
LDADataset class : this class is used to hold a LDA dataset. It contains two things; corpus and vocabularies
LDAModel class:
1. LDAModel can only be constructed from serialized memory
2. From serialized memory, LDAModel gets all the statistics needed for sampling
3. Currently LDAModel is not a subclass of CirrusModel (*)
LDAStatistics class : this class is used to hold local variables (ndt, topics assignment, slice). Few things to notice:
1. Given an integer minibatch_size, return partial LDAStatistics corresponding to minibatch_size # of documents stored in current LDAStatistics
2. According to the slice_size (class variable; default = 500; can be set to other values), return partial LDAStatistics covering only slice_size # of words
ModelGradient : LDAUpdate class :
1. containing updates to the global variables
2. note that vocabulary slice is also stored since it would be needed to figure out which part of global variables to update
3. currently it’s not a subclass of ModelGradient (*)
PSInterface: support sending LDAUpdate and get partial LDA model now; note that in order to get the partial model, only vocabulary slice needs to be sent.
S3SparseIterator: add push_samples_lda to support reading the serialized version of LDAStatistics from S3
Tasks:
1. LoadingSparseTaskS3: read the input files, count all the needed statistics and store them in S3
2. LDAModelTaskS3 : sampling LDAStatistics from S3; note that during sampling, we only consider partial LDAStatistics corresponding to ONE vocabulary slice

Update master branch in the forked repo

jcarreira · 2018-08-02T04:50:15Z

I think we should put the datasets used for testing outside of github to avoid making this repo unnecessarily heavy.

jcarreira · 2018-10-31T20:30:47Z

If this PR is no longer relevant please close @hyu596

Pushing all the current LDA progress

8b8b04a

jcarreira mentioned this pull request Jul 15, 2018

Add support for LDA algorithm #21

Open

jcarreira assigned hyu596 Jul 16, 2018

hyu596 and others added 3 commits July 15, 2018 21:29

Update

13d77ef

Quick update; debugging

03eb381

Switching VM; quick push

a7bd664

jcarreira changed the title ~~Pushing all the current LDA progress~~ LDA implementation Jul 21, 2018

Ubuntu and others added 14 commits July 24, 2018 07:18

Update: LDA working with S3

ced887a

Clearing

c0a53c5

Add back the missing Makefile

749a58c

Fixing issue

23de928

Merge pull request #1 from jcarreira/master

8269305

Update master branch in the forked repo

Solved conflicts

a70a45d

Quick fix

353f09e

Fix formatting

1e09937

Finish travis test for lda

eb800e0

Fix formatting

b0ce881

Quick fix for formatting

99bc2b4

Quick fix the travis test & pushing the small dataset for travis test

3150227

Quick push

005163a

Fix Makefiles

7ddc346

Ubuntu and others added 8 commits August 2, 2018 05:01

Delete test data

bdd6191

fix conflict

d291248

Merge branch 'lda' of https://github.com/hyu596/cirrus-1 into lda

ac45905

Fix for lamda

9d48cdd

Quick fix

f64d2f1

Working on Python Interface; switch VM

b602cdc

Impoved the computation of ll

aa06b49

Writing ll to file

f1a8cd9

jcarreira added 5 commits August 5, 2018 04:52

Quick fix for ll

0c05f6b

Clean the code

5bab83b

Clean and fix the code

4b0c691

Add improvement of ll init, benchmark, few metrics and optimizations

8c45ed8

Optimization

9720543

hyu596 closed this Nov 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LDA implementation #51

LDA implementation #51

hyu596 commented Jul 3, 2018

jcarreira commented Aug 2, 2018

jcarreira commented Oct 31, 2018

LDA implementation #51

LDA implementation #51

Conversation

hyu596 commented Jul 3, 2018

jcarreira commented Aug 2, 2018

jcarreira commented Oct 31, 2018