Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issues for medium-sized corpus #4

Open
aflueckiger opened this issue Oct 15, 2018 · 8 comments
Open

Memory issues for medium-sized corpus #4

aflueckiger opened this issue Oct 15, 2018 · 8 comments

Comments

@aflueckiger
Copy link

aflueckiger commented Oct 15, 2018

Thanks for this new intriguing method. A few toy models showed interesting and consistent patterns so I would like to use it with a real corpus. When applying this method to a medium-sized corpus, however, I run into memory issues (20GB RAM + 20 GB swapfile). The properties of the corpus are as follows:

#docs (unfiltered): 7897
#tokens (unfiltered): 9273368
#types (unfiltered): 67485

Following your paper, the algorithm should be scalable. Yet, even when drastically reducing the size of the graph by filtering out nodes (docs + words) as well as edges (tokens) I still encounter the same issues.

What's the exact issue? Is there a way to make it more scalable?

Edit: I use the graph tool Docker container

@martingerlach
Copy link
Owner

The complexity (and thus memory requirements) of the algorithm scales with the number of nodes in the graph. For a corpus similar to yours, the number of nodes is dominated by the number of word-types. One pragmatic way to substantially reduce this number is to impose a minimum number of occurrences for each word-type to be included in the graph. This amounts to filtering the large number of low-frequency words which should not contain much information in any case.

I added an optional argument in make_graph(n_min = 10) in order to filter out many of the word types (you can choose other values for n_min). This should allow you to analyze the corpus with 20GB memory.

I am looking into other options to reduce memory usage in the future.

Best,
Martin

@aflueckiger
Copy link
Author

aflueckiger commented Oct 16, 2018

Thanks for the quick answer and the code amendment. However, the approach of reducing size reasonably doesn't help unless I stick to toy models. More specifically, I filtered out words already so drasticallyby myself that the remaining corpus is no longer of any practical use. I set extreme thresholds for the term-document frequency (min_df=2000, max_df=0.7), which gives me the following numbers.

#docs (unfiltered): 7897
#tokens (unfiltered): 7953429
#types/unique words (filtered): 40891

#docs (filtered): 7897
#tokens (filtered): 3036422
#types/unique words (filtered): 530

I wonder how you processed the Twitter corpus that you listed in your paper (docs: 10,000, types: 12,258, words:196,625). Something on my site doesn't seem to be in line with these result. Hence, the following questions:
Are the edges (e.g. tokens) actually not of any significance concerning complexity (magnitude higher here)? Otherwise, I can only think about something that doesn't work properly in the current Docker build unless you used a performant server for these computations. Do I miss something?

@martingerlach
Copy link
Owner

martingerlach commented Oct 17, 2018

You are right, the memory also depends on the number of edges (word tokens). Therefore, for a very large corpus, you would probably need to run the code on a cluster.

However, the Twitter-corpus we used in the paper takes <~2GB of memory, so this should be no problem with a 20GB memory. I tried to construct a similar corpus in size as you described by taking 10,000 books from Project Gutenberg. Setting a minimum threshold of 10-100 occurrences for each word yields a manageable corpus that I can run on my laptop.

So I am not completely sure where your problem is stemming from since I havent used the Docker build. Perhaps you can find additional information on the graph-tool gitlab Issues https://git.skewed.de/count0/graph-tool/issues

Beyond all this, Tiago (graph-tool) suggested to try the following to reduce the memory usage (which is not yet implemented in the topic-model wrapper):

The main reason why minimize_nested_blockmodel_dl() uses a lot of RAM is
because it keeps several simultaneous copies of the global state when it's
searching for the optimal number of groups. Instead, a much more economical
approach is to use a single state the whole time. You do this by
initializing a singe NestedBlockState, where each node belongs to its own
group at first. You then equilibrate a MCMC by calling repeatedly:

state.multiflip_mcmc_sweep(niter=n)

with n=10 or 100 so. This attempts to merge and split groups, so it gives
good results. In practice, minimize_nested_blockmodel_dl() still gives
better results in many cases, but they are comparable. But with this
approach the memory use is far lower (factor 3 to 4, or so).
(Remember that using weighted graphs, via the 'eweight' parameter, will also
help reduce memory use.)

@aflueckiger
Copy link
Author

aflueckiger commented Oct 18, 2018

I really appreciate your support. I installed graph-tool locally and did some further experiments. The RAM consumption is still incredibly high and seems to differ from your observations. Do I understand correctly that you have successfully processed 10,000 books with just filtering by term frequency n<100 on a laptop? This must have resulted in a much bigger corpus than I try to model.

I managed to process an extremely filtered corpus successfully (min_df=1000, max_df=0.6), however, it still takes 17GB of RAM. Thus, something is still not in line.

#docs (filtered): 7897
#tokens (filtered): 3797133
#types/unique words (filtered): 1151

Currently, I don't have the expertise to readily extend the code following your additional suggestions as I work primarily in the field of computational linguistics and computational social science. I would be happy to test such an implementation though.

Moreover, can you share the script (incl. corpus loading from nltk?) of your experiment to make sure we are on the same page? :)

@martingerlach
Copy link
Owner

Here is a notebook in which I analyze 1000 books from Project Gutenberg with a minimum threshold n_min=100 (so only words that occur at least 100 times are included in the corpus. This takes about 1GB and runs for a few minutes. I guess this would easily work with 10,000 documents but I didnt want to wait.

I am looking into other methods to further reduce memory usage (along the lines of what I wrote above) and hope to be able to include some of these features at some point.

Let me know if you have further questions.
dev_20181017_demo-for-PG-corpus.html.tar.gz

@aflueckiger
Copy link
Author

aflueckiger commented Oct 21, 2018

The corpus you have tested with is quite different compared to mine. In your example, only 1000 documents are considered with a maximal length of 1000 tokens each, which results in a relatively small corpus. Moreover, a minimum term frequency of 100 is very restrictive given the corpus has only 1000 documents (a word needs to occur in every 10th article on average, thus, smaller "topics" disappear). For practical applications, these requirements are not appropriate.

Modeling bigger corpora may be not possible with the current implementation unless one has access to a cluster. The used algorithm has a complexity of O(Vln2V) that explains the memory issue with the bigger corpus. https://graph-tool.skewed.de/static/doc/inference.html#graph_tool.inference.minimize.minimize_nested_blockmodel_dl

I only see the following options:

  • filtering linguistically informed based on the part-of-speech tag (POS) to keep only semantically relevant words (nouns, adjectives, verbs)
  • waiting for an implementation that reduces the required memory as sketched out

Sampling just a number of words for each document would jeopardize the explanatory power. Thus, this is not an option for my purposes.

Do you have other suggestions? Thank you.

@jnothman
Copy link

I wonder whether there is scope to implement an alternative inference algorithm for hSBM such as the variational approach of Park and Bader (https://arxiv.org/pdf/1711.05150.pdf)

@count0
Copy link
Contributor

count0 commented Oct 6, 2021

With a recent graph-tool version (2.43 or above) the memory requirements should be significantly improved (as well as speed). Please try again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants