FAQ

Q. I gave the Alfresco process a bunch more memory / CPU, but my import didn't speed up. Shouldn't it have gotten a lot faster?
Q. Can I run imports on more than one server in an Alfresco cluster?
Q. Why are the instantaneous rates on the graphs so bursty?
Q. After a little while I'm seeing long periods of zero instantaneous activity, followed by a solitary large burst. What's going on?
Q. At the start of an import, I see a high "nodes imported per second" reading, but "bytes imported per second" is stuck on zero. What's happening?
Q. At the start of an import, I see "Threads: 0 active of 0 total", but the import seems to be progressing. Why is this?
Q. What does batch "weight" mean?
Q. Why is the count of "folders scanned" so much higher than the actual number of folders on disk in the source content set?
Q. What are these "Transactional update cache 'org.alfresco.cache.node.aspectsTransactionalCache' is full (65000)." warnings in the log? Should I be concerned??

Q. I gave the Alfresco process a bunch more memory / CPU, but my import didn't speed up. Shouldn't it have gotten a lot faster?

A. No. Bulk imports are usually I/O bound, so adding more CPU or memory capacity when neither is the bottleneck isn't going to help much, if at all. Instead I'd focus on the classical performance tuning process:

Identify the performance objective (so you know when to stop)
Measure the system
Identify the bottleneck
Fix the bottleneck
Measure the system again. If the performance objective isn't met, go to step 3.
???
PROFIT!!!1

Step #1 is critically important, otherwise this process becomes an infinite loop!

Q. Can I run imports on more than one server in an Alfresco cluster?

A. Yes, though it may not accomplish much if your bottleneck is in a shared component (database, contentstore, network, source filesystem - see previous question).

Related Q. I tried to run the Bulk Import Tool on multiple cluster nodes and got a JobLockService exception.

A. You're using the embedded fork, which is a cluster-singleton process. One of numerous reasons to avoid the embedded fork.

Q. Why are the instantaneous rates on the graphs so bursty?

A. To avoid double counting (e.g. during a transactional retry), the tool only "counts" the target data when a transaction is committed. This makes the various target counters appear to be a lot more bursty than they actually are. The best solution is to focus on the moving average, since it's a better indicator of overall throughput.

Q. After a little while I'm seeing long periods of zero instantaneous activity, followed by a solitary large burst. What's going on?

A. This is partly related to the previous question, and is something I've observed in my test environment too. While I'm not 100% sure I know the answer, what I think is happening is that transaction commits across the various worker threads end up falling into alignment. Initially I figured it was just because I was starting all of the worker threads at the same time, but after experimenting with staggered startup logic what I saw was that the "coherence pattern" would eventually re-emerge anyway (and that code has since been removed as a result). It's possible this is specific to the database I'm testing on (MySQL 5.6.xx) but regardless, I'd be very keen to hear from a database expert who might be able to explain the observed behaviour in more detail.

Q. At the start of an import, I see a high "nodes imported per second" reading, but "bytes imported per second" is stuck on zero. What's happening?

A. The tool imports the entire directory structure first, before importing any files. Directories count as nodes in the repository, but are (obviously) empty - they contain no data.

Q. At the start of an import, I see "Threads: 0 active of 0 total", but the import seems to be progressing. Why is this?

A. The tool imports the directory structure and the first couple of batches of content on a single thread, since:

directories are "dependent" on each (they can be nested), so it's not possible to reliably import the structure in a multi-threaded way - multi-threaded imports don't guarantee ordering, meaning it would be possible for a child folder to be imported before its parent has been imported (which will fail for obvious reasons)
for small imports the cost of spinning up the multi-threaded import machinery outweighs the benefits, so the first couple of batches (couple of hundred files) are imported serially, and only once a certain threshold is reached (currently 3 batches, but this is an internal implementation detail that may change) does multi-threading kick in

During this single-threaded phase the worker threads haven't been created yet, and so the tool reports that zero threads are active (it's reporting on the size of the worker thread pool). Arguably it should report that 1 thread is active, even though that thread is not part of the worker thread pool - feel free to raise an issue if you think this is problematic.

Q. What does batch "weight" mean?

A. Nothing. "Weight" is a unitless value that's simply used for comparing the approximate size of each imported node while constructing batches. It's intended to be proportional to the amount of work the database will have to do while importing that node, but the value itself is meaningless (it's not "number of nodes" or "number of database rows" or anything like that - it's simply a unitless value).

Q. Why is the count of "folders scanned" so much higher than the actual number of folders on disk in the source content set?

A. The "Default" source (which imports from a filesystem) scans the source content set twice:

to enumerate all of the folders in the source content set and submit them for import
to enumerate all of the files in the source content set and submit them for import

While this may seem inefficient, it is preferable than scanning once and holding the entire set of filenames in memory - the memory usage with that approach would be O(N) on the number of folders + files in the source content set, and could easily exceed the total heap available to the tool in the presence of large (multi-million node) source content sets. In addition, on modern platforms, performing a recursive folder listing is reasonably fast - no file data is being read, just the index entries in the filesystem (inodes in Unixland), and most operating systems have caches for these data structures.

Some quick math, because I was intrigued. On the JVM on my Mac (java version "1.8.0_92" Java(TM) SE Runtime Environment (build 1.8.0_92-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)) a java.io.File object has a fixed overhead of approximately 70 bytes, and consumes a byte for every character in the fully qualified path of that file (it's probably more than that if the path contains Unicode characters). If we assume the average fully qualified path length is around 50 characters, that means each java.io.File object will consume, on average, 120 bytes. If we naively assume that the tool can consume 100MB of the Alfresco heap without causing instability in the system, that means we can store 104,857,600 ÷ 120, or approximately 870,000 File objects before we start running into trouble.

870,000 source files is not especially large for an import, especially ones that contain shadow metadata and version files (which each count towards the File total, but don't count towards the imported node count). Given that the tool is regularly used for imports containing millions to tens-of-millions of imported nodes (which translates to an even higher number of source files on-disk), it's clear that caching these file entries simply to save ourselves a second scan of the source content set isn't justifiable.

Q. What are these `2016-01-01 00:00:00,000 WARN [org.alfresco.repo.cache.TransactionalCache.org.alfresco.cache.node.aspectsTransactionalCache] [http-bio-8080-exec-13] Transactional update cache 'org.alfresco.cache.node.aspectsTransactionalCache' is full (65000).` warnings in the log? Should I be concerned??

A. Generally these can be ignored without serious consequences - they're an indication that one of Alfresco's caches is full and has had to be flushed. While there are performance impacts when this happens, your import should continue to function correctly (albeit more slowly than might otherwise happen). That said you may wish to allocate more RAM to the Alfresco JVM (if possible) so that these caches can be sized larger.

There used to be a good blog post by Luis Cabaceira that went into more detail on this, and other, Alfresco caching issues, but it looks like in migrating to their new "community" service, Alfresco lost that content and/or put it behind a sign up wall. If you find it, please let me know so I can update this link!

Back to wiki home.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FAQ

Table of Contents