Skip to content

DynamicItemCounting

Harish Butani edited this page May 28, 2012 · 2 revisions

Dynamic Item Counting

At each Mapper in Phase 1 we use the Algorithm by Sergey Brin et. al; see http://www-db.stanford.edu/~sergey/dic.ps to find the Frequent ItemSets at each Mapper. Dynamic Item Counting (DIC) reduces the number of passes made for the data while keeping the number of ItemSets which are counted in each pass relatively low compared to other methods.

The intuition behind DIC is that it works like a train with stops at intervals M transactions apart. (M is a parameter; which the authors showed to be optimally set to 1000.) When the train reaches the end of the input, it has made pass over the data and it starts at the beginning in the next pass. The “passengers” on the train are itemsets; when an itemset is on the train, its occurrences in the transactions are counted. So for Apriori in this model means: all itemsets get on at the start of a pass and get off at the end; 1-itemsets take the first pass, 2-itemsets the second pass and so on. In DIC itemsets are allowed to get on at any stop as long as they get off at the same stop the next time the train goes around. This ensures that an itemset is counted on all transactions in the file. This ensures that an itemset is counted as soon as it becomes apparent that it maybe a candidate itemset.

Idea behind Counting Frequent Itemsets

  • Itemsets from a lattice with the empty itemset at the bottom and the set of all items at the top. In this lattice some itemsets are large(candidates) and the rest are small.
  • It is infeasible to count all Itemsets ( the number is exponential in the number of items.) It is sufficient to count the minimal ones(the itemsets that do not include any other small itemsets) since if an itemset is small all its supersets are small too.

DIC algorithm marks itemsets in four ways:

Solid square
confirmed large itemset - an itemset that has been counted over all transactions and whose count exceeds the support threshold.
Solid circle
confirmed small itemset - an itemset that has been counted over all transactions and whose count is below the support threshold.
Dashed square
suspected large itemset - an itemset that is still being counted, and whose count already exceeds the support threshold.
Dashed circle
suspected small itemset - an itemset that is still being counted, and whose count is still below the support threshold.

The Algorithm

  1. The empty itemset is marked as a Solid square. All 1-itemsets are marked as Dashed circles. Other itemsets are not yet considered.
  2. Read M transactions. For each transaction update the counts of Itemsets that are in a Dashed state.
  3. If a Dashed circle has a count that exceeds the support threshold change its state to a Dashed square. Now introduce some of its supersets to the lattice structure. Introduce those supersets for whom all their subsets are Solids (square or circle). Set the counter for these supersets to 0 and their initial state is Dashed Circle.
  4. If a dashed itemset has been counted through all itemsets make it a Solid; thus it will not be counted in subsequent passes.
  5. If we are at the end of the transaction file rewind to the beginning of the file.
  6. If the lattice contains any Dashed itemsets go to step 2.

Lattice Data structure

The data structure must support the following operations:

  • add new Itemsets.
  • maintain a counter for each Itemset. Incrementing counts of itemsets from a transaction must be very fast, since this is the heart of the algorithm.
  • maintain itemset states: Dashed square, Dashed circle, Solid square and Solid circle. Detect when state transitions happen.
  • For itemsets that become large determine what new itemsets should be added as dashed circles to the data structure.

So the data structure used is like the hash tree used in Apriori with a little extra information stored at each node. It is a trie with the following properties:

  • each itemset is sorted by its items.
  • every itemset being counted or already counted has a node in the trie, and so does all of its prefixes.
  • the empty itemset is the root of the Trei.
  • the 1-itemsets are attached to the root node and they are labelled by the item they represent.
  • All other itemsets are attached to their prefix containing all but their last item.
  • See Figure 7 in the paper.

Incrementing Counts for a Transaction

The algorithm sorts the items in each transaction. The order is based on the how may times an item occurs in transactions; the optimal order is when the items are ordered in reverse order for their counts. Given this incrementing counts for the Itemsets in a transaction is a simple recursive algorithm Increment(Node, Itemset) that increments the count for the Node and then recursively calls Increment on a branch with the suffix of ItemSet that starts at the item labelled in the branch.

Introducing new Itemsets to the lattice.

When a Node becomes SOLID, we need to check every superset S, and for S check every subset. This requires 2 iterators and way to quickly get a Node of the Trei for an Itemset.

Our implementation

Our implementation varies from the algorithm in the paper in the following way:

  • the algorithm in the paper requires that when itemset [BC] becomes a SOLID we must consider all the supersets that come after it in the sort order; so we consider only [BCD]. If [AB] subsequently becomes SOLID then [ABC] will be introduced to the Trei.
  • But if the order is reversed ie [AB] becomes SOLID first and at a latter time [BC] becomes SOLID, then by the algorithm [ABC] would never be introduced to the Trei. So we changed the algorithm to look for new supersets to add at any Node that is SOLID.
  • With these changes the data structure is simplified; most operations involve Trei traversal.

Supersets of an ItemSet

  • Since we are only looking at itemsets of length 1 greater and for items that are greater than the last item in the sort order, this is an easy iteration.
  • See classes: ImmedSuperSetISet and ImmediateSuperSetsIterator

Subsets of an ItemSet

  • Since we only work with Itemsets that are sorted by item; we can can generate the subsets by removing any item from the ItemSet.
  • See classes: ImmedSubSetISet and ImmediateSubSetsIterator

Clone this wiki locally