frequentItemSetCount incorrect #2

nirehs · 2017-07-31T02:05:46Z

Junit test code is:

File inputFile = new File("c:/data1.txt");
int frequentItemSetCount=1;
Apriori<NamedItem> apriori = new Apriori.Builder<>(frequentItemSetCount)
    .supportDelta(0.1).maxSupport(1.0).minSupport(0.0).create();
Iterator<Transaction<NamedItem>> iterator = new DataIterator(inputFile);
Output<NamedItem> output = apriori.execute(iterator);
SortedSet<ItemSet<NamedItem>> frequentItemSets = output.getFrequentItemSets();		
System.out.println("frequentItemSets.size():"+frequentItemSets.size());	
Iterator<ItemSet<NamedItem>> iteratorItemSet = frequentItemSets.iterator();

while (iteratorItemSet.hasNext()) {
    ItemSet<NamedItem> itemSet = (ItemSet<NamedItem>) iteratorItemSet.next();
    System.out.println("result ............."+ itemSet.toString());
}

data1.txt content is:

# Test data for the Apriori algorithm
# One transaction per line, items are separated with whitespaces

bread   butter  sugar
coffee  milk    sugar
bread   coffee  milk    sugar
coffee  milk

run result is :

frequentItemSets.size():4
result .............[coffee, milk]
result .............[sugar]
result .............[milk]
result .............[coffee]

frequentItemSetCount =1 but frequentItemSets.size()=4

The text was updated successfully, but these errors were encountered:

michael-rapp · 2017-07-31T18:12:35Z

This is not a bug, but normal behavior. Specifying a frequentItemSetCount does not guarantee, that exactly that many frequent item sets are found. It is just away to avoid that very few item sets are found, if the minimum confidence threshold has been chosen too restrictively. Depending on the given data set, it might not be possible to find as many item sets as specified. On very small data sets such as the one you used, it is very likely that more frequent item sets are returned. This isn't a bug either. The algorithm just successively decreases the minimum confidence (starting with 1.0 in your example) until enough item sets have been found. If the minimum confidence, which is used in that last iteration, is reached by more item sets than specified, all of them are returned. This is intentional, because the algorithm cannot decide, which ones to include (they all reach the same minimum confidence and there is no criteria for measuring their quality besides that). Furthermore, if association rules should be generated in a second step, all of the frequent item sets must be used, otherwise the learned rules will be incomplete. If you only want to find a single item set in the given example, you must decide on your own, which item set to keep (probably the first one, because it is the only one including two items).

As a future improvement, it would be possible to return a custom implementation of the type SortedSet, which provides sort- and filter-methods such as the class RuleSet does. This would ease to manually filter the returned item sets, if too many are returned. The progress on that enhancement is from now on tracked here: #4

Furthermore, I added additional information to the library's README to avoid future misunderstandings.

michael-rapp self-assigned this Jul 31, 2017

michael-rapp added the invalid label Jul 31, 2017

michael-rapp added a commit that referenced this issue Jul 31, 2017

Updated README to avoid misinterpretations such as in #2

8f4abad

michael-rapp mentioned this issue Jul 31, 2017

Add functionality to easily sort/filter frequent item sets #4

Closed

michael-rapp closed this as completed Jul 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

frequentItemSetCount incorrect #2

frequentItemSetCount incorrect #2

nirehs commented Jul 31, 2017 •

edited by michael-rapp

michael-rapp commented Jul 31, 2017 •

edited

frequentItemSetCount incorrect #2

frequentItemSetCount incorrect #2

Comments

nirehs commented Jul 31, 2017 • edited by michael-rapp

michael-rapp commented Jul 31, 2017 • edited

nirehs commented Jul 31, 2017 •

edited by michael-rapp

michael-rapp commented Jul 31, 2017 •

edited