Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frequentItemSetCount incorrect #2

Closed
nirehs opened this issue Jul 31, 2017 · 1 comment
Closed

frequentItemSetCount incorrect #2

nirehs opened this issue Jul 31, 2017 · 1 comment
Assignees
Labels

Comments

@nirehs
Copy link

nirehs commented Jul 31, 2017

Junit test code is:

File inputFile = new File("c:/data1.txt");
int frequentItemSetCount=1;
Apriori<NamedItem> apriori = new Apriori.Builder<>(frequentItemSetCount)
    .supportDelta(0.1).maxSupport(1.0).minSupport(0.0).create();
Iterator<Transaction<NamedItem>> iterator = new DataIterator(inputFile);
Output<NamedItem> output = apriori.execute(iterator);
SortedSet<ItemSet<NamedItem>> frequentItemSets = output.getFrequentItemSets();		
System.out.println("frequentItemSets.size():"+frequentItemSets.size());	
Iterator<ItemSet<NamedItem>> iteratorItemSet = frequentItemSets.iterator();

while (iteratorItemSet.hasNext()) {
    ItemSet<NamedItem> itemSet = (ItemSet<NamedItem>) iteratorItemSet.next();
    System.out.println("result ............."+ itemSet.toString());
}

data1.txt content is:

# Test data for the Apriori algorithm
# One transaction per line, items are separated with whitespaces

bread   butter  sugar
coffee  milk    sugar
bread   coffee  milk    sugar
coffee  milk

run result is :

frequentItemSets.size():4
result .............[coffee, milk]
result .............[sugar]
result .............[milk]
result .............[coffee]

frequentItemSetCount =1 but frequentItemSets.size()=4

@michael-rapp
Copy link
Owner

michael-rapp commented Jul 31, 2017

This is not a bug, but normal behavior. Specifying a frequentItemSetCount does not guarantee, that exactly that many frequent item sets are found. It is just away to avoid that very few item sets are found, if the minimum confidence threshold has been chosen too restrictively. Depending on the given data set, it might not be possible to find as many item sets as specified. On very small data sets such as the one you used, it is very likely that more frequent item sets are returned. This isn't a bug either. The algorithm just successively decreases the minimum confidence (starting with 1.0 in your example) until enough item sets have been found. If the minimum confidence, which is used in that last iteration, is reached by more item sets than specified, all of them are returned. This is intentional, because the algorithm cannot decide, which ones to include (they all reach the same minimum confidence and there is no criteria for measuring their quality besides that). Furthermore, if association rules should be generated in a second step, all of the frequent item sets must be used, otherwise the learned rules will be incomplete. If you only want to find a single item set in the given example, you must decide on your own, which item set to keep (probably the first one, because it is the only one including two items).

As a future improvement, it would be possible to return a custom implementation of the type SortedSet, which provides sort- and filter-methods such as the class RuleSet does. This would ease to manually filter the returned item sets, if too many are returned. The progress on that enhancement is from now on tracked here: #4

Furthermore, I added additional information to the library's README to avoid future misunderstandings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants