Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation: Proper DataSource format and usage for K-Means Clustering #72

Closed
lincolnthree opened this issue Oct 22, 2020 · 32 comments
Closed
Labels
enhancement New feature or request

Comments

@lincolnthree
Copy link

lincolnthree commented Oct 22, 2020

Is your feature request related to a problem? Please describe.
Still a newbie to this library, so thanks for bearing with me.

Right now, the documentation shows how to run K-Means clustering on an auto-generated data set of Gaussian clusters. This is great, as it shows K-Means is possible, but (unless I'm missing something) it does not show the steps to input real data. (It mentions You can also use any of the standard data loaders to pull in clustering data. but I don't see where that's documented).

I've figured out how to load a CSV file of features and metadata (thanks to your new Colunmar tutorial), but I can't seem to infer how to connect this data with KMeansTrainer, or if that's even the right approach.

Describe the solution you'd like
A clear and concise description/example of how to load real-world (non-autogenerated) data into the K-Means algorithm.

Describe alternatives you've considered
Looking through JavaDocs, but having trouble knowing what to focus on.

Additional context
image

@lincolnthree lincolnthree added the enhancement New feature or request label Oct 22, 2020
@lincolnthree
Copy link
Author

I think I'm conflating Label with ClusterId, but looking for where in the docs I can clarify that.

@lincolnthree
Copy link
Author

lincolnthree commented Oct 22, 2020

If it helps, essentially what I'm trying to do is associate named features to coordinates such that they can be classified/clustered using K-Means. There's probably a way to do this already, but I'm missing it.

I'd then like to use the trained output clusters as an input for classification.

@Craigacp
Copy link
Member

Do you have ground truth cluster ids in your csv file? If not then you should add any response processor you like, make sure you pass it ClusteringFactory rather than LabelFactory, and then your datasource and dataset generic type should become ClusterID not Label. Label is only used for multiclass classification problems, ClusterID is the output type for clustering problems. Also make sure you pass the false flag to CSVDataSource so it knows it's ok for there not to be an output variable in the csv file.

If you do have ground truth cluster ids (and those ids are integers), then you should use FieldResponseProcessor as that will pass them through unaltered to the ClusteringFactory so they can be converted into ClusterID instances. If the ground truth clusters aren't integers, then at the moment you'll need to write a specific ResponseProcessor implementation that maps those values into the integers.

The overall architecture of Tribuo's output system is discussed here - https://tribuo.org/learn/4.0/docs/architecture.html#structure, but there's nothing specific on loading in ClusterID vs Label. The strong typing means that Tribuo behaves a little differently to other ML packages.

Once you've got the ClusterIDs out, then you can get the int values from those ids and add them as features to other examples. Note you should do this before those examples are added to a Dataset as otherwise the features won't be recorded in the domain and won't be visible to the downstream classifier. If everything is in memory then you can construct the Example<Label>s and feed them to a ListDataSource<Label> before passing that to a MutableDataset<Label> or you can directly pass the Iterable<Example<Label>> to the mutable dataset constructor with an appropriate SimpleDataProvenance. Unfortunately this step will lose the provenance of the original clustering, as we don't have a chaining provenance you can supply to the dataset. We're working on the design for such a thing as it comes up a few places.

@lincolnthree
Copy link
Author

lincolnthree commented Oct 22, 2020

Thank you so much. This is extremely helpful. I think I'm mostly following. Again please forgive me as I'd self describe as an experienced programmer who is new to machine learning.

I don't have ground truth cluster ids. What I have is essentially tabular data where each row has an id, name, a name weight (quality of name), and N sparse feature ngrams, where N is roughly anywhere from 10 - 60 populated features per row, but sum of N for the whole dataset is upwards of 15,000 unique (and obviously very sparse) features.

Essentially, this:

"ID","NAME","WEIGHT","FEATURES"
"ROWID-XYZ","ROWNAME-XYZ",1,"C1:C1:C1:C1:C2:C2:C2:C3:C4:C4:C5:C10:C10:C14:C14"

I'm wondering if there will be issues with duplicate features? E.g. do I need to coerce "C1:C1:C1" -> "C1" or in integer form "1,1,1" -> "1"... etc?

My goal is to cluster this data based on the features, weighted by how many times that feature appears. (Not sure how to add multi-dimensional data?)

So I've currently ended up with this:

    private MutableDataset<ClusterID> getClusteringDataset(Path csvInputPath)
    {
        var fieldProcessors = new HashMap<String, FieldProcessor>();
        fieldProcessors.put("features",
                    new TextFieldProcessor("cards", new BasicPipeline(new SplitPatternTokenizer(","), 1)));
        fieldProcessors.put("format", new IdentityProcessor("format"));

        var responseProcessor = new FieldResponseProcessor<>("features", "-1", new ClusteringFactory());

        var metadataExtractors = new ArrayList<FieldExtractor<?>>();
        metadataExtractors.add(new IdentityExtractor("id"));
        metadataExtractors.add(new IdentityExtractor("name"));

        var weightExtractor = new FloatExtractor("weight"); // not sure I need this for clustering since name-weight/confidence doesn't matter - that will happen in another step

        var rowProcessor = new RowProcessor<ClusterID>(metadataExtractors, weightExtractor, responseProcessor, fieldProcessors, Collections.emptySet());

        var csvSource = new CSVDataSource<ClusterID>(csvInputPath, rowProcessor, false);

        return new MutableDataset<ClusterID>(csvSource);
    }

And as you mentioned, I need to convert the string feature values into integers such that the ResponseProcessor (or a custom ResponseProcessor) can handle them. I've done this before CSV generation by indexing them all as I generate the CSV, and write integers instead of the string IDs.

However, I'm still a bit unclear on how to tokenize and read those values via the TextFieldProcessor and FieldResponseProcessor:

java.lang.NumberFormatException: For input string: "0,0,1,2,2,2,2,3,3,3,4,4,5,6,6,6,6,7,7,8,8,9,9,9,9,10,10,10,10,11,11,11,11,12,12,12,12,13,13,13,13,13,14,14,14,14,15,15,15,16,16,16,16,17,17,18,18,18,19,19"
        at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:68)
        at java.base/java.lang.Integer.parseInt(Integer.java:652)
        at java.base/java.lang.Integer.parseInt(Integer.java:770)
        at org.tribuo.clustering.ClusteringFactory.generateOutput(ClusteringFactory.java:59)
        at org.tribuo.clustering.ClusteringFactory.generateOutput(ClusteringFactory.java:37)
        at org.tribuo.data.columnar.processors.response.FieldResponseProcessor.process(FieldResponseProcessor.java:76)
        at org.tribuo.data.columnar.RowProcessor.generateExample(RowProcessor.java:177)
        at org.tribuo.data.columnar.ColumnarDataSource$InnerIterator.hasNext(ColumnarDataSource.java:97)
        at org.tribuo.MutableDataset.<init>(MutableDataset.java:82)
        at org.tribuo.MutableDataset.<init>(MutableDataset.java:93)
        at GenerateClusters.getClusteringDataset(generate-clusters.java:189)
        at GenerateClusters.call(generate-clusters.java:95)
        at GenerateClusters.call(generate-clusters.java:1)
        at picocli.CommandLine.executeUserObject(CommandLine.java:1933)
        at picocli.CommandLine.access$1100(CommandLine.java:145)
        at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2332)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2326)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2291)
        at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2159)
        at picocli.CommandLine.execute(CommandLine.java:2058)
        at GenerateClusters.main(generate-clusters.java:82)

So I think I'm missing how to create a FieldProcessor that that reads features and passes them through as integers from the CSV.

        fieldProcessors.put("features",
                    new TextFieldProcessor("cards", new BasicPipeline(new SplitPatternTokenizer(","), 1)));

Also, just as a thought/aside about the API:

I assume based on your response that there is no ResponseProcessor that automatically maps strings to integers (and back again)? That could be useful as domain models in applications today are starting to favor use of UUIDs/strings (not ints) for IDs.

But I'm guessing there may be issues with carrying that mapping through the pipeline which make it less simple?

Either way it shouldn't be too hard to do in my code if I can figure out how to read the feature IDs as ints.

@lincolnthree
Copy link
Author

lincolnthree commented Oct 22, 2020

Also a quick note. I've tried using DoubleFieldProcessor, but it doesn't look like that allows for splitting input values via a Pipeline.

@lincolnthree
Copy link
Author

lincolnthree commented Oct 22, 2020

Also trying to extend DoubleFieldProcessor, not sure if this is the right approach:



    class DoubleListFieldProcessor extends DoubleFieldProcessor
    {
        private String splitRegex;

        public DoubleListFieldProcessor(String fieldName, String splitRegex)
        {
            super(fieldName);
            this.splitRegex = splitRegex;
        }

        @Override
        public List<ColumnarFeature> process(String values)
        {
            var features = new ArrayList<ColumnarFeature>();
            for (String value : values.split(this.splitRegex)) {
                features.addAll(super.process(value));
            }
            return features;
        }
    }

@Craigacp
Copy link
Member

Thank you so much. This is extremely helpful. I think I'm mostly following. Again please forgive me as I'd self describe as an experienced programmer who is new to machine learning.

I don't have ground truth cluster ids. What I have is essentially tabular data where each row has an id, name, a name weight (quality of name), and N sparse feature ngrams, where N is roughly anywhere from 10 - 60 populated features per row, but sum of N for the whole dataset is upwards of 15,000 unique (and obviously very sparse) features.

Essentially, this:

"ID","NAME","WEIGHT","FEATURES"
"ROWID-XYZ","ROWNAME-XYZ",1,"C1:C1:C1:C1:C2:C2:C2:C3:C4:C4:C5:C10:C10:C14:C14"

I'm wondering if there will be issues with duplicate features? E.g. do I need to coerce "C1:C1:C1" -> "C1" or in integer form "1,1,1" -> "1"... etc?

To deal with extracted duplicate features you should use UniqueProcessor which will let you aggregate tokens (though the BasicPipeline won't emit duplicate features, it uniques them first). You probably want to use TokenPipeline and turn on termCounting which will replace multiple instances of a feature by a single feature with value equal to the number of instances.

My goal is to cluster this data based on the features, weighted by how many times that feature appears. (Not sure how to add multi-dimensional data?)

So I've currently ended up with this:

    private MutableDataset<ClusterID> getClusteringDataset(Path csvInputPath)
    {
        var fieldProcessors = new HashMap<String, FieldProcessor>();
        fieldProcessors.put("features",
                    new TextFieldProcessor("cards", new BasicPipeline(new SplitPatternTokenizer(","), 1)));
        fieldProcessors.put("format", new IdentityProcessor("format"));

        var responseProcessor = new FieldResponseProcessor<>("features", "-1", new ClusteringFactory());

        var metadataExtractors = new ArrayList<FieldExtractor<?>>();
        metadataExtractors.add(new IdentityExtractor("id"));
        metadataExtractors.add(new IdentityExtractor("name"));

        var weightExtractor = new FloatExtractor("weight"); // not sure I need this for clustering since name-weight/confidence doesn't matter - that will happen in another step

        var rowProcessor = new RowProcessor<ClusterID>(metadataExtractors, weightExtractor, responseProcessor, fieldProcessors, Collections.emptySet());

        var csvSource = new CSVDataSource<ClusterID>(csvInputPath, rowProcessor, false);

        return new MutableDataset<ClusterID>(csvSource);
    }

And as you mentioned, I need to convert the string feature values into integers such that the ResponseProcessor (or a custom ResponseProcessor) can handle them. I've done this before CSV generation by indexing them all as I generate the CSV, and write integers instead of the string IDs.

However, I'm still a bit unclear on how to tokenize and read those values via the TextFieldProcessor and FieldResponseProcessor:

java.lang.NumberFormatException: For input string: "1,2,2,2,2,3,3,3,4,4,5,6,6,6,6,7,7,8,8,9,9,9,9,10,10,10,10,11,11,11,11,12,12,12,12,13,13,13,13,13,14,14,14,14,15,15,15,16,16,16,16,17,17,18,18,18,19,19"
        at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:68)
        at java.base/java.lang.Integer.parseInt(Integer.java:652)
        at java.base/java.lang.Integer.parseInt(Integer.java:770)

So I think I'm missing how to create a FieldProcessor that that reads features and passes them through as integers from the CSV.

        fieldProcessors.put("features",
                    new TextFieldProcessor("cards", new BasicPipeline(new SplitPatternTokenizer(","), 1)));

Also, just as a thought/aside about the API:

I assume based on your response that there is no ResponseProcessor that automatically maps strings to integers (and back again)? That could be useful as domain models in applications today are starting to favor use of UUIDs/strings (not ints) for IDs.

But I'm guessing there may be issues with carrying that mapping through the pipeline which make it less simple?

Either way it shouldn't be too hard to do in my code if I can figure out how to read the feature IDs as ints.

Is it possible to paste a line or two of the data? I think I'm missing something about how the data is setup. Tribuo's feature space is both named and implicitly sparse, so I think it should map pretty easily to your task, but it looks like the row processor is mangling your inputs in some way.

In general it's not necessary to map feature names into id numbers, as Tribuo will do that for you because everything at the user level is named not numbered. The only thing that is actually numbered are cluster ids, and even then we could consider making them named to fit better with the rest of Tribuo (though that would be a breaking change and so would have to wait till the next major release).

@Craigacp
Copy link
Member

Also trying to extend DoubleFieldProcessor, not sure if this is the right approach:



    class DoubleListFieldProcessor extends DoubleFieldProcessor
    {
        private String splitRegex;

        public DoubleListFieldProcessor(String fieldName, String splitRegex)
        {
            super(fieldName);
            this.splitRegex = splitRegex;
        }

        @Override
        public List<ColumnarFeature> process(String values)
        {
            var features = new ArrayList<ColumnarFeature>();
            for (String value : values.split(this.splitRegex)) {
                features.addAll(super.process(value));
            }
            return features;
        }
    }

That will emit multiple features where each one has the same name (as it comes from the field name) and different values based on the value that was extracted. I think you probably want to emit a feature with the name <field-name>:idx and the value extracted from that index, so you'd either need to rebuild the feature after it comes out of super.process or not subclass DoubleFieldProcessor and bring in the emitting logic from that class, but modified to include the index.

@lincolnthree
Copy link
Author

lincolnthree commented Oct 22, 2020

Interesting. Wouldn't adding the Feature IDs actually create potentially duplicate features? E.g. 1+3 = 4 and 2+2 = 4.

Absolutely!

ID,NAME,WEIGHT,TYPE,FEATURE_IDS
48dc6570-b4b5-4150-b4da-e5868c17f641,"Simic Ramp",1,standard,"67,67,67,67,358,358,358,358,523,523,363,363,363,363,360,360,360,360,11,11,11,11,11,364,364,364,364,453,453,453,453,13,13,13,13,13,13,13,13,13,13,492,492,361,361,361,361,76,76,125,125,330,330,330,330,425,365,365,365,500"
75e7c4f0-4db5-4991-9ecf-2bfe8bc83db0,"Mardu Knights",1,standard,"223,223,223,223,384,384,384,384,70,70,385,385,385,385,317,317,317,317,373,373,373,373,8,8,8,8,320,320,320,320,434,434,434,434,322,322,322,322,27,27,27,27,28,28,28,28,77,29,324,324,324,324,154,154,154,154,238,238,238,238"
5a098704-e350-4913-9cae-ac887ef1c93a,"Jund Food",1,standard,"441,441,441,441,403,403,403,403,503,503,70,70,329,329,329,329,368,368,368,368,110,110,110,8,444,444,444,444,498,515,515,515,211,211,211,211,13,13,13,13,13,14,14,14,14,445,445,445,445,371,371,371,371,77,324,324,324,324,55,55"
9ccce337-6973-4573-8118-b2709f492d61,"Rakdos Aristocrats",1,standard,"441,441,441,441,70,70,70,70,70,70,70,70,70,442,442,442,151,151,338,338,338,338,8,8,8,8,8,443,443,443,443,444,444,444,444,340,340,340,340,14,14,14,14,342,342,342,342,472,445,445,445,445,77,77,324,324,324,324,538,538"
43c38ed1-4027-435d-a0d7-a2676f22e9c4,"Rakdos Knights",1,standard,"384,384,384,384,70,70,70,70,70,70,70,70,70,385,385,385,385,373,373,373,373,110,110,8,8,8,8,8,8,320,320,320,320,340,340,340,340,27,27,27,27,28,28,28,28,437,437,437,437,77,19,19,324,324,324,324,154,154,154,154"
bb93bd9a-e415-4d34-b5b7-dc70c3b07956,"White Weenie",1,standard,"314,314,314,314,458,458,458,458,405,405,318,318,318,318,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,459,459,459,459,460,460,460,460,495,495,495,495,407,407,407,407,528,528,528,528,51,51,51,51,409,409,409,409,199"
229a5e57-403c-4af5-9f22-c6bbb0192f17,"Gruul Aggro",1,standard,"554,554,554,554,329,329,329,329,541,541,57,57,57,57,312,312,312,312,8,8,8,8,8,8,8,8,333,333,475,475,475,475,477,477,477,477,13,13,13,13,13,13,13,13,13,13,14,14,28,28,28,28,478,478,478,478,19,19,19,19"
9b285f98-9ce2-4d41-8f83-c715187a8f9f,"Golgari Food",1,standard,"441,441,441,441,402,402,402,402,375,375,403,403,403,403,70,70,70,70,70,70,362,362,368,368,368,368,110,110,110,443,443,443,443,498,498,498,211,211,211,211,271,13,13,13,13,13,13,13,13,13,13,13,14,14,14,445,445,445,445,77"
45065bf9-f6d2-49c1-8b11-7ee1402449c9,"Simic Flash",1,standard,"67,67,67,67,412,412,412,413,413,413,362,362,362,108,108,363,363,363,363,414,414,414,414,7,7,11,11,11,11,11,11,11,364,364,453,453,453,453,415,415,415,415,13,13,13,13,13,13,13,14,14,76,76,330,330,330,330,54,353,353"
e7d2b12f-d85a-40b1-81e1-3b8845ff71ff,"Jeskai Fires",1,standard,"158,158,158,183,183,183,446,446,446,446,325,325,325,325,326,326,326,326,447,447,447,317,317,5,8,8,32,32,448,448,11,11,327,327,327,327,449,449,449,449,14,14,14,352,352,352,352,450,450,450,451,451,451,451,76,76,76,19,19,19"

As you can see, each row has any number of features and there's no specific order.

@lincolnthree
Copy link
Author

Looking at the APIs you mentioned.

@Craigacp
Copy link
Member

Craigacp commented Oct 22, 2020

Interesting. Wouldn't adding the Feature IDs actually create potentially duplicate features? E.g. 1+3 = 4 and 2+2 = 4.

Absolutely!

ID,NAME,WEIGHT,FEATURE_IDS
48dc6570-b4b5-4150-b4da-e5868c17f641,"Simic Ramp",1,standard,"67,67,67,67,358,358,358,358,523,523,363,363,363,363,360,360,360,360,11,11,11,11,11,364,364,364,364,453,453,453,453,13,13,13,13,13,13,13,13,13,13,492,492,361,361,361,361,76,76,125,125,330,330,330,330,425,365,365,365,500"
75e7c4f0-4db5-4991-9ecf-2bfe8bc83db0,"Mardu Knights",1,standard,"223,223,223,223,384,384,384,384,70,70,385,385,385,385,317,317,317,317,373,373,373,373,8,8,8,8,320,320,320,320,434,434,434,434,322,322,322,322,27,27,27,27,28,28,28,28,77,29,324,324,324,324,154,154,154,154,238,238,238,238"
5a098704-e350-4913-9cae-ac887ef1c93a,"Jund Food",1,standard,"441,441,441,441,403,403,403,403,503,503,70,70,329,329,329,329,368,368,368,368,110,110,110,8,444,444,444,444,498,515,515,515,211,211,211,211,13,13,13,13,13,14,14,14,14,445,445,445,445,371,371,371,371,77,324,324,324,324,55,55"
9ccce337-6973-4573-8118-b2709f492d61,"Rakdos Aristocrats",1,standard,"441,441,441,441,70,70,70,70,70,70,70,70,70,442,442,442,151,151,338,338,338,338,8,8,8,8,8,443,443,443,443,444,444,444,444,340,340,340,340,14,14,14,14,342,342,342,342,472,445,445,445,445,77,77,324,324,324,324,538,538"
43c38ed1-4027-435d-a0d7-a2676f22e9c4,"Rakdos Knights",1,standard,"384,384,384,384,70,70,70,70,70,70,70,70,70,385,385,385,385,373,373,373,373,110,110,8,8,8,8,8,8,320,320,320,320,340,340,340,340,27,27,27,27,28,28,28,28,437,437,437,437,77,19,19,324,324,324,324,154,154,154,154"
bb93bd9a-e415-4d34-b5b7-dc70c3b07956,"White Weenie",1,standard,"314,314,314,314,458,458,458,458,405,405,318,318,318,318,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,459,459,459,459,460,460,460,460,495,495,495,495,407,407,407,407,528,528,528,528,51,51,51,51,409,409,409,409,199"
229a5e57-403c-4af5-9f22-c6bbb0192f17,"Gruul Aggro",1,standard,"554,554,554,554,329,329,329,329,541,541,57,57,57,57,312,312,312,312,8,8,8,8,8,8,8,8,333,333,475,475,475,475,477,477,477,477,13,13,13,13,13,13,13,13,13,13,14,14,28,28,28,28,478,478,478,478,19,19,19,19"
9b285f98-9ce2-4d41-8f83-c715187a8f9f,"Golgari Food",1,standard,"441,441,441,441,402,402,402,402,375,375,403,403,403,403,70,70,70,70,70,70,362,362,368,368,368,368,110,110,110,443,443,443,443,498,498,498,211,211,211,211,271,13,13,13,13,13,13,13,13,13,13,13,14,14,14,445,445,445,445,77"
45065bf9-f6d2-49c1-8b11-7ee1402449c9,"Simic Flash",1,standard,"67,67,67,67,412,412,412,413,413,413,362,362,362,108,108,363,363,363,363,414,414,414,414,7,7,11,11,11,11,11,11,11,364,364,453,453,453,453,415,415,415,415,13,13,13,13,13,13,13,14,14,76,76,330,330,330,330,54,353,353"
e7d2b12f-d85a-40b1-81e1-3b8845ff71ff,"Jeskai Fires",1,standard,"158,158,158,183,183,183,446,446,446,446,325,325,325,325,326,326,326,326,447,447,447,317,317,5,8,8,32,32,448,448,11,11,327,327,327,327,449,449,449,449,14,14,14,352,352,352,352,450,450,450,451,451,451,451,76,76,76,19,19,19"

It's cool you're using Tribuo on MtG data.

Assuming those numbers are card ids then it might be simpler for you to process this in a different way, but given the format the data is in you can do this:

        var fieldProcessors = new HashMap<String, FieldProcessor>();
        fieldProcessors.put("FEATURE_IDS",
                    new TextFieldProcessor("cards", new TokenPipeline(new SplitPatternTokenizer(","), 1, true)));
        fieldProcessors.put("TYPE", new IdentityProcessor("format"));

        var responseProcessor = new FieldResponseProcessor<>("blank", null, new ClusteringFactory());

        var metadataExtractors = new ArrayList<FieldExtractor<?>>();
        metadataExtractors.add(new IdentityExtractor("ID"));
        metadataExtractors.add(new IdentityExtractor("NAME"));

        var rowProcessor = new RowProcessor<ClusterID>(metadataExtractors, null, responseProcessor, fieldProcessors, Collections.emptySet());

        var csvSource = new CSVDataSource<ClusterID>(csvInputPath, rowProcessor, false);

        return new MutableDataset<ClusterID>(csvSource);

Basically all I did was switch BasicPipeline for TokenPipeline and then point the response processor at a non-existent field and give it the default value null which causes it to never emit a response, meaning the RowProcessor always emits the unknown cluster id for the example.

This is what the first example looks like:

ArrayExample(numFeatures=18,output=-1,weight=1.0,metadata={ID=48dc6570-b4b5-4150-b4da-e5868c17f641, NAME=Simic Ramp},features=[(cards@1-N=11, 5.0)(cards@1-N=125, 2.0), (cards@1-N=13, 10.0), (cards@1-N=330, 4.0), (cards@1-N=358, 4.0), (cards@1-N=360, 4.0), (cards@1-N=361, 4.0), (cards@1-N=363, 4.0), (cards@1-N=364, 4.0), (cards@1-N=365, 3.0), (cards@1-N=425, 1.0), (cards@1-N=453, 4.0), (cards@1-N=492, 2.0), (cards@1-N=500, 1.0), (cards@1-N=523, 2.0), (cards@1-N=67, 4.0), (cards@1-N=76, 2.0), (format@standard, 1.0), ])

@lincolnthree
Copy link
Author

lincolnthree commented Oct 22, 2020

You know Magic! (I guess I shouldn't be surprised) :) That's awesome. I'm now super excited. Do you actively play, or are just familiar?

So... yes! I'm working on a new Metagame analysis for www.topdecked.com. I used to do R&D for Red Hat but this is my chosen career as of a year or so ago. (Got burned out.)

I think I see what you did there. It looks like I was close-ish. I think the main difference is the FieldResponseProcessor is 'blank' as you mentioned. And the use of TokenPipeline

I was trying to assign this to a field and that's where it was blowing up. Let me try this.

@lincolnthree
Copy link
Author

lincolnthree commented Oct 22, 2020

Out of curiosity, what's the purpose of the ResponseProcessor in clustering? I gather that it processes the designated response fields using the supplied OutputFactory to convert the field text into an Output instance.

But isn't that more for classification when you're either gathering possible outputs, or evaluating for a result output via an already trained dataset?

I guess this is potentially just a downstream effect of the shared nature of this (actually very nice) generic dataset pipeline?

@lincolnthree
Copy link
Author

Hot dog!

Number of examples = 500
Number of features = 556
Label domain = []
Example = ArrayExample(numFeatures=21,output=-1,weight=1.0,metadata={name=Four-Color Omnath, id=876c6326-a40d-438b-89c0-825e647370d0},features=[(cards@1-N=0, 2.0)(cards@1-N=1, 1.0), (cards@1-N=10, 4.0), (cards@1-N=11, 4.0), (cards@1-N=12, 4.0), (cards@1-N=13, 5.0), (cards@1-N=14, 4.0), (cards@1-N=15, 3.0), (cards@1-N=16, 4.0), (cards@1-N=17, 2.0), (cards@1-N=18, 3.0), (cards@1-N=19, 2.0), (cards@1-N=2, 4.0), (cards@1-N=3, 3.0), (cards@1-N=4, 2.0), (cards@1-N=5, 1.0), (cards@1-N=6, 4.0), (cards@1-N=7, 2.0), (cards@1-N=8, 2.0), (cards@1-N=9, 4.0), (format@standard, 1.0), ])

@Craigacp
Copy link
Member

Craigacp commented Oct 22, 2020

You know Magic! (I guess I shouldn't be surprised) :) That's awesome. I'm now super excited. Do you actively play, or are just familiar?

So... yes! I'm working on a new Metagame analysis for www.topdecked.com. I used to do R&D for Red Hat but this is my chosen career as of a year or so ago. (Got burned out.)

I think I see what you did there. It looks like I was close-ish. I think the main difference is the FieldResponseProcessor is 'blank' as you mentioned. And the use of TokenPipeline

I was trying to assign this to a field and that's where it was blowing up. Let me try this.

I play intermittently. I used to play a lot more back in the UK, but when I moved to the US it became much less frequent (plus my daughter isn't old enough to play it yet). Obviously this year I've been stuck playing Arena, though I did get to play a mystery booster draft at PAX East in February (though I did pretty badly).

Out of curiosity, what's the purpose of the ResponseProcessor in clustering? I gather that it processes the designated response fields using the supplied OutputFactory to convert the field text into an Output instance.

But isn't that more for classification when you're either gathering possible outputs, or evaluating for a result output via an already trained dataset?

I guess this is potentially just a downstream effect of the shared nature of this (actually very nice) generic dataset pipeline?

It's partially an issue with a shared input pipeline, but sometimes there is a ground truth clustering that you want to measure the performance of the system against (e.g. when developing new clustering algorithms, or trying a clustering approach to some other supervised learning task), and so it is useful.

We should probably make it simpler to turn off in this kind of use case though, as both the clustering and anomaly detection tasks are likely to have this issue. The way I used to turn it off is pretty esoteric and requires knowledge of the internal codepaths which aren't too well documented (but at least it's open source so you can read it).

@lincolnthree
Copy link
Author

lincolnthree commented Oct 22, 2020

I play intermittently. I used to play a lot more back in the UK, but when I moved to the US it became much less frequent (plus my daughter isn't old enough to play it yet). Obviously this year I've been stuck playing Arena, though I did get to play a mystery booster draft at PAX East in February (though I did pretty badly). Actually a bunch of the Tribuo developers play MtG.

That's cool. I miss having co-workers. That's one downside of going solo on a project like this. Yeah, Arena just isn't the same as sitting down with some friends or going to a card shop. But it has its merits. I prefer paper myself as I've been playing since '97. I haven't played Arena in months, since I started kicking it into gear trying to get this project done and into the wild. Happy to give you (or anyone else you want) a beta account if you're at all interested in seeing this in action (but I digress.)

We should probably make it simpler to turn off in this kind of use case though...

A NoOpResponseProcessor() or UnknownClusterResponseProcessor comes to mind, I'm sure you already thought of that, though. Could easily just be a wrapper for what you did above, using some esoteric key/field name nobody would possibly be able to use. A GUID comes to mind, but I'm sure there's a better internal solution than randomly naming a non-existent field.

@lincolnthree
Copy link
Author

lincolnthree commented Oct 22, 2020

Out of curiosity, do you guys have any packages in this library that will do "automatic" determination of K based on some kind of metrics? E.g. Elbow method or silhouette, etc?

EDIT: I see the centroids already have a number of metrics/analysis methods provided. I'll check those out and it shouldn't be too hard to implement something like this.

@Craigacp
Copy link
Member

I play intermittently. I used to play a lot more back in the UK, but when I moved to the US it became much less frequent (plus my daughter isn't old enough to play it yet). Obviously this year I've been stuck playing Arena, though I did get to play a mystery booster draft at PAX East in February (though I did pretty badly). Actually a bunch of the Tribuo developers play MtG.

That's cool. I miss having co-workers. That's one downside of going solo on a project like this. Yeah, Arena just isn't the same as sitting down with some friends or going to a card shop. But it has its merits. I prefer paper myself as I've been playing since '97. I haven't played Arena in months, since I started kicking it into gear trying to get this project done and into the wild. Happy to give you (or anyone else you want) a beta account if you're at all interested in seeing this in action (but I digress.)

We should probably make it simpler to turn off in this kind of use case though...

A NoOpResponseProcessor() or UnknownClusterResponseProcessor comes to mind, I'm sure you already thought of that, though. Could easily just be a wrapper for what you did above, using some esoteric key/field name nobody would possibly be able to use. A GUID comes to mind, but I'm sure there's a better internal solution than randomly naming a non-existent field.

Yeah I think we might want to modify the constructors to make it simpler, but ensuring that it interacts properly with the configuration system will require a bit of thought.

Out of curiosity, do you guys have any packages in this library that will do "automatic" determination of K based on some kind of metrics? E.g. Elbow method or silhouette, etc?

EDIT: I see the centroids already have a number of metrics/analysis methods provided. I'll check those out and it shouldn't be too hard to implement something like this.

We don't have any kind of hyperparameter optimization built in yet, but it's something we're interested in. Making it work across all the things Tribuo supports might be tricky so it requires some thought.

The metrics are mostly about measuring against ground truth clusterings, so not applicable for your use case, we need to add some more which measure qualities of the clusters themselves.

@lincolnthree
Copy link
Author

lincolnthree commented Oct 22, 2020

Gotcha. This is where the fun begins :) Thank you for all of your help.

A simple metric based on whether each feature/card has a hypothetically valid quantity in a given cluster:

545: 0.26x Realm-Cloaked Giant // Cast Off
546: 0.25x Noxious Grasp
547: 0.63x Murder
548: 0.34x Kasmina, Enigmatic Mentor
549: 0.15x God-Eternal Kefnet
550: 0.87x Disinformation Campaign
555: 0.96x null
CENTROID 0 has 32/193 or 17% valid cards

Not a very good cluster ;) Time for tuning!

Related to optimization, it would be awesome if you could provide your own evaluation classes/functions to the ClusteringEvaluator, something like:

List<ClusterMetric> metrics = .... (your metrics here)
Map<ClusterID, ClusterEvaluationResult> results = ClusteringEvaluator.quality(model, metrics)

Where ClusterEvaluationResult has some overall aggregate quality score, but also the individual scores from the provided metrics, which could themselves be weighted.

Just spitballing here :) All of this can be achieved now, of course, by writing a little extra code, and this is an awesomely powerful project already.

@Craigacp
Copy link
Member

We thought about user defined metrics when building the evaluation system, and decided against it for the first public release. The underlying design should allow user metrics to drop in when we enable them, but yes at the moment you'll need to write additional code to aggregate your own metrics into your own Evaluation subclass.

@lincolnthree
Copy link
Author

lincolnthree commented Oct 22, 2020

@Craigacp Sorry to bug you again. One more question about all this. I'm having a bit of confusion interpreting the centroid results.

Now that the features are converted to integers and that's working. Where do I find each feature (by ID?) in the resultant clusters? I thought I was correctly assuming that the centroid vectors are indexed by feature ID, and that each average feature value is accessible via centroid.get(id), but now I'm not 100% sure.

var centroids = model.getCentroidVectors();
for (var centroid : centroids) {
    System.out.println(centroid);
    centroid.get(1) // <-- What does this return? Feature ID 1's value for this centroid, or something else?
}

If this doesn't return the feature value/score, where can I get that?

Also, it seems like getTopFeatures() returns empty map for the KMeansModel type, which seems intentional, is there a recommended way to get that info? If not, which API should I start with?

I feel like I'm missing something basic in the docs. Is this all explained somewhere I could look at without bothering you here?

I've found this page: https://tribuo.org/learn/4.0/docs/packageoverview.html, the JavaDocs, and the tutorials.

@Craigacp
Copy link
Member

Craigacp commented Oct 22, 2020

I suspect it's a gap in the docs. Unfortunately you've hit one of the places where Tribuo exposes its integer ids to the world. We should patch that to give a more user friendly view on it.

The centroid vectors are DenseVector instances. The index should be looked up in model.getFeatureIDMap().get(i) which will return a VariableIDInfo instance which contains the feature name. In your case those names will be of the form "cards@1-N=10" and that last number (i.e. 10) is the card id number. I think that if you wrote a custom FieldProcessor you could pass in "10 x Plains; 5 x Mountain; ..." and then have it emit features of the form (Plains,10),(Mountain,5)..., which would make the output easier to see.

The value of the centroid at a specific index is the point in that dimension (e.g. the quantity of that card).

Also, it seems like getTopFeatures() returns empty map for the KMeansModel type, which seems intentional, is there a recommended way to get that info? If not, which API should I start with?

There is no notion of top features for a K-Means model which is why it gives the empty map. As the task isn't supervised it's hard to say what feature contributes most to a particular clustering. I guess we could compute the features which separate the data the most, but that would only work for the training data, not any future points you passed in to determine the clustering.

@lincolnthree
Copy link
Author

lincolnthree commented Oct 22, 2020

@Craigacp Okay, no problem. Thanks, it seems I'm on the right track. I appreciate the feedback!

We should patch that to give a more user friendly view on it.

I think integer IDs are fine as long as the docs or javadocs say what they are. Of course I'd never complain if they were more strongly typed, as you suggest :)

The centroid vectors are DenseVector instances. The index should be looked up in model.getFeatureIDMap().get(i)

You mean the index of each element in the DenseVector should be used as a lookup in model.getFeatureIDMap(indexOfDenseVectorElement) right? Just making sure.

The value of the centroid at a specific index is the point in that dimension (e.g. the quantity of that card).

So, denseVector.get(indexOfElement).value === card quantity of the specified element, in this cluster

There is no notion of top features for a K-Means model

I guess that makes sense. I guess I was expecting something like frequency of feature occurrence across all clusters, which I can certainly calculate myself.

@Craigacp
Copy link
Member

@Craigacp Okay, no problem. Thanks, it seems I'm on the right track. I appreciate the feedback!

We should patch that to give a more user friendly view on it.

I think integer IDs are fine as long as the docs or javadocs say what they are. Of course I'd never complain if they were more strongly typed, as you suggest :)

Tribuo's whole thing is that you should never need to know those ids (apart from when importing an external model, and it's unavoidable there), so we should definitely have a method that returns something better. Probably a List<List<Feature>>, or something else with a little bit more of a type to it.

The centroid vectors are DenseVector instances. The index should be looked up in model.getFeatureIDMap().get(i)

You mean the index of each element in the DenseVector should be used as a lookup in model.getFeatureIDMap(indexOfDenseVectorElement) right? Just making sure.

Yep.

The value of the centroid at a specific index is the point in that dimension (e.g. the quantity of that card).

So, denseVector.get(indexOfElement).value === card quantity of the specified element, in this cluster

denseVector.get(indexOfElement) == card quantity, as the get method returns a double.

There is no notion of top features for a K-Means model

I guess that makes sense. I guess I was expecting something like frequency of feature occurrence across all clusters, which I can certainly calculate myself.

Yeah, though you'll need to be a little careful with the sparsity, as it's implicitly zero copies of a card which doesn't appear, so your stats might be a little skewed.

BTW there is a K-Means++ initialisation in the main branch, it'll be in the next feature release (i.e. 4.1.0), but at the moment we don't have a timeline for when that will be. The code base in main should always be working though, so if you want to try it out feel free.

@lincolnthree
Copy link
Author

lincolnthree commented Oct 23, 2020

@Craigacp Thanks! I'd definitely be interested in trying that. Looking for it now. Thank god for Maven. Built and installed locally in ~1min, no build issues whatsoever. Do you, perchance, deploy nightly snapshots anywhere?

@Craigacp
Copy link
Member

We don't deploy nighties and are unlikely to because our release processes aren't setup to move that fast.

@Craigacp
Copy link
Member

Craigacp commented Oct 23, 2020

We're trying to make sure that Tribuo is always straightforward to build so hopefully people can just build main themselves if they want the latest bits.

@lincolnthree
Copy link
Author

lincolnthree commented Oct 23, 2020

@Craigacp Makes sense. I think you've achieved that goal :)

Unless you're using an organizational level staging repository with OSSRH at Sonatype (or privately hosted repos), I'll grant it's rather cumbersome to push staging artifacts and have to log in to do the manual close/release process.

I've been continuing to experiment with the KMeans++ algorithm, and it seems to be working a little better than KMeans. The centroids seem slightly very slightly more accurate, but as you mentioned, I think I'm having an issue with sparsity in the data. There are too many features/cards and too few in each deck, and I don't think the centroids aren't far enough apart to be reliably distinguishable. There's a lot of noise in the clusters. There will be several features with reasonable values that make sense, then a bunch of others that I know to be incorrect.

I noticed there are some 'sparsify()' methods in the resultant cluster objects, but I'm assuming that only cleans up the clusters after they've been selected, and there's no built-in way to do trimming/weighting of sparse data during training?

Also, please stop me if this is too many questions, or if there's a different medium I should be using. I do appreciate your help, but I don't want to be a pain.

@lincolnthree
Copy link
Author

lincolnthree commented Oct 23, 2020

Update. Looks like things are working better than I thought. It turns out I was still confused about the int feature IDs.

The feature ID assigned to the actual Feature instance has nothing to do with the actual vectored feature int defined in the data file/CSV, which was the root of my secondary confusion.

After I implemting a custom TextFieldProcessor to add the card name to the features, this became clear.

E.g:

48dc6570-b4b5-4150-b4da-e5868c17f641,"Simic Ramp",1,standard,"67,67,67,67,358,358,358,358,523,523,363,363,363,363,360,360,360,360,11,11,11,11,11,364,364,364,364,453,453,453,453,13,13,13,13,13,13,13,13,13,13,492,492,361,361,361,361,76,76,125,125,330,330,330,330,425,365,365,365,500"

ID 67 in the input above may very well be Feature 22 or feature 321 or whatever is assigned y Tribuo in the resulting feature objects. The net effect of this was me making improper card name mappings which made the centroid feature values seem much more random than they actually were.

Once I sorted this out, it actually looks like the clusters are working quite accurately, and it appears that while sparseness may still be having an effect, it's not nearly as pronounced as I initially thought.

Now I'm getting results that are much more in line with previous K-Means algorithms I've tried, and seemingly fewer outliers, in a fraction of the time, with support for card quantities as feature values.

Note, I have implemented pruning of values below a quantity of 1 (or near one), since it does not make sense to have less than one copy of a card, and cards that appear with less than 1 copy imply that the feature is an outlier in the cluster.

A dataset of 500 randomly selected decks from the Legacy format resulted in an initial optimal K of 56, though I still need to do more tweaking to my accuracy/evaluation/quality metrics.

Example output clusters:

CLUSTER 0: **Bant SnowKo** ---- 4.0 Force of Will, 4.0 Brainstorm, 4.0 Misty Rainforest, 3.9 Flooded Strand, 3.9 Ponder, 3.8 Snow-Covered Island, 3.7 Swords to Plowshares, 3.7 Arcum's Astrolabe, 3.3 Ice-Fang Coatl, 2.7 Oko, Thief of Crowns, 2.2 Snapcaster Mage, 2.1 Terminus, 1.6 Jace, the Mind Sculptor, 1.3 Tundra, 1.1 Veil of Summer, 1.1 Teferi, Time Raveler, 1.1 Snow-Covered Forest, 1.1 Snow-Covered Plains, 1.0 Mystic Sanctuary, 1.0 Tropical Island, 0.9 Volcanic Island

CLUSTER 1: **Mono-Red Prison** ---- 11.0 Mountain, 4.0 Ancient Tomb, 4.0 Chandra, Torch of Defiance, 4.0 City of Traitors, 4.0 Goblin Rabblemaster, 4.0 Chrome Mox, 4.0 Karn, the Great Creator, 4.0 Blood Moon, 4.0 Chalice of the Void, 4.0 Simian Spirit Guide, 3.1 Trinisphere, 2.9 Magus of the Moon, 2.8 Ensnaring Bridge, 2.8 Bonecrusher Giant // Stomp, 1.2 Fiery Confluence

So I think things are on a good track.
:)

@Craigacp
Copy link
Member

I noticed there are some 'sparsify()' methods in the resultant cluster objects, but I'm assuming that only cleans up the clusters after they've been selected, and there's no built-in way to do trimming/weighting of sparse data during training?

Well the notion of distance when some elements are not present is hard to define, so we implicitly set missing elements to zero. Otherwise you can end up with degenerate cases where two decks don't have any card overlap, and the distance between them is undefined.

Also, please stop me if this is too many questions, or if there's a different medium I should be using. I do appreciate your help, but I don't want to be a pain.

At the moment we've not got a chat or discussion platform setup, so github issues or the mailing list are it, and the issue is fine.

I'm glad you've managed to get it working. Yes, Tribuo's feature ids are completely disconnected from the id numbers that you pass in, as Tribuo treats the id numbers you pass in as strings and renumbers the features itself (based on the lexicographic ordering of the feature strings). This is why we should add a method to KMeansModel that returns the feature names and their values, because it's one of the few places where you need to know the internal ids to understand a user facing output.

@Craigacp
Copy link
Member

KMeansModel now has a getCentroids method which returns the feature names and values for the centroids, so should make it easier to work with the centroids in downstream tasks. The clustering tutorial has been updated to use the new method, and a couple of minor issues with how K-Means++ was integrated have been fixed to preserve interface compatibility with 4.0.

@Craigacp
Copy link
Member

We've also merged in an empty response processor implementation for use when loading clustering, anomaly detection or other datasets where you don't expect there to be a ground truth output. I'm going to close this issue now as I think we've patched the usability issues you hit. Open a fresh one if you hit others, or re-open this if you think it's not quite covered by PRs #99 and #98.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants