Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Label and Cluster evaluation? #105

Closed
nikkisingh111333 opened this issue Dec 9, 2020 · 10 comments
Closed

Label and Cluster evaluation? #105

nikkisingh111333 opened this issue Dec 9, 2020 · 10 comments
Labels
question General question

Comments

@nikkisingh111333
Copy link

nikkisingh111333 commented Dec 9, 2020

hey,I m playing around with k means clustering and at evaluation time i was playing with Old friend Iris data.so i m new to unsupervised learning heres my question:
how to evaluate cluster with label column 'species in case of iris'
heres what i m doing simple stuff for getting familiar with clustering in tribuo:

Map<String,FieldProcessor> fieldProcessors = new HashMap<String,FieldProcessor>();
//TOKENIZER THE TEXT INTO COLUMN OF EACH WORDS USING TOKENPIPELINE OR WE CAN USE BASEPIPELINE
fieldProcessors.put("Species",new IdentityProcessor("Species"));
RowProcessor rowProcessor = new RowProcessor<Label>(new FieldResponseProcessor("Species","UNK",new LabelFactory()),fieldProcessors);
CSVDataSource irisData = new CSVDataSource<Label>(Paths.get("C:\\\\Users\\\\Nikki singh\\\\Downloads\\\\Iris.csv"),rowProcessor,true); 
TrainTestSplitter splitIrisData = new TrainTestSplitter<>(irisData,
                      /* Train fraction */ 0.7,
                            /* RNG seed */ 0L);
MutableDataset train = new MutableDataset(splitIrisData.getTrain());
MutableDataset test = new MutableDataset(splitIrisData.getTest());
//K MEANS CLUSTERING TRAINER
KMeansTrainer trainer = new KMeansTrainer(5,100,Distance.EUCLIDEAN,5,1234);
Long startTime = System.currentTimeMillis();  
KMeansModel model = trainer.train(train);
System.out.println("Feature:"+model.getFeatureIDMap().get("B"));
Long endTime = System.currentTimeMillis();
//WE CAN SEE OUR CLUSTERS(CENTROIDS) BEST 'K' CENTROID WE GOT AFTER 10 ITERATIONs
DenseVector[] centroids = model.getCentroidVectors();
for (DenseVector centroid : centroids) {
    System.out.println("cLUSTERS:"+centroid);
}
ClusteringEvaluator eva=new ClusteringEvaluator();
ClusteringEvaluation c= eva.evaluate(model,train);
System.out.println(c.adjustedMI()+"-----"+c.normalizedMI());

AND I M GETTING THIS:

Exception in thread "main" java.lang.ClassCastException: org.tribuo.classification.Label cannot be cast to org.tribuo.clustering.ClusterID

HOT TO EVALUATE AND SEE IF MY CLUSTERS CORRECTLY CLASSIFFIFIED CLASSES OR NOT
and one more thing i have heard about K-mode clustering .is that kind of thing exist for now.?

@nikkisingh111333 nikkisingh111333 added the question General question label Dec 9, 2020
@Craigacp
Copy link
Member

Craigacp commented Dec 9, 2020

You can't use a clustering evaluator on a classification dataset. You washed all the types off which removes the guarantees about the dataset, and the only reason it didn't throw a ClassCastException during model training is that the KMeansTrainer doesn't check the output type (which I guess it should do as this shouldn't have happened).

If you want to compare how the clusters line up against classification labels then you should write a new ResponseProcessor that converts the labels into numerical ids which can be consumed by the ClusteringFactory to create new ClusterIDs. That will give you a DataSource<ClusterID> and you can keep the generic types all through the computation. Then the ClusteringEvaluation will compute the mutual information between the predicted cluster ids and the ground truth "cluster ids" (i.e. the labels). If that mutual information is close to 1.0 then the clusters are similar to the labels, if it's close to 0.0 then the clusters and labels don't line up very well.

Looks like k-modes is for categorical data. We've not looked into that algorithm. We're considering adding k-medoids which guarantees that the cluster is a datapoint, rather than the mean of the cluster, but we've not implemented it yet.

@nikkisingh111333
Copy link
Author

okk so can you just pin me some code snippets of how can i use responseProcessor and how to use clusteringfactoy to get what i want..there are too many classes interfaces so kind of lost in docs.

@Craigacp
Copy link
Member

Craigacp commented Dec 9, 2020

Something like this should be sufficient. I've not tested it, and it's not going to be part of Tribuo.

public class IrisClusterResponseProcessor implements ResponseProcessor<ClusterID> {

    @Config(mandatory = true,description="The field name to read.")
    private String fieldName;

    private final OutputFactory<ClusterID> outputFactory = new ClusteringFactory();

    private IrisClusterResponseProcessor() {}

    public IrisClusterResponseProcessor(String fieldName) {
        this.fieldName = fieldName;
    }

    @Override
    public OutputFactory<ClusterID> getOutputFactory() {
        return outputFactory;
    }

    @Override
    public String getFieldName() {
        return fieldName;
    }

    @Deprecated
    @Override
    public void setFieldName(String fieldName) {
        this.fieldName = fieldName;
    }

    @Override
    public Optional<ClusterID> process(String value) {
        if ("Iris-setosa".equals(value)) {
            return Optional.of(outputFactory.generateOutput("0"));
        } else if ("Iris-versicolor".equals(value)) {
            return Optional.of(outputFactory.generateOutput("1"));
        } else if ("Iris-virginica".equals(value)) {
            return Optional.of(outputFactory.generateOutput("2"));
        } else {
            return Optional.empty();
        }
    }

    @Override
    public ConfiguredObjectProvenance getProvenance() {
        return new ConfiguredObjectProvenanceImpl(this,"ResponseProcessor");
    }
}

@nikkisingh111333
Copy link
Author

okkk so everytime i need to give dataset i need to do this okkk thanks ..is there any automatically stuff you are planning for this kind of stuffs. i have just done writing my own uniqueFeatureEncoder class by implementing FieldProcessor it was easy but if some param options is provided on API side will be good like for binarized feature.binary,for real data just add real.. btw thanks for your help .and quick support..

@Craigacp
Copy link
Member

Craigacp commented Dec 9, 2020

Well, in general you shouldn't try to feed classification data to a clustering task. Tribuo is setup to prevent users from confusing the prediction tasks like that, so it needs some tricks to make it work.

We add new implementations of FieldProcessor, ResponseProcessor etc as we discover a need for them either in the community or from our internal work. The columnar infrastructure is designed to be flexible, but it has to get the information about what the type of each feature is from somewhere, we don't want it to have to read the dataset twice just to understand it (nor do we want it to make guesses about the feature types as those might be tricky to undo. If you've got a concrete feature request then make a new issue and write it up in some detail, I'm not clear exactly what you're asking for at the moment as there are already field processors that emit binary features and real valued features.

@nikkisingh111333
Copy link
Author

okk one last think before wrapping up ...WHAT SHOULD I CHOOSE to make my feature categorical into unique encoding but just one feature not all features for all classes ..whether i should use REAL or CATEGORICAL or WHAT ?

@Craigacp
Copy link
Member

Craigacp commented Dec 9, 2020

I don't understand the question, could you give an example?

@nikkisingh111333
Copy link
Author

nikkisingh111333 commented Dec 9, 2020

for eg.i have a column shirt size=[tiny,medium,large] and i want to encode it into numeric but no features for every class just want one single feature with class labelled as numbers[1=tiny,2=medium,3=large] so as i have written my own rowprocessor what should i use ? should i use this GeneratedFeatureType.BINARISED_CATEGORICAL or this GeneratedFeatureType.REAL or this GeneratedFeatureType.CATEGORICAL ....theres also on this GeneratedFeatureType.TEXT..which one should i choose ?

@Craigacp
Copy link
Member

Craigacp commented Dec 9, 2020

It's GeneratedFeatureType.CATEGORICAL. We're working on further changes to Tribuo's internal type system, as at the moment that enum only really interacts with the LIME explanation module, but in the future it will control how the feature statistics are computed.

@nikkisingh111333
Copy link
Author

nikkisingh111333 commented Dec 10, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question
Projects
None yet
Development

No branches or pull requests

2 participants