-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with number of clusters to execute the XBOS #5
Comments
It is normal for this to happen when the cardinality of the data in a particular dimension is too low. I think a good solution is to erase the dimensions with low cardinality from the data in advance and run XBOS. |
I just checked low cardinality definition in the database which in short means that the column contains a lot of repeats in its data range but imagine I have the excerpt of following data with
Does this data count Low cardinal and I can't use XBOS? |
Cardinality is the unique number of datas. [ 0, 0, 0, 0, 0 ] -> cardinality is 1. And, because XBOS calculates( or uses ) distances, You can't use text data. |
https://tylerburleigh.com/blog/working-with-categorical-features-in-ml/ "In the context of machine learning, “cardinality” refers to the number of possible values that a feature can assume. For example, the variable “US State” is one that has 50 possible values. " |
For sure, before using ML-based models, I converted categorical features/dimensions/columns into numerical ones and dropped categorical columns using Label-encoding due to large data ( plz see last two columns after encoding):
Here I didn't drop those categorical columns just to showcase encoding. Considering this issue, |
You need to drop low cardinality( lower than cluster size ) columns before use XBOS.
The math of XBOS is pretty easy. |
I just see your point. You mean that if in my dataframe I have one column/dimension/feature which has binary style e. g. like
in my case for the above frame is:
here I should drop
Thanks, I do my best! :D Have a nice Sunday. Sunday is Funday! :) |
Yes. But I'm sorry, I will not implement that ( automatically remove features ) because of ...
|
By the way, if you have many low cardinality features, I think that Isolation Forest is a good solution for anomaly/outlier detection. |
During my experiments I noticed that there is a limit for number of clusters which results in following error either in bigdata or your provided sample:
the shape of my data is
(1516385, 8)
and when I run the XBOS by defaultxbos = XBOS()
it meansn_clusters=15
,effectiveness=500
,max_iter=2
I'll faceKeyError: 0
including following message along with error traceback:So in the end XBOS on my data can be executed with only two clusters
n_clusters=2
which doesn't make sense. Even though when I tested on simple dataset you provided here when you configure it withn_clusters=8
it threw out the similarKeyError
and mostlyKeyError: 7
Please let me know if I should change my configuration to execute the XBOS successfully. Please feel free to check it out this Colab notebook and comment next to cells for quick debugging.
The text was updated successfully, but these errors were encountered: