Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with number of clusters to execute the XBOS #5

Open
clevilll opened this issue Oct 13, 2021 · 9 comments
Open

Problem with number of clusters to execute the XBOS #5

clevilll opened this issue Oct 13, 2021 · 9 comments

Comments

@clevilll
Copy link

During my experiments I noticed that there is a limit for number of clusters which results in following error either in bigdata or your provided sample:

the shape of my data is (1516385, 8) and when I run the XBOS by default xbos = XBOS() it means n_clusters=15, effectiveness=500, max_iter=2 I'll face KeyError: 0 including following message along with error traceback:

/content/xbos.py:25: ConvergenceWarning: Number of distinct clusters (7) found smaller than n_clusters (8). Possibly due to duplicate points in X.
  kmeans.fit(data[column].values.reshape(-1,1))

So in the end XBOS on my data can be executed with only two clusters n_clusters=2 which doesn't make sense. Even though when I tested on simple dataset you provided here when you configure it with n_clusters=8 it threw out the similar KeyError and mostly KeyError: 7

/content/xbos.py:25: ConvergenceWarning: Number of distinct clusters (7) found smaller than n_clusters (8). Possibly due to duplicate points in X.
  kmeans.fit(data[column].values.reshape(-1,1))
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2897             try:
-> 2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 7

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
5 frames
<ipython-input-9-6c63ff2c8311> in <module>()
      6 
      7 xbos = XBOS(n_clusters=8, max_iter=1)
----> 8 result = xbos.fit_predict(dff)
      9 #for i in result:
     10 #    print(round(i,2))

/content/xbos.py in fit_predict(self, data)
     56 
     57     def fit_predict(self,data):
---> 58         self.fit(data)
     59         return self.predict(data)

/content/xbos.py in fit(self, data)
     35                     if i != k:
     36                         dist = abs(kmeans.cluster_centers_[i] - kmeans.cluster_centers_[k])/max_distance
---> 37                         effect = ratio[k]*(1/pow(self.effectiveness,dist))
     38                         cluster_score[i] = cluster_score[i]+effect
     39 

/usr/local/lib/python3.7/dist-packages/pandas/core/series.py in __getitem__(self, key)
    880 
    881         elif key_is_scalar:
--> 882             return self._get_value(key)
    883 
    884         if is_hashable(key):

/usr/local/lib/python3.7/dist-packages/pandas/core/series.py in _get_value(self, label, takeable)
    988 
    989         # Similar to Index.get_value, but we do not fall back to positional
--> 990         loc = self.index.get_loc(label)
    991         return self.index._get_values_for_loc(self, loc, label)
    992 

/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:
-> 2900                 raise KeyError(key) from err
   2901 
   2902         if tolerance is not None:

KeyError: 7

Please let me know if I should change my configuration to execute the XBOS successfully. Please feel free to check it out this Colab notebook and comment next to cells for quick debugging.

@Kanatoko
Copy link
Owner

Kanatoko commented Oct 16, 2021

It is normal for this to happen when the cardinality of the data in a particular dimension is too low. I think a good solution is to erase the dimensions with low cardinality from the data in advance and run XBOS.

@clevilll
Copy link
Author

I just checked low cardinality definition in the database which in short means that the column contains a lot of repeats in its data range but imagine I have the excerpt of following data with (1516385, 5):

+---+-------------+------+------------+-------------+-----------------+
| id|         Type|Length|Token_number|Encoding_type|Character_feature|
+---+-------------+------+------------+-------------+-----------------+
|  0|     Sentence|  4014|         198|        false|              136| 
|  1|    contextid|    90|           2|        false|               15|
|  2|     Sentence|   172|          11|        false|              118| 
|  3|       String|    12|           0|         true|               11| 
|  4|version-style|    16|           0|        false|               13|   
|  5|     Sentence|   339|          42|        false|              110| 
|  6|version-style|    16|           0|        false|               13|  
|  7| url_variable|    10|           2|        false|                9| 
|  8| url_variable|    10|           2|        false|                9|
|  9|     Sentence|   172|          11|        false|              117| 
| 10|    contextid|    90|           2|        false|               15| 
| 11|     Sentence|   170|          11|        false|              114|
| 12|version-style|    16|           0|        false|               13|
| 13|     Sentence|    68|          10|        false|               59|
| 14|       String|    12|           0|         true|               11|
| 15|     Sentence|   173|          11|        false|              118|
| 16|       String|    12|           0|         true|               11|
| 17|     Sentence|   132|           8|        false|               96|
| 18|       String|    12|           0|         true|               11|
| 19|    contextid|    88|           2|        false|               15|
+---+-------------+------+------------+-------------+-----------------+

Does this data count Low cardinal and I can't use XBOS?

@Kanatoko
Copy link
Owner

Cardinality is the unique number of datas.
Cardinality examples.

[ 0, 0, 0, 0, 0 ] -> cardinality is 1.
[ 0, 0, 0, 1, 1 ] -> cardinality is 2.
[ 0, 0, 1, 1, 2 ] -> cardinality is 3.
[ 0, 0, 1, 2, 3 ] -> cardinality is 4.

And, because XBOS calculates( or uses ) distances, You can't use text data.
In that case, you need to drop 'Type' and 'Encoding_type'.

@Kanatoko
Copy link
Owner

Kanatoko commented Oct 16, 2021

https://tylerburleigh.com/blog/working-with-categorical-features-in-ml/

"In the context of machine learning, “cardinality” refers to the number of possible values that a feature can assume. For example, the variable “US State” is one that has 50 possible values. "

@clevilll
Copy link
Author

clevilll commented Oct 16, 2021

Cardinality is the unique number of datas. Cardinality examples.

[ 0, 0, 0, 0, 0 ] -> cardinality is 1. [ 0, 0, 0, 1, 1 ] -> cardinality is 2. [ 0, 0, 1, 1, 2 ] -> cardinality is 3. [ 0, 0, 1, 2, 3 ] -> cardinality is 4.

And, because XBOS calculates( or uses ) distances, You can't use text data. In that case, you need to drop 'Type' and 'Encoding_type'.

For sure, before using ML-based models, I converted categorical features/dimensions/columns into numerical ones and dropped categorical columns using Label-encoding due to large data ( plz see last two columns after encoding):

+---+---------------+------+------------+-------------+-----------------+--------------------+------------+---------------------+
| id|           Type|Length|Token_number|Encoding_type|Character_feature|                Freq|Type_Encoded|Encoding_Type_Encoded|         
+---+---------------+------+------------+-------------+-----------------+--------------------+------------+---------------------+
|  0|  sap-contextid|    90|           2|        false|               15|                 1.0|         2.0|                  0.0|
|  1|       Sentence|   169|          11|         true|              115|  0.0434355930699323|         0.0|                  1.0|
|  2|   url_variable|    12|           2|        false|               11|  0.3768681063417741|         1.0|                  0.0|
|  3|version-setting|    11|           2|         true|               10| 0.08895918484530539|         6.0|                  1.0|
|  4|       Sentence|   722|           5|        false|              117|0.004624917132551378|         0.0|                  0.0|
+---+---------------+------+------------+-------------+-----------------+--------------------+------------+---------------------+

Here I didn't drop those categorical columns just to showcase encoding. Considering this issue, XBOS didn't work with n_cluster for more than 2, which raises this question why? How can I force it to increase it?
Another issue is the mathematical concept behind the implementation. I mean, I'm not interested in deep math, but It would be great if you help me to understand which math formula has been implemented in the 2nd & 3rd functions in XBOS.py Regarding this I tried to reach you via your email. maybe you haven't checked your mail or due to I attached the picture of the assumed formula, my email went to spam.

@Kanatoko
Copy link
Owner

How can I force it to increase it?

You need to drop low cardinality( lower than cluster size ) columns before use XBOS.
For example, you can not apply K-means clustrering with k=3 on this data:
[ 0,0,0,0,1,1,1,1 ]
because cardinality is 2 and 2 is lower than 3.
We can not get 3 clusters from this data.

the mathematical concept behind the implementation

The math of XBOS is pretty easy.
This blog might help. ( Sorry it is written in Japanese )
https://www.scutum.jp/information/waf_tech_blog/2018/03/waf-blog-054.html

@clevilll
Copy link
Author

clevilll commented Oct 17, 2021

How can I force it to increase it?

You need to drop low cardinality( lower than cluster size ) columns before use XBOS. For example, you can not apply K-means clustrering with k=3 on this data: [ 0,0,0,0,1,1,1,1 ] because cardinality is 2 and 2 is lower than 3. We can not get 3 clusters from this data.

I just see your point. You mean that if in my dataframe I have one column/dimension/feature which has binary style e. g. like Encoding_Type_Encoded in the above frame which was encoded out to its true/false info ruin this and therefore should be dropped. I mean if it is the case algorithm can be developed and check the cardinality of dimensions and drop those ones using automatically:

print("No of unique values in each column :\n", df.nunique(axis=0, dropna=False)) 

in my case for the above frame is:

No of unique values in each column:
 Type                  7
Length               12
Token_number          7
Character_feature    12
Encoding_type         2
Freq                  3

here I should drop Encoding_Type even no needs to encode it in form of Encoding_Type_Encoded . Then n_clusters=<7 should be executed. now it makes sense why I was facing KeyError: 7 due to the cardinality of my 1st column when I was setting n_clusters=8!! Do you confirm that?

The math of XBOS is pretty easy. This blog might help. ( Sorry it is written in Japanese )

Thanks, I do my best! :D Have a nice Sunday. Sunday is Funday! :)

@Kanatoko
Copy link
Owner

Do you confirm that?

Yes.

But I'm sorry, I will not implement that ( automatically remove features ) because of ...

  1. This XBOS python code is just for proof of concept. It should be tiny.
  2. Anyone can extend it because it is open source.
  3. Dropping features automatically is not a good idea. Someone would not notice that.

@Kanatoko
Copy link
Owner

By the way, if you have many low cardinality features, I think that Isolation Forest is a good solution for anomaly/outlier detection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants