Problem with max size of clusters #8

Arsalan-Vosough · 2020-12-23T18:23:41Z

Hi

Thank you for sharing your code! I used it to cluster my data in 10 cluster with min_size = 3 and max_size = 5. But it returns some clusters with more than max size elements unfortunately. it gives me a cluster with 7 elements sometimes.

joshlk · 2020-12-23T18:44:19Z

Hi, Great to hear that your using it 😀.

Can you please provide a minimal working example. Thanks, Josh

Arsalan-Vosough · 2020-12-24T07:57:35Z

Longitude Latitude
0 0.143799 0.549696
1 0.748523 0.666809
2 0.893091 0.485969
3 0.633522 0.273117
4 0.691772 0.763385
5 0.671481 0.112269
6 0.250957 0.781550
7 0.199018 0.798926
8 0.680017 0.201779
9 0.270592 0.461235
10 0.648789 0.140139
11 0.417517 0.114667
12 0.733276 0.254028
13 0.283617 0.515177
14 0.256486 0.788757
15 0.369168 0.380070
16 0.265186 0.596243
17 0.356121 0.442192
18 0.651694 0.876345
19 0.166674 0.829551
20 0.623306 0.034364
21 0.250798 0.911847
22 0.448605 0.517670
23 0.529576 0.000000
24 0.622372 0.215839
25 0.492679 0.621276
26 0.349826 0.242467
27 0.561980 0.855117
28 0.543573 1.000000
29 0.000000 0.572787
30 0.285501 0.358724
31 0.398475 0.106590
32 1.000000 0.452500
33 0.367203 0.419650
34 0.672594 0.257735
35 0.590781 0.022893
36 0.459228 0.146675
37 0.480092 0.666456
38 0.451271 0.225341
39 0.767639 0.395854
40 0.702797 0.589130

this is my data, I normalized it with minmax scaler. and used this function in order to clustering :

def k_means_cons(k,minVal,maxVal,data):

clf = KMeansConstrained(
    n_clusters=k,
    size_min=minVal,
    size_max=maxVal,
    random_state=0,max_iter = 300)
clf.fit(data)

clf.cluster_centers_

Label = clf.predict(data)
return Label

Label = k_means_cons(10,3,5,normalized)

and it returns me:

array([0, 9, 7, 1, 9, 8, 5, 5, 1, 4, 8, 3, 1, 4, 5, 4, 0, 4, 6, 5, 8, 5,
2, 8, 1, 2, 3, 6, 6, 0, 4, 3, 7, 4, 1, 8, 3, 2, 3, 7, 9])

as you can see there are 6 elements in 4th cluster

joshlk · 2020-12-24T11:39:26Z

Thanks. What exact normalisation did you use?

what sklearn and ortools version are you also using?

Arsalan-Vosough · 2020-12-24T13:17:36Z

minmax_scale = preprocessing.MinMaxScaler(feature_range=(0,1))
scaled_feature = minmax_scale.fit_transform(data)

sklearn version is 0.23.2 and ortools version is 8.1.8487

Arsalan-Vosough · 2020-12-24T13:48:50Z

i think, I made it complex. Briefly if you run the code below, sometimes it gives you cluster with more than max_size

def generatedb(numberPatient):
    patient=[]
    i = 0
    while len(patient)<=numberPatient:
        x = random.uniform(51.078418,51.701563)
        y = random.uniform(35.514715,35.901148)
        if y < -0.722386*x+72.866:
            if y > -0.7184706*x+72.576:
                if y > 0.692935*x+0.0551:
                    if y<0.549044*x+7.5495:
                        patient.append((x,y))
                        i=i +1
    dataWithcolName =  pd.DataFrame(patient,columns=['Longitude', 'Latitude'])  
    return(dataWithcolName)

def k_means_cons(k,minVal,maxVal,data):

    clf = KMeansConstrained(
        n_clusters=k,
        size_min=minVal,
        size_max=maxVal,
        random_state=0,max_iter = 300)
    clf.fit(data)

    clf.cluster_centers_

    Label = clf.predict(data)
    return Label
data0 = generatedb(38)
Label = k_means_cons(10,3,5,normalized)
Label

array([7, 9, 6, 0, 1, 0, 1, 3, 0, 5, 2, 5, 7, 2, 4, 0, 3, 5, 4, 8, 7, 0,
6, 6, 7, 4, 9, 1, 9, 4, 1, 8, 3, 9, 5, 7, 2, 1, 2, 4, 2, 8, 5, 7])

joshlk · 2020-12-28T15:37:00Z

Hi,

I determined what the issue is and it's my fault as the example on the front page of this project is wrong. So thank you for raising the issue.

So you need to use the method fit_predict instead of fit and then predict. This is because predict assigns clusters to the nearest centre without obeying the min and max constrains. While fit_predict does obey the constrains, you can also access the assigned labels using the labels_ attribute after a fit. Like I said, on the front page of this project I use fit and then predict and so this wasn't communicated properly by myself.

Currently, I would say, the predict function does not meet expectations and therefore I have changed it in the latest version so it does obey the obeying the min and max constrains. Therefore if you update to the latest version (v0.5.0) which is on PyPI it should now work.

Thanks again for reporting this,
Josh

Arsalan-Vosough · 2020-12-29T14:37:13Z

Hi,

I used fit_predict and it worked.

Thanks for your quick response.

joshlk mentioned this issue Dec 28, 2020

Constrained predict #9

Merged

joshlk closed this as completed Dec 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with max size of clusters #8

Problem with max size of clusters #8

Arsalan-Vosough commented Dec 23, 2020

joshlk commented Dec 23, 2020

Arsalan-Vosough commented Dec 24, 2020

joshlk commented Dec 24, 2020

Arsalan-Vosough commented Dec 24, 2020

Arsalan-Vosough commented Dec 24, 2020

joshlk commented Dec 28, 2020

Arsalan-Vosough commented Dec 29, 2020 •

edited

Problem with max size of clusters #8

Problem with max size of clusters #8

Comments

Arsalan-Vosough commented Dec 23, 2020

joshlk commented Dec 23, 2020

Arsalan-Vosough commented Dec 24, 2020

joshlk commented Dec 24, 2020

Arsalan-Vosough commented Dec 24, 2020

Arsalan-Vosough commented Dec 24, 2020

joshlk commented Dec 28, 2020

Arsalan-Vosough commented Dec 29, 2020 • edited

Arsalan-Vosough commented Dec 29, 2020 •

edited