Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with max size of clusters #8

Closed
Arsalan-Vosough opened this issue Dec 23, 2020 · 7 comments
Closed

Problem with max size of clusters #8

Arsalan-Vosough opened this issue Dec 23, 2020 · 7 comments

Comments

@Arsalan-Vosough
Copy link

Hi

Thank you for sharing your code! I used it to cluster my data in 10 cluster with min_size = 3 and max_size = 5. But it returns some clusters with more than max size elements unfortunately. it gives me a cluster with 7 elements sometimes.

@joshlk
Copy link
Owner

joshlk commented Dec 23, 2020

Hi, Great to hear that your using it 😀.

Can you please provide a minimal working example. Thanks, Josh

@Arsalan-Vosough
Copy link
Author

Longitude Latitude
0 0.143799 0.549696
1 0.748523 0.666809
2 0.893091 0.485969
3 0.633522 0.273117
4 0.691772 0.763385
5 0.671481 0.112269
6 0.250957 0.781550
7 0.199018 0.798926
8 0.680017 0.201779
9 0.270592 0.461235
10 0.648789 0.140139
11 0.417517 0.114667
12 0.733276 0.254028
13 0.283617 0.515177
14 0.256486 0.788757
15 0.369168 0.380070
16 0.265186 0.596243
17 0.356121 0.442192
18 0.651694 0.876345
19 0.166674 0.829551
20 0.623306 0.034364
21 0.250798 0.911847
22 0.448605 0.517670
23 0.529576 0.000000
24 0.622372 0.215839
25 0.492679 0.621276
26 0.349826 0.242467
27 0.561980 0.855117
28 0.543573 1.000000
29 0.000000 0.572787
30 0.285501 0.358724
31 0.398475 0.106590
32 1.000000 0.452500
33 0.367203 0.419650
34 0.672594 0.257735
35 0.590781 0.022893
36 0.459228 0.146675
37 0.480092 0.666456
38 0.451271 0.225341
39 0.767639 0.395854
40 0.702797 0.589130

this is my data, I normalized it with minmax scaler. and used this function in order to clustering :

def k_means_cons(k,minVal,maxVal,data):

clf = KMeansConstrained(
    n_clusters=k,
    size_min=minVal,
    size_max=maxVal,
    random_state=0,max_iter = 300)
clf.fit(data)

clf.cluster_centers_

Label = clf.predict(data)
return Label

Label = k_means_cons(10,3,5,normalized)

and it returns me:

array([0, 9, 7, 1, 9, 8, 5, 5, 1, 4, 8, 3, 1, 4, 5, 4, 0, 4, 6, 5, 8, 5,
2, 8, 1, 2, 3, 6, 6, 0, 4, 3, 7, 4, 1, 8, 3, 2, 3, 7, 9])

as you can see there are 6 elements in 4th cluster

@joshlk
Copy link
Owner

joshlk commented Dec 24, 2020

Thanks. What exact normalisation did you use?

what sklearn and ortools version are you also using?

@Arsalan-Vosough
Copy link
Author

minmax_scale = preprocessing.MinMaxScaler(feature_range=(0,1))
scaled_feature = minmax_scale.fit_transform(data)

sklearn version is 0.23.2 and ortools version is 8.1.8487

@Arsalan-Vosough
Copy link
Author

i think, I made it complex. Briefly if you run the code below, sometimes it gives you cluster with more than max_size

def generatedb(numberPatient):
    patient=[]
    i = 0
    while len(patient)<=numberPatient:
        x = random.uniform(51.078418,51.701563)
        y = random.uniform(35.514715,35.901148)
        if y < -0.722386*x+72.866:
            if y > -0.7184706*x+72.576:
                if y > 0.692935*x+0.0551:
                    if y<0.549044*x+7.5495:
                        patient.append((x,y))
                        i=i +1
    dataWithcolName =  pd.DataFrame(patient,columns=['Longitude', 'Latitude'])  
    return(dataWithcolName)

def k_means_cons(k,minVal,maxVal,data):

    clf = KMeansConstrained(
        n_clusters=k,
        size_min=minVal,
        size_max=maxVal,
        random_state=0,max_iter = 300)
    clf.fit(data)

    clf.cluster_centers_

    Label = clf.predict(data)
    return Label
data0 = generatedb(38)
Label = k_means_cons(10,3,5,normalized)
Label

array([7, 9, 6, 0, 1, 0, 1, 3, 0, 5, 2, 5, 7, 2, 4, 0, 3, 5, 4, 8, 7, 0,
6, 6, 7, 4, 9, 1, 9, 4, 1, 8, 3, 9, 5, 7, 2, 1, 2, 4, 2, 8, 5, 7])

@joshlk
Copy link
Owner

joshlk commented Dec 28, 2020

Hi,

I determined what the issue is and it's my fault as the example on the front page of this project is wrong. So thank you for raising the issue.

So you need to use the method fit_predict instead of fit and then predict. This is because predict assigns clusters to the nearest centre without obeying the min and max constrains. While fit_predict does obey the constrains, you can also access the assigned labels using the labels_ attribute after a fit. Like I said, on the front page of this project I use fit and then predict and so this wasn't communicated properly by myself.

Currently, I would say, the predict function does not meet expectations and therefore I have changed it in the latest version so it does obey the obeying the min and max constrains. Therefore if you update to the latest version (v0.5.0) which is on PyPI it should now work.

Thanks again for reporting this,
Josh

@Arsalan-Vosough
Copy link
Author

Arsalan-Vosough commented Dec 29, 2020

Hi,

I used fit_predict and it worked.

Thanks for your quick response.

@joshlk joshlk closed this as completed Dec 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants