# Data Mining

## K-Means

### After completing materials of this notebook, you should be able to:

* Explain what k-means clusters are, how they are found and the benefits of using them.
* Recognize the necessary format for data in order to create k-means clusters.
* Develop a k-means cluster data mining model.
* Interpret the clusters generated by a k-means model and explain their significance, if any.

#### ORGANIZATIONAL UNDERSTANDING
    we are trying to find: natural groups of individuals who are most at risk for high weight and high cholesterol

#### Data Understanding
* __Weight__ in pounds, recorded on the person’s most recent medical examination.
* __Cholesterol__: cholesterol level determined by blood work in doctor’s lab
* __Gender__: 0 indicates Female and 1 indicates Male

#### Data Preparation

In [5]:
import pandas as pd
data = pd.read_csv('data.csv')
data

Unnamed: 0,Weight,Cholesterol,Gender
0,102,111,1
1,115,135,1
2,115,136,1
3,140,167,0
4,130,158,1
5,198,227,1
6,114,131,1
7,145,176,0
8,191,223,0
9,186,221,1


In [6]:
# Check for Missing data
data.isnull().values.any()
print(f'Is there any null value in dataset?? {data.isnull().values.any()}')
data[data.isnull().any(axis = 1)]

Is there any null value in dataset?? False


Unnamed: 0,Weight,Cholesterol,Gender


In [13]:
# Check for inconsistent values
for col in data.columns:
    print(col, data[col].unique())
data.describe()

Weight [102 115 140 130 198 114 145 191 186 104 188  96 156 125 178 109 168 152
 133 153 107 199  95 183 108 190 174 149 169 138 151 106 195 129 166 197
 148 117 193 170 134 128 105 110 164 157 124 113 150 100 139 101 187 137
 121 132 180 122 185 123 119 126 116 144 154  97 146 118 179 142 131 176
 103 120 143 203 192 112 173 141 175 181 111 200 167 196 171 135 184 161
 159 158 201 127 172 155 160 136 182 147  99  98 177 194 189 162]
Cholesterol [111 135 136 167 158 227 131 176 223 221 116 222 102 192 152 213 125 204
 189 163 122 228 168 218 123 208 183 188 126 225 105 155 203 177 139 224
 207 164 154 118 138 199 219 128 197 196 175 129 185 107 211 110 194 166
 143 191 216 106 146 220 147 141 130 172 148 120 157 149 115 230 108 140
 214 169 209 162 145 232 180 117 104 195 156 133 124 231 184 171 132 233
 142 212 161 144 201 179 193 114 202 170 165 173 127 112 151 235 119 134
 181 121 190 226 205 159 182 160 187 109 153 150 174 217 113 178 215 186
 198 234 200 210 137]
Gender [1 0]


Unnamed: 0,Weight,Cholesterol,Gender
count,547.0,547.0,547.0
mean,143.572212,170.433272,0.513711
std,30.837275,39.147189,0.500269
min,95.0,102.0,0.0
25%,116.0,136.0,0.0
50%,140.0,169.0,1.0
75%,171.0,208.0,1.0
max,203.0,235.0,1.0


#### Modeling
    The ‘k’ in k-means clustering stands for some number of groups, or clusters. The aim of this data mining methodology is to look at each observation’s individual attribute values and compare them to the means, or in other words averages, of potential groups of other observations in order to find natural groups that are similar to one another

class sklearn.cluster.__KMeans__(n_clusters=8, init=’k-means++’, n_init=10, max_iter=300, tol=0.0001, precompute_distances=’auto’, verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm=’auto’)
n_clusters : int, optional, default: 8

    The number of clusters to form as well as the number of centroids to generate.

__init__ : {‘k-means++’, ‘random’ or an ndarray}

    Method for initialization, defaults to ‘k-means++’:

    ‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. 
    ‘random’: choose k observations (rows) at random from data for the initial centroids.

    If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

__n_init__ : int, default: 10

    Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.(sum of squared distances to the nearest cluster center)

__max_iter__ : int, default: 300

    Maximum number of iterations of the k-means algorithm for a single run.

__tol__ : float, default: 1e-4

    Relative tolerance with regards to inertia to declare convergence

__precompute_distances__ : {‘auto’, True, False}

    Precompute distances (faster but takes more memory).

    ‘auto’ : do not precompute distances if n_samples * n_clusters > 12 million. This corresponds to about 100MB overhead per job using double precision.

    True : always precompute distances

    False : never precompute distances

__verbose__ : int, default 0

    Verbosity mode.

__random_state__ : int, RandomState instance or None, optional, default: None

    If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

__copy_x__ : boolean, default True

    When pre-computing distances it is more numerically accurate to center the data first. If copy_x is True, then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean.

__n_jobs__ : int

    The number of jobs to use for the computation. This works by computing each of the n_init runs in parallel.

    If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.

__algorithm__ : “auto”, “full” or “elkan”, default=”auto”

    K-means algorithm to use. The classical EM-style algorithm is “full”. The “elkan” variation is more efficient by using the triangle inequality, but currently doesn’t support sparse data. “auto” chooses “elkan” for dense data and “full” for sparse data.


In [63]:
from sklearn.cluster import KMeans
cls = KMeans(n_clusters=4 )
cls.fit(data.values)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=4, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [64]:
from collections import Counter
Counter(cls.labels_)

Counter({3: 140, 1: 135, 0: 154, 2: 118})

In [76]:
import numpy as np
print('Attribute          cluster_0      cluster_1       cluster_2       cluster_3')
for col, center in zip(data.columns,cls.cluster_centers_.T):
    c = '|'.join(f'{i:^15f}' for i in center)
    print(f'{col :15s}|{c}|')

Attribute          cluster_0      cluster_1       cluster_2       cluster_3
Weight         |  184.318182   |  127.725926   |  152.093220   |  106.850000   |
Cholesterol    |  218.915584   |  154.385185   |  185.906780   |  119.535714   |
Gender         |   0.590909    |   0.459259    |   0.440678    |   0.542857    |


* cluster 0 has the highest average weight and cholesterol and mostly are men(high cholesterol and weight are two key indicators of heart disease risk)

In [80]:
cluster_0 = data[cls.predict(data.values) == 0]
cluster_0

Unnamed: 0,Weight,Cholesterol,Gender
5,198,227,1
8,191,223,0
9,186,221,1
11,188,222,1
15,178,213,0
17,168,204,1
22,199,228,1
25,183,218,0
27,190,222,0
28,174,208,1


#### DEPLOYMENT

In [86]:
cluster_0.describe()

Unnamed: 0,Weight,Cholesterol,Gender
count,154.0,154.0,154.0
mean,184.318182,218.915584,0.590909
std,9.809096,8.190502,0.49327
min,167.0,204.0,0.0
25%,176.25,212.25,0.0
50%,183.5,220.0,1.0
75%,191.0,225.0,1.0
max,203.0,235.0,1.0


* there are people with cholesterol of 235 which is too heigh and very dangerous
* using the info of each cluster one can take the name and address of each opponent to contant and warn them