## Import
Import **pandas** and **matplotlib**.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%load_ext autoreload
%autoreload 2

## Creating Clusters in the Synthetic Dataset using k-Means Algorithm
Open `kmeans.py` file. Some of the functions in the `KMeans` class are not yet implemented. We will implement the missing parts of this class.

Import the `KMeans` class.

In [2]:
from kmeans import KMeans

## Dataset 8
For this notebook, we will work on dataset 8. The group decided to assume that this is a clustering dataset. This decision was based on a number of factors. First, there is a class variable. The presence of a class variable suggests that the observations are trying to be grouped in some way. Second, the values are continuous. Continuous values rule out the possibility that these are item counts; which in turn makes it unlikely to be a rule mining dataset. The granularity of the values, which goes up to 5 decimal places, hints that it is not some sort of user rating either. This is further supported by the presence of negative values which rules out the possibility of implicitly generated ratings. 

If you view the `.csv` file in Excel, you can see that our dataset contains 900 **observations** (rows) across 10 **variables** (columns). The following are the variable in the dataset.

- **`f1`**
- **`f2`**
- **`f3`**
- **`f4`**
- **`f5`**
- **`f6`**
- **`f7`**
- **`f8`**
- **`f9`**
- **`f10`**

For this dataset we will assume that each of the rows represents a medical record for one person. Each variable represenets some kind of health metric such as blood sugar, blood pressure, etc. They will be grouped into 3 classes based on their health records.

Let's read the dataset.

In [3]:
dataset_df = pd.read_csv('Dataset8.csv')

Let us call the [`info`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) function, which displays general information about the dataset.

In [4]:
dataset_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900 entries, 0 to 899
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  900 non-null    int64  
 1   f1          900 non-null    float64
 2   f2          900 non-null    float64
 3   f3          900 non-null    float64
 4   f4          900 non-null    float64
 5   f5          900 non-null    float64
 6   f6          900 non-null    float64
 7   f7          900 non-null    float64
 8   f8          900 non-null    float64
 9   f9          900 non-null    float64
 10  f10         900 non-null    float64
 11  class       900 non-null    int64  
dtypes: float64(10), int64(2)
memory usage: 84.5 KB


In [11]:
range_df = pd.DataFrame(data={'min': dataset_df.min(), 'max': dataset_df.max()})
range_df.drop(dataset_df.columns[0])

Unnamed: 0,min,max
f1,-32.088394,28.234197
f2,-19.685436,35.567507
f3,-8.540643,50.468531
f4,2.715834,54.700314
f5,16.785543,68.070939
f6,25.87813,76.757192
f7,38.117091,83.514346
f8,44.639279,91.777657
f9,54.410047,100.681336
f10,68.148391,112.777374


Instantiate a `KMeans` object with `k` equal to `3`, `start_var` equal to `1`, `end_var` equal to `5`, `num_observations` equal to `150`, and `data` equal to the `DataFrame` object which represents the iris dataset. 

In [8]:
kmeans = KMeans(3, 1, 11, 900, dataset_df)

Initialize the centroids.

In [9]:
kmeans.initialize_centroids(dataset_df)

Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10
0,-5.71085,15.094523,10.923855,38.56511,33.292183,57.28237,54.949177,65.369141,84.635297,86.999744
1,-4.267853,-16.263877,33.946826,32.118864,60.628451,50.805646,67.131122,78.154488,65.519522,91.30092
2,-0.846899,13.283446,29.694957,29.343965,63.831882,42.928793,81.862522,53.293976,96.324544,101.543847


Cluster the dataset.

In [10]:
groups = kmeans.train(dataset_df, 300)

Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6
Iteration 7
Iteration 8
Iteration 9
Iteration 10
Iteration 11
Iteration 12
Iteration 13
Iteration 14
Iteration 15
Iteration 16
Iteration 17
Iteration 18
Iteration 19
Iteration 20
Iteration 21
Iteration 22
Iteration 23
Iteration 24
Iteration 25
Iteration 26
Iteration 27
Iteration 28
Iteration 29
Iteration 30
Iteration 31
Done clustering!


Check the number of  per class in each cluster. Answer the questions below.

In [13]:
dataset_df

array([0, 1, 2], dtype=int64)

In [21]:
cluster_0 = dataset_df.loc[groups == 0]
cluster_1 = dataset_df.loc[groups == 1]
cluster_2 = dataset_df.loc[groups == 2]

print(cluster_0.loc[cluster_0['class'] == 0])
print('Number of data points in each cluster:')
print('Cluster 0:')
print('Class 0:\t', cluster_0.loc[cluster_0['class'] == 0].shape[0])
print('Class 1:\t', cluster_0.loc[cluster_0['class'] == 1].shape[0])
print('Class 2:\t', cluster_0.loc[cluster_0['class'] == 2].shape[0])
print('Cluster 1:')
print('Class 0:\t', cluster_1.loc[cluster_1['class'] == 0].shape[0])
print('Class 1:\t', cluster_1.loc[cluster_1['class'] == 1].shape[0])
print('Class 2:\t', cluster_1.loc[cluster_1['class'] == 2].shape[0])
print('Cluster 2:')
print('Class 0:\t', cluster_2.loc[cluster_2['class'] == 0].shape[0])
print('Class 1:\t', cluster_2.loc[cluster_2['class'] == 1].shape[0])
print('Class 2:\t', cluster_2.loc[cluster_2['class'] == 2].shape[0])

     Unnamed: 0         f1         f2         f3         f4         f5  \
1             1  -5.451407  24.126971  21.398380  36.928816  28.886724   
2             2   1.939516   8.613466  15.468965  37.749287  39.346290   
3             3 -13.605885  16.107486  20.186186  33.162006  30.406046   
4             4   6.025143  17.815497  13.296819  38.290870  40.021019   
5             5 -12.277771  14.140850  13.656782  33.016452  39.331827   
..          ...        ...        ...        ...        ...        ...   
286         286  -9.251580  17.874114  14.933038  36.581572  41.827065   
291         291   1.145308   4.128104  18.479135  26.517957  32.049883   
294         294   4.856792   9.183635  18.474114  41.507072  29.664412   
296         296  -4.972477  19.713641  24.089985  36.790919  39.918319   
299         299 -20.839971 -15.240110  19.762379  39.668911  25.479082   

            f6         f7         f8         f9         f10  class  
1    39.590812  43.296796  62.154686  93.5