## Import
Import **pandas** and **matplotlib**.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%load_ext autoreload
%autoreload 2

## Creating Clusters in Dataset 8 using k-Means Algorithm

Import the `KMeans` class.

In [None]:
from kmeans import KMeans_py

## Dataset 8
For this notebook, we will work on dataset 8. The group decided to assume that this is a clustering dataset. This decision was based on a number of factors. First, there is a class variable. The presence of a class variable suggests that the observations have some kind of classifications and are trying to be grouped in some way. Second, the values are continuous. Continuous values rule out the possibility that these are item counts; which in turn makes it unlikely to be a rule mining dataset. The granularity of the values, which goes up to 5 decimal places, hints that it is not some sort of user rating either. This is further supported by the presence of negative values which rules out the possibility of implicitly generated ratings.


If you view the `.csv` file in Excel, you can see that our dataset contains 900 **observations** (rows) across 10 **variables** (columns). The following are our assumptions of what each variable in the dataset represents:

- **`f1`**: Hip Hop
- **`f2`**: R&B
- **`f3`**: Jazz
- **`f4`**: Rock
- **`f5`**: K-Pop
- **`f6`**: Country
- **`f7`**: Heavy metal
- **`f8`**: EDM
- **`f9`**: Blues
- **`f10`**: Pop
- **`class`**: `0` represents songs released in the 1990's, `1` for songs released in the 2000's, and `2` for songs released in the 2010's

For this dataset, we will assume that each observation represents a song. For the variables, we will be assuming that each one of them is a genre and the values are system generated valued that represent how "close" they are to that specific genre (i.e. a higher value under f4 means that it has many of the features of a rock song). 

### EDA

Let us read the dataset.

In [None]:
dataset_df = pd.read_csv('Dataset8.csv')

Let us display the general `info` of the dataset

In [None]:
dataset_df.info()

Let us rename the columns

In [None]:
dataset_df = dataset_df.rename(columns={
    'f1': 'Hip Hop',
    'f2': 'R&B',
    'f3': 'Jazz',
    'f4': 'Rock',
    'f5': 'K-Pop',
    'f6': 'Country',
    'f7': 'Heavy Metal',
    'f8': 'EDM',
    'f9': 'Blues',
    'f10': 'Pop'
})
dataset_df.head()

### Question 1: Which genre's features is on average the most used in songs?

Let us take only the necessary columns for this question.

In [None]:
ave_df = dataset_df.drop(columns=dataset_df.columns[0]).drop(['class'], axis=1)
ave_df

Now we take the average of each column.

In [None]:
ave_df = ave_df.mean()
ave_df

Let us plot the data into a bar plot.

In [None]:
ave_df.plot.bar()
plt.xlabel('Genre')
plt.ylabel('Value')
plt.title('Average Genre Likeness Value')

It appears that the features from Pop songs are most common on average.

### Question 2: Which genres are correlated?

Let us drop irrelevant data from the dataset

In [None]:
dropped = dataset_df.drop(columns=['Unnamed: 0','class'])

Let us get and visualize the correlation matrix

In [None]:
corr = dropped.corr()
corr.style.background_gradient(cmap="coolwarm", axis=None).set_precision(2)

It seems that there are no genres that are correlated

### Clustering

Let us proceed to finding the number of observation per group prior to clustering

In [None]:
print("Class 1 : " , dataset_df.loc[dataset_df['class'] == 0].count().loc['class'])
print("Class 2 : " , dataset_df.loc[dataset_df['class'] == 1].count().loc['class'])
print("Class 3 : " , dataset_df.loc[dataset_df['class'] == 2].count().loc['class'])

In [None]:
# Import required packages
from sklearn.cluster import KMeans
sse = []
list_k = range(1,10)
for k in list_k:
    km = KMeans(n_clusters = k)
    km = km.fit(dataset_df)
    sse.append(km.inertia_)

plt.plot(list_k, sse, 'b*-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()


From the elbow method, we can see that the optimal k is 3.

Instantiate a `KMeans` object with `k` equal to `3`, `start_var` equal to `1`, `end_var` equal to `5`, `num_observations` equal to `150`, and `data` equal to the `DataFrame` object which represents the dataset. 

In [None]:
kmeans = KMeans_py(3, 1, 11, 900, dataset_df)

Initialize the centroids.

In [None]:
kmeans.initialize_centroids(dataset_df)

Cluster the dataset.

In [None]:
groups = kmeans.train(dataset_df, 300)

In [None]:
cluster_0 = dataset_df.loc[groups == 0]
cluster_1 = dataset_df.loc[groups == 1]
cluster_2 = dataset_df.loc[groups == 2]

# print(cluster_0.loc[cluster_0['class'] == 0])
print('Number of data points in each cluster:')
print('Cluster 0:')
print('Class 0:\t', cluster_0.loc[cluster_0['class'] == 0].shape[0])
print('Class 1:\t', cluster_0.loc[cluster_0['class'] == 1].shape[0])
print('Class 2:\t', cluster_0.loc[cluster_0['class'] == 2].shape[0])
print('Cluster 1:')
print('Class 0:\t', cluster_1.loc[cluster_1['class'] == 0].shape[0])
print('Class 1:\t', cluster_1.loc[cluster_1['class'] == 1].shape[0])
print('Class 2:\t', cluster_1.loc[cluster_1['class'] == 2].shape[0])
print('Cluster 2:')
print('Class 0:\t', cluster_2.loc[cluster_2['class'] == 0].shape[0])
print('Class 1:\t', cluster_2.loc[cluster_2['class'] == 1].shape[0])
print('Class 2:\t', cluster_2.loc[cluster_2['class'] == 2].shape[0])

In [None]:
# syn_new_df = pd.concat([syn_df.iloc[:, 0:2], groups.rename('group')], axis=1)
# dataset_new_df = pd.concat([dataset_df.iloc[:, 1:11], groups.rename('group')], axis=1)
# print(dataset_new_df.head(3))

# print(syn_new_df.head(5))
# fig, axs = plt.subplots(1, 2, figsize=(10,5))
# fig, axs = plt.subplots(1, 2, figsize=(10,5))
# axs[0].plot(syn_df.loc[syn_df['class'] == 0, 'x'], syn_df.loc[syn_df['class'] == 0, 'y'], 'r+')
# axs[0].plot(syn_df.loc[syn_df['class'] == 1, 'x'], syn_df.loc[syn_df['class'] == 1, 'y'], 'g+')
# axs[0].plot(syn_df.loc[syn_df['class'] == 2, 'x'], syn_df.loc[syn_df['class'] == 2, 'y'], 'b+')
# axs[0].plot(dataset_df.loc[dataset_df['class'] == 0, 'x'], dataset_df.loc[dataset_df['class'] == 0, 'y'], 'r+')
# axs[0].plot(dataset_df.loc[dataset_df['class'] == 1, 'x'], dataset_df.loc[dataset_df['class'] == 1, 'y'], 'r+')
# axs[0].plot(dataset_df.loc[dataset_df['class'] == 2, 'x'], dataset_df.loc[dataset_df['class'] == 2, 'y'], 'r+')

# axs[1].plot(syn_new_df.loc[syn_new_df['group'] == 0, 'x'], syn_new_df.loc[syn_new_df['group'] == 0, 'y'], 'r+')
# axs[1].plot(syn_new_df.loc[syn_new_df['group'] == 1, 'x'], syn_new_df.loc[syn_new_df['group'] == 1, 'y'], 'g+')
# axs[1].plot(syn_new_df.loc[syn_new_df['group'] == 2, 'x'], syn_new_df.loc[syn_new_df['group'] == 2, 'y'], 'b+')

# for i in range(len(kmeans.centroids)):
#     axs[1].plot(kmeans.centroids.iloc[i]['x'], kmeans.centroids.iloc[i]['y'], 'k*', ms=12)

# axs[0].grid()
# axs[1].grid()

In [None]:
# kmeans = KMeans_py(3, 1, 11, 900, dataset_df)
# kmeans.initialize_centroids(dataset_df)
# groups = kmeans.train(dataset_df, 300)
# cluster_0 = dataset_df.loc[groups == 0]
# cluster_1 = dataset_df.loc[groups == 1]
# cluster_2 = dataset_df.loc[groups == 2]

# # print(cluster_0.loc[cluster_0['class'] == 0])
# print('Number of data points in each cluster:')
# print('Cluster 0:')
# print('Class 0:\t', cluster_0.loc[cluster_0['class'] == 0].shape[0])
# print('Class 1:\t', cluster_0.loc[cluster_0['class'] == 1].shape[0])
# print('Class 2:\t', cluster_0.loc[cluster_0['class'] == 2].shape[0])
# print('Cluster 1:')
# print('Class 0:\t', cluster_1.loc[cluster_1['class'] == 0].shape[0])
# print('Class 1:\t', cluster_1.loc[cluster_1['class'] == 1].shape[0])
# print('Class 2:\t', cluster_1.loc[cluster_1['class'] == 2].shape[0])
# print('Cluster 2:')
# print('Class 0:\t', cluster_2.loc[cluster_2['class'] == 0].shape[0])
# print('Class 1:\t', cluster_2.loc[cluster_2['class'] == 1].shape[0])
# print('Class 2:\t', cluster_2.loc[cluster_2['class'] == 2].shape[0])
# # print(cluster_2.loc[cluster_2['class'] == 2].shape[0] + cluster_0.loc[cluster_0['class'] == 2].shape[0] + cluster_1.loc[cluster_1['class'] == 2].shape[0])
# # print(cluster_2.loc[cluster_2['class'] == 0].shape[0] + cluster_0.loc[cluster_0['class'] == 0].shape[0] + cluster_1.loc[cluster_1['class'] == 0].shape[0])
# # print(cluster_2.loc[cluster_2['class'] == 1].shape[0] + cluster_0.loc[cluster_0['class'] == 1].shape[0] + cluster_1.loc[cluster_1['class'] == 1].shape[0])


106-87-98
114-121-83
80-92-119