## K-Means Clustering

The idea of this notebook is to learn an unsupervised model capable of separating groups of mobile phones
into several categories, e.g. k=5 categories. We then compare these clusters with the hypothesized models
from other supervised learners, e.g. DT's and Multiple-Regression models.

We can also test whether it splits the clusters in a way that resembles the pricing categories directly, i.e.
simply putting in the test examples and seeing what cluster the model outputs.

Two different datasets will be used, the one from GSMArena and another one simply based on technical specifications.
To test the tech-spec only dataset, we'll put the inputs into an ensemble model and predict its price, then
input it into the k-means cluster and inspect which cluster it ended up in.

In [4]:
from auxiliary.data_clean2 import clean_data
import pandas as pd
from sklearn.model_selection import train_test_split

# Load up dataset 1: gsmarena
data = pd.read_csv('dataset/GSMArena_dataset_2020.csv', index_col=0)

data_features = data[["oem", "launch_announced", "launch_status", "body_dimensions", "display_size", "comms_wlan", "comms_usb",
                "features_sensors", "platform_os", "platform_cpu", "platform_gpu", "memory_internal",
                "main_camera_single", "main_camera_video", "misc_price",
                "selfie_camera_video",
                "selfie_camera_single", "battery"]]

# Clean up the data into a trainable form.
df = clean_data(data_features)

# load utilities
from sklearn.model_selection import train_test_split

y = df["misc_price"]
X = df.drop(["misc_price"], axis=1)

# Split data into train, test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4, random_state=0)


key_index
1        None
2        None
3        46.3
4        43.7
5        81.3
         ... 
10675    36.1
10676    26.1
10677    26.1
10678    26.1
10679    None
Name: scn_bdy_ratio, Length: 10679, dtype: object key_index
1        None
2         3.5
3         3.2
4         2.8
5         6.3
         ... 
10675     2.4
10676     2.0
10677     2.0
10678     2.0
10679    None
Name: screen_size, Length: 10679, dtype: object


#### SkLearn's K-Means Model

As always, the idea is to use algorithms from sklearn or another library as a baseline before writing our own
algorithms in a more fine-tuned and streamlined manner.

In [9]:
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score

# Train the model
clf = KMeans(n_clusters=5, random_state=0).fit(X_train)

# Get the centres of each cluster
print(clf.cluster_centers_)

# # Utility for k-fold cross validation
# from sklearn.model_selection import cross_val_score

# # iterate 4 times
# scores = cross_val_score(clf, X, y, cv=4)



[[ 3.94304100e+03  6.67682709e+01  2.01739750e+03  9.72447411e+04
   4.42245989e+00  2.36541889e+00  3.52361856e+00  1.55155793e+03
   9.06096257e+02  1.28994652e+01  3.50743137e+03  2.05910695e+03
   5.80825314e+00  7.70231729e+01  5.13368984e+00  4.83048485e+03]
 [ 5.87400000e+03  5.95000000e+01  2.01450000e+03  1.67681506e+06
   1.00000000e+00  1.00000000e+00  1.49999994e+00  9.00000000e+02
   0.00000000e+00  0.00000000e+00  2.87150000e+03  1.59000000e+03
   1.78499994e+01  7.22999992e+01  0.00000000e+00  2.00000000e+03]
 [ 7.40350000e+03  7.21250000e+01  2.01262500e+03  3.82933449e+05
   2.12500000e+00  1.62500000e+00  3.75000015e+00  7.20000000e+02
   3.60000000e+02  1.34999999e+00  5.72887500e+03  1.41250000e+03
   1.01125003e+01  6.68124986e+01  0.00000000e+00  1.78200000e+03]
 [ 3.82634375e+03  3.94375000e+01  2.01578125e+03  2.11341216e+04
   2.46875000e+00  1.78125000e+00 -4.44089210e-16 -2.27373675e-13
  -2.27373675e-13  0.00000000e+00  3.28843750e+02  8.95625000e+02
   1.47

ValueError: Classification metrics can't handle a mix of continuous and multiclass targets