<a href="https://colab.research.google.com/github/gustahps-0712/MachineLearningProjects/blob/main/Kmeans_Agrupamentos_de_Clientes_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1 Business Problem
We need to build a Predictive Machine that, based on customer energy consumption data, groups consumers by similarity in order to understand customer behavior and its relationship with energy consumption..



# 2° Exploratory Data Analysis

### Fonte de Dados
https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption


Measurements of electrical energy consumption in a home with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some submeasurement values ​​are available.




### AD (Arquitetura de Dados/Dicionário de Dados):

1. **data**: Data no formato dd / mm / aaaa
2. **hora**: hora no formato hh: mm: ss
3. **global_active_power**: potência ativa média global por minuto (em quilowatt)
4. **potência reativa global da família**: potência reativa média global por minuto (em quilowatt)
5. **voltagem**: tensão média por minuto (em volt)
6. **intensidade global**: intensidade de corrente média por minuto global doméstica (em ampere)
7. **sub_metering_1**: submedição de energia nº 1 (em watt-hora de energia ativa ) Corresponde à **cozinha**, que contém essencialmente uma máquina de lavar louça, um forno e um micro-ondas (a placa eléctrica não é eléctrica mas sim a gás).
8. **sub_metering_2**: sub-medição de energia nº 2 (em watt-hora de energia ativa). Corresponde à **lavanderia**, contendo uma máquina de lavar, uma secadora, uma geladeira e uma luz.
9. **sub_metering_3**: submedição de energia nº 3 (em watt-hora de energia ativa). Corresponde a um **aquecedor elétrico de água e um ar condicionado**.

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import pylab
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from scipy.spatial.distance import cdist, pdist
from sklearn.metrics import silhouette_score
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

In [None]:
# loading the data
dataset = pd.read_csv('household_power_consumption.txt', delimiter = ';', low_memory = False)

In [None]:
# view first lines
dataset.head()

Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
0,16/12/2006,17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
1,16/12/2006,17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2,16/12/2006,17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
3,16/12/2006,17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
4,16/12/2006,17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


In [None]:
# Dataset dimensions in rows and columns respectively
dataset.shape

(2075259, 9)

In [None]:
# Check the type of fields

dataset.dtypes

Date                      object
Time                      object
Global_active_power       object
Global_reactive_power     object
Voltage                   object
Global_intensity          object
Sub_metering_1            object
Sub_metering_2            object
Sub_metering_3           float64
dtype: object

In [None]:
# General Dataset Information
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075259 entries, 0 to 2075258
Data columns (total 9 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   Date                   object 
 1   Time                   object 
 2   Global_active_power    object 
 3   Global_reactive_power  object 
 4   Voltage                object 
 5   Global_intensity       object 
 6   Sub_metering_1         object 
 7   Sub_metering_2         object 
 8   Sub_metering_3         float64
dtypes: float64(1), object(8)
memory usage: 142.5+ MB


In [None]:
# Checking for missing values
dataset.isnull().values.any()

True

In [None]:
# Checking where are the missing values

dataset.isnull().sum()

Date                         0
Time                         0
Global_active_power          0
Global_reactive_power        0
Voltage                      0
Global_intensity             0
Sub_metering_1               0
Sub_metering_2               0
Sub_metering_3           25979
dtype: int64

#3° Data Pre-processing

In [None]:
# Removes records with NA values ​​and removes the first two columns (not needed)

dataset = dataset.iloc[0:, 2:9].dropna()

In [None]:
#  Check first lines

dataset.head()

Unnamed: 0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
0,4.216,0.418,234.84,18.4,0.0,1.0,17.0
1,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2,5.374,0.498,233.29,23.0,0.0,2.0,17.0
3,5.388,0.502,233.74,23.0,0.0,1.0,17.0
4,3.666,0.528,235.68,15.8,0.0,1.0,17.0


In [None]:
# Checking if there are still missing values

dataset.isnull().values.any()

False

In [None]:
dataset.isnull().sum()

Global_active_power      0
Global_reactive_power    0
Voltage                  0
Global_intensity         0
Sub_metering_1           0
Sub_metering_2           0
Sub_metering_3           0
dtype: int64

In [None]:
# Gets the attribute values. Gets the values ​​of each variable in an array format

dataset_atrib = dataset.values

In [None]:
# prints the array


dataset_atrib  # input variables

array([['4.216', '0.418', '234.840', ..., '0.000', '1.000', 17.0],
       ['5.360', '0.436', '233.630', ..., '0.000', '1.000', 16.0],
       ['5.374', '0.498', '233.290', ..., '0.000', '2.000', 17.0],
       ...,
       ['0.938', '0.000', '239.820', ..., '0.000', '0.000', 0.0],
       ['0.934', '0.000', '239.700', ..., '0.000', '0.000', 0.0],
       ['0.932', '0.000', '239.550', ..., '0.000', '0.000', 0.0]],
      dtype=object)

In [None]:
# Coleta uma amostra de 1% dos dados para não comprometer a memória do computador

dataset, amostra2 = train_test_split(dataset_atrib, train_size = .01)

In [None]:
dataset.shape

(20492, 7)

#4 Predictive Customer Segmentation Machine

In [None]:
# Applies dimensionality reduction to the array of variables
pca = PCA(n_components = 2).fit_transform(dataset)

NameError: ignored

In [None]:
# Determining a Kmeans Hyperparameter "K" range
k_range = range(1,12)
k_range

In [None]:
# Applying the K-Means model to each value of K (this cell can take a long time to run)
k_means_var = [KMeans(n_clusters = k).fit(pca) for k in k_range]

### Curva de Elbow

In [None]:
#  Adjusting the cluster centroid for each model
centroids = [X.cluster_centers_ for X in k_means_var]

# Calculando a distância euclidiana de cada ponto de dado para o centróide
k_euclid = [cdist(pca, cent, 'euclidean') for cent in centroids]
dist = [np.min(ke, axis = 1) for ke in k_euclid]

# Calculating the Euclidean distance of each data point to the centroid
sum_squares_intra_cluster = [sum(d**2) for d in dist]

# Total sum of squares
sum_total = sum(pdist(pca)**2)/pca.shape[0]

# Sum of squares between clusters
sum_squares_inter_cluster = sum_total - sum_squares_intra_cluster

# Elbow Curve
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(k_range, sum_squares_inter_cluster/sum_total * 100, 'b*-')
ax.set_ylim((0,100))
plt.grid(True)
plt.xlabel('N° of Clusters')
plt.ylabel('% of Explained Variance')
plt.title('Variance Explained for each K-Value')

NameError: ignored

In [None]:
# Creating a model with K = 8
model_v1 = KMeans(n_clusters = 8)
model_v1.fit(pca)

#5° Predictive Machine Evaluation


In [None]:
# Machine V1
# Get the minimum and maximum values ​​and organize the shape

x_min, x_max = pca[:, 0].min() - 5, pca[:, 0].max() - 1
y_min, y_max = pca[:, 1].min() + 1, pca[:, 1].max() + 5
xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02))
Z = model_v1.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

In [None]:
# Plot of cluster areas

plt.figure(1)
plt.clf()
plt.imshow(Z, 
           interpolation = 'nearest',
           extent = (xx.min(), xx.max(), yy.min(), yy.max()),
           cmap = plt.cm.Paired,
           aspect = 'auto', 
           origin = 'lower')

In [None]:
# Evaluation metrics for Clustering
# The best value is 1 and the worst value is -1
?silhouette_score

In [None]:
# Silhouette Score
labels = model_v1.labels_
silhouette_score(pca, labels, metric = 'euclidean')

0.605442170975759

#### Avaliando a Máquina Preditiva V2 com K=9

In [None]:
# Criando um modelo com K = 9
model_v2 = KMeans(n_clusters = 9)
model_v2.fit(pca)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=9, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [None]:
# Silhouette Score
labels = model_v2.labels_
silhouette_score(pca, labels, metric = 'euclidean')

0.6824612530173378

#### Avaliando a Máquina Preditiva V3 com K=10

In [None]:
# Criando um modelo com K = 10
modelo_v2 = KMeans(n_clusters = 10)
modelo_v2.fit(pca)

In [None]:
# Silhouette Score
labels = modelo_v2.labels_
silhouette_score(pca, labels, metric = 'euclidean')

0.6364236122740907

#### Avaliando a Máquina Preditiva V4 com K=11

In [None]:
# Criando um modelo com K = 11
model_v3 = KMeans(n_clusters = 11)
model_v3.fit(pca)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=11, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [None]:
# Silhouette Score
labels = model_v3.labels_
silhouette_score(pca, labels, metric = 'euclidean')

0.6359833488484669

In [None]:
# List with column names
names = ['Global_active_power', 'Global_reactive_power', 'Voltage', 'Global_intensity', 'Sub_metering_1', 'Sub_metering_2', 'Sub_metering_3']

In [None]:
# Including the cluster number in the customer base
cluster_map = pd.DataFrame(dataset, columns = names)
cluster_map['Global_active_power'] = pd.to_numeric(cluster_map['Global_active_power'])
cluster_map['cluster'] = modelo_v2.labels_

In [None]:
cluster_map

In [None]:
# Calculates average power consumption per cluster
cluster_map.groupby('cluster')['Global_active_power'].mean()

cluster
0    1.809054
1    1.548473
2    4.570247
3    2.663121
4    2.358853
5    3.317606
6    3.801114
7    0.363262
8    1.100518
Name: Global_active_power, dtype: float64