# Clustering Analysis

- Problem: There are too many retailers to create Clusters manually and metrics are not clear. 
- Business Goal: Cluster Retailers according to 2/3 KPI to later conduct A/B testing on a specific controlled group or improve promotion campaigns.
    - Create clusters of retailers's point of sales and/or their salesman, in respect to their performance.  
    - Find out possible fraudulent Retailers.
    - Derive findings from created Clusters

## Import Modules & Libraries

In [9]:
# Base ----
import pandas as pd
# Dataviz -----
import matplotlib.pyplot as plt
%matplotlib inline
from plotnine import*
# Model ----
from sklearn.cluster import KMeans

## Load Dataset

In [316]:
pos_df = pd.read_excel("./merged_excel/ficheirolino_estancos.xlsx")
#salesman_df = pd.read_excel("./merged_excel/ficheirolino_vendedores.xlsx")

## Data Cleaning of the Dataset

Check for missing values and Replace for Reto and Puntos

In [317]:
pos_df.isna().sum()

Región                                    0
Zona                                      0
Territorio                                1
Estanco                                   0
Tipo reto                                 0
Reto                                      1
Puntos                                    1
Fecha inicio                              0
Fecha fin                                 0
Segmento                                  1
Codigo                                    2
Nombre profesional                        3
Email                                     2
Perfil                                    7
Activo                                   16
Resultado Ventas                        144
Resultado (Retos Logrado/No logrado)      0
dtype: int64

In [318]:
pos_df.loc[pos_df.Reto.isna(),["Reto","Puntos"]] = [8, 20]

Check for unique values to perform basic checks. 
There should be 13 Regions, 47 Zones, 2791 PoS.

In [319]:
pos_df.nunique()

Región                                    13
Zona                                      47
Territorio                               122
Estanco                                 2791
Tipo reto                                 12
Reto                                      15
Puntos                                    14
Fecha inicio                             122
Fecha fin                                116
Segmento                                  16
Codigo                                  2935
Nombre profesional                      4192
Email                                   3768
Perfil                                    12
Activo                                     1
Resultado Ventas                          29
Resultado (Retos Logrado/No logrado)       6
dtype: int64

In [355]:
pos_df["Region"] = pos_df.Región

In [321]:
pos_df.Region.unique()

array(['REGION 1', 'R001', 'REGION 2', 'REGION 3', 'R002', 'R003',
       'REGION 4', '10+A1106:L1106', 'REGION 6', 'R004', 'Rr004',
       'REGION 5', 'R005'], dtype=object)

In [340]:
pos_df.Region.isna().sum()

11730

In [366]:
# This function harmonizes the name of the Region to the same format
def RegionConverter(element, i):
        if element == "Rr00" + str(i):
            return "R00" + str(i)
        elif element == "REGION "+ str(i):
            return "R00" + str(i)
        elif element == "R00" + str(i):
            return "R00" + str(i)
        elif element == "10+A1106:L1106":
            return "R004"
        else:
            return element

In [367]:
for n in range(1,7): # Run For Loop since there are 6 regions (this avoids having to change Region names manually)
    pos_df["Region"] = pos_df.Region.apply(RegionConverter, args=(n,))

In [370]:
pos_df.Region.value_counts()

R004    2880
R003    2731
R005    2092
R001    2030
R002    1995
R006       2
Name: Region, dtype: int64

In [381]:
# There are only 5 Regions but value counts shows 2 entries for R006.
# Replace R006 with correct Region since this is a fat finger derived mistake

In [389]:
pos_df.loc[pos_df.Region.str.contains("R006"),:] # After inspection R006 belongs to Zona ZN20. 
pos_df.loc[pos_df.Zona == "ZN20",:]# Find the equivalent Region name in ZN20
pos_df.loc[pos_df.Region.str.contains("R006"),"Region"] = "R004" # Replace for the value found in the line code above

In [393]:
pos_df.drop(columns="Región", inplace=True)

In [412]:
# Reorder column names
cols = pos_df.columns.tolist()
cols = cols[-1:] + cols[0:-1] # create list with new order
pos_df = pos_df[cols]

Convert and Check for Datatypes

In [413]:
pos_df = pos_df.convert_dtypes()
pos_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11730 entries, 0 to 11729
Data columns (total 17 columns):
 #   Column                                Non-Null Count  Dtype 
---  ------                                --------------  ----- 
 0   Region                                11730 non-null  string
 1   Zona                                  11730 non-null  string
 2   Territorio                            11729 non-null  Int64 
 3   Estanco                               11730 non-null  string
 4   Tipo reto                             11730 non-null  string
 5   Reto                                  11730 non-null  Int64 
 6   Puntos                                11730 non-null  Int64 
 7   Fecha inicio                          11730 non-null  object
 8   Fecha fin                             11730 non-null  object
 9   Segmento                              11729 non-null  object
 10  Codigo                                11728 non-null  object
 11  Nombre profesional          

Check for simples stats

In [414]:
pos_df.describe()

Unnamed: 0,Territorio,Reto,Puntos,Activo,Resultado Ventas
count,11729.0,11730.0,11730.0,11714.0,11586.0
mean,2367.926336,1.822933,13.387383,1.0,1.213879
std,1395.262334,1.433472,8.559042,0.0,2.500307
min,101.0,1.0,0.0,1.0,0.0
25%,815.0,1.0,10.0,1.0,0.0
50%,2808.0,1.0,10.0,1.0,0.0
75%,3301.0,2.0,10.0,1.0,1.0
max,5003.0,30.0,120.0,1.0,94.0


In [415]:
pos_df.describe(include="string")

Unnamed: 0,Region,Zona,Estanco,Tipo reto,Nombre profesional,Email,Perfil
count,11730,11730,11730,11730,11727,11728,11723
unique,5,47,2791,12,4192,3768,12
top,R004,ZN16,CALPE-003,VENTA,Maria antonia Amengual berna,estancogranada36@gmail.com,Shop Owner 1
freq,2882,968,54,8978,19,18,7032


In [418]:
pos_df.describe(include="object")

Unnamed: 0,Fecha inicio,Fecha fin,Segmento,Codigo,Resultado (Retos Logrado/No logrado)
count,11730,11730,11729,11728,11730
unique,122,116,16,2935,6
top,2021-05-01 00:00:00,2021-04-30 00:00:00,OFFICIAL RESELLER,999000785,No logrado
freq,816,1876,3236,45,4925


In [416]:
pos_df.head()

Unnamed: 0,Region,Zona,Territorio,Estanco,Tipo reto,Reto,Puntos,Fecha inicio,Fecha fin,Segmento,Codigo,Nombre profesional,Email,Perfil,Activo,Resultado Ventas,Resultado (Retos Logrado/No logrado)
0,R001,ZN3,2801,ALCORCON-001,VENTA,2,20,2021-02-09 00:00:00,2021-02-14 00:00:00,OFFICIAL RESELLER,999009382,Laura Garcia Moreno,l.garcia.m@hotmail.es,PROPIETARIO,1,0,No Logrado
1,R001,ZN3,2801,ALCORCON-007,VENTA,2,20,2021-02-09 00:00:00,2021-02-14 00:00:00,OFFICIAL RESELLER PLUS,999009388,ALEJANDRO MARCOS RUIZ,estancolasretamas54@gmail.com,PROPIETARIO,1,1,No Logrado
2,R001,ZN3,2801,ALCORCON-009,VENTA,1,10,2021-02-09 00:00:00,2021-02-14 00:00:00,RESELLER,999009390,NOELIA IGLESIAS ONTORIA,noelia.iglesiasontoria@gmail.com,PROPIETARIO,1,1,Logrado
3,R001,ZN3,2801,ALCORCON-019,VENTA,2,20,2021-02-09 00:00:00,2021-02-14 00:00:00,OFFICIAL RESELLER,2698,Pedro Fernandez Garcia,estanco19alcorcon@gmail.com,PROPIETARIO,1,1,No Logrado
4,R001,ZN3,2801,EL ESCORIAL-001,VENTA,1,10,2021-02-09 00:00:00,2021-02-14 00:00:00,RESELLER,999009341,Cristina Aguilar Partida,mc.aguilar@hotmail.es,PROPIETARIO,1,1,Logrado


Since this is a clustering analysis, the exact Point of Sales is not relevant, only the origin.

In [446]:
pos_df["Estanco"] = pos_df.Estanco.str.split("-", n=1, expand=False).str[0].str.capitalize()
pos_df.Estanco

0               Alcorcon
1               Alcorcon
2               Alcorcon
3               Alcorcon
4            El escorial
              ...       
11725           Pamplona
11726           Pamplona
11727            Peralta
11728    Puente la reina
11729    Puente la reina
Name: Estanco, Length: 11730, dtype: object

In [475]:
aux = pos_df.Zona.str.replace(" ", "", case=True).str.replace("ONA", "N").str.upper()

In [489]:
aux

0         ZN3
1         ZN3
2         ZN3
3         ZN3
4         ZN3
         ... 
11725    ZN24
11726    ZN24
11727    ZN24
11728    ZN24
11729    ZN24
Name: Zona, Length: 11730, dtype: string

In [491]:
aux = aux.str[2:]

In [492]:
def zone_cleaner(element):
    if len(element) == 1:
        return str(0)+element
    return element

In [493]:
aux.apply(zone_cleaner)

0        03
1        03
2        03
3        03
4        03
         ..
11725    24
11726    24
11727    24
11728    24
11729    24
Name: Zona, Length: 11730, dtype: object

In [None]:
aux.str.split("", )