# Baridi Baridi Customers Spacial Clustering

We're using customer dataset with geolocation information to analyze spacial patterns and relationship that may impact the necessity and financial stability assessments. Here is how this data is being used in this analysis.

1. **Spacial Clustering:**
   - We apply spacial clustering algorigthm (DBSCAN with haversine distance) to group customers based on their geolocation. Clusters may indicate areas with higher demand for AC units due to population density or other latent factors
2. **Density Analysis:**
   - We'll calculate density of installations in different areas. Areas with a high density of AC installations might indicate regions with high demand for our service.
3. **Proximity features:**
   - We'll calculate distances from each customer to signify points of interest (e.g., commercial centers, residential areas) which could affect the usage patterns of ACs
4. **Climate Analysis:**
   - We'll use geolocation data to determine the geolocation zone for each customer. Different zones will have different AC needs depending on the weather patterns.
5. **Regional Economic Indicators:**
   - Combine geolocation with extra economic data to asses financial stability, for example, customer in more affluent areas might be more financially stable.


In [1]:
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

try:
    import folium
except ImportError as err:
    print('Failed to import basemap. Going to install')
    !pip install folium
    import folium

In [2]:
df = pd.read_csv('customers_20240325.csv')

### Hyperparameter selection creteria (This is the rosetta stone to our problem)

A very small eps value (like 0.0003 in our case) indicates that points need to be very close to each other to be considered part of the same cluster. This can have several implications:

**High Resolution:** A smaller eps leads to a higher-resolution clustering, which can detect very tight clusters. This is useful in dense datasets where fine-grained clustering is necessary.

**Sensitivity to Noise:** Small eps values make the algorithm more sensitive to noise. Points that are not in very close proximity to a dense cluster will be considered outliers.

**Number of Clusters:** You might end up with a larger number of smaller clusters, as only points very close to each other are grouped together. To find a balance, we'll apply the `minimum_samples` hyperparameter to be able to limit the number of clusters.

**Geospatial Data:** For geospatial data, an eps of 0.0003 suggests that we are looking for clusters within a very small geographic area. Given that degrees of latitude and longitude vary in actual distance depending on where you are on the globe, a rough approximation is that one degree of latitude is about 111 kilometers (69 miles). An eps of 0.0003 degrees in latitude or longitude would then correspond to a distance of approximately 33 meters (108 feet). This level of precision is suitable for very detailed local clustering, such as identifying groups of points within the same neighborhood or small geographic area.

In [3]:
eps = 0.0003
min_samples = 20

In [4]:
# Remove instalment customers
df = df[df['contract_type'] == 'Subscription']
# Remove column which do not provide any information
df.drop(columns=[
    'id',
    'created_by',
    'updated_by',
    'first_name',
    'last_name',
    'acc_type',
    'contract_type',
    'c_type',
    'zcrm_id',
    'iso_country_code',
    'primary_phone_mcc',
    'primary_phone',
    '__ts_vector__',
    'last_deduction_date',
    'last_deduction_points',
    'tag',
    'zbooks_id',
    'next_filter_cleaning_schedule',
    'filter_cleaning_days',
    'current_pts',
    'wassha_token',
    'points',
    'payment_date',
    'number_of_assignments',
    'assignments_completed',
    'number_of_locations',
    'locations_completed',
    'created_at',
    'updated_at',
    'zserial_id',
    'primary_phone_mnc',
    'wassha_id',
    'parent_acc_id',
    'subscription_due_date',
    'filter_cleaning_status',
    'total_security_deposit',
    'original_current_pts'], inplace=True)

In [5]:
df.head()

Unnamed: 0,status,initial_installation_date,subscription_status,business_type,location_name,geocoords,district_name
9,ACTIVE,2022-03-02 00:00:00,VALID,RESIDENTIAL,Mwananyamala,"-6.7887042,39.2553516",Kinondoni
26,ACTIVE,2024-02-15 09:09:47.083,VALID,CLOTHING SHOP,Sinza,"-6.773721,39.224959",Ubungo
27,ACTIVE,2024-02-01 09:40:14.862,VALID,COSMETICS SHOP,Sinza,"-6.786635, 39.223505",Ubungo
28,ACTIVE,2024-02-08 09:41:15.31,VALID,CLOTHING SHOP,Ukonga,"-6.872975,39.189895",Ilala
30,ACTIVE,2024-02-06 13:48:55.572,VALID,COSMETICS SHOP,Sinza,"-6.788005, 39.229468",Ubungo


In [6]:
df.shape

(997, 7)

## Data Cleansing

Split raw geolocation presentation into floating point presentation. This will facilitate computation of distances and perform other geodentics.

In [7]:
# Some coordinates are comma delimited, while others are space delimited.
# We need to divice a function to assist with that.
def split_coordinates(coord):
    # split based on the delimiter, which could be a comma or space
    if ',' in coord:
        return coord.split(',')[:2]
    return coord.split()[:2]

In [8]:
df[['lat', 'lng']] = (
  df['geocoords'].apply(
    lambda x: pd.Series(split_coordinates(x) if pd.notnull(x) else [None, None]))
)

In [9]:
df['lat'] = pd.to_numeric(df['lat'], errors='coerce')
df['lng'] = pd.to_numeric(df['lng'], errors='coerce')

Remove data which misses geographic data since it will not aid in our analysis (Noisy)

* Any data which misses `latitude` is considered faulty
* Any data which misses `longitude` is considered faulty 

In [10]:
df = df[~df['lat'].isnull() & ~df['lng'].isnull()]

In [11]:
df.shape

(840, 9)

In [12]:
df[['lat', 'lng']].describe()

Unnamed: 0,lat,lng
count,840.0,840.0
mean,-83204.8,56139.74
std,2364697.0,1381081.0
min,-68530340.0,39.09347
25%,-6.81772,39.22081
50%,-6.784457,39.24188
75%,-6.771946,39.26525
max,6.81059,39255950.0


That is a lot of errors (For Dar-es-Salaam region). We'll filter out erreneous data by

* Retain information where `lng` is less than 40, that is east of the 40deg West meridian.
* Retain information where `lat` is greater than -7 but less than 0, i.e., between the equator and 7 deg South

In [13]:
df = df[(df['lng'] < 40) & (df['lat'] > -7) & (df['lat'] < 0)]

In [14]:
df.shape

(833, 9)

In [15]:
dbscan = DBSCAN(eps=eps, min_samples=min_samples, metric='haversine')
df['location_cluster'] = dbscan.fit_predict(np.radians(df[['lat', 'lng']]))
print(f"eps: {eps/100000}, labels: {df['location_cluster'].unique()}")

eps: 2.9999999999999996e-09, labels: [ 0 -1  1]


### Exploratory Data Analysis

In [20]:
# Setup the map
colors = sns.color_palette('hls', len(df['location_cluster'].unique()))

In [21]:
map_center = [df['lat'].mean(), df['lng'].mean()]
cluster_map = folium.Map(location=map_center, zoom_start=9)

# Use a color palette
colors = ["#{:02x}{:02x}{:02x}".format(int(r), int(g), int(b)) for r, g, b in 255*np.array(colors)]
for _, row in df.iterrows():
    folium.CircleMarker(
        location=[row['lat'], row['lng']],
        radius=5,
        color=colors[int(row['location_cluster'])],
        fill=True,
        fill_color=colors[int(row['location_cluster'])],
        fill_opacity=0.99
    ).add_to(cluster_map)

# Show the map
cluster_map

### Observations

1. With minimum number of customers = 20 within 33 meters in a location, we were able to create three different types of clusters.
2. The first cluster is located around the city center