## Exploring Customer Segmentation

In this activity, you are tasked with profiling customer groups for a large telecommunications company.  The data provided contains information on customers purchasing and useage behavior with the telecom products.  Your goal is to use PCA and clustering to segment these customers into meaningful groups, and report back your findings.  

Because these results need to be interpretable, it is important to keep the number of clusters reasonable.  Think about how you might represent some of the non-numeric features so that they can be included in your segmentation models.  You are to report back your approach and findings to the class.  Be specific about what features were used and how you interpret the resulting clusters.

## Imports

In [None]:
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import sklearn.cluster as cluster
from sklearn.decomposition import PCA
from sklearn import metrics
from sklearn.neighbors import NearestNeighbors
import itertools
import warnings

In [None]:
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", None)
mpl.rcParams.update({"axes.grid": True})

## Data Load and Initial Display

In [None]:
df_in = pd.read_csv("./data/telco_churn_data.csv")

In [None]:
df_in.head()

In [None]:
df_in.info()

In [None]:
df_in.describe()

## Cleanup

### Specify Columns to Toss

For one reason or another we want to toss these ones

In [None]:
# These columns have many nulls
drop_columns = df_in.loc[
    :, df_in.isnull().sum() / df_in.isnull().count() * 100.0 > 10.0
].columns.to_list()

display(drop_columns)

# These columns are representations of other columns or otherwise unneeded
drop_columns += [
    "Under 30",
    "Senior Citizen",
    "Dependents",
    "City",
    "Latitude",
    "Longitude",
    "Population",
    "Customer ID",
]

display(drop_columns)

### Various Columns That Are Functions of Other Columns

Can we (should we) remove these multiplicative products

In [None]:
# Show that total total long distance is average long distance x num months
if 0:
    display(
        pd.DataFrame(
            {
                "Actual": df_in["Total Long Distance Charges"],
                "Assertion": df_in["Avg Monthly Long Distance Charges"]
                * df_in["Tenure in Months"],
            }
        )
    )

In [None]:
drop_columns.append("Avg Monthly Long Distance Charges")

In [None]:
# Show that monthly charge approximately equals average total regular charges
if 0:
    plt.scatter(
        df_in["Monthly Charge"],
        df_in["Total Regular Charges"] / df_in["Tenure in Months"],
        color="blue",
    )

In [None]:
drop_columns.append("Monthly Charge")

In [None]:
# Show that total extra data charges is proportional to Total GB Download when unlimited data is false
df_in["Total GB Download"] = (
    df_in["Avg Monthly GB Download"] * df_in["Tenure in Months"]
)

plt.scatter(
    df_in["Total GB Download"],
    df_in["Total Extra Data Charges"],
    c=df_in["Unlimited Data"] == "Yes",
)

In [None]:
drop_columns.append("Avg Monthly GB Download")

In [None]:
# How meaningful are the other total charges columns?

### Perform the Cleanup

In [None]:
def cat_str_to_idx(series: pd.Series) -> pd.Series:
    if series.dtype == "object" and series.nunique() <= 5 and 0:
        return pd.Series(np.unique(series, return_inverse=True)[1])
    return series


df = df_in.drop(columns=drop_columns).apply(cat_str_to_idx)
assert np.all(df.isnull().sum() == 0), "Some Nulls Remain"
df.info()
df.head()

## PCA Prep

### Select Numeric Columns

Those columns where data type is not object

In [None]:
df_numeric = df[df.columns[df.dtypes != "object"]]
df_numeric["Unlimited Data"] = df_in["Unlimited Data"] == "Yes"
df_numeric = df_numeric.drop(columns=["Avg Monthly GB Download"])
display(df_numeric.head())
display(df_numeric.shape)

### Scale

In [None]:
df_scaled = (df_numeric - df_numeric.mean()) / df_numeric.std()
display(df_scaled.head())

## EV vs. Chosen Columns Analysis

Looking for Combinations with High Cumulative EV With 3 Components

### Define the Allowable Combinations

In [None]:
inds = list(range(df_numeric.shape[1]))
choosek = 5
combos = list(itertools.combinations(inds, choosek))
print(
    "Checking %d of combinations of choosing %d from %d"
    % (len(combos), choosek, len(inds))
)

### Cumulative EV vs. Num Components

In [None]:
cum_ev_3_combos = []
ncomp = 3

for m in range(len(combos)):
    combo = list(combos[m])
    cum_ev = (
        PCA(n_components=choosek, random_state=123)
        .fit(df_scaled.iloc[:, combo])
        .explained_variance_ratio_
        * 100.0
    ).cumsum()

    cum_ev_3_combos.append(cum_ev[ncomp - 1])
    if m % 1000 == 0:
        display([m, combo, cum_ev_3_combos[-1]])

### Plot Cumulative EV at 3 Components vs. Combination Index

In [None]:
sorted_inds = np.argsort(cum_ev_3_combos)[::-1]
plt.scatter(range(len(combos)), np.array(cum_ev_3_combos)[sorted_inds], color="blue")

for ind in range(5):
    m = sorted_inds[ind]
    combo = list(combos[m])
    combo_columns = df_scaled.columns[combo]
    display([m, cum_ev_3_combos[m], combo_columns])

### Choose the Top Combination

The one with the highest EV @ 3 components

In [None]:
df_scaled = df_scaled.iloc[:, list(combos[sorted_inds[0]])]

## PCA

#### Plot Cumulative EV

In [None]:
ev = (
    PCA(n_components=df_scaled.shape[1], random_state=123)
    .fit(df_scaled)
    .explained_variance_ratio_
    * 100.0
)
cum_ev = ev.cumsum()

In [None]:
fig, ax1 = plt.subplots()

ax1_color = "black"
ax1.plot(
    np.arange(len(ev)) + 1,
    ev,
    linestyle="solid",
    marker="o",
    color=ax1_color,
)

ax1.set_xlabel("Number of Components")
ax1.set_ylabel("Explained Variance Ratio (%)", color=ax1_color)

ax2_color = "blue"
ax2 = ax1.twinx()
ax2.plot(
    np.arange(len(cum_ev)) + 1,
    cum_ev,
    linestyle="solid",
    marker="o",
    color=ax2_color,
)

ax2.set_ylabel("Cumulative Variance Explained (%)", color=ax2_color)
ax2.tick_params(axis="y", labelcolor=ax2_color)


def crosshairs_at(
    target_cev: float = 0.0, ncomp: int = None, color: str = "", linestyle: str = "--"
):
    if ncomp is None:
        ncomp = PCA(n_components=target_cev / 100.0).fit(df_scaled).n_components_

    label = "%2d Components -> %.2f%% Variance" % (ncomp, cum_ev[ncomp - 1])
    ax2.axhline(cum_ev[ncomp - 1], color=color, linestyle=linestyle)
    ax2.axvline(ncomp, label=label, color=color, linestyle=linestyle)


crosshairs_at(ncomp=2, color="red")
crosshairs_at(ncomp=3, color="cyan")
crosshairs_at(ncomp=4, color="magenta")
# crosshairs_at(target_cev=95.0, color="blue")

plt.setp(plt.legend(loc="center right", fancybox=True).texts, family="monospace")

### Fit

For simplicity just go with 3 components for PCA

In [None]:
n_components = 3
X = PCA(n_components=3, random_state=123).fit_transform(df_scaled)

## Clustering with KMeans

### Parameter Search

#### Search Over Number of Clusters

In [None]:
n_clusters_list = np.arange(2, 8)
display(n_clusters_list)
inertia = []

for n_clusters in n_clusters_list:
    kmeans = cluster.KMeans(n_clusters=n_clusters, random_state=123).fit(X)
    (_, counts) = np.unique(kmeans.labels_, return_counts=True)
    display([n_clusters, kmeans.inertia_, counts])
    inertia.append(kmeans.inertia_)

#### Plot Inertia vs. Num Clusters

In [None]:
fig, ax1 = plt.subplots()

# Inertia vs. Num Clusters
ax1_color = "black"
ax1.plot(
    n_clusters_list,
    inertia,
    linestyle="solid",
    marker="o",
    color=ax1_color,
)

ax1.set_xlabel("Number of Clusters")
ax1.set_ylabel("Inertia")

# Differential Inertia vs. Num Clusters
ax2_color = "blue"
ax2 = ax1.twinx()
ax2.plot(
    n_clusters_list[:-1],
    np.diff(inertia),
    linestyle="solid",
    marker="o",
    color=ax2_color,
)

ax2.set_ylabel("Differential Inertia", color=ax2_color)
ax2.tick_params(axis="y", labelcolor=ax2_color)

#### Conclusion

5 clusters is good enough, after that the improvement decelerates

### Cluster

In [None]:
kmeans = cluster.KMeans(n_clusters=5, random_state=123).fit(X)

(unique_labels, counts) = np.unique(kmeans.labels_, return_counts=True)
display([unique_labels, counts, kmeans.inertia_])

### Label Data

In [None]:
df_labeled = pd.DataFrame(
    X, columns=["Component" + str(k + 1) for k in range(X.shape[1])]
).join(pd.DataFrame({"kmeans": kmeans.labels_}))
display(df_labeled.head())
display(df_labeled.groupby("kmeans").count().reset_index())

### Scatter Plot Helper Function

In [None]:
def df_scatter_3d(
    data_frame: pd.DataFrame = None,
    color: str = None,
):
    px.scatter_3d(
        data_frame=data_frame,
        x=data_frame.columns[0],
        y=data_frame.columns[1],
        z=data_frame.columns[2],
        color=color,
    ).update_layout(autosize=False, width=600, height=600).show()

### Scatter Plot

In [None]:
df_scatter_3d(df_labeled, color="kmeans")

### KDE Plot

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=n_components)
fig.set_size_inches((14, 6))

for k in range(n_components):
    sns.kdeplot(
        df_labeled,
        x=df_labeled.columns[k],
        fill=True,
        hue="kmeans",
        ax=ax[k],
        palette="bright",
        alpha=0.5,
        linewidth=1,
    )

fig.suptitle("Density vs. Component Axis")

## Clustering with DBSCAN

### Parameter Search

Search over range of eps and min samples

#### K-Distance Graph

In [None]:
min_samples = 75
distances = np.sort(
    np.mean(
        NearestNeighbors(n_neighbors=min_samples, algorithm="auto")
        .fit(X)
        .kneighbors(X)[0],
        axis=1,
    )
)

fig, ax = plt.subplots(nrows=1, ncols=2)
fig.set_size_inches((14, 8))


ax[0].plot(
    distances,
    linestyle="solid",
    marker="o",
    color="blue",
)

ax[0].set_xlabel("Sorted Index")
ax[0].set_ylabel("Average Distance")

sns.kdeplot(distances, shade=True, color="blue", ax=ax[1])
ax[1].set_xlabel("Average Distance")

fig.suptitle("K-Distance for %d Nearest Neighbors" % min_samples)

#### Perform the Search

In [None]:
eps_start = 0.5
eps_stop = 0.7
num_eps_points = 21
eps_list = np.linspace(eps_start, eps_stop, num_eps_points, endpoint=True)
display(eps_list)

n_clusters_list = []
n_noise_list = []

for eps in eps_list:
    dbscan = cluster.DBSCAN(eps=eps, min_samples=min_samples).fit(X)
    n_clusters_list.append(len(np.unique(dbscan.labels_[dbscan.labels_ != -1])))
    n_noise_list.append(np.sum(dbscan.labels_ == -1))

# for eps in eps_list:
#     for min_samples in min_samples_list:
#         dbscan = cluster.DBSCAN(eps=eps, min_samples=min_samples).fit(X)
#         (unique_labels, counts) = np.unique(dbscan.labels_, return_counts=True)
#         null_count = counts[unique_labels == -1][0] if -1 in unique_labels else 0
#         null_pct = null_count / len(X) * 100.0
#         non_null_pct = counts[unique_labels != -1] / len(X) * 100.0
#         num_labels = np.sum(unique_labels != -1)
#         if num_labels in [2, 3, 4, 5] and null_pct < 25.0 or 1:
#             score = 0.0  # metrics.silhouette_score(X, dbscan.labels_)
#             msg = (
#                 "eps = %.2f, min samples = %d, nulls = %d, %.2f%%, num labels = %d, score = %.2f, label distr = %s"
#                 % (
#                     eps,
#                     min_samples,
#                     null_count,
#                     null_pct,
#                     num_labels,
#                     score,
#                     str(np.round(non_null_pct, 1)),
#                 )
#             )
#             print(msg)

#### Plot Search Results

In [None]:
fig, ax1 = plt.subplots()

ax1_color = "black"
ax1.plot(
    eps_list,
    n_clusters_list,
    linestyle="solid",
    marker="o",
    color=ax1_color,
)

ax1.set_xlabel("Epsilon")
ax1.set_ylabel("Number of Clusters")

ax2_color = "blue"
ax2 = ax1.twinx()
ax2.plot(
    eps_list,
    n_noise_list,
    linestyle="solid",
    marker="o",
    color=ax2_color,
)

ax2.set_ylabel("Number of Noise Values", color=ax2_color)
ax2.tick_params(axis="y", labelcolor=ax2_color)

### Cluster

In [None]:
dbscan = cluster.DBSCAN(eps=0.54, min_samples=min_samples).fit(X)
(unique_labels, counts) = np.unique(dbscan.labels_, return_counts=True)
display([unique_labels, counts])

### Label Data

In [None]:
df_labeled["dbscan"] = dbscan.labels_
display(df_labeled.head())
display(df_labeled.groupby("dbscan").count().reset_index())

### Scatter Plot

In [None]:
df_scatter_3d(df_labeled.query("dbscan != -1"), color="dbscan")

## Clustering with OPTICS

In [None]:
# optics = cluster.OPTICS(min_samples=10).fit(X)
# (unique_labels, counts) = np.unique(optics.labels_, return_counts=True)
# display([unique_labels, counts])
# optics.get_params(deep=True)