# Exercise 12: Clustering to find Anomalies
You are working as a data analyst for the Office of the Inspector General in Washington, D.C. Your task is to identify **unusual government purchase card (P-Card) spending patterns** that could indicate potential waste or misuse. Each record in your dataset represents one government **cardholder**. The dataset summarizes their monthly spending behavior.

Use **DBSCAN** to identify natural clusters and potential outliers among cardholders.
1. **Import and Explore**  
    * Load the pcard_summary.csv dataset using pandas  
    * Examine the distribution of each numeric variable  
    * Plot pairwise scatterplots or a 3D scatterplot to get a sense of clustering  

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

p_card = pd.read_csv('pcard_summary.csv')

# print pairplot of numeric coefficients
num_vars = ['avg_transaction_amount',
            'transactions_per_month',
            'pct_weekend_transactions']

# create pairplot
sns.pairplot(p_card[num_vars])
plt.show()

# 3D scatter plot
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111,projection='3d')
ax.scatter(p_card[num_vars[0]],
           p_card[num_vars[1]],
           p_card[num_vars[2]],
           s=30)

ax.set_xlabel(num_vars[0])
ax.set_ylabel(num_vars[1])
ax.set_zlabel(num_vars[2])

plt.show()

#### Distributions
* **- avg_transaction_amount** is heavily right-skewed with several very large purchases - possible high-risk cases.  
* **- transactions_per_month** shows two clear populations: light users vs. high-volume purchasers.  
* **- pct_weekend_transactions** clusters strongly near 0 (most goverment purchasing occurs on weekdays), with a thin tail extending toward high weekend usage.  
------
#### Pairplots
* There appear to be natural groupings along both spending amount and transaction count.
* Some cardholders combine high transaction amounts with high weekend percentages, making them potential outliers.
* The large bands suggest clusters, ideal for DBSCAN.
-----
#### 3D Plot
* The cardholders form two main dense clusters, along with several isolated points far from the core.
* High-spending outliers are clearly visible at the far right of the avg_transaction_amount axis.
* A small group with unusually high weekend usage also stands apart from the main population.

**Preprocess**  
*  Extract the three numeric columns
*  Scale them using StandardScaler from sklearn.preprocessing.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X = scaler.fit_transform(p_card[['avg_transaction_amount',
                                        'transactions_per_month',
                                        'pct_weekend_transactions']])

**3. Parameter Exploration**  
* Use a **k-distance plot** to estimate a good value for eps. *(Hint: try min_samples=4 and plot the distance to the 4th nearest neighbor.)*
* Try multiple combinations of eps and min_samples until clusters start to emerge.

In [None]:
from sklearn.neighbors import NearestNeighbors
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

all_distances = pd.DataFrame()

for n in range(4,11):
    neighbors = NearestNeighbors(n_neighbors=n)
    neighbors_fit = neighbors.fit(X)
    distances, indices = neighbors_fit.kneighbors(X)

    distances = np.sort(distances[:,n-1])
    all_distances[n] = distances

all_distances.plot()
plt.title("K-distance graph (for choosing eps)")
plt.xlabel("Points sorted by distance")
plt.ylabel(f"{n}-th nearest neighbor distance")
plt.show()

My choice for an eps based on the k-distance plot is 0.30. This is where the elbow occurs, right before the plot becomes steep.

#### 4. Run DBSCAN
* Use sklearn.cluster.DBSCAN to fit the model.
* Create a new column for cluster_label in your dataframe

In [None]:
from sklearn.cluster import DBSCAN

min_samples = range(2,10)
eps_range = range(1,40)

dbscan_results = []

for n in min_samples:
    for eps in eps_range:
        # starting values that work well for this scaled data
        db = DBSCAN(eps=eps/10, min_samples=n)
        db_labels = db.fit_predict(X)

        unique_labels = set(db_labels)
        n_clusters = len([l for l in unique_labels if l != -1])
        n_noise = list(db_labels).count(-1)

        counts = pd.DataFrame(db_labels).value_counts()

        dbscan_results.append([n, eps/10, n_clusters, n_noise, min(counts), max(counts), min(counts)/max(counts)])

dbscan = pd.DataFrame(dbscan_results, columns=['min_samples','eps','n_clusters','noise','min_size','max_size','min_max_ratio'])

dbscan

**Best parameters**  
* min_samples = 3
* eps = 0.4  

I went with min_samples = 3 and eps = 0.4 because this combination produces a meaningful structure in the data. Instead of collapsing into one giant cluster or creating dozens of tiny ones, it creates three distinct spending lcusters along with a reasonable number of noise points. The 3 clusters closely matches the pairplots and this parameter set gives the clearest separation between normal behavior and unsual P-Card activity.

#### 5. Analyze Results
* Count how many clusters and noise points were found.
* Compute the mean values of each variable per cluster.
* Visualize the clusters with a scatterplot (color by cluster)

In [None]:
# starting values that work well for this scaled data

# Helper function to fit DBSCAN and plot by cluster
def dbscan_plot(X, eps, min_samples):

    db = DBSCAN(eps=eps, min_samples=min_samples)
    db_labels = db.fit_predict(X)

    unique_labels = set(db_labels)
    n_clusters = len([l for l in unique_labels if l != -1])
    n_noise = list(db_labels).count(-1)

    print(f"Clusters: {n_clusters}")
    print(f"Noise:    {n_noise}")

    plt.scatter(p_card["transactions_per_month"], p_card["pct_weekend_transactions"],
                c=db_labels, cmap="tab10", s=30)
    plt.xlabel("transactions_per_month")
    plt.ylabel("pct_weekend_transactions")
    plt.title("DBSCAN clustering")
    plt.show()

    return db_labels

# Pick som values from our table above that result in 2 clusters, 0 noise, and balanced cluster ratio
labels = dbscan_plot(X, 0.4, 3)

In [None]:
from mpl_toolkits.mplot3d import Axes3D

def dbscan_plot_3d(X, eps, min_samples):
    db = DBSCAN(eps=eps, min_samples=min_samples)
    db_labels = db.fit_predict(X)

    # assign cluster labels to the dataframe
    p_card["cluster_label"] = db_labels

    # compute clusters and noise counts
    unique_labels = set(db_labels)
    n_clusters = len([l for l in unique_labels if l != -1])
    n_noise = list(db_labels).count(-1)

    print(f"Clusters found: {n_clusters}")
    print(f"Noise points:   {n_noise}\n")

    # compute cluster means
    print("Mean values per cluster:")
    display(p_card.groupby("cluster_label")[num_vars].mean())

    # create 3D scatter plot
    fig = plt.figure(figsize=(8,6))
    ax = fig.add_subplot(111, projection='3d')

    ax.scatter(
        p_card[num_vars[1]],
        p_card[num_vars[2]],
        p_card[num_vars[0]],
        c=db_labels,
        cmap="tab10",
        s=30
    )

    ax.set_xlabel(num_vars[1])
    ax.set_ylabel(num_vars[2])
    ax.set_zlabel(num_vars[0])

    plt.title(f"DBSCAN 3D (eps={eps}, min_samples={min_samples})")
    plt.show()

    return db_labels


# run the final model
labels = dbscan_plot_3d(X, 0.4, 3)

#### 6. Interpretation
* Which clusters represent "typical" spending?
* Which cardholders might be outliers (cluster=-1)?
* Discuss plausible explanations for outlier behavior (e.g., legitimate high-volume spending vs potential misuse)
----------------------------
Overall, DBSCAN ended up being a really helpful way to break this dataset into meaningful groups and surface anything that looked unusual. Most cardholders clearly fall into one big “typical” spending pattern, while a couple of smaller clusters show alternate but still reasonable usage styles that probably reflect different job roles or purchasing needs. The real value comes from the noise points—the model isolates the handful of cardholders whose spending looks genuinely out of line with everyone else. These cases don’t automatically mean something is wrong, but they’re definitely the ones I’d want to review more closely to understand whether the behavior is justified or if it points to potential misuse.