<a href="https://colab.research.google.com/github/hellojohnkim/mmai869/blob/main/2024_869_JohnKim.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MMAI 869 2024: Individual Assignment

- Student Name: Kim, John
- Student Number: 20439250
- Section Number: MMAI 2024
- Favourite Book: Pachinko by Min Jin Lee
- Currently Reading: The Worlds I See by Fei Fei Li
- Submitted Date: 2024-01-07

In [None]:
import datetime
import pandas as pd
import numpy as np

In [None]:
print(datetime.datetime.now())

In [None]:
!which python

In [None]:
!python --version

In [None]:
!echo $PYTHONPATH

# Question 1: Uncle Steve's Diamonds

## Instructions

You work at a local jewelry store named *Uncle Steve's Diamonds*. You started as a janitor, but you’ve recently been promoted to senior data analyst. Congratulations!

Uncle Steve, the store's owner, needs to better understand the store's customers. In particular, he wants to know what kind of customers shop at the store. He wants to know the main types of *customer personas*. Once he knows these, he will contemplate ways to better market to each persona, better satisfy each persona, better cater to each persona, increase the loyalty of each persona, etc. But first, he must know the personas.

You want to help Uncle Steve. Using sneaky magic (and the help of Environics), you've collected four useful features for a subset of the customers: age, income, spending score (i.e., a score based on how much they’ve spent at the store in total), and savings (i.e., how much money they have in their bank account).

**Your tasks**

1. Pick a clustering algorithm (the [`sklearn.cluster`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster) module has many good choices, including [`KMeans`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans), [`DBSCAN`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN), and [`AgglomerativeClustering`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering) (aka Hierarchical)). (Note that another popular implementation of the hierarchical algorithm can be found in SciPy's [`scipy.cluster.hierarchy.linkage`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html).) Don't spend a lot of time thinking about which algorithm to choose - just pick one. Cluster the customers as best as you can, within reason. That is, try different feature preprocessing steps, hyperparameter values, and/or distance metrics. You don't need to try every posssible combination, but try a few at least. Measure how good each  model configuration is by calculating an internal validation metric (e.g., [`calinski_harabasz_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html) or [`silhouette_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score)).
2. You have some doubts - you're not sure if the algorithm you chose in part 1 is the best algorithm for this dataset/problem. Neither is Uncle Steve. So, choose a different algorithm (any!) and do it all again.
3. Which clustering algorithm is "better" in this case? Think about charateristics of the algorithm like quality of results, ease of use, speed, interpretability, etc. Choose a "winner" and justify to Uncle Steve.
4. Interpret the clusters of the winning model. That is, describe, in words, a *persona* that accurately depicts each cluster. Use statistics (e.g., cluster means/distributions), examples (e.g., exemplar instances from each cluster), and/or visualizations (e.g., relative importance plots, snakeplots) to get started. Human judgement and creativity will be necessary. This is where it all comes together. Be descriptive and *help Uncle Steve understand his customers better*. Please!

**Marking**

The coding parts (i.e., 1 and 2) will be marked based on:

- *Correctness*. Code clearly and fully performs the task specified.
- *Reproducibility*. Code is fully reproducible. I.e., you (and I) are able to run this Notebook again and again, from top to bottom, and get the same results each time.
- *Style*. Code is organized. All parts commented with clear reasoning and rationale. No old code laying around. Code easy to follow.


Parts 3 and 4 will be marked on:

- *Quality*. Response is well-justified and convincing. Responses uses facts and data where possible.
- *Style*. Response uses proper grammar, spelling, and punctuation. Response is clear and professional. Response is complete, but not overly-verbose. Response follows length guidelines.


**Tips**

- Since clustering is an unsupervised ML technique, you don't need to split the data into training/validation/test or anything like that. Phew!
- On the flip side, since clustering is unsupervised, you will never know the "true" clusters, and so you will never know if a given algorithm is "correct." There really is no notion of "correctness" - only "usefullness."
- Many online clustering tutorials (including some from Uncle Steve) create flashy visualizations of the clusters by plotting the instances on a 2-D graph and coloring each point by the cluster ID. This is really nice and all, but it can only work if your dataset only has exactly two features - no more, no less. This dataset has more than two features, so you cannot use this technique. (But that's OK - you don't need to use this technique.)
- Must you use all four features in the clustering? Not necessarily, no. But "throwing away" quality data, for no reason, is unlikely to improve a model.
- Some people have success applying a dimensionality reduction technique (like [`sklearn.decomposition.PCA`](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)) to the features before clustering. You may do this if you wish, although it may not be as helpful in this case because there are only four features to begin with.
- If you apply a transformation (e.g., [`MinMaxScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) or [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler)) to the features before clustering, you may have difficulty interpretting the means of the clusters (e.g., what is a mean Age of 0.2234??). There are two options to fix this: first, you can always reverse a transformation with the `inverse_transform` method. Second, you can just use the original dataset (i.e., before any preprocessing) during the interpreation step.
- You cannot change the distance metric for K-Means. (This is for theoretical reasons: K-Means only works/makes sense with Euclidean distance.)


## 1.0: Load data

In [None]:
import matplotlib.pyplot as plt
import sklearn
from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture

In [None]:
# DO NOT MODIFY THIS CELL
df1 = pd.read_csv("https://drive.google.com/uc?export=download&id=1thHDCwQK3GijytoSSZNekAsItN_FGHtm")
df1.info()

In [None]:
df1.head(100)

In [None]:
## Data Exploration
df1.describe().transpose()
list(df1)
df1.shape
df1.info()
df1.head(n=20)
df1.describe().transpose()

In [None]:
X = df1.copy()
X.head(10)

In [None]:
col_names = df1.columns
X = df1.to_numpy()  

In [None]:
# Pre-processing steps before clustering 
# use standard scaler to standardize the values in the data before we apply clsutering
scaler = StandardScaler()
features = df1.columns
X[features] = scaler.fit_transform(X[features])

In [None]:
X.shape
X.info()
X.describe().transpose()
X.head(10)
X.tail()

In [None]:
plt.figure();

plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c="black");
plt.title("Jewelry Customers Data");
plt.xlabel('Annual Income (K)');
plt.ylabel('Spending Score');
plt.xticks();
plt.yticks();

## 1.1: Clustering Algorithm #1

In [52]:
from sklearn.cluster import KMeans

k_means = KMeans(init="k-means++", random_state=42)
k_means.fit(X)

  super()._check_params_vs_input(X, default_n_init=10)


In [53]:
# Get the cluster labels for each data point
labels = k_means.labels_
labels

array([6, 6, 1, 1, 0, 3, 1, 0, 5, 0, 4, 2, 7, 7, 1, 4, 1, 5, 0, 4, 7, 4,
       6, 5, 4, 3, 3, 1, 0, 1, 1, 1, 0, 1, 2, 6, 0, 1, 2, 7, 1, 0, 3, 6,
       1, 3, 4, 3, 2, 6, 0, 2, 3, 5, 1, 0, 2, 0, 5, 5, 5, 0, 0, 6, 1, 6,
       0, 3, 3, 1, 1, 7, 5, 0, 7, 6, 2, 6, 1, 3, 1, 4, 6, 5, 5, 2, 7, 6,
       1, 6, 1, 6, 3, 0, 6, 3, 2, 6, 6, 2, 7, 3, 2, 3, 2, 0, 5, 7, 7, 6,
       5, 2, 6, 2, 7, 7, 6, 7, 3, 0, 2, 0, 5, 5, 2, 4, 1, 0, 6, 1, 2, 1,
       7, 6, 3, 0, 4, 3, 1, 2, 0, 6, 1, 3, 1, 3, 6, 4, 1, 5, 6, 7, 5, 6,
       5, 4, 7, 6, 0, 1, 2, 5, 6, 7, 0, 2, 5, 6, 2, 6, 0, 1, 7, 1, 0, 6,
       6, 0, 0, 0, 6, 6, 7, 3, 3, 5, 2, 1, 1, 6, 2, 0, 6, 1, 1, 0, 3, 7,
       5, 5, 0, 2, 6, 2, 6, 2, 1, 0, 3, 1, 0, 0, 5, 5, 1, 5, 7, 3, 7, 0,
       6, 6, 2, 0, 7, 3, 7, 6, 1, 0, 6, 6, 0, 6, 6, 6, 2, 1, 7, 2, 3, 1,
       6, 6, 4, 6, 2, 0, 3, 3, 3, 5, 6, 7, 0, 0, 7, 7, 6, 6, 4, 1, 6, 0,
       6, 5, 7, 3, 6, 7, 5, 1, 5, 0, 7, 6, 0, 5, 6, 6, 0, 2, 7, 4, 7, 5,
       0, 7, 4, 0, 5, 3, 6, 7, 2, 6, 2, 0, 5, 2, 1,

In [54]:
#centroids values
k_means.cluster_centers_

array([[ 1.20746731, -1.33611726, -0.64538103,  1.14264671],
       [ 0.00639037, -0.11090221,  1.04534171, -0.824199  ],
       [-1.08606468,  0.77829813, -0.76260864,  0.79448196],
       [-1.44466803,  1.46050665,  1.51057952, -1.57249636],
       [ 1.11876151,  1.23565419, -1.68367042,  0.59588052],
       [ 1.1748626 , -1.31231335, -0.71943742,  0.76977876],
       [ 0.06420515, -0.0651299 ,  1.01300305, -1.14800405],
       [-1.09030742,  0.87816916, -0.74187521,  0.4542097 ]])

In [55]:
scaler.inverse_transform(k_means.cluster_centers_)

array([[8.81392405e+01, 2.74701646e+04, 3.37686022e-01, 1.75120744e+04],
       [5.91739130e+01, 7.15255507e+04, 7.76219861e-01, 7.78734682e+03],
       [3.28281250e+01, 1.03498766e+05, 3.07279927e-01, 1.57906341e+04],
       [2.41800000e+01, 1.28029120e+05, 8.96891640e-01, 4.08752031e+03],
       [8.60000000e+01, 1.19944040e+05, 6.83780993e-02, 1.48086838e+04],
       [8.73529412e+01, 2.83260882e+04, 3.18477530e-01, 1.56684935e+04],
       [6.05681818e+01, 7.31713977e+04, 7.67831971e-01, 6.18634890e+03],
       [3.27258065e+01, 1.07089855e+05, 3.12657694e-01, 1.41082170e+04]])

In [56]:
# Plot the data points with different colors for each cluster
plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.show()

InvalidIndexError: (slice(None, None, None), 0)

In [None]:
# Calculate the silhouette score for the KMeans model
silhouette_score(X, labels)

# Calculate the silhouette samples for the KMeans model
silhouette_samples(X, labels)

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, calinski_harabasz_score
import numpy as np

# Preprocessing: Standardizing the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df1)

# Function to apply K-Means and evaluate the model
def kmeans_clustering(data, n_clusters):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    labels = kmeans.fit_predict(data)

    silhouette = silhouette_score(data, labels)
    calinski_harabasz = calinski_harabasz_score(data, labels)

    return kmeans, silhouette, calinski_harabasz

# Trying K-Means with different numbers of clusters
k_values = range(2, 11)
kmeans_results = {}

for k in k_values:
    kmeans, silhouette, calinski_harabasz = kmeans_clustering(scaled_data, k)
    kmeans_results[k] = (silhouette, calinski_harabasz)

kmeans_results_df = pd.DataFrame(kmeans_results, index=["Silhouette Score", "Calinski-Harabasz Score"]).T
kmeans_results_df.index.name = 'Number of Clusters'
kmeans_results_df


## 1.2: Clustering Algorithm #2

In [None]:
from scipy.cluster.hierarchy import linkage

#call the linkage function
aggl = linkage(X, method='ward', metric='euclidean')

In [None]:
import scipy
from scipy.cluster.hierarchy import dendrogram

# Plot the dendogram 0 this time with better labels
plt.figure(figsize=(16, 8));
plt.grid(False)
plt.title("Uncle Steve Diamonds Dendogram");  
dend = scipy.cluster.hierarchy.dendrogram(aggl); 

In [None]:
# Let's find K=5 clusters
K=5
labels = scipy.cluster.hierarchy.fcluster(aggl, K, criterion="maxclust")

In [None]:
df1['Cluster ID'] = labels
df1.head(100)

## 1.3 Model Comparison

## 1.4 Personas

TODO: Delete this text and insert your answer here.

# Question 2: Uncle Steve's Fine Foods

## 2.1: A rule that might have high support and high confidence.

Rule: {organic vegetables} -> {plant-based milk}
</p>
The pairing of organic vegetables and plant-based milk reflects a growing consumer trend towards health-conscious and environmentally sustainable choices. This rule's presence in a large number of transactions is indicative of a substantial customer base that prioritizes organic and plant-based products. For Uncle Steve, this insight is valuable for inventory management and marketing strategies. It underscores the importance of stocking a diverse range of organic vegetables and plant-based milk options to cater to this health-focused demographic. Additionally, it presents an opportunity for Uncle Steve to position his store as a destination for customers seeking healthier, eco-friendly food options. This could involve creating dedicated sections for organic and plant-based products, running promotional health-focused campaigns, and possibly hosting in-store events or workshops centered around healthy living and sustainability. Such initiatives could enhance customer loyalty and attract new customers who are increasingly making purchasing decisions based on health and environmental considerations.

## 2.2: A rule that might have reasonably high support but low confidence.

Rule: {energy drinks} -> {gaming accessories}
</p>
The association between energy drinks and gaming accessories, while not always consistent in every transaction, points to a specific lifestyle or customer hobby. Energy drinks are popular among gamers for their stimulating effects during long gaming sessions, but the purchase of gaming accessories is a less frequent occurrence. This rule's insight for Uncle Steve lies in the potential to tap into the gaming community by cross-promoting these products. Although the confidence in the rule is low, indicating that not all buyers of energy drinks are interested in gaming accessories, there is still an opportunity to increase sales through targeted marketing efforts. Uncle Steve could consider setting up a gaming section in his store, featuring energy drinks alongside gaming accessories, or even collaborate with local gaming events for mutual promotion. This approach not only caters to the existing customer base but also positions the store as a hub for the local gaming community, potentially attracting a younger demographic and creating a unique shopping experience.

## 2.3: A rule that might have low support and low confidence.

Rule: {gourmet mustard} -> {imported chocolates}
</p>
The combination of gourmet mustard and imported chocolates is unconventional, reflecting a unique, perhaps adventurous, consumer taste. Given their niche nature, these items are unlikely to be frequently purchased together, leading to low support and confidence in this rule. For Uncle Steve, this unusual pairing may not be particularly actionable due to its rarity. However, it can serve as a creative prompt to explore less obvious product combinations that might appeal to a small but potentially loyal customer segment. This insight could encourage experimenting with diverse, high-end products to attract customers seeking unique culinary experiences. While not a primary strategy, it could add an element of surprise and novelty to the store's product range, potentially distinguishing Uncle Steve's store from more conventional grocery outlets.


## 2.4: A rule that might have low support and high confidence.

Rule: {camping gear} -> {trail mix}
</p>
The connection between camping gear and trail mix, while occurring in a smaller number of transactions, shows a high likelihood of joint purchase when camping gear is bought. This pattern reflects a specific customer interest in outdoor activities. For Uncle Steve, this rule highlights an opportunity to cater to the outdoor enthusiast segment. Although the frequency of such purchases may be low (low support), the strong association (high confidence) suggests that those who are buying camping gear are very likely to be interested in trail mix as well. This insight could lead Uncle Steve to strategically position and market these items together, possibly creating an outdoor-themed section in his store. Additionally, it opens avenues for seasonal promotions, especially during peak camping seasons, and partnerships with local outdoor activity groups or events. Focusing on this niche could not only increase sales in these categories but also enhance the store's reputation as a community-focused retailer that understands and caters to the specific interests and needs of its customers. Such targeted efforts can create a loyal customer base and differentiate Uncle Steve's store from larger, more generic retailers.

# Question 3: Uncle Steve's Credit Union

## 3.0: Load data and split

In [None]:
# DO NOT MODIFY THIS CELL

# First, we'll read the provided labeled training data
df3 = pd.read_csv("https://drive.google.com/uc?export=download&id=1wOhyCnvGeY4jplxI8lZ-bbYN3zLtickf")
df3.info()

from sklearn.model_selection import train_test_split

X = df3.drop('BadCredit', axis=1) #.select_dtypes(['number'])
y = df3['BadCredit']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X.head()
X_train.shape, X_test.shape, y_train.shape, y_test.shape
X_train.select_dtypes(include=['object']).columns

## 3.1: Baseline model



In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OneHotEncoder

# Identifying categorical columns
categorical_cols = X_train.select_dtypes(include=['object']).columns

# Creating a column transformer for one-hot encoding categorical variables
preprocessor = ColumnTransformer(transformers=[('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)], remainder='passthrough')

# Creating a pipeline with preprocessor and a Random Forestclassifier
pipeline = Pipeline(steps=[('preprocessor', preprocessor),('classifier', RandomForestClassifier(random_state=42))])

# Performing 10-fold cross-validation to evaluate the baseline model
scores = cross_val_score(pipeline, X_train, y_train, cv=10, scoring='roc_auc')

# Displaying the mean score
print("Mean ROC AUC Score:", scores.mean())


## 3.2: Adding feature engineering

## 3.3: Adding feature selection

## 3.4: Adding hyperparameter tuning

## 3.5: Performance estimation on testing data

# Question 4: Uncle Steve's Wind Farm

# Answers
**Current Situtation**
- 256 Failed Turbines
- Failure Repair Cost: $20,000 per turbine
- Mainteance Service Cost: $2,000 per turbine
- Inspection CostL $500 per turbine

Uncle Steve is currently paying $5.12 million in maintenance costs without any predictive mainteance models,

- number of fails * failure repair cost per turbine = 256 turbines * $20,000 = $5,120,000

Random Forest Model will save $3,492,500 for Uncle Steve and cost less than RNN Model.

Additional metrics like

|         | Cost           |Savings   |
| ------------- |:------------:|:------------:
| **No Predictive Models**      | $5,120,000 |  -    |
| **Random Forest**   | **$1,627,500** | **$3,492,500** |
| **Recurrent Neural Network**   | $1,765,000 | $3,355,000 |


**Random Forest Cost Analysis**

Confusion matrix for the random forest:
|         | Predicted Fail           | Predicted No Fail  | |
| ------------- |:------------:| :-----:|:-----:|
| **Actual Fail**      | 201 | 55 |256|
| **Actual No Fail**   | 50 | 255,195 |255,245|
|                      | 251|255,250|255,501|

Cost matrix for the random forest:
|         | Predicted Fail           | Predicted No Fail  |
| ------------- |:------------:| :-----:|
| **Actual Fail**      | $2500 | $20,000 |
| **Actual No Fail**   | $500 | - |

Total Cost for the random forest:
|         | Predicted Fail           | Predicted No Fail  |
| ------------- |:------------:| :-----:|
| **Actual Fail**      | $502,500 | $1,100,000 |
| **Actual No Fail**   | $25,000 | - |

Total Cost = $502,500 + $25,000 + $1,100,000 =  $1,627,500

**RNN Cost Analysis**

Confusion matrix for the RNN:
|         | Predicted Fail           | Predicted No Fail  | |
| ------------- |:------------:| :-----:|:-----:|
| **Actual Fail**      | 226 | 30 |256|
| **Actual No Fail**   | 1200 | 25,4045 | 254,245|
|                      | 1426 | 254,075| 255,501|

Cost matrix for the RNN:
|         | Predicted Fail           | Predicted No Fail  |
| ------------- |:------------:| :-----:|
| **Actual Fail**      | $2500 | $20,000 |
| **Actual No Fail**   | $500 | - |

Total Cost for the RNN:
|         | Predicted Fail           | Predicted No Fail  |
| ------------- |:------------:| :-----:|
| **Actual Fail**      |  $565,000  | $600,000  |
| **Actual No Fail**   |  $600,000  | - |

Total Cost = $565,000 + $600,000 + $600,000 = $1,765,000