<a href="https://www.kaggle.com/code/sayem01k/clustering-with-multiple-features?scriptVersionId=161615230" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction
We are set to explore data derived from the [Survey of Consumer Finances](https://www.federalreserve.gov/econres/scfindex.htm) (SCF), a survey supported by the US Federal Reserve. This comprehensive survey systematically captures data on financial, demographic, and opinion-related aspects of families across the United States. The SCF serves as a valuable resource for gaining insights into the intricate dynamics of households and individuals, providing a nuanced understanding of their financial behaviors and attitudes.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import warnings 
warnings.filterwarnings('ignore') 
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from scipy.stats.mstats import trimmed_var
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/survey-of-consumer-finances/sub-cdbk.txt
/kaggle/input/survey-of-consumer-finances/sub-data.txt


https://sda.berkeley.edu/sdaweb/docs/scfcomb2019/DOC/hcbkx01.htm#13.HEADING

# Goals
**1. Selecting Features:**
Start by exploring the dataset and identifying features that you believe could be valuable for creating clusters. There are hundreds of features, so we want to consider those that are likely to be relevant to consumer finances. Features related to income, expenses, savings, investments, debts, and other financial aspects could be good candidates.

**2. Data Preprocessing:**
Before applying K-Means, make sure to preprocess your data. This involves handling missing values, scaling features if necessary, and converting categorical variables into numerical representations.

**3. Building K-Means Model:**
Once your data is ready, choose the appropriate number of clusters (you can use techniques like the elbow method or silhouette analysis) and apply K-Means clustering algorithm. Use all the selected features this time.

**4. Principal Component Analysis (PCA):**
PCA is a dimensionality reduction technique that can help you visualize the multi-dimensional clusters in a 2D scatter plot. It transforms the original features into a set of orthogonal components, and you can choose a subset of these components for visualization.

**5. Visualization:**
Create a 2D scatter plot using the selected components from PCA. Each point in the plot represents a respondent, and the color or shape of the points can indicate the cluster assignment. This will allow you to visually inspect the structure and separation of clusters in the reduced-dimensional space.

# Selecting Features
Create a function that returns a DataFrame consisting of households with a net worth of less than $2 million and those that have either been turned down for credit or have expressed fear of being denied credit in the year of 2016 (referencing the "TURNFEAR" variable).

In [2]:
def wrangle(filepath):
    df=pd.read_csv(filepath)
    mask=(df['TURNFEAR']==1) & (df['NETWORTH']<2e6)
    df=df[mask]
    df=df[df['YEAR']==2016]
    
    return df

In [3]:
df = wrangle("/kaggle/input/survey-of-consumer-finances/sub-data.txt")

print("df type:", type(df))
print("df shape:", df.shape)
df.head()

df type: <class 'pandas.core.frame.DataFrame'>
df shape: (5532, 318)


Unnamed: 0,CASEID,WGT,YEAR,Y1,YY1,X1,XX1,CPI_DEFL,AGE,AGECL,...,HSAVFIN,HSAVNFIN,EMERGPSTP,HPSTPPAY,HPSTPLN,HPSTPOTH,EMERGCUT,HCUTFOOD,HCUTENT,HCUTOTH
207645,207646,6500.518974,2016,21,2,,,1.0,20,1,...,0,0,0,0,0,0,0,0,0,0
207646,207647,6463.122354,2016,22,2,,,1.0,20,1,...,0,0,0,0,0,0,0,0,0,0
207647,207648,6474.279341,2016,23,2,,,1.0,20,1,...,0,0,0,0,0,0,0,0,0,0
207648,207649,6481.578552,2016,24,2,,,1.0,20,1,...,0,0,0,0,0,0,0,0,0,0
207649,207650,6492.952368,2016,25,2,,,1.0,20,1,...,0,0,0,0,0,0,0,0,0,0


# Data Preprocessing 
##### One effective way to choose the best features for clustering is to identify which numerical features exhibit the highest variance.
Calculate the variance for all the features in the DataFrame 'df' and create a Series named 'top_ten_var' containing the 10 features with the largest variance

In [4]:
df=df.select_dtypes(['int64', 'float64'])

In [5]:
# Calculate variance, get 10 largest features
top_ten_var = df.var().sort_values().tail(10)

print("top_ten_var type:", type(top_ten_var))
print("top_ten_var shape:", top_ten_var.shape)
top_ten_var

top_ten_var type: <class 'pandas.core.series.Series'>
top_ten_var shape: (10,)


RETQLIQ     9.076778e+09
PLOAN1      9.442372e+09
KGTOTAL     1.180267e+10
DEBT        1.353757e+10
FIN         1.475060e+10
NHNFIN      1.956246e+10
HOUSES      2.249120e+10
NFIN        5.409730e+10
NETWORTH    5.674371e+10
ASSET       9.330782e+10
dtype: float64

**Use plotly express to create a horizontal bar chart of `top_ten_var`**

In [6]:
# Create horizontal bar chart of `top_ten_var`
fig = px.bar(
    x=top_ten_var,
    y=top_ten_var.index,
    title="SCF: High Variance Features"
)
fig.update_layout(xaxis_title="Variance",yaxis_title="Features")


fig.show()

Many wealth indicators exhibit high skewness, primarily due to a few outlier households possessing substantial wealth. These outliers can affect our variance measure. Let's investigate whether this is the case for one of the features from 'top_five_var

### Outliers Detection
Use plotly express to create a horizontal boxplot of "NHNFIN" to determine if the values are skewed

In [7]:
# Create a boxplot of `NHNFIN`
fig = px.box(
    data_frame=df,
    x='NHNFIN',
    title="Distribution of Non-home, Non-Financial Assets"
)
fig.update_layout(xaxis_title="Value [$]")
fig.show()

The dataset is massively right-skewed because of the huge outliers on the right side of the distribution. Even though we already excluded households with a high net worth with our wrangle function, the variance is still being distorted by some extreme outliers.

The best way to deal with this is to look at the trimmed variance, where we remove extreme values before calculating variance. We can do this using the trimmed_variance function from the SciPy library.

In [8]:
# Calculate trimmed variance
top_ten_trim_var = df.apply(trimmed_var).sort_values().tail(10)

print("top_ten_trim_var type:", type(top_ten_trim_var))
print("top_ten_trim_var shape:", top_ten_trim_var.shape)
top_ten_trim_var

top_ten_trim_var type: <class 'pandas.core.series.Series'>
top_ten_trim_var shape: (10,)


WAGEINC     5.360562e+08
HOMEEQ      7.199005e+08
NH_MORT     9.547916e+08
MRTHEL      1.006453e+09
PLOAN1      1.158083e+09
DEBT        2.192043e+09
NETWORTH    3.568438e+09
HOUSES      3.951162e+09
NFIN        6.742461e+09
ASSET       1.039060e+10
dtype: float64

Use plotly express to create a horizontal bar chart of top_ten_trim_var

In [9]:
# Create horizontal bar chart of `top_ten_trim_var`
fig = px.bar(
    x=top_ten_trim_var,
    y=top_ten_trim_var.index,
    title="SCF: High Variance Features"
)
fig.update_layout(xaxis_title="Trimmed Variance",yaxis_title="Features")
fig.show()

There three notable observations in this plot. Firstly, the variances have decreased significantly. In our previous chart, the x-axis extended up to 8 trillion, while this one reaches 14 billion. Secondly, the top 10 features have undergone some changes; all features related to business ownership are now absent. Lastly, we observe substantial differences in variance among features. For instance, the variance for 'NHNFIN' is around 7 billion, whereas the variance for 'ASSET' is nearly 14 billion. In other words, these features exhibit entirely different scales. Addressing this discrepancy is crucial before we can create meaningful clusters

---------------------------------------------------------------------------------------------------------------
Generate a list high_var_cols with the column names of the five features with the highest trimmed variance.

In [10]:
high_var_cols = top_ten_trim_var.tail(5).index.to_list()
print("high_var_cols type:", type(high_var_cols))
print("high_var_cols len:", len(top_ten_trim_var))
high_var_cols

high_var_cols type: <class 'list'>
high_var_cols len: 10


['DEBT', 'NETWORTH', 'HOUSES', 'NFIN', 'ASSET']

## Split
Now that we've gotten our data to a place where we can use it, we can follow the steps we've used before to build a model, starting with a feature matrix.

In [11]:
X = df[high_var_cols]

print("X type:", type(X))
print("X shape:", X.shape)
X.head()

X type: <class 'pandas.core.frame.DataFrame'>
X shape: (5532, 5)


Unnamed: 0,DEBT,NETWORTH,HOUSES,NFIN,ASSET
207645,0.0,10.0,0.0,0.0,10.0
207646,0.0,10.0,0.0,0.0,10.0
207647,0.0,5.0,0.0,0.0,5.0
207648,0.0,10.0,0.0,0.0,10.0
207649,0.0,10.0,0.0,0.0,10.0


# Building K-Means Model

We have a scale issue among our features, which can make clustering the data more challenging. To address this, we'll utilize standardization, a statistical method for placing all variables in a dataset on the same scale. Let's explore how this works here, and later, we'll incorporate it into our model pipeline.

In [12]:
X_summary = X.aggregate(['mean','std']).astype(int)

print("X_summary type:", type(X_summary))
print("X_summary shape:", X_summary.shape)
X_summary

X_summary type: <class 'pandas.core.frame.DataFrame'>
X_summary shape: (2, 5)


Unnamed: 0,DEBT,NETWORTH,HOUSES,NFIN,ASSET
mean,64173,90095,72489,116577,154268
std,116351,238209,149970,232588,305463


Create a StandardScaler transformer, use it to transform the data in X, and then put the transformed data into a DataFrame named X_scaled.

In [13]:
# Instantiate transformer
ss = StandardScaler()

# Transform `X`
X_scaled_data = ss.fit_transform(X)

# Put `X_scaled_data` into DataFrame
X_scaled =pd.DataFrame(X_scaled_data,columns=X.columns)

print("X_scaled type:", type(X_scaled))
print("X_scaled shape:", X_scaled.shape)
X_scaled.head()

X_scaled type: <class 'pandas.core.frame.DataFrame'>
X_scaled shape: (5532, 5)


Unnamed: 0,DEBT,NETWORTH,HOUSES,NFIN,ASSET
0,-0.551597,-0.378212,-0.483402,-0.501265,-0.505044
1,-0.551597,-0.378212,-0.483402,-0.501265,-0.505044
2,-0.551597,-0.378233,-0.483402,-0.501265,-0.505061
3,-0.551597,-0.378212,-0.483402,-0.501265,-0.505044
4,-0.551597,-0.378212,-0.483402,-0.501265,-0.505044


Create a DataFrame X_scaled_summary with the mean and standard deviation for all the features in X_scaled

In [14]:
X_scaled_summary = X_scaled.aggregate(['mean','std']).astype(int)

print("X_scaled_summary type:", type(X_scaled_summary))
print("X_scaled_summary shape:", X_scaled_summary.shape)
X_scaled_summary

X_scaled_summary type: <class 'pandas.core.frame.DataFrame'>
X_scaled_summary shape: (2, 5)


Unnamed: 0,DEBT,NETWORTH,HOUSES,NFIN,ASSET
mean,0,0,0,0,0
std,1,1,1,1,1


Utilize a for loop to construct and train a K-Means model, with the number of clusters (n_clusters) ranging from 2 to 10 (inclusive). Ensure that your model incorporates a StandardScaler. Upon training each model, calculate the **inertia** and append it to the list 'inertia_errors'. Also, compute the **silhouette score** and add it to the list 'silhouette_scores'.

In [15]:
n_clusters = range(2,11)
inertia_errors = []
silhouette_scores = []

# Add `for` loop to train model and calculate inertia, silhouette score.
for k in n_clusters:
    model=make_pipeline(StandardScaler(),KMeans(n_clusters=k, random_state=42) )
    model.fit(X)
    inertia_errors.append(model.named_steps["kmeans"].inertia_)
    silhouette_scores.append(silhouette_score(X,model.named_steps["kmeans"].labels_))


print("inertia_errors type:", type(inertia_errors))
print("inertia_errors len:", len(inertia_errors))
print("Inertia:", inertia_errors)
print()
print("silhouette_scores type:", type(silhouette_scores))
print("silhouette_scores len:", len(silhouette_scores))
print("Silhouette Scores:", silhouette_scores)

inertia_errors type: <class 'list'>
inertia_errors len: 9
Inertia: [12400.167986957067, 8273.6385675099, 6666.8981991436085, 5645.695575238505, 4967.67133351475, 4346.861015798331, 3990.3946616441663, 3673.495001686319, 3403.5702769382847]

silhouette_scores type: <class 'list'>
silhouette_scores len: 9
Silhouette Scores: [0.7947210807540656, 0.6968437837402767, 0.6476482058529203, 0.6427009916549471, 0.6201749712842529, 0.5868880896536391, 0.5825260775120301, 0.5947229871529234, 0.5708304521218499]


---------------------------------------------------------------------------------------------------------------
Create line plot of `inertia_errors` vs `n_clusters`

In [16]:
# Create line plot of `inertia_errors` vs `n_clusters`
fig = px.line(
    x=n_clusters,
    y=inertia_errors,
    title="K-Means Model: Inertia vs Number of Clusters"
    
)
fig.update_layout(xaxis_title="Number of Clusters",yaxis_title="Inertia")

fig.show()

You can see that the line starts to flatten out around 3 or 4 clusters

In [17]:
# Create a line plot of `silhouette_scores` vs `n_clusters`
fig = px.line(
    x=n_clusters,
    y=silhouette_scores,
    title="K-Means Model: Silhouette Score vs Number of Clusters"
    
)
fig.update_layout(xaxis_title="Number of Clusters",yaxis_title="Silhouette Score")

fig.show()

This one's a little less straightforward, but we can see that the best silhouette scores occur when there are 3 or 4 clusters.

Putting the information from this plot together with our inertia plot, it seems like the best setting for n_clusters will be 3.

**Build and train a new k-means model named final_model. Use the information you gained from the two plots above to set an appropriate value for the n_clusters argument.**

In [18]:
# Build model
model = make_pipeline(
    StandardScaler(),
    KMeans(n_clusters=4,random_state=42)
)

# Fit model to data
model.fit(X)


**Extract Label from the final_model**

In [19]:
labels = model.named_steps["kmeans"].labels_

print("labels type:", type(labels))
print("labels len:", len(labels))
print(labels[:5])

labels type: <class 'numpy.ndarray'>
labels len: 5532
[0 0 0 0 0]


Create a DataFrame x_grp_by_label that contains the mean values of the features in X for each of the clusters in final_model

In [20]:
x_grp_by_label = X.groupby(labels).mean()

print("xgb type:", type(x_grp_by_label))
print("xgb shape:", x_grp_by_label.shape)
x_grp_by_label

xgb type: <class 'pandas.core.frame.DataFrame'>
xgb shape: (4, 5)


Unnamed: 0,DEBT,NETWORTH,HOUSES,NFIN,ASSET
0,21130.616782,15637.72,10866.483385,24800.16,36768.34
1,378104.1,1384214.0,529090.0,1270946.0,1762318.0
2,141453.806519,157552.0,180731.861199,242917.0,299005.8
3,316386.483221,485712.2,438834.899329,614297.4,802098.7


side-by-side bar chart from x_grp_by_label that shows the mean of the features in X for each of the clusters

In [21]:
# Create side-by-side bar chart of `x_grp_by_label`
fig = px.bar(
    x_grp_by_label,
    barmode="group"
)
fig.update_layout(xaxis_title="Clusters", yaxis_title="Value [$]")

fig.show()

#  Principal Component Analysis (PCA)
Create a PCA transformer, use it to reduce the dimensionality of the data in X to 2, and then put the transformed data into a DataFrame named X_pca. The columns of X_pca should be named "PC1" and "PC2".

In [22]:
# Instantiate transformer
pca = PCA(n_components=2,random_state=42)

# Transform `X`
X_t = pca.fit_transform(X)

# Put `X_t` into DataFrame
X_pca = pd.DataFrame(X_t,columns=["PC1","PC2"])

print("X_pca type:", type(X_pca))
print("X_pca shape:", X_pca.shape)
X_pca.head()

X_pca type: <class 'pandas.core.frame.DataFrame'>
X_pca shape: (5532, 2)


Unnamed: 0,PC1,PC2
0,-231786.601809,-32585.593116
1,-231786.601809,-32585.593116
2,-231792.296779,-32582.107643
3,-231786.601809,-32585.593116
4,-231786.601809,-32585.593116


In [23]:
labels=labels.astype(str)

In [24]:
fig=px.scatter(
    data_frame=X_pca,
    x="PC1",
    y="PC2",
    color=labels,
    title="ha"

)
fig.update_layout(xaxis_title="PC1",yaxis_title="PC2")
fig.show()