# Customer Segmentation in the US: Part 1
## Exploring the Data
___
In this notebook, data from the [Survey of Consumer Finances](https://www.federalreserve.gov/econres/scfindex.htm) (SCF) is used. The SCF is a survey sponsored by the US Federal Reserve. It tracks financial, demographic adn opinion information about families in US. The survey is conducted every three years, and here an extract of the results from 2022 are used.

To get the details about the features in the dataset, use the 
[Code Book](https://sda.berkeley.edu/sdaweb/docs/scfcomb2022/DOC/hcbk.htm).


### Import
___

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df= pd.read_csv("SCFP2022.csv")
df.shape

(22975, 357)

As can be seen above, the dataset is very large with 22975 observations and 357 features.

For the time being, we are going to focus on households that have "been turned down for credit or feared being denied credit in the past 5 years" which can be identified through the "TURNFEAR" column. 

In [None]:
df_credit_fear = df[df["TURNFEAR"] ==1]
print("df_credit_fear shape:", df_credit_fear.shape)
print("df_credit_fear type:", type(df_credit_fear))
df_credit_fear.head()

### Explore
___
Among different features that can be used, let's examine demographic features such as age, race, education, and income. Finally, their relationship to assets and debt will be investigated.

Let's start with ***age***. One of the features of the dataset is age group `AGECL`. Wht values are there in this column.

In [None]:
df_credit_fear["AGECL"].unique()

From the Code Book, this column represents categorical data, however, the values in the column are numeric.

![image.png](attachment:c1ea828d-a6ad-4565-8284-245ab55a687d.png)

These numeric values should be changed to the actual labels. Then a bar chart is plotted to see the distribution of these age groups.

In [None]:
age_dict = {
    1: "Under 35",
    2: "35 - 44",
    3: "45 - 54",
    4: "55 - 64",
    5: "65 - 74",
    6: "75 or Above"
}

age = df["AGECL"].replace(age_dict)
age.head()

In [None]:
bool(0)

In [None]:
age_df = df["AGECL"].groupby(df["TURNFEAR"]).value_counts().rename("count").to_frame().reset_index().set_index("AGECL")
# age_df[list(val for val in age_dict.values())]
age_df.sort_index()
# .sort_index(key= lambda x: age_dict[x])

In [None]:
import plotly.express as px
age_df = df["AGECL"].groupby(df["TURNFEAR"]).value_counts(normalize= True).rename("count").to_frame().reset_index()
fig = px.bar(
    x= age_df["AGECL"], y= age_df["count"], color= age_df["TURNFEAR"].astype(str), barmode= "group"
)
legends = {"0": "False", "1": "True"}
fig.for_each_trace(lambda t: t.update(name = legends[t.name],
                                      legendgroup = legends[t.name],
                                      hovertemplate = t.hovertemplate.replace(t.name, legends[t.name])
                                     )
                  )
fig.update_layout(xaxis = {"tickvals": list(age_dict.keys()), "ticktext": list(age_dict.values()), "title": "Age"},
                  yaxis_title= "Frequency [count]",
                  legend_title= "Credit Fearful")
fig.show()

In [None]:
age.value_counts()[list(val for val in age_dict.values())]

In [None]:
age_counts= age.value_counts()[list(val for val in age_dict.values())]
age_counts.plot(kind= "bar", xlabel= "Age Classes", ylabel= "Count",
                        title= "Distribution of Age Groups: Credit Fearful",
                        );

The bar plot indicates that those who fear being denied credit are more with age less than 54.

In [None]:
df_credit_fear["AGE"].hist(bins= 10)
plt.xlabel("Age")
plt.ylabel("Count")
plt.title("Age Distribution: Credit Fearful");

The above histogram shows that the most concerned people seem to be 30 and 45, which supports the bar chart.

What about ***race***?

Looking for "RACE" in the Code Book,

![image.png](attachment:915132cc-892c-478c-9987-7ed405ac9fad.png)

In the Code Book, 4 is not assigned to a specific race, hence let's make it "Asian American / Pacific Islander"
This column has numeric data type. So change it to something more understandable.
Then a horizontal bar chart is used to see the frequency of these races.

In [None]:
df_credit_fear["RACE"].value_counts(normalize= True)

In [None]:
race_dict = {
    1: "white non-Hispanic",
    2: "black / African American",
    3: "Hispanic",
    4: "Asian American / Pacific Islander",
    5: "Other"
}

race_cat = df_credit_fear["RACE"].replace(race_dict)
race_cat.value_counts(normalize= True).plot(
    kind= "barh",
    xlabel= "Frequency",
    ylabel= "Race",
    title= "Racial Groups: Credit Freaful"
)
plt.xlim((0, 1));

In [None]:
race_df = df["RACE"].groupby(df["TURNFEAR"]).value_counts(normalize= True).rename("freq").to_frame().reset_index()
fig = px.bar(
    x= race_df["RACE"], y= race_df["freq"], color= race_df["TURNFEAR"].astype(str), barmode= "group",
    title= "Distribution of Race"
)
race_df

legends = {"0": "False", "1": "True"}
fig.for_each_trace(lambda t: t.update(name = legends[t.name],
                                      legendgroup = legends[t.name],
                                      hovertemplate = t.hovertemplate.replace(t.name, legends[t.name])
                                     )
                  )
fig.update_layout(xaxis = {"tickvals": list(race_dict.keys()), "ticktext": list(race_dict.values()), "title": "Age",
                           "tickangle": 30},
                  yaxis_title= "Frequency [count]",
                  legend_title= "Credit Fearful")

fig.show()

This suggests that White non-Hispanic and Black / African American worry more about being denied credit, but is that also the case in the overall population?

In [None]:
df["RACE"].replace(race_dict).value_counts(normalize= True)

In [None]:
df["RACE"].replace(race_dict).value_counts(normalize= True).plot(
    kind= "barh",
    xlabel= "Frequency",
    ylabel= "Race",
    title= "Racial Groups: Population"
)
plt.xlim((0, 1));

In [None]:
race_data = df["RACE"].replace(race_dict).value_counts(normalize= 1)
race_credit_fear = df_credit_fear["RACE"].replace(race_dict).value_counts(normalize= 1)
credit_fear_per_race = df_credit_fear["RACE"].replace(race_dict).value_counts() / df["RACE"].replace(race_dict).value_counts()

df_credit_fear_per_race = pd.DataFrame.from_dict({"race_from_overall": race_data, "race_credit_fear": race_credit_fear, "credit_fear_per_race": credit_fear_per_race})
df_credit_fear_per_race

In [None]:
df_credit_fear_per_race["credit_fear_per_race"].plot.barh()
plt.xlabel("Proportion")
plt.ylabel("Race")
plt.title("Proportions of Credit Fearfuls per Race");

From the population bar chart, White non-Hispanic account for 62% of the whole population, but only about 39% are credit fearful. On the onther hand, Black and Hispanic people represent 28% of the dataset, but 53% of those are credit fearful. This shows that Blck and Hispanic households are more likely to be in the credit fearful group.

***Income*** is another measure. Does people with lower income level worry about being denied credit, or is that something people with more money worry about?

Using a single side-by-side bar chart, the entire dataset is compared against the credit fearful group based on the income percentile groups, feature "INCCAT."

![image.png](attachment:183e3a9d-d18d-456f-8c79-704a70c3999f.png)

First change the numbers to actual values, then create a dataframe based on credit fearful and non-credit fearful households, income categories and their normalized frequency. Finally, use seaborn to plot bar chart.

In [None]:
income_cat = {
    1: "0 - 20",
    2: "20 - 39.9",
    3: "40 - 59.9",
    4: "60 - 79.9",
    5: "80 - 89.9",
    6: "90 - 100"
}

inc_df = (
    df["INCCAT"]
    # .replace(income_cat)
    .groupby(df["TURNFEAR"])
    .value_counts(normalize= 1)
    .rename("freq")
    .to_frame()
    .reset_index()
)

inc_df

In [None]:
fig = px.bar(
    x= inc_df["INCCAT"], y= inc_df["freq"], color= inc_df["TURNFEAR"].astype(str), barmode= "group",
    title= "Distribution of Education"
)

legends = {"0": "False", "1": "True"}
fig.for_each_trace(lambda t: t.update(name = legends[t.name],
                                      legendgroup = legends[t.name],
                                      hovertemplate = t.hovertemplate.replace(t.name, legends[t.name])
                                     )
                  )
fig.update_layout(
    xaxis = {"tickvals": list(income_cat.keys()), "ticktext": list(income_cat.values()), "title": "Income"},
                           # "tickangle": 30},
                  yaxis_title= "Frequency [count]",
                  legend_title= "Credit Fearful")

fig.show()

In [None]:
sns.barplot(
    x = "INCCAT",
    y = "frequency",
    hue = "TURNFEAR",
    data= df_income_freq,
    order= income_cat.values()
)
plt.xlabel("Income")
plt.ylabel("Frequency")
plt.title("Income Distribution: Credit Fearful vs. Non-Fearful");

From the above comparison, credit fearful households are much more common in the lower income categories. That means, credit fearfuls have lower incomes.

Based on all this, among credit fearful households, the highest number of households have lower income and are young.

***Assets***

From the dataset, let's examine the correlation between assets and value of houses.
Then, credit fearful group is used.

In [None]:
asset_house_corr = df["ASSET"].corr(df["HOUSES"])
print("Dataset: asset-houses correlation:", round(asset_house_corr, 3))

In [None]:
asset_house_corr_fear = df_credit_fear["ASSET"].corr(df_credit_fear["HOUSES"])
print("Credit fearful: asset-house correlation:", round(asset_house_corr_fear, 3))

The dataset shows a moderate positive correlation which indicates that the value of houses makes most of the the households assets.

On the other hand, for credit-fearful group, the correlation is weaker and indicates that value of houses has less contribution to the households assets.

Let's make correlation matrics for both dataset and the credit-fearful group using 5 features: asset, value of houses, income, debt, and education level.

In [None]:
features= ["ASSET", "HOUSES", "INCOME", "DEBT", "EDUC"]
corr = df[features].corr()
corr.style.background_gradient(axis= None)

In [None]:
corr = df_credit_fear[features].corr()
corr.style.background_gradient(axis= None)

Now, we can see that the correlation between asset and income increases from 0.70 to 0.89 which indicates that for credit-fearful group, income takes most of their asset. Similar trend is also shown for debt and houses. As for credit fearful group, value of houses has stronger correlation with debt as compared to the overall dataset.

Hence credit-fearful households base their assets on their income rather than the value of their houses.

What about ***education***?

Let's compare credit fearful and non-credit fearful groups for education levels.

![image.png](attachment:a8db5483-53ba-4f1a-a68d-0e4ddea61d65.png)

First a dataframe containing the relative frequency of education levels grouped by credit fearful and non-credit fearful groups is created. Then a side-by-side bar chart is used to compare the two groups. 

In [None]:
educ_levels_freq = (
    df["EDUC"]
    .groupby(df["TURNFEAR"])
    .value_counts(normalize= True)
    .rename("frequency")
    .to_frame()
    .reset_index()
)

educ_levels_freq

In [None]:
sns.barplot(
    x = "EDUC",
    y = "frequency",
    hue = "TURNFEAR",
    data = educ_levels_freq
)
plt.xlabel("Education Levels")
plt.ylabel("Frequency")
plt.title("Education Attainment: Credit Fearful vs. Non-Credit Fearful");

Most of the credit-fearful respondents are high school graduates. On the other hand, university degrees are more common among the non-credit fearful.
