# IST256 Project Deliverable 3 (P3)

## Phase 3: Data Story / Coding for Explanation

In this step, you submit the final version of your working code. You should be implementing the data story that you discussed in P2 (2.3.1). 

All code necessary to make the project run should be included in this notebook. This include all imports, functions, setup code and your interact. There should be no code that causes errors or exploratory code here.

The expectation is that your instructor can open this notebook, run all cells, and then use your program.

The code you write should be clear, easy to understand and use the affordances learned in the course.

No changes to your code will be considered after this submission. It is important to take prior instructor feedback taken into consideration and these factor into your evaluation.


### Step 1: Summarize Enhancements and Changes

If there were any enhancement or changes to your P3 from your P2 (including those you suggested), please explain them here. For example you might have geocoded your dataset or extracted entities from the text.


In P2, I planned a few enhancements that I could improve to make my data story stronger and impactful and I actually implemented them in P3 while also considering the feedback from my TA. First, I used the existing region codes which were just numeric code such as 1,2,3,4… to create a new variable called region_label_simple with readable region names like “Seoul,” “Kyeong-gi,” and “Jeolla and Jeju.” Then I compared both average income by region and income per person across these regions using bar charts and folium map. I also added average income over time by region, so it shows that income differences are not only about personal factors but also related to where people live.

I also extended my occupation analysis from P2 by comparing the gender composition within the Top 10 and Bottom 10 occupations by average income. So that it visually shows if high paying jobs are more male dominated while lower paying jobs have a slightly higher share of women.

I created a new variable called “income_per_person” by dividing income by the number of family members that individuals have. I used this in several parts especially when comparing age groups, regions, and non-working groups. I tried this way since I assumed that the people with the same income can have very different living situations depending on how many people/family members they support. Thus, 1 person or single person households and large families should not be treated the same.

I reorganized the reason_none_worker values and made a detailed version into a broader variable called reason_group as I grouped similar reasons together such as “house worker”, “caring kids at home”, and “nursing” into “care and housework” and “no capable” and “others” into “health issue / other.” I then focused only on non-workers and compared both the counts and average income per person across these groups. This makes it easier to see and talk about different types of non-working reasons.

In addition, I also simplified several other variables by turning it into readable labels like gender_label_simple, education_label_simple, and age_group_simple and I wrote functions to calculate average income by category to support my analysis (based on the feedback I got). I used these functions together with interact and interact_manual so that the user can switch between gender, education, age group, region and metric type simply. All of these changes turn my exploration of enhancements from P2 into a clearer and interactive data story in P3.

### Step 2: Project Code

Include all project code below. This includes code that enhances the original dataset. Make sure to execute your code to ensure it runs properly before you turn it in. 

Add as many cells as you need here.


In [None]:
%pip install openpyxl

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from ipywidgets import interact

In [None]:
sns.set() # added seaborn setting

In [None]:
df = pd.read_csv("Korea_income_and_Welfare.csv")
job_codes = pd.read_excel("job_code_translated.xlsx")

print("Number of rows:", len(df))
print("Columns:", list(df.columns))

In [None]:
print("\nData types:")
print(df.dtypes)

print("\nFirst 5 rows:")
print(df.head())

print("\nMissing values by column:")
print(df.isna().sum())

In [None]:
df.head(10)

In [None]:
# I cleaned the employment-related columns by stripping out the strings and by replacing blank values with clear “No info” labels.
# This makes the dataset consistent so my analysis is not going to break due to empty strings or missing codes.
# After that, I merged the job code table so occupation codes become readable occupation titles.

In [None]:
for col in ["occupation", "company_size", "reason_none_worker"]:
    df[col] = df[col].astype(str).str.strip()
    df.loc[df[col] == "", col] = np.nan

print("\nMissing values after cleaning strings:")
print(df[["occupation", "company_size", "reason_none_worker"]].isna().sum())

# created clean version with explicit labels for missing info.

df["occupation_clean"] = df["occupation"].fillna("No occupation info")
df["company_size_clean"] = df["company_size"].fillna("No company size info")
df["reason_none_worker_clean"] = df["reason_none_worker"].fillna("No reason / not a non-worker")

print("\nHead of cleaned employment-related columns:")
print(df[["occupation_clean", "company_size_clean", "reason_none_worker_clean"]].head(10))

In [None]:
job_codes["job_code"] = job_codes["job_code"].astype(str).str.strip()
job_codes["job_title"] = job_codes["job_title"].astype(str).str.strip()

df["occupation_clean"] = df["occupation_clean"].astype(str).str.strip()

df = pd.merge(df, job_codes[["job_code","job_title"]], left_on="occupation_clean", right_on="job_code", how="left")
df = df.rename(columns={"job_title":"occupation_label"})
df["occupation_label"] = df["occupation_label"].fillna("No occupation info")

In [None]:
print(df[["occupation_clean", "occupation_label"]].head(20))
print(df["occupation_label"].value_counts().head(10))

In [None]:
# set labels for reason for non working
reason_map = {
    "1": "no capable",
    "2": "in military service",
    "3": "studying in school",
    "4": "prepare for school",
    "5": "prepare to apply job",
    "6": "house worker",
    "7": "caring kids at home",
    "8": "nursing",
    "9": "giving-up economic activities",
    "10": "no intention to work",
    "11": "others",
}

In [None]:
df["reason_none_worker_clean"] = df["reason_none_worker_clean"].astype(str).str.strip()

def label_reason(code_str):
    code_str = str(code_str).strip()
    if code_str in reason_map:
        return reason_map[code_str]
    else:
        return "Not a non-worker"

labels = []
for x in df["reason_none_worker_clean"]:
    labels.append(label_reason(x))

df["reason_label"] = labels

In [None]:
print(df[["reason_none_worker_clean", "reason_label"]].head(15))
print(df["reason_label"].value_counts().head(10))

In [None]:
df.head()

In [None]:
#The dataset description says that income is yearly income in KRW million won and
#When I look at the actual numbers, they don’t really match the idea of Million KRW since it is unrealistically large number. 
#and since my dataset has a lot of older adults and many non-workers.
#it was even harder to verify what the unit should be in a realistic way.
# So for this project, because my main goal is to compare patterns across groups
#(like gender, education, age, occupation, and region), I’m treating income as the dataset’s internal numeric scale. 
#Thus, all visualizations and calculations use these values consistently only to focus on relative differences 
#rather than interpreting them as exact real-world income/salaries.

In [None]:
def avg_income_by_category_simple(df, category_column, income_column):
    categories = df[category_column].dropna().unique()
    result_categories = []
    result_incomes = []

    for c in categories:
        subset = df[df[category_column] == c]
        mean_income = subset[income_column].mean()
        result_categories.append(c)
        result_incomes.append(mean_income)

    result_df = pd.DataFrame({
        "category": result_categories,
        "average_income": result_incomes
    })
    return result_df

In [None]:
occ_income_simple = avg_income_by_category_simple(df, "occupation_label", "income")

occ_income_simple = occ_income_simple.dropna(subset=["average_income"])

occ_income_simple = occ_income_simple[
    occ_income_simple["category"] != "No occupation info"
]

occ_income_simple_top10 = (
    occ_income_simple
      .sort_values("average_income", ascending=False)
      .head(10)
)

sns.barplot(
    data=occ_income_simple_top10,
    x="average_income",
    y="category"
)
plt.xlabel("Average income")
plt.ylabel("Occupation")
plt.title("Top 10 occupations by average income")
plt.show()

In [None]:
occ_income_simple_bottom10 = (
    occ_income_simple
      [occ_income_simple["category"] != "No occupation info"]
      .sort_values("average_income", ascending=True)
      .head(10)
)

sns.barplot(
    data=occ_income_simple_bottom10,
    x="average_income",
    y="category"
)
plt.xlabel("Average income")
plt.ylabel("Occupation")
plt.title("Bottom 10 occupations by average income")
plt.show()

In [None]:
# I created simple labels for gender, education, age groups, and regions so the results are easy to interpret.
# I also created income_per_person by dividing income by family size/number of family members in the household
# to compare living standards more fairly.
# These enhancements help my story focus on meaningful patterns instead of confusing into raw codes.

In [None]:
education_label_simple = []

for e in df["education_level"]:
    if e == 1:
        education_label_simple.append("No education (under 7)")
    elif e == 2:
        education_label_simple.append("No education (7+)")
    elif e == 3:
        education_label_simple.append("Elementary")
    elif e == 4:
        education_label_simple.append("Middle school")
    elif e == 5:
        education_label_simple.append("High school")
    elif e == 6:
        education_label_simple.append("College")
    elif e == 7:
        education_label_simple.append("University degree")
    elif e == 8:
        education_label_simple.append("MA")
    elif e == 9:
        education_label_simple.append("Doctoral degree")
    else:
        education_label_simple.append("Unknown")

df["education_label_simple"] = education_label_simple

print("\nEducation label (simple):")
print(df["education_label_simple"].value_counts())

In [None]:
df["age"] = df["year"] - df["year_born"]

print("\nAge summary:")
print(df["age"].describe())

age_group_simple = []

for age in df["age"]:
    if age < 25:
        age_group_simple.append("<25")
    elif age < 35:
        age_group_simple.append("25–34")
    elif age < 45:
        age_group_simple.append("35–44")
    elif age < 60:
        age_group_simple.append("45–59")
    else:
        age_group_simple.append("60+")

df["age_group_simple"] = age_group_simple

print("\nAge group (simple):")
print(df["age_group_simple"].value_counts())

df["age_group"] = df["age_group_simple"]

In [None]:
income_per_person = []

for i in range(len(df)):
    fam = df.loc[i, "family_member"]
    inc = df.loc[i, "income"]

    if pd.notna(fam) and fam != 0:
        income_per_person.append(inc / fam)
    else:
        income_per_person.append(np.nan)

df["income_per_person"] = income_per_person

print("\nIncome per person summary:")
print(df["income_per_person"].describe())


In [None]:
gender_label_simple = []

for g in df["gender"]:
    if g == 1:
        gender_label_simple.append("Male")
    elif g == 2:
        gender_label_simple.append("Female")
    else:
        gender_label_simple.append("Unknown")

df["gender_label_simple"] = gender_label_simple

print("\nGender label (cleaned version):")
print(df["gender_label_simple"].value_counts())

In [None]:
gender_income_simple = avg_income_by_category_simple(
    df,
    "gender_label_simple",
    "income"
)
print("\nAverage income by gender (simple):")
print(gender_income_simple)

In [None]:
sns.barplot(
    data=gender_income_simple,
    x="category",
    y="average_income",
    hue="category",
    palette={"Male": "#1F77B4", "Female": "#FFB6B9"} # I got a hex value of the color so that I can set the female as pink and male as blue.
)
plt.ylabel("Average income")
plt.title("Average income by gender (simple)")
plt.show()

In [None]:
top10_jobs = occ_income_simple_top10["category"].tolist()
bottom10_jobs = occ_income_simple_bottom10["category"].tolist()
print("Top 10 jobs:", top10_jobs)
print("Bottom 10 jobs:", bottom10_jobs)

In [None]:
def gender_share_for_job_list(data, job_list):
    rows = []

    for job in job_list:
        sub = data[data["occupation_label"] == job]
        total = len(sub)

        if total == 0:
            continue

        male = len(sub[sub["gender_label_simple"] == "Male"])
        female = len(sub[sub["gender_label_simple"] == "Female"])

        rows.append({"occupation": job, "gender": "Male", "pct": (male / total) * 100})
        rows.append({"occupation": job, "gender": "Female", "pct": (female / total) * 100})

    return pd.DataFrame(rows)

In [None]:
top10_gender = gender_share_for_job_list(df, top10_jobs)

plt.figure(figsize=(18, 6))

sns.barplot(
    data=top10_gender,
    x="occupation",
    y="pct",
    hue="gender"
)

plt.ylim(0, 100)
plt.xlabel("Occupation")
plt.ylabel("Percent within occupation (%)")
plt.title("Gender composition in Top 10 highest-paid occupations")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

In [None]:
bottom10_gender = gender_share_for_job_list(df, bottom10_jobs)

plt.figure(figsize=(18, 7))

sns.barplot(
    data=bottom10_gender,
    x="occupation",
    y="pct",
    hue="gender"
)

plt.ylim(0, 100)
plt.xlabel("Occupation")
plt.ylabel("Percent within occupation (%)")
plt.title("Gender composition in Bottom 10 lowest-paid occupations")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

In [None]:
df["education_level"].value_counts().sort_index()

In [None]:
edu_income_simple = avg_income_by_category_simple(
    df,
    "education_label_simple",
    "income"
)
print("\nAverage income by education (simple):")
print(edu_income_simple)

In [None]:
edu_order = [
    "No education (7+)",
    "Elementary",
    "Middle school",
    "High school",
    "College",
    "University degree",
    "MA",
    "Doctoral degree",
]

sns.barplot(
    data=edu_income_simple,
    x="category",
    y="average_income",
    hue="category",
    order=edu_order
)
plt.xticks(rotation=45, ha="right")
plt.ylabel("Average income")
plt.title("Average income by education (simple)")
plt.show()

In [None]:
age_income_simple = avg_income_by_category_simple(
    df,
    "age_group_simple",
    "income"
)
print("\nAverage income by age group (simple):")
print(age_income_simple)

In [None]:
sns.barplot(data=age_income_simple, x="category", y="average_income")
plt.ylabel("Average income")
plt.title("Average income by age group (simple)")
plt.show()

In [None]:
# My goal is to compare income patterns across groups like gender, education, age, occupation, region, and non-working reasons.
# Because the income unit is unclear, I treated it as internal numeric scale and I focused mainly on relative differences instead.
# The interact and the region map with folium let users explore these comparisons dynamically and see consistent evidence.

In [None]:
def show_average_income(choose):
    if choose == "Gender":
        print("Average income by gender:")
        print(gender_income_simple)
    elif choose == "Education":
        print("Average income by education:")
        print(edu_income_simple)
    elif choose == "Age group":
        print("Average income by age group:")
        print(age_income_simple)
    else:
        print("Unknown choice")

interact(show_average_income, choose=["Gender", "Education", "Age group"]);

In [None]:
def show_average_income_plot(choose):
    
    if choose == "Gender":
        sns.barplot(
            data=gender_income_simple,
            x="category",
            y="average_income",
            hue="category",
            palette="Set3"
        )
        plt.xlabel("Gender")
        plt.title("Average income by gender")

    elif choose == "Education":
        edu_order = [
            "No education (7+)",
            "Elementary",
            "Middle school",
            "High school",
            "College",
            "University degree",
            "MA",
            "Doctoral degree",
        ]
        sns.barplot(
            data=edu_income_simple,
            x="category",
            y="average_income",
            hue="category",
            order=edu_order
        )
        plt.xlabel("Education level")
        plt.xticks(rotation=45, ha="right")
        plt.title("Average income by education")

    elif choose == "Age group":
        age_order = ["<25", "25–34", "35–44", "45–59", "60+"]
        sns.barplot(
            data=age_income_simple,
            x="category",
            y="average_income",
            order=age_order
        )
        plt.xlabel("Age group")
        plt.title("Average income by age group")

    else:
        plt.text("Unknown choice", ha="center")
    
    plt.ylabel("Average income")
    plt.tight_layout()
    plt.show()

interact(show_average_income_plot, choose=["Gender", "Education", "Age group"]);

In [None]:
sns.set(style="whitegrid")

age_counts = (
    df["age_group"]
      .value_counts()
      .sort_index()
      .reset_index()
)
age_counts.columns = ["age_group", "count"]

sns.barplot(
    data=age_counts,
    x="age_group",
    y="count",
    hue="age_group",
    palette="Set1"
)
plt.ylabel("Count")
plt.title("Distribution of age groups")
plt.show()

In [None]:
#age income per person = income / family_member
age_income_pp_simple = avg_income_by_category_simple(
    df,
    "age_group",
    "income_per_person"
)

age_order = ["<25", "25–34", "35–44", "45–59", "60+"]

age_income_pp_simple_ordered = (
    age_income_pp_simple
      .set_index("category")
      .loc[age_order]
      .reset_index()
)

print("\nAverage income per person by age group (simple):")
print(age_income_pp_simple_ordered)

sns.barplot(
    data=age_income_pp_simple_ordered,
    x="category",
    y="average_income"
)
plt.ylabel("Average income per person")
plt.title("Average income per person by age group (simple)")
plt.show()


In [None]:
def avg_income_by_two_categories_simple(df, col1, col2, income_column):
    values1 = df[col1].dropna().unique()
    values2 = df[col2].dropna().unique()

    rows = []

    for v1 in sorted(values1):
        for v2 in sorted(values2):
            subset = df[(df[col1] == v1) & (df[col2] == v2)]
            if len(subset) == 0:
                continue
            mean_income = subset[income_column].mean()
            rows.append({
                col1: v1,
                col2: v2,
                "average_income": mean_income
            })

    result_df = pd.DataFrame(rows)
    return result_df

In [None]:
year_income_simple = avg_income_by_category_simple(df, "year", "income").sort_values("category")

sns.lineplot(data=year_income_simple, x="category", y="average_income", marker="o")
plt.xlabel("Year")
plt.ylabel("Average income")
plt.title("Average income over time (Overall)")
plt.show()

In [None]:
year_gender_income_simple = avg_income_by_two_categories_simple(df, "year", "gender_label_simple", "income")

year_gender_income_simple = year_gender_income_simple.rename(columns={"average_income": "income"})

print("\nAverage income over time by gender (simple):")
print(year_gender_income_simple.head())

sns.lineplot(
    data=year_gender_income_simple,
    x="year",
    y="income",
    hue="gender_label_simple",
    marker="o"
)
plt.ylabel("Average income")
plt.title("Average income over time by gender (simple)")
plt.show()

In [None]:
filtered_pp = df[(df["income_per_person"] > 0) & (df["income_per_person"] < 10000)]

sns.histplot(
    data=filtered_pp,
    x="income_per_person",
    hue="gender_label_simple",
    bins=40,
    element="step",
    stat="density",
    common_norm=False
)
plt.xlabel("Income per person")
plt.title("Income per person distribution by gender")
plt.show()

In [None]:
reason_group = []

for r in df["reason_label"]:
    if r in ["house worker", "caring kids at home", "nursing"]:
        reason_group.append("Care and housework")
    elif r in ["no capable", "others"]:
        reason_group.append("Health issue / other")
    elif r in ["studying in school", "prepare for school", "prepare to apply job", "in military service"]:
        reason_group.append("Studying / preparing")
    elif r in ["giving-up economic activities", "no intention to work"]:
        reason_group.append("No intention / discouraged")
    else:
        reason_group.append("Not in non-worker group")

df["reason_group"] = reason_group

print("\nReason group value_counts:")
print(df["reason_group"].value_counts())

In [None]:
nonworker_df = df[df["reason_group"] != "Not in non-worker group"]

reason_group_counts = (nonworker_df["reason_group"].value_counts().reset_index())
reason_group_counts.columns = ["reason_group", "count"]

print("\nNon-worker reason group counts:")
print(reason_group_counts)

sns.barplot(data=reason_group_counts, x="reason_group", y="count", hue="reason_group", palette="Set1")
plt.xticks(rotation=45, ha="right")
plt.ylabel("Count")
plt.title("Distribution of broad non-working reason groups")
plt.show()

In [None]:
reason_group_income_pp_simple = avg_income_by_category_simple(nonworker_df, "reason_group", "income_per_person")

print("\nAverage income per person by non-working reason group (simple):")
print(reason_group_income_pp_simple)

sns.barplot(data=reason_group_income_pp_simple, x="category", y="average_income")

plt.xticks(rotation=45, ha="right")
plt.ylabel("Average income per person")
plt.title("Average income per person by non-working reason group (simple)")
plt.show()

In [None]:
# I also wanted to explore geographic differences in income by creating simple region labels 
# and comparing both total income and income per person across regions.
region_label_simple = []

for r in df["region"]:
    if r == 1:
        region_label_simple.append("Seoul")
    elif r == 2:
        region_label_simple.append("Kyeong-gi")
    elif r == 3:
        region_label_simple.append("Kyoung-nam")
    elif r == 4:
        region_label_simple.append("Kyoung-buk")
    elif r == 5:
        region_label_simple.append("Chung-nam")
    elif r == 6:
        region_label_simple.append("Gang-won & Chung-buk")
    elif r == 7:
        region_label_simple.append("Jeolla & Jeju")
    else:
        region_label_simple.append("Unknown")

df["region_label_simple"] = region_label_simple

print("\nRegion label (simple):")
print(df["region_label_simple"].value_counts())

In [None]:
region_counts = (
    df["region_label_simple"]
      .value_counts()
      .reset_index()
)
region_counts.columns = ["region_label_simple", "count"]

print("\nRegion counts:")
print(region_counts)

sns.barplot(data=region_counts, x="region_label_simple", y="count", hue="region_label_simple")
plt.xticks(rotation=45, ha="right")
plt.ylabel("Count")
plt.title("Number of observations by region")
plt.show()

In [None]:
regions = sorted(df["region_label_simple"].dropna().unique())
counts = []
for reg in regions:
    subset = df[(df["region_label_simple"] == reg) & (df["education_label_simple"] == "University degree")]
    counts.append(len(subset))

univ_df = pd.DataFrame({"region_label_simple": regions, "count": counts})

sns.barplot(data=univ_df, x="region_label_simple", y="count", hue="region_label_simple", palette="Set1")
plt.xticks(rotation=45, ha="right")
plt.ylabel("Count")
plt.title("Number of people with university degrees by region")
plt.show()

In [None]:
%pip install folium

import folium
from IPython.display import display
from ipywidgets import interact_manual

In [None]:
region_coords = {
    "Seoul": (37.5665, 126.9780),
    "Kyeong-gi": (37.4138, 127.5183),
    "Kyoung-nam": (35.2383, 128.6924),
    "Kyoung-buk": (36.5760, 128.5056),
    "Chung-nam": (36.5184, 126.8000),
    "Gang-won & Chung-buk": (37.5, 128.0),
    "Jeolla & Jeju": (34.8, 126.9),
}
#I got this region chords from AI

In [None]:
region_df = df[df["region_label_simple"] != "Unknown"].copy()

def display_region_income(metric):
    if metric == "Average income":
        income_col = "income"
        x_label = "Average income"
    elif metric == "Average income per person":
        income_col = "income_per_person"
        x_label = "Average income per person"
    else:
        print("Unknown metric")
        return

    region_income = avg_income_by_category_simple(region_df, "region_label_simple", income_col)

    sns.barplot(data=region_income, x="average_income", y="category", hue="category")
    
    plt.xlabel(x_label)
    plt.ylabel("Region")
    plt.title(f"{metric} by region")
    plt.show()

    m = folium.Map(location=[36.5, 127.8], zoom_start=7)

    for _, row in region_income.iterrows():
        region = row["category"]
        value = row["average_income"]

        if region not in region_coords:
            continue

        lat, lon = region_coords[region]
        text = f"{region}<br>{metric} (income): {value:,.0f}"

        marker = folium.Marker(location=(lat, lon), popup=text)
        marker.add_to(m)

    display(m)

interact_manual(display_region_income, metric=["Average income", "Average income per person"]);

In [None]:
#Average income over time by region. Next, I want to look at how average income changes 
#over time by region to compare and see.

year_region_income_simple = avg_income_by_two_categories_simple(df,"year","region_label_simple","income")

year_region_income_simple = year_region_income_simple.sort_values(["year", "region_label_simple"])

print("\nAverage income over time by region (simple):")
print(year_region_income_simple.head())

In [None]:
sns.lineplot(
    data=year_region_income_simple,
    x="year",
    y="average_income",
    hue="region_label_simple",
    marker="o"
)

plt.legend(
    title="Region",
    fontsize=7,
    title_fontsize=8
)

plt.xlabel("Year")
plt.ylabel("Average income")
plt.title("Average income over time by region (simple)")
plt.show()

### Prepare for your Pitch and Reflection (P4)

With the project code complete, its time to prepare for the final deliverable - submitting your project demo Pitch and reflection.


In [None]:
# run this code to turn in your work!
from casstools.assignment import Assignment
Assignment().submit()

✅ TIMESTAMP  : 2025-12-13 22:55
✅ COURSE     : ist256
✅ TERM       : fall2025
✅ USER       : jhan70@syr.edu
✅ STUDENT    : True
✅ PATH       : ist256/fall2025/lessons/project/P3.ipynb
✅ ASSIGNMENT : P3.ipynb
✅ POINTS     : 0
✅ DUE DATE   : 2025-12-14 23:59
✅ LATE       : False
✅ STATUS     : Re-Submission



❓ Submit Again? [y/n] ❓  y
