# Prepare real world data and EDA

Task:
- prepare a dataset with the following columns: county, gender, child, adult, senior.
- `county` (string) name should look like `"Albany County"`
- `gender` (category dtype) is categorical variable with two possible values: `"F"` and `"M"`.
- `child` (int64) represents the number of residents in each county under 18 years old
- `adult` (int64) represents the number of residents in each county between 19 to 64 years old
- `senior` (int64) represents the number of residents in each county over 65 years old.

## Read and inspect data

In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

ny_age = pd.read_csv("data/ACSST5Y2023.S0101-2026-02-23T201031.csv")

ny_age.head()

Unnamed: 0,Label (Grouping),"Albany County, New York!!Total!!Estimate","Albany County, New York!!Percent!!Estimate","Albany County, New York!!Male!!Estimate","Albany County, New York!!Percent Male!!Estimate","Albany County, New York!!Female!!Estimate","Albany County, New York!!Percent Female!!Estimate","Allegany County, New York!!Total!!Estimate","Allegany County, New York!!Percent!!Estimate","Allegany County, New York!!Male!!Estimate",...,"Wyoming County, New York!!Male!!Estimate","Wyoming County, New York!!Percent Male!!Estimate","Wyoming County, New York!!Female!!Estimate","Wyoming County, New York!!Percent Female!!Estimate","Yates County, New York!!Total!!Estimate","Yates County, New York!!Percent!!Estimate","Yates County, New York!!Male!!Estimate","Yates County, New York!!Percent Male!!Estimate","Yates County, New York!!Female!!Estimate","Yates County, New York!!Percent Female!!Estimate"
0,Total population,315374.0,(X),153188.0,(X),162186.0,(X),47027.0,(X),23865.0,...,21429.0,(X),18551.0,(X),24637.0,(X),12084.0,(X),12553.0,(X)
1,AGE,,,,,,,,,,...,,,,,,,,,,
2,Under 5 years,14928.0,4.7%,7577.0,4.9%,7351.0,4.5%,2396.0,5.1%,1187.0,...,905.0,4.2%,985.0,5.3%,1557.0,6.3%,839.0,6.9%,718.0,5.7%
3,5 to 9 years,15626.0,5.0%,8058.0,5.3%,7568.0,4.7%,2425.0,5.2%,1369.0,...,1068.0,5.0%,1039.0,5.6%,1506.0,6.1%,760.0,6.3%,746.0,5.9%
4,10 to 14 years,16773.0,5.3%,8581.0,5.6%,8192.0,5.1%,2803.0,6.0%,1323.0,...,1002.0,4.7%,1083.0,5.8%,1657.0,6.7%,767.0,6.3%,890.0,7.1%


In [4]:
ny_age.shape

(42, 373)

In [5]:
ny_age.columns

Index(['Label (Grouping)', 'Albany County, New York!!Total!!Estimate',
       'Albany County, New York!!Percent!!Estimate',
       'Albany County, New York!!Male!!Estimate',
       'Albany County, New York!!Percent Male!!Estimate',
       'Albany County, New York!!Female!!Estimate',
       'Albany County, New York!!Percent Female!!Estimate',
       'Allegany County, New York!!Total!!Estimate',
       'Allegany County, New York!!Percent!!Estimate',
       'Allegany County, New York!!Male!!Estimate',
       ...
       'Wyoming County, New York!!Male!!Estimate',
       'Wyoming County, New York!!Percent Male!!Estimate',
       'Wyoming County, New York!!Female!!Estimate',
       'Wyoming County, New York!!Percent Female!!Estimate',
       'Yates County, New York!!Total!!Estimate',
       'Yates County, New York!!Percent!!Estimate',
       'Yates County, New York!!Male!!Estimate',
       'Yates County, New York!!Percent Male!!Estimate',
       'Yates County, New York!!Female!!Estimate',
  

In [7]:
ny_age.dtypes

Label (Grouping)                                    object
Albany County, New York!!Total!!Estimate            object
Albany County, New York!!Percent!!Estimate          object
Albany County, New York!!Male!!Estimate             object
Albany County, New York!!Percent Male!!Estimate     object
                                                     ...  
Yates County, New York!!Percent!!Estimate           object
Yates County, New York!!Male!!Estimate              object
Yates County, New York!!Percent Male!!Estimate      object
Yates County, New York!!Female!!Estimate            object
Yates County, New York!!Percent Female!!Estimate    object
Length: 373, dtype: object

## Select the columns and rows that may be helpful

- remove all columns with names containing "Percent" or "Total"
- Only select rows: "Total population", "Under 18 years", "65 years and over"
- transpose the dataframe

In [27]:
# Before select, let's remove spaces
ny_age['Label (Grouping)'] = ny_age['Label (Grouping)'].str.strip()

In [50]:
selected_df = ny_age[[col for col in ny_age.columns if "Percent" not in col and "Total" not in col]].copy()
selected_df = selected_df[selected_df['Label (Grouping)'].isin(["Total population", "Under 18 years", "65 years and over"])]

In [51]:
selected_df = selected_df.T.copy().reset_index()

Inspect the data, we realize the first row should be the column name

In [52]:
selected_df.columns = selected_df.iloc[0]
selected_df = selected_df.iloc[1:].copy()

## Convert the numbers to numeric
- remove the thousand separators first
- Then convert to numeric

In [54]:
for col in selected_df.columns[1:]:
    selected_df[col] = selected_df[col].str.replace(",", "")
    selected_df[col] = pd.to_numeric(selected_df[col], errors="coerce")

In [55]:
selected_df.isna().sum()

0
Label (Grouping)     0
Total population     0
Under 18 years       0
65 years and over    0
dtype: int64

## Edit column names and Ccreate a new column called "adult" and edit other column names

In [57]:
selected_df.rename(columns={'Label (Grouping)': 'county',
                            'Under 18 years': 'child',
                            '65 years and over': 'senior'},
                   inplace=True)

In [58]:
selected_df['adult'] = selected_df['Total population'] - selected_df['child'] - selected_df['senior']

In [59]:
selected_df.drop(columns=['Total population'], inplace=True)

## Create a new columns called "gender"

In [60]:
selected_df['gender'] = selected_df['county'].apply(lambda x: "M" if "Male" in x else "F")

In [62]:
preferred_col_order = ['county', 'gender', 'child', 'adult','senior']
selected_df = selected_df[preferred_col_order]

## Fixing county names

In [64]:
def extract_county_name(string: str) -> str:
    return string.split(",")[0]

selected_df['county'] = selected_df['county'].apply(extract_county_name)

In [65]:
selected_df.dtypes

0
county    object
gender    object
child      int64
adult      int64
senior     int64
dtype: object

## EDA

In [66]:
selected_df.describe()

Unnamed: 0,child,adult,senior
count,124.0,124.0,124.0
mean,33139.330645,99208.516129,27912.790323
std,58248.648946,174275.977723,46091.296086
min,326.0,1372.0,824.0
25%,4792.75,14981.25,5319.0
50%,8698.5,26223.0,8190.5
75%,24418.5,72413.75,23513.5
max,304581.0,864572.0,230881.0
