## UCI Adult income Dataset - Exploratory and Descriptive Analysis

In this notebook, we focus on **data preparation**, **cleaning**, and **preprocessing** for the **UCI Adult Income Dataset**, a popular dataset often used for classification tasks predicting whether an individual earns more or less than $50,000 annually based on demographic and work-related attributes.

Good data preprocessing is crucial for reliable and interpretable results in machine learning and analytics workflows. Here, we address common data issues such as **missing values, duplicates, and inconsistent categorical labels** while creating derived features to improve downstream analysis.
We start by importing essential Python libraries for data handling and manipulation.

- `pandas` for structured data operations.

- `numpy` for numerical operations.

- `os` for interacting with the operating system and directory structures.


In [43]:
import pandas as pd 
import numpy as np 
import os 
import plotly.express as px

## Define and Create Directory Paths

To ensure reproducibility andorganized storage, we programmatically create directories for:

- **raw data**
- **processed data**
- **results**
- **documentation**

These directories will store intermediate and final outputs for reproducibility.




## Define and Create paths

In [47]:
# Get working directory 
current_dir = os.getcwd()

# Go one directory up to the root directory 
project_root_dir = os.path.dirname(current_dir)

data_dir = os.path.join(project_root_dir, 'data')
raw_dir = os.path.join(data_dir,'raw')
processed_dir = os.path.join(data_dir,'processed')

# Define paths to results folder 
results_dir = os.path.join(project_root_dir,'results')

# Define paths to docs folder 
docs_dir = os.path.join(project_root_dir,'docs') 

#Create directories if they do not exist 
os.makedirs(raw_dir,exist_ok= True)
os.makedirs(processed_dir,exist_ok= True)
os.makedirs(results_dir,exist_ok= True)
os.makedirs(data_dir,exist_ok= True)

## Read in the data
We load the **Adult Income dataset** as a CSV file.

Key considerations here are:

- We treat `?` as missing values (`na_values = '?'`).
- We use `skipinitialspace = True` to remove extra spaces after delimeters which is common in text-based datasets.

After loading, we inspect the first few rows.



In [50]:
adult_data_filename = os.path.join(processed_dir, 'adult_cleaned.csv')
adult_df = pd.read_csv(adult_data_filename)
adult_df.head(10)

Unnamed: 0,age,workclass,fnlwgt,education_num,marital_status,relationship,race,sex,capital_gain,capital_loss,hours_per_week,income,education_level,occupation_grouped,native_region,age_group
0,39,government,77516,13,single,single,white,male,2174,0,40,<=50k,tertiary,white collar,north america,36-45
1,50,self-employed,83311,13,married,male spouse,white,male,0,0,13,<=50k,tertiary,white collar,north america,46-60
2,38,private,215646,9,divorced or separated,single,white,male,0,0,40,<=50k,highschool graduate,blue collar,north america,36-45
3,53,private,234721,7,married,male spouse,black,male,0,0,40,<=50k,secondary,blue collar,north america,46-60
4,28,private,338409,13,married,female spouse,black,female,0,0,40,<=50k,tertiary,white collar,central america,26-35
5,37,private,284582,14,married,female spouse,white,female,0,0,40,<=50k,tertiary,white collar,north america,36-45
6,49,private,160187,5,divorced or separated,single,black,female,0,0,16,<=50k,secondary,service,central america,46-60
7,52,self-employed,209642,9,married,male spouse,white,male,0,0,45,>50k,highschool graduate,white collar,north america,46-60
8,31,private,45781,14,single,single,white,female,14084,0,50,>50k,tertiary,white collar,north america,26-35
9,42,private,159449,13,married,male spouse,white,male,5178,0,40,>50k,tertiary,white collar,north america,36-45


## check the shape of the dataset and datatypes

In [53]:
adult_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32514 entries, 0 to 32513
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   age                 32514 non-null  int64 
 1   workclass           32514 non-null  object
 2   fnlwgt              32514 non-null  int64 
 3   education_num       32514 non-null  int64 
 4   marital_status      32514 non-null  object
 5   relationship        32514 non-null  object
 6   race                32514 non-null  object
 7   sex                 32514 non-null  object
 8   capital_gain        32514 non-null  int64 
 9   capital_loss        32514 non-null  int64 
 10  hours_per_week      32514 non-null  int64 
 11  income              32514 non-null  object
 12  education_level     32514 non-null  object
 13  occupation_grouped  32514 non-null  object
 14  native_region       32514 non-null  object
 15  age_group           32514 non-null  object
dtypes: int64(6), object(10

## summary statistics
### numerical variables

In [None]:
adult_df.describe()

## categorical variables

In [None]:
adult_df.describe(include='object')

In [None]:
adult_df['workclass'].value_counts()

In [None]:
adult_df['workclass'].value_counts(normalize=True)

In [None]:
adult_df['marital_status'].value_counts(normalize=True)

In [None]:
adult_df['relationship'].value_counts(normalize=True)

In [None]:
adult_df['race'].value_counts(normalize=True)

## income distribution

In [None]:
adult_df_income = adult_df.groupby('income').size().reset_index(name='total')
adult_df_income

In [None]:
fig = px.pie(adult_df_income, names='income', values='total', title='Overall Income Distribution', color_discrete_sequence=['#008080', '#808080'])
fig.show()

## 1.4.1 income by age group

In [None]:
adult_df_age = adult_df.groupby(['age_group','income']).size().reset_index(name='total_by_age')
#sort_values(by='total_by_age', ascending=False)
adult_df_age

In [None]:
adult_df_income_age = adult_df.groupby(['age_group', 'income']).size().reset_index(name='total_by_age').sort_values(['age_group', 'income'])
adult_df_income_age

In [None]:
total_per_group = adult_df_income_age.groupby('age_group').size()
total_per_group

In [None]:
total_per_group = adult_df_income_age.groupby('age_group')['total_by_age'].transform('sum')
total_per_group

In [None]:
total_per_group = adult_df_income_age.groupby('age_group')['total_by_age'].transform('sum')
adult_df_income_age['percentage'] = (adult_df_income_age['total_by_age']/total_per_group) * 100
adult_df_income_age

In [None]:
fig = px.bar(
    adult_df_income_age,
    x = 'age_group',
    y = 'percentage',
    color = 'income',
    title='Incoome Distribution by Age Group(%)',
    barmode='group',
    color_discrete_sequence=px.colors.sequential.RdBu,
    text='percentage'
)
fig.update_traces(texttemplate = '%{text:.2f}%')
fig.show()

In [None]:
themes = ["plotly", "plotly_white", "plotly_dark", "ggplot2", "seaborn", "simple_white", "presentation", "xgridoff", "ygridoff", "gridon", "none"]
for theme in themes:
    fig.update_layout(template=theme)
    fig.show()

In [None]:
#pip install -U kaleido

In [None]:
#pip install -U plotly

In [None]:
adult_df_income_native_region = adult_df.groupby(['native_region', 'income']).size().reset_index(name='total_income_distr')
adult_df_income_native_region

In [None]:
total_per_region = adult_df_income_native_region.groupby('native_region')['total_income_distr'].transform('sum')
adult_df_income_native_region['percentage'] = (adult_df_income_native_region['total_income_distr']/total_per_region) * 100
adult_df_income_native_region

In [None]:
fig = px.bar(
    adult_df_income_native_region,
    x = 'native_region',
    y = 'percentage',
    color = 'income',
    title='Income Distribution across Native Region ',
    barmode='group',
    color_discrete_sequence=px.colors.sequential.RdBu,
    text='percentage'
)
fig.update_traces(texttemplate = '%{text:.2f}%')
fig.show()
fig.write_image(os.path.join(results_dir, 'income_distribution_by_nativeRegion_bar_plot.jpg'))
fig.write_image(os.path.join(results_dir, 'income_distribution_by_nativeRegion_bar_plot.png'))
fig.write_html(os.path.join(results_dir, 'income_distribution_by_nativeRegion_bar_plot.html'))

In [None]:
adult_df_income_race = adult_df.groupby(['race', 'income']).size().reset_index(name='total_income_race')
adult_df_income_race

In [None]:
total_per_race= adult_df_income_race.groupby('race')['total_income_race'].transform('sum')
adult_df_income_race['percentage'] = (adult_df_income_race['total_income_race']/total_per_race) * 100
adult_df_income_race

In [None]:
fig = px.bar(
    adult_df_income_race,
    x = 'race',
    y = 'percentage',
    color = 'income',
    title='Income Distribution Per Race ',
    barmode='group',
    color_discrete_sequence=px.colors.sequential.RdBu,
    text='percentage'
)
fig.update_traces(texttemplate = '%{text:.2f}%')
fig.show()
fig.write_image(os.path.join(results_dir, 'income_distribution_by_race_bar_plot.jpg'))
fig.write_image(os.path.join(results_dir, 'income_distribution_by_race_bar_plot.png'))
fig.write_html(os.path.join(results_dir, 'income_distribution_by_race_bar_plot.html'))

In [None]:
adult_df_income_edu_occ = (adult_df.groupby(['education_level', 'occupation_grouped', 'income'])
                          .size().reset_index(name='total').sort_values('total', ascending = False))
adult_df_income_edu_occ

In [None]:
adult_df_income_edu_occ['edu_occ'] = (adult_df_income_edu_occ['education_level'] + " | "
                                     + adult_df_income_edu_occ['occupation_grouped'])
adult_df_income_edu_occ

In [None]:
num = 15
adult_df_combos = adult_df_income_edu_occ.head(num)

fig = px.bar(
    adult_df_combos,
    x='total',
    y='edu_occ',
    color='income',
    orientation='h',
    title=f'Top {num} Education and Occupation Groups Combinations by Income Group',
    height=500,
    width=1100,
    color_discrete_sequence=px.colors.sequential.RdBu,
    text='total'
)

fig.update_layout(
    template="presentation",
    xaxis_title='Number of Individuals',
    yaxis_title='Education | Occupation Group',
    legend_title=dict(text='Income Level'),
    margin=dict(l=450, r=50, t=50, b=50)
)

#  method here
fig.update_traces(textposition='inside')

fig.show()
fig.write_image(os.path.join(results_dir, 'income_distribution_by_Education and Occupation Groups Combinations by Income Group_bar_plot.jpg'))
fig.write_image(os.path.join(results_dir, 'income_distribution_by_Education and Occupation Groups Combinations by Income Group_bar_plot.png'))
fig.write_html(os.path.join(results_dir, 'income_distribution_by_Education and Occupation Groups Combinations by Income Group_bar_plot.html'))

