<a href="https://colab.research.google.com/github/omidrezaasdev/AdultIncomeEDA/blob/main/adoultIncomeEDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 1. Data Structure and Information


In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns

adult_file_path = '/content/drive/MyDrive/Colab Notebooks/Project/adultIncomeEDA/adult.csv'

df_adult = pd.read_csv(adult_file_path)
df_adult.columns = df_adult.columns.str.strip()

print(df_adult.head())
print(df_adult.info())
print(df_adult.describe())

In [None]:
# Check for missing values count and percentage
missing_data_adult = df_adult.isnull().sum()
missing_data_adult = missing_data_adult[missing_data_adult > 0]
missing_percentage_adult = (missing_data_adult / len(df_adult)) * 100

missing_df_adult = pd.DataFrame({
    'Missing Count': missing_data_adult,
    'Missing Percentage': missing_percentage_adult
}).sort_values(by='Missing Percentage', ascending=False)

print(missing_df_adult)


1. Overall Data Structure and Information
The df_adult.head() output shows initial rows with diverse data types. df_adult.info() reveals approximately 48,842 entries and 15 columns, with object types for categorical data and int64/float64 for numerical. df_adult.describe() provides statistics for numerical columns like age and hours-per-week. The missing values summary (from df_adult.isnull().sum()) indicates some columns (e.g., workclass, occupation, native-country) have missing values, often represented as '?' in this dataset.

## 2. Examining the Target Variable


In [None]:
print(df_adult['income'].value_counts())

plt.figure(figsize=(8, 5))
sns.countplot(data=df_adult, x='income', order=df_adult['income'].value_counts().index)
plt.title('Distribution of Income Levels')
plt.xlabel('Income Level')
plt.ylabel('Count')
plt.show()

df_adult['income'].value_counts() shows a significant class imbalance, with the <=50K category having a much higher count than >50K. The bar chart visually confirms this imbalance, illustrating that the majority of individuals in the dataset earn less than or equal to $50,000 annually.

## 3. Numerical Variables



In [None]:
# 'fnlwgt' is a statistical weight and not a direct feature for modeling
numerical_cols = df_adult.select_dtypes(include=np.number).columns.tolist()
if 'fnlwgt' in numerical_cols:
    numerical_cols.remove('fnlwgt')

# Select a few key numerical features for individual visualization
selected_numerical_for_plot = [
    'age',
    'education.num',
    'hours.per.week',
    'capital.gain', # Income from capital gains
    'capital.loss'  # Losses from capital investments
]

# Plot Histogram with KDE and Boxplot for each selected numerical column
for col in selected_numerical_for_plot:
    print(f"\n--- Processing column: {col} ---")

    # Plot Histogram with KDE
    plt.figure(figsize=(10, 5))
    sns.histplot(df_adult[col], kde=True, bins=30)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency / Density')
    plt.grid(axis='y', alpha=0.75)
    plt.show()

    # Plot Boxplot
    plt.figure(figsize=(8, 4))
    sns.boxplot(y=df_adult[col])
    plt.title(f'Boxplot of {col}')
    plt.ylabel(col)
    plt.grid(axis='y', alpha=0.75)
    plt.show()

Histograms and boxplots for age, education-num, and hours-per-week reveal their distributions. age typically shows a right-skew, while education-num might have multiple peaks. capital-gain and capital-loss exhibit extreme right-skewness and many outliers, as most values are zero, visible in their boxplots.

## 4. Categorical Variables


In [None]:
categorical_cols = df_adult.select_dtypes(include='object').columns.tolist()
if 'income' in categorical_cols:
    categorical_cols.remove('income')


selected_categorical_for_plot = [
    'workclass',        # Type of employer
    'education',        # Highest level of education achieved
    'marital.status',   # Marital status
    'occupation',       # Occupation category
    'sex'               # Gender
]

# Plot Bar Charts for frequencies of each selected categorical column
for col in selected_categorical_for_plot:
    print(f"\n--- Processing column: {col} ---")

    print(df_adult[col].value_counts())

    # Plot Bar Chart for frequencies using Seaborn.
    plt.figure(figsize=(10, 6))
    sns.countplot(data=df_adult, x=col, order=df_adult[col].value_counts().index)
    plt.title(f'Frequency of {col}')
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()


df_adult[col].value_counts() outputs frequency tables for selected categorical columns like workclass and education. Bar charts visually represent these frequencies, showing the most common categories (e.g., 'Private' for workclass, 'HS-grad' for education). This helps identify dominant groups and potential data quality issues like '?' values if not handled.

## 5. Correlation Between Numerical Variables


In [None]:
numerical_cols_for_corr = df_adult.select_dtypes(include=np.number).columns.tolist()
if 'fnlwgt' in numerical_cols_for_corr:
    numerical_cols_for_corr.remove('fnlwgt')

correlation_matrix = df_adult[numerical_cols_for_corr].corr()

# Plot a heatmap of the correlation matrix for better visualization
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix of Numerical Variables (Adult Income)')
plt.show()

# Instead, we will look at correlations between numerical features themselves
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df_adult, x='age', y='hours.per.week')
plt.title('Age vs. Hours.per.Week')
plt.xlabel('Age')
plt.ylabel('Hours.per.Week')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

plt.figure(figsize=(8, 7))
sns.jointplot(data=df_adult, x='age', y='hours.per.week', kind='hex', height=7, cmap='viridis')
plt.suptitle('Hexagonal Binning: Age vs. Hours.per.Week', y=1.02)
plt.show()


The heatmap of the correlation matrix for numerical features (e.g., age, education-num, hours-per-week) shows their linear relationships. While no extremely strong correlations are typically observed among these, education-num and age might show a weak positive correlation. Scatterplots like age vs. hours-per-week visually confirm these relationships and reveal data density.

## 6.  Relationship Between Categorical Variables and Income


In [None]:
categorical_cols_for_rel = df_adult.select_dtypes(include='object').columns.tolist()
if 'income' in categorical_cols_for_rel:
    categorical_cols_for_rel.remove('income')

selected_categorical_for_plot_vs_target = [
    'workclass',        # Type of employer
    'education',        # Highest level of education achieved
    'marital.status',   # Marital status
    'occupation',       # Occupation category
    'sex',              # Gender
    'race'              # Race
]

for col in selected_categorical_for_plot_vs_target:
    print(f"\n--- Processing relationship for: {col} vs Income ---")

    # Create a cross-tabulation (contingency table) of the categorical feature and income.
    ct = pd.crosstab(df_adult[col], df_adult['income'], normalize='index') * 100
    print(ct)

    # Plotting the stacked bar chart
    plt.figure(figsize=(12, 7))
    ct.plot(kind='bar', stacked=True, ax=plt.gca(), cmap='viridis')
    plt.title(f'Income Distribution by {col}')
    plt.xlabel(col)
    plt.ylabel('Percentage')
    plt.xticks(rotation=45, ha='right')
    plt.legend(title='Income Level')
    plt.tight_layout()
    plt.show()


Stacked bar charts, generated from pd.crosstab(df_adult[col], df_adult['income'], normalize='index'), effectively show the proportion of income levels within each category of features like education, occupation, and sex. For instance, these plots clearly demonstrate that individuals with higher education levels or certain occupation types (e.g., 'Exec-managerial', 'Prof-specialty') have a significantly higher percentage of >50K income. The sex plot typically shows a higher proportion of >50K income for males compared to females.