# Exploratory Data Analysis

**Project:** Income Prediction: What Determines Who Earns More? (2.4)

**Team:** Anastasia Sidorova and Paola Cancino

**Date:** 2/9/2026

## Table of Contents
1. Setup & Load Data
2. Data Quality Check
3. Target Variable Analysis
4. Feature Distributions
5. Correlation Analysis
6. Key Findings Summary



## 1. Setup & Load Data

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Settings
plt.style.use('seaborn-v0_8')
pd.set_option('display.max_columns', None)

print("âœ“ Libraries loaded!")

In [None]:
pip install ucimlrepo

In [None]:
from ucimlrepo import fetch_ucirepo 

# fetch dataset 
adult = fetch_ucirepo(id=2) 
  
# data (as pandas dataframes) 
X = adult.data.features 
y = adult.data.targets 
  
# metadata 
# print(adult.metadata) 
# variable information 
# print(adult.variables) 

df = pd.concat([X, y], axis=1)
df

In [None]:
df.info()
print(f' Data loaded: {df.shape[0]:,} rows x {df.shape[1]:,} columns')
df.head()


## 2. Data Quality Check

**Questions to answer:**
- What are the data types?
- <span style="color: green;">There are 6 integers and 9 strings</span>
- Are there missing values?
- <span style="color: green;">There are 2203 missing values.</span>
- Are there duplicate rows?
- <span style="color: green;">There are 29 duplicate rows.</span>

In [None]:
df.info()

In [None]:
df.dtypes

In [None]:
print(f'Missing values: {df.isnull().sum().sum()}')
print(f'Duplicate rows: {df.duplicated().sum()}')

In [None]:
df.describe()

### Data Quality Observations

*TODO: Write your observations here*

1. **Data types:** There are 6 integers and 9 strings.
2. **Missing values:** There are 2203 missing values.
3. **Duplicates:** There are 29 duplicate rows.
4. **Potential issues:** Duplicate rows will need to be dropped. Becasue there is a big amount of missing values, they will also most likley need to be dropped. 



## 3. Target Variable Analysis

**Your target variable:** [TODO: What are you trying to predict?]

Our target will be income, since that is what we are trying to predict. 

In [None]:
df['income_numeric'] = df['income'].map({'<=50K': 50000, '>50K': 100000})

max_value = df['income_numeric'].max()
capped = (df['income_numeric'] >= 100000).sum()

pct = capped / len(df) * 100

print(f'Maximum value: ${max_value:,.0f}')
print(f'Values at/near cap: {capped:,} ({pct:.1f}%)')


In [None]:
df['income_numeric'] = df['income'].map({'<=50K': 50000, '>50K': 100000})

plt.figure(figsize=(6, 4))
plt.hist(df['income_numeric'], bins=2, edgecolor='black', alpha=0.7)  # two bins for <=50K and >50K
plt.axvline(df['income_numeric'].mean(), color='red', linestyle='--', 
            label=f"Mean: ${df['income_numeric'].mean():,.0f}")
plt.axvline(df['income_numeric'].median(), color='green', linestyle='--', 
            label=f"Median: ${df['income_numeric'].median():,.0f}")
plt.xticks([50000, 100000], ['<=50K', '>50K'])
plt.xlabel('Income Category')
plt.ylabel('Frequency')
plt.title('Distribution of Income')
plt.legend()
plt.show()

### Target Variable Observations

*TODO: Write your observations here*

1. **Distribution shape:** The target is imbalanced, there are more individuals that earns <=50k. This is a righ-skewed catrgorical distribution. 
2. **Outliers:** We don't know if there are any outliers yet. 
3. **Potential issues:** The imbalance may bias predictive models toward predicting the majority class, which is <=50K.



## 4. Feature Distributions

In [None]:
features = df.columns
features

In [None]:
numeric_features = df.select_dtypes(include='number').columns
fig, axes = plt.subplots(3, 3, figsize=(15, 12))
axes = axes.flatten()

for i, col in enumerate(numeric_features):
    axes[i].hist(df[col], bins=50, edgecolor='black', alpha=0.7)
    axes[i].set_title(col)
    axes[i].set_xlabel('')
    axes[i].set_ylabel('Frequency')

for j in range(i+1, len(axes)):
    axes[j].axis('off')

plt.suptitle('Numeric Feature Distributions', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

### Feature Distribution Observations
*TODO: Write your observations here*


## 5. Correlation Analysis

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()
sns.pairplot(df.select_dtypes(include=[np.number]))
plt.show()

In [None]:
scatter_features = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']

fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.flatten()

for i, feature in enumerate(scatter_features):
    axes[i].scatter(df[feature], df['income_numeric'], s=5, alpha=0.7)
    axes[i].set_title(f'{feature} vs Income')
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel('Income')

plt.tight_layout()
plt.show()

In [None]:
# Calculate correlation matrix
correlation_matrix = df.corr(numeric_only=True)
correlation_matrix

In [None]:
# Calculate correlation matrix
correlation_matrix = df.corr(numeric_only=True)
correlation_matrix
correlation_matrix['income_numeric'].sort_values(ascending=True)

In [None]:
plt.figure(figsize=(10,8))
plt.scatter(df['education-num'],df['income_numeric'],s=5,alpha=0.7)
plt.xlabel('Education-num')
plt.ylabel('Income')
plt.show()

In [None]:
def count_outlier(column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5*IQR
    upper_bound = Q3 + 1.5*IQR
    outliers = df[(df[column]<lower_bound)|(df[column]>upper_bound)]
    return len(outliers)

In [None]:
numeric_cols = df.select_dtypes(include='number').columns
for col in numeric_cols:
    n_outliers = count_outlier(col)
    percent = (n_outliers / len(df)) * 100
    print(f'{col}: {n_outliers} outliers ({percent:.2f}%)')

### Correlation Observations
*TODO: Write your observations here*
1. **Strongest predictor:**
    * Our strongest predictor is education_num of 0.34.
2. **Other important features:**
    * age w/income_numeric or income of 0.23
    * capital_gain w/income_numeric or income of 0.22
    * hours-per-week w/income_numeric or income 0.23. 
3. **Multicollinearity concerns:**
    * The data does not show any multicollinarity concerns.


## 6. Key Findings Summary

#### In this first part of the project we have discovered:
* We have 48843 rows and 15 columns of data.
* Of that data, there are 6 integer and 9 string data types.* There are 29 duplicate rows and 2203 missing values, that will need to be dropped.
* Our target will be income, since that is what we are trying to predict.
* The target is imbalanced, there are more individuals that earns <=50k. This is a righ skewed catrgorical distribution.
* There are no ouliers.
* The imbalance may bias predictive models toward predicting the majority class, which is <=50K.
* There are 216 outliers in age --> 0.44% of data
* There are 1453 outliers in fnlwgt --> 2.9% of data
* There are 1794 outliers in educatio-num --> 3.67% of data
* There are 4035 outliers in capital-gain --> 8.26% of data
* There are 2282 outliers in capital-loss --> 4.67% of data 
* There are 13496 outliers in hours-per-week --> 27.63% of data
* There are 7841 outliers income_numeric(income) --> 16.05% of data
* Our strongest predictor is education_num of 0.34.
* Other important features include age of 0.25, capital gain of 0.22, and hours-per week of 0.23.
* The data does not show any multicollinarity concerns.

## EDA Checklist
Before moving to modeling, ensure you've completed:
- [x] Loaded and examined the data
- [x] Checked data types
- [x] Identified and documented missing values
- [x] Analyzed target variable distribution
- [x] Examined feature distributions
- [x] Created correlation analysis
- [x] Documented key findings
- [x] Identified potential data quality issues