In [7]:
import pandas as pd

print("pandas version:", pd.__version__)



pandas version: 2.3.3


In [8]:
import pandas as pd

# 1) Read the raw data file from the data folder
df = pd.read_csv("C:/Users/priya/OneDrive/Desktop/adult-income-project/data/adult/adult.data", header=None)

# 2) Show first 5 rows to confirm it worked
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [9]:
columns = [
    "age", "workclass", "fnlwgt", "education", "education-num",
    "marital-status", "occupation", "relationship", "race", "sex",
    "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"
]

df.columns = columns
df.head()


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [10]:
df = df.replace(" ?", pd.NA)
df = df.dropna()
df = df.reset_index(drop=True)
df.head()


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [11]:
df["income"] = df["income"].str.strip()
df["income"] = df["income"].map({">50K": 1, "<=50K": 0})
df["income"].value_counts()



income
0    22654
1     7508
Name: count, dtype: int64

In [12]:
df.to_csv("C:/Users/priya/OneDrive/Desktop/adult-income-project/data/adult/cleaned_adult.csv", index=False)

# 01 — Data Preparation & Cleaning

This notebook focuses on preparing the raw Adult Census Income dataset for analysis and modeling. The raw dataset (`adult.data`) contains several inconsistencies such as missing values, inconsistent formatting, and lack of column headers. The goal of this notebook is to convert the raw data into a clean, structured, and machine-learning-ready format.

---

## 1. Import Libraries and Load Raw Data

The dataset was loaded using `pd.read_csv()` with `header=None` since the raw file does not include column names. A direct file path was used to ensure consistent loading.

---

## 2. Assign Official Column Names

The Adult Income dataset contains 15 columns as defined by the UCI Machine Learning Repository. These column names were assigned manually to ensure clarity and consistency throughout the analysis.

Column list:
- age  
- workclass  
- fnlwgt  
- education  
- education-num  
- marital-status  
- occupation  
- relationship  
- race  
- sex  
- capital-gain  
- capital-loss  
- hours-per-week  
- native-country  
- income  

Assigning these names helps with readability and prepares the dataset for downstream processing.

---

## 3. Handle Missing Values

Missing values in the raw dataset appear as `" ?"`. These were replaced with `pd.NA` using the `replace()` function. Afterward, rows containing missing values were removed using `dropna()`. This ensures that only complete and reliable observations remain in the dataset.

---

## 4. Clean and Transform the Target Column

The `income` column contains two string categories:
- `<=50K`
- `>50K`

These labels were cleaned using `str.strip()` to remove surrounding whitespace. The values were then mapped to numerical labels:
- `0` → income ≤ 50K  
- `1` → income > 50K

This binary representation is required for machine learning classification.

---

## 5. Save the Cleaned Dataset

The final cleaned dataset was saved as:

#cleaned_adult.csv


This file will be used for Exploratory Data Analysis (EDA) and Model Training in subsequent notebooks.

---

## ✔️ Final Output

A clean, structured dataset with:
- No missing values  
- Properly labeled columns  
- Binary target variable  
- Ready for analysis and modeling
