##Project Goal

The goal of this project is to apply data mining techniques to the **Adult Income dataset**.  
Specifically, the objective is to perform **classification** (to predict whether an individual's income is `<=50K` or `>50K`) and **clustering** (to group individuals with similar characteristics).  

This project is important because income prediction is widely used in business and social research for decision-making and policy planning.


## Dataset Source

The dataset was obtained from Kaggle:  
[Adult Income Dataset](https://www.kaggle.com/datasets/wenruliu/adult-income-dataset)  


##Dataset Description

The Adult Income dataset contains **48,842 instances (rows)** and **15 attributes (columns)**.  
These attributes describe demographic and work-related information about individuals.  

- **age** (numeric)  
- **workclass** (categorical)  
- **fnlwgt** (numeric)  
- **education** (categorical)  
- **education-num** (numeric)  
- **marital-status** (categorical)  
- **occupation** (categorical)  
- **relationship** (categorical)  
- **race** (categorical)  
- **sex** (categorical)  
- **capital-gain** (numeric)  
- **capital-loss** (numeric)  
- **hours-per-week** (numeric)  
- **native-country** (categorical)  
- **income** (target class: <=50K, >50K)  

The **target attribute** is `income`, which has two classes:  
- `<=50K`  
- `>50K`  

The dataset is imbalanced since most individuals fall into the `<=50K` class.  



In [11]:
import pandas as pd
data = pd.read_csv("adult.csv")


In [12]:
print("Number of records:", data.shape[0])
print("Number of attributes:", data.shape[1])


Number of records: 48842
Number of attributes: 15


In [13]:
print(data.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       48842 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   48842 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB
None


In [14]:
print(data["income"].value_counts())


income
<=50K    37155
>50K     11687
Name: count, dtype: int64


  
Sample run of the dataset (first 5 rows):


In [15]:
print("Sample of the dataset:")
print(data.head())


Sample of the dataset:
   age  workclass  fnlwgt     education  educational-num      marital-status  \
0   25    Private  226802          11th                7       Never-married   
1   38    Private   89814       HS-grad                9  Married-civ-spouse   
2   28  Local-gov  336951    Assoc-acdm               12  Married-civ-spouse   
3   44    Private  160323  Some-college               10  Married-civ-spouse   
4   18          ?  103497  Some-college               10       Never-married   

          occupation relationship   race  gender  capital-gain  capital-loss  \
0  Machine-op-inspct    Own-child  Black    Male             0             0   
1    Farming-fishing      Husband  White    Male             0             0   
2    Protective-serv      Husband  White    Male             0             0   
3  Machine-op-inspct      Husband  Black    Male          7688             0   
4                  ?    Own-child  White  Female             0             0   

   hours-per-we