# Adult dataset
## URL：https://archive.ics.uci.edu/dataset/2/adult
### ***Introduction***
| This data was extracted from the census bureau database found at http://www.census.gov/ftp/pub/DES/www/welcome.html  
| Donor: Ronny Kohavi and Barry Becker,  
| 48842 instances, mix of continuous and discrete    (train=32561, test=16281)  
| 45222 if instances with unknown values are removed (train=30162, test=15060)  
| Duplicate or conflicting instances : 6  
| Class probabilities for adult.all file  
| Probability for the label '>50K'  : 23.93% / 24.78% (without unknowns)  
| Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)  
|  
| Extraction was done by Barry Becker from the 1994 Census database.  A set of reasonably clean records was extracted using the following conditions： ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))  
|  
| Prediction task is to determine whether a person makes over 50K a year.  
|    
| Conversion of original data as follows:   
| 1. Discretized agrossincome into two ranges with threshold 50,000.   
| 2. Convert U.S. to US to avoid periods.   
| 3. Convert Unknown to "?"   
| 4. Run MLC++ GenCVFiles to generate data,test.   
### ***Our work***   
We download the Adult dataset from the UCI dataset website and process it. Examples include inserting missing values, removing duplicate features, correcting erroneous values, etc., to ensure a relatively clean census dataset. It is used to detect the income of the population, to assist the generation of financial data and solve the problem of privacy protection.      
### ***Target:***  
+ **income**：>50K, <=50K, >50K., <=50K.  
### ***Features:***   
+ **age**: continuous.   
+ **workclass**: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.   
+ **fnlwgt**: continuous.   
+ **education**: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.    
+ **education-num**: continuous.   
+ **marital-status**: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.   
+ **occupation**: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.   
+ **relationship**: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.  
+ **race**: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.   
+ **sex**: Female, Male.  
+ **capital-gain**: continuous.   
+ **capital-loss**: continuous.   
+ **hours-per-week**: continuous.   
+ **native-country**: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.    
### ***Reference:***
[1] B. Becker and R. Kohavi (1996). Adult [Dataset]. UCI Machine Learning Repository. Available: https://archive.ics.uci.edu/dataset/2/adult     

***Time ： 2024/11/23 12:50***  
***Author ： Chuang Liu***  
***Email ：LIUC0316@126.COM***  
***File ：Adult_Processing.ipynb***  
***Notebook ：Jupyter***   

In [1]:
from ucimlrepo import fetch_ucirepo
# fetch dataset
adult = fetch_ucirepo(id=2)
# metadata
print(adult.metadata)
# variable information
print(adult.variables)

# data (as pandas dataframes)
X = adult.data.features
y = adult.data.targets

{'uci_id': 2, 'name': 'Adult', 'repository_url': 'https://archive.ics.uci.edu/dataset/2/adult', 'data_url': 'https://archive.ics.uci.edu/static/public/2/data.csv', 'abstract': 'Predict whether annual income of an individual exceeds $50K/yr based on census data. Also known as "Census Income" dataset. ', 'area': 'Social Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 48842, 'num_features': 14, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Age', 'Income', 'Education Level', 'Other', 'Race', 'Sex'], 'target_col': ['income'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1996, 'last_updated': 'Tue Sep 24 2024', 'dataset_doi': '10.24432/C5XW20', 'creators': ['Barry Becker', 'Ronny Kohavi'], 'intro_paper': None, 'additional_info': {'summary': "Extraction was done by Barry Becker from the 1994 Census database.  A set of reasonably clean records was extracted using the fol

In [2]:
import pandas as pd
# initial data
df = pd.concat([X, y], axis=1)
df.drop(['education-num'], axis=1, inplace=True)
df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       47879 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   marital-status  48842 non-null  object
 5   occupation      47876 non-null  object
 6   relationship    48842 non-null  object
 7   race            48842 non-null  object
 8   sex             48842 non-null  object
 9   capital-gain    48842 non-null  int64 
 10  capital-loss    48842 non-null  int64 
 11  hours-per-week  48842 non-null  int64 
 12  native-country  48568 non-null  object
 13  income          48842 non-null  object
dtypes: int64(5), object(9)
memory usage: 5.2+ MB


In [3]:
from sklearn.preprocessing import OrdinalEncoder
# update missing value and label value
df.replace("?", pd.NaT, inplace= True)
df.replace(">.+", "1", regex= True, inplace= True)
df.replace("<=.+", "0", regex= True, inplace= True)
# insert mode
trans = {'workclass': df['workclass'].mode()[0], 'occupation': df['occupation'].mode()[0], 'native-country' : df['native-country'].mode()[0]}
df.fillna(trans, inplace = True)

# classification attribute columns in training data
object_cols = [col for col in df.columns if df[col].dtype == "object"]

# apply ordinal encoder
my_ordinalEncoder = OrdinalEncoder()
df[object_cols] = my_ordinalEncoder.fit_transform(df[object_cols])
df.head()
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             48842 non-null  int64  
 1   workclass       48842 non-null  float64
 2   fnlwgt          48842 non-null  int64  
 3   education       48842 non-null  float64
 4   marital-status  48842 non-null  float64
 5   occupation      48842 non-null  float64
 6   relationship    48842 non-null  float64
 7   race            48842 non-null  float64
 8   sex             48842 non-null  float64
 9   capital-gain    48842 non-null  int64  
 10  capital-loss    48842 non-null  int64  
 11  hours-per-week  48842 non-null  int64  
 12  native-country  48842 non-null  float64
 13  income          48842 non-null  float64
dtypes: float64(9), int64(5)
memory usage: 5.2 MB


Unnamed: 0,age,workclass,fnlwgt,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
count,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0
mean,38.643585,3.099668,189664.1,10.28842,2.61875,6.152819,1.443287,3.668052,0.668482,1079.067626,87.502314,40.422382,36.433664,0.239282
std,13.71051,1.11081,105604.0,3.874492,1.507703,3.968837,1.602151,0.845986,0.470764,7452.019058,403.004552,12.391444,6.031536,0.426649
min,17.0,0.0,12285.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
25%,28.0,3.0,117550.5,9.0,2.0,3.0,0.0,4.0,0.0,0.0,0.0,40.0,38.0,0.0
50%,37.0,3.0,178144.5,11.0,2.0,7.0,1.0,4.0,1.0,0.0,0.0,40.0,38.0,0.0
75%,48.0,3.0,237642.0,12.0,4.0,9.0,3.0,4.0,1.0,0.0,0.0,45.0,38.0,0.0
max,90.0,7.0,1490400.0,15.0,6.0,13.0,5.0,4.0,1.0,99999.0,4356.0,99.0,40.0,1.0


In [4]:
# save the data as csv file
# df.to_csv("Adult.csv", index=False)