# Data Quality Analysis on Adult Dataset
#### Using YData Profiling

# Note : Displaying the report at the end to avoid interrupting the notebook flow.

##### **Importing necessary lib and dataset**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import os
import gc
from IPython.core.display import display, HTML
warnings.filterwarnings( "ignore",    message=".*should_run_async.*",category=DeprecationWarning)
warnings.filterwarnings("ignore")
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/adult/adult.data
/kaggle/input/adult/adult.names
/kaggle/input/adult/Index
/kaggle/input/adult/old.adult.names
/kaggle/input/adult/adult.test
/kaggle/input/profiling-report-html/profiling_report.html


In [2]:
from ydata_profiling import ProfileReport

## Data loading and basic data cleaning like data type changing 

In [3]:
column_names = [
    'age', 'workclass', "fnlwgt",'education', 'education-num', 'marital-status', 
    'occupation', 'relationship', 'race', 'sex', 'capital-gain', 
    'capital-loss', 'hours-per-week', 'native-country', 'income'
]

adult_data = pd.read_csv("/kaggle/input/adult/adult.data", names=column_names)
adult_test = pd.read_csv("/kaggle/input/adult/adult.test", names=column_names)
adult_test.drop(index=0,inplace=True)

adult_data_full = pd.concat([adult_data, adult_test], axis=0)

adult_data_full.index = adult_data_full.index.astype(str)

int_columns = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
for col in int_columns:
    adult_data_full[col] = adult_data_full[col].astype(int)

In [4]:
adult_data_full.sample(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
14297,44,Private,36271,Bachelors,13,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K
8766,52,Private,161482,Some-college,10,Divorced,Adm-clerical,Not-in-family,White,Female,0,0,40,United-States,<=50K.
991,41,Private,194636,Assoc-voc,11,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,<=50K
4356,51,Private,126528,HS-grad,9,Separated,Craft-repair,Not-in-family,White,Male,0,0,60,United-States,<=50K.
32246,32,Self-emp-not-inc,116508,Some-college,10,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,>50K


In [5]:
adult_data_full.info()

<class 'pandas.core.frame.DataFrame'>
Index: 48842 entries, 0 to 16281
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       48842 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      48842 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48842 non-null  object
 14  income          48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 6.0+ MB


In [6]:
string_columns = adult_data_full.select_dtypes(include=['object']).columns

for col in string_columns:
    adult_data[col] = adult_data[col].str.strip()

In [7]:
adult_data_full.shape

(48842, 15)

In [8]:
profile_adult_data = ProfileReport(adult_data_full, title="Profiling Report")

## Data Profiling Using YData Profiling

In [9]:
# profile_adult_data

In [10]:
profile_adult_data.to_file("profiling_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### Reading it after saving due to some time notebook get freeze

In [11]:
# with open("/kaggle/working/profiling_report.html", "r", encoding="utf-8") as f:
#     report = f.read()

# display(HTML(report))

**column-wise summary** based on Y Data profiling report. 
---

### 🧾 **Column-Wise Summary**

| Column Name       | Data Type | Missing (%) | Unique Values | Notes / Issues | Possible Insights |
|-------------------|-----------|-------------|----------------|----------------|----------------|
| **age**           | float64   | 0%       | 74       |NO  | Can be Fillid with mean/median if needed |
| **sex**        | object    | 0%          | 2              | Categorical (Male/Female) | Encode using LabelEncoder or one-hot |
| **workclass**          | object    | 0%      | 9         | dominated by private also question mark(?) is there | ? can be replaced from NA |
| **education**        | object   | 0%         | 16     | Dominated by HS-grad |  |
| **education-num**         | int64     | 0%          | 16  | Good distribution peak at 9 |  |
| **marital-status**   | object    | 0%          | 7   | Looks good|  |
| **occupation**     | Categorical      | 0%          | 15  | ? is there as Catagory | |
| **relationship**      | relationship    | 0%        | 6 || |
| **race**      | Categorical    | 0%          | 6     | White Dominated     |  |
| **capital-gain** | float64   | 0%        | 123         | 44807 zeroes are there| Impute or consider dropping if not informative |
| **capital-loss**        | float64    | 0%      |99  | 46560 zeros | Fill  or drop if not relevant |
| **hours-per-week**  | float64       | 0%          | 96            | distribution between 1 to 99 |  |
| **native-country**  | Categorical       | 0%          | 42            | Dominated by US |  |
| **income**  | Categorical       | 0%          | 4            | looks like some typing mistake should only be 2 catagires but here is <=50K	<=50K.	>50K	>50K.| Make them 2 |

---



# Improving the Data quality

In [12]:
import pandas as pd
import numpy as np

In [13]:
# 1. Normalize income labels (fix variations like '>50K.', '<=50K ', etc.)
adult_data_full['income'] = adult_data_full['income'].str.strip().replace({'>50K.': '>50K', '<=50K.': '<=50K'})

# 2. Replace '?' with NaN
adult_data_full.replace('?', np.nan, inplace=True)

# 3. Remove duplicate rows
adult_data_full = adult_data_full.drop_duplicates()

In [14]:
adult_data_full.shape

(48790, 15)

In [15]:
adult_data_full['income'].unique()

# Count of duplicate rows
num_duplicates = adult_data_full.duplicated().sum()
print(f"Number of duplicate rows: {num_duplicates}")

# Number of NaN (missing) values per column
nan_counts = adult_data_full.isna().sum()
print("\nNumber of NaN values per column:")
print(nan_counts)

Number of duplicate rows: 0

Number of NaN values per column:
age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64


# Showing the report and the end due to the report is taking part in the notebook

In [16]:
with open("/kaggle/working/profiling_report.html", "r", encoding="utf-8") as f:
    report = f.read()

display(HTML(report))

0,1
Number of variables,15
Number of observations,48842
Missing cells,0
Missing cells (%),0.0%
Duplicate rows,28
Duplicate rows (%),0.1%
Total size in memory,7.0 MiB
Average record size in memory,149.6 B

0,1
Numeric,6
Categorical,9

0,1
Dataset has 28 (0.1%) duplicate rows,Duplicates
race is highly imbalanced (65.8%),Imbalance
native-country is highly imbalanced (82.7%),Imbalance
capital-gain has 44807 (91.7%) zeros,Zeros
capital-loss has 46560 (95.3%) zeros,Zeros

0,1
Analysis started,2025-04-15 05:10:33.881554
Analysis finished,2025-04-15 05:10:43.610363
Duration,9.73 seconds
Software version,ydata-profiling vv4.12.2
Download configuration,config.json

0,1
Distinct,74
Distinct (%),0.2%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,38.643585

0,1
Minimum,17
Maximum,90
Zeros,0
Zeros (%),0.0%
Negative,0
Negative (%),0.0%
Memory size,1.8 MiB

0,1
Minimum,17
5-th percentile,19
Q1,28
median,37
Q3,48
95-th percentile,63
Maximum,90
Range,73
Interquartile range (IQR),20

0,1
Standard deviation,13.71051
Coefficient of variation (CV),0.35479394
Kurtosis,-0.18426874
Mean,38.643585
Median Absolute Deviation (MAD),10
Skewness,0.55758032
Sum,1887430
Variance,187.97808
Monotonicity,Not monotonic

Value,Count,Frequency (%)
36,1348,2.8%
35,1337,2.7%
33,1335,2.7%
23,1329,2.7%
31,1325,2.7%
34,1303,2.7%
28,1280,2.6%
37,1280,2.6%
30,1278,2.6%
38,1264,2.6%

Value,Count,Frequency (%)
17,595,1.2%
18,862,1.8%
19,1053,2.2%
20,1113,2.3%
21,1096,2.2%
22,1178,2.4%
23,1329,2.7%
24,1206,2.5%
25,1195,2.4%
26,1153,2.4%

Value,Count,Frequency (%)
90,55,0.1%
89,2,< 0.1%
88,6,< 0.1%
87,3,< 0.1%
86,1,< 0.1%
85,5,< 0.1%
84,13,< 0.1%
83,11,< 0.1%
82,15,< 0.1%
81,37,0.1%

0,1
Distinct,9
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Memory size,1.8 MiB

0,1
Private,33906
Self-emp-not-inc,3862
Local-gov,3136
?,2799
State-gov,1981
Other values (4),3158

0,1
Max length,17.0
Median length,8.0
Mean length,8.8708693
Min length,2.0

0,1
Total characters,433271
Distinct characters,29
Distinct categories,1 ?
Distinct scripts,1 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,State-gov
2nd row,Self-emp-not-inc
3rd row,Private
4th row,Private
5th row,Private

Value,Count,Frequency (%)
Private,33906,69.4%
Self-emp-not-inc,3862,7.9%
Local-gov,3136,6.4%
?,2799,5.7%
State-gov,1981,4.1%
Self-emp-inc,1695,3.5%
Federal-gov,1432,2.9%
Without-pay,21,< 0.1%
Never-worked,10,< 0.1%

Value,Count,Frequency (%)
private,33906,69.4%
self-emp-not-inc,3862,7.9%
local-gov,3136,6.4%
,2799,5.7%
state-gov,1981,4.1%
self-emp-inc,1695,3.5%
federal-gov,1432,2.9%
without-pay,21,< 0.1%
never-worked,10,< 0.1%

Value,Count,Frequency (%)
e,49895,11.5%
,48842,11.3%
t,41772,9.6%
a,40476,9.3%
v,40465,9.3%
i,39484,9.1%
r,35358,8.2%
P,33906,7.8%
-,21556,5.0%
o,13578,3.1%

Value,Count,Frequency (%)
(unknown),433271,100.0%

Value,Count,Frequency (%)
e,49895,11.5%
,48842,11.3%
t,41772,9.6%
a,40476,9.3%
v,40465,9.3%
i,39484,9.1%
r,35358,8.2%
P,33906,7.8%
-,21556,5.0%
o,13578,3.1%

Value,Count,Frequency (%)
(unknown),433271,100.0%

Value,Count,Frequency (%)
e,49895,11.5%
,48842,11.3%
t,41772,9.6%
a,40476,9.3%
v,40465,9.3%
i,39484,9.1%
r,35358,8.2%
P,33906,7.8%
-,21556,5.0%
o,13578,3.1%

Value,Count,Frequency (%)
(unknown),433271,100.0%

Value,Count,Frequency (%)
e,49895,11.5%
,48842,11.3%
t,41772,9.6%
a,40476,9.3%
v,40465,9.3%
i,39484,9.1%
r,35358,8.2%
P,33906,7.8%
-,21556,5.0%
o,13578,3.1%

0,1
Distinct,28523
Distinct (%),58.4%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,189664.13

0,1
Minimum,12285
Maximum,1490400
Zeros,0
Zeros (%),0.0%
Negative,0
Negative (%),0.0%
Memory size,1.8 MiB

0,1
Minimum,12285.0
5-th percentile,39615.4
Q1,117550.5
median,178144.5
Q3,237642.0
95-th percentile,379481.65
Maximum,1490400.0
Range,1478115.0
Interquartile range (IQR),120091.5

0,1
Standard deviation,105604.03
Coefficient of variation (CV),0.55679491
Kurtosis,6.0578482
Mean,189664.13
Median Absolute Deviation (MAD),60295.5
Skewness,1.4388919
Sum,9.2635757 × 109
Variance,1.115221 × 1010
Monotonicity,Not monotonic

Value,Count,Frequency (%)
203488,21,< 0.1%
120277,19,< 0.1%
190290,19,< 0.1%
125892,18,< 0.1%
126569,18,< 0.1%
126675,17,< 0.1%
113364,17,< 0.1%
99185,17,< 0.1%
186934,16,< 0.1%
111567,16,< 0.1%

Value,Count,Frequency (%)
12285,1,< 0.1%
13492,1,< 0.1%
13769,3,< 0.1%
13862,1,< 0.1%
14878,1,< 0.1%
18827,1,< 0.1%
19214,1,< 0.1%
19302,6,< 0.1%
19395,2,< 0.1%
19410,2,< 0.1%

Value,Count,Frequency (%)
1490400,1,< 0.1%
1484705,1,< 0.1%
1455435,1,< 0.1%
1366120,1,< 0.1%
1268339,1,< 0.1%
1226583,1,< 0.1%
1210504,1,< 0.1%
1184622,1,< 0.1%
1161363,1,< 0.1%
1125613,1,< 0.1%

0,1
Distinct,16
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Memory size,1.8 MiB

0,1
HS-grad,15784
Some-college,10878
Bachelors,8025
Masters,2657
Assoc-voc,2061
Other values (11),9437

0,1
Max length,13.0
Median length,12.0
Mean length,9.4220753
Min length,4.0

0,1
Total characters,460193
Distinct characters,32
Distinct categories,1 ?
Distinct scripts,1 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,Bachelors
2nd row,Bachelors
3rd row,HS-grad
4th row,11th
5th row,Bachelors

Value,Count,Frequency (%)
HS-grad,15784,32.3%
Some-college,10878,22.3%
Bachelors,8025,16.4%
Masters,2657,5.4%
Assoc-voc,2061,4.2%
11th,1812,3.7%
Assoc-acdm,1601,3.3%
10th,1389,2.8%
7th-8th,955,2.0%
Prof-school,834,1.7%

Value,Count,Frequency (%)
hs-grad,15784,32.3%
some-college,10878,22.3%
bachelors,8025,16.4%
masters,2657,5.4%
assoc-voc,2061,4.2%
11th,1812,3.7%
assoc-acdm,1601,3.3%
10th,1389,2.8%
7th-8th,955,2.0%
prof-school,834,1.7%

Value,Count,Frequency (%)
,48842,10.6%
e,43993,9.6%
o,39360,8.6%
-,32869,7.1%
l,30698,6.7%
a,28661,6.2%
r,27977,6.1%
c,27738,6.0%
S,26662,5.8%
g,26662,5.8%

Value,Count,Frequency (%)
(unknown),460193,100.0%

Value,Count,Frequency (%)
,48842,10.6%
e,43993,9.6%
o,39360,8.6%
-,32869,7.1%
l,30698,6.7%
a,28661,6.2%
r,27977,6.1%
c,27738,6.0%
S,26662,5.8%
g,26662,5.8%

Value,Count,Frequency (%)
(unknown),460193,100.0%

Value,Count,Frequency (%)
,48842,10.6%
e,43993,9.6%
o,39360,8.6%
-,32869,7.1%
l,30698,6.7%
a,28661,6.2%
r,27977,6.1%
c,27738,6.0%
S,26662,5.8%
g,26662,5.8%

Value,Count,Frequency (%)
(unknown),460193,100.0%

Value,Count,Frequency (%)
,48842,10.6%
e,43993,9.6%
o,39360,8.6%
-,32869,7.1%
l,30698,6.7%
a,28661,6.2%
r,27977,6.1%
c,27738,6.0%
S,26662,5.8%
g,26662,5.8%

0,1
Distinct,16
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,10.078089

0,1
Minimum,1
Maximum,16
Zeros,0
Zeros (%),0.0%
Negative,0
Negative (%),0.0%
Memory size,1.8 MiB

0,1
Minimum,1
5-th percentile,5
Q1,9
median,10
Q3,12
95-th percentile,14
Maximum,16
Range,15
Interquartile range (IQR),3

0,1
Standard deviation,2.5709728
Coefficient of variation (CV),0.2551052
Kurtosis,0.62574527
Mean,10.078089
Median Absolute Deviation (MAD),1
Skewness,-0.31652486
Sum,492234
Variance,6.6099009
Monotonicity,Not monotonic

Value,Count,Frequency (%)
9,15784,32.3%
10,10878,22.3%
13,8025,16.4%
14,2657,5.4%
11,2061,4.2%
7,1812,3.7%
12,1601,3.3%
6,1389,2.8%
4,955,2.0%
15,834,1.7%

Value,Count,Frequency (%)
1,83,0.2%
2,247,0.5%
3,509,1.0%
4,955,2.0%
5,756,1.5%
6,1389,2.8%
7,1812,3.7%
8,657,1.3%
9,15784,32.3%
10,10878,22.3%

Value,Count,Frequency (%)
16,594,1.2%
15,834,1.7%
14,2657,5.4%
13,8025,16.4%
12,1601,3.3%
11,2061,4.2%
10,10878,22.3%
9,15784,32.3%
8,657,1.3%
7,1812,3.7%

0,1
Distinct,7
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Memory size,1.8 MiB

0,1
Married-civ-spouse,22379
Never-married,16117
Divorced,6633
Separated,1530
Widowed,1518
Other values (2),665

0,1
Max length,22.0
Median length,19.0
Mean length,15.406044
Min length,8.0

0,1
Total characters,752462
Distinct characters,25
Distinct categories,1 ?
Distinct scripts,1 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,Never-married
2nd row,Married-civ-spouse
3rd row,Divorced
4th row,Married-civ-spouse
5th row,Married-civ-spouse

Value,Count,Frequency (%)
Married-civ-spouse,22379,45.8%
Never-married,16117,33.0%
Divorced,6633,13.6%
Separated,1530,3.1%
Widowed,1518,3.1%
Married-spouse-absent,628,1.3%
Married-AF-spouse,37,0.1%

Value,Count,Frequency (%)
married-civ-spouse,22379,45.8%
never-married,16117,33.0%
divorced,6633,13.6%
separated,1530,3.1%
widowed,1518,3.1%
married-spouse-absent,628,1.3%
married-af-spouse,37,0.1%

Value,Count,Frequency (%)
e,106278,14.1%
r,102602,13.6%
i,69691,9.3%
-,62205,8.3%
d,50360,6.7%
,48842,6.5%
s,46716,6.2%
v,45129,6.0%
a,42849,5.7%
o,31195,4.1%

Value,Count,Frequency (%)
(unknown),752462,100.0%

Value,Count,Frequency (%)
e,106278,14.1%
r,102602,13.6%
i,69691,9.3%
-,62205,8.3%
d,50360,6.7%
,48842,6.5%
s,46716,6.2%
v,45129,6.0%
a,42849,5.7%
o,31195,4.1%

Value,Count,Frequency (%)
(unknown),752462,100.0%

Value,Count,Frequency (%)
e,106278,14.1%
r,102602,13.6%
i,69691,9.3%
-,62205,8.3%
d,50360,6.7%
,48842,6.5%
s,46716,6.2%
v,45129,6.0%
a,42849,5.7%
o,31195,4.1%

Value,Count,Frequency (%)
(unknown),752462,100.0%

Value,Count,Frequency (%)
e,106278,14.1%
r,102602,13.6%
i,69691,9.3%
-,62205,8.3%
d,50360,6.7%
,48842,6.5%
s,46716,6.2%
v,45129,6.0%
a,42849,5.7%
o,31195,4.1%

0,1
Distinct,15
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Memory size,1.8 MiB

0,1
Prof-specialty,6172
Craft-repair,6112
Exec-managerial,6086
Adm-clerical,5611
Sales,5504
Other values (10),19357

0,1
Max length,18.0
Median length,16.0
Mean length,13.186991
Min length,2.0

0,1
Total characters,644079
Distinct characters,33
Distinct categories,1 ?
Distinct scripts,1 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,Adm-clerical
2nd row,Exec-managerial
3rd row,Handlers-cleaners
4th row,Handlers-cleaners
5th row,Prof-specialty

Value,Count,Frequency (%)
Prof-specialty,6172,12.6%
Craft-repair,6112,12.5%
Exec-managerial,6086,12.5%
Adm-clerical,5611,11.5%
Sales,5504,11.3%
Other-service,4923,10.1%
Machine-op-inspct,3022,6.2%
?,2809,5.8%
Transport-moving,2355,4.8%
Handlers-cleaners,2072,4.2%

Value,Count,Frequency (%)
prof-specialty,6172,12.6%
craft-repair,6112,12.5%
exec-managerial,6086,12.5%
adm-clerical,5611,11.5%
sales,5504,11.3%
other-service,4923,10.1%
machine-op-inspct,3022,6.2%
,2809,5.8%
transport-moving,2355,4.8%
handlers-cleaners,2072,4.2%

Value,Count,Frequency (%)
e,64487,10.0%
r,60321,9.4%
a,58780,9.1%
,48842,7.6%
-,43793,6.8%
i,42998,6.7%
c,38963,6.0%
l,33128,5.1%
s,30538,4.7%
t,25996,4.0%

Value,Count,Frequency (%)
(unknown),644079,100.0%

Value,Count,Frequency (%)
e,64487,10.0%
r,60321,9.4%
a,58780,9.1%
,48842,7.6%
-,43793,6.8%
i,42998,6.7%
c,38963,6.0%
l,33128,5.1%
s,30538,4.7%
t,25996,4.0%

Value,Count,Frequency (%)
(unknown),644079,100.0%

Value,Count,Frequency (%)
e,64487,10.0%
r,60321,9.4%
a,58780,9.1%
,48842,7.6%
-,43793,6.8%
i,42998,6.7%
c,38963,6.0%
l,33128,5.1%
s,30538,4.7%
t,25996,4.0%

Value,Count,Frequency (%)
(unknown),644079,100.0%

Value,Count,Frequency (%)
e,64487,10.0%
r,60321,9.4%
a,58780,9.1%
,48842,7.6%
-,43793,6.8%
i,42998,6.7%
c,38963,6.0%
l,33128,5.1%
s,30538,4.7%
t,25996,4.0%

0,1
Distinct,6
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Memory size,1.8 MiB

0,1
Husband,19716
Not-in-family,12583
Own-child,7581
Unmarried,5125
Wife,2331

0,1
Max length,15.0
Median length,14.0
Mean length,10.138713
Min length,5.0

0,1
Total characters,495195
Distinct characters,26
Distinct categories,1 ?
Distinct scripts,1 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,Not-in-family
2nd row,Husband
3rd row,Not-in-family
4th row,Husband
5th row,Wife

Value,Count,Frequency (%)
Husband,19716,40.4%
Not-in-family,12583,25.8%
Own-child,7581,15.5%
Unmarried,5125,10.5%
Wife,2331,4.8%
Other-relative,1506,3.1%

Value,Count,Frequency (%)
husband,19716,40.4%
not-in-family,12583,25.8%
own-child,7581,15.5%
unmarried,5125,10.5%
wife,2331,4.8%
other-relative,1506,3.1%

Value,Count,Frequency (%)
,48842,9.9%
n,45005,9.1%
i,41709,8.4%
a,38930,7.9%
-,34253,6.9%
d,32422,6.5%
l,21670,4.4%
H,19716,4.0%
u,19716,4.0%
s,19716,4.0%

Value,Count,Frequency (%)
(unknown),495195,100.0%

Value,Count,Frequency (%)
,48842,9.9%
n,45005,9.1%
i,41709,8.4%
a,38930,7.9%
-,34253,6.9%
d,32422,6.5%
l,21670,4.4%
H,19716,4.0%
u,19716,4.0%
s,19716,4.0%

Value,Count,Frequency (%)
(unknown),495195,100.0%

Value,Count,Frequency (%)
,48842,9.9%
n,45005,9.1%
i,41709,8.4%
a,38930,7.9%
-,34253,6.9%
d,32422,6.5%
l,21670,4.4%
H,19716,4.0%
u,19716,4.0%
s,19716,4.0%

Value,Count,Frequency (%)
(unknown),495195,100.0%

Value,Count,Frequency (%)
,48842,9.9%
n,45005,9.1%
i,41709,8.4%
a,38930,7.9%
-,34253,6.9%
d,32422,6.5%
l,21670,4.4%
H,19716,4.0%
u,19716,4.0%
s,19716,4.0%

0,1
Distinct,5
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Memory size,1.8 MiB

0,1
White,41762
Black,4685
Asian-Pac-Islander,1519
Amer-Indian-Eskimo,470
Other,406

0,1
Max length,19.0
Median length,6.0
Mean length,6.5294009
Min length,6.0

0,1
Total characters,318909
Distinct characters,23
Distinct categories,1 ?
Distinct scripts,1 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,White
2nd row,White
3rd row,White
4th row,Black
5th row,Black

Value,Count,Frequency (%)
White,41762,85.5%
Black,4685,9.6%
Asian-Pac-Islander,1519,3.1%
Amer-Indian-Eskimo,470,1.0%
Other,406,0.8%

Value,Count,Frequency (%)
white,41762,85.5%
black,4685,9.6%
asian-pac-islander,1519,3.1%
amer-indian-eskimo,470,1.0%
other,406,0.8%

Value,Count,Frequency (%)
,48842,15.3%
i,44221,13.9%
e,44157,13.8%
h,42168,13.2%
t,42168,13.2%
W,41762,13.1%
a,9712,3.0%
l,6204,1.9%
c,6204,1.9%
k,5155,1.6%

Value,Count,Frequency (%)
(unknown),318909,100.0%

Value,Count,Frequency (%)
,48842,15.3%
i,44221,13.9%
e,44157,13.8%
h,42168,13.2%
t,42168,13.2%
W,41762,13.1%
a,9712,3.0%
l,6204,1.9%
c,6204,1.9%
k,5155,1.6%

Value,Count,Frequency (%)
(unknown),318909,100.0%

Value,Count,Frequency (%)
,48842,15.3%
i,44221,13.9%
e,44157,13.8%
h,42168,13.2%
t,42168,13.2%
W,41762,13.1%
a,9712,3.0%
l,6204,1.9%
c,6204,1.9%
k,5155,1.6%

Value,Count,Frequency (%)
(unknown),318909,100.0%

Value,Count,Frequency (%)
,48842,15.3%
i,44221,13.9%
e,44157,13.8%
h,42168,13.2%
t,42168,13.2%
W,41762,13.1%
a,9712,3.0%
l,6204,1.9%
c,6204,1.9%
k,5155,1.6%

0,1
Distinct,2
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Memory size,1.8 MiB

0,1
Male,32650
Female,16192

0,1
Max length,7.0
Median length,5.0
Mean length,5.6630359
Min length,5.0

0,1
Total characters,276594
Distinct characters,7
Distinct categories,1 ?
Distinct scripts,1 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,Male
2nd row,Male
3rd row,Male
4th row,Male
5th row,Female

Value,Count,Frequency (%)
Male,32650,66.8%
Female,16192,33.2%

Value,Count,Frequency (%)
male,32650,66.8%
female,16192,33.2%

Value,Count,Frequency (%)
e,65034,23.5%
,48842,17.7%
a,48842,17.7%
l,48842,17.7%
M,32650,11.8%
F,16192,5.9%
m,16192,5.9%

Value,Count,Frequency (%)
(unknown),276594,100.0%

Value,Count,Frequency (%)
e,65034,23.5%
,48842,17.7%
a,48842,17.7%
l,48842,17.7%
M,32650,11.8%
F,16192,5.9%
m,16192,5.9%

Value,Count,Frequency (%)
(unknown),276594,100.0%

Value,Count,Frequency (%)
e,65034,23.5%
,48842,17.7%
a,48842,17.7%
l,48842,17.7%
M,32650,11.8%
F,16192,5.9%
m,16192,5.9%

Value,Count,Frequency (%)
(unknown),276594,100.0%

Value,Count,Frequency (%)
e,65034,23.5%
,48842,17.7%
a,48842,17.7%
l,48842,17.7%
M,32650,11.8%
F,16192,5.9%
m,16192,5.9%

0,1
Distinct,123
Distinct (%),0.3%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,1079.0676

0,1
Minimum,0
Maximum,99999
Zeros,44807
Zeros (%),91.7%
Negative,0
Negative (%),0.0%
Memory size,1.8 MiB

0,1
Minimum,0
5-th percentile,0
Q1,0
median,0
Q3,0
95-th percentile,5013
Maximum,99999
Range,99999
Interquartile range (IQR),0

0,1
Standard deviation,7452.0191
Coefficient of variation (CV),6.9059796
Kurtosis,152.6931
Mean,1079.0676
Median Absolute Deviation (MAD),0
Skewness,11.894659
Sum,52703821
Variance,55532588
Monotonicity,Not monotonic

Value,Count,Frequency (%)
0,44807,91.7%
15024,513,1.1%
7688,410,0.8%
7298,364,0.7%
99999,244,0.5%
3103,152,0.3%
5178,146,0.3%
5013,117,0.2%
4386,108,0.2%
8614,82,0.2%

Value,Count,Frequency (%)
0,44807,91.7%
114,8,< 0.1%
401,5,< 0.1%
594,52,0.1%
914,10,< 0.1%
991,6,< 0.1%
1055,37,0.1%
1086,8,< 0.1%
1111,1,< 0.1%
1151,13,< 0.1%

Value,Count,Frequency (%)
99999,244,0.5%
41310,3,< 0.1%
34095,6,< 0.1%
27828,58,0.1%
25236,14,< 0.1%
25124,6,< 0.1%
22040,1,< 0.1%
20051,49,0.1%
18481,2,< 0.1%
15831,8,< 0.1%

0,1
Distinct,99
Distinct (%),0.2%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,87.502314

0,1
Minimum,0
Maximum,4356
Zeros,46560
Zeros (%),95.3%
Negative,0
Negative (%),0.0%
Memory size,1.8 MiB

0,1
Minimum,0
5-th percentile,0
Q1,0
median,0
Q3,0
95-th percentile,0
Maximum,4356
Range,4356
Interquartile range (IQR),0

0,1
Standard deviation,403.00455
Coefficient of variation (CV),4.6056445
Kurtosis,20.014346
Mean,87.502314
Median Absolute Deviation (MAD),0
Skewness,4.5698089
Sum,4273788
Variance,162412.67
Monotonicity,Not monotonic

Value,Count,Frequency (%)
0,46560,95.3%
1902,304,0.6%
1977,253,0.5%
1887,233,0.5%
2415,72,0.1%
1485,71,0.1%
1848,67,0.1%
1590,62,0.1%
1602,62,0.1%
1876,59,0.1%

Value,Count,Frequency (%)
0,46560,95.3%
155,1,< 0.1%
213,5,< 0.1%
323,5,< 0.1%
419,3,< 0.1%
625,17,< 0.1%
653,4,< 0.1%
810,2,< 0.1%
880,6,< 0.1%
974,2,< 0.1%

Value,Count,Frequency (%)
4356,3,< 0.1%
3900,2,< 0.1%
3770,4,< 0.1%
3683,2,< 0.1%
3175,2,< 0.1%
3004,5,< 0.1%
2824,14,< 0.1%
2754,2,< 0.1%
2603,7,< 0.1%
2559,17,< 0.1%

0,1
Distinct,96
Distinct (%),0.2%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,40.422382

0,1
Minimum,1
Maximum,99
Zeros,0
Zeros (%),0.0%
Negative,0
Negative (%),0.0%
Memory size,1.8 MiB

0,1
Minimum,1.0
5-th percentile,17.05
Q1,40.0
median,40.0
Q3,45.0
95-th percentile,60.0
Maximum,99.0
Range,98.0
Interquartile range (IQR),5.0

0,1
Standard deviation,12.391444
Coefficient of variation (CV),0.30654908
Kurtosis,2.9510591
Mean,40.422382
Median Absolute Deviation (MAD),3
Skewness,0.23874966
Sum,1974310
Variance,153.54789
Monotonicity,Not monotonic

Value,Count,Frequency (%)
40,22803,46.7%
50,4246,8.7%
45,2717,5.6%
60,2177,4.5%
35,1937,4.0%
20,1862,3.8%
30,1700,3.5%
55,1051,2.2%
25,958,2.0%
48,770,1.6%

Value,Count,Frequency (%)
1,27,0.1%
2,53,0.1%
3,59,0.1%
4,84,0.2%
5,95,0.2%
6,92,0.2%
7,45,0.1%
8,218,0.4%
9,27,0.1%
10,425,0.9%

Value,Count,Frequency (%)
99,137,0.3%
98,14,< 0.1%
97,2,< 0.1%
96,9,< 0.1%
95,2,< 0.1%
94,1,< 0.1%
92,3,< 0.1%
91,3,< 0.1%
90,42,0.1%
89,3,< 0.1%

0,1
Distinct,42
Distinct (%),0.1%
Missing,0
Missing (%),0.0%
Memory size,1.8 MiB

0,1
United-States,43832
Mexico,951
?,857
Philippines,295
Germany,206
Other values (37),2701

0,1
Max length,27.0
Median length,14.0
Mean length,13.306847
Min length,2.0

0,1
Total characters,649933
Distinct characters,46
Distinct categories,1 ?
Distinct scripts,1 ?
Distinct blocks,1 ?

0,1
Unique,1 ?
Unique (%),< 0.1%

0,1
1st row,United-States
2nd row,United-States
3rd row,United-States
4th row,United-States
5th row,Cuba

Value,Count,Frequency (%)
United-States,43832,89.7%
Mexico,951,1.9%
?,857,1.8%
Philippines,295,0.6%
Germany,206,0.4%
Puerto-Rico,184,0.4%
Canada,182,0.4%
El-Salvador,155,0.3%
India,151,0.3%
Cuba,138,0.3%

Value,Count,Frequency (%)
united-states,43832,89.7%
mexico,951,1.9%
,857,1.8%
philippines,295,0.6%
germany,206,0.4%
puerto-rico,184,0.4%
canada,182,0.4%
el-salvador,155,0.3%
india,151,0.3%
cuba,138,0.3%

Value,Count,Frequency (%)
t,132284,20.4%
e,89870,13.8%
,48842,7.5%
a,47613,7.3%
i,47106,7.2%
n,45884,7.1%
d,44771,6.9%
-,44344,6.8%
s,44194,6.8%
S,44169,6.8%

Value,Count,Frequency (%)
(unknown),649933,100.0%

Value,Count,Frequency (%)
t,132284,20.4%
e,89870,13.8%
,48842,7.5%
a,47613,7.3%
i,47106,7.2%
n,45884,7.1%
d,44771,6.9%
-,44344,6.8%
s,44194,6.8%
S,44169,6.8%

Value,Count,Frequency (%)
(unknown),649933,100.0%

Value,Count,Frequency (%)
t,132284,20.4%
e,89870,13.8%
,48842,7.5%
a,47613,7.3%
i,47106,7.2%
n,45884,7.1%
d,44771,6.9%
-,44344,6.8%
s,44194,6.8%
S,44169,6.8%

Value,Count,Frequency (%)
(unknown),649933,100.0%

Value,Count,Frequency (%)
t,132284,20.4%
e,89870,13.8%
,48842,7.5%
a,47613,7.3%
i,47106,7.2%
n,45884,7.1%
d,44771,6.9%
-,44344,6.8%
s,44194,6.8%
S,44169,6.8%

0,1
Distinct,4
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Memory size,1.8 MiB

0,1
<=50K,24720
<=50K.,12435
>50K,7841
>50K.,3846

0,1
Max length,7.0
Median length,6.0
Mean length,6.0940584
Min length,5.0

0,1
Total characters,297646
Distinct characters,8
Distinct categories,1 ?
Distinct scripts,1 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,<=50K
2nd row,<=50K
3rd row,<=50K
4th row,<=50K
5th row,<=50K

Value,Count,Frequency (%)
<=50K,24720,50.6%
<=50K.,12435,25.5%
>50K,7841,16.1%
>50K.,3846,7.9%

Value,Count,Frequency (%)
50k,48842,100.0%

Value,Count,Frequency (%)
,48842,16.4%
5,48842,16.4%
0,48842,16.4%
K,48842,16.4%
<,37155,12.5%
=,37155,12.5%
.,16281,5.5%
>,11687,3.9%

Value,Count,Frequency (%)
(unknown),297646,100.0%

Value,Count,Frequency (%)
,48842,16.4%
5,48842,16.4%
0,48842,16.4%
K,48842,16.4%
<,37155,12.5%
=,37155,12.5%
.,16281,5.5%
>,11687,3.9%

Value,Count,Frequency (%)
(unknown),297646,100.0%

Value,Count,Frequency (%)
,48842,16.4%
5,48842,16.4%
0,48842,16.4%
K,48842,16.4%
<,37155,12.5%
=,37155,12.5%
.,16281,5.5%
>,11687,3.9%

Value,Count,Frequency (%)
(unknown),297646,100.0%

Value,Count,Frequency (%)
,48842,16.4%
5,48842,16.4%
0,48842,16.4%
K,48842,16.4%
<,37155,12.5%
=,37155,12.5%
.,16281,5.5%
>,11687,3.9%

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
16272,61,Private,89686,HS-grad,9,Married-civ-spouse,Sales,Husband,White,Male,0,0,48,United-States,<=50K.
16273,31,Private,440129,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K.
16274,25,Private,350977,HS-grad,9,Never-married,Other-service,Own-child,White,Female,0,0,40,United-States,<=50K.
16275,48,Local-gov,349230,Masters,14,Divorced,Other-service,Not-in-family,White,Male,0,0,40,United-States,<=50K.
16276,33,Private,245211,Bachelors,13,Never-married,Prof-specialty,Own-child,White,Male,0,0,40,United-States,<=50K.
16277,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K.
16278,64,?,321403,HS-grad,9,Widowed,?,Other-relative,Black,Male,0,0,40,United-States,<=50K.
16279,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K.
16280,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K.
16281,35,Self-emp-inc,182148,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,60,United-States,>50K.

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,# duplicates
10,25,Private,195994,1st-4th,2,Never-married,Priv-house-serv,Not-in-family,White,Female,0,0,40,Guatemala,<=50K,3
0,18,Self-emp-inc,378036,12th,8,Never-married,Farming-fishing,Own-child,White,Male,0,0,10,United-States,<=50K.,2
1,19,Private,97261,HS-grad,9,Never-married,Farming-fishing,Not-in-family,White,Male,0,0,40,United-States,<=50K,2
2,19,Private,138153,Some-college,10,Never-married,Adm-clerical,Own-child,White,Female,0,0,10,United-States,<=50K,2
3,19,Private,146679,Some-college,10,Never-married,Exec-managerial,Own-child,Black,Male,0,0,30,United-States,<=50K,2
4,19,Private,251579,Some-college,10,Never-married,Other-service,Own-child,White,Male,0,0,14,United-States,<=50K,2
5,20,Private,107658,Some-college,10,Never-married,Tech-support,Not-in-family,White,Female,0,0,10,United-States,<=50K,2
6,21,Private,243368,Preschool,1,Never-married,Farming-fishing,Not-in-family,White,Male,0,0,50,Mexico,<=50K,2
7,21,Private,250051,Some-college,10,Never-married,Prof-specialty,Own-child,White,Female,0,0,10,United-States,<=50K,2
8,23,Private,240137,5th-6th,3,Never-married,Handlers-cleaners,Not-in-family,White,Male,0,0,55,Mexico,<=50K,2
