# Data screening of the IPUMS USA census data sample from 2016

### Purpose of this notebook is to document and explore the data which will be used for the research project

INCWAGE is the sensitive information which needs to be protected. The other columns are assumed to be the QID in various combinations.

In [1]:
import pandas as pd

In [2]:
# read data into pandas dataframe
ipums = pd.read_csv('../sample-data/ipums_usa/usa_00001.csv')
ipums.head()

Unnamed: 0,YEAR,DATANUM,SERIAL,HHWT,GQ,PERNUM,PERWT,SEX,AGE,MARST,RACE,RACED,BPL,BPLD,EDUC,EDUCD,OCC,INCWAGE
0,2016,1,1,97,1,1,98,2,84,1,1,100,1,100,7,71,0,0
1,2016,1,1,97,1,2,89,1,84,1,1,100,1,100,10,101,0,0
2,2016,1,2,95,1,1,95,2,78,4,1,100,13,1300,7,71,5700,27300
3,2016,1,3,159,1,1,160,1,46,1,1,100,27,2700,8,81,1550,65000
4,2016,1,3,159,1,2,154,2,52,1,1,100,1,100,6,65,0,0


Remove the system columns which are irrelevant for the analysis:

In [3]:
ipums = ipums.drop(['YEAR', 'DATANUM', 'SERIAL', 'HHWT', 'GQ', 'PERNUM', 'PERWT', 'RACED', 'BPLD', 'EDUCD'], axis = 1)
ipums.head()

Unnamed: 0,SEX,AGE,MARST,RACE,BPL,EDUC,OCC,INCWAGE
0,2,84,1,1,1,7,0,0
1,1,84,1,1,1,10,0,0
2,2,78,4,1,13,7,5700,27300
3,1,46,1,1,27,8,1550,65000
4,2,52,1,1,1,6,0,0


Number of observations:

In [4]:
n_obs = ipums.shape[0]
print(n_obs)

3156487


Number of distinct observations (without INCWAGE column):

In [5]:
print(ipums.groupby(['SEX', 'AGE', 'MARST', 'RACE', 'BPL', 'EDUC', 'OCC']).size().shape[0])

1656824


Check how many distinct values each column contains:

In [6]:
print(ipums.nunique())
print(ipums['INCWAGE'].describe().apply(lambda x: format(x, 'f')))
print("INCWAGE Specific Variable Codes \n 999999 = N/A \n 999998 = Missing")

SEX          2
AGE         97
MARST        6
RACE         9
BPL        124
EDUC        11
OCC        480
INCWAGE    933
dtype: int64
count    3156487.000000
mean      204626.372750
std       377112.658096
min            0.000000
25%            0.000000
50%        21000.000000
75%        84000.000000
max       999999.000000
Name: INCWAGE, dtype: object
INCWAGE Specific Variable Codes 
 999999 = N/A 
 999998 = Missing
