In [2]:
import pandas as pd
import numpy as np

In [16]:
df = pd.read_csv('usa_00002.csv')

In [29]:
df.columns

Index(['YEAR', 'DATANUM', 'SERIAL', 'CBSERIAL', 'HHWT', 'CPI99', 'GQ',
       'PERNUM', 'PERWT', 'SEX', 'AGE', 'MARST', 'RACE', 'RACED', 'HISPAN',
       'HISPAND', 'SCHOOL', 'EDUC', 'EDUCD', 'EMPSTAT', 'EMPSTATD', 'OCC1950',
       'OCC1990', 'OCC2010', 'IND1950', 'IND1990', 'WKSWORK2', 'UHRSWORK',
       'INCWAGE'],
      dtype='object')

## Age
AGE reports the person's age in years as of the last birthday.

In [19]:
df['AGE'].describe()

count    3.190040e+06
mean     4.128723e+01
std      2.363224e+01
min      0.000000e+00
25%      2.100000e+01
50%      4.200000e+01
75%      6.000000e+01
max      9.600000e+01
Name: AGE, dtype: float64

***
## Household weight
HHWT is a 6-digit numeric variable which indicates how many households in the U.S. population are represented by a given household in an IPUMS sample and has **two implied decimals**. For example, a HHWT value of 010461 should be interpreted as 104.61. HHWT specific variable codes for missing, edited, or unidentified observations, observations not applicable (N/A), observations not in universe (NIU), top and bottom value coding, etc. are provided below if applicable by Census year (and data sample if specified).

**User Note:** Users should also be sure to select one person (e.g., PERNUM = 1) to represent the entire household when using HHWT.

In [20]:
df['HHWT'].head()

0    206
1     45
2    136
3    136
4    136
Name: HHWT, dtype: int64

***
## Group Quarters Status
GQ classifies all housing units as falling into one of three main categories: households, group quarters, or vacant units. It also identifies fragmentary sample units for 1850-1930 (see below). In all years, the data available about a person and their co-residents depend on whether the person lives in a household or in group quarters. Households are sampled as units, meaning that everyone in the household is included in the sample, and most household-level variables are available. People living in group quarters are generally sampled as individuals; other people in their unit may or may not be included in the sample, and there is no way of linking co-residents' records to one another. If, however, a sampled person in group quarters was living with relatives, the related group was sampled for 1850-1930. Most household-level variables are not available for group quarters or for vacant units.

Group quarters are largely institutions and other group living arrangements, such as rooming houses and military barracks. The definitions vary from year to year, but the pre-1940 samples have generally used a definition of group quarters that includes units with 10 or more individuals unrelated to the householder. See the comparability discussion below and "Sample Designs" for more details about changing definitions of group quarters. Group-quarters types are identified in further detail by GQTYPE and GQFUNDS.

Codes: https://usa.ipums.org/usa-action/variables/GQ#codes_section

In [21]:
df['GQ'].head()

0    1
1    1
2    1
3    1
4    1
Name: GQ, dtype: int64

***
## Person number in sample unit

In [22]:
df['PERNUM'].head()

0    1
1    1
2    1
3    2
4    3
Name: PERNUM, dtype: int64

***
##  Person weight
PERWT indicates how many persons in the U.S. population are represented by a given person in an IPUMS sample.

It is generally a good idea to use PERWT when conducting a person-level analysis of any IPUMS sample. The use of PERWT is optional when analyzing one of the "flat" or unweighted IPUMS samples. Flat IPUMS samples include the 1% samples from 1850-1930, all samples from 1960, 1970, and 1980, the 1% unweighted samples from 1990 and 2000, the 10% 2010 sample, and any of the full count 100% census datasets. PERWT must be used to obtain nationally representative statistics for person-level analyses of any sample other than those.

In [23]:
df['PERWT'].head()

0    206
1     45
2    136
3    121
4    111
Name: PERWT, dtype: int64

***
## Sex
SEX reports whether the person was male or female.

1 = Male

2 = Female

Codes: https://usa.ipums.org/usa-action/variables/SEX#codes_section

In [24]:
df['SEX'].describe()

count    3.190040e+06
mean     1.510606e+00
std      4.998876e-01
min      1.000000e+00
25%      1.000000e+00
50%      2.000000e+00
75%      2.000000e+00
max      2.000000e+00
Name: SEX, dtype: float64

***
## Marital status
MARST gives each person's current marital status.

Codes: https://usa.ipums.org/usa-action/variables/MARST#codes_section

In [25]:
df['MARST'].head()

0    3
1    6
2    1
3    1
4    6
Name: MARST, dtype: int64

***
## Race
RACE provides the full detail given by the respondent and/or released by the Census Bureau; it is not always historically compatible

Codes: https://usa.ipums.org/usa-action/variables/RACE#codes_section

In [26]:
df['RACE'].head()

0    2
1    1
2    1
3    1
4    1
Name: RACE, dtype: int64

***
## Race: Single race identification
RACESING assigns a single race to multiple-race people. Each multiple-race person is assigned to the single race response category deemed most likely, depending on the individual's age, sex, Hispanic origin, region and urbanization level of residence, and the racial diversity of their local area. 

Codes: https://usa.ipums.org/usa-action/variables/RACESING#codes_section

In [31]:
df['RACED'].head()

0    200
1    100
2    100
3    100
4    100
Name: RACED, dtype: int64

***
## Hispanic origin
HISPAN identifies persons of Hispanic/Spanish/Latino origin and classifies them according to their country of origin when possible. Origin is defined by the Census Bureau as ancestry, lineage, heritage, nationality group, or country of birth. People of Hispanic origin may be of any race; see RACE for a discussion of coding issues involved. Users should note that race questions were not asked in the Puerto Rican censuses of 1970, 1980 and 1990. They were asked in the 1910 and 1920 Puerto Rican censuses, and in the 2000 and 2010 Puerto Rican census and the PRCS. However, questions assessing Spanish/Hispanic origin were not asked in the Puerto Rican censuses prior to 2000.

The HISPAN general code covers country-of-origin classifications common to all years; the detailed code distinguishes additional groups and subgroups. See [HISPRULE](https://usa.ipums.org/usa-action/variables/HISPRULE#description_section) for details on how country of origin information was assigned prior to 1980.

Codes: https://usa.ipums.org/usa-action/variables/HISPAN#codes_section

In [32]:
df['HISPAN'].head()

0    0
1    0
2    1
3    0
4    1
Name: HISPAN, dtype: int64

In [33]:
df['HISPAND'].head()

0      0
1      0
2    100
3      0
4    100
Name: HISPAND, dtype: int64

***
## School Attendance
SCHOOL indicates whether the respondent attended school during a specified period.

0 =	N/A

1 =	No, not in school

2 =	Yes, in school

9 =	Missing

Codes: https://usa.ipums.org/usa-action/variables/SCHOOL#codes_section

In [34]:
df['SCHOOL'].head()

0    1
1    1
2    1
3    1
4    2
Name: SCHOOL, dtype: int64

***
## Educational attainment
EDUC indicates respondents' educational attainment, as measured by the highest year of school or degree completed. Note that completion differs from the highest year of school attendance; for example, respondents who attended 10th grade but did not finish were classified in EDUC as having completed 9th grade. For additional detail on grade attendance, see [GRADEATT](https://usa.ipums.org/usa-action/variables/GRADEATT#codes_section) as well as the detailed version of [HIGRADE](https://usa.ipums.org/usa-action/variables/HIGRADE#description_section).

Codes: https://usa.ipums.org/usa-action/variables/EDUC#codes_section

In [35]:
df['EDUC'].head()

0     2
1    10
2     6
3     6
4     4
Name: EDUC, dtype: int64

***
## Employment status

EMPSTAT indicates whether the respondent was a part of the labor force -- working or seeking work -- and, if so, whether the person was currently unemployed. The second digit preserves additional related information available for some years but not others. See [LABFORCE](https://usa.ipums.org/usa-action/variables/LABFORCE#description_section) for a dichotomous variable that identifies whether a person participated in the labor force or not and is available for all years in the IPUMS.

Codes: https://usa.ipums.org/usa-action/variables/EMPSTAT#codes_section

In [36]:
df['EMPSTAT'].head()

0    3
1    1
2    1
3    3
4    3
Name: EMPSTAT, dtype: int64

In [37]:
df['EMPSTATD'].head()

0    30
1    10
2    10
3    30
4    30
Name: EMPSTATD, dtype: int64

***
## Occupation, 2010 basis
OCC2010 is a harmonized occupation coding scheme based on the Census Bureau's 2010 ACS occupation classification scheme. Similar variables are offered for the 1950 (OCC1950) and 1990 (OCC1990) classifications. OCC2010 offers researchers a consistent, long-term classification of occupations.

Codes: https://usa.ipums.org/usa-action/variables/OCC2010#codes_section

In [38]:
df['OCC2010'].head()

0    9920
1     350
2    6260
3    9920
4    9920
Name: OCC2010, dtype: int64

***
## Industry, 1990 basis
IND1990 classifies industries from all years since 1950 into the 1990 Census Bureau industrial classification scheme. Like IND1950, IND1990 offers researchers a consistent long-term classification of industries.

Codes: https://usa.ipums.org/usa-action/variables/IND1990#codes_section

In [39]:
df['IND1990'].head()

0      0
1    840
2     60
3      0
4      0
Name: IND1990, dtype: int64

***
## Weeks worked last year, intervalled
WKSWORK2, like WKSWORK1, reports the number of weeks that the respondent worked for profit, pay, or as an unpaid family worker during the previous year. For the census, the reference period is the previous calendar year; for the ACS, the reference period is the previous 12 months. WKSWORK2 differs from WKSWORK1 in that responses are given in intervals (1-13 weeks, 14-26 weeks, and so on), instead of the precise number of weeks. This is because the 1960 and 1970 samples recorded weeks worked only in intervals. For the other years contained in WKSWORK2 (the 1940-1950 and 1980-2000 censuses, the ACS, and the PRCS), the exact number of weeks worked is recorded in WKSWORK1.


Codes: https://usa.ipums.org/usa-action/variables/WKSWORK2#codes_section

In [40]:
df['WKSWORK2'].head()

0    0
1    6
2    6
3    0
4    0
Name: WKSWORK2, dtype: int64

***
## Usual hours worked per week
UHRSWORK reports the number of hours per week that the respondent usually worked, if the person worked during the previous year. The census inquiry relates to the previous calendar year, while the ACS and the PRCS uses the previous 12 months as the reference period.

Codes: https://usa.ipums.org/usa-action/variables/UHRSWORK#codes_section

00 = N/A

99 = 99 hours (top code)

In [41]:
df['UHRSWORK'].head()

0     0
1    42
2    42
3     0
4     0
Name: UHRSWORK, dtype: int64

***
## Wage and salary income
INCWAGE is a 7-digit numeric code reporting each respondent's total pre-tax wage and salary income - that is, money received as an employee - for the previous year. INCWAGE specific variable codes for missing, edited, or unidentified observations, observations not applicable (N/A), observations not in universe (NIU), top and bottom value coding, etc. are provided below by Census year (and data sample if specified).

Codes: https://usa.ipums.org/usa-action/variables/INCWAGE#codes_section

In [42]:
df['INCWAGE'].head()

0        0
1    38500
2    72000
3        0
4        0
Name: INCWAGE, dtype: int64

***
## CPI-U adjustment factor to 1999 dollars

CPI99 provides the CPI-U multiplier available from the Bureau of Labor Statistics to convert dollar figures to constant 1999 dollars. This corresponds to the dollar amounts in the 2000 census, which inquired about income in 1999. Multiplying dollar amounts by CPI99 (which is constant within years) will render them comparable across time and thus suitable for multivariate analysis.

CPI99 is a 5-digit numeric variable that provides the CPI-U multiplier available from the Bureau of Labor Statistics to convert dollar figures to constant 1999 dollars and has **three implied decimals**. For example, a CPI99 value of 15423 should be interpreted as 15.423. See the [IPUMS inflation adjustment page](https://usa.ipums.org/usa/cpi99.shtml) for more information on how to use CPI99. 

No Specific Variable Code: https://usa.ipums.org/usa-action/variables/CPI99#codes_section

In [43]:
df['CPI99'].head()

0    0.679
1    0.679
2    0.679
3    0.679
4    0.679
Name: CPI99, dtype: float64