## About the ACSIncome Dataset
  
ACSIncome is one of several datasets created by [Ding et al.](https://proceedings.neurips.cc/paper_files/paper/2021/file/32e54441e6382a7fbacbbbaf3c450059-Paper.pdf) as an alternative to [UCI Adult](https://archive.ics.uci.edu/dataset/2/adult). A few key details about ACSIncome:
*   The dataset contains 1,664,500 datapoints pulled from the 2018 United States–wide [American Community Survey](https://www.census.gov/programs-surveys/acs) (ACS) [Public Use Microdata Sample](https://www.census.gov/programs-surveys/acs/microdata.html) (PUMS) data sample.
*   All fifty US states and Puerto Rico are represented in this dataset.
*   Each row represents a person described by various features, including age, race, and sex, which correspond to protected categories in different domains under US anti-discrimination laws.
*   The dataset only includes individuals above 16 years old who worked at least 1 hour per week in the past year and had an income of at least $100 USD.

For more information on the dataset and how it was created to reconstruct UCI Adult, check out the following citations:

> Ding, Frances, Moritz Hardt, John Miller, and Ludwig Schmidt. "[Retiring adult: New datasets for fair machine learning.](https://proceedings.neurips.cc/paper_files/paper/2021/hash/32e54441e6382a7fbacbbbaf3c450059-Abstract.html)" Advances in neural information processing systems 34 (2021): 6478-6490.

> Sarah Flood, Miriam King, Renae Rodgers, Steven Ruggles, and J. Robert Warren (2020). Integrated Public Use Microdata Series, Current Population Survey: Version 8.0 [dataset]. Minneapolis, MN: IPUMS. https://doi.org/10.18128/D030.V8.0


## Features

After importing the dataset, five random samples appear in a table in the output cell. Each sample represents an individual, with each column representing an aspect of the invidiual, such as their age, occupation, place of birth, and so forth.

The following table describes each feature column:

| Feature    | Description |
| -------- | ------- |
| AGEP | Age |
| COW | Class of worker (government employee, self-employed, private employee) |
| SCHL | Educational attainment (high school diploma, bachelor's degree, doctorate degree) |
| MAR  | Marital status |
| OCCP | Occuptation |
| POBP | Place of birth |
| RELP | Relationship to householder (husband or wife, housemate or roommate, nursing home, group home, etc.)  |
| WKHP | Usual hours worked per week in the past 12 months |
| SEX | Male or female |
| RAC1P | Recorded detailed race code |
| ST | US state code that represents the individual's location |
| PINCP | Total person's yearly income |

All of these features are represented numerically, though some of them correspond to a coded value. For example, for the `COW` (Class of worker) feature, `1.0` represents *an employee of a private for-profit company or business, or of an individual, for wages, salary, or commissions* and `2.0` represents *an employee of a private not-for-profit, tax-exempt, or charitable organization*. See [the supplemental section](https://proceedings.neurips.cc/paper_files/paper/2021/file/32e54441e6382a7fbacbbbaf3c450059-Supplemental.pdf) of [Ding et al.](https://proceedings.neurips.cc/paper_files/paper/2021/file/32e54441e6382a7fbacbbbaf3c450059-Paper.pdf) and the [ACS PUMS 2018 Data Dictionary](https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2018.pdf) for the full mapping of codes.

In [None]:
import pandas as pd

# Import the dataset
acs_df = pd.read_csv("data/acsincome_raw_2018.csv")

print(acs_df.shape)

print(acs_df.info())

# Print five random rows of the pandas DataFrame.
# acs_df.sample(5)

In [14]:
acs_df.columns

Index(['AGEP', 'COW', 'SCHL', 'MAR', 'OCCP', 'POBP', 'RELP', 'WKHP', 'SEX',
       'RAC1P', 'ST', 'PINCP'],
      dtype='object')

In [None]:
print(acs_df.describe())

In [13]:
acs_df.sample(5)

Unnamed: 0,AGEP,COW,SCHL,MAR,OCCP,POBP,RELP,WKHP,SEX,RAC1P,ST,PINCP
1366684,36.0,2.0,21.0,1.0,2634.0,47.0,1.0,40.0,2.0,1.0,47.0,32000.0
115676,46.0,2.0,21.0,1.0,3255.0,233.0,0.0,36.0,1.0,6.0,6.0,95000.0
39018,51.0,1.0,22.0,3.0,440.0,36.0,13.0,45.0,1.0,2.0,4.0,65000.0
1173243,58.0,3.0,19.0,1.0,4600.0,39.0,0.0,38.0,2.0,1.0,39.0,20000.0
834111,60.0,1.0,19.0,1.0,4435.0,28.0,0.0,40.0,2.0,2.0,28.0,38000.0


In [None]:
COLUMNS_DE = ['AGE', 'COW', 'SCHL', 'MAR', 'OCCP', 'POBP', 'RELP', 'WKHP', 'SEX', 'RACE', 'STATE', 'INCOME']
COLUMNS_NAMES_DE = ['ALTER', 'COW', 'BILDUNG', 'F_STAND', 'OCCP', 'POBP', 'RELP', 'WKHP', 'GENRE', 'RASSE', 'US-STATE', 'EINKOMMEN']