# Income Prediction: Download adult income data to a local file

Saves a Dataframe of raw data to a local file for later analysis.

Data source: [Adult dataset](https://archive.ics.uci.edu/dataset/2/adult) in the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/).

In [1]:
CSV_FILE_TYPE = 'csv'
PARQUET_FILE_TYPE = 'parquet'
PARQUET_ENGINE = 'pyarrow'
OUTPUT_FILE_TYPE = PARQUET_FILE_TYPE  # or CSV_FILE_TYPE
OUTPUT_DATA_PATH = f'../data/adult_income_raw.{OUTPUT_FILE_TYPE}'

In [2]:
import pandas as pd
import pyarrow
from ucimlrepo import fetch_ucirepo

## Read data from the public online repository

In [3]:
adult_income = fetch_ucirepo(name='Adult').data.original
adult_income.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       47879 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      47876 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48568 non-null  object
 14  income          48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


## Write the data to a local file

In [4]:
# Saves the Dataframe of raw data to a local file for later analysis.
if OUTPUT_FILE_TYPE == PARQUET_FILE_TYPE:
    adult_income.to_parquet(
        path=OUTPUT_DATA_PATH,
        engine=PARQUET_ENGINE,
        index=False,
    )
elif OUTPUT_FILE_TYPE == CSV_FILE_TYPE:
    adult_income.to_csv(
        path_or_buf=OUTPUT_DATA_PATH,
        index=False,
    )
else:
    raise Exception(f"Unexpected {OUTPUT_FILE_TYPE=}. Use one of ['{PARQUET_FILE_TYPE}', '{CSV_FILE_TYPE}'].")
print(f"Data saved to: {OUTPUT_DATA_PATH}")

Data saved to: ../data/adult_income_raw.parquet
