---
title: "Data Cleaning"
format:
    html: 
        code-fold: false
---

<!-- After digesting the instructions, you can delete this cell, these are assignment instructions and do not need to be included in your final submission.  -->

{{< include instructions.qmd >}} 

# Code 

Provide the source code used for this section of the project here.

If you're using a package for code organization, you can import it at this point. However, make sure that the **actual workflow steps**—including data processing, analysis, and other key tasks—are conducted and clearly demonstrated on this page. The goal is to show the technical flow of your project, highlighting how the code is executed to achieve your results.

If relevant, link to additional documentation or external references that explain any complex components. This section should give readers a clear view of how the project is implemented from a technical perspective.

Remember, this page is a technical narrative, NOT just a notebook with a collection of code cells, include in-line Prose, to describe what is going on.

In [239]:
# Load in necessary packages
import pandas as pd
import numpy as np

I first handled the survey data from **Pew Research Center's American Trends Panel Wave 111**. There was a decently large sample size. 

In [240]:
# Read in .sav file
W111_df = pd.read_spss("../../data/raw-data/ATP_W111.sav")
#print(W111_df.head())

#Disply data frame shape and column titles
print(W111_df.shape)
print(W111_df.columns)

(6034, 139)
Index(['QKEY', 'INTERVIEW_START_W111', 'INTERVIEW_END_W111',
       'DEVICE_TYPE_W111', 'LANG_W111', 'XTABLET_W111', 'SHOP18_W111',
       'SHOP19_W111', 'METOO1_W111', 'METOOSUPOE_M1_W111',
       ...
       'F_PARTYLN_FINAL', 'F_PARTYSUM_FINAL', 'F_PARTYSUMIDEO_FINAL',
       'F_INC_SDT1', 'F_REG', 'F_IDEO', 'F_INTFREQ', 'F_VOLSUM', 'F_INC_TIER2',
       'WEIGHT_W111'],
      dtype='object', length=139)


First, I start off by cleaning the whitespace.

In [241]:
# Clean and filter  

# Remove whitespace from column names     
W111_df.columns = W111_df.columns.str.strip()

for col in W111_df.columns:

    # Iterate through each column name and remove the suffix if present
    if col.endswith("_W111"): # Checks if column title ends with that title
        new_col_name = col[:-5]  # Remove that part of the name
        W111_df = W111_df.rename(columns={col: new_col_name})
    if col.startswith("F_"):
        new_col_name = col[2:]  # Remove the first 2 characters
        W111_df = W111_df.rename(columns={col: new_col_name})


# Remove whitespace from each row in each column if column data type is string
for col in W111_df.columns:
    if W111_df[col].dtype == "object":
        W111_df[col] = W111_df[col].str.strip()



After referring to the survey's questionnaire document (included ...) to see what each feature (column) refers too, I selected the following to look into.  

In [242]:
W111_columns_keep = ["ONLSHOP1_a", "ONLSHOP1_b", "ONLSHOP1_c", "SHOP4", "SNSUSE", "ONLSHOP5", "MARITAL", "USR_SELFID", "AGECAT", 
                     "GENDER", "EDUCCAT", "RACECMB", "INC_SDT1"]
W111_df = W111_df[W111_columns_keep]

# View column data types
print(print(W111_df.dtypes) )

ONLSHOP1_a    category
ONLSHOP1_b    category
ONLSHOP1_c    category
SHOP4         category
SNSUSE        category
ONLSHOP5      category
MARITAL       category
USR_SELFID    category
AGECAT        category
GENDER        category
EDUCCAT       category
RACECMB       category
INC_SDT1      category
dtype: object
None


In [243]:
W111_df.rename(columns={'age': 'age_years'}, inplace=True)

In [244]:
# Check for null values per column
null_counts = W111_df.isnull().sum()
print(null_counts)

ONLSHOP1_a     142
ONLSHOP1_b     142
ONLSHOP1_c     142
SHOP4          142
SNSUSE         142
ONLSHOP5      1406
MARITAL          0
USR_SELFID       0
AGECAT           0
GENDER           0
EDUCCAT          0
RACECMB          2
INC_SDT1         0
dtype: int64


In [245]:
print(W111_df.columns)

Index(['ONLSHOP1_a', 'ONLSHOP1_b', 'ONLSHOP1_c', 'SHOP4', 'SNSUSE', 'ONLSHOP5',
       'MARITAL', 'USR_SELFID', 'AGECAT', 'GENDER', 'EDUCCAT', 'RACECMB',
       'INC_SDT1'],
      dtype='object')


Then I moved on to handling the data from the **Consumer Expenditure Survey**. For each of the datasets used from this survey, I had to go through the corresponding data dictionary (stored in an Excel) to select potentially relevant features.
We begin with the income data.

In [246]:
#  Import data for income
income_1_df = pd.read_csv("../../data/raw-data/itii232.csv")
income_2_df = pd.read_csv("../../data/raw-data/itii233.csv")
income_3_df = pd.read_csv("../../data/raw-data/itii234.csv")
income_4_df = pd.read_csv("../../data/raw-data/itii241.csv")

In [247]:
# Examine one of the dataframes
print(income_1_df.describe)

<bound method NDFrame.describe of           NEWID  REFMO  REFYR     UCC  PUBFLAG VALUE_  IMPNUM        VALUE
0       5090604      1   2023  900030        2    NaN       1  3169.833300
1       5090604      1   2023  900030        2    NaN       2  3169.833300
2       5090604      1   2023  900030        2    NaN       3  3169.833300
3       5090604      1   2023  900030        2    NaN       4  3169.833300
4       5090604      1   2023  900030        2    NaN       5  3169.833300
...         ...    ...    ...     ...      ...    ...     ...          ...
330445  5366911      5   2023  980071        2    NaN       1   820.250000
330446  5366911      5   2023  980071        2    NaN       2   250.000000
330447  5366911      5   2023  980071        2    NaN       3   100.000000
330448  5366911      5   2023  980071        2    NaN       4   294.666667
330449  5366911      5   2023  980071        2    NaN       5   160.250000

[330450 rows x 8 columns]>


Now I filter for the relevant columns in the income dataframes. From the data collection stage, we already know that each of dataframes has 8 columns. We want the following columns:
- The variable "NEWID" represent the unique identifier for the survey participant. 
- The feature 'UCC' stands for Universal Classification Code which correpond to goods and services and other things that can be bought or sold. The values under variable "UCC" correspond to items that would increases or decreases to the individuals' net worth. 
- The variable "VALUE" indicate the absolute value of the change in net worth. The other 5 variables only represent data reelvant to the survey process so we subset the dataframes for those 3 columns.

In [248]:
income_columns_keep = ['NEWID', 'UCC', 'VALUE']

income_1_df = income_1_df[income_columns_keep]
print(income_1_df.shape)

income_2_df = income_2_df[income_columns_keep]
print(income_2_df.shape)

income_3_df = income_3_df[income_columns_keep]
print(income_3_df.shape)

income_4_df = income_4_df[income_columns_keep]
print(income_4_df.shape)

(330450, 3)
(330840, 3)
(322320, 3)
(325200, 3)


Next, we want to find the unqiue "UCC" values to see if we have to deal with decreases in net worth. 

In [249]:

# Initialize list that stores all unique values of 'UCC' column
all_UCC_unique = []

# Function that prints the unique values in a particular column and returns the list
def find_unique_UCC_values(df, column_name, unique_UCC_values):

  unique_values = df[column_name].unique()
  for value in unique_values:
    if value not in all_UCC_unique:
        unique_UCC_values.append(value)
        
  
find_unique_UCC_values(income_1_df, 'UCC', all_UCC_unique)
find_unique_UCC_values(income_2_df, 'UCC', all_UCC_unique)
find_unique_UCC_values(income_3_df, 'UCC', all_UCC_unique)
find_unique_UCC_values(income_4_df, 'UCC', all_UCC_unique)

print(all_UCC_unique)

[900030, 900170, 900180, 980000, 980071, 800940, 900000, 900160, 900150, 900090, 900190, 900200, 900210, 900120, 900140]


By referring to the data dictionary, I found that the "UCC" values are mostly associated with increases, except for 800940 which represents deductions for social security. 
There is some overlap between them. For example, 980071 represent income after taxes. Here I want to only focus on pre-tax income for simplicity's sake. Therefore we filter for the following:
- 900030: Social Security and railroad retirement income
- 900170: Retirement, survivors, disability income
- 900180: Interest and dividends
- 980000: Income before taxes
- 800940: Deductions for Social Security
- 900150: Food stamps

The following codes correspond to income that is lumped into 980000: Income before taxes
- 900160: Self-employment income
- 900000: Wages and salaries 
- 900090: Supplemental security income
- 900190: Net room/rental income
- 900200: Royalty, estate, trust income
- 900210: Other regular income
- 900140: Other income

In [250]:
income_df_UCC_keep = [900030, 900170, 900180, 980000, 800940, 900150]

negation_UCC_value = 800940

# Function to filter for the 'UCC' values we want and negate if UCC = 800940
def filter_and_negate(df, negation_ucc):

  # Filter the DataFrame based on the UCC list
  filtered_df = df[df['UCC'].isin(income_df_UCC_keep)]

  # Negate the 'VALUE' column for the specific UCC
  filtered_df.loc[filtered_df['UCC'] == negation_ucc, 'VALUE'] *= -1

  return filtered_df

# Apply the function to the data frames and check the shape 
income_1_df = filter_and_negate(income_1_df, negation_UCC_value)
print(income_1_df.shape)

income_2_df = filter_and_negate(income_2_df, negation_UCC_value)
print(income_2_df.shape)

income_3_df = filter_and_negate(income_3_df, negation_UCC_value)
print(income_3_df.shape)

income_4_df = filter_and_negate(income_4_df, negation_UCC_value)
print(income_4_df.shape)


(182790, 3)
(182475, 3)
(178470, 3)
(179925, 3)


With the cleaned income dataframes, we aggregate each to find the total income per year per individual and merge all the dataframes to get the master income dataframe.

In [251]:
# Function sums income sources based on participant ID 
def calculate_total_income(df):

#use reset_index to make a hierarchical index a regular column
  total_income_df = df.groupby('NEWID')['VALUE'].sum().reset_index() 
  
  # Rename columns in place
  total_income_df.columns = ['id', 'total_income']
  return total_income_df


# Calculate total income for each DataFrame
total_income_df1 = calculate_total_income(income_1_df)
total_income_df2 = calculate_total_income(income_2_df)
total_income_df3 = calculate_total_income(income_3_df)
total_income_df4 = calculate_total_income(income_4_df)

# Concatenate dataframes to get total income per survey participant
total_income_df = pd.concat([total_income_df1, total_income_df2, total_income_df3, total_income_df4], axis = 0)

#DF of income over a year
print(total_income_df.shape)

(18829, 2)


In [252]:
print(total_income_df.head)

<bound method NDFrame.head of            id  total_income
0     5090604   101692.5000
1     5090624    34467.5010
2     5090634   155839.9995
3     5090664    72695.0001
4     5090674    43196.2500
...       ...           ...
4675  5607961   130770.0000
4676  5607981   364462.6290
4677  5608001    84775.0005
4678  5608051   486724.8864
4679  5608061    65240.0010

[18829 rows x 2 columns]>


Now we handle the expenditures data. 

In [253]:
#  Import data for expenses
expense_1_df = pd.read_csv("../../data/raw-data/mtbi232.csv")
expense_2_df = pd.read_csv("../../data/raw-data/mtbi233.csv")
expense_3_df = pd.read_csv("../../data/raw-data/mtbi234.csv")
expense_4_df = pd.read_csv("../../data/raw-data/mtbi241.csv")

# Subset expense Data Frames for the relevant columns
expense_columns_keep = ['NEWID', 'SEQNO', 'UCC', 'COST']
expense_1_df = expense_1_df[expense_columns_keep]
expense_2_df = expense_2_df[expense_columns_keep]
expense_3_df = expense_3_df[expense_columns_keep]
expense_4_df = expense_4_df[expense_columns_keep]

expense_df = pd.concat([expense_1_df, expense_2_df, expense_3_df, expense_4_df], axis = 0)


By consulting the data dictionary, I located the specific files that listed the online purchases of tangible goods. 

In [254]:
specific_expense_df1 = pd.read_csv("../../data/raw-data/apb23.csv")
specific_expense_df2 = pd.read_csv("../../data/raw-data/eqb23.csv")
specific_expense_df3 = pd.read_csv("../../data/raw-data/mis23.csv")
specific_expense_df4 = pd.read_csv("../../data/raw-data/ovb23.csv")



  specific_expense_df1 = pd.read_csv("../../data/raw-data/apb23.csv")


In [255]:
print(specific_expense_df1.columns)
print(specific_expense_df2.columns)
print(specific_expense_df3.columns)
print(specific_expense_df4.columns)

Index(['QYEAR', 'NEWID', 'SEQNO', 'ALCNO', 'REC_ORIG', 'MINAPPLY', 'MINA_PLY',
       'GFTCMIN', 'GFTCMIN_', 'MIN_MO', 'MIN_MO_', 'MINPURX', 'MINPURX_',
       'MINRENTX', 'MINR_NTX', 'MNAPPL1', 'MNAPPL1_', 'MNAPPL2', 'MNAPPL2_',
       'MNAPPL3', 'MNAPPL3_', 'MNAPPL4', 'MNAPPL4_', 'MNAPPL5', 'MNAPPL5_',
       'MNAPPL6', 'MNAPPL6_', 'MNAPPL7', 'MNAPPL7_', 'MNAPPL8', 'MNAPPL8_',
       'MNAPPL9', 'MNAPPL9_', 'INSTLSCR', 'INST_SCR', 'INSTLLEX', 'INST_LEX',
       'APBPURCH'],
      dtype='object')
Index(['QYEAR', 'NEWID', 'SEQNO', 'ALCNO', 'REC_ORIG', 'APPRPRYB', 'APPR_RYB',
       'SRVCMOB', 'SRVCMOB_', 'REPAIRX', 'REPAIRX_', 'APPRPB1', 'APPRPB1_',
       'APPRPB2', 'APPRPB2_', 'APPRPB3', 'APPRPB3_', 'APPRPB4', 'APPRPB4_',
       'APPRPB5', 'APPRPB5_', 'APPRPB6', 'APPRPB6_', 'APPRPB7', 'APPRPB7_',
       'APPRPB8', 'APPRPB8_', 'APPRPB9', 'EQBPURCH', 'APPRPB9_'],
      dtype='object')
Index(['QYEAR', 'NEWID', 'SEQNO', 'ALCNO', 'REC_ORIG', 'MISCCODE', 'MISC_ODE',
       'MISCMO', 'MISCMO

Filter the specific purchase Data Frames for the relevant columns so we get four dataframes of online expenses.
- 'SEQNO' is the identifier variable for the purchases and can be used to merge with expense data frames. 
- 'APBPURCH' tells us if this item was purchased online or in-person.

In [256]:
apb_columns_keep = ['NEWID', 'SEQNO', 'APBPURCH']
online_expense_df1 = specific_expense_df1[apb_columns_keep]
online_expense_df1_subset = online_expense_df1.loc[online_expense_df1['APBPURCH'] == 1]


- 'EQBPURCH' tells us if this item was bought online or in-person.

In [257]:
eqb_columns_keep = ['NEWID', 'SEQNO', 'EQBPURCH']
online_expense_df2 = specific_expense_df2[eqb_columns_keep]
online_expense_df2_subset = online_expense_df2.loc[online_expense_df2['EQBPURCH'] == 1]

- 'MISPURCH' tells us if this item was bought online or in-person.

In [258]:
mis_columns_keep = ['NEWID', 'SEQNO', 'MISPURCH']
online_expense_df3 = specific_expense_df3[mis_columns_keep]
online_expense_df3_subset = online_expense_df3.loc[online_expense_df3['MISPURCH'] == 1]

- 'OVBPURCH' tells us if this item was bought online or in-person.

In [259]:
ovb_columns_keep = ['NEWID', 'SEQNO', 'OVBPURCH']
online_expense_df4 = specific_expense_df4[ovb_columns_keep]
online_expense_df4_subset = online_expense_df4.loc[online_expense_df4['OVBPURCH'] == 1]

Now we combine the mini-dataframes to get the the online_expense_df which we will merge with the expense_df to see the dollar amount of the online purchase. Then, we aggregate based upon 'NEWID' primary key to calculate the total expense per person, total online expense and the online spending percentage. 

In [260]:
#Concatenate all online expense df
online_expense_df = pd.concat([online_expense_df1_subset, online_expense_df2_subset, online_expense_df3_subset, online_expense_df4_subset], axis = 0)

# Filter for relevant columns
online_expense_df = online_expense_df[['NEWID', 'SEQNO']]

# Add a column with imputed constant values of 1 for later merging so we know the expenses are online 
online_expense_df['Is_Online'] = 1
print("Online Expense Columns: ",online_expense_df.columns)

# Merge the two DataFrames based on 'NEWID' and 'SEQNO'
merged_expense_df = expense_df.merge(online_expense_df, on=['NEWID', 'SEQNO'], how='left')
print("Merged Expense Columns: ", merged_expense_df.columns)


# Calculate total expenses, online expenses, and online percentage for each individual
# Group by 'NEWID' column, and perform 3 aggregate functions
total_expense_df = merged_expense_df.groupby('NEWID').agg(

    id=('NEWID', 'first'),

    #create a new column, and calculates the sum of 'COST'
    Total_Expense=('COST', 'sum'), 

    #filter COST column to include only rows where Is_Online is 1 ( online purchases)
    Online_Expense=('COST', lambda x: x[merged_expense_df['Is_Online'] == 1].sum()), 

    #filter COST column for online purchases and claculates that as a % of total
    Online_Percentage=('COST', lambda x: x[merged_expense_df['Is_Online'] == 1].sum() / x.sum() * 100)
)

print("Total Expense Columns: ", total_expense_df.columns)
print(total_expense_df.shape)

Online Expense Columns:  Index(['NEWID', 'SEQNO', 'Is_Online'], dtype='object')
Merged Expense Columns:  Index(['NEWID', 'SEQNO', 'UCC', 'COST', 'Is_Online'], dtype='object')
Total Expense Columns:  Index(['id', 'Total_Expense', 'Online_Expense', 'Online_Percentage'], dtype='object')
(18871, 4)


Now we move on to handling characteristics data

In [261]:
# Import data for individual characteristics
characteristics_df_1 = pd.read_csv("../../data/raw-data/memi232.csv")
characteristics_df_2 = pd.read_csv("../../data/raw-data/memi233.csv")
characteristics_df_3 = pd.read_csv("../../data/raw-data/memi234.csv")
characteristics_df_4 = pd.read_csv("../../data/raw-data/memi241.csv")

# Filter for  relevant columns
characteristics_columns_keep = ["AGE", "ARM_FORC", "EARNTYPE", "EDUCA", "INCWEEKQ", 
                                "INDRETX", "JSSDEDX", "JSSDEDXM", "MARITAL", "MEMBRACE", "RC_ASIAN", "RC_BLACK", "RC_DK", "RC_NATAM",
                                "RC_OTHER", "RC_PACIL", "RC_WHITE", "SEX", "SOCRRX"]

characteristics_df_1 = characteristics_df_1[characteristics_columns_keep]
characteristics_df_2 = characteristics_df_2[characteristics_columns_keep]
characteristics_df_3 = characteristics_df_3[characteristics_columns_keep]
characteristics_df_4 = characteristics_df_4[characteristics_columns_keep]

# Combine all of them
characteristics_df = pd.concat([characteristics_df_1, characteristics_df_2, characteristics_df_3, characteristics_df_4], axis = 0)




In [262]:
# Rename columns
characteristics_df.rename(columns={"AGE": "age", "ARM_FORC": "is_military",  "EARNTYPE": "earning_type", "EDUCA": "highest_ed_completed", 
                                   "INCWEEKQ": "num_weeks_worked_in_last_yr", "INDRETX": "deposited_money_in_retirement_this_yr", 
                                   "JSSDEDX": "income_into_SS_this_yr", "JSSDEDXM": "SS_payments_received_this_yr", "MARITAL": "marital_status", 
                                   "MEMBRACE": "race", "RC_ASIAN": "is_asian", "RC_BLACK": "is_black", "RC_DK": "race_unknown", 
                                   "RC_NATAM": "is_native_american", "RC_OTHER": "is_other_race", "RC_PACIL": "is_pacific_islander", 
                                   "RC_WHITE": "is_white", "SEX": "sex", "SOCRRX": "SS_and_railroad_retirement_income_received"}, inplace=True)


We need to do some data cleaning for the categorical variables and one-hot encode them.

In [263]:
#Data cleaning for one-hot encoding is_military table
characteristics_df["is_military"] = characteristics_df['is_military'].replace(2, 0)


# One-hot encode the 'earning_type' column
characteristics_df = pd.get_dummies(characteristics_df, columns=['earning_type'], prefix='', prefix_sep='', dtype = int)

# Rename the earning_type columns
characteristics_df.rename(columns={'1.0': 'full_time_1_yr', 
                   '2.0': 'part_time_1_yr', 
                   '3.0': 'full_time_part_yr', 
                   '4.0': 'part_time_part_yr'}, inplace=True)

# One-hot encode the 'highest_ed_completed' column
characteristics_df = pd.get_dummies(characteristics_df, columns=['highest_ed_completed'], prefix='', prefix_sep='', dtype = int)

# Rename the highest_ed_completed columns
characteristics_df.rename(columns={'1.0': 'no_school_completed', 
                   '2.0': 'grades_1-8_completed', 
                   '3.0': 'high_school_no_degree', 
                   '4.0': 'high_school_grad',
                   '5.0': 'some_college_no_degree',
                   '6.0': 'associates_degree',
                   '7.0': 'bachelors_degree',
                   '8.0': 'graduate_degree'}, inplace=True)

# One-hot encode the 'marital_status' column
characteristics_df = pd.get_dummies(characteristics_df, columns=['marital_status'], prefix='', prefix_sep='', dtype = int)

# Rename the marital_status columns
characteristics_df.rename(columns={'1': 'is_married', 
                   '2': 'is_widowed', 
                   '3': 'is_divorced', 
                   '4': 'is_separated',
                   '5': 'is_single'}, inplace=True)

# One-hot encode the 'sex' column
characteristics_df = pd.get_dummies(characteristics_df, columns=['sex'], prefix='', prefix_sep='', dtype = int)

# Rename the sex columns
characteristics_df.rename(columns={'1': 'is_male', 
                   '2': 'is_female'}, inplace=True)

In [264]:
# View values per column about race
# Legend: White - 1, Black - 2, Native American - 3, Asian - 4, Pacific Islander - 5, Other - 6, Unknown - 7
race_columns = ["is_asian", "is_black", "race_unknown", "is_native_american", "is_other_race", "is_pacific_islander", "is_white"]
print(characteristics_df["race"].value_counts())
for race in race_columns:
    print(characteristics_df[race].value_counts())

# One-Hot Encode Race Columns
# Replace numerical values for one-hot-encoding
characteristics_df["is_asian"] = characteristics_df['is_asian'].replace(4, 1)
characteristics_df["is_black"] = characteristics_df['is_black'].replace(2, 1)
characteristics_df["race_unknown"] = characteristics_df['race_unknown'].replace(7, 1)
characteristics_df["is_native_american"] = characteristics_df['is_native_american'].replace(3, 1)
characteristics_df["is_other_race"] = characteristics_df['is_other_race'].replace(6, 1)
characteristics_df["is_pacific_islander"] = characteristics_df['is_pacific_islander'].replace(5, 1)

# Replace NaN values with 0
characteristics_df[race_columns] = characteristics_df[race_columns].fillna(0)


race
1    34631
2     4560
4     3405
6     1040
5      259
3      235
Name: count, dtype: int64
is_asian
4.0    3714
Name: count, dtype: int64
is_black
2.0    4784
Name: count, dtype: int64
race_unknown
7.0    297
Name: count, dtype: int64
is_native_american
3.0    432
Name: count, dtype: int64
is_other_race
6.0    2595
Name: count, dtype: int64
is_pacific_islander
5.0    415
Name: count, dtype: int64
is_white
1.0    32913
Name: count, dtype: int64


In [265]:
print(characteristics_df.columns)

Index(['age', 'is_military', 'num_weeks_worked_in_last_yr',
       'deposited_money_in_retirement_this_yr', 'income_into_SS_this_yr',
       'SS_payments_received_this_yr', 'race', 'is_asian', 'is_black',
       'race_unknown', 'is_native_american', 'is_other_race',
       'is_pacific_islander', 'is_white',
       'SS_and_railroad_retirement_income_received', 'full_time_1_yr',
       'part_time_1_yr', 'full_time_part_yr', 'part_time_part_yr',
       'no_school_completed', 'grades_1-8_completed', 'high_school_no_degree',
       'high_school_grad', 'some_college_no_degree', 'associates_degree',
       'bachelors_degree', 'graduate_degree', 'is_married', 'is_widowed',
       'is_divorced', 'is_separated', 'is_single', 'is_male', 'is_female'],
      dtype='object')
