---
title: "Data Cleaning"
format:
    html: 
        code-fold: false
---

<!-- After digesting the instructions, you can delete this cell, these are assignment instructions and do not need to be included in your final submission.  -->

{{< include instructions.qmd >}} 

# Code 

Provide the source code used for this section of the project here.

If you're using a package for code organization, you can import it at this point. However, make sure that the **actual workflow steps**—including data processing, analysis, and other key tasks—are conducted and clearly demonstrated on this page. The goal is to show the technical flow of your project, highlighting how the code is executed to achieve your results.

If relevant, link to additional documentation or external references that explain any complex components. This section should give readers a clear view of how the project is implemented from a technical perspective.

Remember, this page is a technical narrative, NOT just a notebook with a collection of code cells, include in-line Prose, to describe what is going on.

In [8]:
# Load in necessary packages
import pandas as pd
import numpy as np

I first handled the survey data from **Pew Research Center's American Trends Panel Wave 111**. There was a decently large sample size. 

In [9]:
# Read in .sav file
W111_df = pd.read_spss("../../data/raw-data/ATP_W111.sav")
#print(W111_df.head())

#Disply data frame shape and column titles
print(W111_df.shape)
print(W111_df.columns)

(6034, 139)
Index(['QKEY', 'INTERVIEW_START_W111', 'INTERVIEW_END_W111',
       'DEVICE_TYPE_W111', 'LANG_W111', 'XTABLET_W111', 'SHOP18_W111',
       'SHOP19_W111', 'METOO1_W111', 'METOOSUPOE_M1_W111',
       ...
       'F_PARTYLN_FINAL', 'F_PARTYSUM_FINAL', 'F_PARTYSUMIDEO_FINAL',
       'F_INC_SDT1', 'F_REG', 'F_IDEO', 'F_INTFREQ', 'F_VOLSUM', 'F_INC_TIER2',
       'WEIGHT_W111'],
      dtype='object', length=139)


First, I start off by cleaning the whitespace.

In [None]:
# Clean and filter  

# Remove whitespace from column names     
W111_df.columns = W111_df.columns.str.strip()

for col in W111_df.columns:

    # Iterate through each column name and remove the suffix if present
    if col.endswith("_W111"): # Checks if column title ends with that title
        new_col_name = col[:-5]  # Remove that part of the name
        W111_df = W111_df.rename(columns={col: new_col_name})
    if col.startswith("F_"):
        new_col_name = col[2:]  # Remove the first 2 characters
        W111_df = W111_df.rename(columns={col: new_col_name})


# Remove whitespace from each row in each column if column data type is string
for col in W111_df.columns:
    if W111_df[col].dtype == "object":
        W111_df[col] = W111_df[col].str.strip()



After referring to the survey's questionnaire document (included ...) to see what each feature (column) refers too, I selected the following to look into.  

In [18]:
W111_columns_keep = ["ONLSHOP1_a", "ONLSHOP1_b", "ONLSHOP1_c", "SHOP4", "SNSUSE", "ONLSHOP5", "MARITAL", "USR_SELFID", "AGECAT", 
                     "GENDER", "EDUCCAT", "RACECMB", "INC_SDT1"]
W111_df = W111_df[W111_columns_keep]

# View column data types
print(print(W111_df.dtypes) )

ONLSHOP1_a    category
ONLSHOP1_b    category
ONLSHOP1_c    category
SHOP4         category
SNSUSE        category
ONLSHOP5      category
MARITAL       category
USR_SELFID    category
AGECAT        category
GENDER        category
EDUCCAT       category
RACECMB       category
INC_SDT1      category
dtype: object
None


In [None]:
W111_df.rename(columns={'age': 'age_years'}, inplace=True)

In [None]:
# Check for null values per column
null_counts = W111_df.isnull().sum()
print(null_counts)

ONLSHOP1_a     142
ONLSHOP1_b     142
ONLSHOP1_c     142
SHOP4          142
SNSUSE         142
ONLSHOP5      1406
MARITAL          0
USR_SELFID       0
AGECAT           0
GENDER           0
EDUCCAT          0
RACECMB          2
INC_SDT1         0
dtype: int64


In [15]:
print(W111_df.columns)

Index(['ONLSHOP1_a', 'ONLSHOP1_b', 'ONLSHOP1_c', 'SHOP4', 'SNSUSE', 'ONLSHOP5',
       'MARITAL', 'USR_SELFID', 'AGECAT', 'GENDER', 'EDUCCAT', 'RACECMB',
       'INC_SDT1'],
      dtype='object')


Then I moved on to handling the data from the Consumer Expenditure Survey. We begin with the income data.

In [62]:
#  Import data for income
income_1_df = pd.read_csv("../../data/raw-data/itii232.csv")
income_2_df = pd.read_csv("../../data/raw-data/itii233.csv")
income_3_df = pd.read_csv("../../data/raw-data/itii234.csv")
income_4_df = pd.read_csv("../../data/raw-data/itii241.csv")

In [63]:
print(income_1_df.describe)

<bound method NDFrame.describe of           NEWID  REFMO  REFYR     UCC  PUBFLAG VALUE_  IMPNUM        VALUE
0       5090604      1   2023  900030        2    NaN       1  3169.833300
1       5090604      1   2023  900030        2    NaN       2  3169.833300
2       5090604      1   2023  900030        2    NaN       3  3169.833300
3       5090604      1   2023  900030        2    NaN       4  3169.833300
4       5090604      1   2023  900030        2    NaN       5  3169.833300
...         ...    ...    ...     ...      ...    ...     ...          ...
330445  5366911      5   2023  980071        2    NaN       1   820.250000
330446  5366911      5   2023  980071        2    NaN       2   250.000000
330447  5366911      5   2023  980071        2    NaN       3   100.000000
330448  5366911      5   2023  980071        2    NaN       4   294.666667
330449  5366911      5   2023  980071        2    NaN       5   160.250000

[330450 rows x 8 columns]>


Now I filter for the relevant columns in the income Data Frames. From the data collection stage, we already know that each of data frames has 8 columns. The variable "NEWID" represent the unique identifier for the survey participant. The values under variable "UCC" indicate certain increases or decreases to the individuals' net worth. The variable "VALUE" indicate the absolute value of the change in net worth. The other 5 variables only represent data reelvant to the survey process so we subset the Data Frames for those 3 columns.

In [64]:
income_columns_keep = ['NEWID', 'UCC', 'VALUE']

income_1_df = income_1_df[income_columns_keep]
print(income_1_df.shape)

income_2_df = income_2_df[income_columns_keep]
print(income_2_df.shape)

income_3_df = income_3_df[income_columns_keep]
print(income_3_df.shape)

income_4_df = income_4_df[income_columns_keep]
print(income_4_df.shape)

(330450, 3)
(330840, 3)
(322320, 3)
(325200, 3)


Next, we want to find the unqiue "UCC" values to see if we have to deal with decreases in net worth. 

In [65]:

# Initialize list that stores all unique values of 'UCC' column
all_UCC_unique = []

# Function that prints the unique values in a particular column and returns the list
def find_unique_UCC_values(df, column_name):

  unique_values = df[column_name].unique()
  for value in unique_values:
    if value not in all_UCC_unique:
        all_UCC_unique.append(value)
        
  
find_unique_UCC_values(income_1_df, 'UCC')
find_unique_UCC_values(income_2_df, 'UCC')
find_unique_UCC_values(income_3_df, 'UCC')
find_unique_UCC_values(income_4_df, 'UCC')

print(all_UCC_unique)

[900030, 900170, 900180, 980000, 980071, 800940, 900000, 900160, 900150, 900090, 900190, 900200, 900210, 900120, 900140]


By referring to the data dictionary, I found that the "UCC" values are mostly associated with increases, except for 800940 which represents deductions for social security. 
There is some overlap between them. For example, 980071 represent income after taxes. Here I want to only focus on pre-tax income for simplicity's sake. Therefore we filter for the following:
- 900030: Social Security and railroad retirement income
- 900170: Retirement, survivors, disability income
- 900180: Interest and dividends
- 980000: Income before taxes
- 800940: Deductions for Social Security
- 900150: Food stamps

The following codes correspond to income that is lumped into 980000: Income before taxes
- 900160: Self-employment income
- 900000: Wages and salaries 
- 900090: Supplemental security income
- 900190: Net room/rental income
- 900200: Royalty, estate, trust income
- 900210: Other regular income
- 900140: Other income

In [None]:
income_df_UCC_keep = [900030, 900170, 900180, 980000, 800940, 900150]

negation_UCC_value = 800940

# Function to filter for the 'UCC' values we want and negate if UCC = 800940
def filter_and_negate(df, negation_ucc):

  # Filter the DataFrame based on the UCC list
  filtered_df = df[df['UCC'].isin(income_df_UCC_keep)]

  # Negate the 'VALUE' column for the specific UCC
  filtered_df.loc[filtered_df['UCC'] == negation_ucc, 'VALUE'] *= -1

  return filtered_df

# Apply the function to the data frames and check the shape 
income_1_df = filter_and_negate(income_1_df, negation_UCC_value)
print(income_1_df.shape)

income_2_df = filter_and_negate(income_2_df, negation_UCC_value)
print(income_2_df.shape)

income_3_df = filter_and_negate(income_3_df, negation_UCC_value)
print(income_3_df.shape)

income_4_df = filter_and_negate(income_4_df, negation_UCC_value)
print(income_4_df.shape)


(182790, 3)
(182475, 3)
(178470, 3)
(179925, 3)


In [None]:
# Function sums income sources based on participant ID 
def calculate_total_income(df):

#use reset_index to make a hierarchical index a regular column
  total_income_df = df.groupby('NEWID')['VALUE'].sum().reset_index() 
  
  # Rename columns in place
  total_income_df.columns = ['id', 'total_income']
  return total_income_df


# Calculate total income for each DataFrame
total_income_df1 = calculate_total_income(income_1_df)
total_income_df2 = calculate_total_income(income_2_df)
total_income_df3 = calculate_total_income(income_3_df)
total_income_df4 = calculate_total_income(income_4_df)

# Concatenate dataframes to get total income per survey participant
total_income_df = pd.concat([total_income_df1, total_income_df2, total_income_df3, total_income_df4], axis = 0)

#DF of income over a year
print(total_income_df.shape)

(18829, 2)


In [70]:
print(total_income_df.head)

<bound method NDFrame.head of            id  total_income
0     5090604   101692.5000
1     5090624    34467.5010
2     5090634   155839.9995
3     5090664    72695.0001
4     5090674    43196.2500
...       ...           ...
4675  5607961   130770.0000
4676  5607981   364462.6290
4677  5608001    84775.0005
4678  5608051   486724.8864
4679  5608061    65240.0010

[18829 rows x 2 columns]>


Now we handle the expenditures data. 

In [None]:
#  Import data for expenses
expense_1_df = pd.read_csv("../../data/raw-data/mtbi232.csv")
expense_2_df = pd.read_csv("../../data/raw-data/mtbi233.csv")
expense_3_df = pd.read_csv("../../data/raw-data/mtbi234.csv")
expense_4_df = pd.read_csv("../../data/raw-data/mtbi241.csv")

[]
