# 1. Lab 2 - Divorce Survey Lab
## Selection, Filtering, and Transformation

We’ll start with **quick concrete examples** of three core Pandas skills:  
1. **Selection** → Choosing rows & columns  
2. **Filtering** → Keeping rows that match conditions  
3. **Transformation** → Creating new columns from existing ones  

After this “warm-up,” we’ll keep iterating with more cleaning and advanced transformations.


In [1]:
import pandas as pd

# Load the divorce dataset
df = pd.read_csv("divorce_survey.csv")

# explore the columns, useful to know for later stages
df.columns

Index(['Unique_ID', 'Name', 'Gender', 'Age_When_First_Married',
       'Age_When_First_Divorced', 'Times_Divorced', 'Divorce_Reason',
       'Had_Children_Together', 'Children_Ages', 'Kept_Contact_After_Divorce',
       'Custody_Arrangement', 'Happiness_Before_Marriage',
       'Happiness_After_Marriage', 'Felt_Supported_By_Friends_Family',
       'Advice_For_Others', 'Biggest_Regret', 'Would_Marry_Again_Text',
       'Zip_Code', 'Years_Married', 'Education_Level', 'Annual_Income',
       'State_Of_Residence', 'Urban_or_Rural', 'Met_Spouse_How',
       'Marriage_Ceremony_Size', 'Conflict_Frequency',
       'Shared_Interests_Score', 'Reported_Stress_Level', 'Therapy_Attendance',
       'Time_Until_Remarried', 'Moved_To_New_City',
       'Change_In_Income_After_Divorce'],
      dtype='object')

# Section 1: Selection

In this section we practice selecting parts of a DataFrame.

### Instructor Demo
We’ll look at how to:
- Select a single column
- Select multiple columns
- Select rows by **position** (`.iloc`)
- Select rows/columns by **label** (`.loc`)

### Student Exercise
Select the **5th through 10th survey participants** and show their:
- Name  
- Gender  
- Age when they first got married  



In [2]:
# Single column
df['Name'].head()

0      Fayez
1      Jason
2    Michael
3     Hassan
4    Desiree
Name: Name, dtype: object

In [3]:
# Multiple columns
df[['Name','Gender','Age_When_First_Married']].head()

Unnamed: 0,Name,Gender,Age_When_First_Married
0,Fayez,Nonbinary,30
1,Jason,male,30
2,Michael,Male,30
3,Hassan,Female,19
4,Desiree,Nonbinary,18


In [4]:
# Using .iloc (position-based)
df.iloc[0:5, 0:3]   # first 5 rows, first 3 columns

Unnamed: 0,Unique_ID,Name,Gender
0,P0082,Fayez,Nonbinary
1,P0005,Jason,male
2,P0041,Michael,Male
3,P0070,Hassan,Female
4,P0011,Desiree,Nonbinary


In [5]:
# Using .loc (label-based)
df.loc[0:5, ['Name','Gender','Age_When_First_Married']]

Unnamed: 0,Name,Gender,Age_When_First_Married
0,Fayez,Nonbinary,30
1,Jason,male,30
2,Michael,Male,30
3,Hassan,Female,19
4,Desiree,Nonbinary,18
5,Ian,Nonbinary,26


In [None]:
# Exercise: 

df.iloc[5:11, 1:4]

Unnamed: 0,Name,Gender,Age_When_First_Married
5,Ian,Nonbinary,26
6,Harshitha,Nonbinary,18
7,Selam,Male,28
8,Thara,Male,24
9,Kashvi,male,28
10,Adron,female,20


# Section 2: Filtering

Now we’ll practice filtering rows in our dataset.  
Filtering means finding survey participants who meet **certain conditions**.

### Instructor Demo
We’ll look at:
- Boolean masks (e.g., `df[df['Age_When_First_Married'] < 21]`)
- Checking membership with `.isin()`
- Combining conditions with `&` (and), `|` (or)

### Student Exercises

1. **Single condition filter**  
   Find all participants who **married before age 21**.  
   Show only their:
   - Name  
   - Gender  
   - Age when first married  

2. **Multi-condition filter**  
   Find all participants who **have been divorced at least once AND had children together**.  
   Show only their:
   - Name  
   - Times Divorced  
   - Had Children Together  


In [None]:
# check original amount of rows, to know how many rows are getting filtered
df.shape

(40, 32)

In [None]:
# Boolean mask
df[df['Age_When_First_Married'] < 21]

Unnamed: 0,Unique_ID,Name,Gender,Age_When_First_Married,Age_When_First_Divorced,Times_Divorced,Divorce_Reason,Had_Children_Together,Children_Ages,Kept_Contact_After_Divorce,...,Urban_or_Rural,Met_Spouse_How,Marriage_Ceremony_Size,Conflict_Frequency,Shared_Interests_Score,Reported_Stress_Level,Therapy_Attendance,Time_Until_Remarried,Moved_To_New_City,Change_In_Income_After_Divorce
3,P0070,Hassan,Female,19,21,3,irreconcilable,yes,"5, 8",no,...,Urban,online,large,often,8,7,once,,yes,down 8k
4,P0011,Desiree,Nonbinary,18,26,2,irreconcilable,no,none,no,...,Suburban,online,small,weekly,8,7,yes,1,no,down 3k
6,P0071,Harshitha,Nonbinary,18,33,3,irreconcilable,no,none,maybe,...,Rural,college,small,2 times/month,6,2,no,2,yes,up 8k
10,P0101,Adron,female,20,23,3,other,no,none,sometimes,...,Rural,church,large,1,2,9,yes,4,yes,up 6k
13,P0114,Arlyna,Nonbinary,20,33,2,infidelity,no,none,maybe,...,Suburban,church,189,1,3,9,no,,yes,down 6k
14,P0019,Addison,Nonbinary,20,21,3,other,yes,17,maybe,...,Suburban,online,279 people,1,3,1,once,never,no,down 4k
15,P0001,Vanessa,female,18,23,3,abuse,no,none,maybe,...,Suburban,online,medium,2 times/month,5,7,yes,9,no,down 4k
30,P0079,Nicholas,Male,20,24,1,abuse,yes,7,sometimes,...,Rural,family friend,97,weekly,7,5,sometimes,never,no,7%
31,P0016,Jordan,female,20,25,3,irreconcilable,no,none,no,...,Rural,family friend,medium,weekly,5,4,no,,yes,up 8k
37,P0066,Hengmeng,female,18,21,3,abuse,yes,5 11 14,yes,...,Urban,vacation,medium,1,10,7,sometimes,2,no,up 1k


In [71]:
# Membership test with .isin()
df[df['Education_Level'].isin(['HS', 'Vocational'])]

Unnamed: 0,Unique_ID,Name,Gender,Age_When_First_Married,Age_When_First_Divorced,Times_Divorced,Divorce_Reason,Had_Children_Together,Children_Ages,Kept_Contact_After_Divorce,...,Therapy_Attendance,Time_Until_Remarried,Moved_To_New_City,Change_In_Income_After_Divorce,Annual_Income_Clean,Income_Change_Amount,New_Annual_Income,Change_in_Happiness,New_Education_Level,Participant_Label
3,P0070,Hassan,Female,19,21,3,irreconcilable,yes,"5, 8",no,...,once,,yes,down 8k,20072.0,-8000.0,12072.0,-2,High School,Hassan (High School)
7,P0067,Selam,Male,28,42,3,incompatibility,yes,4,yes,...,yes,,yes,down 8k,54176.0,-8000.0,46176.0,0,High School,Selam (High School)
12,P0085,Jason,Male,31,35,1,finances,yes,4 13 9,sometimes,...,no,never,no,-11%,144299.0,-0.11,128426.11,-5,High School,Jason (High School)
13,P0114,Arlyna,Nonbinary,20,33,2,infidelity,no,none,maybe,...,no,,yes,down 6k,106833.0,-6000.0,100833.0,-3,High School,Arlyna (High School)
16,P0073,Joshua,Nonbinary,32,42,3,other,yes,"5, 8",yes,...,no,,no,up 7k,27805.0,7000.0,34805.0,0,High School,Joshua (High School)
18,P0106,Semoon,Male,33,41,1,finances,no,none,sometimes,...,no,4,no,down 9k,134838.0,-9000.0,125838.0,3,Vocational,Semoon (Vocational)
24,P0068,Olaoluwa,female,28,38,3,incompatibility,no,none,no,...,no,never,yes,up 2k,118333.0,2000.0,120333.0,-4,High School,Olaoluwa (High School)
26,P0110,Sriram,Female,30,32,3,abuse,no,none,maybe,...,no,,no,-17%,33087.0,-0.17,27462.21,-5,High School,Sriram (High School)
28,P0045,Aditya,Female,26,38,2,infidelity,no,none,maybe,...,once,,yes,up 6k,50678.0,6000.0,56678.0,2,Vocational,Aditya (Vocational)
31,P0016,Jordan,female,20,25,3,irreconcilable,no,none,no,...,no,,yes,up 8k,130251.0,8000.0,138251.0,4,Vocational,Jordan (Vocational)


In [None]:
# Multiple conditions: divorced more than once OR attended therapy
df[(df['Times_Divorced'] > 1) |
         (df['Therapy_Attendance'] == 'Yes')][['Name','Times_Divorced','Therapy_Attendance']]

Unnamed: 0,Name,Times_Divorced,Therapy_Attendance
0,Fayez,2,no
1,Jason,2,yes
3,Hassan,3,once
4,Desiree,2,yes
6,Harshitha,3,no
7,Selam,3,yes
10,Adron,3,yes
11,Matthew,2,no
13,Arlyna,2,no
14,Addison,3,once


In [None]:
# Exercise: 

# 1. 

print(df[df['Age_When_First_Married'] < 21] [['Name','Gender','Age_When_First_Married']])

print("\n")

# 2. 

print(df[(df['Times_Divorced'] >= 1) & (df["Had_Children_Together"] == 'yes')] [['Name','Times_Divorced','Had_Children_Together']])


         Name     Gender  Age_When_First_Married
3      Hassan     Female                      19
4     Desiree  Nonbinary                      18
6   Harshitha  Nonbinary                      18
10      Adron     female                      20
13     Arlyna  Nonbinary                      20
14    Addison  Nonbinary                      20
15    Vanessa     female                      18
30   Nicholas       Male                      20
31     Jordan     female                      20
37   Hengmeng     female                      18


          Name  Times_Divorced Had_Children_Together
0        Fayez               2                   yes
1        Jason               2                   yes
2      Michael               1                   yes
3       Hassan               3                   yes
7        Selam               3                   yes
8        Thara               1                   yes
9       Kashvi               1                   yes
11     Matthew               2     

# Section 3: Transformation

Now let’s practice creating new information from existing columns.

### Instructor Demo
We’ll look at two kinds of transformations:
- **Numeric transformation** (e.g., turning years into months)
- **Text transformation** (combining multiple text columns into something useful)


In [None]:
# Numeric Demo: Years married → months married
df['Marriage_Length_Months'] = df['Years_Married'] * 12
df[['Name','Years_Married','Marriage_Length_Months']]

Unnamed: 0,Name,Years_Married,Marriage_Length_Months
0,Fayez,15 years,15 years15 years15 years15 years15 years15 yea...
1,Jason,~14,~14~14~14~14~14~14~14~14~14~14~14~14
2,Michael,15 years,15 years15 years15 years15 years15 years15 yea...
3,Hassan,1 years,1 years1 years1 years1 years1 years1 years1 ye...
4,Desiree,4 years,4 years4 years4 years4 years4 years4 years4 ye...
5,Ian,3 years,3 years3 years3 years3 years3 years3 years3 ye...
6,Harshitha,~5,~5~5~5~5~5~5~5~5~5~5~5~5
7,Selam,15,151515151515151515151515
8,Thara,6 years,6 years6 years6 years6 years6 years6 years6 ye...
9,Kashvi,14,141414141414141414141414


In [None]:
# Text Demo: Combine urban/rural + state into a "Residence_Type" column
df['Residence_Type'] = df['Urban_or_Rural'] + " - " + df['State_Of_Residence']
df[['Name','Residence_Type']]

Unnamed: 0,Name,Residence_Type
0,Fayez,Urban - Penn
1,Jason,Suburban - NY
2,Michael,Urban - MD
3,Hassan,Urban - Virginia
4,Desiree,Suburban - D.C.
5,Ian,Urban - NY
6,Harshitha,Rural - Meryland
7,Selam,Urban - NY
8,Thara,Suburban - Penn
9,Kashvi,Suburban - Penn


### What do we Notice, is wrong with these columns?

Your Answer:

These columns aren't normalized, as the residence type of each location on the column differs sometimes despite the individual being in the same state as someone else. For instance, Suburban -- Ny and Urban -- Ny are the same states/locations, but the type of residence differs from one another (urban vs suburban). If we ever want to analyze or query this data, we would run into issues doing so without first transforming the nature of this column. The same is true for the other columns, as "years married" and the length in months is formatted differently across various rows in each of these columns.  

## Section 3 - Fixing Broken Transformations

When we rushed into transforming without checking, we got nonsense results.  
Now let’s redo it properly, step by step:

1. **Check the datatype** of `Years_Married` and convert to numeric.  
2. **Clean up the text columns** before combining them.  


In [None]:
# its an object which means it spans multiple data types
print(df['Years_Married'].dtype)

# Clean Years_Married into integers
df['Years_Married_clean'] = (
    df['Years_Married']
      .astype(str)                               # ensure string -
      .str.replace(r'[^0-9]', '', regex=True)    # keep only digits
      .replace('', pd.NA)                       # empty strings -> NaN
      .astype(float)                             # convert to float first
      .astype('Int64')                           # then nullable integer
)

# Create months married column
df['Marriage_Length_Months'] = df['Years_Married_clean'] * 12

df[['Name','Years_Married','Years_Married_clean','Marriage_Length_Months']].head(12)



object


Unnamed: 0,Name,Years_Married,Years_Married_clean,Marriage_Length_Months
0,Fayez,15 years,15,180
1,Jason,~14,14,168
2,Michael,15 years,15,180
3,Hassan,1 years,1,12
4,Desiree,4 years,4,48
5,Ian,3 years,3,36
6,Harshitha,~5,5,60
7,Selam,15,15,180
8,Thara,6 years,6,72
9,Kashvi,14,14,168


In [None]:
# Define a mapping dictionary (expand as needed)
state_mapping = {
    "NY": "New York",
    "Penn": "Pennsylvania",
    "Meryland": "Maryland",
    "MD": "Maryland",
    "D.C.": "District of Columbia",
    "Virginia": "Virginia",
    "New York": "New York",
    "Pennsylvania": "Pennsylvania"
}

# Normalize state names
df['State_Of_Residence_clean'] = (
    df['State_Of_Residence']
      .astype(str)
      .str.strip()
      .replace(state_mapping)
)

df[['Name','State_Of_Residence','State_Of_Residence_clean']].head(12)


Unnamed: 0,Name,State_Of_Residence,State_Of_Residence_clean
0,Fayez,Penn,Pennsylvania
1,Jason,NY,New York
2,Michael,MD,Maryland
3,Hassan,Virginia,Virginia
4,Desiree,D.C.,District of Columbia
5,Ian,NY,New York
6,Harshitha,Meryland,Maryland
7,Selam,NY,New York
8,Thara,Penn,Pennsylvania
9,Kashvi,Penn,Pennsylvania


In [None]:
df['Residence_Type'] = df['Urban_or_Rural'] + " - " + df['State_Of_Residence_clean']
df[['Name','Residence_Type']].head(12)


Unnamed: 0,Name,Residence_Type
0,Fayez,Urban - Pennsylvania
1,Jason,Suburban - New York
2,Michael,Urban - Maryland
3,Hassan,Urban - Virginia
4,Desiree,Suburban - District of Columbia
5,Ian,Urban - New York
6,Harshitha,Rural - Maryland
7,Selam,Urban - New York
8,Thara,Suburban - Pennsylvania
9,Kashvi,Suburban - Pennsylvania


## Techniques we just used to normalize the data

- **`.astype(str)`**  
  Force everything into the same type (string) so operations work consistently.  
  Example: numbers, `NaN`, and `~14` all become strings.

- **`.str.replace(regex)`**  
  Use a regular expression to remove or substitute characters.  
  Example: `r'[^0-9]'` means “anything that’s *not* a digit” → strip all letters like `years` or symbols like `~`.

- **`.replace(mapping)`**  
  Use a dictionary to standardize known values.  
  Example: `"NY"` and `"New York"` both → `"New York"`.

- **`.str.strip()`**  
  Remove leading/trailing spaces that can mess up comparisons.

- **`.astype(float)` → `.astype('Int64')`**  
  Convert cleaned strings into numbers.  
  `Int64` is a *nullable* integer type in pandas, so missing values (`NaN`) are allowed.



### Student Exercises
1. **Numeric**  
   Create a column showing the **change in happiness after Marriage**.  
   - Use the column `Happiness_Before_Marriage` and `Happiness_After_Marriage`.  


2. **Text**  
   Create a column with a short participant label in the format:  
   `"Name (Education_Level)"`.

    *hint* : need to use mapping, if you want good normalized results.

In [69]:
# Exercises:

# Numeric 1. 

df["Change_in_Happiness"] = df["Happiness_After_Marriage"] - df["Happiness_Before_Marriage"]

print(df[["Change_in_Happiness", "Happiness_After_Marriage", "Happiness_Before_Marriage"]].head(10))

# Text 2. 

df["Education_Level"].unique()

education_mapping = {"Bachlor": "Bachelor",
                     "Bachelors": "Bachelor",
                     "BS": "Bachelor",
                     "BA": "Bachelor",
                     "HS": "High School",
                     "PhD": "PhD",
                     "MA": "Master",
                     "Vocational": "Vocational"
                     }

df["New_Education_Level"] = df["Education_Level"].replace(education_mapping)

df["Participant_Label"] = df["Name"] + ' (' + df['New_Education_Level'] + ')'

df[['Name', 'New_Education_Level', 'Participant_Label']].head(10)



   Change_in_Happiness  Happiness_After_Marriage  Happiness_Before_Marriage
0                   -3                         3                          6
1                    5                         7                          2
2                    7                        10                          3
3                   -2                         3                          5
4                    6                         8                          2
5                   -2                         3                          5
6                    7                         8                          1
7                    0                         7                          7
8                    3                         6                          3
9                    5                         7                          2


Unnamed: 0,Name,New_Education_Level,Participant_Label
0,Fayez,Bachelor,Fayez (Bachelor)
1,Jason,Bachelor,Jason (Bachelor)
2,Michael,Bachelor,Michael (Bachelor)
3,Hassan,High School,Hassan (High School)
4,Desiree,Bachelor,Desiree (Bachelor)
5,Ian,PhD,Ian (PhD)
6,Harshitha,Bachelor,Harshitha (Bachelor)
7,Selam,High School,Selam (High School)
8,Thara,Bachelor,Thara (Bachelor)
9,Kashvi,PhD,Kashvi (PhD)


## 6. Lambda Expressions and Apply - Advanced Transformations

**What if I need more complex logic for transformations?**  
Lambda expressions and `.apply()` give us flexibility.


In [None]:
# 🔹 Lambda on a Single Column
# Turn years into full decades lived, *bucketed ages can be more useful to work with.*
df['Decade_Age'] = df['Age_When_First_Married'].apply(lambda x: (x // 10) * 10)
df[['Name','Age_When_First_Married','Decade_Age']].head()


Unnamed: 0,Name,Age_When_First_Married,Decade_Age
0,Fayez,30,30
1,Jason,30,30
2,Michael,30,30
3,Hassan,19,10
4,Desiree,18,10


Here, `lambda x: (x // 10) * 10` is just a quick anonymous function.  
Under the hood, pandas applies this to every value in the column.

In [None]:
# 🔹 Categorize with Lambda
# Create categories based on Years Married, categorical values can be more useful in some circumstances
df['Marriage_Category'] = df['Years_Married_clean'].apply(
    lambda x: 'Short' if x < 5 else 'Medium' if x < 15 else 'Long'
)
df[['Name','Years_Married_clean','Marriage_Category']].head()


Unnamed: 0,Name,Years_Married_clean,Marriage_Category
0,Fayez,15,Long
1,Jason,14,Medium
2,Michael,15,Long
3,Hassan,1,Short
4,Desiree,4,Short


In [None]:
# 🔹 Row-Wise Apply (multiple columns)
# Use both years married and number of divorces.
df.apply(
    lambda row: f"{row['Name']} was married for {row['Years_Married_clean']} years and divorced {row['Times_Divorced']} times.",
    axis=1
).head()


Unnamed: 0,0
0,Fayez was married for 15 years and divorced 2 ...
1,Jason was married for 14 years and divorced 2 ...
2,Michael was married for 15 years and divorced ...
3,Hassan was married for 1 years and divorced 3 ...
4,Desiree was married for 4 years and divorced 2...


### What if the logic gets really complex?  
Sometimes `lambda` becomes too messy. In that case, just write a proper function.


In [None]:
# 🔹 Complex Function Example
def divorce_experience(row):
    if row['Times_Divorced'] == 0:
        status = "Never Divorced"
    elif row['Times_Divorced'] == 1:
        status = "Divorced Once"
    else:
        status = f"Divorced {row['Times_Divorced']} times"

    if pd.isna(row['Years_Married_clean']):
        return f"{row['Name']} - {status} (no data on marriage length)"

    return f"{row['Name']} - {status}, marriage lasted {row['Years_Married_clean']} years"

df['Experience'] = df.apply(divorce_experience, axis=1)
df[['Name','Times_Divorced','Years_Married_clean','Experience']].head()


Unnamed: 0,Name,Times_Divorced,Years_Married_clean,Experience
0,Fayez,2,15,"Fayez - Divorced 2 times, marriage lasted 15 y..."
1,Jason,2,14,"Jason - Divorced 2 times, marriage lasted 14 y..."
2,Michael,1,15,"Michael - Divorced Once, marriage lasted 15 years"
3,Hassan,3,1,"Hassan - Divorced 3 times, marriage lasted 1 y..."
4,Desiree,2,4,"Desiree - Divorced 2 times, marriage lasted 4 ..."


##  Final Challenge Question

1. **Numeric Transformation**  
   Create a column showing the **change in happiness after Marriage**.  
   - Use the column `Change_In_Income_After_Divorce`.  
   - Combine it with `Annual_Income` to calculate the *new income after divorce*.  

*hint* : this one is atrocious 🤣, I'll cover it with you, after you give it a shot ofc.

In [68]:
import numpy as np
import pandas as pd
import re

# getting good view of data 

print("Initial data shape:", df.shape)
print("\nFirst 5 annual income values:", df['Annual_Income'].head(5).tolist())
print("First 5 change in annual income values (After divorce):", df['Change_In_Income_After_Divorce'].head(5).tolist())

# normalize Annual_Income into strictly numbers

def clean_income(value):
    if pd.isna(value):
        return np.nan
    
    # convert to string before cleaning (initially an int)

    value_str = str(value).strip()
    
    # removing dollar sign, commas, whitespace, and USD phrasing in column
    value_str = value_str.replace('$', '').replace(',', '').replace('USD', '').replace(' ', '')
    
    # removing k and multiplying by 1000 to supplement this change
    if 'k' in value_str.lower():
        value_str = value_str.lower().replace('k', '')
        try:
            return float(value_str) * 1000
        except:
            return np.nan
    
    # try except block to convert all the numbers without a k to floats as well
    try:
        return float(value_str)
    except:
        return np.nan

df['Annual_Income_Clean'] = df['Annual_Income'].apply(clean_income)

# parsing Change_In_Income_After_Divorce 
def parse_income_change(value):
    if pd.isna(value):
        return np.nan
    
    value_str = str(value).strip().lower()
    
    # handling percentages like -18% or 12%
    if '%' in value_str:
        try:
            # removing % and converting to float type
            percent = float(value_str.replace('%', '').replace(' ', ''))
            return percent / 100  # returning previous percen value as a decimal 
        except:
            return np.nan
    
    # handling changes in value (up/down)
    if any(x in value_str for x in ['up', 'down']):
        try:
            # getting the number in string and removing keywords or k/whitespace
            num_str = value_str.replace('up', '').replace('down', '').replace(' ', '').replace('k', '')
            amount = float(num_str) * 1000  # multiplying by 1000 after getting rid of k and converting new numeric string (after removal changes) 
            
            # applying sign/value changes to account for direction the income changes went in
            if 'down' in value_str:
                return -amount
            else:
                return amount
        except:
            return np.nan
    
    # fallback to handle what still might not be covered beforehand
    try:
        # removing everything except for the digits, negative signs, and decimal points, then joining them together
        clean_str = ''.join([c for c in value_str if c.isdigit() or c == '-' or c == '.'])
        if clean_str and clean_str != '-': # checking if clean_str exists and not a negative sign
            return float(clean_str)
        else:
            return np.nan
    except:
        return np.nan

df['Income_Change_Amount'] = df['Change_In_Income_After_Divorce'].apply(parse_income_change)

# computing New_Annual_Income

def calculate_new_income(row):
    if pd.isna(row['Annual_Income_Clean']) or pd.isna(row['Income_Change_Amount']):
        return np.nan
    
    change = row['Income_Change_Amount']
    
    # check if change is a percentage (between -1 and 1 for percents that converted, or between 1 and 100 for percents that didn't convert properly)
    if abs(change) <= 1 or (abs(change) <= 100 and abs(change) > 1):
        # if it turns out to be a percentage not parsed properly, then normalize/convert it and return the change 
        if abs(change) > 1:
            change = change / 100
        return row['Annual_Income_Clean'] * (1 + change)
    
    # if the income change is in dollar amount (not percent)
    else:
        return row['Annual_Income_Clean'] + change

df['New_Annual_Income'] = df.apply(calculate_new_income, axis=1) # for each row, apply the function down the column (horizontal)


# checking data parsing/conversion success 
income_success = df['Annual_Income_Clean'].notna().sum()
change_success = df['Income_Change_Amount'].notna().sum()
new_income_success = df['New_Annual_Income'].notna().sum()

print("\n")

print(f"Annual_Income conversion success: {income_success}/{len(df)}")
print(f"Income change conversion success: {change_success}/{len(df)}")
print(f"New income calculation success: {new_income_success}/{len(df)}")

# displaying final dataframe with new columns
print("\nFinal dataframe (Income after divorce):\n")
new_cols = ['Annual_Income_Clean', 'Income_Change_Amount', 'New_Annual_Income']
print(df[new_cols].head(10))
    

Initial data shape: (40, 35)

First 5 annual income values: ['41k', '146536', '86k', '$20,072 ', '78k']
First 5 change in annual income values (After divorce): ['-18%', '-6%', 'down 9k', 'down 8k', 'down 3k']


Annual_Income conversion success: 40/40
Income change conversion success: 40/40
New income calculation success: 40/40

Final dataframe (Income after divorce):

   Annual_Income_Clean  Income_Change_Amount  New_Annual_Income
0              41000.0                 -0.18           33620.00
1             146536.0                 -0.06          137743.84
2              86000.0              -9000.00           77000.00
3              20072.0              -8000.00           12072.00
4              78000.0              -3000.00           75000.00
5             121734.0              -6000.00          115734.00
6              70000.0               8000.00           78000.00
7              54176.0              -8000.00           46176.00
8             119624.0               1000.00         

*double hint* :  We’ll solve the problem step by step:  

1. Normalize `Annual_Income` into numbers.  
2. Parse `Change_In_Income_After_Divorce` into usable values.  
3. Convert percent changes into dollar amounts.  
4. Compute `New_Annual_Income`.  
5. Run diagnostics to see what worked and what broke.  