# Lecture 2.5: Creating Derived Variables in Pandas

## 🎯 Learning Objectives
By the end of this lecture, you will:
- Understand how to create new columns based on existing data
- Use `.apply()` and `lambda` functions for row-wise operations
- Compute sum scores and derived epidemiological variables (e.g., BMI)


## 🧠 Why Derived Variables Matter

Derived variables help simplify analysis and make variables more interpretable:
- Create new risk scores
- Summarize symptom presence
- Transform raw values into meaningful metrics (e.g., BMI from weight and height)


In [1]:
import pandas as pd

# Sample data
df = pd.DataFrame({
    'height_cm': [170, 160, 180],
    'weight_kg': [70, 60, 90],
    'symptom1': [1, 0, 1],
    'symptom2': [1, 1, 0],
    'symptom3': [0, 0, 1]
})

df

Unnamed: 0,height_cm,weight_kg,symptom1,symptom2,symptom3
0,170,70,1,1,0
1,160,60,0,1,0
2,180,90,1,0,1


## 🧮 Creating a Simple Derived Variable

In [2]:
# Convert height to meters and calculate BMI
df['height_m'] = df['height_cm'] / 100
df['BMI'] = df['weight_kg'] / (df['height_m'] ** 2)

df[['height_cm', 'weight_kg', 'height_m', 'BMI']]

Unnamed: 0,height_cm,weight_kg,height_m,BMI
0,170,70,1.7,24.221453
1,160,60,1.6,23.4375
2,180,90,1.8,27.777778


## ⚙️ Using `.apply()` and `lambda` for Row-wise Calculations

In [3]:
# Apply row-wise to create flag for overweight (BMI > 25)
df['overweight_flag'] = df['BMI'].apply(lambda x: 1 if x > 25 else 0)
df[['BMI', 'overweight_flag']]

Unnamed: 0,BMI,overweight_flag
0,24.221453,0
1,23.4375,0
2,27.777778,1


## ➕ Creating Sum Scores

In [4]:
# Sum of symptoms
df['symptom_score'] = df[['symptom1', 'symptom2', 'symptom3']].sum(axis=1)
df[['symptom1', 'symptom2', 'symptom3', 'symptom_score']]

Unnamed: 0,symptom1,symptom2,symptom3,symptom_score
0,1,1,0,2
1,0,1,0,1
2,1,0,1,2


# Applying this to the Oregon dataset

In [5]:
OHIE = pd.read_csv('../Data/OHIE_12m.csv')
OHIE.head()

Unnamed: 0,person_id,household_id,treatment,draw_treat,draw_lottery,applied_app,approved_app,dt_notify_lottery,dt_retro_coverage,birthyear_list,...,live_partner_12m,live_parents_12m,live_friends_12m,live_relatives_12m,live_other_12m,hhsize_12m,PHQ2_1,PHQ2_2,PHQ2_sum,PHQ2_cutoff
0,64350,164350,Not selected,,Lottery Draw 6,,,2008-07-14,2008-08-08,1974,...,No,Yes,No,No,No,2.0,3.0,3.0,6.0,True
1,55655,155655,Not selected,,Lottery Draw 7,,,2008-08-12,2008-09-08,1987,...,Yes,No,No,No,No,2.0,1.0,1.0,2.0,False
2,20087,128134,Selected,Draw 6: selected in lottery 07/01/2008,Lottery Draw 6,Submitted an Application to OHP,No,2008-07-14,2008-08-08,1963,...,No,No,No,Yes,No,7.0,0.0,1.0,1.0,False
3,70998,170998,Not selected,,Lottery Draw 7,,,2008-08-12,2008-09-08,1954,...,Yes,No,No,No,No,2.0,3.0,2.0,5.0,True
4,8839,108839,Selected,Draw 8: selected in lottery 09/02/2008,Lottery Draw 8,Did NOT submit an application to OHP,No,2008-09-11,2008-10-08,1964,...,No,No,Yes,No,No,4.0,2.0,2.0,4.0,True


In [6]:
OHIE[['dep_interest_12m','dep_sad_12m']].head()

Unnamed: 0,dep_interest_12m,dep_sad_12m
0,nearly every day,nearly every day
1,several days,several days
2,several days,not at all
3,more than half the days,nearly every day
4,more than half the days,more than half the days


In [7]:
phq_scoring = {'not at all': 0,
               'several days': 1,
               'more than half the days': 2,
               'nearly every day':3}

In [14]:
OHIE['PHQ2_1'] = OHIE['dep_interest_12m'].replace(phq_scoring).astype('Int64')
OHIE['PHQ2_2'] = OHIE['dep_sad_12m'].replace(phq_scoring).astype('Int64')

  OHIE['PHQ2_1'] = OHIE['dep_interest_12m'].replace(phq_scoring).astype('Int64')
  OHIE['PHQ2_2'] = OHIE['dep_sad_12m'].replace(phq_scoring).astype('Int64')


In [18]:
OHIE['PHQ2_sum'] = OHIE[['PHQ2_1', 'PHQ2_2']].sum(axis=1)
# OHIE['PHQ2_sum'] = OHIE['PHQ2_1'] + OHIE['PHQ2_2']

In [19]:
OHIE['PHQ2_cutoff'] = OHIE['PHQ2_sum'] > 2
OHIE[['PHQ2_sum','PHQ2_cutoff']].head()

Unnamed: 0,PHQ2_sum,PHQ2_cutoff
0,6,True
1,2,False
2,1,False
3,5,True
4,4,True


In [17]:
OHIE['PHQ2_cutoff'].value_counts()/4000*100

PHQ2_cutoff
False     65.35
True     32.025
Name: count, dtype: Float64

## ✅ Summary
- Use arithmetic to compute new variables
- Use `.apply()` with `lambda` for conditional logic
- Use `.sum(axis=1)` for sum scores across columns

Derived variables are a key step in preparing data for analysis and visualization.
