# Preparing Data: Adding Columns  v.ekc-c

Real-world datasets rarely come with everything you need.  
This class covers how to **engineer new features** — creating columns from scratch or from existing ones.

| Section | Topic |
|---------|-------|
| 1 | Setup |
| 2 | Adding Independent Columns (3 methods) |
| 3 | Adding Columns Based on Other Columns |
| 4 | 🔬 Titanic Feature Engineering Activity |
| Appendix | Quick Reference |


---
## 1. Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
sns.set_style("darkgrid")
import warnings
warnings.filterwarnings('ignore')


In [None]:
# Toy dataset for demonstrations
animals_dct = {
    'Animal': ['cow', 'kitten', 'penguin', 'Puppy'],
    'Sound':  ['moo', 'purr', 'chirp', 'bark'],
}
animals = pd.DataFrame(animals_dct)
animals


---
## 2. Adding Independent Columns

### 📋 Board Reference — 3 Methods

| Method | When to use |
|--------|------------|
| `df['col'] = value` | Add to the end — broadcast a single value or list |
| `df.insert(pos, 'col', value)` | Insert at a **specific position** |
| `df.assign(col1=..., col2=...)` | Add **multiple columns at once**, returns new df |

**Key difference:** `df['col'] =` modifies in place; `df.assign(...)` returns a new DataFrame.


In [None]:
# Method 1a: broadcast a constant
animals_m1 = animals.copy()
animals_m1['Cute score'] = 10
animals_m1


In [None]:
# Method 1b: assign a list
animals_m1b = animals.copy()
animals_m1b['Cute score'] = [10, 9.9, 10, 10]
animals_m1b


In [None]:
# Method 2: insert at a specific position
animals_m2 = animals.copy()
animals_m2.insert(1, 'Cute score', [10, 9.9, 10, 10])
animals_m2


In [None]:
# Method 3: assign multiple columns at once
animals_m3 = animals.assign(
    Adjective=['adorable', 'playful', 'tough', 'cuddly'],
    Pet=[False, True, False, True]
)
animals_m3


---
### 🔬 Explore 1 — Adding Independent Columns

Use this dictionary to create a DataFrame called `planets`:

```python
planets_dct = {
    'Planet': ['Mercury', 'Venus', 'Earth', 'Mars'],
    'Diameter_km': [4879, 12104, 12742, 6779],
}
```

1. Add a column `Has_Moons` with values `[False, False, True, True]` using method 1 (`df['col'] = `).
2. Insert a column `Position` (values 1–4) at position 0 using `.insert()`.
3. Add **two columns at once** using `.assign()`:  
   - `'Type'` = `['Rocky', 'Rocky', 'Rocky', 'Rocky']`  
   - `'In_Habitable_Zone'` = `[False, False, True, False]`
4. **Bonus**: What happens if you try to assign a list of the wrong length? Try it and explain the error.


In [None]:
# Create planets DataFrame


In [None]:
# 1. Add Has_Moons column


In [None]:
# 2. Insert Position at column 0


In [None]:
# 3. Assign Type and In_Habitable_Zone


In [None]:
# 4. Bonus: wrong-length list — what happens?


<details>
<summary>💡 One approach — click to peek</summary>
<br>

*Pandas requires the list to be the same length as the DataFrame — otherwise it raises a ValueError.*

```python
planets_dct = {
    'Planet': ['Mercury', 'Venus', 'Earth', 'Mars'],
    'Diameter_km': [4879, 12104, 12742, 6779],
}
planets = pd.DataFrame(planets_dct)

# 1. Method 1
planets['Has_Moons'] = [False, False, True, True]
print(planets)

# 2. Method 2 — insert at position 0
planets.insert(0, 'Position', [1, 2, 3, 4])
print(planets)

# 3. Method 3 — assign multiple
planets = planets.assign(
    Type=['Rocky', 'Rocky', 'Rocky', 'Rocky'],
    In_Habitable_Zone=[False, False, True, False]
)
print(planets)

# 4. Bonus — wrong length raises ValueError
try:
    planets['Oops'] = [1, 2, 3]   # only 3 values for 4 rows
except ValueError as e:
    print(f"Error: {e}")
```

</details>

---
## 3. Adding Columns Based on Other Columns

### 📋 Board Reference

| Technique | Example | When to use |
|-----------|---------|-------------|
| Boolean condition | `df['col'] > 5` | Create True/False flag |
| `.map({})` | `df['col'].map({True:'yes', False:'no'})` | Recode / relabel categories |
| `.apply(func)` | `df['col'].apply(round)` | Apply any function element-wise |
| Arithmetic | `df['a'] + df['b']` | Combine numeric columns |
| String methods | `df['col'].str[0]` · `.str.upper()` | Extract or transform text |
| List comprehension | `[f(x) for x in df['col']]` | Complex conditional logic |


In [None]:
animals = pd.DataFrame({'Animal':['cow','kitten','penguin','Puppy'],
                        'Sound':['moo','purr','chirp','bark'],
                        'Cute score':[10, 9.9, 10, 10],
                        'Pet':[False, True, False, True]})

# Boolean condition
animals['Is Cute'] = animals['Cute score'] > 5
animals


In [None]:
# Map — recode a column
animals['Can Own'] = animals['Pet'].map({True: 'yes', False: 'no'})
animals


In [None]:
# Apply — element-wise function
animals['Rounded Score'] = animals['Cute score'].apply(round)
animals


In [None]:
# Arithmetic between columns
animals['Cute & Pet'] = animals['Cute score'] + animals['Pet']
animals


In [None]:
# String methods
animals['First letter'] = animals['Animal'].str[0]
animals['Uppercase'] = animals['Animal'].str.upper()
animals


In [None]:
# List comprehension — complex condition
animals['Starts with P'] = ['yes' if re.search('[Pp]', a) else 'no' for a in animals['Animal']]
animals


---
### 🔬 Explore 2 — Derived Columns

Use the `planets` DataFrame you built in Explore 1.

1. Add a column `Large` = `True` if `Diameter_km` > 10000, else `False`.
2. Add a column `Size_Category` by mapping `Large` to `'Big'` / `'Small'`.
3. Add a column `Diameter_miles` = `Diameter_km * 0.621371` (rounded to 0 decimals).
4. Add a column `Planet_Upper` = planet name in all caps using `.str.upper()`.
5. **Bonus**: Use a list comprehension to add a column `Name_Length` = number of characters in each planet name.


In [None]:
# 1. Large column — boolean condition


In [None]:
# 2. Size_Category — map


In [None]:
# 3. Diameter_miles — arithmetic


In [None]:
# 4. Planet_Upper — string method


In [None]:
# 5. Bonus: Name_Length — list comprehension


<details>
<summary>💡 One approach — click to peek</summary>
<br>

```python
# 1. Boolean condition
planets['Large'] = planets['Diameter_km'] > 10000

# 2. Map to category labels
planets['Size_Category'] = planets['Large'].map({True: 'Big', False: 'Small'})

# 3. Arithmetic conversion
planets['Diameter_miles'] = (planets['Diameter_km'] * 0.621371).round(0)

# 4. String method
planets['Planet_Upper'] = planets['Planet'].str.upper()

# 5. List comprehension
planets['Name_Length'] = [len(name) for name in planets['Planet']]

print(planets)
```

</details>

---
## 4. 🔬 Titanic Feature Engineering Activity

The Seaborn `titanic` dataset has several **engineered columns** added on top of the raw Kaggle data.  
Your job: recreate those columns from scratch!


In [None]:
titanic_seaborn = sns.load_dataset('titanic')
titanic_kaggle  = pd.read_csv('titanic.csv')

print("Seaborn columns:", list(titanic_seaborn.columns))
print("Kaggle columns: ", list(titanic_kaggle.columns))


The Seaborn version added: `alive`, `alone`, `who`, `adult_male`, `embark_town`, `class`.  
You'll recreate three of these.


### Activity 1 — `alive`

In `titanic_kaggle`, create a column `alive`:  
- `"yes"` when `Survived == 1`  
- `"no"` when `Survived == 0`

*(Hint: use `.map()`)*


In [None]:
# Create 'alive' column


In [None]:
# Run to verify
(titanic_kaggle['alive'] == titanic_seaborn['alive']).all()


<details>
<summary>💡 One approach — click to peek</summary>
<br>

```python
titanic_kaggle['alive'] = titanic_kaggle['Survived'].map({1: 'yes', 0: 'no'})
print(titanic_kaggle[['Survived', 'alive']].head())
print("Match:", (titanic_kaggle['alive'] == titanic_seaborn['alive']).all())
```

</details>

### Activity 2 — `alone`

Create a column `alone`:  
- `True` if the passenger had **no** family members aboard (`SibSp == 0` AND `Parch == 0`)  
- `False` otherwise

*(Hint: boolean arithmetic — `SibSp + Parch` tells you total family members)*


In [None]:
# Create 'alone' column


In [None]:
# Run to verify
(titanic_kaggle['alone'] == titanic_seaborn['alone']).all()


<details>
<summary>💡 One approach — click to peek</summary>
<br>

```python
titanic_kaggle['alone'] = (titanic_kaggle['SibSp'] + titanic_kaggle['Parch']) == 0
print(titanic_kaggle[['SibSp', 'Parch', 'alone']].head(10))
print("Match:", (titanic_kaggle['alone'] == titanic_seaborn['alone']).all())
```

</details>

### Activity 3 — `who`

Create a column `who` with three labels:
- `"child"` if `Age < 16`
- `"man"` if `Age >= 16` and `Sex == "male"`
- `"woman"` if `Age >= 16` and `Sex == "female"`

*(Hint: use a list comprehension with `zip(df['Sex'], df['Age'])`, handling NaN ages)*


In [None]:
# Create 'who' column


In [None]:
# Run to verify
(titanic_kaggle['who'] == titanic_seaborn['who']).all()


<details>
<summary>💡 One approach — click to peek</summary>
<br>

*apply(func, axis=1) calls your function once per ROW — each row is passed as a Series. Use this when you need multiple columns at once.*

```python
def classify_who(row):
    if pd.isna(row['Age']):
        return 'man' if row['Sex'] == 'male' else 'woman'
    if row['Age'] < 16:
        return 'child'
    return 'man' if row['Sex'] == 'male' else 'woman'

titanic_kaggle['who'] = titanic_kaggle.apply(classify_who, axis=1)
print(titanic_kaggle[['Sex', 'Age', 'who']].head(10))
print("Match:", (titanic_kaggle['who'] == titanic_seaborn['who']).all())
```

</details>

### Activity 4 — Survival Analysis

Now use your new columns to do a quick EDA.

1. Find the number of passengers that survived and didn't, in each `Pclass`.
2. Create a plot showing survival counts by `Pclass`, faceted by `Sex`.
3. Create a visualization comparing age distribution of survivors vs non-survivors.


In [None]:
# 1. Survival counts by Pclass


In [None]:
# 2. Plot: survival by Pclass, faceted by Sex


In [None]:
# 3. Age distribution by survival


<details>
<summary>💡 One approach — click to peek</summary>
<br>

```python
# 1. Survival counts by Pclass
print(titanic_kaggle.groupby(['Pclass', 'Survived']).size().unstack())

# 2. Faceted count plot
g = sns.FacetGrid(titanic_kaggle, col='Sex', hue='Survived',
                  palette={0:'salmon', 1:'steelblue'})
g.map(sns.countplot, 'Pclass')
g.add_legend(title='Survived')
g.set_axis_labels('Passenger Class', 'Count')
plt.tight_layout()
plt.show()

# 3. Age distributions
sns.kdeplot(
    data=titanic_kaggle.dropna(subset=['Age']),
    x='Age', hue='alive', fill=True
)
plt.title('Age Distribution by Survival')
plt.show()
```

</details>

---
## Appendix — Adding Columns Quick Reference

```python
# Add to end
df['new'] = value                          # constant or list
df['new'] = df['a'] + df['b']             # arithmetic
df['new'] = df['col'] > 5                 # boolean condition
df['new'] = df['col'].map({v1:'a', v2:'b'})  # recode
df['new'] = df['col'].apply(func)         # element-wise function
df['new'] = df['col'].str[0]              # string method
df['new'] = [f(x) for x in df['col']]    # list comprehension

# Insert at specific position
df.insert(pos, 'new', value)

# Add multiple at once (returns new df)
df = df.assign(col1=values1, col2=values2)

# Row-wise function (use when you need multiple columns)
def my_func(row):
    return row['a'] + row['b']
df['new'] = df.apply(my_func, axis=1)
```

**Common datetime extraction:**
```python
df['col'] = pd.to_datetime(df['col'])
df['year']  = df['col'].dt.year
df['month'] = df['col'].dt.month
df['day']   = df['col'].dt.day
```

**Extract from strings:**
```python
df['year'] = df['date_str'].str.split('-').str[0]
df['first_word'] = df['text'].str.split().str[0]
```
