## Daniel Barella
## 9/14/25

# Day 13 – Feature Engineering & Automated Cleaning
Focus: Creating new features, encoding, scaling, and using automated tools.


**Goals for Day 13**
- Practice creating new features from existing data
- Binning continuous variables into categories
- Encode categorical variables (manual + automated)
- Use `SimpleImputer` for missing values
- Use `StandardScaler` for normalization


In [1]:
#!pip install --upgrade scikit-learn


In [2]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler


In [3]:
import sklearn
print(sklearn.__version__)


1.6.1


## Load Titanic (Cleaned)

In [4]:
df = pd.read_csv("titanic_clean.csv")
df.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,Male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,Female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,Female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,Female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,Male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## Feature Creation

In [9]:
# Create new column: family_size
df['family_size'] = df['sibsp'] + df['parch'] + 1

# Bin ages into categories
df['age_group'] = pd.cut(df['age'], bins=[0,12,18,35,60,80],
                         labels=['Child','Teen','Young Adult','Adult','Senior'])


## Encoding (Manual & Automated)

In [10]:
# Manual one-hot encoding
df_encoded = pd.get_dummies(df, columns=['sex','class','embarked'], drop_first=True)

# Automated encoding example
encoder = OneHotEncoder(drop='first', sparse_output=False)
encoded = encoder.fit_transform(df[['sex','class','embarked']])
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(['sex','class','embarked']))


## Imputation with SimpleImputer

In [11]:
# Example: If age had missing values
imputer = SimpleImputer(strategy='median')
df['age_imputed'] = imputer.fit_transform(df[['age']])


## Scaling

In [12]:
scaler = StandardScaler()
df[['fare_scaled']] = scaler.fit_transform(df[['fare']])


## Reflection

### Reflection
- Learned to create new features (family size, age groups)
- Practiced encoding (manual one-hot and sklearn `OneHotEncoder`)
- Used `SimpleImputer` to automatically handle missing values
- Scaled numerical data with `StandardScaler`

Next: Advanced Visualization (Day 14)
