## Feature Engineering


In this assignment, we'll evaluate how you think about preprocessing data for a tabular classification problem. The idea is to wrangle and normalise the dataset below, so that we can train and evaluate our model. Here is a description of the fields:

- **family**: who is covered? (Just Me, Me and my Spouse', Me and my kids, Me, Spouse, and Kids)
- **financial_risk_preference**: (1) Prefer Savings to Prefer Protection (5) 
- **exercises**: frequency of exercise (I exercise everyday, I exercise 3x a week, I don't exercise)
- **preexisting_conditions**: conditions that require frequent doctor visits (cancer, high blood pressure, etc)
- **qle**: qualifying life event that might incur costs (baby, medical procedure, married, moving)
- **savings**: if they had to pay $3000, how would they pay for this? (borrow money, have savings, HSA)
- **prescription_costs**: costs of annual prescription 
- **pcp_costs**: costs of primary care costs
- **specialist_costs**: annual cost of speciality care costs


The output should have this dataset split into **X_train**, **X_test**, **y_train**, **y_test** that will be inputted into RandomForestClassifier. 

In [25]:
import pandas as pd
from random import shuffle

df = pd.read_csv("data/surveys.csv")
df.sample(5)

Unnamed: 0,idx,age,family,salary,household_salaries,financial_risk_preference,preexisting_conditions,prescription_costs,pcp_costs,specialist_costs,pcp_visits,qle,specialty_visits,exercises,savings,classification
73,73,32,Just Me,56434,56434.0,3,none,98,0,722,0,baby,2,I exercise everyday,borrow money,Cigna Copay Plan PPO
1,1,33,Me and my Spouse,117690,129459.0,3,none,51,155,0,2,none,0,I exercise 3x a week,have savings,Cigna Base HDHP
201,201,19,Just Me,133141,133141.0,3,high blood pressure,87,671,811,8,none,2,I exercise everyday,HSA,Cigna Choice HDHP
241,241,38,Me and my Spouse,129999,220998.3,3,high blood pressure,21,130,857,1,moving,2,I don't exercise,have savings,Cigna Choice HDHP
216,216,52,Me and my Spouse,83925,92317.5,4,none,84,302,3648,2,none,9,I exercise 3x a week,have savings,Cigna Copay Plan PPO


#### 1) Prepare dataset for RandomForest

In [31]:
features = [
    "age",
    "salary",
    "family",
    "household_salaries",
    "savings",
    "financial_risk_preference",
    "preexisting_conditions",
    "qle",
    "pcp_visits",
    "specialty_visits",
    "pcp_costs",
    "specialist_costs"
]
categorical_features = ["family", "preexisting_conditions", "qle", "savings", "exercises"]


### ANSWER
feature_distributions = {}
numeric = df[[f for f in df[features].columns if f not in categorical_features and f != "classification"]]
feature_distributions = {
    "mean": numeric.mean().to_dict(),
    "std": numeric.std().to_dict(),
}
numeric = (numeric - pd.Series(feature_distributions["mean"])) / pd.Series(
    feature_distributions["std"]
)
one_hot = pd.get_dummies(df[categorical_features])
X = numeric.merge(one_hot, left_index=True, right_index=True)
X.index.name = "user-id"

y = df.classification.values

train_split = 0.8
test_split = 0.2

train_idx = {}
test_idx = {}
val_idx = {}

for plan in list(set(y)):
    indices = [i for i, label in enumerate(y) if label == plan]
    shuffle(indices)
    train_plan_idx = int(len(indices) * train_split)
    test_plan_idx = train_plan_idx + int(len(indices) * test_split)
    val_plan_idx = len(indices) - test_plan_idx

    train_idx[plan] = indices[:train_plan_idx]
    test_idx[plan] = indices[train_plan_idx:test_plan_idx]
    val_idx[plan] = indices[-val_plan_idx:]

flatten = lambda t: [item for sublist in t for item in sublist]

X_train = X.iloc[flatten([v for _, v in train_idx.items()]), :]
X_val = X.iloc[flatten([v for _, v in val_idx.items()]), :]
X_test = X.iloc[flatten([v for _, v in test_idx.items()]), :]

y_train = y[flatten([v for _, v in train_idx.items()])]
y_val = y[flatten([v for _, v in val_idx.items()])]
y_test = y[flatten([v for _, v in test_idx.items()])]

X_train.shape[0], X_test.shape[0]

(216, 54)

#### 2) Train RandomForest

In [32]:
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=1000, max_depth=50)
model = clf.fit(X_train, y_train)

y_train_predict = model.predict(X_train)
print("Train Metrics: ")
print(classification_report(y_train, y_train_predict))

y_test_predict = model.predict(X_test)
print("Test Metrics: ")
print(classification_report(y_test, y_test_predict))

Train Metrics: 
                      precision    recall  f1-score   support

     Cigna Base HDHP       1.00      1.00      1.00        48
   Cigna Choice HDHP       1.00      1.00      1.00       104
Cigna Copay Plan PPO       1.00      1.00      1.00        64

            accuracy                           1.00       216
           macro avg       1.00      1.00      1.00       216
        weighted avg       1.00      1.00      1.00       216

Test Metrics: 
                      precision    recall  f1-score   support

     Cigna Base HDHP       1.00      1.00      1.00        12
   Cigna Choice HDHP       0.80      0.92      0.86        26
Cigna Copay Plan PPO       0.83      0.62      0.71        16

            accuracy                           0.85        54
           macro avg       0.88      0.85      0.86        54
        weighted avg       0.85      0.85      0.85        54



### Discussion Questions

Feature Exploration and Selection

1) What techniques do you use to explore and visualize the distribution of features in the dataset?
2) How do you decide which features are relevant for the classification task? Can you discuss feature selection methods you're familiar with?

Categorical Variables

1) How do you handle categorical variables in a tabular dataset? Are there specific encoding techniques you prefer for classification models?
2) Can you explain the concept of target encoding, and when might it be useful in a classification problem?

Dealing with Imbalanced Data

1) In the context of imbalanced classes, what strategies do you employ during feature engineering to address potential issues?
2) How can feature engineering contribute to mitigating the impact of class imbalance in a classification model?

Feature Scaling

1) Do you consider feature scaling in your feature engineering process? When is it necessary, and how does it impact different machine learning algorithms?
2) Can you explain the difference between normalization and standardization, and when might you choose one over the other?

Feature Transformation

1) How do you approach feature transformation, such as creating interaction terms or polynomial features, and when might these techniques be beneficial?
2) Can you discuss the use of log-transformations or Box-Cox transformations for certain types of features?