## Feature Engineering


In this assignment, we'll evaluate how you think about preprocessing data for a tabular classification problem. The idea is to wrangle and normalise the dataset below, so that we can train and evaluate our model. Here is a description of the fields:

- **family**: who is covered? (Just Me, Me and my Spouse', Me and my kids, Me, Spouse, and Kids)
- **financial_risk_preference**: (1) Prefer Savings to Prefer Protection (5) 
- **exercises**: frequency of exercise (I exercise everyday, I exercise 3x a week, I don't exercise)
- **preexisting_conditions**: conditions that require frequent doctor visits (cancer, high blood pressure, etc)
- **qle**: qualifying life event that might incur costs (baby, medical procedure, married, moving)
- **savings**: if they had to pay $3000, how would they pay for this? (borrow money, have savings, HSA)
- **prescription_costs**: costs of annual prescription 
- **pcp_costs**: costs of primary care costs
- **specialist_costs**: annual cost of speciality care costs


The output should have this dataset split into **X_train**, **X_test**, **y_train**, **y_test** that will be inputted into RandomForestClassifier. 

In [1]:
import pandas as pd

df = pd.read_csv("data/surveys.csv")
df.sample(5)

Unnamed: 0,idx,age,family,salary,household_salaries,financial_risk_preference,preexisting_conditions,prescription_costs,pcp_costs,specialist_costs,pcp_visits,qle,specialty_visits,exercises,savings,classification
78,78,38,Me and my Spouse,47295,61483.5,3,none,685,199,2019,2,moving,6,I exercise 3x a week,borrow money,Cigna Choice HDHP
196,196,26,Me and my Spouse,128538,167099.4,4,high blood pressure,422,0,2124,0,none,7,I exercise everyday,have savings,Cigna Copay Plan PPO
52,52,43,Just Me,36253,36253.0,4,none,27,310,836,2,baby,2,I don't exercise,have savings,Cigna Copay Plan PPO
77,77,38,"Me, Spouse, and Kids",89751,107701.2,3,none,1092,0,653,0,none,2,I exercise 3x a week,borrow money,Cigna Base HDHP
67,67,42,Me and my Spouse,89214,133821.0,3,high blood sugar,89,80,1409,1,none,5,I exercise everyday,borrow money,Cigna Copay Plan PPO


In [2]:
df.columns

Index(['idx', 'age', 'family', 'salary', 'household_salaries',
       'financial_risk_preference', 'preexisting_conditions',
       'prescription_costs', 'pcp_costs', 'specialist_costs', 'pcp_visits',
       'qle', 'specialty_visits', 'exercises', 'savings', 'classification'],
      dtype='object')

In [3]:
features = [
    "age",
    "salary",
    "family",
    "household_salaries",
    "savings",
    "financial_risk_preference",
    "preexisting_conditions",
    "qle",
    "pcp_visits",
    "specialty_visits",
    "pcp_costs",
    "specialist_costs"
]
categorical_features = ["family", "preexisting_conditions", "qle", "savings", "exercises"]

#### 1) Separate Numeric and Categorical Features using pandas indexing  (TODO)

In [None]:
numeric_features = [] #TODO: determine numeric features names
numeric_df = df #TODO: filter numeric features from df
categorical_df = df #TODO: filter categorical features from df

#### 2) Normalise Features using pandas transformations  (TODO)

In [None]:
numeric_df = numeric_df #TODO: normalise numeric_df 
categorical_df = categorical_df #TODO: normalise categorical_df

X = numeric_df.merge(categorical_df, left_index=True, right_index=True)
y = df.classification.values

#### 3) Split data into train and test sets (TODO)

In [4]:
df.classification.value_counts()

classification
Cigna Choice HDHP       130
Cigna Copay Plan PPO     80
Cigna Base HDHP          61
Name: count, dtype: int64

In [None]:
from random import shuffle

train_split = 0.75

X_train = X #TODO: training features
X_test = X #TODO: testing features

y_train = y #TODO training labels
y_test = y #TODO testing labels

#### 4) Train RandomForest

In [None]:
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=10)
model = clf.fit(X_train, y_train)

y_train_predict = model.predict(X_train)
print("Train Metrics: ")
print(classification_report(y_train, y_train_predict))

y_test_predict = model.predict(X_test)
print("Test Metrics: ")
print(classification_report(y_test, y_test_predict))

### Discussion Questions

Feature Exploration and Selection

1) What techniques do you use to explore and visualize the distribution of features in the dataset?
2) How do you decide which features are relevant for the classification task? Can you discuss feature selection methods you're familiar with?

Categorical Variables

1) How do you handle categorical variables in a tabular dataset? Are there specific encoding techniques you prefer for classification models?
2) Can you explain the concept of target encoding, and when might it be useful in a classification problem?

Dealing with Imbalanced Data

1) In the context of imbalanced classes, what strategies do you employ during feature engineering to address potential issues?
2) How can feature engineering contribute to mitigating the impact of class imbalance in a classification model?

Feature Scaling

1) Do you consider feature scaling in your feature engineering process? When is it necessary, and how does it impact different machine learning algorithms?
2) Can you explain the difference between normalization and standardization, and when might you choose one over the other?

Feature Transformation

1) How do you approach feature transformation, such as creating interaction terms or polynomial features, and when might these techniques be beneficial?
2) Can you discuss the use of log-transformations or Box-Cox transformations for certain types of features?