## Feature Engineering


In this assignment, we'll evaluate how you think about preprocessing data for a tabular classification problem. The idea is to wrangle and normalise the dataset below, so that we can train and evaluate our model. Here is a description of the fields:

- **family**: who is covered? (Just Me, Me and my Spouse', Me and my kids, Me, Spouse, and Kids)
- **financial_risk_preference**: (1) Prefer Savings to Prefer Protection (5) 
- **exercises**: frequency of exercise (I exercise everyday, I exercise 3x a week, I don't exercise)
- **preexisting_conditions**: conditions that require frequent doctor visits (cancer, high blood pressure, etc)
- **qle**: qualifying life event that might incur costs (baby, medical procedure, married, moving)
- **savings**: if they had to pay $3000, how would they pay for this? (borrow money, have savings, HSA)
- **prescription_costs**: costs of annual prescription 
- **pcp_costs**: costs of primary care costs
- **specialist_costs**: annual cost of speciality care costs


The output should have this dataset split into **X_train**, **X_test**, **y_train**, **y_test** that will be inputted into RandomForestClassifier. 

In [None]:
import pandas as pd
from random import shuffle

df = pd.read_csv("data/surveys.csv")
df.sample(5)

#### 1) Prepare dataset for RandomForest

In [None]:
features = [
    "age",
    "salary",
    "family",
    "household_salaries",
    "savings",
    "financial_risk_preference",
    "preexisting_conditions",
    "qle",
    "pcp_visits",
    "specialty_visits",
    "pcp_costs",
    "specialist_costs"
]
categorical_features = ["family", "preexisting_conditions", "qle", "savings", "exercises"]


### TODO: Normalise and split data into train, validation, and test sets


X_train = []
X_val = []
X_test = []

y_train = []
y_val = []
y_test = []


#### 2) Train RandomForest

In [None]:
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=1000, max_depth=50)
model = clf.fit(X_train, y_train)

y_train_predict = model.predict(X_train)
print("Train Metrics: ")
print(classification_report(y_train, y_train_predict))

y_test_predict = model.predict(X_test)
print("Test Metrics: ")
print(classification_report(y_test, y_test_predict))

### Discussion Questions

Feature Exploration and Selection

1) What techniques do you use to explore and visualize the distribution of features in the dataset?
2) How do you decide which features are relevant for the classification task? Can you discuss feature selection methods you're familiar with?

Categorical Variables

1) How do you handle categorical variables in a tabular dataset? Are there specific encoding techniques you prefer for classification models?
2) Can you explain the concept of target encoding, and when might it be useful in a classification problem?

Dealing with Imbalanced Data

1) In the context of imbalanced classes, what strategies do you employ during feature engineering to address potential issues?
2) How can feature engineering contribute to mitigating the impact of class imbalance in a classification model?

Feature Scaling

1) Do you consider feature scaling in your feature engineering process? When is it necessary, and how does it impact different machine learning algorithms?
2) Can you explain the difference between normalization and standardization, and when might you choose one over the other?

Feature Transformation

1) How do you approach feature transformation, such as creating interaction terms or polynomial features, and when might these techniques be beneficial?
2) Can you discuss the use of log-transformations or Box-Cox transformations for certain types of features?