# Feature selection
While preparing your data for modeling, it is important to ensure that you have a set of helpful features for the model to base its predictions (or diagnosis) on. In order to be helpful, features need to capture essential characteristics of the heart disease dataset in an orthogonal way; more data isn't always better! <br>

You can use the sklearn.feature_selection.SelectFromModel module to select useful features. SelectFromModel implements a brute-force method that uses a RandomForestClassifier model to find the most salient features for the task of heart disease diagnosis. <br>

RandomForestClassifier has been imported and the heart disease data features and target have been imported as X_train and y_train, respectively.

In [None]:
import matplotlib as plt

In [None]:
# Define a random forest classifier with n_jobs = -1, 'balanced' class_weight, and max_depth = 5, and perform feature selection on heart_disease_df using .fit().

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

In [None]:
# Define the random forest model
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)

# Fit the random forest model to the training data
rf.fit(X_train, y_train)

# Create a SelectFromModel object
selector = SelectFromModel(rf)

# Fit the selector to the training data
selector.fit(X_train, y_train)

In [None]:
# Define the feature selection object using the rf classifier you have created.

In [None]:
from sklearn.feature_selection import SelectFromModel

# Define the random forest model and fit to the training data
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
rf.fit(X_train, y_train)

# Define the feature selection object
model = SelectFromModel(rf, prefit=True)

In [None]:
from sklearn.feature_selection import SelectFromModel

# Define the random forest model and fit to the training data
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
rf.fit(X_train, y_train)

# Define the feature selection object
model = SelectFromModel(rf, prefit=True)

# Transform the training features
X_train_transformed = model.transform(X_train)

In [None]:
# Get the selected features from the feature selector object, filter the DataFrame based on the features and print the selected features out.

In [None]:
from sklearn.feature_selection import SelectFromModel

# Define the random forest model and fit to the training data
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
rf.fit(X_train, y_train)

# Define the feature selection object
model = SelectFromModel(rf, prefit=True)

# Transform the training features
X_train_transformed = model.transform(X_train)

# Assuming heart_disease_df includes the target column, get feature names excluding the target
original_features = X_train.columns.tolist()
print(f"Original features: {original_features}")

# Select the features deemed important by the SelectFromModel
features_bool = model.get_support()

selected_features = [feature for feature, is_selected in zip(original_features, features_bool) if is_selected]
print(f"\nSelected features: {selected_features}")

# Create a DataFrame to hold feature importance data
feature_importance = pd.DataFrame({
    "feature": selected_features,
    "importance": rf.feature_importances_[features_bool]
})

# Plot the feature importances
plt.figure(figsize=(10, 6))
plt.barh(feature_importance["feature"], feature_importance["importance"])
plt.xlabel('Importance')
plt.title('Feature Importances')
plt.show()

In [None]:
# Original features: ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'slope', 'ca', 'thal']
# Selected features: ['cp', 'thalach', 'ca', 'thal']