# Feature Selection

Feature importance analysis is a critical aspect of understanding the relevance and influence of variables in predicting the target variable within a machine learning model. This analysis serves multiple purposes: it identifies key features, assists in the selection process to mitigate computational complexity, provides insights into the relationships between features and the target variable, and ultimately aids in explaining the model's behavior, facilitating informed decision-making.

Let's explore the feature importance with a Random Forest Classifier. Random Forest calculates feature importance during training by considering how features contribute to the model's ability to make accurate predictions. Features that are important for making predictions are more likely to be selected for splitting nodes in the trees, leading to greater impurity decrease and higher feature importance scores.

In [None]:
X = t_tmp.drop(columns=[target])
y = t_tmp[target]

rf_classifier = RandomForestClassifier()
rf_classifier.fit(X, y)

# Get feature importances
feature_importances = rf_classifier.feature_importances_
feature_importance_df = pd.DataFrame({'Feature': X.columns,
                                      'Importance': feature_importances})

# Sort the DataFrame by importance
feature_importance_df = feature_importance_df.sort_values(by='Importance',
                                                          ascending=False)

# Plot feature importances
plt.figure(figsize=(10, 6))
plt.barh(feature_importance_df['Feature'],
         feature_importance_df['Importance'])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importance')
plt.show()

As observed during the Correlation Analysis, the most significant features to determine the diagnosis are, in order of significance, *radius* worst, *concave points* in its mean and worst forms, *perimeter* worst and *area* worst.

Let's now evaluate feature importance using a technique called Permutation feature importance. When computing permutation importance, the model's performance metric (e.g., accuracy, mean squared error) is evaluated before and after shuffling the values of a feature. The difference between these two performances indicates the importance of the feature: if shuffling the feature values leads to a significant decrease in performance, the feature is considered important. In this case the already trained Random Forest Classifier will serve as the model.

In [None]:
perm_importance = permutation_importance(rf_classifier,
                                         X,
                                         y,
                                         n_repeats=10,
                                         random_state=42)

feature_names = X.columns
sorted_idx = perm_importance.importances_mean.argsort()

plt.figure(figsize=(10, 6))
plt.barh(feature_names[sorted_idx],
         perm_importance.importances_mean[sorted_idx])
plt.xlabel('Permutation Importance')
plt.ylabel('Feature')
plt.title('Permutation Feature Importance')
plt.show()

It seems that for the specific case of a Random Forest Classifier, the most significant variables are *concave points_mean*, *area_se*, *texture_worst* and *concave points_worst*. 

A decision tree-based model like Random Forest may assign higher importance to certain features compared to a linear model like Logistic Regression. This difference arises because decision trees split the data based on feature values, making certain features more influential for prediction, whereas linear models estimate coefficients for each feature based on their relationship with the target variable. Let's redo the analysis using a logistic regression classifier. 

To perform feature importance analysis using Logistic Regression, the coefficients of the logistic regression model will serve as a measure of feature importance. Features with higher absolute coefficients are considered more important for predicting the target variable.

In [None]:
logreg_model = LogisticRegression()
logreg_model.fit(X, y)


feature_coeff = logreg_model.coef_[0]

# Create a DataFrame with feature names and their coefficients
feature_coeff_df = pd.DataFrame({'Feature': X.columns,
                                 'Coefficient': feature_coeff})

# Sort the DataFrame by absolute coefficient values
feature_coeff_df['Abs_Coefficient'] = abs(feature_coeff_df['Coefficient'])
feature_coeff_df = feature_coeff_df.sort_values(by='Abs_Coefficient',
                                                ascending=False)

# Plot feature coefficients
plt.figure(figsize=(10, 6))
plt.barh(feature_coeff_df['Feature'],
         feature_coeff_df['Coefficient'])
plt.xlabel('Coefficient')
plt.ylabel('Feature')
plt.title('Logistic Regression Feature Coefficients')
plt.show()

The variables with higher coefficients are *texture_worst*, *area_worst*, *area_se*, *concavity_worst*, *smoothness_worst*, *texture_mean*, *compactness_worst* and *symmetry_worst*.

Let's now use the permutation importance technique with our Logistic Regression Classifier. 

In [None]:
perm_importance = permutation_importance(logreg_model,
                                         X,
                                         y,
                                         n_repeats=10,
                                         random_state=42)

feature_names = X.columns
sorted_idx = perm_importance.importances_mean.argsort()

plt.figure(figsize=(10, 6))
plt.barh(feature_names[sorted_idx],
         perm_importance.importances_mean[sorted_idx])
plt.xlabel('Permutation Importance')
plt.ylabel('Feature')
plt.title('Permutation Feature Importance')
plt.show()

It can be observed that for this type of classifier, there are more variables that contribute at model's predictions. From them, the most relevant ones are *area_se*, *area_worst*, *compactness_worst*, *compactness_se*, *texture_worst* and *texture_mean*.

From both analysis, the less significant variables that **could be excluded** during feature selection, are: **All SE variables** but *area_se*, ***fractal_dimension_mean***, ***symmetry_mean***, ***compactness_mean***, ***smoothness_mean***, ***compactness_mean*** and ***fractal_dimension_worst***.