# **Feature Selection Notebook**

## Objectives

* Answer business requirements 1
- The client is interested in understanding the patterns from the heart disease database so that the client can learn the most relevant variables correlated to a positive heart desease prediction.

## Inputs

* outputs/datasets/collection/heart.csv

## Outputs

* Generate code that answer the first business requirement



---

# Set up the Working Directory

Define and confirm the working directory.

In [None]:
import os
current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
current_dir

# Load data

In [None]:
import pandas as pd
df_raw_path = "outputs/datasets/collection/heart.csv"
df = pd.read_csv(df_raw_path)
df.head(3)

---

# Explore the data

We are interested in getting more familiar with the data , so we generate ProfileReport to know the variable type , missing levels and distribution.

In [None]:
import pandas as pd
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

From the data we noticed that:
- All columns are numerical
- There are no missing values
- Few columns are inbalanced
- Target is balanced

---

# Correlation study

Initial Pearson and Spearman study.

In [None]:
corr_spearman = df.corr(method='spearman')['target'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

In [None]:
corr_pearson = df.corr(method='pearson')['target'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

In [None]:
top_n = 5
set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())

After studying spearman and pearson correlation we found that the most correlate variables to target are the following:

- ca = -0.453329
- cp = 0.464894
- exang =  -0.438029
- oldpeak = -0.438441
- thalach = 0.429832

We do not have a strong correlation, where the max is ca and the min of the variables chosen is thalach. Developer will try to improve these correlation.

---

# Correlation matrix

Correlation matrix for spearman

In [None]:
spearman_correlation_matrix = df.corr(method='spearman')

Correlation matrix for pearson

In [None]:
pearson_correlation_matrix = df.corr(method='pearson')

In [None]:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 8))
sns.heatmap(spearman_correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title(" Spearman Correlation Heatmap")
plt.show()

plt.figure(figsize=(10, 8))
sns.heatmap(spearman_correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title(" Pearson Correlation Heatmap")
plt.show()

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ppscore as pps

def heatmap_corr(df,threshold, figsize=(20,12), font_annot = 8):
  if len(df.columns) > 1:
    mask = np.zeros_like(df, dtype=bool)
    mask[np.triu_indices_from(mask)] = True
    mask[abs(df) < threshold] = True

    fig, axes = plt.subplots(figsize=figsize)
    sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                mask=mask, cmap='viridis', annot_kws={"size": font_annot}, ax=axes,
                linewidth=0.5
                     )
    axes.set_yticklabels(df.columns, rotation = 0)
    plt.ylim(len(df.columns),0)
    plt.show()


def heatmap_pps(df,threshold, figsize=(20,12), font_annot = 8):
    if len(df.columns) > 1:

      mask = np.zeros_like(df, dtype=bool)
      mask[abs(df) < threshold] = True

      fig, ax = plt.subplots(figsize=figsize)
      ax = sns.heatmap(df, annot=True, xticklabels=True,yticklabels=True,
                       mask=mask,cmap='rocket_r', annot_kws={"size": font_annot},
                       linewidth=0.05,linecolor='grey')
      
      plt.ylim(len(df.columns),0)
      plt.show()

def CalculateCorrAndPPS(df):
  df_corr_spearman = df.corr(method="spearman")
  df_corr_pearson = df.corr(method="pearson")

  pps_matrix_raw = pps.matrix(df)
  pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

  pps_score_stats = pps_matrix_raw.query("ppscore < 1").filter(['ppscore']).describe().T
  print("PPS threshold - check PPS score IQR to decide threshold for heatmap \n")
  print(pps_score_stats.round(3))

  return df_corr_pearson, df_corr_spearman, pps_matrix


def DisplayCorrAndPPS(df_corr_pearson, df_corr_spearman, pps_matrix,CorrThreshold,PPS_Threshold,
                      figsize=(20,12), font_annot=8 ):

  print("\n")
  print("* Analyze how the target variable for your ML models are correlated with other variables (features and target)")
  print("* Analyze multi colinearity, that is, how the features are correlated among themselves")

  print("\n")
  print("*** Heatmap: Spearman Correlation ***")
  print("It evaluates monotonic relationship \n")
  heatmap_corr(df=df_corr_spearman, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

  print("\n")
  print("*** Heatmap: Pearson Correlation ***")
  print("It evaluates the linear relationship between two continuous variables \n")
  heatmap_corr(df=df_corr_pearson, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

  print("\n")
  print("*** Heatmap: Predictive power Score (PPS) ***")
  print(f"PPS detects linear or non-linear relationships between two columns.\n"
        f"The score ranges from 0 (no predictive power) to 1 (perfect predictive power) \n")
  heatmap_pps(df=pps_matrix,threshold=PPS_Threshold, figsize=figsize, font_annot=font_annot)

In [None]:
df_corr_pearson, df_corr_spearman, pps_matrix = CalculateCorrAndPPS(df)

In [None]:
DisplayCorrAndPPS(df_corr_pearson=df_corr_pearson,
                  df_corr_spearman=df_corr_spearman, 
                  pps_matrix=pps_matrix,
                  CorrThreshold=0.3, PPS_Threshold=0.15,
                  figsize=(5,5), font_annot=8)

# Conclusions

From this notebook we understood:
* Dataset is completely numerical
* After correlation analysis with pearson and spearman 5 important features where selected = **'ca', 'cp', 'exang', 'oldpeak', 'thalach'**
* After pps heatmap analysis developer will add another feature that could have an inportant predict power for the target: **'chol'** with a 0.51 predict power .
* In the next notebook there will be cleaning of the dataset.

First business requirement : As a business client I can analyze the variables, so that i understand which features are more related to heart disease.
First business requirement was answered finding the features that are more related to the target.
Future notebook will clean and save the dataset.

---