1. Check the dimensions of the dataset (number of rows and columns).

In [None]:
import pandas as pd

# Load the heart dataset
heart_data = pd.read_csv("heart.csv")

# Check the dimensions of the dataset
dataset_dimensions = heart_data.shape

print("The dataset has {} rows and {} colums.".format(dataset_dimensions[0],dataset_dimensions[1]))


2. Peek at the first few rows to understand the data structure

In [None]:


# Adjust pandas display settings
pd.set_option('display.max_columns', 12)  # Ensure all columns are attempted to be displayed
pd.set_option('display.width', 1000)  # Adjust the width 


# Display the first few rows of heart.csv
print(heart_data.head())
print(heart_data.describe())

3. Examine the types of data present in each column (numerical, categorical, datetime,...). Verify that the data types assigned to each column align with the actual nature of the data. Convert data types if necessary.

In [None]:


# Function to display data types 
def display_dtypes(dataframe):
    # Create a DataFrame from the dtypes
    dtypes_df = pd.DataFrame(dataframe.dtypes, columns=['Data Type'])
    # Reset index to get the column names as a separate column
    dtypes_df.reset_index(inplace=True)
    # Rename columns 
    dtypes_df.columns = ['Column Name', 'Data Type']
    # Display the DataFrame
    display(dtypes_df)  

# Display original data types
print("Original Data Types:")
display_dtypes(heart_data)

# Convert categorical variables to 'category' data type
categorical_columns = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
heart_data[categorical_columns] = heart_data[categorical_columns].astype('category')

# Display data types after conversion for comparison
print("\nData Types after Conversion:")
display_dtypes(heart_data)

4. Filter the data by removing variables that are not relevant for the analysis.

Given the context of predicting heart failure, all of the 12 variables have relevant associations with cardiovascular health and could potentially contribute valuable insights when building a predictive model. 

5. Univariate analysis: Summary statistics for numerical variables and frequency distribution for categorical ones. Create visualizations (histograms, box plots, bar plots, etc).

In [None]:
# Import necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns


# Set the style of the plots
sns.set_style("whitegrid")

# Summary statistics for numerical variables
print("Summary Statistics for Numerical Variables:")
print(heart_data.describe())

# Histograms for numerical variables
numerical_columns = ['Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak']
for col in numerical_columns:
    plt.figure(figsize=(8, 4))
    sns.histplot(heart_data[col], kde=True, bins=30)
    plt.title(f'Histogram of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.show()

# Frequency distribution and bar plots for categorical variables
categorical_columns = ['Sex', 'ChestPainType', 'FastingBS', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
for col in categorical_columns:
    print(f"Frequency distribution for {col}:")
    print(heart_data[col].value_counts())
    
    plt.figure(figsize=(8, 4))
    sns.countplot(x=col, data=heart_data)
    plt.title(f'Bar Plot of {col}')
    plt.ylabel('Count')
    plt.show()

6. Identify outliers and decide whether to remove, transform, or keep them.


RestingBP: 

- Normal blood pressure for adults ranges from 90/60 mm KG to 120/80 mm Hg. High blood pressure is a critical risk factor for heart disease, these high values are therefore valuable for our prediction and we have kept them in our analysis. 
- We have decided to remove the 0 values. These values do not contribute to a meaningful analysis, because RestingBP=0 is not poosible. 

Cholesterol: 
- High cholesterol levels are clinically relevant and can indicate a significant risk for cardiovascular diseases. 
- The instances where cholesterol is recorded as 0 are clearly errors or missing data, as it's physiologically impausible for someone to have no cholesterol. 

MaxHR: 
- MaxHR can vary widely among individuals, especially considering factors like age, fitness level, and the presence of cardiovascular conditions. Given this variability and its clinical significance, we have decided to keep the two low outliers in MaxHR. 


Oldpeak: 
- Oldpeak" refers to the ST depression induced by exercise relative to rest, measured in millimeters (mm) on an electrocardiogram (ECG). ST depression is a finding on an ECG that can indicate myocardial ischemia, a condition where part of the heart does not receive enough oxygen, often due to blockages in the coronary arteries. In the context of a stress test, an increase in oldpeak value (more negative or more pronounced depression) can suggest a higher likelihood of coronary artery disease.
- ST Elevation: An ST amplitude of ≥0.1 mV, ≥0.15 mV, and ≥0.2 mV is considered significant for ST elevation. ST elevation can indicate acute myocardial infarction (heart attack) and other conditions that lead to an elevated risk of heart disease. High positive Oldpeak values (e.g., 4.0, 5.6, 6.2) are well above the ≥0.2 mV elevation significance threshold, indicating substantial ST segment deviations. These are critical for identifying potential heart disease risk and should be retained for their clinical significance.
- ST Depression: For ST depression, thresholds of ≤–0.05 mV or ≤–0.1 mV are used to denote clinical significance. ST depression is indicative of myocardial ischemia, a condition where the heart muscle doesn't receive enough oxygen, often due to narrowed or blocked coronary arteries. With ST depression being clinically significant at values of ≤–0.05 mV or ≤–0.1 mV, a negative Oldpeak value reflects ST depression and is thus clinically relevant. This value suggests myocardial ischemia and should be included in analyses concerning heart disease risk, assuming the negative sign indicates depression below the baseline.


In [None]:
# List of numerical variables to check for outliers
numerical_variables = ['Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak']

# Function that calculates the outliers based on the IQR method
# The IQR is the difference between the 75th percentile (Q3) and the 25th percentile (Q1) of the data
# Outliers are typically considered data points that fall below (Q1- 1,5*IQR) or above (Q3 + 1,5*IQR)

def IQR_method(dataframe, colum):
    Q1 = dataframe[colum].quantile(0.25)
    Q3 = dataframe[colum].quantile(0.75)

    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    return dataframe[(dataframe[colum] < lower_bound) | (dataframe[colum] > upper_bound)]

# Identify outliers in each numerical variable (Age, RestingBP, Cholesterol, FastingBS, MaxHR and Oldpeak)

def display_outliers(dataframe, variables):
    for variable in variables:
        outliers_df = IQR_method(dataframe, variable)
        print(f"Number of outliers in {variable}: {len(outliers_df)}")
        if len(outliers_df) > 0:
            display(outliers_df[[variable]])

print("Outliers before removing 0 values:")
display_outliers(heart_data, numerical_variables)

# Remove 0 values from RestingBP and Cholesterol
heart_data_cleaned = heart_data[(heart_data['RestingBP'] > 0) & (heart_data['Cholesterol'] > 0)]

print("\nOutliers after removing 0 values for RestingBP and Cholesterol:")
display_outliers(heart_data_cleaned, numerical_variables)


7. Identify variables that provide little to no information and remove them.

Skal vi fjerne dette punktet siden vi gjorde det i punkt 4?

8. Bivariate analysis: Examine correlations and dependencies between variables to identify potential relationships. Create visualizations to explore these relationships. A particularly significant case involves assessing the relationship of each variable with the class/target variable you want to predict.

In [None]:
# TODO: sammenligning med ikke-numeriske verdier
 
import matplotlib.pyplot as plt
import pandas as pd

# Define variables of interest, they will be compared to the target variable HeartDisease
variables = ['Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak']

# Calculate correlations and print them
for variable in variables:
    correlation = heart_data[[variable, 'HeartDisease']].corr()
    print(correlation)

# Calculate correlations
correlation_data = []
for variable in variables:
    correlation = heart_data[[variable, 'HeartDisease']].corr().iloc[0, 1]
    correlation_data.append(correlation)

# Create DataFrame for plotting
correlation_df = pd.DataFrame({
    'Variable': variables,
    'Correlation': correlation_data
})

# Plot
plt.figure(figsize=(10, 6))
plt.bar(correlation_df['Variable'], correlation_df['Correlation'], color='skyblue')
plt.title('Correlation with Heart Disease')
plt.xlabel('Variable')
plt.ylabel('Correlation Coefficient')
plt.ylim(-1, 1)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


9. Create new features from existing ones if necessary (e.g., extracting date components, ratios between variables, etc.).

In [None]:
# Creating an age group column from 20-40, 40-60 and 60+
binned = pd.cut(heart_data['Age'], bins=[20, 40, 60, 100], labels=['Young Adult', 'Middle-aged', 'Elderly'])

# Add the binned values as a new categorical feature
heart_data["Age_group"] = binned
print(heart_data[['Age', 'Age_group']].head())
heart_data.groupby('Age_group', observed=False).size()



10. Divide the dataset into training and test sets.

In [None]:
# You train the model using the training set. You test the model using the testing set.

# Shuffle dataframe using sample function
shuffled_data = heart_data.sample(frac=1)
shuffled_data.head()

# Select ratio - how much of the data will be used for training and how much for testing
ratio = 0.75

# total number of rows in the shuffled dataset
total_rows = shuffled_data.shape[0]
# the number of rows to include in the training set
train_size = int(total_rows*ratio)
 
# Split data into test and train
train = shuffled_data[0:train_size]
test = shuffled_data[train_size:]

# print train set
print("Train dataframe")
print(train)
 
# print test set
print("Test dataframe")
print(test)

11. Scale numerical features to a similar range in order to improve model performance on some models.

In [None]:
# Create a DataFrame with the desired values
df = pd.DataFrame(heart_data[['Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak']])

# Min-max scaling
df_scaled = (df - df.min()) / (df.max() - df.min())

print(df_scaled.head())

12. Encode categorical variables.

In [None]:
# Define categorical columns
categorical_columns = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']

# Create new columns with encoded values
for column in categorical_columns:
    new_column_name = f"{column}_encoded"
    heart_data[new_column_name] = heart_data[column].cat.codes

# Print DataFrame to verify the new columns
print(heart_data[['Sex_encoded', 'ChestPainType_encoded', 'RestingECG_encoded', 'ExerciseAngina_encoded', 'ST_Slope_encoded']].head())


13. Identify missing data and decide how to handle them (remove, impute).

In [None]:
# Identify missing data

# Column names with the number of non-null values in each column
# num_missing = heart_data.info()
# print(num_missing)

# summarize total number of missing values per column
num_missing1 = heart_data.isnull().sum()
print(num_missing1)