# Predicting Heart Disease Using a Support Vector Classifier

## 1. Introduction:

### Background:

The purpose

### Objective:

### Datasets:

### Tech Stack:

The following tools and libraries are used in this project:
- Python
- Pandas
- Matplotlib
- Statsmodels

## 2. Setup and Imports:

### Library Imports:

In [1]:
# Standard library imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statistics import mean

# Scipy and Statsmodels imports for statistical analysis
from scipy.stats import pointbiserialr, chi2_contingency
import statsmodels.formula.api as smf

# Scikit-learn imports for machine learning models, metrics, and preprocessing
from sklearn.model_selection import (GridSearchCV, train_test_split, StratifiedKFold,
                                     cross_val_score, StratifiedShuffleSplit, cross_validate)
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.naive_bayes import CategoricalNB
from sklearn.metrics import (accuracy_score, recall_score, precision_score, f1_score, 
                             confusion_matrix, classification_report, roc_curve, auc)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.decomposition import PCA

# IPython for HTML display
from IPython.core.display import HTML

### CSS Styling:

In [12]:
# Importing custom CSS for styling

css = open('style.css').read()
HTML('<style>{}</style>'.format(css))

## 3. Data Processing & Exploration - Kaggle Dataset

### 3.1 Data Processing

#### Data Loading

1. Load the dataset from 'kaggle-heart.csv' into a pandas DataFrame, handling missing values.
2. Preview the first 2 rows to ensure the data has been loaded correctly.

In [3]:
# Load the kaggle-heart.csv dataset into a DataFrame called "df_kaggle"
# We treat " ", "?", and "NA" as missing values and replace them with NaN
df_kaggle = pd.read_csv('kaggle-heart.csv', na_values=[" ","?","NA"])

# Display the first two rows of the dataset for inspection
df_kaggle.head(2)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0


#### Data Dictionary

1. Extract column names and data types from the dataset.
2. Add descriptions for each column based on the dataset documentation.
3. Calculate the min and max values for each numerical column.
4. Combine all information into a single DataFrame.

In [4]:
# Creating a Data Dictionary for the dataset:
# We will collect the following:
# - Field names (column names)
# - Data types
# - Descriptions (based on Kaggle's dataset page)
# - Max and Min values for numerical columns

# Column names and data types
kaggle_field_list = df_kaggle.columns.tolist()  # List of column names
kaggle_dtype_list = df_kaggle.dtypes.astype(str).tolist()  # List of column data types as strings

# Description of each field based on Kaggle's dataset page
kaggle_description_list = [
    "age",
    "sex",
    "chest pain type (4 values)",
    "resting blood pressure",
    "serum cholestoral in mg/dl",
    "fasting blood sugar > 120 mg/dl",
    "resting electrocardiographic results (values 0,1,2)",
    "maximum heart rate achieved",
    "exercise induced angina",
    "oldpeak = ST depression induced by exercise relative to rest",
    "the slope of the peak exercise ST segment",
    "number of major vessels (0-3) colored by flourosopy",
    "thal: 0 = normal; 1 = fixed defect; 2 = reversable defect",
    "presence of heart disease. 0 = no disease and 1 = disease."
]

# Max and min values for each column
kaggle_max_list = df_kaggle.max().to_list()  # List of max values for each column
kaggle_min_list = df_kaggle.min().to_list()  # List of min values for each column

# Combine all lists into one DataFrame for easier reference
kaggle_concat_list = [
    kaggle_field_list,
    kaggle_dtype_list,
    kaggle_description_list,
    kaggle_min_list,
    kaggle_max_list
]

# The lists need to be converted to Series before concatenating
df_kaggle_data_dictionary = pd.DataFrame(pd.concat([pd.Series(x) for x in kaggle_concat_list], axis=1))

# Set column names for the new DataFrame
df_kaggle_data_dictionary.columns = ["FieldName", "DataType", "Description", "Min", "Max"]

# Display the data dictionary rounded to 1 decimal point for readability
df_kaggle_data_dictionary.round(1)

Unnamed: 0,FieldName,DataType,Description,Min,Max
0,age,int64,age,29.0,77.0
1,sex,int64,sex,0.0,1.0
2,cp,int64,chest pain type (4 values),0.0,3.0
3,trestbps,int64,resting blood pressure,94.0,200.0
4,chol,int64,serum cholestoral in mg/dl,126.0,564.0
5,fbs,int64,fasting blood sugar > 120 mg/dl,0.0,1.0
6,restecg,int64,resting electrocardiographic results (values 0...,0.0,2.0
7,thalach,int64,maximum heart rate achieved,71.0,202.0
8,exang,int64,exercise induced angina,0.0,1.0
9,oldpeak,float64,oldpeak = ST depression induced by exercise re...,0.0,6.2


#### Summary Statistics

Generate summary statistics for the numerical columns in the dataset, rounding the results to two decimal places.

In [9]:
# Generate summary statistics for numerical columns and round the results to two decimal places
df_kaggle.describe().round(2)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0
mean,54.43,0.7,0.94,131.61,246.0,0.15,0.53,149.11,0.34,1.07,1.39,0.75,2.32,0.51
std,9.07,0.46,1.03,17.52,51.59,0.36,0.53,23.01,0.47,1.18,0.62,1.03,0.62,0.5
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,48.0,0.0,0.0,120.0,211.0,0.0,0.0,132.0,0.0,0.0,1.0,0.0,2.0,0.0
50%,56.0,1.0,1.0,130.0,240.0,0.0,1.0,152.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,275.0,0.0,1.0,166.0,1.0,1.8,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


**Notes:**
- The dataset appears to contain more rows than expected.
- We expected 303.
- This discrepancy may indicate data issues, such as extra rows or duplicate entries that need to be investigated and cleaned.

#### Count Null Values per Column

This step counts the missing (null) values in each column to assess the completeness of the dataset and guide decisions on handling missing data.

In [10]:
# Count the number of missing (null) values per column in the dataset
df_kaggle.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

#### Count Duplicated Rows

Calculate the number of duplicated rows in the dataset, helping identify copied data.

In [11]:
# Calculate the number of duplicated rows in the dataset
df_kaggle.duplicated().sum()

723

#### Count Unique Rows

Identify and count the unique rows in the dataset.

In [13]:
# Get the unique rows in the DataFrame (removing duplicates)
unique_rows = np.unique(df_kaggle, axis=0)

# Display the shape of the unique rows (number of unique records)
unique_rows.shape

(302, 14)

## 4. Data Processing & Exploration - UCI Dataset

### 4.1 Data Processing

#### Data Loading

#### Data Dictionary

#### Summary Statistics

####

### 4.2 Data Exploration