# **Data Exploration: Chronic Kidney Disease**

### Welcome to the section on Support Vector Machines (SVMs)! SVMs are one of the most vertaile machine learning models and have wide utility across both regression and classification tasks. They are a true staple in any data scientists tool kit and, to this day, are still used in very advanced applications.

### We will be making an SVM today for a classification task. Make sure you still remember the high level difference between classification and regression as this is guiding light when thinking about how to solve problems.

### - `Classification`- whether or not something belongs to a certain class (car, truck, plane)

### - `Regression`- prediction of a continuous variable (heart rate given certain parameters, post operative systolic blood pressure)

### Here, we will be doing `binary classification`, and attept to predict whether or not individuals will develop kidney disease from based on various clinical observations. The data is open source and can be found here: https://www.kaggle.com/datasets/mahmoudlimam/preprocessed-chronic-kidney-disease-dataset. 

### We won't go to in depth with describing the data here, but one thing of note is that this dataset is already heavily preprocessed so we will have a lot of work saved for us. Missing values were calculated using KNN (see the KNN notebook in this section for more information on this!) and the categorical variables are already one hot encoded. You can read more about the other preprocessing done here: https://www.kaggle.com/code/mahmoudlimam/chronic-kidney-disease-clustering-and-prediction/notebook.

### As the point of this specific section is to delve into SVMs, we felt OK using this dataset, but keep in mind that real world data is *never* this clean coming in and will always require you to do some form of preprocessing on it. Throughout all of Code Grand Rounds we show multiple different approaches to this preprocessing so it will be lighter in this section, but remember never to skip this step! Alright, without further ado lets dive in!

In [None]:
# Install and import all the neccessary stuff

%pip install seaborn

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


### Import the data and take a look at it

In [None]:
# Import the dataset
df = pd.read_csv('CKD_Preprocessed.csv')

# Display the first few rows of the dataset
df.head()

### Get some basic summary statistics about the data. We will leave you to interpret these on your own.

In [None]:
# Get basic summary statistics for numerical columns
df.describe()

### Get some info about data types and probe number of missing (null) values

In [None]:

# Get info about data types and missing values
df.info()

## **Initial Visualizations**

### Make histogram of all the data to see the distribution. Our data is all numerical (as seen by df.info) so a single histogram should suffice. For the binary data (1s or 0s) we want to see if there is a relatively even distribution of both 1s and 0s, particularly for our target (CKD status)

In [None]:
# Select only the columns with datatypes 'float64' and 'int64' from the DataFrame 'df'.
# For these selected columns, create histograms.
# figsize=(20,15) sets the width and height of the entire figure in inches.
# bins=20 means each individual histogram will have 20 bins.
df.select_dtypes(include=['float64', 'int64']).hist(figsize=(20,15), bins=20)

# Adjust the padding between and around the subplots (histograms) for a cleaner look.
plt.tight_layout()

# Display the figure containing the histograms.
plt.show()


### We will leave the interpretation of a majority of these data up to you, but we want to briefly discuss our target vairable 'Chronic Kidney Disease: Yes'. 

### We see from the histogram in the bottom right that 250 people with kidney disease and 150 without. In this case the majority class is people with kidney disease, but this still about a 5:3 class imbalance. As indicated in the logistic regression, imbalanced classes can be a big problem in machine learning (particular classification tasks) as they can accidentally bias the model towards the majority class. This can falsely lead you to an increase in accuracy when your model is actually just predicting the majority class a majority of the time and achieving what seems like a good result. This is why it is imperative to always check other metrics like precision, recall, AUROC, etc., and be sure to think critically about the distribution of your data. We need to keep this potential imbalance in mind going forward!

## More data visualization

### While this might not be strictly neccesary (although always a good idea...), but lets also make a correlation matrix to see if any of our features are highly correlated. This is the same code from the data exploration notebook in module 2!

In [None]:
corr_matrix = df.corr() # This line computes the correlation matrix of the DataFrame.
                 #  It calculates the Pearson correlation coefficient for each pair of numerical columns. 
                 # Post cleaning, all of our columns have some kind of numerical representation.

print (corr_matrix)

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr_matrix, dtype=bool)) #  Here, create a mask for the upper triangle of your correlation matrix. 
                                               # This is done because the matrix is symmetric, i.e., the lower triangle is a mirror 
                                               # image of the upper triangle. Thus, showing both would be redundant.
                                               # You don't technially need to do this, but its a nice trick...

# Set up the matplotlib figure
fig, ax = plt.subplots(figsize=(18, 15))

# Generate a colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask
# Look at the sns documenttion for details on all of the arguments. 
sns.heatmap(corr_matrix, mask=mask, cmap=cmap, vmax=.3, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)

plt.title('Correlation Matrix Heatmap')
plt.show()


### While we will not do a detailed assessment of this data, right away we can see that there are many features that seem to be highly correlated with our target variable. This could mean that we may be able to achieve good predictive power with this dataset, but it is also worth noting that multicollinearity can sometimes lead to overfitting. But overall, this looks good!   

## **Moving on**

### That is really all we are going to do here. As this is a special dataset because it was pretty heavily preprocessed for us (as you know by now, this will definitely not always be the case...), we just wanted to take an initial look to 'get to know' the data before we run off and try to build models for it. Knowing the distribution is important because certain models require certain distributions to work properly, and it is always nice to know if things correlate and kind of 'make sense' within your dataset. Lets go make a model!