# CKD Risk Factors Prediction: Exploratory Data Analysis

This repository contains the Exploratory Data Analysis (EDA) for the Chronic Kidney Disease (CKD) dataset. The goal is to predict CKD risk factors using Machine Learning techniques.


## Introduction

The EDA involves data importing, data overview, data preprocessing, descriptive statistics, and data visualization.

### Data Importing

We started by importing the necessary libraries and loading the CKD dataset using the pandas read_csv function.

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import io

# Load the data
df = pd.read_csv('ckd-dataset-v2 (2).csv')

## Data Overview

We performed an initial exploration of the dataset by printing out the first few rows and checked the basic information of the dataset such as the number of entries, the data types of each column, and the presence of missing values.

In [None]:
# Check the first few rows of the processed data
print(df.head())

# Print the basic information about the dataset
print(df.info())

## Data Preprocessing for EDA

We found that two columns ('sg' and 'grf') contained a mixture of numeric ranges, discrete values, and greater than or equal to values. We created a function to handle these special cases and applied it to 'sg' and 'grf' columns to create new columns 'avg_sg' and 'avg_grf'. We then dropped the original 'sg' and 'grf' columns and converted 'class' column to binary format.

In [None]:
# Define a function to process 'sg' and 'grf' columns
def process_column(col):
    if isinstance(col, float):
        if pd.isnull(col):
            return np.nan
    else:
        if 'discrete' in col:
            return np.nan
        elif '-' in col:
            return np.mean(list(map(float, col.split(' - '))))
        elif '≥' in col:
            return float(col[2:])
        else:
            try:
                return float(col)
            except:
                return np.nan

# Apply the function to 'sg' and 'grf' columns
df['avg_sg'] = df['sg'].apply(process_column)
df['avg_grf'] = df['grf'].apply(process_column)

# Drop the original 'sg' and 'grf' columns
df.drop(['sg', 'grf'], axis=1, inplace=True)

# Convert 'class' column to binary format
df['class'] = df['class'].map({'ckd': 1, 'notckd': 0})

## Descriptive Statistics

We used the describe function to obtain descriptive statistics for the numeric columns in the dataset.


In [None]:
# Use the describe function
df.describe()

## Data Visualization

We visualized the distribution of CKD and non-CKD patients using a bar plot. This helped us understand the balance of the target classes in our dataset.

In [None]:
# Plot histograms for 'avg_sg' and 'avg_grf' columns
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
sns.histplot(df['avg_sg'].dropna(), bins=30, kde=True)
plt.title('avg_sg Distribution')

plt.subplot(1, 2, 2)
sns.histplot(df['avg_grf'].dropna(), bins=30, kde=True)
plt.title('avg_grf Distribution')

plt.tight_layout()
plt.show()

sns.countplot(x='class', data=df)
plt.title('CKD vs Non-CKD Patients')
plt.xlabel('Groups')
plt.ylabel('Number of Patients')
plt.xticks([0, 1], ['Non-CKD', 'CKD'])
plt.show()

In [None]:
## Training and Test sets
We then split the data into training and test sets and visualized the distribution of CKD and non-CKD patients in both sets.
from sklearn.model_selection import train_test_split

# Assume 'class' is the target and rest of the columns are features
X = df.drop('class', axis=1)
y = df['class']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Convert y_train and y_test to DataFrames for easier plotting
y_train_df = pd.DataFrame(y_train, columns=['class'])
y_test_df = pd.DataFrame(y_test, columns=['class'])

# Create subplots
fig, axs = plt.subplots(1, 2, figsize=(10, 5))

# Count plot for 'class' column in the training set
sns.countplot(x='class', data=y_train_df, ax=axs[0])
axs[0].set_title('CKD vs Non-CKD Patients (Training Set)')
axs[0].set_xlabel('Groups')
axs[0].set_ylabel('Number of Patients')
axs[0].set_xticklabels(['Non-CKD', 'CKD'])

# Count plot for 'class' column in the test set
sns.countplot(x='class', data=y_test_df, ax=axs[1])
axs[1].set_title('CKD vs Non-CKD Patients (Test Set)')
axs[1].set_xlabel('Groups')
axs[1].set_ylabel('Number of Patients')
axs[1].set_xticklabels(['Non-CKD', 'CKD'])

plt.tight_layout()
plt.show()

## Data Preparation and Decision Stump

In [None]:
from google.colab import files
import pandas as pd
import io

# Upload file
uploaded = files.upload()

Steps to take:

Importing Libraries: The necessary libraries and modules are imported. These include numpy, pandas, train_test_split from sklearn.model_selection, LogisticRegression from sklearn.linear_model, confusion_matrix, accuracy_score from sklearn.metrics, SimpleImputer from sklearn.impute, and OneHotEncoder from sklearn.preprocessing.

Defining a Function: The function process_column is defined to handle the data preprocessing step. This function checks if a value in a column is 'discrete' or contains a '-', or is a float. It returns NaN for 'discrete', the average of the two numbers if the value contains a '-', and the float value if it's a float. If none of these conditions are met, it returns NaN.

Loading the Dataset: The dataset is loaded from a CSV file using pd.read_csv.

Data Preprocessing: The process_column function is applied to the necessary columns of the dataframe. The target column 'class' is converted to integer type, where 'ckd' is represented as 1 and 'notckd' as 0. Missing values in the dataframe are filled with the mean of the respective column. Then, categorical variables are one-hot encoded, and the original categorical columns are dropped from the dataframe.

Splitting the Dataset: The dataset is split into features (X) and target (y). Then, it's further split into training and testing sets using train_test_split function.

Data Imputation: A SimpleImputer object is created to fill any remaining missing values in the dataset with the mean of the respective column. This imputer is fit on the training data and then used to transform both training and testing data.

Training the Model: A Logistic Regression model is trained using the imputed training data.

Making Predictions: The model is used to make predictions on the test data.

Evaluating the Model: The accuracy of the model is printed out, and a confusion matrix is displayed to evaluate the performance of the model.

Note: This code is quite comprehensive and incorporates several good practices like handling missing values, converting data types, one-hot encoding categorical variables, and splitting the dataset into training and testing sets. It also makes use of logistic regression, a simple and commonly used machine learning algorithm for binary classification problems.

**Hypertension (htn)**

In [None]:
# Import the necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np

# Load your data into a pandas DataFrame
# df = pd.read_csv('your_data.csv')

# Assume that 'df' is your DataFrame, 'htn' is the feature column, and 'classification' is the target column
# Also assuming that 'classification' is a binary variable

# Handle missing values in 'htn' column by filling with the mode
df['htn'] = df['htn'].fillna(df['htn'].mode()[0])

# Encode 'htn' column to numerical values if it's categorical
le = LabelEncoder()
df['htn'] = le.fit_transform(df['htn'])

X = df[['htn']]  # feature
y = df['class']  # target

# Split the dataset into 70% training data and 30% test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Create an instance of DecisionTreeClassifier with max_depth = 1 (Decision Stump)
clf = DecisionTreeClassifier(max_depth=1, random_state=0)

# Train the model
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Print the accuracy of the model
print('Accuracy:', accuracy_score(y_test, y_pred))


**Diabetes Mellitus (dm)**

In [None]:
# Import the necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np

# Load your data into a pandas DataFrame
# df = pd.read_csv('your_data.csv')

# Assume that 'df' is your DataFrame, 'dm' is the feature column, and 'classification' is the target column
# Also assuming that 'classification' is a binary variable

# Handle missing values in 'dm' column by filling with the mode
df['dm'] = df['dm'].fillna(df['dm'].mode()[0])

# Encode 'dm' column to numerical values if it's categorical
le = LabelEncoder()
df['dm'] = le.fit_transform(df['dm'])

X = df[['dm']]  # feature
y = df['class']  # target

# Split the dataset into 70% training data and 30% test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Create an instance of DecisionTreeClassifier with max_depth = 1 (Decision Stump)
clf = DecisionTreeClassifier(max_depth=1, random_state=0)

# Train the model
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Print the accuracy of the model
print('Accuracy:', accuracy_score(y_test, y_pred))


**Hemoglobin A1c (hemo)**

In [None]:
# Necessary imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.impute import SimpleImputer

def process_column(col):
    if 'discrete' in str(col):
        return np.nan  # return NaN if 'discrete' is in column
    if '-' in str(col):
        low, high = map(float, str(col).split('-'))  # split on '-', convert to float
        return (low + high) / 2  # return the average
    else:
        try:
            return float(col)  # convert to float
        except ValueError:
            return np.nan  # if conversion to float fails, return NaN

# Load the dataset
df = pd.read_csv('ckd-dataset-v2 (2).csv')

# Apply process_column function to necessary columns
df['hemo'] = df['hemo'].apply(process_column)

# Convert 'class' to integer type
df['class'] = (df['class'] == 'ckd').astype(int)

# Fill missing values with the mean of the respective column
df['hemo'] = df['hemo'].fillna(df['hemo'].mean())

# Split the dataset into features and target
X = df[['hemo']]  # Select only the 'hemo' column as the feature
y = df['class']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Use mean imputation
imputer = SimpleImputer(strategy='mean')

# Fit on the training data
imputer.fit(X_train)

# Transform both training and testing data
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

# Train the model using the imputed training data
model = DecisionTreeClassifier(max_depth=1)
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Print out the accuracy and confusion matrix
print(f"Accuracy: {round(accuracy_score(y_test, y_pred), 2)}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}")



# Logistic regression

The Logistic Regression model from the sklearn.linear_model module.

Here's a step-by-step breakdown of what the code is doing:

StandardScaler(): This creates an instance of the StandardScaler class, which will be used to standardize the features by removing the mean and scaling to unit variance. This is often a good preprocessing step for many machine learning algorithms.

scaler.fit_transform(X_train): This fits the scaler to the training data and then transforms the training data. "Fitting" the scaler means that it learns the parameters (mean and standard deviation for standardization) of the training data.

scaler.transform(X_test): This uses the scaler that was fitted to the training data to transform the test data. It's important to note that the same scaler is used to transform both the training and test data to ensure that they are scaled in the same way.

LogisticRegression(max_iter=1000): This creates an instance of the LogisticRegression class. The max_iter=1000 argument sets the maximum number of iterations for the solver to converge, which can be necessary for larger datasets.

model.fit(X_train, y_train): This fits the logistic regression model to the training data. "Fitting" the model means that it learns the relationship between the features (X_train) and the target (y_train).

model.predict(X_test): This uses the fitted model to make predictions on the test data.

accuracy_score(y_test, y_pred): This calculates the accuracy of the model by comparing the predicted values to the actual values.

So, in summary, this code is using logistic regression to make predictions on the test data and then calculating the accuracy of those predictions.

**Hypertension (htn)**

In [None]:
# Import the necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np

# Load your data into a pandas DataFrame
# df = pd.read_csv('your_data.csv')

# Assume that 'df' is your DataFrame, 'htn' is the feature column, and 'classification' is the target column
# Also assuming that 'classification' is a binary variable

# Handle missing values in 'htn' column by filling with the mode
df['htn'] = df['htn'].fillna(df['htn'].mode()[0])

# Encode 'htn' column to numerical values if it's categorical
le = LabelEncoder()
df['htn'] = le.fit_transform(df['htn'])

X = df[['htn']]  # feature
y = df['class']  # target

# Split the dataset into 70% training data and 30% test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Create an instance of LogisticRegression
lr = LogisticRegression()

# Train the model
lr.fit(X_train, y_train)

# Make predictions
y_pred = lr.predict(X_test)

# Print the accuracy of the model
print('Accuracy:', accuracy_score(y_test, y_pred))


**Hemoglobin A1c (hemo)**

In [None]:
# Import the necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

# Load your data into a pandas DataFrame
# df = pd.read_csv('your_data.csv')

# Assume that 'df' is your DataFrame, 'hemo' is the feature column, and 'classification' is the target column
# Also assuming that 'classification' is a binary variable

# Handle missing values in 'hemo' column
df['hemo'] = df['hemo'].fillna(df['hemo'].mean())

X = df[['hemo']]  # feature
y = df['class']  # target

# Split the dataset into 70% training data and 30% test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Create an instance of LogisticRegression
lr = LogisticRegression()

# Train the model
lr.fit(X_train, y_train)

# Make predictions
y_pred = lr.predict(X_test)

# Print the accuracy of the model
print('Accuracy:', accuracy_score(y_test, y_pred))

**Diabetes Mellitus (dm)**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

# Assuming 'dm' is your feature and 'classification' is your target
X = df['dm'].values.reshape(-1,1)
y = df['class']

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Creating the Logistic Regression model
classifier = LogisticRegression(random_state = 0)

# Training the model
classifier.fit(X_train, y_train)

# Predicting the test set results
y_pred = classifier.predict(X_test)

# Evaluating the model
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: \n", cm)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: ", accuracy)


# Random Forest

This script is about the Random Forest algorithm to a dataset for classification purposes, using the RandomForestClassifier class from the sklearn.ensemble module. Here are the key steps:

RandomForestClassifier(n_estimators=100, random_state=42): This creates an instance of the RandomForestClassifier class with 100 trees in the forest (n_estimators=100) and a specified random state for reproducibility (random_state=42).

rf.fit(X_train, y_train): This fits the Random Forest model to the training data. The model learns the relationship between the features (X_train) and the target (y_train) based on an ensemble of decision trees.

rf.predict(X_test): This uses the fitted model to make predictions on the test data.

accuracy_score(y_test, y_pred_rf): This computes the accuracy of the model by comparing the predicted values to the actual values.

The Random Forest algorithm is a type of ensemble learning method, where multiple learning algorithms are used to obtain better predictive performance. In the case of Random Forest, it builds multiple decision trees and merges them together to get a more accurate and stable prediction.

**Hemoglobin A1c (hemo)**

In [None]:
# Import the necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

# Load your data into a pandas DataFrame
# df = pd.read_csv('your_data.csv')

# Assume that 'df' is your DataFrame, 'hemo' is the feature column, and 'classification' is the target column

# Handle missing values in 'hemo' column by filling with the mean
df['hemo'] = df['hemo'].fillna(df['hemo'].mean())

X = df[['hemo']]  # feature
y = df['class']  # target

# Split the dataset into 70% training data and 30% test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Create an instance of RandomForestClassifier
clf = RandomForestClassifier(random_state=0)

# Train the model
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Print the accuracy of the model
print('Accuracy:', accuracy_score(y_test, y_pred))


**Hypertension (htn)**

In [None]:
# Import the necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np

# Load your data into a pandas DataFrame
# df = pd.read_csv('your_data.csv')

# Assume that 'df' is your DataFrame, 'htn' is the feature column, and 'classification' is the target column
# Also assuming that 'classification' is a binary variable

# Handle missing values in 'htn' column by filling with the mode
df['htn'] = df['htn'].fillna(df['htn'].mode()[0])

# Encode 'htn' column to numerical values if it's categorical
le = LabelEncoder()
df['htn'] = le.fit_transform(df['htn'])

X = df[['htn']]  # feature
y = df['class']  # target

# Split the dataset into 70% training data and 30% test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Create an instance of RandomForestClassifier
clf = RandomForestClassifier(random_state=0)

# Train the model
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Print the accuracy of the model
print('Accuracy:', accuracy_score(y_test, y_pred))


**Diabetes Mellitus (dm)**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

# Handling NaN values in 'dm' column
df['dm'] = df['dm'].fillna(df['dm'].mean())

# Selecting 'dm' as the feature and 'class' as the target
X = df['dm'].values.reshape(-1,1)
y = df['class'].values

# Splitting the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Initializing the Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=0)

# Training the model
rf_clf.fit(X_train, y_train)

# Predicting the test set results
y_pred = rf_clf.predict(X_test)

# Printing the confusion matrix and accuracy score
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))


## CKD Data Prep


In [7]:
# Necessary imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier  # Import DecisionTreeClassifier instead of LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder


def process_column(col):
    if 'discrete' in str(col):
        return np.nan  # return NaN if 'discrete' is in column
    if '-' in str(col):
        low, high = map(float, str(col).split('-'))  # split on '-', convert to float
        return (low + high) / 2  # return the average
    else:
        try:
            return float(col)  # convert to float
        except ValueError:
            return np.nan  # if conversion to float fails, return NaN

# Load the dataset
df = pd.read_csv('ckd-dataset-v2 (2).csv')

# Added Affected - SR
# Apply process_column function to necessary columns
column_list = ['bp (Diastolic)', 'bp limit', 'sg', 'al', 'rbc', 'su', 'pc', 'pcc', 'ba', 'bgr', 'bu', 'sod', 'sc', 'pot', 'hemo', 'pcv', 'rbcc', 'wbcc', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane', 'grf', 'stage', 'affected', 'age']
for column_name in column_list:
    df[column_name] = df[column_name].apply(process_column)

# Convert 'class' to integer type
df['class'] = (df['class'] == 'ckd').astype(int)

# Fill missing values with the mean of the respective column
df = df.fillna(df.mean(numeric_only=True))

# One-hot encode categorical variables
enc = OneHotEncoder(drop='first')  # Create encoder object
df_encoded = pd.DataFrame(enc.fit_transform(df.select_dtypes(include=['object'])).toarray())  # Transform data

# Merge with the original df
df = df.join(df_encoded)
df = df.drop(df.select_dtypes(include=['object']).columns, axis=1)

# Split the dataset into features and target
X = df.drop(columns=['class'])
y = df['class']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert columns to string type to avoid issues with imputer
X_train.columns = X_train.columns.astype(str)
X_test.columns = X_test.columns.astype(str)

# Use mean imputation
imputer = SimpleImputer(strategy='mean')

# Fit on the training data
imputer.fit(X_train)

# Transform both training and testing data
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

# Train the model using the imputed training data
model = DecisionTreeClassifier(max_depth=1)  # Replace LogisticRegression with DecisionTreeClassifier(max_depth=1)
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Exclude rows by index values - SR
filtered_df = df.drop([0, 1])


# Print out the accuracy and confusion matrix
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}")


Accuracy: 1.0
Confusion Matrix:
[[13  0]
 [ 0 28]]


## Affected column change


In [8]:
import pandas as pd

# Exclude rows by index values, removes descrete, blanks, and class
filtered_df = df.drop([0, 1])

# Assuming you have your dataset stored in a DataFrame called 'df'
print(filtered_df.head())

   bp (Diastolic)  bp limit     sg        al  class  rbc        su   pc  pcc  \
2             0.0       0.0  1.020  1.000000      1  0.0  2.724138  0.0  0.0   
3             0.0       0.0  1.010  2.028169      1  0.0  2.724138  0.0  0.0   
4             0.0       0.0  1.010  2.028169      1  1.0  2.724138  1.0  0.0   
5             1.0       1.0  1.010  3.000000      1  0.0  2.724138  0.0  0.0   
6             0.0       0.0  1.016  2.028169      1  0.0  2.724138  0.0  0.0   

    ba  ...  htn   dm  cad  appet   pe  ane         grf  stage  affected  \
2  0.0  ...  0.0  0.0  0.0    0.0  0.0  0.0   90.897524    NaN       1.0   
3  0.0  ...  0.0  0.0  0.0    0.0  0.0  0.0   90.897524    NaN       1.0   
4  1.0  ...  0.0  0.0  0.0    1.0  0.0  0.0  139.863500    NaN       1.0   
5  0.0  ...  0.0  0.0  0.0    0.0  0.0  0.0  139.863500    NaN       1.0   
6  0.0  ...  0.0  1.0  0.0    1.0  1.0  0.0  139.863500    NaN       1.0   

         age  
2  52.973118  
3  52.973118  
4  52.973118  
5 

## Linear Regression


In [10]:
import statsmodels.api as sm

# Assuming you have your data stored in X (independent variables) and y (dependent variable)
X = filtered_df[['hemo','sg','grf']]
y = filtered_df[['affected']]

# Add a constant term to the independent variables
X = sm.add_constant(X)

# Create and fit the linear regression model
model = sm.OLS(y, X)
result = model.fit()

# Print the summary of the linear regression model
print(result.summary())


                            OLS Regression Results                            
Dep. Variable:               affected   R-squared:                       0.601
Model:                            OLS   Adj. R-squared:                  0.595
Method:                 Least Squares   F-statistic:                     98.35
Date:                Mon, 26 Jun 2023   Prob (F-statistic):           7.08e-39
Time:                        20:22:35   Log-Likelihood:                -45.148
No. Observations:                 200   AIC:                             98.30
Df Residuals:                     196   BIC:                             111.5
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         30.4514      6.044      5.039      0.0

## Conclusion

We hope this analysis will contribute to a deeper understanding of the CKD dataset and aid in the development of effective Machine Learning models for CKD prediction.