<a href="https://colab.research.google.com/github/TAlkam/-Probability-Stats-for-AI/blob/main/CKD_data_prepration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Preparation and Decision Stump

In [6]:
from google.colab import files
import pandas as pd
import io

# Upload file
uploaded = files.upload()

Saving ckd-dataset-v2 (2).csv to ckd-dataset-v2 (2).csv


Steps to take:

1. Importing Libraries: 
The necessary libraries and modules are imported. These include `numpy`, `pandas`, `train_test_split` from `sklearn.model_selection`, `LogisticRegression` from `sklearn.linear_model`, `confusion_matrix`, `accuracy_score` from `sklearn.metrics`, `SimpleImputer` from `sklearn.impute`, and `OneHotEncoder` from `sklearn.preprocessing`.

2. Defining a Function:
The function `process_column` is defined to handle the data preprocessing step. This function checks if a value in a column is 'discrete' or contains a '-', or is a float. It returns NaN for 'discrete', the average of the two numbers if the value contains a '-', and the float value if it's a float. If none of these conditions are met, it returns NaN.

3. Loading the Dataset:
The dataset is loaded from a CSV file using `pd.read_csv`.

4. Data Preprocessing:
The `process_column` function is applied to the necessary columns of the dataframe. The target column 'class' is converted to integer type, where 'ckd' is represented as 1 and 'notckd' as 0. Missing values in the dataframe are filled with the mean of the respective column. Then, categorical variables are one-hot encoded, and the original categorical columns are dropped from the dataframe.

5. Splitting the Dataset:
The dataset is split into features (X) and target (y). Then, it's further split into training and testing sets using `train_test_split` function.

6. Data Imputation:
A `SimpleImputer` object is created to fill any remaining missing values in the dataset with the mean of the respective column. This imputer is fit on the training data and then used to transform both training and testing data.

7. Training the Model:
A Logistic Regression model is trained using the imputed training data.

8. Making Predictions:
The model is used to make predictions on the test data.

9. Evaluating the Model:
The accuracy of the model is printed out, and a confusion matrix is displayed to evaluate the performance of the model.

Note: This code is quite comprehensive and incorporates several good practices like handling missing values, converting data types, one-hot encoding categorical variables, and splitting the dataset into training and testing sets. It also makes use of logistic regression, a simple and commonly used machine learning algorithm for binary classification problems.

In [11]:
# Necessary imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier  # Import DecisionTreeClassifier instead of LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder


def process_column(col):
    if 'discrete' in str(col):
        return np.nan  # return NaN if 'discrete' is in column
    if '-' in str(col):
        low, high = map(float, str(col).split('-'))  # split on '-', convert to float
        return (low + high) / 2  # return the average
    else:
        try:
            return float(col)  # convert to float
        except ValueError:
            return np.nan  # if conversion to float fails, return NaN

# Load the dataset
df = pd.read_csv('ckd-dataset-v2 (2).csv')

# Apply process_column function to necessary columns
column_list = ['bp (Diastolic)', 'bp limit', 'sg', 'al', 'rbc', 'su', 'pc', 'pcc', 'ba', 'bgr', 'bu', 'sod', 'sc', 'pot', 'hemo', 'pcv', 'rbcc', 'wbcc', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane', 'grf', 'stage', 'age']
for column_name in column_list:
    df[column_name] = df[column_name].apply(process_column)

# Convert 'class' to integer type
df['class'] = (df['class'] == 'ckd').astype(int)

# Fill missing values with the mean of the respective column
df = df.fillna(df.mean(numeric_only=True))

# One-hot encode categorical variables
enc = OneHotEncoder(drop='first')  # Create encoder object
df_encoded = pd.DataFrame(enc.fit_transform(df.select_dtypes(include=['object'])).toarray())  # Transform data

# Merge with the original df
df = df.join(df_encoded)
df = df.drop(df.select_dtypes(include=['object']).columns, axis=1)

# Split the dataset into features and target
X = df.drop(columns=['class'])
y = df['class']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert columns to string type to avoid issues with imputer
X_train.columns = X_train.columns.astype(str)
X_test.columns = X_test.columns.astype(str)

# Use mean imputation
imputer = SimpleImputer(strategy='mean')

# Fit on the training data
imputer.fit(X_train)

# Transform both training and testing data
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

# Train the model using the imputed training data
model = DecisionTreeClassifier(max_depth=1)  # Replace LogisticRegression with DecisionTreeClassifier(max_depth=1)
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Print out the accuracy and confusion matrix
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}")


Accuracy: 1.0
Confusion Matrix:
[[13  0]
 [ 0 28]]


# Logistic regression

The Logistic Regression model from the `sklearn.linear_model` module.

Here's a step-by-step breakdown of what the code is doing:

1. `StandardScaler()`: This creates an instance of the StandardScaler class, which will be used to standardize the features by removing the mean and scaling to unit variance. This is often a good preprocessing step for many machine learning algorithms.

2. `scaler.fit_transform(X_train)`: This fits the scaler to the training data and then transforms the training data. "Fitting" the scaler means that it learns the parameters (mean and standard deviation for standardization) of the training data.

3. `scaler.transform(X_test)`: This uses the scaler that was fitted to the training data to transform the test data. It's important to note that the same scaler is used to transform both the training and test data to ensure that they are scaled in the same way.

4. `LogisticRegression(max_iter=1000)`: This creates an instance of the LogisticRegression class. The `max_iter=1000` argument sets the maximum number of iterations for the solver to converge, which can be necessary for larger datasets.

5. `model.fit(X_train, y_train)`: This fits the logistic regression model to the training data. "Fitting" the model means that it learns the relationship between the features (X_train) and the target (y_train).

6. `model.predict(X_test)`: This uses the fitted model to make predictions on the test data.

7. `accuracy_score(y_test, y_pred)`: This calculates the accuracy of the model by comparing the predicted values to the actual values.

So, in summary, this code is using logistic regression to make predictions on the test data and then calculating the accuracy of those predictions.

In [12]:
from sklearn.preprocessing import StandardScaler

# Scale the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Increase max_iter and fit the model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")


Accuracy: 1.0


# Linear regression

This is a linear regression model using the `LinearRegression` class from the `sklearn.linear_model` module. However, a couple of things to note:

1. The target variable, 'class', is categorical in nature. Typically, for a linear regression model, the target variable would be continuous. Using a linear regression model for a binary categorical outcome could lead to predicted values outside the [0,1] range, which would not make sense for a binary classification problem. Logistic regression is generally more appropriate for binary classification problems.

2. The model is trained on a single feature 'bp (Diastolic)'. While linear regression can handle multiple features, in this case, it's being used for univariate linear regression (one feature to one target).

Aside from these points, the rest of the code is a standard implementation of a linear regression model:

1. `LinearRegression()`: This creates an instance of the LinearRegression class.

2. `linreg.fit(X_train, y_train)`: This fits the linear regression model to the training data. "Fitting" the model means that it learns the relationship between the features (X_train) and the target (y_train).

3. `linreg.predict(X_test)`: This uses the fitted model to make predictions on the test data.

4. `mean_squared_error(y_test, y_pred)`: This calculates the mean squared error of the model by comparing the predicted values to the actual values.

5. `linreg.coef_` and `linreg.intercept_`: These are the parameters of the fitted linear regression model. The coefficients are the weights assigned to the features, and the intercept is the point where the fitted line crosses the y-axis when all features are 0.

In [25]:
# Split the dataset into features and target
X = df[['bp (Diastolic)']]  # feature matrix
y = df['class']  # target variable


In [30]:
# Convert 'discrete' entries in 'age' column to NaN
df['age'] = pd.to_numeric(df['age'], errors='coerce')

# Fill NaN values with the mean of the column
df['age'].fillna(df['age'].mean(), inplace=True)

# Now you can split your data
X = df[['age']]  # feature matrix
y = df['class']  # target variable

# And continue with the rest of your code...


In [34]:
# Necessary imports
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load the dataset
df = pd.read_csv('ckd-dataset-v2 (2).csv')  # replace with your csv file

# Replace non-numeric 'bp (Diastolic)' values with the median
df['bp (Diastolic)'] = pd.to_numeric(df['bp (Diastolic)'], errors='coerce')
df['bp (Diastolic)'].fillna(df['bp (Diastolic)'].median(), inplace=True)

# Split the dataset into features and target
X = df[['bp (Diastolic)']]  # feature matrix
y = df['class']  # target variable

# Convert categorical variable 'class' into binary indicator
y = (y == 'ckd').astype(int)  # replace 'ckd' with the actual class label

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Linear Regression object
linreg = LinearRegression()

# Train the model using the training data
linreg.fit(X_train, y_train)

# Make predictions on the test data
y_pred = linreg.predict(X_test)

# Print out the mean squared error (MSE)
print(f"MSE: {mean_squared_error(y_test, y_pred)}")

# Print out the coefficients and the intercept
print(f"Coefficients: {linreg.coef_}")
print(f"Intercept: {linreg.intercept_}")


MSE: 0.217499939801131
Coefficients: [0.07987616]
Intercept: 0.5789473684210525


# *Naive Bayes:*

This script is about the Naive Bayes algorithm to a dataset for classification purposes, using the GaussianNB class from the sklearn.naive_bayes module. Here are some of the key steps:

GaussianNB(): This creates an instance of the GaussianNB (Naive Bayes) class.

gnb.fit(X_train, y_train): This fits the Naive Bayes model to the training data. The model learns the relationship between the features (X_train) and the target (y_train) based on the assumption that the features are independent given the target.

gnb.predict(X_test): This uses the fitted model to make predictions on the test data.

accuracy_score(y_test, y_pred): This computes the accuracy of the model by comparing the predicted values to the actual values.

The script also includes data preprocessing steps such as filling missing values with the column mean, one-hot encoding categorical features, and checking for NaNs or non-numeric values.

Note that there are some redundant lines of code (e.g., fitting the model multiple times), which may have been used for debugging or illustrating different steps. These can be cleaned up in a final version of the script.

In [35]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score


In [36]:
# Load the dataset
df = pd.read_csv('ckd-dataset-v2 (2).csv')

# Split the dataset into features and target
X = df.drop(columns=['class'])  # feature matrix
y = df['class']  # target variable


In [37]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [39]:
# Fill missing values with the mean of the respective column
X_train = X_train.fillna(X_train.mean())
X_test = X_test.fillna(X_train.mean())  # Use mean from training data to fill NaNs in test data


  X_train = X_train.fillna(X_train.mean())
  X_test = X_test.fillna(X_train.mean())  # Use mean from training data to fill NaNs in test data


In [41]:
# Fill missing values with the mean of the respective column
X_train = X_train.fillna(X_train.mean(numeric_only=True))
X_test = X_test.fillna(X_train.mean(numeric_only=True))  # Use mean from training data to fill NaNs in test data


In [44]:
# Fill NaNs in numeric columns with the column mean
for col in X_train.select_dtypes(include=[np.number]).columns:
    X_train[col] = X_train[col].fillna(X_train[col].mean())
    X_test[col] = X_test[col].fillna(X_train[col].mean())  # Use mean from training data to fill NaNs in test data

# Fill NaNs in non-numeric columns with a placeholder value
for col in X_train.select_dtypes(include=[object]).columns:
    X_train[col] = X_train[col].fillna('Unknown')
    X_test[col] = X_test[col].fillna('Unknown')


In [None]:
# One-hot encode categorical features in the training data
X_train = pd.get_dummies(X_train)

# One-hot encode categorical features in the test data, ensuring it has the same columns as the training data
X_test = pd.get_dummies(X_test)
missing_cols = set(X_train.columns) - set(X_test.columns)
for c in missing_cols:
    X_test[c] = 0
X_test = X_test[X_train.columns]


In [52]:
# Create a DataFrame of zeros with the same index as X_test and columns as missing_cols
missing_cols_df = pd.DataFrame(0, index=X_test.index, columns=list(missing_cols))


In [None]:
print(X_train.isnull().sum())


In [None]:
print(X_train.dtypes)


In [None]:
print((X_train.values == np.inf).sum())
print((X_train.values == -np.inf).sum())


In [None]:
print(X_train.dtypes)


In [None]:
# Step 1: Check if y_train contains NaNs or non-numeric values.
print("Number of NaNs in y_train:", y_train.isnull().sum())
print("Data type of y_train:", y_train.dtypes)

# Step 2: Round the values in X_train to a certain number of decimal places.
X_train = X_train.round(decimals=6)

# Step 3: Process of elimination to find problematic columns.
# This is just an example. You'll need to iterate over this process manually.
for col in X_train.columns:
    try:
        gnb.fit(X_train[[col]], y_train)
        print(f"No issue with column: {col}")
    except ValueError:
        print(f"Issue with column: {col}")


In [68]:
# Check the data types and NaN counts again
print("Number of NaNs in y_train:", np.isnan(y_train).sum())
print("Data type of y_train:", y_train.dtype)


Number of NaNs in y_train: 0
Data type of y_train: int64


In [None]:
# Create a Gaussian Naive Bayes object
gnb = GaussianNB()

# Train the model using the training data
gnb.fit(X_train, y_train)


In [None]:
gnb = GaussianNB()
gnb.fit(X_train, y_train)


In [73]:
# Encode y_test
y_test = le.transform(y_test)

# Train the model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions on the test data
y_pred = gnb.predict(X_test)

# Print out the accuracy
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")


Accuracy: 1.0


# Random Forest

This script is about the Random Forest algorithm to a dataset for classification purposes, using the RandomForestClassifier class from the sklearn.ensemble module. Here are the key steps:

RandomForestClassifier(n_estimators=100, random_state=42): This creates an instance of the RandomForestClassifier class with 100 trees in the forest (n_estimators=100) and a specified random state for reproducibility (random_state=42).

rf.fit(X_train, y_train): This fits the Random Forest model to the training data. The model learns the relationship between the features (X_train) and the target (y_train) based on an ensemble of decision trees.

rf.predict(X_test): This uses the fitted model to make predictions on the test data.

accuracy_score(y_test, y_pred_rf): This computes the accuracy of the model by comparing the predicted values to the actual values.

The Random Forest algorithm is a type of ensemble learning method, where multiple learning algorithms are used to obtain better predictive performance. In the case of Random Forest, it builds multiple decision trees and merges them together to get a more accurate and stable prediction.

In [75]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Fit the model
rf.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf.predict(X_test)

# Print out the accuracy
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf)}")


Accuracy: 1.0
