# ❤️‍🩹 Medical Condition Prediction Dataset:

## About Dataset:

This dataset provides information about various medical conditions such as Cancer, Pneumonia, and Diabetic based on demographic, lifestyle, and health-related features. It contains randomly generated user data, including multiple missing values, making it suitable for handling imbalanced classification tasks and missing data problems.

Features:

1. **id**: Unique identifier for each user.
2. **full_name**: Randomly generated user name.
3. **age**: Age of the user (ranging from 18 to 90 years), with some missing values.
4. **gender**: The gender of the user (categorized as Male, Female, or Non-Binary).
5. **smoking_status**: Indicates the smoking status of the user (Smoker, Non-Smoker, Former-Smoker).
6. **bmi**: Body Mass Index (BMI) of the user (ranging from 15 to 40), with some missing values.
7. **blood_pressure**: Blood pressure levels of the user (ranging from 90 to 180 mmHg), with some missing values.
8. **glucose_levels**: Blood glucose levels of the user (ranging from 70 to 200 mg/dL), with some missing values.
9. **condition**: The target label indicating the medical condition of the user (Cancer, Pneumonia, or Diabetic), with imbalanced distribution (15% Cancer, 25% Pneumonia, 60% Diabetic).

Goal:

The objective of this dataset is to predict the medical condition (Cancer, Pneumonia, Diabetic) of a user based on their demographic, lifestyle, and health-related features. This dataset can be used to explore strategies for dealing with imbalanced classes and missing data in healthcare applications. ​

## Import Libraries:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [None]:
data=pd.read_csv("/kaggle/input/medical-condition-prediction-dataset/medical_conditions_dataset.csv")
data.head()

## EDA:

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.isna().sum()

In [None]:
sns.pairplot(data)
plt.show()

We see that none of our feature distribution is skewed, based on that we gonna impute using the mean

In [None]:
df = data.copy()
df['gender_smoking'] = df['gender'] + ' - ' + df['smoking_status']
df.head()

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(12, 10))

#PLOTS FOR blood_pressure:
sns.boxplot(data=data, x='blood_pressure', hue='smoking_status', ax=axes[0, 0])
axes[0, 0].set_title('blood_pressure by smoking_status')

sns.boxplot(data=df, x='blood_pressure', hue='gender', ax=axes[0, 1])
axes[0, 1].set_title('blood_pressure by gender')

sns.boxplot(data=df, x='blood_pressure', hue='gender_smoking', ax=axes[0, 2])
axes[0, 2].set_title('blood_pressure by gender_smoking')

#PLOTS FOR blood_pressure:
sns.boxplot(data=df, x='glucose_levels', hue='smoking_status', ax=axes[1, 0])
axes[1, 0].set_title('glucose_levels by smoking_status')

sns.boxplot(data=df, x='glucose_levels', hue='gender', ax=axes[1, 1])
axes[1, 1].set_title('glucose_levels by gender')

sns.boxplot(data=df, x='glucose_levels', hue='gender_smoking', ax=axes[1, 2])
axes[1, 2].set_title('glucose_levels by gender_smoking')

plt.tight_layout()
plt.show()

### Imputation:

In [None]:
from sklearn.impute import SimpleImputer

mean_imputer = SimpleImputer(strategy='mean')

df[['age', 'bmi']] = mean_imputer.fit_transform(df[['age', 'bmi']])
print(df[["age", "bmi"]].isna().sum())

df['blood_pressure'] = df.groupby("gender_smoking")['blood_pressure'].transform(lambda x: x.fillna(x.mean()))
print(df[['blood_pressure']].isna().sum())

df['glucose_levels'] = df.groupby("gender_smoking")['glucose_levels'].transform(lambda x: x.fillna(x.mean()))
print(df[['glucose_levels']].isna().sum())

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(12, 10))

#PLOTS FOR blood_pressure:
sns.boxplot(data=data, x='blood_pressure', hue='smoking_status', ax=axes[0, 0])
axes[0, 0].set_title('blood_pressure by smoking_status')

sns.boxplot(data=df, x='blood_pressure', hue='gender', ax=axes[0, 1])
axes[0, 1].set_title('blood_pressure by gender')

sns.boxplot(data=df, x='blood_pressure', hue='gender_smoking', ax=axes[0, 2])
axes[0, 2].set_title('blood_pressure by gender_smoking')

#PLOTS FOR blood_pressure:
sns.boxplot(data=df, x='glucose_levels', hue='smoking_status', ax=axes[1, 0])
axes[1, 0].set_title('glucose_levels by smoking_status')

sns.boxplot(data=df, x='glucose_levels', hue='gender', ax=axes[1, 1])
axes[1, 1].set_title('glucose_levels by gender')

sns.boxplot(data=df, x='glucose_levels', hue='gender_smoking', ax=axes[1, 2])
axes[1, 2].set_title('glucose_levels by gender_smoking')

plt.tight_layout()
plt.show()

In [None]:
df.head()

In [None]:
df.isna().sum()

Now, there is no null value

In [None]:
sns.countplot(data=df, x='gender', hue='condition')
plt.show()

In [None]:
sns.countplot(data=df, x='smoking_status', hue='condition')
plt.show()

In [None]:
sns.countplot(data=df, x='gender_smoking', hue='condition')
plt.xticks(rotation=45)
plt.show()

In [None]:
sns.pairplot(df)
plt.show()

## Feature Engineering:

In [None]:
df.drop(columns=["id", "full_name", "gender_smoking"], inplace=True)
df.head()

### Feature Encoding:

In [None]:
from sklearn.preprocessing import LabelEncoder

categorical_columns = df.select_dtypes(include=['object']).columns.tolist()

label_encoder = LabelEncoder()

for col in categorical_columns:
    df[col] = label_encoder.fit_transform(df[col])

df.head()

In [None]:
data["condition"].value_counts()

We see that there is imbalance between the classes

## Handling Imbalanced DATA:

### Splitting DATA:

In [None]:
X = df.drop(columns=["condition"])
y = df["condition"]

### Oversampling using SMOTE:

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

smote = SMOTE(random_state=42)

X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

## RandomforestClassifier:

### Without using resampled Data

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(class_weight='balanced', random_state=42)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

### Using resampled Data

In [None]:
from sklearn.model_selection import cross_val_predict

model = RandomForestClassifier(random_state=42)
y_pred_cv  = cross_val_predict(model, X_train_resampled, y_train_resampled, cv=5)
print(classification_report(y_train_resampled, y_pred_cv))

we can see there is an improvement in detecting classes 0 and 1 after resampling data

## Logistic Regression:

When using Logistic Regression it's better to standardize our data

### Without using resampled Data

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

X_num = X_train.drop(["gender", "smoking_status"], axis=1)
std = StandardScaler()

X_scaled = std.fit_transform(X_num)

X_scaled = pd.DataFrame(X_scaled, columns=X_num.columns)

X_trsf = pd.concat([X_scaled.reset_index(drop=True), X_train[["gender", "smoking_status"]].reset_index(drop=True)], axis=1)

print(f"X_scaled shape: {X_trsf.shape}, y_train shape: {y_train.shape}") 

model = LogisticRegression(multi_class='multinomial', solver= "newton-cg", random_state=42)
y_pred_cv = cross_val_predict(model, X_scaled, y_train, cv=5)
print(classification_report(y_train, y_pred_cv))

### Using resampled Data

In [None]:
from sklearn.preprocessing import StandardScaler

X_num = X_train_resampled.drop(["gender", "smoking_status"], axis=1)
std = StandardScaler()

X_scaled = std.fit_transform(X_num)
X_scaled = pd.DataFrame(X_scaled, columns=X_num.columns)
X_scaled = pd.concat([X_scaled.reset_index(drop=True), X_train_resampled[["gender", "smoking_status"]].reset_index(drop=True)], axis=1)

model = LogisticRegression(multi_class='multinomial', solver= "newton-cg", random_state=42)
y_pred_cv = cross_val_predict(model, X_scaled, y_train_resampled, cv=5)
print(classification_report(y_train_resampled, y_pred_cv))

## SVM:

When using SVM it's better to standardize our data

### Without using resampled Data

In [None]:
from sklearn.svm import SVC

X_num = X_train.drop(["gender", "smoking_status"], axis=1)
std = StandardScaler()

X_scaled = std.fit_transform(X_num)
X_scaled = pd.DataFrame(X_scaled, columns=X_num.columns)
X_scaled = pd.concat([X_scaled.reset_index(drop=True), X_train[["gender", "smoking_status"]].reset_index(drop=True)], axis=1)

clf = SVC(C=1, gamma= 100, random_state=42)
y_pred_cv = cross_val_predict(clf, X_scaled, y_train, cv=5)
print(classification_report(y_train, y_pred_cv))

### Using resampled Data

In [None]:
X_num = X_train_resampled.drop(["gender", "smoking_status"], axis=1)
std = StandardScaler()

X_scaled = std.fit_transform(X_num)
X_scaled = pd.DataFrame(X_scaled, columns=X_num.columns)
X_scaled = pd.concat([X_scaled.reset_index(drop=True), X_train_resampled[["gender", "smoking_status"]].reset_index(drop=True)], axis=1)

clf = SVC(C=1, gamma= 100, random_state=42)
y_pred_cv = cross_val_predict(clf, X_scaled, y_train_resampled, cv=5)
print(classification_report(y_train_resampled, y_pred_cv))