# Disease Prediction from Medical Data

**Objective**: Predict the possibility of heart disease based on patient data using classification techniques.

**Dataset**: Sample data mimicking the UCI Heart Disease dataset (13 features, binary target).

**Features**: age, sex, cp (chest pain type), trestbps (resting blood pressure), chol (cholesterol), fbs (fasting blood sugar), restecg (ECG results), thalach (max heart rate), exang (exercise-induced angina), oldpeak (ST depression), slope, ca (number of major vessels), thal.

**Algorithms**: Logistic Regression, SVM, Random Forest, XGBoost.

**Steps**:
1. Load and preprocess sample dataset.
2. Train and evaluate models.
3. Compare performance using accuracy, precision, recall, and F1-score.

In [3]:
%pip install xgboost

import pandas as pd
import io
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report

# Sample dataset (mimicking UCI Heart Disease dataset)
sample_data = """
age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
37,1,2,130,250,0,1,187,0,3.5,0,0,2,0
41,0,1,130,204,0,0,172,0,1.4,2,0,2,0
56,1,1,120,236,0,1,178,0,0.8,2,0,2,0
57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
57,1,0,140,192,0,1,148,0,0.4,1,0,1,0
56,0,1,140,294,0,0,153,0,1.3,1,0,2,0
44,1,1,120,263,0,1,173,0,0,2,0,3,0
"""

# Load sample data
data = pd.read_csv(io.StringIO(sample_data))

# Display first few rows
print("Sample Dataset:")
print(data.head())

# Check for missing values
print("\nMissing Values:")
print(data.isnull().sum())


Sample Dataset:
   age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  \
0   63    1   3       145   233    1        0      150      0      2.3      0   
1   37    1   2       130   250    0        1      187      0      3.5      0   
2   41    0   1       130   204    0        0      172      0      1.4      2   
3   56    1   1       120   236    0        1      178      0      0.8      2   
4   57    0   0       120   354    0        1      163      1      0.6      2   

   ca  thal  target  
0   0     1       1  
1   0     2       0  
2   0     2       0  
3   0     2       0  
4   0     2       1  

Missing Values:
age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64


## Data Preprocessing

- Separate features and target.
- Scale numerical features using StandardScaler.
- Split data into training (80%) and testing (20%) sets.

In [4]:
# Separate features and target
X = data.drop('target', axis=1)
y = data['target']

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)

Training set shape: (6, 13)
Testing set shape: (2, 13)


## Model Training and Evaluation

Train four models: Logistic Regression, SVM, Random Forest, and XGBoost.
Evaluate each using accuracy and classification report (precision, recall, F1-score).

In [5]:
# Define models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'SVM': SVC(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42, n_estimators=100),
    'XGBoost': XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')
}

# Train and evaluate models
for name, model in models.items():
    print(f"\nTraining {name}...")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"{name} Accuracy: {accuracy_score(y_test, y_pred):.2f}")
    print(f"Classification Report for {name}:")
    print(classification_report(y_test, y_pred, zero_division=0))


Training Logistic Regression...
Logistic Regression Accuracy: 1.00
Classification Report for Logistic Regression:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         2

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2


Training SVM...
SVM Accuracy: 1.00
Classification Report for SVM:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         2

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2


Training Random Forest...
Random Forest Accuracy: 0.50
Classification Report for Random Forest:
              precision    recall  f1-score   support

           0       1.00      0.50      0.67         2
           1       0.00      0.00      0.00         0

    accurac

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


## Notes

- The sample dataset is small for demonstration. For real-world use, download the full UCI Heart Disease dataset from [UCI ML Repository](https://archive.ics.uci.edu/ml/datasets/Heart+Disease).
- Performance on this small dataset may not reflect real-world results due to limited samples.
- Consider hyperparameter tuning (e.g., GridSearchCV) for better performance.
- For deployment, save the best model using `joblib` or `pickle` and create an API (e.g., Flask).