<a href="https://colab.research.google.com/github/jash-ai/PYTHON/blob/main/DECISION_TREE_NHS_EXAMPLE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

DECSION TREE CATERGORICAL COLUMNS AND NUMERICAL PYTHON EXAMPLES FOR NHS DATA

Decision trees can be used for both categorical and numerical data. Here, I'll provide examples of decision tree classification for a dataset that includes both types of variables using Python. We'll use synthetic data for illustration, and you can adapt these examples to your specific NHS dataset.

1. Decision Tree for Categorical and Numerical Data:

Suppose you have a dataset with a mix of categorical and numerical features, and you want to predict a target variable. In this example, we'll use synthetic data with a mix of both types:

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

In [2]:
# Generate synthetic data (replace with your NHS data)
np.random.seed(42)
n_samples = 1000

In [3]:
# Create a DataFrame with both categorical and numerical features
data = pd.DataFrame({
    'Gender': np.random.choice(['Male', 'Female'], n_samples),
    'Age': np.random.randint(18, 80, n_samples),
    'Smoker': np.random.choice(['Yes', 'No'], n_samples),
    'BMI': np.random.uniform(18, 35, n_samples),
    'Hypertension': np.random.choice(['Yes', 'No'], n_samples),
    'Diabetes': np.random.choice(['Yes', 'No'], n_samples),
    'HeartDisease': np.random.choice(['Yes', 'No'], n_samples),
    'Target': np.random.choice(['HighRisk', 'LowRisk'], n_samples)
})

In [6]:
data

Unnamed: 0,Gender,Age,Smoker,BMI,Hypertension,Diabetes,HeartDisease,Target
0,Male,71,Yes,29.007070,No,Yes,Yes,LowRisk
1,Female,34,No,33.266519,No,No,No,LowRisk
2,Male,26,Yes,22.139199,No,No,Yes,LowRisk
3,Male,50,No,33.759587,No,Yes,No,HighRisk
4,Male,70,No,19.024546,Yes,No,No,LowRisk
...,...,...,...,...,...,...,...,...
995,Male,49,Yes,33.018901,No,Yes,Yes,LowRisk
996,Male,27,Yes,25.621424,Yes,No,No,HighRisk
997,Female,33,Yes,26.677352,No,No,No,LowRisk
998,Female,24,Yes,28.652073,Yes,No,Yes,HighRisk


In [4]:
# Encode categorical variables (e.g., using one-hot encoding)
data_encoded = pd.get_dummies(data, columns=['Gender', 'Smoker', 'Hypertension', 'Diabetes', 'HeartDisease'])


In [5]:
data_encoded

Unnamed: 0,Age,BMI,Target,Gender_Female,Gender_Male,Smoker_No,Smoker_Yes,Hypertension_No,Hypertension_Yes,Diabetes_No,Diabetes_Yes,HeartDisease_No,HeartDisease_Yes
0,71,29.007070,LowRisk,0,1,0,1,1,0,0,1,0,1
1,34,33.266519,LowRisk,1,0,1,0,1,0,1,0,1,0
2,26,22.139199,LowRisk,0,1,0,1,1,0,1,0,0,1
3,50,33.759587,HighRisk,0,1,1,0,1,0,0,1,1,0
4,70,19.024546,LowRisk,0,1,1,0,0,1,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,49,33.018901,LowRisk,0,1,0,1,1,0,0,1,0,1
996,27,25.621424,HighRisk,0,1,0,1,0,1,1,0,1,0
997,33,26.677352,LowRisk,1,0,0,1,1,0,1,0,1,0
998,24,28.652073,HighRisk,1,0,0,1,0,1,1,0,0,1


In [7]:
# Split the data into features (X) and target (y)
X = data_encoded.drop('Target', axis=1)
y = data_encoded['Target']

In [8]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [9]:
# Create a Decision Tree Classifier
tree_classifier = DecisionTreeClassifier(random_state=42)

In [10]:

# Fit the model on the training data
tree_classifier.fit(X_train, y_train)

In [11]:
# Make predictions on the test data
y_pred = tree_classifier.predict(X_test)

In [12]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

In [13]:
print(f'Accuracy: {accuracy}')
print(report)

Accuracy: 0.445
              precision    recall  f1-score   support

    HighRisk       0.40      0.48      0.44        89
     LowRisk       0.50      0.41      0.45       111

    accuracy                           0.45       200
   macro avg       0.45      0.45      0.44       200
weighted avg       0.45      0.45      0.45       200



In this example, we've generated synthetic data with both categorical and numerical features and used a Decision Tree Classifier to predict a target variable (e.g., "HighRisk" or "LowRisk"). We've also encoded categorical variables using one-hot encoding.

Replace the synthetic data with your actual NHS dataset and adjust the target variable and feature columns accordingly.

Remember that preprocessing, feature selection, and hyperparameter tuning are essential steps when working with real-world datasets to improve model performance and interpretability.