You are a data scientist working for a healthcare company, and you have been tasked with creating a 
decision tree to help identify patients with diabetes based on a set of clinical variables. You have been 
given a dataset (diabetes.csv) with the following variables:

1. Pregnancies: Number of times pregnant (integer)

2. Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test (integer)

3. BloodPressure: Diastolic blood pressure (mm Hg) (integer)

4. SkinThickness: Triceps skin fold thickness (mm) (integer)

5. Insulin: 2-Hour serum insulin (mu U/ml) (integer)

6. BMI: Body mass index (weight in kg/(height in
Ans:-Great! To create a decision tree for identifying patients with diabetes based on the given clinical variables, we can follow these general steps using Python and popular libraries such as pandas and scikit-learn.

Assuming you have a CSV file named "diabetes.csv" with columns named "Pregnancies," "Glucose," "BloodPressure," "SkinThickness," "Insulin," "BMI," and a target variable "Outcome" indicating whether a patient has diabetes (1 for positive, 0 for negative). m)^2) (float)

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.tree import export_text

# Load the dataset
file_path = 'path/to/diabetes.csv'  # Replace with the actual path to your CSV file
df = pd.read_csv(file_path)

# Display the first few rows of the dataset
print(df.head())

# Separate features (X) and target variable (y)
X = df.drop('Outcome', axis=1)
y = df['Outcome']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Decision Tree model
model = DecisionTreeClassifier(random_state=42)

# Train the model
model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

# Display the evaluation metrics
print(f'Accuracy: {accuracy:.4f}')
print('\nConfusion Matrix:\n', conf_matrix)
print('\nClassification Report:\n', classification_rep)

# Display the decision tree rules
tree_rules = export_text(model, feature_names=list(X.columns))
print('\nDecision Tree Rules:\n', tree_rules)


7. DiabetesPedigreeFunction: Diabetes pedigree function (a function which scores likelihood of diabetes 
based on family history) (float)

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.tree import export_text

# Load the dataset
file_path = 'path/to/diabetes.csv'  # Replace with the actual path to your CSV file
df = pd.read_csv(file_path)

# Display the first few rows of the dataset
print(df.head())

# Separate features (X) and target variable (y)
X = df.drop('Outcome', axis=1)
y = df['Outcome']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Decision Tree model
model = DecisionTreeClassifier(random_state=42)

# Train the model
model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

# Display the evaluation metrics
print(f'Accuracy: {accuracy:.4f}')
print('\nConfusion Matrix:\n', conf_matrix)
print('\nClassification Report:\n', classification_rep)

# Display the decision tree rules
tree_rules = export_text(model, feature_names=list(X.columns))
print('\nDecision Tree Rules:\n', tree_rules)


8. Age: Age in years (integer)

9. Outcome: Class variable (0 if non-diabetic, 1 if diabetic) (integer)

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.tree import export_text

# Load the dataset
file_path = 'path/to/diabetes.csv'  # Replace with the actual path to your CSV file
df = pd.read_csv(file_path)

# Display the first few rows of the dataset
print(df.head())

# Separate features (X) and target variable (y)
X = df.drop('Outcome', axis=1)
y = df['Outcome']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Decision Tree model
model = DecisionTreeClassifier(random_state=42)

# Train the model
model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

# Display the evaluation metrics
print(f'Accuracy: {accuracy:.4f}')
print('\nConfusion Matrix:\n', conf_matrix)
print('\nClassification Report:\n', classification_rep)

# Display the decision tree rules
tree_rules = export_text(model, feature_names=list(X.columns))
print('\nDecision Tree Rules:\n', tree_rules)


Q1. Import the dataset and examine the variables. Use descriptive statistics and visualizations to 
understand the distribution and relationships between the variables.
Ans:-Certainly! To import the dataset and examine the variables using descriptive statistics and visualizations, you can use Python with libraries such as pandas, matplotlib, and seaborn. Below is an example code snippet to get you started. Make sure to replace 'path/to/diabetes.csv' with the actual path to your dataset.

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
file_path = 'path/to/diabetes.csv'  # Replace with the actual path to your CSV file
df = pd.read_csv(file_path)

# Display the first few rows of the dataset
print(df.head())

# Descriptive statistics
print("\nDescriptive Statistics:")
print(df.describe())

# Pairwise correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Pairwise Correlation Heatmap')
plt.show()

# Distribution of numerical variables
df.hist(figsize=(12, 10))
plt.suptitle('Distribution of Numerical Variables', y=0.92)
plt.show()

# Box plots for key variables
plt.figure(figsize=(14, 6))
plt.subplot(1, 3, 1)
sns.boxplot(x='Outcome', y='Glucose', data=df)
plt.title('Box Plot for Glucose')

plt.subplot(1, 3, 2)
sns.boxplot(x='Outcome', y='BMI', data=df)
plt.title('Box Plot for BMI')

plt.subplot(1, 3, 3)
sns.boxplot(x='Outcome', y='Age', data=df)
plt.title('Box Plot for Age')

plt.show()


Q2. Preprocess the data by cleaning missing values, removing outliers, and transforming categorical 
variables into dummy variables if necessary.

In [None]:
# Import necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split

# Load the dataset
file_path = 'path/to/diabetes.csv'  # Replace with the actual path to your CSV file
df = pd.read_csv(file_path)

# Display missing values
print("Missing Values Before Preprocessing:")
print(df.isnull().sum())

# Visualize missing values
plt.figure(figsize=(8, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

# Handling missing values (replace with median for numerical features)
imputer = SimpleImputer(strategy='median')
df_filled = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

# Outlier detection and removal using Isolation Forest
outlier_detector = IsolationForest(contamination=0.05, random_state=42)
outliers = outlier_detector.fit_predict(df_filled)
df_cleaned = df_filled.loc[outliers != -1]

# Transform categorical variables into dummy variables (if any)
# Example: df_cleaned = pd.get_dummies(df_cleaned, columns=['CategoricalVariable'])

# Display missing values after preprocessing
print("\nMissing Values After Preprocessing:")
print(df_cleaned.isnull().sum())

# Visualize missing values after preprocessing
plt.figure(figsize=(8, 6))
sns.heatmap(df_cleaned.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap After Preprocessing')
plt.show()

# Additional preprocessing steps (if needed), such as scaling numerical features
# Example: scaler = StandardScaler()
# df_scaled = pd.DataFrame(scaler.fit_transform(df_cleaned), columns=df_cleaned.columns)

# Split the data into features and target variable
X = df_cleaned.drop('Outcome', axis=1)
y = df_cleaned['Outcome']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Now, X_train, X_test, y_train, y_test can be used for further analysis or modeling


Q3. Split the dataset into a training set and a test set. Use a random seed to ensure reproducibility

In [None]:
# Import necessary libraries
from sklearn.model_selection import train_test_split

# Load the dataset
file_path = 'path/to/diabetes.csv'  # Replace with the actual path to your CSV file
df = pd.read_csv(file_path)

# Handling missing values (replace with median for numerical features)
imputer = SimpleImputer(strategy='median')
df_filled = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

# Outlier detection and removal using Isolation Forest
outlier_detector = IsolationForest(contamination=0.05, random_state=42)
outliers = outlier_detector.fit_predict(df_filled)
df_cleaned = df_filled.loc[outliers != -1]

# Transform categorical variables into dummy variables (if any)
# Example: df_cleaned = pd.get_dummies(df_cleaned, columns=['CategoricalVariable'])

# Split the data into features and target variable
X = df_cleaned.drop('Outcome', axis=1)
y = df_cleaned['Outcome']

# Split the data into training and testing sets with a random seed for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Now, X_train, X_test, y_train, y_test can be used for further analysis or modeling


Q4. Use a decision tree algorithm, such as ID3 or C4.5, to train a decision tree model on the training set. Use 
cross-validation to optimize the hyperparameters and avoid overfitting

In [None]:
# Import necessary libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

# Create a Decision Tree model
dt_model = DecisionTreeClassifier(random_state=42)

# Define hyperparameters to search over
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Perform GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=dt_model, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Print the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:")
print(best_params)

# Train the model with the best hyperparameters on the entire training set
best_dt_model = grid_search.best_estimator_
best_dt_model.fit(X_train, y_train)

# Evaluate the model on the test set
y_pred_test = best_dt_model.predict(X_test)

# Print the accuracy on the test set
accuracy_test = accuracy_score(y_test, y_pred_test)
print(f'Accuracy on Test Set: {accuracy_test:.4f}')


Q5. Evaluate the performance of the decision tree model on the test set using metrics such as accuracy, 
precision, recall, and F1 score. Use confusion matrices and ROC curves to visualize the results.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, auc
import matplotlib.pyplot as plt

# Make predictions on the test set
y_pred_test = best_dt_model.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred_test)
precision = precision_score(y_test, y_pred_test)
recall = recall_score(y_test, y_pred_test)
f1 = f1_score(y_test, y_pred_test)

# Print evaluation metrics
print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1:.4f}')

# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred_test)
print('\nConfusion Matrix:\n', conf_matrix)

# ROC curve
fpr, tpr, _ = roc_curve(y_test, best_dt_model.predict_proba(X_test)[:, 1])
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (AUC = {:.2f})'.format(roc_auc))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.show()


Q6. Interpret the decision tree by examining the splits, branches, and leaves. Identify the most important 
variables and their thresholds. Use domain knowledge and common sense to explain the patterns and 
trends.