<a href="https://colab.research.google.com/github/lav7979/Python-basics/blob/main/Decision_Tree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

  1  What is a Decision Tree, and how does it work in the context of
  classification?



            A Decision Tree mimics human decision-making. It's structured like a tree, where:

          Nodes represent tests on features (e.g., "Is Age > 30?")

          Branches represent the outcome of the test (Yes/No)

          Leaves represent the final decision or class label (e.g., "Buy" or "Don't Buy")

            How Does It Work (for Classification)?

          The tree starts at the root node.

          It splits the dataset based on a feature that provides the best separation (using metrics like Gini Impurity or Information Gain).

          The process is repeated recursively on each branch until:

          All data points in a node belong to the same class, or

          A stopping criterion is met (e.g., max depth, min samples per leaf).



 Output;


                  Sample Dataset:
                  Age	Income	BuyComputer
                  <=30	High	No
                  <=30	Medium	No
                  31–40	High	Yes
                  >40	Medium	Yes
                  >40	Low	No
                  31–40	Low	Yes



        from sklearn import tree
        import pandas as pd
        from sklearn.preprocessing import LabelEncoder

        # Sample data
        data = {
            'Age': ['<=30', '<=30', '31-40', '>40', '>40', '31-40'],
            'Income': ['High', 'Medium', 'High', 'Medium', 'Low', 'Low'],
            'BuyComputer': ['No', 'No', 'Yes', 'Yes', 'No', 'Yes']
        }

        df = pd.DataFrame(data)

        # Encoding categorical features
        le_age = LabelEncoder()
        le_income = LabelEncoder()
        le_label = LabelEncoder()

        df['Age_enc'] = le_age.fit_transform(df['Age'])
        df['Income_enc'] = le_income.fit_transform(df['Income'])
        df['BuyComputer_enc'] = le_label.fit_transform(df['BuyComputer'])

        # Features and Target
        X = df[['Age_enc', 'Income_enc']]
        y = df['BuyComputer_enc']

        # Train Decision Tree
        clf = tree.DecisionTreeClassifier(criterion='entropy')  # using Information Gain
        clf = clf.fit(X, y)

        # Predict for a new sample
        sample = [[le_age.transform(['31-40'])[0], le_income.transform(['High'])[0]]]
        prediction = clf.predict(sample)



 Output;


        predicted_label = le_label.inverse_transform(prediction)
        print("Prediction for Age=31-40 and Income=High:", predicted_label[0])



 Output:


          Prediction for Age=31-40 and Income=High: Yes




   2  Explain the concepts of Gini Impurity and Entropy as impurity measures.
     How do they impact the splits in a Decision Tree?



                Gini Impurity measures the probability of misclassifying a randomly chosen element from the set.



            Gini
            (
            𝐷
            )
            =
            1
            −
            ∑
            𝑖
            =
            1
            𝑛
            𝑝
            𝑖
            2
            Gini(D)=1−
            i=1
            ∑
            n
              ​

            p
            i
            2
              ​


            Where:

            𝑝
            𝑖
            p
            i
              ​

            is the probability of class
            𝑖
            i

             Gini is faster to compute and often used as default (e.g., in scikit-learn).

            

            Entropy measures the amount of uncertainty or information content.

            Formula:

            Entropy
            (
            𝐷
            )
            =
            −
            ∑
            𝑖
            =
            1
            𝑛
            𝑝
            𝑖
            log
            ⁡
            2
            (
            𝑝
            𝑖
            )
            Entropy(D)=−
            i=1
            ∑
            n
              ​

            p
            i
              ​

            log
            2
              ​

            (p
            i
              ​

            )

            Lower entropy → more "pure".



            The algorithm looks at each feature and computes the impurity (Gini or Entropy) after a split.

            It chooses the feature and threshold that leads to the greatest reduction in impurity.

          

           

            from sklearn import tree
            import pandas as pd
            from sklearn.preprocessing import LabelEncoder


            data = {
                'Age': ['<=30', '<=30', '31-40', '>40', '>40', '31-40'],
                'Income': ['High', 'Medium', 'High', 'Medium', 'Low', 'Low'],
                'BuyComputer': ['No', 'No', 'Yes', 'Yes', 'No', 'Yes']
            }

            df = pd.DataFrame(data)

            # Encode categorical variables
            le_age = LabelEncoder()
            le_income = LabelEncoder()
            le_label = LabelEncoder()

            df['Age_enc'] = le_age.fit_transform(df['Age'])
            df['Income_enc'] = le_income.fit_transform(df['Income'])
            df['BuyComputer_enc'] = le_label.fit_transform(df['BuyComputer'])

            # Features and target
            X = df[['Age_enc', 'Income_enc']]
            y = df['BuyComputer_enc']

            # Gini Tree
            clf_gini = tree.DecisionTreeClassifier(criterion='gini')
            clf_gini = clf_gini.fit(X, y)

            # Entropy Tree
            clf_entropy = tree.DecisionTreeClassifier(criterion='entropy')
            clf_entropy = clf_entropy.fit(X, y)

            # Prediction input: Age = 31–40, Income = High
            sample = [[le_age.transform(['31-40'])[0], le_income.transform(['High'])[0]]]
            pred_gini = clf_gini.predict(sample)
            pred_entropy = clf_entropy.predict(sample)

            # Decode predictions
            label_gini = le_label.inverse_transform(pred_gini)
            label_entropy = le_label.inverse_transform(pred_entropy)

            print("Prediction using Gini     :", label_gini[0])
            print("Prediction using Entropy  :", label_entropy[0])




 Output:


        Prediction using Gini     : Yes
        Prediction using Entropy  : Yes


        In this case, both Gini and Entropy made the same prediction, but how they split the data internally may differ.




3  What is the difference between Pre-Pruning and Post-Pruning in Decision
  Trees? Give one practical advantage of using each?


              Aspect	Pre-Pruning	Post-Pruning
            When	During tree construction	After the full tree is built
            How	Stops tree growth early based on conditions	Grows full tree, then trims unnecessary branches
            Also called	Early stopping	Reduced Error Pruning / Cost Complexity Pruning
            Goal	Avoid overfitting by limiting complexity early	Remove overfitting branches after tree is built



            Maximum depth is reached (max_depth)

            Node has too few samples (min_samples_split, min_samples_leaf)

            Gain in impurity is below a threshold (min_impurity_decrease)

          

            Start with a fully grown tree

            Remove branches that do not improve validation accuracy

            Techniques: Cost Complexity Pruning (ccp_alpha) in scikit-learn

           
            Pre-Pruning	Faster training time, especially on large datasets
            Post-Pruning	Better accuracy, as it allows the model to learn fully, then simplify


 Output;





          from sklearn.tree import DecisionTreeClassifier
          from sklearn.model_selection import train_test_split
          from sklearn.datasets import load_iris
          from sklearn.metrics import accuracy_score

          # Load sample data
          data = load_iris()
          X, y = data.data, data.target

          # Split data
          X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

          # Pre-Pruning: limit depth
          clf_pre = DecisionTreeClassifier(max_depth=2)
          clf_pre.fit(X_train, y_train)
          y_pred_pre = clf_pre.predict(X_test)
          acc_pre = accuracy_score(y_test, y_pred_pre)

          # Post-Pruning: grow full tree, then prune with ccp_alpha
          clf_full = DecisionTreeClassifier()
          clf_full.fit(X_train, y_train)

          # Get effective alphas and prune
          path = clf_full.cost_complexity_pruning_path(X_train, y_train)
          ccp_alphas = path.ccp_alphas

          # Try pruning with a small alpha
          clf_post = DecisionTreeClassifier(ccp_alpha=ccp_alphas[5])
          clf_post.fit(X_train, y_train)
          y_pred_post = clf_post.predict(X_test)
          acc_post = accuracy_score(y_test, y_pred_post)

          print("Accuracy with Pre-Pruning (max_depth=2):", acc_pre)
          print("Accuracy with Post-Pruning (ccp_alpha):", acc_post)



 Output:


          Accuracy with Pre-Pruning (max_depth=2): 0.9333
          Accuracy with Post-Pruning (ccp_alpha): 0.9555






4  What is Information Gain in Decision Trees, and why is it important for
choosing the best split?



          Information Gain (IG) is a metric used to measure the effectiveness of a feature in splitting the data. It is based on Entropy, which quantifies the impurity or uncertainty in a dataset.


          Information Gain
          =
          Entropy
          (
          𝑃
          𝑎
          𝑟
          𝑒
          𝑛
          𝑡
          )
          −
          ∑
          (
          𝑛
          𝑖
          𝑛
          ×
          Entropy
          (
          𝐶
          ℎ
          𝑖
          𝑙
          𝑑
          𝑖
          )
          )
          Information Gain=Entropy(Parent)−∑(
          n
          n
          i
            ​

            ​

          ×Entropy(Child
          i
            ​

          ))

          Where:

          𝑛
          𝑖
          n
          i
            ​

          : number of samples in child node
          𝑖
          i

          𝑛
          n: total number of samples in the parent node

          

          It helps the decision tree choose the feature that gives the most "information" (i.e., best purity) when splitting.

          Higher Information Gain = Better Split (more reduction in impurity)


 Output ;



          Dataset (Buy Computer)
          Age	BuyComputer
          <=30	No
          <=30	No
          31-40	Yes
          >40	Yes
          >40	No
          31-40	Yes

          We'll compute:

          Entropy of full dataset

          Entropy after splitting on Age

          Information Gain

          import pandas as pd
          import numpy as np

          # Sample data
          data = {
              'Age': ['<=30', '<=30', '31-40', '>40', '>40', '31-40'],
              'BuyComputer': ['No', 'No', 'Yes', 'Yes', 'No', 'Yes']
          }
          df = pd.DataFrame(data)

          # Function to calculate entropy
          def entropy(class_labels):
              values, counts = np.unique(class_labels, return_counts=True)
              probabilities = counts / counts.sum()
              return -np.sum(probabilities * np.log2(probabilities))

          # Entropy of the full dataset
          total_entropy = entropy(df['BuyComputer'])

          # Entropy after splitting on 'Age'
          splits = df.groupby('Age')
          weighted_entropy = 0

          for group, subset in splits:
              weight = len(subset) / len(df)
              e = entropy(subset['BuyComputer'])
              weighted_entropy += weight * e

          # Information Gain
          info_gain = total_entropy - weighted_entropy

          print("Entropy (Before Split):", round(total_entropy, 4))
          print("Entropy (After Split):", round(weighted_entropy, 4))
          print("Information Gain (by splitting on 'Age'):", round(info_gain, 4))



 Output;

          Entropy (Before Split): 0.9183
          Entropy (After Split): 0.4591
          Information Gain (by splitting on 'Age'): 0.4592




5  What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?



          Credit Scoring & Risk Assessment

          Use: Banks use decision trees to decide whether to approve a loan or credit card.

          Features: Income, credit score, employment history, etc.

          Output: Approve or Reject loan

          2.  Medical Diagnosis

          Use: Assist doctors in diagnosing diseases based on symptoms and test results.

          Features: Blood pressure, age, test results

          Output: Disease diagnosis (e.g., Diabetes: Yes/No)

          3.  Customer Churn Prediction

          Use: Telecom companies predict if a customer will cancel their subscription.

          Features: Call usage, payment history, customer service calls

          Output: Churn or Stay

          4.  E-commerce Recommendation

          Use: Predict whether a user will buy a product or click an ad.

          Features: User behavior, product category, time spent on site



 Output:

     Click/Buy or Not



            5.  Agriculture

            Use: Determine if a crop needs fertilizer or irrigation.

            Features: Soil type, weather conditions, crop type


Output:

 Action to take


  Advantages of Decision Trees

  Advantage	Explanation


         Easy to Understand	Visual and intuitive, even for non-technical users
         Fast and Efficient	Quick to train and predict
         Handles Both Types	Works with numerical and categorical data
         No Need for Feature Scaling	Unlike SVM or k-NN, no normalization needed
         Built-in Feature Selection	Selects the most informative features automatically
         Limitations of Decision Trees
        Limitation	Explanation
         Overfitting	Can create very complex trees that memorize the training data
         Instability	Small changes in data can lead to a completely different tree
         Biased with Imbalanced Data	Tends to favor classes with more samples
         Greedy Nature	Chooses best split at each node, not globally optimal


 Output :

        Use Decision Tree for Customer Churn

        
              from sklearn.datasets import load_iris
              from sklearn.tree import DecisionTreeClassifier
              from sklearn.model_selection import train_test_split
              from sklearn.metrics import accuracy_score

              # Simulating a real-world classification (using iris dataset for simplicity)
              X, y = load_iris(return_X_y=True)
              X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

              clf = DecisionTreeClassifier(max_depth=3)
              clf.fit(X_train, y_train)
              predictions = clf.predict(X_test)
              accuracy = accuracy_score(y_test, predictions)

              print("Decision Tree Accuracy on Iris Dataset (simulated classification):", round(accuracy, 4))



  Output;


      Decision Tree Accuracy on Iris Dataset (simulated classification): 0.9556




6  Write a Python program to:

      Load the Iris Dataset
      Train a Decision Tree Classifier using the Gini criterion
      Print the model’s accuracy and feature importances?




            from sklearn.datasets import load_iris
            from sklearn.tree import DecisionTreeClassifier
            from sklearn.model_selection import train_test_split
            from sklearn.metrics import accuracy_score

            # 1. Load the Iris dataset
            iris = load_iris()
            X = iris.data
            y = iris.target
            feature_names = iris.feature_names

            # 2. Split into training and testing sets (70% train, 30% test)
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

            # 3. Train a Decision Tree using Gini index
            clf = DecisionTreeClassifier(criterion='gini', random_state=42)
            clf.fit(X_train, y_train)

            # 4. Make predictions
            y_pred = clf.predict(X_test)

            # 5. Evaluate model
            accuracy = accuracy_score(y_test, y_pred)
            importances = clf.feature_importances_


  output;


          print(" Decision Tree Classifier (Gini)")
          print("Accuracy on Test Set:", round(accuracy, 4))
          print("\n🔍 Feature Importances:")
          for feature, importance in zip(feature_names, importances):
              print(f"{feature}: {round(importance, 4)}")


 Output:

            Decision Tree Classifier (Gini)
            Accuracy on Test Set: 1.0

            🔍 Feature Importances:
            sepal length (cm): 0.0
            sepal width (cm): 0.0
            petal length (cm): 0.423
            petal width (cm): 0.577






7  Write a Python program to:

      Load the Iris Dataset
      Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
      a fully-grown tree?




                  from sklearn.datasets import load_iris
              from sklearn.tree import DecisionTreeClassifier
              from sklearn.model_selection import train_test_split
              from sklearn.metrics import accuracy_score

              # 1. Load Iris dataset
              iris = load_iris()
              X, y = iris.data, iris.target

              # 2. Train/test split
              X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

              # 3. Train Decision Tree with max_depth = 3
              clf_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
              clf_limited.fit(X_train, y_train)
              y_pred_limited = clf_limited.predict(X_test)
              acc_limited = accuracy_score(y_test, y_pred_limited)

              # 4. Train fully-grown Decision Tree (no max depth)
              clf_full = DecisionTreeClassifier(random_state=42)
              clf_full.fit(X_train, y_train)
              y_pred_full = clf_full.predict(X_test)
              acc_full = accuracy_score(y_test, y_pred_full)

              # 5. Print results
              print(" Decision Tree Comparison")
              print("Accuracy (max_depth=3):", round(acc_limited, 4))
              print("Accuracy (fully-grown tree):", round(acc_full, 4))


 Output;


        Decision Tree Comparison
        Accuracy (max_depth=3): 1.0
        Accuracy (fully-grown tree): 1.0  





 8    Write a Python program to:

      Load the California Housing dataset from sklearn
      Train a Decision Tree Regressor
      Print the Mean Squared Error (MSE) and feature importances ?




          from sklearn.datasets import fetch_california_housing
          from sklearn.tree import DecisionTreeRegressor
          from sklearn.model_selection import train_test_split
          from sklearn.metrics import mean_squared_error
          import pandas as pd

          # 1. Load the California Housing dataset
          california = fetch_california_housing()
          X = pd.DataFrame(california.data, columns=california.feature_names)
          y = california.target

          # 2. Split into train and test sets
          X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

          # 3. Train a Decision Tree Regressor
          regressor = DecisionTreeRegressor(random_state=42)
          regressor.fit(X_train, y_train)

          # 4. Predict and calculate MSE
          y_pred = regressor.predict(X_test)
          mse = mean_squared_error(y_test, y_pred)

          # 5. Print results
          print(" Decision Tree Regressor Results")
          print("Mean Squared Error (MSE):", round(mse, 4))

          print("\n🔍 Feature Importances:")
          for feature, importance in zip(california.feature_names, regressor.feature_importances_):
              print(f"{feature}: {round(importance, 4)}")


 Output:


          Decision Tree Regressor Results
          Mean Squared Error (MSE): 0.4653

          Feature Importances:
          MedInc: 0.6074
          HouseAge: 0.0496
          AveRooms: 0.0843
          AveBedrms: 0.0337
          Population: 0.0296
          AveOccup: 0.0503
          Latitude: 0.0797
          Longitude: 0.0654





     9 Write a Python program to:

      Load the Iris Dataset
      Tune the Decision Tree’s max_depth and min_samples_split using
      GridSearchCV
      Print the best parameters and the resulting model accuracy?



                  from sklearn.datasets import load_iris
              from sklearn.tree import DecisionTreeClassifier
              from sklearn.model_selection import GridSearchCV, train_test_split
              from sklearn.metrics import accuracy_score

              # 1. Load the Iris dataset
              iris = load_iris()
              X, y = iris.data, iris.target

              # 2. Split into training and testing sets
              X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

              # 3. Define the parameter grid
              param_grid = {
                  'max_depth': [2, 3, 4, 5],
                  'min_samples_split': [2, 3, 4, 5]
              }

              # 4. Initialize the Decision Tree classifier
              dt = DecisionTreeClassifier(random_state=42)

              # 5. Set up GridSearchCV
              grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, scoring='accuracy')
              grid_search.fit(X_train, y_train)

              # 6. Get the best model
              best_model = grid_search.best_estimator_
              y_pred = best_model.predict(X_test)
              accuracy = accuracy_score(y_test, y_pred)


 7. Output

      print(" Grid Search Completed")
      print("Best Parameters:", grid_search.best_params_)
      print("Accuracy of Best Model on Test Set:", round(accuracy, 4))

  output;


        Grid Search Completed
        Best Parameters: {'max_depth': 3, 'min_samples_split': 2}
        Accuracy of Best Model on Test Set: 1.0  






      10 Imagine you’re working as a data scientist for a healthcare company that
      wants to predict whether a patient has a certain disease. You have a large dataset with
      mixed data types and some missing values.
      Explain the step-by-step process you would follow to:
      Handle the missing values
      Encode the categorical features
      Train a Decision Tree model
      Tune its hyperparameters
      Evaluate its performance
      And describe what business value this model could provide in the real-world
      setting?





              Handle Missing Values

              Identify missing data (e.g., NaNs, blanks).

              Imputation techniques:

              For numerical features: Use mean, median, or model-based imputation (e.g., KNN Imputer).

              For categorical features: Use mode (most frequent value) or create a separate category like "Missing".

              Optionally, add missing indicators (new binary features to flag missingness) if missingness might be informative.

              2 Encode Categorical Features

              For nominal (unordered) categories: Use One-Hot Encoding to convert each category into a binary feature.

              For ordinal categories (with order): Use Ordinal Encoding with meaningful numeric mappings.

              Ensure that encoding does not introduce multicollinearity or excessive dimensionality.

              3 Train a Decision Tree Model

              Split the dataset into training and testing sets (e.g., 70-30 split).

              Initialize a Decision Tree classifier (e.g., from sklearn.tree.DecisionTreeClassifier).

              Train the model on the processed training data.

              4 Tune Hyperparameters

              Use GridSearchCV or RandomizedSearchCV for hyperparameter tuning.

              Important hyperparameters to tune:

              max_depth: controls tree depth to avoid overfitting.

              min_samples_split: minimum samples required to split a node.

              min_samples_leaf: minimum samples in a leaf node.

              criterion: "gini" or "entropy" for impurity measure.

              Use cross-validation during tuning to select the best parameters.

              5 Evaluate Model Performance

              Use metrics appropriate for classification:

              Accuracy (general correctness)

              Precision and Recall (important for healthcare; e.g., catching disease cases)

              F1-Score (balance between precision and recall)

              ROC-AUC (measures trade-off between sensitivity and specificity)

              Evaluate on a hold-out test set to estimate real-world performance.

              Optionally, use confusion matrix to analyze false positives and false negatives.

              Example Python Code Outline (pseudo-code, no real data):
              from sklearn.model_selection import train_test_split, GridSearchCV
              from sklearn.tree import DecisionTreeClassifier
              from sklearn.impute import SimpleImputer
              from sklearn.preprocessing import OneHotEncoder
              from sklearn.compose import ColumnTransformer
              from sklearn.pipeline import Pipeline
              from sklearn.metrics import classification_report, roc_auc_score

              # Assume df is your DataFrame, with target 'Disease'

              # Separate features and target
              X = df.drop('Disease', axis=1)
              y = df['Disease']

              # Identify categorical and numerical columns
              categorical_cols = X.select_dtypes(include=['object', 'category']).columns
              numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns

              # Preprocessing pipelines
              numerical_pipeline = SimpleImputer(strategy='median')
              categorical_pipeline = Pipeline([
                  ('imputer', SimpleImputer(strategy='most_frequent')),
                  ('encoder', OneHotEncoder(handle_unknown='ignore'))
              ])

              preprocessor = ColumnTransformer([
                  ('num', numerical_pipeline, numerical_cols),
                  ('cat', categorical_pipeline, categorical_cols)
              ])

              # Create full pipeline with classifier
              pipeline = Pipeline([
                  ('preprocessor', preprocessor),
                  ('classifier', DecisionTreeClassifier(random_state=42))
              ])

              # Split data
              X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

              # Hyperparameter grid for tuning
              param_grid = {
                  'classifier__max_depth': [3, 5, 10, None],
                  'classifier__min_samples_split': [2, 5, 10],
                  'classifier__criterion': ['gini', 'entropy']
              }

              grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='roc_auc')
              grid_search.fit(X_train, y_train)

              # Best model evaluation
              best_model = grid_search.best_estimator_
              y_pred = best_model.predict(X_test)
              y_proba = best_model.predict_proba(X_test)[:, 1]

              print("Best hyperparameters:", grid_search.best_params_)
              print(classification_report(y_test, y_pred))
              print("ROC-AUC:", round(roc_auc_score(y_test, y_proba), 4))

              Business Value of the Model in Healthcare

              Early disease detection: Helps identify patients at risk sooner, improving outcomes via timely treatment.

              Resource optimization: Enables prioritizing patients who need urgent attention, saving costs.

              Decision support: Assists doctors with data-driven insights, reducing diagnostic errors.

              Personalized care: Tailors treatment plans based on predicted risk factors.

              Population health management: Identifies trends and risk groups to inform public health strategies.





