In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

In [2]:
#load data set
url = "healthcare-dataset-stroke-data.csv"
data = pd.read_csv(url)

In [3]:
print(data.head())

      id  gender   age  hypertension  heart_disease ever_married  \
0   9046    Male  67.0             0              1          Yes   
1  51676  Female  61.0             0              0          Yes   
2  31112    Male  80.0             0              1          Yes   
3  60182  Female  49.0             0              0          Yes   
4   1665  Female  79.0             1              0          Yes   

       work_type Residence_type  avg_glucose_level   bmi   smoking_status  \
0        Private          Urban             228.69  36.6  formerly smoked   
1  Self-employed          Rural             202.21   NaN     never smoked   
2        Private          Rural             105.92  32.5     never smoked   
3        Private          Urban             171.23  34.4           smokes   
4  Self-employed          Rural             174.12  24.0     never smoked   

   stroke  
0       1  
1       1  
2       1  
3       1  
4       1  


In [4]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB
None


In [5]:
# checking for the null values of each column

for each in data.columns:
    print(f'There are {data[each].isnull().sum()} null values in the {each} column')

There are 0 null values in the id column
There are 0 null values in the gender column
There are 0 null values in the age column
There are 0 null values in the hypertension column
There are 0 null values in the heart_disease column
There are 0 null values in the ever_married column
There are 0 null values in the work_type column
There are 0 null values in the Residence_type column
There are 0 null values in the avg_glucose_level column
There are 201 null values in the bmi column
There are 0 null values in the smoking_status column
There are 0 null values in the stroke column


In [6]:
data.dropna(inplace=True)
data = pd.get_dummies(data, drop_first=True)

X = data.drop(columns=['stroke'])
y = data['stroke']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Loading and Cleaning Data
- We load the dataset and remove rows with missing values using dropna.
- We use pd.get_dummies to convert categorical columns into numerical representations for easier processing.

Feature and Target Selection
- We define X as the features and y as the target (stroke).

Data Splitting
- Using train_test_split, we create training and testing sets with 80-20 split.

In [7]:
class DecisionTreeCustom:
    def __init__(self, maxDepth=3):
        self.maxDepth = maxDepth
        self.tree = []

    def fit(self, X, y, depth=0):
        if depth < self.maxDepth:
            best_mse = float('inf')
            n_samples, n_features = X.shape
            for feature in range(n_features):
                feature_vals = np.unique(X[:, feature])
                for threshold in feature_vals:
                    preds = np.where(X[:, feature] > threshold,
                                     np.full(y.shape, y.mean()),
                                     np.full(y.shape, y.mean() - y.std()))
                    mse = np.mean((y - preds) ** 2)
                    if mse < best_mse:
                        best_mse = mse
                        self.featureIndex = feature
                        self.threshold = threshold
            self.tree.append((self.featureIndex, self.threshold))

    def predict(self, X):
        feature_vals = X[:, self.featureIndex]
        return np.where(feature_vals > self.threshold, 1, 0)

Decision Stump Attributes

- featureIndex: the feature to split on.
- threshold: the value to decide the split.
- polarity: not used in this version, but typically allows for adjusting direction of split.


Finding the Best Split
- We loop through each feature and each unique value in that feature to find the best threshold based on mean squared error (MSE).
- We store the feature and threshold with the lowest MSE.

Prediction
- Given input data, predict checks if each value is above the threshold and returns a binary prediction (1 or 0).

In [8]:
class GradientBoostingClassifier:
    def __init__(self, numEstimators=50, learningRate=0.05, maxDepth=4):
        self.numEstimators = numEstimators
        self.learningRate = learningRate
        self.maxDepth = maxDepth
        self.trees = []

    def fit(self, X, y):
        preds = np.full(y.shape, y.mean())

        for i in range(self.numEstimators):
            residuals = y - preds
            print(f"Iteration {i + 1}")
            print("Residuals:\n", residuals[:5])

            tree = DecisionTreeCustom(maxDepth=self.maxDepth)
            tree.fit(X, residuals)
            self.trees.append(tree)

            tree_preds = tree.predict(X)
            preds += self.learningRate * tree_preds

            print("Updated Predictions:", preds[:5])
            print("-" * 40)
    def predict(self, X):
        y_pred = np.zeros(X.shape[0])
        for tree in self.trees:
            y_pred += self.learningRate * tree.predict(X)
        return np.where(y_pred > 0.5, 1, 0)


__init__
- it initializes parameters like numEstimators, learningRate, and maxDepth

fit method
- it starts with initial predictions set to the mean of y
- for each iteration, it calculates residuals (the difference between actual values and predictions)
- it also fits a simple tree on these residuals, appending it to self.trees
- finally it updates predictions by adding the tree’s predictions, scaled by the learningRate

predict method

- uses the learned trees to make predictions on new data by summing up predictions from each tree, weighted by learningRate
- then it converts the result to a binary prediction (0 or 1) based on a threshold of 0.5

In [10]:
gradient_boosting = GradientBoostingClassifier(numEstimators=5, learningRate=0.05, maxDepth=4)
gradient_boosting.fit(X_train, y_train)

# predict on the test set then evaluate
y_pred = gradient_boosting.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print(classification_report(y_test, y_pred, zero_division=1))

Iteration 1
Residuals:
 3565   -0.039725
898    -0.039725
2707   -0.039725
4198   -0.039725
2746   -0.039725
Name: stroke, dtype: float64
Updated Predictions: [0.08972498 0.08972498 0.08972498 0.08972498 0.08972498]
----------------------------------------
Iteration 2
Residuals:
 3565   -0.089725
898    -0.089725
2707   -0.089725
4198   -0.089725
2746   -0.089725
Name: stroke, dtype: float64
Updated Predictions: [0.13972498 0.13972498 0.13972498 0.13972498 0.13972498]
----------------------------------------
Iteration 3
Residuals:
 3565   -0.139725
898    -0.139725
2707   -0.139725
4198   -0.139725
2746   -0.139725
Name: stroke, dtype: float64
Updated Predictions: [0.18972498 0.18972498 0.18972498 0.18972498 0.18972498]
----------------------------------------
Iteration 4
Residuals:
 3565   -0.189725
898    -0.189725
2707   -0.189725
4198   -0.189725
2746   -0.189725
Name: stroke, dtype: float64
Updated Predictions: [0.23972498 0.23972498 0.23972498 0.23972498 0.23972498]
-------------

gradient_boosting.fit(X_train, y_train)
- it trains the gradient boosting model on the training data by building multiple trees to minimize prediction errors

y_pred = gradient_boosting.predict(X_test)
- it uses the trained model to predict labels for the test set

accuracy = accuracy_score(y_test, y_pred)
- calculates accuracy as the percentage of correct predictions.

classification_report(y_test, y_pred, zero_division=1)
- it just prints metrics like precision, recall, and F1-score, which give a more complete view of model performance, especially when dealing with imbalanced classes


for the output.

Accuracy: Shows the proportion of correctly predicted instances.
Precision, Recall, and F1-score: For each class (0 and 1), these metrics help assess how well the model detects strokes (1), which may not be well captured by accuracy alone.

# DESCRIPTION

gradient boosting is an ensemble technique that iteratively constructs a number of small models, frequently decision stumps, each of which fixes mistakes caused by the models before it. by incorporating new models that concentrate on residual errors, the objective is to reduce the discrepancy between expected and actual results.

at each iteration the algorithm does as follows:

- determines the residuals, or the discrepancies between the current projections and the actual target values.
- fits these residuals to a weak learner (such a decision stump).
- adds the new learner's predictions to the existing ones, scaling them according to the learning rate, a hyperparameter that regulates each learner's contribution.


until the residual errors are reduced, or for a predetermined number of repetitions, this process is repeated. the final model can capture intricate relationships in the data since it is a weighted sum of all the weak learners.

# Pseudocode for Gradient Boosting



```
Initialize predictions to the mean of the target values (y_mean)

FOR each iteration up to the number of estimators:
    Calculate residuals = (target values - current predictions)
    
    Train a weak learner (e.g., decision stump) on the residuals
    
    Add the weak learner's predictions to the current predictions,
    scaled by the learning rate
    
END FOR

RETURN final predictions based on the sum of all weak learners' contributions

```



Initialization
- To provide a baseline, begin with a straightforward prediction, such as the target variable's mean.

Residual Calculation
- Determine the difference between the current predictions and the actual target values for each iteration.

Weak Learner Training
- Teach a basic model, or weak learner, to forecast these residuals.

Update Predictions
- Adding the new learner's predictions, weighted by a learning rate, will update the forecasts.

Repeat
- Keep going until every student has been added.

# Differences Between Gradient Boosting and Random Forests

1. Model Combination

- With gradient boosting, models are constructed one after the other, with each new model concentrating on the residual errors of the ones that came before it. We refer to this methodical process as "boosting."
- In contrast, Random Forests construct each tree separately before averaging their predictions to produce a final outcome. We refer to this parallel strategy as "bagging."

2. Prediction Strategy

- Gradient Boosting generates predictions by adding together all of the weak learners' predictions, each of which is weighted by a learning rate.
- Random Forests use the majority vote (for classification) or average (for regression) across all trees to generate predictions.

3. Complexity and Interpretability

- Because each weak learner concentrates on fixing certain mistakes from the previous stage, gradient boosting frequently results in a more complex final model. Although it needs to be carefully adjusted (e.g., the number of iterations and learning rate), it typically performs better on more difficult issues.
- Because Random Forests average independent trees, they are easier to understand and more resilient to overfitting, which allows them to be applied to a greater range of situations without requiring a lot of fine-tuning.


In short, Random Forests are parallel and concentrate on averaging results from independent models, whereas Gradient Boosting is sequential and modifies each model based on the errors of the previous one.