# Adaboost Lab

## 准备工作
### 环境准备
请确保完成以下依赖包的安装，并且通过下面代码来导入与验证。

In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


### 数据集准备
我们将使用以下数据集进行 Adaboost 的训练。

该数据集与决策树部分使用的数据集相同，包括 7 个特征以及一个标签“是否适合攻读博士”，涵盖了适合攻读博士的各种条件，如love doing research,I absolutely want to be a college professor等。

请执行下面的代码来加载数据集。


In [13]:
# read decision_tree_datasets.csv
train_data = pd.read_csv('train_phd_data.csv')
test_data = pd.read_csv('test_phd_data.csv')

# translate lables [0,1] to [-1,1]
# if 0 then -1, if 1 then 1
train_data.iloc[:, -1] = train_data.iloc[:, -1].map({0: -1, 1: 1})
test_data.iloc[:, -1] = test_data.iloc[:, -1].map({0: -1, 1: 1})

## Adaboost (15 pts)

在上一个lab中，你已经成功完成了 Decision Tree 的构建。在本部分，你可以继续沿用上一部分的代码，学习并完成 Adaboost 模型的训练。

在这个 Adaboost 模型中，我们选择了一层决策树作为弱学习器，并使用基尼系数作为分类标准。

请完成以下类的构建以及相应函数的实现：

1. **weakClassifier()**: 我们采用一层决策树，包括 `split()` 和 `predict()`。你可以参考上一次实验中的代码。

2. **Adaboost()**：包括弱学习器的集合，拟合过程 `fit()` 和预测过程 `predict()`。


In [17]:
class weakClassifier:
    def __init__(self):
        

        self.tree = None 
        self.alpha = None
    
    # here, we use the gini impurity to find the best feature and threshold
    # Note: you need consider sample_weight when computing the gini impurity
    def gini_impurity(self, y, sample_weight):
    # Calculate the weighted Gini impurity for a set of labels
        unique, counts = np.unique(y, return_counts=True)
        freqs = counts / counts.sum()
        weighted_freqs = freqs * sample_weight[:len(freqs)]
        return 1 - np.sum(weighted_freqs ** 2)

    def best_split(self, X, y, sample_weight):

        ''' 
            find the best feature and threshold to split the data based on the gini impurity

            Args:
                X: the features of the data
                y: the labels of the data
                sample_weight: the weight of each sample

            Returns:
                best_feature: the best feature to split the data
                best_Series: Series, the data set after splitting
        '''
        min_gini = float("inf")
        best_feature = None
        best_splits = None
        best_threshold = None

        for feature in X.columns:
            thresholds = X[feature].unique()
            for threshold in thresholds:
                left_mask = X[feature] <= threshold
                right_mask = ~left_mask

                # Calculate weighted Gini impurity for both splits
                left_gini = self.gini_impurity(y[left_mask], sample_weight[left_mask])
                right_gini = self.gini_impurity(y[right_mask], sample_weight[right_mask])
                total_gini = (left_gini * left_mask.sum() + right_gini * right_mask.sum()) / len(y)

                if total_gini < min_gini:
                    min_gini = total_gini
                    best_feature = feature
                    best_splits = (X[feature] <= threshold)
                    best_threshold = threshold

        return best_feature, best_threshold

        # TODO: implement the function to find the best feature and threshold to split the data based on the gini impurity
        pass
    
    def fit(self, X, y, sample_weight):
        '''  
            fit the data to the decision tree

            Args:
                X: the features of the data
                y: the labels of the data
                sample_weight: the weight of each sample

            Returns:
                None, but self.tree should be updated
        '''
        best_feature, best_splits = self.best_split(X, y, sample_weight)

        if best_feature is None:
            return 

        # TODO: Create the tree as a nested dictionary
        self.tree = {'feature': best_feature, 'split': best_splits}
        

    def predict(self,x):
        '''  
        predict the label of the data

        Args:
            x: the features of the data
        Return:
            predict_lables: the predict labels of the data
        '''

        # Store the results
        predict_labels = []

        # predict the label of each sample
        for i in range(len(x)):
            sample = x.iloc[i,:]

            # TODO: predict the label of the sample
            if sample[self.tree['feature']] <= self.tree['split']:
                predict_labels.append(-1)
            else:
                predict_labels.append(1)

        return predict_labels



In [18]:
class Adaboost:
    
    def __init__(self, n_estimators=10):

        # the number of weak classifier
        self.n_estimators = n_estimators
        # the list of weak classifier
        self.clfs = []
    
    # AdaBoost training process
    def fit(self, X, y):
        n_samples,m_features = X.shape
    
        # initialize weights
        w = np.ones(n_samples)/n_samples

        # for each weak classifier
        for _ in range(self.n_estimators):
            clf = weakClassifier()

            # 1. fit the weak classifier
            clf.fit(X,y,w)

            # TODO: 2. predict the label of the data using the weak classifier
            predictions = np.array(clf.predict(X))

            # TODO: 3. Calculate errors 
            miss = [int(x) for x in (predictions != y)]  # 1 for incorrect predictions, 0 for correct
            miss2 = [x if x == 1 else -1 for x in miss]  # 1 for incorrect, -1 for correct
            error = sum(w * miss) / sum(w)


            # TODO:4. Calculate alpha
            alpha = 0.5 * np.log((1 - error) / (error + 1e-10))
            
            # TODO: 5. Update weights
            w *= np.exp([alpha * i for i in miss2])
            
            # normalize to one
            w /= np.sum(w)


            # save classifier and weight
            clf.alpha = alpha
            self.clfs.append(clf)
            

    def predict(self, X):
        '''  
        predict the label of the data
        
        Args:
            X: the features of the data
        Return:
            y_pred: the predict labels of the data
        '''
        clf_preds = np.array([clf.alpha * np.array(clf.predict(X)) for clf in self.clfs])
        y_pred = np.sign(np.sum(clf_preds, axis=0))
        
        #TODO: 1. compute the predict labels of the data using all weak classifiers
        

        #TODO: 2. compute the weighted sum of the predict labels
        

        #TODO: 3. get the label of the data by sign function (if x>0 return 1, else return -1)
        
        return y_pred

In [19]:
adaboost_model = Adaboost(n_estimators=10)
# fit the model
adaboost_model.fit(train_data.iloc[:, :-1], train_data.iloc[:, -1])

# TODO: predict the test data
y_pred = adaboost_model.predict(test_data.iloc[:, :-1])

# TODO: calculate the accuracy of test data
true_labels = test_data.iloc[:, -1]
accuracy = np.mean(y_pred == true_labels)
print("The accuracy of Adaboost is: ", accuracy)

The accuracy of Adaboost is:  0.7142857142857143
