## 앙상블 코드 리뷰 - 임지영

- 원본 코드 링크 : https://www.kaggle.com/code/yassineghouzam/titanic-top-4-with-ensemble-modeling
- 본 코드 리뷰에서는 **앙상블 모델링 과정**에 대해 집중적으로 살펴보고 정리하였다.
- 용어를 영어로 사용하는 것이 더 편해서 영어로 리뷰한 점 참고 바랍니다..

### Stratified K-fold 
- a type of cross validation technique used when dealing with imbalanced datasets
- the dataset is divided into k-folds that has a similar class distribution to the original dataset
- make sure that each fold is representitive of the overall dataset

In [None]:
kfold = StratifiedKFold(n_splits=10) #This code splits the dataset to 10 folds 

### Cross validate different models 
- check scores of each classifiers 

In [None]:
# Modeling step Test differents algorithms 
random_state = 2
classifiers = []
classifiers.append(SVC(random_state=random_state))
classifiers.append(DecisionTreeClassifier(random_state=random_state))
classifiers.append(AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state,learning_rate=0.1))
classifiers.append(RandomForestClassifier(random_state=random_state))
classifiers.append(ExtraTreesClassifier(random_state=random_state))
classifiers.append(GradientBoostingClassifier(random_state=random_state))
classifiers.append(MLPClassifier(random_state=random_state))
classifiers.append(KNeighborsClassifier())
classifiers.append(LogisticRegression(random_state = random_state))
classifiers.append(LinearDiscriminantAnalysis())

In [None]:
# Visualizing score results of each classifiers
cv_results = []
for classifier in classifiers :
    cv_results.append(cross_val_score(classifier, X_train, y = Y_train, scoring = "accuracy", cv = kfold, n_jobs=4))

cv_means = []
cv_std = []
for cv_result in cv_results:
    cv_means.append(cv_result.mean())
    cv_std.append(cv_result.std())

cv_res = pd.DataFrame({"CrossValMeans":cv_means,"CrossValerrors": cv_std,"Algorithm":["SVC","DecisionTree","AdaBoost",
"RandomForest","ExtraTrees","GradientBoosting","MultipleLayerPerceptron","KNeighboors","LogisticRegression","LinearDiscriminantAnalysis"]})

g = sns.barplot("CrossValMeans","Algorithm",data = cv_res, palette="Set3",orient = "h",**{'xerr':cv_std})
g.set_xlabel("Mean Accuracy")
g = g.set_title("Cross validation scores")

### Grid Search Optimization

 : technique used in machine learning to find the **best combinations of hyperparameters** in the model
 
 - create a grid of all possible combinations of hyperparameters
 - evaluate the model using each combination
 - effective but computationally expensive method

### Adaboost
: a boosting algorithm that iteratively **combines weak models to create a strong model**

- Weak model : simple models that performs only slightly better than random guessing
- effective in reducing bias and variance
- prone to overfitting if the weak learners are too complex or the data is too noisy

**Process of AdaBoost**
1. Train the weak models on different subsets of data
2. Assign weights to the missclassified examples
3. Train in subsequent iteration with more attention to previous misclassified examples

In [None]:
# Adaboost
DTC = DecisionTreeClassifier() #Use Decision tree as a weak model

adaDTC = AdaBoostClassifier(DTC, random_state=7)

ada_param_grid = {"base_estimator__criterion" : ["gini", "entropy"], # use gini index, entropy to as an evaluating metric
              "base_estimator__splitter" :   ["best", "random"],
              "algorithm" : ["SAMME","SAMME.R"],
              "n_estimators" :[1,2],
              "learning_rate":  [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3,1.5]} # many options for the learning rate

gsadaDTC = GridSearchCV(adaDTC,param_grid = ada_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)

gsadaDTC.fit(X_train,Y_train)

ada_best = gsadaDTC.best_estimator_

### Extra Trees Classifier
: Ensemble learning method that ccombines multiple decision trees

- randomly select subsets of features and thresgole values to split nodes in each tree 
- can be used both for classification and regression tasks 
- can handle high-dimensional data well

In [None]:
#ExtraTrees 
ExtC = ExtraTreesClassifier()


## Search grid for optimal parameters
ex_param_grid = {"max_depth": [None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [False],
              "n_estimators" :[100,300],
              "criterion": ["gini"]}


gsExtC = GridSearchCV(ExtC,param_grid = ex_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)

gsExtC.fit(X_train,Y_train)

ExtC_best = gsExtC.best_estimator_

# Best score
gsExtC.best_score_

### Random Forest Classifier 

: builds multiple decision trees and combines them to improve the model's accuracy and generalization ability
- Each decision tree is built using a random subset of the training data and a random subset of the features
- Random Forest is an extension of bagging, where the trees are trained independently and the final prediction is made by majority voting

In [None]:
# RFC Parameters tunning 
RFC = RandomForestClassifier()


## Search grid for optimal parameters
rf_param_grid = {"max_depth": [None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [False],
              "n_estimators" :[100,300],
              "criterion": ["gini"]}


gsRFC = GridSearchCV(RFC,param_grid = rf_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)

gsRFC.fit(X_train,Y_train)

RFC_best = gsRFC.best_estimator_

# Best score
gsRFC.best_score_