## Homework 6: Ensemble methods

This week we have covered a variety of ensemble methods, summarized here:

https://chatgpt.com/share/671bac71-3aac-8002-a19d-047f0a97c619

In this notebook please complete the following task.

**Design your own ensemble method**

For this task you should create a markdown cell that concisely but completely describes your idea.
Following that, there should be a python implementation of your idea. 
You should run your algorithm on a real dataset and compare the performance to a random forest and some other well known ensemble approaches with default parameters (e.g. adaboost, gradient boost). 

*I want to see a nicely formatted table in which rows correspond to classifiers (yours included, at least 3 rows) and columns correspond to training and test accuracy of the models on the data.*

Then answer these questions:
* Is there reason to suppose that your approach would work better on certain types of data?
* Can you algorithm be parallelized for training or prediction?

Your algorithm should be fundamentally different from the algorithms we have seen, but could be a tweak or combination of those existing ideas. 

Motivating thoughts:

* Can you think of a novel way to enhance ensemble diversity of opinion?
* Can you think of a novel way for models to iteratively correct the mistakes of the previous models (as in boosting)?

As always this assignment is partly a test of your ability to **communicate**. Write the appropriate amount of prose (not too much, not too little) and strive for clarity. Provide tables and visualizations where appropriate. 



In [19]:
import pandas as pd
import numpy as np
import re
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

df = pd.read_csv('train.csv')
X = df.drop(["Survived", "Name"], axis=1)
y = df["Survived"]

categorical_fields = ["Pclass", "Sex", "Cabin", "Embarked", "Ticket"]
numerical_fields = [field for field in X.columns if field not in categorical_fields]

# Transform last name into numerial ordering of names in alphabetical order
# My guess is that people where chosen by last name in order leave on a life boat?
df["Last_Name"] = df["Name"].str.extract(r'^(.*?)(?=,)')
sorted_names = df['Last_Name'].sort_values().reset_index(drop=True)
last_name_order = {last_name: i + 1 for i, last_name in enumerate(sorted_names)}
X["Last_Name_Rank"] = df["Last_Name"].map(last_name_order)

# Setting N/A values of Age to the mean
X["Age"] = X["Age"].fillna(X["Age"].mean())

# Setting Unkown Cabins to a single category. Maybe people with unkown cabins were less documented and more likley to die?
X['Cabin'] = X['Cabin'].replace('', "Unknown")
X["Cabin"] = X["Cabin"].fillna("Unknown")


# Setting Embarked nan to Unkown
X["Embarked"] = X["Embarked"].fillna("Unkown")

# Our transformations on numerical and categorical fields
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_fields),
        ('cat', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), categorical_fields)
    ]
)

preprocessing_pipeline = Pipeline([
    ('preprocessor', preprocessor)
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

X_train = preprocessing_pipeline.fit_transform(X_train)
X_test = preprocessing_pipeline.fit_transform(X_test)

#My custom ensemble!

random_forest = RandomForestClassifier()
adaboost = AdaBoostClassifier()

def results(model, model_name):
    model.fit(X_train, y_train)
    print(f"Model name {model_name}")
    print(f"Model train score {model.score(X_train, y_train)}")
    print(f"Model test score {model.score(X_test, y_test)}")
    print("")

results(random_forest, "random forest")
results(adaboost, "adaboost")
#custom_ensemble = CustomEnsemble()

Model name random forest
Model train score 1.0
Model test score 0.5810055865921788





Model name adaboost
Model train score 0.8553370786516854
Model test score 0.7541899441340782



In [18]:
# Source Code
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

df = pd.read_csv('train.csv')

y = df['Survived']
# For some reason Sex colunm was needed to predict survivabillity. As such, I dropped it. PasasangerID also seems irrelevant
X = df.drop(columns=['Survived', 'PassengerId', "Sex"])

categorical_fields = ['Pclass', 'Embarked', 'Name', 'Cabin', 'Ticket']
numerical_fields = [col for col in X.columns if col not in categorical_fields]

# This transforms N/A values in the numerical fields to the mean of all the numerical field values
numerical_transformer = SimpleImputer(strategy='mean')

# This first makes all N/A or None fields into a single category, and then applies On Hot Encoding on them
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# This is used to transform the X values with the transformations above
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_fields),
        ('cat', categorical_transformer, categorical_fields)
    ]
)
X = preprocessor.fit_transform(X)

# Creating random forest
random_forest = Pipeline([
    ('classifier', RandomForestClassifier())
])

#Creating adaboost
adaboost = Pipeline([
    ('classifier', AdaBoostClassifier(
        estimator=DecisionTreeClassifier(max_depth=1),
        n_estimators=100,
        random_state=42
    ))
])


#My Model!
class Almost_Blind_Trees:
    def __init__(self):
        self.almost_blind_trees = []
        self.weights = []
    
    def fit(self, X, y):
        for i, (X_val, y_val) in enumerate(zip(X, y)):
            X_instance = np.array([X_val]) #current instance of X
            y_instance = np.array([y_val]) #current instance of y 

            # 29 other Random instances with repition in X and y
            additional_indices = np.random.choice(y_train.shape[0], 4, replace=False)
            additional_samples = X_train[additional_indices]
            additional_targets = y_train.iloc[additional_indices]

            # Concatentating these instances to train tree on them both
            X_train_instance = np.vstack((X_instance, additional_samples))
            y_train_instance = np.concatenate((y_instance, additional_targets))

            tree = DecisionTreeClassifier()
            tree.fit(X_train_instance, y_train_instance)
            self.almost_blind_trees.append(tree)
            
            # Score our weak learner on the entire data set
            self.weights.append(self.score_transform(tree.score(X, y)))
            
        # Apply a weight based on how succesful our weak learner was on the entire data set
        total = sum(self.weights)
        self.weights = [score / total for score in self.weights]
        
    def score(self, X, y):
        predictions = np.empty(y.shape[0]) # Np arrray to hold our predictions that we will 
        for i, (x_instance, y_val) in enumerate(zip(X, y)):
            # Our prediction for each individual y instance
            prediction = sum([weight * tree.predict([x_instance])[0] for weight, tree in zip(self.weights, self.almost_blind_trees)])
            if prediction >= 0.5:
                predictions[i] = 1
            else:
                predictions[i] = 0
        # How many predictions are equal to the labels as a num from 0 - 1
        return (np.sum(predictions == y) / y.shape[0])
    
    # Edit the score which weights are averaged on. We can adjust this to determine how fairly weights are assigned to models that do well vs those that dont
    def score_transform(self, x):
        return x**4


almost_blind_trees = Almost_Blind_Trees()

# Defining the test train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

def print_scores(model, model_name):
    model.fit(X_train, y_train)
    print(f"For model {model_name}")
    print(f"training score: {model.score(X_train, y_train)}")
    print(f"Testing score: {model.score(X_test, y_test)}")
    print("")
    

    
print_scores(almost_blind_trees, "almost_blind_trees")

For model almost_blind_trees


KeyboardInterrupt: 

|            | Training Score         | Testing Score        |
|------------|-------------------|------------------|
| Random Forest | 1 | 0.6190476190476191 |
| Adaboost Trees | 0.9417808219178082 | 0.6190476190476191 |
| Custom Ensemble Model | 0.636986301369863 | 0.6349206349206349 ||


    Is there reason to suppose that your approach would work better on certain types of data?
    Can you algorithm be parallelized for training or prediction?
    
In viewing the training and testing results of Random Forests and Adaboost trees, I noticed immediatly that they were overfitting the training data to an extreme. As such, I wanted to create a Ensemble method based on trees that would prevent this overfitting from ocurring. As such, I created A Cusrom Ensemble Model that I would like to call Almost Blind Trees. I created this model based on the assumption that the data which I used had a simpler relationship between fields and labels then what the trees were picking up. For this project I used the famouse titanic data set and I believed that the survivability of a passanger would not be a relationship found by comparing multiple data point, and that it was a simpler relationship which could be detected withing a couple of data points. As such, my model consisted of several max depth 2 trees that were trained on three data points at the same time. These data points would then be used to determine how my model would interact with  