## <p style="font-weight:bold; color:#00747b; font-size:140%; text-align:left;padding: 0; margin: 0; border-bottom: 3px solid #00747b"> Source Code and Utils Work </p>

<div style="border-radius:10px; border:#590d0d solid; padding: 15px; background-color: #d4ebea; font-size:100%; text-align:left">

In this we create some utils functions. Based on the idea that we do a monolith notebook they take place in this cell. As we further
the notebook the utils functions here will be transfereed to a utlis/src directory as appropriate </div>

In [3]:
import time
import pickle
import logging
import inspect
import sklearn
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
from sklearn import metrics
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer

In [4]:
notebook_folder = os.path.dirname(os.path.abspath("__file__"))
destination_path = os.path.join(notebook_folder, "../data")
results_path= os.path.join(notebook_folder, "../results")

logging.basicConfig(level=logging.INFO, filename= os.path.join(results_path, "logs.txt"), force=True)
logger = logging.getLogger()

In [5]:
"""
In this we create some utils functions. Based on the idea that we do a monolith notebook they take place in this cell. As we further
the notebook the utils functions here will be transfereed to a utlis/src directory as appropriate
"""
from enum import Enum

class Models(Enum):
    """
    Enum class to help with typing and auto-completion in the IDE
    """
    GRADIENT_BOOSTING = 'Gradient Boosting',
    GAUSSIAN_NB = 'Gaussian Naive Bayes',
    LOG_REGRESSION = 'Logistic Regression',
    MLP_CLASSIFIER = 'MLP Classifier',
    KNN = 'KNN',
    RANDOM_FOREST = 'Random Forest',
    DECISION_TREE = 'Decision Tree',




In [6]:
class Experiment(object):
    """
    A Class to help with running experiments. Also supports using with the "with" keyword. The actual functionalities are pretty basic since the core idea is to
    be replaced with the mlflow runner
    """

    def __init__(self):
        self.start_time = 0
        self.end_time = 0
        self.model = None
        self.results = None

    def __enter__(self):
        self.start_time = 0
        self.start_time = time.time()
        self.results = None

    def __exit__(self, exc_type, exc_val, exc_tb):
        if self.model is None:
            raise Exception("You need to run run_experiment(<model/hyperparametertuner>). Possible that there was an error in the"
                            "fitting of the model")
        self.end_time = 0
        self.end_time = time.time()
        self.model = None
        logger.info("Elapsed Time %0.3f", self.end_time - self.start_time)
        logger.info("####### End of Experiment ####### \n")


    def run_experiment(self, experiment_object, X_train, X_test, y_train, y_test):
        """
        Runs the experiment based on the type of the object
        :param experiment_object: Object to be fitted. Can be a model or can be
        :param X_train: Training Dataset
        :param X_test:  Test Dataset
        :param y_train: Training Target
        :param y_test: Testing Target
        :return: None
        """
        now = datetime.now()
        timestamp = now.strftime("%d%m%y_%H%M")
        logger.info(
            "\n####### Starting Experiment dated {} #######\n".format(
                timestamp
            )
        )
        model, metrics_dict = self._get_model_and_results(experiment_object, X_train, X_test, y_train, y_test)
        pickle.dump(model, open(os.path.join(results_path ,  timestamp + type(model).__name__ + '.pkl'), 'wb'))


    def _get_model_and_results(self, experiment_object, X_train, X_test, y_train, y_test):
        """
        Getter method to help with fitting the different experiment objects.
        :param experiment_object: Object to be fitted. Can be a model or can be
        :param X_train: Training Dataset
        :param X_test:  Test Dataset
        :param y_train: Training Target
        :param y_test: Testing Target
        :return: None
        """
        ensembles = [x[1] for x in inspect.getmembers(sklearn.ensemble, inspect.isclass)]
        tuners = [x[1] for x in inspect.getmembers(sklearn.model_selection)]

        if type(experiment_object) in ensembles:
            logger.info("Model type: %s", type(experiment_object).__name__)
            metrics_dict = {}

            logger.debug("The experiment object is an ensemble")
            experiment_object.fit(X_train, y_train)
            y_pred = experiment_object.predict(X_test)

            metrics_dict['accuracy'] = metrics.accuracy_score(y_test, y_pred) * 100
            metrics_dict['confusion matrix'] = confusion_matrix(y_test, y_pred)
            metrics_dict['classification report'] = classification_report(y_test, y_pred)

            logger.debug('The model is %s', self.model)
            logger.info(metrics_dict['classification report'])

            self.model = experiment_object
            return self.model, metrics_dict

        elif type(experiment_object) in tuners:
            logger.info("Hyperparameter tuning experiment")

            metrics_dict = {}
            logger.debug("The experiment object is a tuner ")
            experiment_object.fit(X_train, y_train)

            logger.info("Model type: %s", type(experiment_object.best_estimator_).__name__)
            self.model = experiment_object.best_estimator_

            y_pred = self.model.predict(X_test)

            metrics_dict['accuracy'] = metrics.accuracy_score(y_test, y_pred) * 100
            metrics_dict['confusion matrix'] = confusion_matrix(y_test, y_pred)
            metrics_dict['classification report'] = classification_report(y_test, y_pred)

            metrics_dict['best params'] = experiment_object.best_params_
            logger.debug('The model is %s', self.model)
            logger.info("Best parameter (CV score=%0.3f):", experiment_object.best_score_)
            logger.info("That following models had the following best parameters %s", experiment_object.best_params_)

            return self.model, metrics_dict
        else:
            raise NotImplementedError


In [7]:
exp = Experiment()

## <p style="font-weight:bold; color:#00747b; font-size:140%; text-align:left;padding: 0; margin: 0; border-bottom: 3px solid #00747b"> Data Work</p>

####
<div style="border-radius:10px; border:#590d0d solid; padding: 15px; background-color: #d4ebea; font-size:100%; text-align:left">
    
<h2 align="center" style='text-decoration:underline;'> Features in the Dataset</h2>
    

The dataset contains 12 features and the target variable we are trying to predict. We divide features into buckets based on the forest attributes they are describing. 

__Features describing the topography of the forest area__ <br>
> *Elevation*: The “height of the forest” <br>
*Aspect*: The orientation (e.g north-facing) of the slope in degrees (0-360). <br>
*Slope*: How steep the area. Measured precent change in elevation over a certain distance. <br>

__Features describing distance to water__ <br>
> *Horizontal_Distance_To_Hydrology*: Horizontal distance to the nearest water source <br>
*Vertical_Distance_To_Hydrology*: Vertical distance to the nearest water source <br>

__Features describing the lighting source.__ <br>
Here it is important to note that those are related to Elevation, slope and topography <br>
> *Hillshade_9am*: Hillshade index at 9. Measured in how bright that part on a grayscale. (1-255) <br>
*Hillshade_Noon*: Hillshade index at 12. Measured in how bright that part on a grayscale. (1-255) <br>
*Hillshade_3pm*: Hillshade index at 3. Measured in how bright that part on a grayscale. (1-255) <br>

__Features describing relevant for fire hazards/urban accessabilty.__ <br>
> *Horizontal_Distance_To_Roadways:* Distance to the nearest road. While this generally describes how accessible a forest is, it plays a role in fighting wildfires. <br>
*Horizontal_Distance_To_Fire_Points:* Distance to an ignition point in the forest which designates a point where fire is susceptible in a forest. <br>

__Area Code__ <br>
> *Wilderness_Area:*  The area code to which the current forest area belongs to. <br>

__Type of soil present in the described forest area__ <br>
> *Soil_Type:* The type of soil in that area of the forest. <br>

__Target variable__ <br>
> Cover_Type the class to be predicted. The type of forestation cover.<br>

In [8]:
df = pd.read_csv(os.path.join(destination_path, "raw/forest_data.csv"), index_col='Id')
target = 'Cover_Type'
print(len(df.columns))
df.hist(bins=50, figsize=(20, 15))

FileNotFoundError: [Errno 2] No such file or directory: 'data/raw/forest_data.csv'

####
<div style="border-radius:10px; border:#590d0d solid; padding: 15px; background-color: #d4ebea; font-size:100%; text-align:left">
    
<h2 align="center" style='text-decoration:underline;'> Correlation Analysis</h2>

When looking at the correlation between the Cover type (our target) and the rest of the features in the dataset there are no strong correlation we can find. The highest one is a positive one with Wilderness Area, however with only 0.2. 

We have some other interesting correlations down to note (non-exhaustive), __but remember to note that any conclusions made out of the correlation are just hypothesis that might need validation.__

- *HIllshade_9am and HIllshade_3pm:* High negative correlation of -0.78. Sounds logical if you think of the sun trajectory above a hill over the course of the day.
- *Hillshade_Noon and Hillshade_3pm:* Positive Correlation of 0.66.
- *Aspect and Hillshade_9am / Aspect and Hillshade_3pm:* With the former being -0.59 and the latter being 0.63 they show that the Hillshade_9am and the Hillshade_3pm have a very similar but inverse relationship to Aspect. Again, thinking of a sun’s trajectory over a hill and the Aspect describing which direction the point is facing (e.g east-facing/west-facing), it makes sense.
- *Soil_type and Elevation:* The highest correlation with 0.83. The reason for that is not clear for now, however this is where Subject Matter Experts are important! At this point you should start thinking about contacting them to help describe why is there a strong relationship

In [None]:
attributes = df.copy()
corr_matrix = attributes.corr()
corr_matrix['Elevation'].sort_values(ascending=False)

In [None]:
cmap = sns.diverging_palette(230, 20, as_cmap=True)

plt.figure(figsize=(15, 10))
sns.heatmap(attributes.corr(), annot=True, fmt='.1g', vmin=-1, vmax=1, center=0, cmap=cmap)
plt.title("Correlation Matrix", fontweight='bold', fontsize='large')

####
<div style="border-radius:10px; border:#590d0d solid; padding: 15px; background-color: #d4ebea; font-size:100%; text-align:left">
    
<h2 align="center" style='text-decoration:underline;'> Training Data split </h2>

We check if the data classes are balanced. With exactly 2160 per class, The dataset is perfectly balanced and no further work is required. We create a random split with a fixed seed to be always able to reproduce the specfic split. 



In [None]:
df[target].value_counts().sort_index()

In [None]:
df.dropna(inplace = True)
y = df[target]
X = df.drop(target, axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42 )

###

<div style="border-radius:10px; border:#590d0d solid; padding: 15px; background-color: #d4ebea; font-size:100%; text-align:left">
    
<h2 align="center" style='text-decoration:underline;'> Feature Engineering </h2>

We apply basic feature engineering to the dataset where we <br>

1. __Apply One Hot Encoding__ : Although our categorical features ('Wilderness_Area' and 'Soil_Type') are already integer encoded, integer encoding might imply an ordinal relationship. This is not the case here and therefore we apply One Hot Encoding
2. __Scale the numerical features__: As we know from looking at our attributes they have very different scales. Having all the data on the scale is important to avoid scenarios where one feature would have more impact only due to the different scale magnifying its effect

Other ideas that coule have been done are __Calculating the Euclidean Distance to the hydrology__ or __assigning a fire hazard score based on the distance to the road and distance to the ignition point__ 

Finally we create a sperate training/test dataset with the changes to compare the effects later 

In [None]:
cat_cols = ['Wilderness_Area', 'Soil_Type']
num_cols = ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology',
       'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways',
       'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm',
       'Horizontal_Distance_To_Fire_Points']

col_transformer = make_column_transformer(
        (OneHotEncoder(), cat_cols),
        remainder=StandardScaler())
X_train.columns

In [None]:
X_train_transformed = col_transformer.fit_transform(X_train)
X_test_transformed = col_transformer.transform(X_test)