# Decision Trees: Homework

## 1 -  Descripción del problema

Imagina que eres un data scientist en un banco.
El banco ha recopilado datos de los solicitantes de préstamos anteriores y si incumplieron o no sus préstamos.
Tu tarea es utilizar estos datos para predecir si un nuevo solicitante es probable que pague su préstamo o lo incumpla.

## 1 - Dataset

El dataset está disponible en el archivo `loan_data.csv` y contiene la siguiente información:

| Applicnt ID | Age | Income (USD 1000s) | Owns House (Yes/No) | Previous Default (Yes/No) | Approved (Yes/No) |
|:-----------:|:---:|:------------------:|:-------------------:|:-------------------------:|:-----------------:|
|      1      |  58 |         4          |          No         |             No            |         No        |
|      2      |  24 |         6          |          No         |             No            |         Yes       |
|      3      |  27 |         9          |          Yes        |             No            |         Yes       |
|     ...     |  ... |         ...          |          ...        |             ...            |         ...       |

Construir un árbol de decisión para decidir la aprobación de un préstamo en función de las características del solicitante.

Pasos:

Preprocesamiento de datos: Convertir características categóricas como "Casa en propiedad" e "Impago anterior" en valores numéricos (por ejemplo, Sí = 1, No = 0).

Entrene un árbol de decisión: Utilice un algoritmo de árbol de decisión basado en los datos. Utilice el 80% de los datos para el entrenamiento y reserve el 20% para las pruebas.

Visualice el árbol de decisión.

Interpretación: Basándose en el árbol, deduzca algunas reglas que el banco parece estar utilizando para la aprobación de préstamos. Por ejemplo, si un nodo divide a los solicitantes en función de si son propietarios de una vivienda, y la mayoría de los que son propietarios obtienen la aprobación, podría deducir que poseer una vivienda aumenta las posibilidades de aprobación de un solicitante.

# 5. Solución Decision Tree Manual

In [38]:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load the dataset
df = pd.read_csv('loan_data.csv')

# Preprocess the data
df['Owns House'] = df['Owns House'].map({'Yes': 1, 'No': 0})
df['Previous Default'] = df['Previous Default'].map({'Yes': 1, 'No': 0})
df['Approved'] = df['Approved'].map({'Yes': 1, 'No': 0})

# Drop Age and Income

# Splitting the dataset into training and test sets
X = df.drop(columns=['Applicant ID', 'Age', 'Income ($1000s)', 'Approved'])
y = df['Approved']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [32]:
def compute_entropy(y):
    entropy = 0.
    
    if len(y) != 0:   
        p1 = len(y[y==1]) / len(y)
        p0 = 1-p1
        
        if p1 != 0 and p1 != 1:
            entropy = -p1*np.log2(p1)-p0*np.log2(p0)   
    
    return entropy

print("Entropía root node: ", compute_entropy(y_train)) 

Entropía root node:  0.9580420222262995


In [33]:
def split_dataset(X, node_indices, feature):
    # You need to return the following variables correctly
    left_indices = []
    right_indices = []
    
    for i in node_indices:
        if X[:i+1, feature][i] == 1:
            left_indices.append(i)
        else:
            right_indices.append(i)
        
    return left_indices, right_indices

In [34]:
def compute_information_gain(X, y, node_indices, feature):  
    # Split dataset
    left_indices, right_indices = split_dataset(X, node_indices, feature)

    print("C")
    
    # Some useful variables
    X_node, y_node = X[node_indices], y[node_indices]
    X_left, y_left = X[left_indices], y[left_indices]
    X_right, y_right = X[right_indices], y[right_indices]
    
    # You need to return the following variables correctly
    information_gain = 0
    
    entropy_node = compute_entropy(y_node)
    entropy_left = compute_entropy(y_left)
    entropy_right = compute_entropy(y_right)
    
    w_left = len(X_left) / len(X_node)
    w_right = len(X_right) / len(X_node)
    
    information_gain = entropy_node - ((w_left * entropy_left) + (w_right * entropy_right))
    
    return information_gain

In [35]:
def get_best_split(X, y, node_indices):   
    # Some useful variables
    num_features = X.shape[1]
    print("A", num_features)
    
    # You need to return the following variables correctly
    best_feature = 0
    max_info_gain = 0
    
    for feature in range(num_features):
        info_gain = compute_information_gain(X, y, node_indices, feature=feature)
        print("B", info_gain)
        
        if info_gain > max_info_gain:
            max_info_gain = info_gain
            best_feature = feature
   
    return best_feature

In [36]:
tree = []

def build_tree_recursive(X, y, node_indices, branch_name, max_depth, current_depth):
    # Maximum depth reached - stop splitting
    if current_depth == max_depth:
        formatting = " "*current_depth + "-"*current_depth
        print(formatting, "%s leaf node with indices" % branch_name, node_indices)
        return
   
    # Otherwise, get best split and split the data
    # Get the best feature and threshold at this node
    best_feature = get_best_split(X, y, node_indices) 
    
    formatting = "-"*current_depth
    print("%s Depth %d, %s: Split on feature: %d" % (formatting, current_depth, branch_name, best_feature))
    
    # Split the dataset at the best feature
    left_indices, right_indices = split_dataset(X, node_indices, best_feature)
    tree.append((left_indices, right_indices, best_feature))
    
    # continue splitting the left and the right child. Increment current depth
    build_tree_recursive(X, y, left_indices, "Left", max_depth, current_depth+1)
    build_tree_recursive(X, y, right_indices, "Right", max_depth, current_depth+1)

In [37]:
root_indices = [i for i in range(len(X_train))]


build_tree_recursive(X_train, y_train, root_indices, "Root", max_depth=10, current_depth=0)

A 2


InvalidIndexError: (slice(None, 1, None), 0)