# Decision Trees: Homework

## 1 -  Descripción del problema

Imagina que eres un data scientist en un banco.
El banco ha recopilado datos de los solicitantes de préstamos anteriores y si incumplieron o no sus préstamos.
Tu tarea es utilizar estos datos para predecir si un nuevo solicitante es probable que pague su préstamo o lo incumpla.

## 1 - Dataset

El dataset está disponible en el archivo `loan_data.csv` y contiene la siguiente información:

| Applicnt ID | Age | Income (USD 1000s) | Owns House (Yes/No) | Previous Default (Yes/No) | Approved (Yes/No) |
|:-----------:|:---:|:------------------:|:-------------------:|:-------------------------:|:-----------------:|
|      1      |  58 |         4          |          No         |             No            |         No        |
|      2      |  24 |         6          |          No         |             No            |         Yes       |
|      3      |  27 |         9          |          Yes        |             No            |         Yes       |
|     ...     |  ... |         ...          |          ...        |             ...            |         ...       |

Construir un árbol de decisión para decidir la aprobación de un préstamo en función de las características del solicitante.

Pasos:

Preprocesamiento de datos: Convertir características categóricas como "Casa en propiedad" e "Impago anterior" en valores numéricos (por ejemplo, Sí = 1, No = 0).

Entrene un árbol de decisión: Utilice un algoritmo de árbol de decisión basado en los datos. Utilice el 80% de los datos para el entrenamiento y reserve el 20% para las pruebas.

Visualice el árbol de decisión.

Interpretación: Basándose en el árbol, deduzca algunas reglas que el banco parece estar utilizando para la aprobación de préstamos. Por ejemplo, si un nodo divide a los solicitantes en función de si son propietarios de una vivienda, y la mayoría de los que son propietarios obtienen la aprobación, podría deducir que poseer una vivienda aumenta las posibilidades de aprobación de un solicitante.

# 2 - Solución XGBoost

In [40]:
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load the dataset
df = pd.read_csv('loan_data.csv')

# Preprocess the data
df['Owns House'] = df['Owns House'].map({'Yes': 1, 'No': 0})
df['Previous Default'] = df['Previous Default'].map({'Yes': 1, 'No': 0})
df['Approved'] = df['Approved'].map({'Yes': 1, 'No': 0})

# Splitting the dataset into training and test sets
X = df.drop(columns=['Applicant ID', 'Approved'])
y = df['Approved']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training the XGBoost model
model = xgb.XGBClassifier(objective="binary:logistic", random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluation
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
print("\nClassification Report:\n", classification_report(y_test, y_pred))



Accuracy: 92.50%

Classification Report:
               precision    recall  f1-score   support

           0       0.94      0.94      0.94       124
           1       0.91      0.89      0.90        76

    accuracy                           0.93       200
   macro avg       0.92      0.92      0.92       200
weighted avg       0.92      0.93      0.92       200



In [41]:
# ... [same code as before]

# Predictions
y_pred = model.predict(X_test)

# Adding predictions to the test dataset
X_test['Actual Approved'] = y_test
X_test['Predicted Approved'] = y_pred

X_test


Unnamed: 0,Age,Income ($1000s),Owns House,Previous Default,Actual Approved,Predicted Approved
521,46,97,1,0,1,1
737,25,96,1,0,1,1
740,23,35,1,1,0,0
660,51,96,1,0,1,1
411,60,75,0,0,1,1
...,...,...,...,...,...,...
408,49,52,1,0,1,1
332,30,81,0,0,1,1
208,27,86,1,1,0,0
613,33,72,1,1,0,0


# 3 - Solución Random Forest

In [42]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load the dataset
df = pd.read_csv('loan_data.csv')

# Preprocess the data
df['Owns House'] = df['Owns House'].map({'Yes': 1, 'No': 0})
df['Previous Default'] = df['Previous Default'].map({'Yes': 1, 'No': 0})
df['Approved'] = df['Approved'].map({'Yes': 1, 'No': 0})

# Splitting the dataset into training and test sets
X = df.drop(columns=['Applicant ID', 'Approved'])
y = df['Approved']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training the Random Forest model
# Import Random Forest from sklearn
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluation
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 91.50%

Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.93      0.93       124
           1       0.88      0.89      0.89        76

    accuracy                           0.92       200
   macro avg       0.91      0.91      0.91       200
weighted avg       0.92      0.92      0.92       200



# 4 - Solución Decision Tree

In [43]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load the dataset
df = pd.read_csv('loan_data.csv')

# Preprocess the data
df['Owns House'] = df['Owns House'].map({'Yes': 1, 'No': 0})
df['Previous Default'] = df['Previous Default'].map({'Yes': 1, 'No': 0})
df['Approved'] = df['Approved'].map({'Yes': 1, 'No': 0})

# Splitting the dataset into training and test sets
X = df.drop(columns=['Applicant ID', 'Approved'])
y = df['Approved']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training the Random Forest model
# Import Decision Tree from sklearn
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluation
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 90.00%

Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.93      0.92       124
           1       0.88      0.86      0.87        76

    accuracy                           0.90       200
   macro avg       0.90      0.89      0.89       200
weighted avg       0.90      0.90      0.90       200

