# Dataset
[Credit card Details Binary Classification Problem](https://www.kaggle.com/datasets/rohitudageri/credit-card-details)

## Origen
Dataset de CCredit card Details Binary Classification Problem de Kaggle (https://www.kaggle.com/datasets/rohitudageri/credit-card-details), que contiene información de solicitudes de tarjetas de crédito.

## Justificacion del problema de clasificacion
 Este es un problema de clasificación binaria porque el label es 0 o 1 que indica si una solicitud de tarjeta de crédito fue aprobada o rechazada

## Descripción del dataset


  | Columna         | Tipo              | Descripción                                                               |
  |-----------------|-------------------|---------------------------------------------------------------------------|
  | Ind_ID          | ID                | Identificador único del solicitante                                       |
  | GENDER          | Categórica        | Género (M/F)                                                              |
  | Car_Owner       | Categórica        | ¿Posee auto? (Y/N)                                                        |
  | Propert_Owner   | Categórica        | ¿Posee propiedad? (Y/N)                                                   |
  | CHILDREN        | Numérica          | Número de hijos                                                           |
  | Annual_income   | Numérica continua | Ingreso anual                                                             |
  | Type_Income     | Categórica        | Tipo de ingreso (Pensioner, Commercial associate, Working, State servant) |
  | EDUCATION       | Categórica        | Nivel educativo                                                           |
  | Marital_status  | Categórica        | Estado civil                                                              |
  | Housing_type    | Categórica        | Tipo de vivienda                                                          |
  | Birthday_count  | Numérica continua | Edad (en días negativos desde fecha de referencia)                        |
  | Employed_days   | Numérica continua | Días empleado (negativo = activo, 365243 = pensionado)                    |
  | Mobile_phone    | Binaria           | ¿Tiene teléfono móvil? (1/0)                                              |
  | Work_Phone      | Binaria           | ¿Tiene teléfono laboral? (1/0)                                            |
  | Phone           | Binaria           | ¿Tiene teléfono fijo? (1/0)                                               |
  | EMAIL_ID        | Binaria           | ¿Tiene email? (1/0)                                                       |
  | Type_Occupation | Categórica        | Ocupación                                                                 |
  | Family_Members  | Numérica          | Miembros de la familia                                                    |
  | label           | Binaria           | Aprobación (1) o Rechazo (0)                                              |


## Exploracion de datos

In [1]:
import pandas as pd
import numpy as np

# Cargar datos
df = pd.read_csv('datos/credit_card.csv')
labels = pd.read_csv('datos/credit_card_label.csv')

# Merge
data = df.merge(labels, on='Ind_ID')

print("=== INFORMACIÓN DEL DATASET ===")
print(f"Dimensiones: {data.shape}")
print(f"\nColumnas ({len(data.columns)}):")
for col in data.columns:
    print(f"  - {col}")

print("\n=== TIPOS DE DATOS ===")
print(data.dtypes)

print("\n=== VALORES ÚNICOS EN LABEL ===")
print(data['label'].value_counts())

print("\n=== ESTADÍSTICAS NUMÉRICAS ===")
print(data.describe())

print("\n=== VALORES FALTANTES ===")
print(data.isnull().sum())

print("\n=== CORRELACIONES CON LABEL ===")
# Solo columnas numéricas
numeric_cols = data.select_dtypes(include=[np.number]).columns
corr_with_label = data[numeric_cols].corr()['label'].sort_values(ascending=False)
print(corr_with_label)

=== INFORMACIÓN DEL DATASET ===
Dimensiones: (1548, 19)

Columnas (19):
  - Ind_ID
  - GENDER
  - Car_Owner
  - Propert_Owner
  - CHILDREN
  - Annual_income
  - Type_Income
  - EDUCATION
  - Marital_status
  - Housing_type
  - Birthday_count
  - Employed_days
  - Mobile_phone
  - Work_Phone
  - Phone
  - EMAIL_ID
  - Type_Occupation
  - Family_Members
  - label

=== TIPOS DE DATOS ===
Ind_ID               int64
GENDER              object
Car_Owner           object
Propert_Owner       object
CHILDREN             int64
Annual_income      float64
Type_Income         object
EDUCATION           object
Marital_status      object
Housing_type        object
Birthday_count     float64
Employed_days        int64
Mobile_phone         int64
Work_Phone           int64
Phone                int64
EMAIL_ID             int64
Type_Occupation     object
Family_Members       int64
label                int64
dtype: object

=== VALORES ÚNICOS EN LABEL ===
label
0    1373
1     175
Name: count, dtype: int64


# A partir de los atributos numericos continuos

# Seleccionar solo atributos numéricos continuos

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score
from sklearn.impute import SimpleImputer

numeric_cols = data.select_dtypes(include=[np.number]).columns

X = data[numeric_cols].drop(columns=['label'])
y = data['label']

# Dividir en train y test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=41)

# Imputar valores faltantes con la media
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)


## Modelo de Regresión Logistica para clasificación

In [3]:
from sklearn.linear_model import LogisticRegression

# Entrenar y evaluar el modelo
model = LogisticRegression()
model.fit(X_train_imputed, y_train)

# Evaluar con train/test split
y_pred = model.predict(X_test_imputed)
print("Accuracy Score:", accuracy_score(y_test, y_pred))
print("Recall Score:", recall_score(y_test, y_pred))
print("Precision Score:", precision_score(y_test, y_pred))

Accuracy Score: 0.9139784946236559
Recall Score: 0.0
Precision Score: 0.0


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


##  Modelo de Arboles de Decisión.

In [4]:
from sklearn.tree import DecisionTreeClassifier

# Entrenar y evaluar el modelo
model = DecisionTreeClassifier(criterion='entropy', max_depth=3)
model.fit(X_train_imputed, y_train)

# Evaluar con train/test split
y_pred = model.predict(X_test_imputed)
print("Accuracy Score:", accuracy_score(y_test, y_pred))
print("Recall Score:", recall_score(y_test, y_pred))
print("Precision Score:", precision_score(y_test, y_pred))

Accuracy Score: 0.9010752688172043
Recall Score: 0.025
Precision Score: 0.125


# A partir de tanto los atributos numericos continuos como los atributos categoricos

## Seleccionar los datos

In [5]:
X = data.drop(columns=['label'])
y = data['label']

# Dividir en train y test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=41)

#one-hot encoding para variables categóricas
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)

# Alinear las columnas de train y test
X_train, X_test = X_train.align(X_test, join='left', axis=1, fill_value=0)
# Imputar valores faltantes con la media
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

## Modelo de Regresión Logistica para clasificación

In [6]:
# Entrenar y evaluar el modelo
model = LogisticRegression()
model.fit(X_train_imputed, y_train)

# Evaluar con train/test split
y_pred = model.predict(X_test_imputed)
print("Accuracy Score:", accuracy_score(y_test, y_pred))
print("Recall Score:", recall_score(y_test, y_pred))
print("Precision Score:", precision_score(y_test, y_pred))

Accuracy Score: 0.9139784946236559
Recall Score: 0.0
Precision Score: 0.0


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


##  Modelo de Arboles de Decisión con GridSearchCV para busqueda de hiperparametros.

In [7]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold


rf = DecisionTreeClassifier(random_state=42)
param_grid = {
    'max_depth': [None, 5, 10]
}

grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, 
                           cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
                           scoring='accuracy', n_jobs=-1)

grid_search.fit(X_train_imputed, y_train)

print("Mejores hiperparámetros encontrados:", grid_search.best_params_)
print("Mejor accuracy promedio (CV):", grid_search.best_score_)

Mejores hiperparámetros encontrados: {'max_depth': 5}
Mejor accuracy promedio (CV): 0.8707415941286909
