# ðŸ“Š AnÃ¡lisis de Facturas - Chile

Este notebook realiza validaciones contables sobre un dataset de facturas chilenas:
- âœ… ValidaciÃ³n de IVA (19%)
- âœ… ValidaciÃ³n de total = neto + impuesto
- âœ… ValidaciÃ³n de perÃ­odo vs fecha de emisiÃ³n
- âœ… GeneraciÃ³n de flags de error
- âœ… ExportaciÃ³n de datos procesados

## 1. Carga de Datos

In [1]:
import pandas as pd

df = pd.read_csv("../data/sample_invoices.csv")
df

Unnamed: 0,invoice_id,issue_date,provider,rut_provider,net_amount,tax,total,period
0,F001,2024-01-05,Proveedor Servicios Ltda,76.123.456-7,100000,19000,119000,2024-01
1,F002,2024-01-10,Comercial Andes SpA,77.987.654-3,200000,38000,238000,2024-01
2,F003,2024-01-15,Transportes del Sur Ltda,78.456.123-9,150000,30000,180000,2024-01
3,F004,2024-01-20,Suministros Patagonia SpA,76.999.888-1,120000,20000,140000,2024-01


In [2]:
df["issue_date"] = pd.to_datetime(df["issue_date"])
df[["net_amount", "tax", "total"]] = df[["net_amount", "tax", "total"]].astype(float)

df

Unnamed: 0,invoice_id,issue_date,provider,rut_provider,net_amount,tax,total,period
0,F001,2024-01-05,Proveedor Servicios Ltda,76.123.456-7,100000.0,19000.0,119000.0,2024-01
1,F002,2024-01-10,Comercial Andes SpA,77.987.654-3,200000.0,38000.0,238000.0,2024-01
2,F003,2024-01-15,Transportes del Sur Ltda,78.456.123-9,150000.0,30000.0,180000.0,2024-01
3,F004,2024-01-20,Suministros Patagonia SpA,76.999.888-1,120000.0,20000.0,140000.0,2024-01


## 2. Validaciones Contables

In [3]:
# ValidaciÃ³n de IVA (19%)
df["expected_tax"] = df["net_amount"] * 0.19
df["tax_ok"] = df["tax"] == df["expected_tax"]

# ValidaciÃ³n de total = neto + impuesto
df["expected_total"] = df["net_amount"] + df["tax"]
df["total_ok"] = df["total"] == df["expected_total"]

# ValidaciÃ³n de perÃ­odo vs fecha
df["expected_period"] = df["issue_date"].dt.strftime("%Y-%m")
df["period_ok"] = df["period"] == df["expected_period"]

df[["invoice_id", "tax_ok", "total_ok", "period_ok"]]

Unnamed: 0,invoice_id,tax_ok,total_ok,period_ok
0,F001,True,True,True
1,F002,True,True,True
2,F003,False,True,True
3,F004,False,True,True


In [4]:
# Generar flags de error claros
def get_errors(row):
    errors = []
    if not row["tax_ok"]:
        errors.append("IVA incorrecto")
    if not row["total_ok"]:
        errors.append("Total incorrecto")
    if not row["period_ok"]:
        errors.append("PerÃ­odo no coincide")
    return ", ".join(errors) if errors else "OK"

df["error_type"] = df.apply(get_errors, axis=1)

df[["invoice_id", "provider", "tax_ok", "total_ok", "period_ok", "error_type"]]

Unnamed: 0,invoice_id,provider,tax_ok,total_ok,period_ok,error_type
0,F001,Proveedor Servicios Ltda,True,True,True,OK
1,F002,Comercial Andes SpA,True,True,True,OK
2,F003,Transportes del Sur Ltda,False,True,True,IVA incorrecto
3,F004,Suministros Patagonia SpA,False,True,True,IVA incorrecto


## 3. Resumen Ejecutivo

In [17]:
summary = df["error_type"].value_counts().reset_index()
summary.columns = ["status", "count"]
summary

Unnamed: 0,status,count
0,OK,2
1,IVA incorrecto,2


## 4. Exportar Datos Procesados

In [18]:
df.to_csv("../data/processed_invoices.csv", index=False)

## 5. Machine Learning - Error Classification

In [5]:
# Crear variable target binaria (0 = OK, 1 = Error)
df["final_status"] = df["error_type"].apply(lambda x: 0 if x == "OK" else 1)

# Seleccionar features (solo columnas numÃ©ricas)
features = df[[
    "net_amount",
    "tax",
    "total"
]]

target = df["final_status"]

print("Features shape:", features.shape)
print("Target distribution:")
print(target.value_counts())

Features shape: (4, 3)
Target distribution:
final_status
0    2
1    2
Name: count, dtype: int64


In [6]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y = encoder.fit_transform(target)

encoder.classes_

array([0, 1])

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    features,
    y,
    test_size=0.3,
    random_state=42
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

Training set: 2 samples
Test set: 2 samples


In [8]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

print("âœ… Modelo entrenado exitosamente!")
print(f"Profundidad del Ã¡rbol: {model.get_depth()}")
print(f"NÃºmero de hojas: {model.get_n_leaves()}")

âœ… Modelo entrenado exitosamente!
Profundidad del Ã¡rbol: 1
NÃºmero de hojas: 2


In [9]:
from sklearn.metrics import classification_report

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred, target_names=["OK", "Error"]))

              precision    recall  f1-score   support

          OK       0.00      0.00      0.00       1.0
       Error       0.00      0.00      0.00       1.0

    accuracy                           0.00       2.0
   macro avg       0.00      0.00      0.00       2.0
weighted avg       0.00      0.00      0.00       2.0



In [10]:
# Probar con una factura nueva
new_invoice = [[150000, 30000, 180000]]
prediction = model.predict(new_invoice)

result = "OK" if prediction[0] == 0 else "Error"
print(f"ðŸ§¾ Factura: net={new_invoice[0][0]}, tax={new_invoice[0][1]}, total={new_invoice[0][2]}")
print(f"ðŸ¤– PredicciÃ³n del modelo: {result}")

ðŸ§¾ Factura: net=150000, tax=30000, total=180000
ðŸ¤– PredicciÃ³n del modelo: Error




## 6. Guardar Modelo

In [11]:
import joblib

# Guardar el modelo entrenado
joblib.dump(model, "../models/invoice_classifier.joblib")

print("âœ… Modelo guardado en models/invoice_classifier.joblib")

âœ… Modelo guardado en models/invoice_classifier.joblib
