## Query 7

Porcentaje de clientes inactivos 180 días por segmento y país

- Usar customers con last_login para identificar inactivos.
- Aquellos inactivos hace más de 180 días respecto a hoy (2025-06-10) para que el resultado sea reproducible.

1. Importar el dataset de customer (data/raw/customers.csv)
2. Filtrar los clientes con datos inválidos (customer_id, customer_segment, country, last_login).
3. Aplicar un map previo al agrupamiento con 1 si el cliente es inactivo (last_login > 180 días) y 0 en caso contrario.
4. Agrupar por (customer_segment, country) y sumar los inactivos y totales.
5. Calcular el porcentaje de inactivos por (customer_segment, country).

In [64]:
from pyspark.sql import SparkSession
from datetime import datetime, timedelta


spark = SparkSession.builder.appName("InactivosPorSegmentoPais").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

In [None]:
customers_df = spark.read.csv(
    "../../data/raw/customers.csv", header=True, inferSchema=True
)
customers_rdd = customers_df.rdd

VALID_SEGMENTS = {"REGULAR", "PREMIUM", "BUDGET"}


def normalize_segment(seg):
    if seg is None:
        return None
    raw = str(seg).strip().upper()
    return raw if raw in VALID_SEGMENTS else None


def normalize_country(country):
    if country is None:
        return None
    return str(country).strip().upper()


def normalize_date(date_str):
    if date_str is None:
        return None
    try:
        return datetime.strptime(date_str[:10], "%Y-%m-%d").date()
    except ValueError:
        return None


def date_is_before(date, threshold):
    if date is None:
        return False
    return date < threshold


# Fecha de referencia para reproducibilidad
REF_DATE = datetime(2025, 6, 10)
INACTIVE_DAYS = 180
INACTIVE_THRESHOLD_DATE = (REF_DATE - timedelta(days=INACTIVE_DAYS)).date()


customers_with_inactive = (
    customers_rdd.filter(
        lambda r: getattr(r, "customer_id", None) is not None
        and getattr(r, "customer_segment", None) is not None
        and getattr(r, "country", None) is not None
        and getattr(r, "last_login", None) is not None
    )
    .map(
        lambda r: (
            {
                "customer_segment": normalize_segment(
                    getattr(r, "customer_segment", None)
                ),
                "country": normalize_country(getattr(r, "country", None)),
                "last_login": normalize_date(getattr(r, "last_login", None)),
            }
        )
    )
    .filter(
        lambda r: r["customer_segment"] is not None
        and r["country"] is not None
        and r["last_login"] is not None
    )
    .map(
        lambda r: (
            (
                r["customer_segment"],
                r["country"],
            ),
            (
                (1 if date_is_before(r["last_login"], INACTIVE_THRESHOLD_DATE) else 0),
                1,
            ),
        )
    )
)

In [66]:
by_segment_country = customers_with_inactive.reduceByKey(
    lambda a, b: (a[0] + b[0], a[1] + b[1])
)

results = by_segment_country.map(
    lambda kv: {
        "segmento": kv[0][0],
        "pais": kv[0][1],
        "inactivos": kv[1][0],
        "total": kv[1][1],
        "porcentaje_inactivos": round((kv[1][0] / kv[1][1] * 100.0), 2) if kv[1][1] > 0 else 0.0,
    }
).collect()

                                                                                

In [67]:
import pandas as pd

df_results = pd.DataFrame(results)
if not df_results.empty:
    df_results = df_results.sort_values(by=["segmento", "pais"]).reset_index(drop=True)
    display(df_results)
else:
    print("No hay resultados")

Unnamed: 0,segmento,pais,inactivos,total,porcentaje_inactivos
0,BUDGET,AUSTRALIA,3604,7322,49.22
1,BUDGET,BRAZIL,3639,7359,49.45
2,BUDGET,CANADA,3557,7286,48.82
3,BUDGET,FRANCE,3491,7179,48.63
4,BUDGET,GERMANY,3452,7144,48.32
5,BUDGET,INDIA,3497,7289,47.98
6,BUDGET,JAPAN,3535,7222,48.95
7,BUDGET,MEXICO,3558,7319,48.61
8,BUDGET,UK,3464,7134,48.56
9,BUDGET,UNDEFINED,1255,2497,50.26
