# 📊 Notebook 1 — Processamento e Preparação de Dados

Este notebook tem como objetivo realizar a **exploração, limpeza e preparação dos dados brutos** do case técnico proposto, utilizando **PySpark** como engine principal de processamento distribuído.

---

## 🎯 Objetivo

- Carregar os dados brutos fornecidos (`customers.json.gz`, `offers.json.gz`, `transactions.json.gz`)
- Realizar uma análise exploratória inicial (**EDA**) para entender a estrutura, qualidade e distribuição dos dados
- Tratar dados faltantes, tipos e formatos
- Preparar um conjunto de dados unificado e otimizado para análise e modelagem futura
- Salvar os dados tratados em formato **Parquet**, que é mais eficiente e leve para uso com Spark

## 🔁 Etapas executadas neste notebook

1. Importação de bibliotecas e configuração do ambiente PySpark
2. Leitura dos arquivos `.json.gz` diretamente com Spark
3. Exploração e validação de schemas e estatísticas dos dados
4. Tratamento de dados ausentes, inconsistentes ou inválidos
5. Conversão para formatos otimizados (`.parquet`)
6. Exportação dos dados tratados para `data/processed/`

---

## 🗂️ Estrutura esperada dos dados

- `data/raw/` → Arquivos `.json.gz` (compactados)
- `data/processed/` → Arquivos `.parquet` tratados e otimizados

---

## ⚙️ Tecnologias utilizadas

- Python 3.11
- PySpark
- JupyterLab
- Pandas (suporte auxiliar para análise exploratória)


## 1. 📦 Importação de bibliotecas e configuração do PySpark

Nesta etapa, vamos:

- Importar as bibliotecas necessárias para manipulação e análise dos dados
- Inicializar a sessão do PySpark (`SparkSession`), que será usada para leitura, transformação e gravação dos dados
- Configurar parâmetros básicos de execução, como nome da aplicação e quantidade de memória (caso necessário)

In [1]:
import sys
import os

# Obtém o caminho absoluto do diretório 'src'
src_path = os.path.abspath("../")
# Adiciona 'src' ao sys.path
if src_path not in sys.path:
    sys.path.append(src_path)

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, isnan, count
from pathlib import Path

In [2]:
sys.path

['/usr/local/lib/python311.zip',
 '/usr/local/lib/python3.11',
 '/usr/local/lib/python3.11/lib-dynload',
 '',
 '/usr/local/lib/python3.11/site-packages',
 '/app']

In [3]:
# Inicialização do SparkSession
spark = SparkSession.builder \
    .appName("iFood - Data Processing") \
    .getOrCreate()

#Testa se Spark está funcionando
spark.sparkContext.setLogLevel("WARN")
print("✅ SparkSession iniciada com sucesso!")


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/28 02:32:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


✅ SparkSession iniciada com sucesso!


## 2. Leitura e entendimento dos dados brutos (.json.gz)

Nesta etapa, vamos:

- Ler os três conjuntos de dados fornecidos: `offers.json`, `customers.json` e `transactions.json`
- Compreender a estrutura e os tipos de dados presentes em cada conjunto
- Preparar os dados para as próximas etapas do pipeline

### 🟦 `customers.json`

Contém atributos de aproximadamente **17 mil clientes** registrados:

| Coluna              | Tipo     | Descrição                                          |
|---------------------|----------|----------------------------------------------------|
| `id`                | `string` | ID único do cliente                                |
| `age`               | `int`    | Idade no momento de criação da conta               |
| `registered_on`     | `int`    | Data de criação da conta (em dias desde o início do teste) |
| `gender`            | `string` | Gênero do cliente (`M`, `F`, `O`, ou `NULL`)        |
| `credit_card_limit` | `float`  | Limite de crédito informado na conta               |

---
### 🟩 `offers.json`

Contém os **IDs das ofertas** e seus respectivos **metadados**:

| Coluna           | Tipo           | Descrição                                          |
|------------------|----------------|----------------------------------------------------|
| `id`             | `string`       | ID único da oferta                                 |
| `offer_type`     | `string`       | Tipo da oferta: `bogo`, `discount` ou `informational` |
| `min_value`      | `int`          | Valor mínimo necessário para ativar a oferta       |
| `duration`       | `int`          | Duração da oferta (em dias)                        |
| `discount_value` | `int`          | Valor do desconto aplicado                         |
| `channels`       | `array<string>`| Canais de veiculação (ex: `email`, `mobile`, etc.) |

---

### 🟨 `transactions.json`

Contém cerca de **300 mil eventos** registrados durante o período de teste:

| Coluna              | Tipo      | Descrição                                          |
|---------------------|-----------|----------------------------------------------------|
| `event`             | `string`  | Tipo do evento (`transaction`, `offer received`, etc.) |
| `account_id`        | `string`  | ID do cliente associado ao evento                 |
| `time_since_test_start` | `int` | Tempo (em dias) desde o início do experimento     |
| `value`             | `json`    | Valor associado ao evento (`offer_id`, `reward` ou `amount`) |

---

🔍 Vamos agora carregarboth_present_df = df.filter(
    col("value.offer_id").isNotNull() & col("value.`offer id`").isNotNull()
) esses arquivos utilizando o **PySpark**, garantindo que os tipos de dados sejam corretamente interpretados e que os dados estejam prontos para análise


In [4]:
# Lê os arquivos JSON compactados com PySpark
customers_df = spark.read.json("/app/data/raw/profile.json.gz")
offers_df = spark.read.json("/app/data/raw/offers.json.gz")
transactions_df = spark.read.json("/app/data/raw/transactions.json.gz")

                                                                                

In [5]:
# Mostra uma prévia dos dados
print("👤 Customers:")
customers_df.show(5, truncate=False)

print("🏷️ Offers:")
offers_df.show(5, truncate=False)

print("💳 Transactions:")
transactions_df.show(5, truncate=False)

👤 Customers:
+---+-----------------+------+--------------------------------+-------------+
|age|credit_card_limit|gender|id                              |registered_on|
+---+-----------------+------+--------------------------------+-------------+
|118|NULL             |NULL  |68be06ca386d4c31939f3a4f0e3dd783|20170212     |
|55 |112000.0         |F     |0610b486422d4921ae7d2bf64640c50b|20170715     |
|118|NULL             |NULL  |38fe809add3b4fcf9315a9694bb96ff5|20180712     |
|75 |100000.0         |F     |78afa995795e4d85b5d9ceeca43f5fef|20170509     |
|118|NULL             |NULL  |a03223e636434f42ac4c3df47e8bac43|20170804     |
+---+-----------------+------+--------------------------------+-------------+
only showing top 5 rows

🏷️ Offers:
+----------------------------+--------------+--------+--------------------------------+---------+-------------+
|channels                    |discount_value|duration|id                              |min_value|offer_type   |
+------------------------

                                                                                

In [6]:
print(f"customers_df: {customers_df.count()} linhas")
print(f"offers_df: {offers_df.count()} linhas")
print(f"transactions_df: {transactions_df.count()} linhas")

customers_df: 17000 linhas
offers_df: 10 linhas
transactions_df: 306534 linhas


## 3. 🔍 Exploração e Validação dos Dados (EDA)

Nesta etapa, vamos:

- Explorar os schemas das tabelas para verificar tipos de dados
- Observar estatísticas descritivas básicas
- Contar valores nulos e valores únicos
- Identificar possíveis problemas de qualidade (ex: campos vazios, inconsistentes)
- Verificar distribuições de colunas importantes

In [7]:
from src.eda.data_diagnostics import isna_sum, value_counts
from src.eda.pipeline import clean_customers_data
from src.eda.pipeline import feature_engineering_customers_data

### 3.1 Tabela ``customers``

#### 3.1.1 Check NaN Values

In [8]:
isna_sum(customers_df, "customers")


📘 Schema de customers:
root
 |-- age: long (nullable = true)
 |-- credit_card_limit: double (nullable = true)
 |-- gender: string (nullable = true)
 |-- id: string (nullable = true)
 |-- registered_on: string (nullable = true)


🔢 Total de linhas: 17000

📊 Nulos por coluna (valores e %):
– age: 0 nulos (0.00%)
– credit_card_limit: 2175 nulos (12.79%)
– gender: 2175 nulos (12.79%)
– id: 0 nulos (0.00%)
– registered_on: 0 nulos (0.00%)

🔎 Amostra de customers:
+---+-----------------+------+--------------------------------+-------------+
|age|credit_card_limit|gender|id                              |registered_on|
+---+-----------------+------+--------------------------------+-------------+
|118|NULL             |NULL  |68be06ca386d4c31939f3a4f0e3dd783|20170212     |
|55 |112000.0         |F     |0610b486422d4921ae7d2bf64640c50b|20170715     |
|118|NULL             |NULL  |38fe809add3b4fcf9315a9694bb96ff5|20180712     |
|75 |100000.0         |F     |78afa995795e4d85b5d9ceeca43f5fef|20170

#### 3.1.2 Data Analysis ``gender`` and ``credit_card_limit``

In [9]:
value_counts(customers_df, "gender")


📊 Distribuição da coluna: gender (total: 17000 registros)
+------+-----+------------------+
|gender|count|percent           |
+------+-----+------------------+
|M     |8484 |49.90588235294118 |
|F     |6129 |36.05294117647059 |
|NULL  |2175 |12.794117647058822|
|O     |212  |1.2470588235294118|
+------+-----+------------------+



In [10]:
value_counts(customers_df, "credit_card_limit")


📊 Distribuição da coluna: credit_card_limit (total: 17000 registros)
+-----------------+-----+------------------+
|credit_card_limit|count|percent           |
+-----------------+-----+------------------+
|NULL             |2175 |12.794117647058822|
|73000.0          |314  |1.8470588235294116|
|72000.0          |297  |1.7470588235294116|
|71000.0          |294  |1.7294117647058824|
|57000.0          |288  |1.6941176470588233|
|74000.0          |282  |1.6588235294117646|
|53000.0          |282  |1.6588235294117646|
|52000.0          |281  |1.6529411764705884|
|56000.0          |281  |1.6529411764705884|
|54000.0          |272  |1.6               |
|70000.0          |270  |1.588235294117647 |
|51000.0          |268  |1.576470588235294 |
|61000.0          |258  |1.5176470588235296|
|64000.0          |258  |1.5176470588235296|
|55000.0          |254  |1.4941176470588236|
|50000.0          |253  |1.4882352941176469|
|60000.0          |251  |1.4764705882352942|
|75000.0          |243  |1.429

#### 3.1.3 Replace NaN Values
1. Coluna gender: Colocar o gender "O" para "NULL"
2. Coluna credit_card_limit: Usar a mediana para substituir os dados nulos

In [11]:
customers_df = clean_customers_data(customers_df)

🔍 Cleaning 'gender' column...
✅ 'gender' cleaned: values normalized (M, F, unknown)

📈 Calculating statistics for 'credit_card_limit'...
📊 Mean credit limit: 65404.99
📏 Median credit limit: 63000.00
✅ 'credit_card_limit' nulls filled with median.


In [12]:
isna_sum(customers_df, "customers")


📘 Schema de customers:
root
 |-- age: long (nullable = true)
 |-- credit_card_limit: double (nullable = false)
 |-- gender: string (nullable = true)
 |-- id: string (nullable = true)
 |-- registered_on: string (nullable = true)


🔢 Total de linhas: 17000

📊 Nulos por coluna (valores e %):
– age: 0 nulos (0.00%)
– credit_card_limit: 0 nulos (0.00%)
– gender: 0 nulos (0.00%)
– id: 0 nulos (0.00%)
– registered_on: 0 nulos (0.00%)

🔎 Amostra de customers:
+---+-----------------+-------+--------------------------------+-------------+
|age|credit_card_limit|gender |id                              |registered_on|
+---+-----------------+-------+--------------------------------+-------------+
|118|63000.0          |unknown|68be06ca386d4c31939f3a4f0e3dd783|20170212     |
|55 |112000.0         |F      |0610b486422d4921ae7d2bf64640c50b|20170715     |
|118|63000.0          |unknown|38fe809add3b4fcf9315a9694bb96ff5|20180712     |
|75 |100000.0         |F      |78afa995795e4d85b5d9ceeca43f5fef|20170

In [13]:
credit_card_limit_quantiles = customers_df.approxQuantile('credit_card_limit', [0.0, 0.1, 0.25, 0.5, 0.75, 0.9, 1.0], 0.01)
credit_card_limit_quantiles

[30000.0, 39000.0, 51000.0, 63000.0, 75000.0, 93000.0, 120000.0]

In [14]:
age_quantiles = customers_df.approxQuantile('age', [0.0, 0.25, 0.5, 0.75, 0.88, 1.0], 0.01)
age_quantiles

[18.0, 45.0, 58.0, 72.0, 118.0, 118.0]

🔎 Grande parte da base é composta por clientes mais velhos. Com 75% dos usuários com mais de 45 anos, e com 12% da base concentrada nos 118 anos (idade máxima), isso indica que:

#### 3.1.4 Feature Engineering ``Customers``
- Create new feature ``birth_year`` ("Ano de nascimento = Ano de cadastro - idade na data do cadastro")
- Create new feature ``age_group``
- Create new feature ``credit_limit_bucket``

In [15]:
customers_df = feature_engineering_customers_data(customers_df)

In [16]:
customers_df.show(5)

+---+-----------------+-------+--------------------+-------------+----------+-------------+-------------------+
|age|credit_card_limit| gender|                  id|registered_on|birth_year|    age_group|credit_limit_bucket|
+---+-----------------+-------+--------------------+-------------+----------+-------------+-------------------+
|118|          63000.0|unknown|68be06ca386d4c319...|     20170212|      1899|Boomers (60+)|  Very High (> 60k)|
| 55|         112000.0|      F|0610b486422d4921a...|     20170715|      1962|Boomers (60+)|  Very High (> 60k)|
|118|          63000.0|unknown|38fe809add3b4fcf9...|     20180712|      1900|Boomers (60+)|  Very High (> 60k)|
| 75|         100000.0|      F|78afa995795e4d85b...|     20170509|      1942|Boomers (60+)|  Very High (> 60k)|
|118|          63000.0|unknown|a03223e636434f42a...|     20170804|      1899|Boomers (60+)|  Very High (> 60k)|
+---+-----------------+-------+--------------------+-------------+----------+-------------+-------------

### 3.2 Tabela ``offers``

In [17]:
offers_df.show(5, truncate=False)

+----------------------------+--------------+--------+--------------------------------+---------+-------------+
|channels                    |discount_value|duration|id                              |min_value|offer_type   |
+----------------------------+--------------+--------+--------------------------------+---------+-------------+
|[email, mobile, social]     |10            |7.0     |ae264e3637204a6fb9bb56bc8210ddfd|10       |bogo         |
|[web, email, mobile, social]|10            |5.0     |4d5c57ea9a6940dd891ad53e9dbe8da0|10       |bogo         |
|[web, email, mobile]        |0             |4.0     |3f207df678b143eea3cee63160fa8bed|0        |informational|
|[web, email, mobile]        |5             |7.0     |9b98b8c7a33c4b65b9aebfe6a799e6d9|5        |bogo         |
|[web, email]                |5             |10.0    |0b1e1539f2cc45b7b9fa7c272da2e1d7|20       |discount     |
+----------------------------+--------------+--------+--------------------------------+---------+-------

#### 3.2.1 OneHotEncoder columns ``Channels``

In [18]:
from src.eda.transform import explode_list_columns_to_ohe

In [19]:
offers_df = explode_list_columns_to_ohe(offers_df)

                                                                                

In [20]:
offers_df.printSchema()

root
 |-- discount_value: long (nullable = true)
 |-- duration: double (nullable = true)
 |-- id: string (nullable = true)
 |-- min_value: long (nullable = true)
 |-- offer_type: string (nullable = true)
 |-- channels_mobile: integer (nullable = false)
 |-- channels_email: integer (nullable = false)
 |-- channels_social: integer (nullable = false)
 |-- channels_web: integer (nullable = false)



In [21]:
offers_df.show(50, truncate=False)

+--------------+--------+--------------------------------+---------+-------------+---------------+--------------+---------------+------------+
|discount_value|duration|id                              |min_value|offer_type   |channels_mobile|channels_email|channels_social|channels_web|
+--------------+--------+--------------------------------+---------+-------------+---------------+--------------+---------------+------------+
|10            |7.0     |ae264e3637204a6fb9bb56bc8210ddfd|10       |bogo         |1              |1             |1              |0           |
|10            |5.0     |4d5c57ea9a6940dd891ad53e9dbe8da0|10       |bogo         |1              |1             |1              |1           |
|0             |4.0     |3f207df678b143eea3cee63160fa8bed|0        |informational|1              |1             |0              |1           |
|5             |7.0     |9b98b8c7a33c4b65b9aebfe6a799e6d9|5        |bogo         |1              |1             |0              |1           |

#### 3.2.2 Check NaN Values ``Offers``
**obs:** Não existe dados nulos na tabela Ofertas

In [22]:
isna_sum(offers_df, "offers")


📘 Schema de offers:
root
 |-- discount_value: long (nullable = true)
 |-- duration: double (nullable = true)
 |-- id: string (nullable = true)
 |-- min_value: long (nullable = true)
 |-- offer_type: string (nullable = true)
 |-- channels_mobile: integer (nullable = false)
 |-- channels_email: integer (nullable = false)
 |-- channels_social: integer (nullable = false)
 |-- channels_web: integer (nullable = false)


🔢 Total de linhas: 10

📊 Nulos por coluna (valores e %):
– discount_value: 0 nulos (0.00%)
– duration: 0 nulos (0.00%)
– id: 0 nulos (0.00%)
– min_value: 0 nulos (0.00%)
– offer_type: 0 nulos (0.00%)
– channels_mobile: 0 nulos (0.00%)
– channels_email: 0 nulos (0.00%)
– channels_social: 0 nulos (0.00%)
– channels_web: 0 nulos (0.00%)

🔎 Amostra de offers:
+--------------+--------+--------------------------------+---------+-------------+---------------+--------------+---------------+------------+
|discount_value|duration|id                              |min_value|offer_type  

#### 3.2.3 EDA ``Offers``

In [23]:
value_counts(offers_df, column='min_value')


📊 Distribuição da coluna: min_value (total: 10 registros)
+---------+-----+-------+
|min_value|count|percent|
+---------+-----+-------+
|10       |4    |40.0   |
|0        |2    |20.0   |
|5        |2    |20.0   |
|7        |1    |10.0   |
|20       |1    |10.0   |
+---------+-----+-------+



In [24]:
value_counts(offers_df, column='offer_type')


📊 Distribuição da coluna: offer_type (total: 10 registros)
+-------------+-----+-------+
|offer_type   |count|percent|
+-------------+-----+-------+
|discount     |4    |40.0   |
|bogo         |4    |40.0   |
|informational|2    |20.0   |
+-------------+-----+-------+



### 3.3 Tabela ``transactions``

In [25]:
transactions_df.show(5, truncate=False)

+--------------------------------+--------------+---------------------+----------------------------------------------------+
|account_id                      |event         |time_since_test_start|value                                               |
+--------------------------------+--------------+---------------------+----------------------------------------------------+
|78afa995795e4d85b5d9ceeca43f5fef|offer received|0.0                  |{NULL, 9b98b8c7a33c4b65b9aebfe6a799e6d9, NULL, NULL}|
|a03223e636434f42ac4c3df47e8bac43|offer received|0.0                  |{NULL, 0b1e1539f2cc45b7b9fa7c272da2e1d7, NULL, NULL}|
|e2127556f4f64592b11af22de27a7932|offer received|0.0                  |{NULL, 2906b810c7d4411798c6938adc9daaa5, NULL, NULL}|
|8ec6ce2a7e7949b1bf142def7d0e0586|offer received|0.0                  |{NULL, fafdcd668e3743c1bb461111dcafc2a4, NULL, NULL}|
|68617ca6246f4fbc85e91a2a49552598|offer received|0.0                  |{NULL, 4d5c57ea9a6940dd891ad53e9dbe8da0, NULL, NULL}|


                                                                                

In [26]:
transactions_df.printSchema()

root
 |-- account_id: string (nullable = true)
 |-- event: string (nullable = true)
 |-- time_since_test_start: double (nullable = true)
 |-- value: struct (nullable = true)
 |    |-- amount: double (nullable = true)
 |    |-- offer id: string (nullable = true)
 |    |-- offer_id: string (nullable = true)
 |    |-- reward: double (nullable = true)



#### 3.3.1 OneHotEncoder columns ``Value``

In [27]:
transactions_df = explode_list_columns_to_ohe(transactions_df)

In [28]:
transactions_df.show(5, truncate=False)

+--------------------------------+--------------+---------------------+------------+--------------------------------+--------------+------------+
|account_id                      |event         |time_since_test_start|value_amount|value_offer id                  |value_offer_id|value_reward|
+--------------------------------+--------------+---------------------+------------+--------------------------------+--------------+------------+
|78afa995795e4d85b5d9ceeca43f5fef|offer received|0.0                  |NULL        |9b98b8c7a33c4b65b9aebfe6a799e6d9|NULL          |NULL        |
|a03223e636434f42ac4c3df47e8bac43|offer received|0.0                  |NULL        |0b1e1539f2cc45b7b9fa7c272da2e1d7|NULL          |NULL        |
|e2127556f4f64592b11af22de27a7932|offer received|0.0                  |NULL        |2906b810c7d4411798c6938adc9daaa5|NULL          |NULL        |
|8ec6ce2a7e7949b1bf142def7d0e0586|offer received|0.0                  |NULL        |fafdcd668e3743c1bb461111dcafc2a4|NULL   

                                                                                

#### 3.3.2 Fix two columns ``value.offer_id``

In [29]:
from src.eda.transform import consolidate_columns

In [30]:
inconsistent_df = transactions_df.filter(
    (col("value_offer_id").isNull() & col("value.`offer id`").isNotNull()) |
    (col("value_offer_id").isNotNull() & col("value.`offer id`").isNull())
)
inconsistent_df.show(5, truncate=False)

+--------------------------------+--------------+---------------------+------------+--------------------------------+--------------+------------+
|account_id                      |event         |time_since_test_start|value_amount|value_offer id                  |value_offer_id|value_reward|
+--------------------------------+--------------+---------------------+------------+--------------------------------+--------------+------------+
|78afa995795e4d85b5d9ceeca43f5fef|offer received|0.0                  |NULL        |9b98b8c7a33c4b65b9aebfe6a799e6d9|NULL          |NULL        |
|a03223e636434f42ac4c3df47e8bac43|offer received|0.0                  |NULL        |0b1e1539f2cc45b7b9fa7c272da2e1d7|NULL          |NULL        |
|e2127556f4f64592b11af22de27a7932|offer received|0.0                  |NULL        |2906b810c7d4411798c6938adc9daaa5|NULL          |NULL        |
|8ec6ce2a7e7949b1bf142def7d0e0586|offer received|0.0                  |NULL        |fafdcd668e3743c1bb461111dcafc2a4|NULL   

                                                                                

In [31]:
both_present_df = transactions_df.filter(
    col("value_offer_id").isNotNull() & col("value.`offer id`").isNotNull()
)
both_present_df.show(5, truncate=False)

+----------+-----+---------------------+------------+--------------+--------------+------------+
|account_id|event|time_since_test_start|value_amount|value_offer id|value_offer_id|value_reward|
+----------+-----+---------------------+------------+--------------+--------------+------------+
+----------+-----+---------------------+------------+--------------+--------------+------------+



                                                                                

In [32]:
transactions_df = consolidate_columns(
    transactions_df,
    output_col="offer_id",
    input_cols=["value_offer_id", "value_offer id"]
)

In [33]:
transactions_df.show(5, truncate=False)

+--------------------------------+--------------+---------------------+------------+--------------------------------+--------------+------------+--------------------------------+
|account_id                      |event         |time_since_test_start|value_amount|value_offer id                  |value_offer_id|value_reward|offer_id                        |
+--------------------------------+--------------+---------------------+------------+--------------------------------+--------------+------------+--------------------------------+
|78afa995795e4d85b5d9ceeca43f5fef|offer received|0.0                  |NULL        |9b98b8c7a33c4b65b9aebfe6a799e6d9|NULL          |NULL        |9b98b8c7a33c4b65b9aebfe6a799e6d9|
|a03223e636434f42ac4c3df47e8bac43|offer received|0.0                  |NULL        |0b1e1539f2cc45b7b9fa7c272da2e1d7|NULL          |NULL        |0b1e1539f2cc45b7b9fa7c272da2e1d7|
|e2127556f4f64592b11af22de27a7932|offer received|0.0                  |NULL        |2906b810c7d4411798c69

                                                                                

#### 3.3.3 Ckeck NaN values

In [34]:
isna_sum(transactions_df, "transactions")


📘 Schema de transactions:
root
 |-- account_id: string (nullable = true)
 |-- event: string (nullable = true)
 |-- time_since_test_start: double (nullable = true)
 |-- value_amount: double (nullable = true)
 |-- value_offer id: string (nullable = true)
 |-- value_offer_id: string (nullable = true)
 |-- value_reward: double (nullable = true)
 |-- offer_id: string (nullable = true)


🔢 Total de linhas: 306534

📊 Nulos por coluna (valores e %):


                                                                                

– account_id: 0 nulos (0.00%)
– event: 0 nulos (0.00%)
– time_since_test_start: 0 nulos (0.00%)
– value_amount: 167581 nulos (54.67%)
– value_offer id: 172532 nulos (56.28%)
– value_offer_id: 272955 nulos (89.05%)
– value_reward: 272955 nulos (89.05%)
– offer_id: 138953 nulos (45.33%)

🔎 Amostra de transactions:
+--------------------------------+--------------+---------------------+------------+--------------------------------+--------------+------------+--------------------------------+
|account_id                      |event         |time_since_test_start|value_amount|value_offer id                  |value_offer_id|value_reward|offer_id                        |
+--------------------------------+--------------+---------------------+------------+--------------------------------+--------------+------------+--------------------------------+
|78afa995795e4d85b5d9ceeca43f5fef|offer received|0.0                  |NULL        |9b98b8c7a33c4b65b9aebfe6a799e6d9|NULL          |NULL        |9b98

                                                                                

#### 3.3.4 Drop Columns

In [35]:
from src.eda.transform import drop_columns

In [36]:
columns_to_remove = ["value_offer_id", "value_offer id"]
transactions_df = drop_columns(transactions_df, columns_to_remove)

In [37]:
transactions_df.show(5, truncate=False)

+--------------------------------+--------------+---------------------+------------+------------+--------------------------------+
|account_id                      |event         |time_since_test_start|value_amount|value_reward|offer_id                        |
+--------------------------------+--------------+---------------------+------------+------------+--------------------------------+
|78afa995795e4d85b5d9ceeca43f5fef|offer received|0.0                  |NULL        |NULL        |9b98b8c7a33c4b65b9aebfe6a799e6d9|
|a03223e636434f42ac4c3df47e8bac43|offer received|0.0                  |NULL        |NULL        |0b1e1539f2cc45b7b9fa7c272da2e1d7|
|e2127556f4f64592b11af22de27a7932|offer received|0.0                  |NULL        |NULL        |2906b810c7d4411798c6938adc9daaa5|
|8ec6ce2a7e7949b1bf142def7d0e0586|offer received|0.0                  |NULL        |NULL        |fafdcd668e3743c1bb461111dcafc2a4|
|68617ca6246f4fbc85e91a2a49552598|offer received|0.0                  |NULL        

                                                                                

#### 3.3.5 Data Analysis Table ``transactions``

In [38]:
value_counts(transactions_df, 'event', show_nulls=True)


📊 Distribuição da coluna: event (total: 306534 registros)
+---------------+------+------------------+
|event          |count |percent           |
+---------------+------+------------------+
|transaction    |138953|45.33037118231583 |
|offer received |76277 |24.88369968747349 |
|offer viewed   |57725 |18.831516242896384|
|offer completed|33579 |10.954412887314295|
+---------------+------+------------------+



In [39]:
value_counts(transactions_df, 'time_since_test_start', show_nulls=True)


📊 Distribuição da coluna: time_since_test_start (total: 306534 registros)
+---------------------+-----+------------------+
|time_since_test_start|count|percent           |
+---------------------+-----+------------------+
|17.0                 |17030|5.555664298250765 |
|24.0                 |17015|5.5507708769663395|
|21.0                 |16822|5.487808856440068 |
|14.0                 |16302|5.318170251913328 |
|7.0                  |16150|5.2685835828978185|
|0.0                  |15561|5.076435240462722 |
|17.25                |3583 |1.1688752308063706|
|21.25                |3514 |1.1463654928980147|
|24.25                |3484 |1.1365786503291642|
|24.5                 |3222 |1.051106891894537 |
|21.5                 |3153 |1.028597153986181 |
|17.5                 |3146 |1.0263135573867825|
|14.25                |3017 |0.9842301343407257|
|24.75                |2937 |0.9581318874904577|
|17.75                |2908 |0.9486712730072357|
|7.25                 |2823 |0.920941885728

In [40]:
value_counts(transactions_df, 'value_amount', show_nulls=True)


📊 Distribuição da coluna: value_amount (total: 306534 registros)


[Stage 100:>                                                        (0 + 1) / 1]

+------------+------+--------------------+
|value_amount|count |percent             |
+------------+------+--------------------+
|NULL        |167581|54.66962881768417   |
|0.05        |431   |0.14060430490581793 |
|0.66        |166   |0.05415386221430575 |
|1.18        |165   |0.05382763412867741 |
|1.01        |163   |0.05317517795742071 |
|1.23        |161   |0.05252272178616402 |
|0.9         |161   |0.05252272178616402 |
|1.19        |159   |0.051870265614907325|
|0.53        |159   |0.051870265614907325|
|0.5         |159   |0.051870265614907325|
|0.79        |157   |0.05121780944365062 |
|1.5         |156   |0.05089158135802227 |
|0.92        |156   |0.05089158135802227 |
|1.54        |155   |0.05056535327239393 |
|0.7         |154   |0.05023912518676558 |
|0.74        |154   |0.05023912518676558 |
|1.57        |154   |0.05023912518676558 |
|1.27        |153   |0.04991289710113723 |
|1.22        |153   |0.04991289710113723 |
|0.65        |152   |0.049586669015508886|
+----------

                                                                                

In [41]:
value_counts(transactions_df, 'value_reward', show_nulls=True)


📊 Distribuição da coluna: value_reward (total: 306534 registros)
+------------+------+------------------+
|value_reward|count |percent           |
+------------+------+------------------+
|NULL        |272955|89.04558711268571 |
|5.0         |12070 |3.9375729935341592|
|2.0         |9334  |3.0450129512549995|
|10.0        |7019  |2.289794933025374 |
|3.0         |5156  |1.6820320094997616|
+------------+------+------------------+



                                                                                

In [42]:
value_counts(transactions_df, 'offer_id', show_nulls=True)


📊 Distribuição da coluna: offer_id (total: 306534 registros)
+--------------------------------+------+------------------+
|offer_id                        |count |percent           |
+--------------------------------+------+------------------+
|NULL                            |138953|45.33037118231583 |
|fafdcd668e3743c1bb461111dcafc2a4|20241 |6.60318268120339  |
|2298d6c36e964ae4a3e7e9706d1fb8c2|20139 |6.569907416469299 |
|f19421c1d4aa40978ebb69ca19b0e20d|19131 |6.241069506155924 |
|4d5c57ea9a6940dd891ad53e9dbe8da0|18222 |5.944528176319755 |
|ae264e3637204a6fb9bb56bc8210ddfd|18062 |5.89233168261922  |
|9b98b8c7a33c4b65b9aebfe6a799e6d9|16202 |5.2855474433504925|
|2906b810c7d4411798c6938adc9daaa5|15767 |5.143638226102162 |
|5a8bc65990b245e5a138643cd4eb9837|14305 |4.666692764913517 |
|0b1e1539f2cc45b7b9fa7c272da2e1d7|13751 |4.485962405475412 |
|3f207df678b143eea3cee63160fa8bed|11761 |3.836768515075    |
+--------------------------------+------+------------------+



                                                                                

## 4. Data Integration

In [43]:
from src.eda.transform import integrate_all_dataframes

In [44]:
offers_df.show(2)

+--------------+--------+--------------------+---------+----------+---------------+--------------+---------------+------------+
|discount_value|duration|                  id|min_value|offer_type|channels_mobile|channels_email|channels_social|channels_web|
+--------------+--------+--------------------+---------+----------+---------------+--------------+---------------+------------+
|            10|     7.0|ae264e3637204a6fb...|       10|      bogo|              1|             1|              1|           0|
|            10|     5.0|4d5c57ea9a6940dd8...|       10|      bogo|              1|             1|              1|           1|
+--------------+--------+--------------------+---------+----------+---------------+--------------+---------------+------------+
only showing top 2 rows



In [45]:
transactions_df.show(2)

+--------------------+--------------+---------------------+------------+------------+--------------------+
|          account_id|         event|time_since_test_start|value_amount|value_reward|            offer_id|
+--------------------+--------------+---------------------+------------+------------+--------------------+
|78afa995795e4d85b...|offer received|                  0.0|        NULL|        NULL|9b98b8c7a33c4b65b...|
|a03223e636434f42a...|offer received|                  0.0|        NULL|        NULL|0b1e1539f2cc45b7b...|
+--------------------+--------------+---------------------+------------+------------+--------------------+
only showing top 2 rows



                                                                                

In [46]:
offers_df.show(2)

+--------------+--------+--------------------+---------+----------+---------------+--------------+---------------+------------+
|discount_value|duration|                  id|min_value|offer_type|channels_mobile|channels_email|channels_social|channels_web|
+--------------+--------+--------------------+---------+----------+---------------+--------------+---------------+------------+
|            10|     7.0|ae264e3637204a6fb...|       10|      bogo|              1|             1|              1|           0|
|            10|     5.0|4d5c57ea9a6940dd8...|       10|      bogo|              1|             1|              1|           1|
+--------------+--------+--------------------+---------+----------+---------------+--------------+---------------+------------+
only showing top 2 rows



In [47]:
full_df = integrate_all_dataframes(transactions_df, customers_df, offers_df)
full_df.printSchema()

root
 |-- account_id: string (nullable = true)
 |-- event: string (nullable = true)
 |-- time_since_test_start: double (nullable = true)
 |-- value_amount: double (nullable = true)
 |-- value_reward: double (nullable = true)
 |-- offer_id: string (nullable = true)
 |-- age: long (nullable = true)
 |-- credit_card_limit: double (nullable = true)
 |-- gender: string (nullable = true)
 |-- registered_on: string (nullable = true)
 |-- birth_year: long (nullable = true)
 |-- age_group: string (nullable = true)
 |-- credit_limit_bucket: string (nullable = true)
 |-- discount_value: long (nullable = true)
 |-- duration: double (nullable = true)
 |-- min_value: long (nullable = true)
 |-- offer_type: string (nullable = true)
 |-- channels_mobile: integer (nullable = true)
 |-- channels_email: integer (nullable = true)
 |-- channels_social: integer (nullable = true)
 |-- channels_web: integer (nullable = true)



In [48]:
full_df.show(5, truncate=False)

[Stage 120:>                                                        (0 + 1) / 1]

+--------------------------------+--------------+---------------------+------------+------------+--------------------------------+---+-----------------+-------+-------------+----------+-------------+-------------------+--------------+--------+---------+----------+---------------+--------------+---------------+------------+
|account_id                      |event         |time_since_test_start|value_amount|value_reward|offer_id                        |age|credit_card_limit|gender |registered_on|birth_year|age_group    |credit_limit_bucket|discount_value|duration|min_value|offer_type|channels_mobile|channels_email|channels_social|channels_web|
+--------------------------------+--------------+---------------------+------------+------------+--------------------------------+---+-----------------+-------+-------------+----------+-------------+-------------------+--------------+--------+---------+----------+---------------+--------------+---------------+------------+
|78afa995795e4d85b5d9ceec

                                                                                

## 5. Exploratory Data Analysis

### Questions:
**a) Qual o perfil de clientes que completam ofertas considerando as features idade (ano de nascimento), gênero e limite de credito?**

In [49]:
from pyspark.sql.functions import col, count, round as spark_round

In [51]:
total = full_df.filter(col("event") == "offer completed").count()

full_df.filter(col("event") == "offer completed") \
    .groupBy("age_group") \
    .agg(count("*").alias("count")) \
    .withColumn("percent", spark_round((col("count") / total) * 100, 2)) \
    .orderBy("count", ascending=False) \
    .show(5, truncate=False)

[Stage 128:>                                                        (0 + 1) / 1]

+-------------------+-----+-------+
|age_group          |count|percent|
+-------------------+-----+-------+
|Boomers (60+)      |21235|63.24  |
|Gen X (45–59)      |7577 |22.56  |
|Millennials (30–44)|3897 |11.61  |
|Gen Z (15–29)      |870  |2.59   |
+-------------------+-----+-------+



                                                                                

🔍 Insight: a base tem **forte presença de clientes com 45 anos ou mais (Boomers e Gen X somam mais de 85%)**. Isso pode refletir o histórico da base de dados ou o público principal da plataforma nesse período.

In [52]:
total = full_df.filter(col("event") == "offer completed").count()

full_df.filter(col("event") == "offer completed") \
    .groupBy("gender") \
    .agg(count("*").alias("count")) \
    .withColumn("percent", spark_round((col("count") / total) * 100, 2)) \
    .orderBy("count", ascending=False) \
    .show(5, truncate=False)

                                                                                

+-------+-----+-------+
|gender |count|percent|
+-------+-----+-------+
|M      |16466|49.04  |
|F      |15477|46.09  |
|unknown|1636 |4.87   |
+-------+-----+-------+



                                                                                

🔍 Insight: **Distribuição de gênero bastante balanceada** — bom para estratégias de comunicação neutras ou segmentadas.

In [53]:
total = full_df.filter(col("event") == "offer completed").count()

full_df.filter(col("event") == "offer completed") \
    .groupBy("credit_limit_bucket") \
    .agg(count("*").alias("count")) \
    .withColumn("percent", spark_round((col("count") / total) * 100, 2)) \
    .orderBy("count", ascending=False) \
    .show(5, truncate=False)

                                                                                

+-------------------+-----+-------+
|credit_limit_bucket|count|percent|
+-------------------+-----+-------+
|Very High (> 60k)  |21483|63.98  |
|High (30k–60k)     |11976|35.67  |
|Medium (10k–30k)   |120  |0.36   |
+-------------------+-----+-------+



                                                                                

🔍 Insight: A base está altamente concentrada em **clientes de alto poder aquisitivo**. Estratégias de cupons e ofertas podem ser mais voltadas para valor agregado (ex: brindes, experiências) do que simples descontos.

**a) Qual o perfil de clientes que completam ofertas considerando as features ano de nascimento, gênero e limite de credito?**

In [56]:
total = full_df.filter(col("event") == "transaction").count()

full_df.filter(col("event") == "offer completed") \
    .groupBy("age_group") \
    .agg(count("*").alias("count")) \
    .withColumn("percent", spark_round((col("count") / total) * 100, 2)) \
    .orderBy("count", ascending=False) \
    .show(5, truncate=False)

                                                                                

+-------------------+-----+-------+
|age_group          |count|percent|
+-------------------+-----+-------+
|Boomers (60+)      |21235|15.28  |
|Gen X (45–59)      |7577 |5.45   |
|Millennials (30–44)|3897 |2.8    |
|Gen Z (15–29)      |870  |0.63   |
+-------------------+-----+-------+



                                                                                

In [None]:
print(f"Full df: {full_df.count()} linhas")

In [None]:
value_counts(full_df, "offer_type")

In [None]:
value_counts(full_df, "event")

In [None]:
 value_amount: double (nullable = true)
 |-- value_reward: double (nullable = true)

isna_sum(full_df, 'full_df')