# PIX and Brazil Payments Trends – Exploratory Analysis

This notebook performs initial exploration and cleaning of the Brazilian Payment Methods dataset (2016–2024).  
The goal is to prepare the data for SQL aggregation and Tableau visualization, using pandas and numpy, focusing on PIX and the decline of traditional payment methods (DOC, TED, Boleto, Checks).

## Import Libraries

In [None]:
import pandas as pd
import numpy as np

## Load Dataset

In [None]:
df = pd.read_csv('../data/raw/brazilian_payment_methods_raw.csv') # Load Dataset

df.head(20) #Quick look at the data

## Dataset Overview
Check column types, missing values, and basic statistics

In [None]:
df.info()
df.describe()
df.isna().sum() # Check for missing values

> **Note:** 
Each number (0) tells us how many missing values there are in that column. 
0 means no missing data in that column — all rows have valid values. 
The dataset is complete for all columns.

## Convert YearMonth to datetime

In [None]:
df['YearMonth'] = pd.to_datetime(df['YearMonth'], format='%Y%m') # Convert YearMonth to datetime
df['year'] = df['YearMonth'].dt.year # Extract year for trend analysis
df['month'] = df['YearMonth'].dt.month  # Extract month for seasonal analysis

df.head()

The date was stored as **202405** and now it's **2024-05-01**.

> **Why?**  
> Converting `YearMonth` allows us to sort and plot data chronologically.  
> We also extract `year` and `month` for easier grouping later.

## Ensure Numeric Columns
These columns are numbers and need to be numeric for aggregation in SQL.

In [None]:
# Columns to convert to numeric (all except YearMonth)
numeric_cols = [
    'quantityPix', 'valuePix',
    'quantityTED', 'valueTED',
    'quantityTEC', 'valueTEC',
    'quantityBankCheck', 'valueBankCheck',
    'quantityBrazilianBoletoPayment', 'valueBrazilianBoletoPayment',
    'quantityDOC', 'valueDOC'
]

# Convert columns to numeric, coerce errors to NaN
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors='coerce')

# Verify the conversion
df.info()
df.isna().sum()

## Create Metrics: Average Transaction Values
For each payment method (PIX, TED, TEC, DOC, Boleto, Check), the average value per transaction is calculated as:

- **avg_method = valueMethod / quantityMethod**

In [None]:
# Calculate average transaction values for each payment method
df['avg_pix'] = df['valuePix'] / df['quantityPix']
df['avg_ted'] = df['valueTED'] / df['quantityTED']
df['avg_doc'] = df['valueDOC'] / df['quantityDOC']
df['avg_check'] = df['valueBankCheck'] / df['quantityBankCheck']
df['avg_boleto'] = df['valueBrazilianBoletoPayment'] / df['quantityBrazilianBoletoPayment']

df[['YearMonth','avg_pix','avg_ted','avg_doc','avg_check','avg_boleto']].head(20)

> **Note:** Months with zero transactions (quantity = 0) will result in `NaN`.  
> This indicates that no transactions occurred that month, which is expected.

## Save Cleaned Dataset
Saving the cleaned dataset so it can be imported into SQL for aggregation and analysis:

In [None]:
df.to_csv('../data/cleaned/brazilian_payment_methods_clean.csv', index=False) # Save cleaned data to a new CSV file