# Generate Users
---
O objetivo deste notebook é criar a base de usuários, usaremos a biblioteca `Faker` para simular os dados.

Definimos também a `seed = 42`, a fim de obtermos a mesma reprodutibilidade na geração dos dados.

Campos da tabela **users**:

* **user_id:** *Representa o id do usuário*
* **email:** *Email do usuários*
* **country:** *País associado ao usuário*
* **signup_date:** *data de cadastro*
* **created_at:** *data de criação*

In [0]:
%pip install faker

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
from faker import Faker
import random 

#### Geração dos Dados

In [0]:
# Setup
random.seed(42)
fake = Faker()
fake.seed_instance(42)

# Geração de dados (com erro proposital)
users = []

for _ in range(5000):

    users.append({
        "user_id" : fake.uuid4() if random.random() > 0.05 else None,
        "email" : fake.email(),
        "country" : random.choice(["BR", "br", "Brazil", "US", "usa", None]),
        "signup_date" : fake.date_between("-2y", "today").strftime("%Y-%m-%d"),
        "created_at" : fake.iso8601()
    })

df_users = spark.createDataFrame(users)
display(df_users.limit(5))

country,created_at,email,signup_date,user_id
BR,2000-07-16T07:10:59.304001,john21@example.net,2025-06-23,bdd640fb-0667-4ad1-9c80-317fa3b1799d
br,2010-02-04T15:22:56.895920,robinsonwilliam@example.org,2025-02-12,07a0ca6e-0822-48f3-ac03-1199972a8469
,1978-12-10T11:42:31.943111,zlawrence@example.org,2025-07-06,386ecbe0-6b65-46a4-8b81-48f6b38a088c
,1990-02-07T03:37:28.052331,susanrogers@example.org,2024-10-02,27cd8130-4722-4389-971a-a8766c307511
BR,1986-06-02T07:56:45.634285,blairamanda@example.com,2024-02-25,ce9ff57f-43b7-43a6-9a8d-ca03580d7b71


#### Escrita na RAW 

* Criação dos Volumes

In [0]:
%sql
CREATE CATALOG IF NOT EXISTS main;
CREATE SCHEMA IF NOT EXISTS main.lakehouse_marketing;

CREATE VOLUME IF NOT EXISTS main.lakehouse_marketing.raw;

* Escrita na RAW

In [0]:
BASE_PATH = "/Volumes/main/lakehouse_marketing/raw"

df_users.write\
    .mode("overwrite")\
    .option("header", "true")\
    .csv(f"{BASE_PATH}/users")

* Validação

In [0]:

dbutils.fs.ls("/Volumes/main/lakehouse_marketing/raw")

[FileInfo(path='dbfs:/Volumes/main/lakehouse_marketing/raw/users/', name='users/', size=0, modificationTime=1767127496953)]