# ETL Raw para Silver - Movies Dataset

Este notebook executa o ETL da camada Raw para a camada Silver do projeto.

Etapas:
- Extract: leitura do CSV bruto
- Transform: limpeza, padronizacao, enriquecimento e validacao
- Load: gravacao na camada Silver (CSV) e opcionalmente no PostgreSQL


## 1. Importacoes e configuracao

In [19]:
import pandas as pd
import numpy as np
from pathlib import Path
import os
from datetime import datetime
import psycopg2
from psycopg2.extras import execute_values

In [20]:
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", 120)


### 1.1 Caminhos do projeto

In [21]:
CWD = Path.cwd()
PROJECT_ROOT = None

for candidate in [CWD, *CWD.parents]:
    if (candidate / "Data Layer").exists() and (candidate / "Data Layer" / "raw" / "dados_brutos.csv").exists():
        PROJECT_ROOT = candidate
        break

if PROJECT_ROOT is None:
    raise FileNotFoundError(
        f"Nao encontrei a pasta do projeto a partir de {CWD}. "
        "Verifique se o notebook esta dentro do repositorio."
    )

RAW_FILE = PROJECT_ROOT / "Data Layer" / "raw" / "dados_brutos.csv"
SILVER_DIR = PROJECT_ROOT / "Data Layer" / "silver"
SILVER_FILE = SILVER_DIR / "movies_silver.csv"
SILVER_PARQUET = SILVER_DIR / "movies_silver.parquet"

print(f"Projeto: {PROJECT_ROOT}")
print(f"Raw: {RAW_FILE}")
print(f"Silver: {SILVER_FILE}")


Projeto: c:\Users\mathe\GitHub\Grupo-8-SBD2
Raw: c:\Users\mathe\GitHub\Grupo-8-SBD2\Data Layer\raw\dados_brutos.csv
Silver: c:\Users\mathe\GitHub\Grupo-8-SBD2\Data Layer\silver\movies_silver.csv


### 1.2 Conex√£o com banco

In [22]:
#Para carregar no PostgreSQL
LOAD_TO_DB = True

DB_HOST = os.getenv("DB_HOST", "localhost")
DB_PORT = os.getenv("DB_PORT", "5433")  # Porta do docker-compose.yml
DB_NAME = os.getenv("DB_NAME", "grupo08")  # Nome do banco no docker-compose.yml
DB_USER = os.getenv("DB_USER", "postgres")
DB_PASSWORD = os.getenv("DB_PASSWORD", "postgres")

conn = None

if LOAD_TO_DB:
    try:
        # Conexao psycopg2
        conn = psycopg2.connect(
            host=DB_HOST,
            port=DB_PORT,
            database=DB_NAME,
            user=DB_USER,
            password=DB_PASSWORD
        )
        print("‚úì Conexao PostgreSQL pronta (psycopg2)")
        print(f"‚úì Conectado ao banco: {DB_NAME} na porta {DB_PORT}")
    except Exception as e:
        print("‚úó Erro ao conectar no PostgreSQL")
        print(e)
        print("\nVerifique se:")
        print("  1. O Docker est√° rodando")
        print("  2. O container est√° ativo: docker ps")
        print("  3. O PostgreSQL est√° acess√≠vel")
        LOAD_TO_DB = False
else:
    print("LOAD_TO_DB desativado. Banco nao sera carregado.")


‚úó Erro ao conectar no PostgreSQL
connection to server at "localhost" (::1), port 5433 failed: Connection refused (0x0000274D/10061)
	Is the server running on that host and accepting TCP/IP connections?
connection to server at "localhost" (127.0.0.1), port 5433 failed: Connection refused (0x0000274D/10061)
	Is the server running on that host and accepting TCP/IP connections?


Verifique se:
  1. O Docker est√° rodando
  2. O container est√° ativo: docker ps
  3. O PostgreSQL est√° acess√≠vel


## 2. Extract - leitura dos dados brutos

In [23]:
print("Carregando CSV bruto...")
df_raw = pd.read_csv(RAW_FILE, low_memory=False)

print(f"Linhas: {len(df_raw):,}")
print(f"Colunas: {len(df_raw.columns)}")
print(f"Memoria: {df_raw.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

print("Amostra:")
display(df_raw.head())

Carregando CSV bruto...


Linhas: 1,351,251
Colunas: 24
Memoria: 1455.05 MB
Amostra:


Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,adult,backdrop_path,budget,homepage,imdb_id,original_language,original_title,overview,popularity,poster_path,tagline,genres,production_companies,production_countries,spoken_languages,keywords
0,27205,Inception,8.364,34495,Released,2010-07-15,825532764,148,False,/8ZTVqvKDQ8emSGUEMjsS4yHAwrp.jpg,160000000,https://www.warnerbros.com/movies/inception,tt1375666,en,Inception,"Cobb, a skilled thief who commits corporate espionage by infiltrating the subconscious of his targets is offered a c...",83.952,/oYuLEt3zVCKq57qu2F8dT7NIa6f.jpg,Your mind is the scene of the crime.,"Action, Science Fiction, Adventure","Legendary Pictures, Syncopy, Warner Bros. Pictures","United Kingdom, United States of America","English, French, Japanese, Swahili","rescue, mission, dream, airplane, paris, france, virtual reality, kidnapping, philosophy, spy, allegory, manipulatio..."
1,157336,Interstellar,8.417,32571,Released,2014-11-05,701729206,169,False,/pbrkL804c8yAv3zBZR4QPEafpAR.jpg,165000000,http://www.interstellarmovie.net/,tt0816692,en,Interstellar,The adventures of a group of explorers who make use of a newly discovered wormhole to surpass the limitations on hum...,140.241,/gEU2QniE6E77NI6lCU6MxlNBvIx.jpg,Mankind was born on Earth. It was never meant to die here.,"Adventure, Drama, Science Fiction","Legendary Pictures, Syncopy, Lynda Obst Productions","United Kingdom, United States of America",English,"rescue, future, spacecraft, race against time, artificial intelligence (a.i.), nasa, time warp, dystopia, expedition..."
2,155,The Dark Knight,8.512,30619,Released,2008-07-16,1004558444,152,False,/nMKdUUepR0i5zn0y1T4CsSB5chy.jpg,185000000,https://www.warnerbros.com/movies/dark-knight/,tt0468569,en,The Dark Knight,"Batman raises the stakes in his war on crime. With the help of Lt. Jim Gordon and District Attorney Harvey Dent, Bat...",130.643,/qJ2tW6WMUDux911r6m7haRef0WH.jpg,Welcome to a world without rules.,"Drama, Action, Crime, Thriller","DC Comics, Legendary Pictures, Syncopy, Isobel Griffiths, Warner Bros. Pictures","United Kingdom, United States of America","English, Mandarin","joker, sadism, chaos, secret identity, crime fighter, superhero, anti hero, scarecrow, based on comic, vigilante, or..."
3,19995,Avatar,7.573,29815,Released,2009-12-15,2923706026,162,False,/vL5LR6WdxWPjLPFRLe133jXWsh5.jpg,237000000,https://www.avatar.com/movies/avatar,tt0499549,en,Avatar,"In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn bet...",79.932,/kyeqWdyUXW608qlYkRqosgbbJyK.jpg,Enter the world of Pandora.,"Action, Adventure, Fantasy, Science Fiction","Dune Entertainment, Lightstorm Entertainment, 20th Century Fox, Ingenious Media","United States of America, United Kingdom","English, Spanish","future, society, culture clash, space travel, space war, space colony, tribe, romance, alien, futuristic, space, ali..."
4,24428,The Avengers,7.71,29166,Released,2012-04-25,1518815515,143,False,/9BBTo63ANSmhC4e6r62OJFuK2GL.jpg,220000000,https://www.marvel.com/movies/the-avengers,tt0848228,en,The Avengers,"When an unexpected enemy emerges and threatens global safety and security, Nick Fury, director of the international ...",98.082,/RYMX2wcKCBAr24UyPD7xwmjaTn.jpg,Some assembly required.,"Science Fiction, Action, Adventure",Marvel Studios,United States of America,"English, Hindi, Russian","new york city, superhero, shield, based on comic, alien invasion, superhero team, aftercreditsstinger, duringcredits..."


## 3. Transform - limpeza e enriquecimento

In [24]:
def normalize_text(series: pd.Series) -> pd.Series:
    # Padroniza strings e converte vazios para NA
    s = series.astype("string").str.strip()
    s = s.replace({"": pd.NA, "None": pd.NA, "nan": pd.NA, "NaN": pd.NA})
    return s


def clean_csv_list(value: str):
    # Limpa campos CSV (separados por virgula)
    if pd.isna(value):
        return pd.NA
    parts = [p.strip() for p in str(value).split(",")]
    parts = [p for p in parts if p and p.lower() not in {"nan", "none"}]
    if not parts:
        return pd.NA
    seen = set()
    dedup = []
    for p in parts:
        if p not in seen:
            dedup.append(p)
            seen.add(p)
    return ", ".join(dedup)


def normalize_imdb_id(value: str):
    if pd.isna(value):
        return pd.NA
    v = str(value).strip()
    if v == "":
        return pd.NA
    if not v.startswith("tt"):
        return pd.NA
    return v if len(v) <= 12 else v[:12]


def parse_bool(value):
    if pd.isna(value):
        return pd.NA
    v = str(value).strip().lower()
    if v in {"true", "t", "1", "yes"}:
        return True
    if v in {"false", "f", "0", "no"}:
        return False
    return pd.NA


def cinema_era(year):
    if pd.isna(year):
        return pd.NA
    y = int(year)
    if y < 1930:
        return "Cinema mudo"
    if y < 1960:
        return "Era dourada"
    if y < 1980:
        return "Nova Hollywood"
    if y < 2000:
        return "Blockbuster"
    if y < 2010:
        return "Digital"
    return "Streaming"


In [25]:
df = df_raw.copy()

text_cols = [
    "title",
    "status",
    "original_language",
    "original_title",
    "overview",
    "tagline",
    "genres",
    "production_companies",
    "production_countries",
    "spoken_languages",
    "keywords",
    "homepage",
    "imdb_id",
    "poster_path",
    "backdrop_path",
]

for col in text_cols:
    if col in df.columns:
        df[col] = normalize_text(df[col])

# Normaliza imdb_id
if "imdb_id" in df.columns:
    df["imdb_id"] = df["imdb_id"].apply(normalize_imdb_id)

# Normaliza idioma
if "original_language" in df.columns:
    df["original_language"] = df["original_language"].str.lower()
    df.loc[df["original_language"].str.len() != 2, "original_language"] = pd.NA

# Normaliza adult
if "adult" in df.columns:
    df["adult"] = df["adult"].apply(parse_bool)

# Conversao de tipos numericos
num_cols = ["id", "vote_average", "vote_count", "revenue", "budget", "runtime", "popularity"]
for col in num_cols:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors="coerce")

if "id" in df.columns:
    df["id"] = df["id"].astype("Int64")
if "vote_count" in df.columns:
    df["vote_count"] = df["vote_count"].astype("Int64")
if "runtime" in df.columns:
    df["runtime"] = df["runtime"].astype("Int64")

# Limpeza de valores invalidos
if "vote_average" in df.columns:
    df.loc[(df["vote_average"] < 0) | (df["vote_average"] > 10), "vote_average"] = pd.NA
if "vote_count" in df.columns:
    df.loc[df["vote_count"] < 0, "vote_count"] = pd.NA
if "revenue" in df.columns:
    df.loc[df["revenue"] < 0, "revenue"] = pd.NA
if "budget" in df.columns:
    df.loc[df["budget"] < 0, "budget"] = pd.NA
if "runtime" in df.columns:
    df.loc[(df["runtime"] <= 0) | (df["runtime"] > 600), "runtime"] = pd.NA
if "popularity" in df.columns:
    df.loc[df["popularity"] < 0, "popularity"] = pd.NA

# Datas
if "release_date" in df.columns:
    df["release_date"] = pd.to_datetime(df["release_date"], errors="coerce")
    df["year"] = df["release_date"].dt.year.astype("Int64")
    df["month"] = df["release_date"].dt.month.astype("Int64")
    df["day"] = df["release_date"].dt.day.astype("Int64")
    df["day_of_week"] = df["release_date"].dt.dayofweek.astype("Int64")
    df["quarter"] = df["release_date"].dt.quarter.astype("Int64")
    df["week_of_year"] = df["release_date"].dt.isocalendar().week.astype("Int64")
    df["decade"] = (df["year"] // 10 * 10).astype("Int64")
    df["is_weekend"] = df["day_of_week"].isin([5, 6])
    df["cinema_era"] = df["year"].apply(cinema_era).astype("string")

# Campos CSV
list_cols = ["genres", "production_companies", "production_countries", "spoken_languages", "keywords"]
for col in list_cols:
    if col in df.columns:
        df[col] = df[col].apply(clean_csv_list).astype("string")

# Metricas derivadas
if "revenue" in df.columns and "budget" in df.columns:
    df["profit"] = df["revenue"] - df["budget"]
    df["roi"] = np.where(
        df["budget"] > 0,
        (df["revenue"] - df["budget"]) / df["budget"] * 100,
        np.nan,
    )

if "vote_average" in df.columns and "vote_count" in df.columns:
    df["engagement"] = df["vote_average"] * np.log1p(df["vote_count"])

if "revenue" in df.columns and "runtime" in df.columns:
    # Usa pd.notna para tratar valores NA antes da comparacao
    mask = pd.notna(df["runtime"]) & (df["runtime"] > 0)
    df["revenue_per_minute"] = pd.Series(dtype="float64")
    df.loc[mask, "revenue_per_minute"] = df.loc[mask, "revenue"] / df.loc[mask, "runtime"]

if "vote_average" in df.columns and "vote_count" in df.columns:
    quality_raw = (df["vote_average"].fillna(0) * np.log1p(df["vote_count"].fillna(0)))
    q_min = quality_raw.min()
    q_max = quality_raw.max()
    if pd.notna(q_min) and pd.notna(q_max) and q_max != q_min:
        df["quality_score"] = (quality_raw - q_min) / (q_max - q_min) * 100
    else:
        df["quality_score"] = pd.NA

# Faixas
if "revenue" in df.columns:
    df["revenue_range"] = pd.cut(
        df["revenue"],
        bins=[-1, 0, 1e6, 1e7, 5e7, 1e8, 5e8, np.inf],
        labels=["Zero", "<1M", "1-10M", "10-50M", "50-100M", "100-500M", ">500M"],
    ).astype("string")

if "budget" in df.columns:
    df["budget_range"] = pd.cut(
        df["budget"],
        bins=[-1, 0, 1e6, 1e7, 5e7, 1e8, 2.5e8, np.inf],
        labels=["Zero", "<1M", "1-10M", "10-50M", "50-100M", "100-250M", ">250M"],
    ).astype("string")

if "vote_average" in df.columns:
    df["rating_range"] = pd.cut(
        df["vote_average"],
        bins=[-0.1, 4, 6, 7, 8, 10],
        labels=["Ruim", "Regular", "Bom", "Muito bom", "Excelente"],
    ).astype("string")

if "runtime" in df.columns:
    df["runtime_range"] = pd.cut(
        df["runtime"],
        bins=[-1, 60, 90, 120, 150, np.inf],
        labels=["<60", "60-90", "90-120", "120-150", ">150"],
    ).astype("string")

if "popularity" in df.columns:
    df["popularity_range"] = pd.cut(
        df["popularity"],
        bins=[-1, 1, 5, 10, 20, 50, np.inf],
        labels=["<1", "1-5", "5-10", "10-20", "20-50", ">50"],
    ).astype("string")

# Deduplicacao por id
if "id" in df.columns:
    df = df.dropna(subset=["id"])
    df = df.sort_values(by=["id", "vote_count", "revenue", "budget"], ascending=[True, False, False, False])
    df = df.drop_duplicates(subset=["id"], keep="first")

# Metadados
df["load_timestamp"] = pd.Timestamp.now()
df["source"] = "base de dados.csv"

print(f"Linhas finais: {len(df):,}")


[ 5577924.081081081,  4152243.822485207,  6608937.131578947,
 18047568.061728396, 10621087.517482517,  7250925.925925926,
 13774597.577181209,  725566.5683453238,   6386583.47107438,
  1388961.038961039,
 ...
                0.0,                0.0,                0.0,
                0.0,                0.0,                0.0,
                0.0,                0.0,                0.0,
                0.0]
Length: 942216, dtype: Float64' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df.loc[mask, "revenue_per_minute"] = df.loc[mask, "revenue"] / df.loc[mask, "runtime"]


Linhas finais: 1,350,096


### 3.1 Reordenacao das colunas (padrao Silver)

In [26]:
silver_columns = [
    "id",
    "title",
    "original_title",
    "imdb_id",
    "overview",
    "tagline",
    "status",
    "adult",
    "original_language",
    "release_date",
    "year",
    "month",
    "day",
    "day_of_week",
    "quarter",
    "week_of_year",
    "decade",
    "cinema_era",
    "is_weekend",
    "vote_average",
    "vote_count",
    "popularity",
    "revenue",
    "budget",
    "profit",
    "roi",
    "runtime",
    "engagement",
    "revenue_per_minute",
    "quality_score",
    "revenue_range",
    "budget_range",
    "rating_range",
    "runtime_range",
    "popularity_range",
    "genres",
    "production_companies",
    "production_countries",
    "spoken_languages",
    "keywords",
    "homepage",
    "poster_path",
    "backdrop_path",
    "load_timestamp",
    "source",
]

available_cols = [c for c in silver_columns if c in df.columns]
df_silver = df[available_cols].copy()

print(f"Colunas Silver: {len(df_silver.columns)}")


Colunas Silver: 45


## 4. Load - gravacao na camada Silver

In [None]:
SILVER_DIR.mkdir(parents=True, exist_ok=True)

print("Salvando CSV da camada Silver...")
df_silver.to_csv(SILVER_FILE, index=False, encoding="utf-8")
print(f"Arquivo salvo: {SILVER_FILE}")

# Salvar Parquet (opcional)
try:
    df_silver.to_parquet(SILVER_PARQUET, index=False)
    print(f"Arquivo salvo: {SILVER_PARQUET}")
except Exception:
    print("Parquet nao gerado (instale pyarrow ou fastparquet se desejar).")


Salvando CSV da camada Silver...


### 4.1 Carga no PostgreSQL (opcional)

In [None]:
if LOAD_TO_DB and conn is not None:
    print("Carregando dados no PostgreSQL...")
    print(f"Registros a carregar: {len(df_silver):,}")
    
    cur = conn.cursor()
    
    try:
        # Cria schema se nao existir
        cur.execute("CREATE SCHEMA IF NOT EXISTS silver")
        conn.commit()
        
        # Prepara os dados: converte DataFrame para lista de tuplas
        # Substitui NaN por None para compatibilidade com PostgreSQL
        df_clean = df_silver.fillna({col: None for col in df_silver.columns})
        values = [tuple(row) for row in df_clean.values]
        columns = list(df_silver.columns)
        
        # Cria a tabela (DROP IF EXISTS primeiro)
        cur.execute("DROP TABLE IF EXISTS silver.movies CASCADE")
        conn.commit()
        
        # Monta o CREATE TABLE com tipos apropriados
        create_table_sql = "CREATE TABLE silver.movies (\n"
        for col in columns:
            dtype = df_silver[col].dtype
            if dtype == 'Int64' or 'int' in str(dtype):
                create_table_sql += f"    {col} BIGINT,\n"
            elif dtype == 'float64' or 'float' in str(dtype):
                create_table_sql += f"    {col} DOUBLE PRECISION,\n"
            elif dtype == 'bool' or dtype == 'boolean':
                create_table_sql += f"    {col} BOOLEAN,\n"
            elif 'datetime' in str(dtype):
                create_table_sql += f"    {col} TIMESTAMP,\n"
            else:
                create_table_sql += f"    {col} TEXT,\n"
        create_table_sql = create_table_sql.rstrip(',\n') + "\n)"
        
        cur.execute(create_table_sql)
        conn.commit()
        
        # Insere dados em chunks usando execute_values (mais eficiente)
        chunk_size = 2000
        total_chunks = (len(values) + chunk_size - 1) // chunk_size
        
        placeholders = ','.join(['%s'] * len(columns))
        insert_sql = f"INSERT INTO silver.movies ({','.join(columns)}) VALUES %s"
        
        for i in range(0, len(values), chunk_size):
            chunk = values[i:i+chunk_size]
            execute_values(cur, insert_sql, chunk)
            conn.commit()
            chunk_num = (i // chunk_size) + 1
            if chunk_num % 10 == 0 or chunk_num == total_chunks:
                print(f"  Processado chunk {chunk_num}/{total_chunks} ({i+len(chunk):,} registros)")
        
        print("‚úì Carga concluida com sucesso!")
        print(f"‚úì Tabela 'silver.movies' criada com {len(df_silver):,} registros")
        
    except Exception as e:
        conn.rollback()
        print(f"‚úó Erro ao carregar dados: {e}")
        raise
    finally:
        cur.close()
else:
    print("Carga no PostgreSQL desativada ou conexao nao disponivel.")


Carga no PostgreSQL desativada ou engine nao disponivel.


### 4.2 Verificacao do banco de dados

In [None]:
# Mostra informacoes do banco de dados
if LOAD_TO_DB and conn is not None:
    cur = conn.cursor()
    
    print("=" * 70)
    print("INFORMACOES DO BANCO DE DADOS")
    print("=" * 70)
    
    # Lista esquemas
    print("\nüìÅ ESQUEMAS:")
    cur.execute("""
        SELECT schema_name 
        FROM information_schema.schemata 
        WHERE schema_name NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
        ORDER BY schema_name;
    """)
    schemas = cur.fetchall()
    for schema in schemas:
        print(f"  ‚Ä¢ {schema[0]}")
    
    # Lista tabelas
    print("\nüìä TABELAS:")
    cur.execute("""
        SELECT schemaname, tablename 
        FROM pg_tables 
        WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
        ORDER BY schemaname, tablename;
    """)
    tables = cur.fetchall()
    
    if tables:
        current_schema = None
        for schema, table in tables:
            if schema != current_schema:
                current_schema = schema
                print(f"\n[{schema}]")
            print(f"  ‚Ä¢ {table}")
    else:
        print("  Nenhuma tabela encontrada")
    
    # Estatisticas da tabela movies
    print("\n" + "=" * 70)
    print("ESTATISTICAS DA TABELA silver.movies")
    print("=" * 70)
    
    try:
        # Contagem de registros
        cur.execute('SELECT COUNT(*) FROM silver.movies;')
        count = cur.fetchone()[0]
        print(f"\nüìà Total de registros: {count:,}")
        
        # Informacoes das colunas
        cur.execute("""
            SELECT column_name, data_type, character_maximum_length
            FROM information_schema.columns 
            WHERE table_schema = 'silver' AND table_name = 'movies'
            ORDER BY ordinal_position;
        """)
        columns = cur.fetchall()
        
        print(f"\nüìã Colunas ({len(columns)}):")
        for col_name, col_type, max_length in columns:
            length_info = f"({max_length})" if max_length else ""
            print(f"  ‚Ä¢ {col_name:30s} {col_type:20s} {length_info}")
        
        # Amostra de dados
        print("\n" + "=" * 70)
        print("AMOSTRA DOS DADOS (primeiras 5 linhas)")
        print("=" * 70)
        cur.execute("SELECT * FROM silver.movies LIMIT 5;")
        sample = cur.fetchall()
        col_names = [desc[0] for desc in cur.description]
        
        # Mostra apenas algumas colunas principais para nao poluir
        main_cols = ['id', 'title', 'year', 'vote_average', 'revenue', 'runtime']
        available_main_cols = [c for c in main_cols if c in col_names]
        
        if sample:
            print(f"\nColunas principais: {', '.join(available_main_cols)}")
            for i, row in enumerate(sample, 1):
                print(f"\n  Registro {i}:")
                for col in available_main_cols:
                    idx = col_names.index(col)
                    value = row[idx]
                    if value is not None:
                        print(f"    {col}: {value}")
        
        # Estatisticas basicas
        print("\n" + "=" * 70)
        print("ESTATISTICAS BASICAS")
        print("=" * 70)
        
        stats_queries = [
            ("Filmes por decada", """
                SELECT decade, COUNT(*) as total 
                FROM silver.movies 
                WHERE decade IS NOT NULL 
                GROUP BY decade 
                ORDER BY decade DESC 
                LIMIT 10;
            """),
            ("Top 10 filmes por nota", """
                SELECT title, vote_average, vote_count 
                FROM silver.movies 
                WHERE vote_average IS NOT NULL 
                ORDER BY vote_average DESC, vote_count DESC 
                LIMIT 10;
            """),
            ("Filmes por era do cinema", """
                SELECT cinema_era, COUNT(*) as total 
                FROM silver.movies 
                WHERE cinema_era IS NOT NULL 
                GROUP BY cinema_era 
                ORDER BY total DESC;
            """)
        ]
        
        for stat_name, query in stats_queries:
            try:
                cur.execute(query)
                results = cur.fetchall()
                if results:
                    print(f"\n{stat_name}:")
                    for row in results:
                        print(f"  {row}")
            except Exception as e:
                print(f"\n{stat_name}: Erro - {e}")
        
    except Exception as e:
        print(f"\nErro ao obter estatisticas: {e}")
    
    cur.close()
    print("\n" + "=" * 70)
    print("Verificacao concluida!")
    print("=" * 70)
else:
    print("Conexao com banco nao disponivel para verificacao.")

Conexao com banco nao disponivel para verificacao.


## 5. Resumo final

In [None]:
print("ETL Raw para Silver concluido.")
print(f"Registros finais: {len(df_silver):,}")
print(f"Arquivo: {SILVER_FILE}")


ETL Raw para Silver concluido.
Registros finais: 1,350,096
Arquivo: c:\Users\mathe\GitHub\Grupo-8-SBD2\Data Layer\silver\movies_silver.csv
