# Tech Challenge Fase 3 - IMDb Machine Learning

## Predição de Ratings de Filmes usando dados do IMDb

**Arquitetura:**
- **Dados:** AWS S3 (Medallion: RAW → TRUSTED → REFINED)
- **Processamento:** AWS Glue Jobs (Spark)
- **Catálogo:** AWS Glue Catalog
- **Consultas:** AWS Athena (SQL)
- **ML:** Jupyter + PyAthena + Scikit-learn

Baseado no `exercicio_cross_validation.ipynb` mas adaptado para dados reais do IMDb.


In [None]:
# Instalar dependências
%pip install PyAthena pandas numpy scikit-learn matplotlib seaborn boto3


In [None]:
import pandas as pd
import numpy as np
from pyathena import connect
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import warnings
warnings.filterwarnings('ignore')

# Configuração de gráficos
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
%matplotlib inline


In [None]:
# Configuração da conexão Athena
ATHENA_CONFIG = {
    'region_name': 'us-east-1',
    's3_staging_dir': 's3://imdb-raw-data-718942601863/athena-results/',
    'work_group': 'imdb-workgroup-dev',
    'database': 'imdb_database_dev'
}

# Conectar ao Athena
conn = connect(**ATHENA_CONFIG)
print("Conexão com Athena estabelecida!")


In [None]:
# Teste básico - Top 10 filmes mais bem avaliados
query_test = """
SELECT 
    b.primarytitle,
    r.averagerating,
    r.numvotes,
    b.startyear
FROM imdb_database_dev.ratings r
JOIN imdb_database_dev.basics b ON r.tconst = b.tconst
WHERE 
    b.titletype = 'movie'
    AND r.numvotes >= 50000
    AND r.averagerating >= 8.5
ORDER BY r.numvotes DESC
LIMIT 10
"""

print("Executando query de teste...")
top_movies = pd.read_sql(query_test, conn)
print("Top 10 filmes mais bem avaliados:")
display(top_movies)
