# üìä An√°lise Explorat√≥ria de Dados (EDA)
## Predi√ß√£o de Indica√ß√µes ao Oscar

**Objetivo**: Entender profundamente os dados antes de treinar qualquer modelo de ML.

### O que vamos fazer neste notebook:
1. Carregar e explorar o dataset principal
2. Analisar balanceamento de classes (indicados vs n√£o-indicados)
3. Identificar dados faltantes
4. An√°lise univariada de features num√©ricas
5. An√°lise bivariada (comparar indicados vs n√£o-indicados)
6. An√°lise temporal
7. An√°lise de features categ√≥ricas

## 1. Setup e Carregamento de Dados

In [None]:
# --- C√âLULA DE IMPORTS (Execute essa c√©lula primeiro!) ---
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt  # <--- Aqui est√° o 'plt' que estava faltando
import seaborn as sns

# Configura√ß√µes de visualiza√ß√£o
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

# Configura√ß√µes do Pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.2f}'.format)

In [1]:
# Importar data loader
import sys
sys.path.append('..')

from src.data_loader import load_ml_dataset

df = load_ml_dataset()

print(f"Dataset carregado: {df.shape[0]:,} filmes √ó {df.shape[1]} features")
print(f"Per√≠odo coberto: {df['release_year'].min()} - {df['release_year'].max()}")
print(f"Mem√≥ria utilizada: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

üìä Loading ML dataset from database...
‚úÖ Loaded 3,181 movies from database
   Features: 29
   Period: 1999 - 2025
Dataset carregado: 3,181 filmes √ó 29 features
Per√≠odo coberto: 1999 - 2025
Mem√≥ria utilizada: 1.03 MB


In [2]:
# Importar data loader
import sys
sys.path.append('..')

from src.data_loader import load_ml_dataset

# Carregar dataset principal do BANCO (n√£o mais CSV!)
df = load_ml_dataset()

print(f"Dataset carregado: {df.shape[0]:,} filmes √ó {df.shape[1]} features")
print(f"Per√≠odo coberto: {df['release_year'].min()} - {df['release_year'].max()}")
print(f"Mem√≥ria utilizada: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

üìä Loading ML dataset from database...
‚úÖ Loaded 3,181 movies from database
   Features: 29
   Period: 1999 - 2025
Dataset carregado: 3,181 filmes √ó 29 features
Per√≠odo coberto: 1999 - 2025
Mem√≥ria utilizada: 1.03 MB


## 2. Vis√£o Geral do Dataset

In [3]:
# Primeiras linhas
df.head(10)

Unnamed: 0,imdb_id,original_title,release_year,imdb_rating,imdb_votes,runtime_minutes,metascore,budget,worldwide_gross,domestic_gross,...,p90_score,min_score,max_score,num_genres,num_countries,num_languages,num_directors,num_writers,num_cast,label
0,tt0120188,Three Kings,1999,7.1,185623,114,82.0,75000000.0,107752036.0,60652036.0,...,100.0,40.0,100.0,3,2,2,1,2,5,0
1,tt0120363,Toy Story 2,1999,7.9,654604,92,88.0,90000000.0,497375404.0,245852179.0,...,100.0,50.0,100.0,3,1,1,3,7,5,0
2,tt0120601,Being John Malkovich,1999,7.7,365304,113,90.0,13000000.0,23106795.0,22863596.0,...,100.0,63.0,100.0,3,1,1,1,1,5,0
3,tt0120616,The Mummy,1999,7.1,489390,124,48.0,80000000.0,417643286.0,157095368.0,...,73.5,20.0,100.0,3,1,6,1,6,5,0
4,tt0120655,Dogma,1999,7.3,237034,130,62.0,10000000.0,33625964.0,32846695.0,...,84.0,30.0,91.0,3,1,2,1,1,5,0
5,tt0120657,The 13th Warrior,1999,6.6,137278,102,42.0,160000000.0,61698899.0,32698899.0,...,72.0,20.0,91.0,3,1,6,1,3,5,0
6,tt0120663,Eyes Wide Shut,1999,7.5,408915,159,69.0,65000000.0,162496398.0,55691208.0,...,100.0,20.0,100.0,3,2,1,1,3,5,0
7,tt0120689,The Green Mile,1999,8.6,1515373,189,61.0,60000000.0,286801374.0,136801374.0,...,84.0,30.0,100.0,3,1,2,1,2,5,1
8,tt0120784,Payback,1999,7.1,151431,100,46.0,90000000.0,161626121.0,81526121.0,...,75.0,20.0,100.0,3,1,1,1,3,5,0
9,tt0120855,Tarzan,1999,7.3,263224,88,80.0,130000000.0,448192603.0,171091819.0,...,100.0,50.0,100.0,3,2,1,2,25,5,0


In [4]:
# Informa√ß√µes gerais
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3181 entries, 0 to 3180
Data columns (total 29 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   imdb_id                    3181 non-null   object 
 1   original_title             3181 non-null   object 
 2   release_year               3181 non-null   int64  
 3   imdb_rating                3181 non-null   float64
 4   imdb_votes                 3181 non-null   int64  
 5   runtime_minutes            3181 non-null   int64  
 6   metascore                  3180 non-null   float64
 7   budget                     2872 non-null   float64
 8   worldwide_gross            3036 non-null   float64
 9   domestic_gross             2938 non-null   float64
 10  roi_worldwide              2812 non-null   float64
 11  us_market_share            2938 non-null   float64
 12  box_office_rank_in_year    3181 non-null   int64  
 13  votes_normalized_by_year   3181 non-null   float

In [5]:
# Estat√≠sticas descritivas para features num√©ricas
df.describe()

Unnamed: 0,release_year,imdb_rating,imdb_votes,runtime_minutes,metascore,budget,worldwide_gross,domestic_gross,roi_worldwide,us_market_share,...,p90_score,min_score,max_score,num_genres,num_countries,num_languages,num_directors,num_writers,num_cast,label
count,3181.0,3181.0,3181.0,3181.0,3180.0,2872.0,3036.0,2938.0,2812.0,2938.0,...,3143.0,3143.0,3143.0,3181.0,3181.0,3181.0,3181.0,3181.0,3181.0,3181.0
mean,2011.618988,6.680132,213443.7,112.594467,58.483333,77049340.0,167229400.0,71641500.0,10.41242,0.467253,...,79.022463,26.929049,88.164493,2.690978,2.033637,1.933669,1.101855,2.651367,4.825841,0.058472
std,6.964063,0.852699,245562.1,19.592763,17.098553,401308600.0,246588600.0,94144300.0,255.378715,0.213917,...,13.729365,17.090626,11.915039,0.573314,1.306659,1.311656,0.549156,2.001915,0.592276,0.234671
min,1999.0,1.7,50006.0,75.0,9.0,7000.0,693.0,1305.0,0.000511,0.000405,...,27.5,0.0,30.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
25%,2006.0,6.2,77455.0,99.0,46.0,17000000.0,29609470.0,14685410.0,1.375783,0.327122,...,70.2,16.0,80.0,2.0,1.0,1.0,1.0,1.0,5.0,0.0
50%,2012.0,6.7,127396.0,109.0,59.0,38000000.0,81610560.0,40910900.0,2.608135,0.465948,...,80.0,25.0,90.0,3.0,2.0,1.0,1.0,2.0,5.0,0.0
75%,2017.0,7.3,244129.0,123.0,71.0,80000000.0,194149400.0,90357300.0,4.577479,0.598714,...,89.4,40.0,100.0,3.0,3.0,2.0,1.0,3.0,5.0,0.0
max,2025.0,9.1,3082270.0,321.0,100.0,12215500000.0,2923711000.0,936662200.0,12890.395533,1.0,...,100.0,80.0,100.0,3.0,10.0,10.0,21.0,29.0,5.0,1.0


## 3. An√°lise de Balanceamento de Classes

**Pergunta cr√≠tica**: Quantos filmes foram indicados ao Oscar vs n√£o-indicados?

In [6]:
# Contar classes
class_counts = df['label'].value_counts()
class_percentages = 100 * class_counts / len(df)

print("\n" + "="*60)
print("BALANCEAMENTO DE CLASSES")
print("="*60)
print(f"N√£o Indicados (0): {class_counts[0]:,} ({class_percentages[0]:.2f}%)")
print(f"Indicados (1):     {class_counts[1]:,} ({class_percentages[1]:.2f}%)")
print(f"\nRatio de Desbalanceamento: 1:{class_counts[0]/class_counts[1]:.1f}")
print("="*60)


BALANCEAMENTO DE CLASSES
N√£o Indicados (0): 2,995 (94.15%)
Indicados (1):     186 (5.85%)

Ratio de Desbalanceamento: 1:16.1


In [7]:
# Visualizar balanceamento
fig, ax = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart
colors = ['#3498db', '#e74c3c']
ax[0].bar(['N√£o Indicados', 'Indicados'], class_counts.values, color=colors, edgecolor='black')
ax[0].set_ylabel('Quantidade de Filmes', fontsize=12)
ax[0].set_title('Distribui√ß√£o de Classes (Contagem)', fontsize=14, fontweight='bold')
ax[0].grid(axis='y', alpha=0.3)
for i, v in enumerate(class_counts.values):
    ax[0].text(i, v + 50, f'{v:,}', ha='center', fontweight='bold', fontsize=11)

# Pie chart
ax[1].pie(class_percentages, labels=['N√£o Indicados', 'Indicados'], autopct='%1.1f%%', 
          colors=colors, startangle=90, textprops={'fontsize': 11, 'fontweight': 'bold'})
ax[1].set_title('Distribui√ß√£o de Classes (Percentual)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('../reports/figures/class_balance.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úÖ Gr√°fico salvo em reports/figures/class_balance.png")

NameError: name 'plt' is not defined

### üí° Insight: Desbalanceamento de Classes

- **IMPORTANTE**: O dataset est√° desbalanceado! 
- Isso √© esperado (nem todos os filmes s√£o indicados ao Oscar)
- Vamos precisar aplicar t√©cnicas de balanceamento posteriormente:
  - `class_weight` nos modelos
  - SMOTE (oversampling)
  - Random Undersampling
- M√©tricas: Usar **Precision**, **Recall**, **F1-Score** ao inv√©s de apenas Accuracy

## 4. An√°lise de Dados Faltantes (Missing Values)

In [None]:
# Calcular missing values
missing = df.isnull().sum()
missing_pct = 100 * missing / len(df)

missing_summary = pd.DataFrame({
    'Column': missing.index,
    'Missing_Count': missing.values,
    'Missing_Percentage': missing_pct.values
})

missing_summary = missing_summary[missing_summary['Missing_Count'] > 0]
missing_summary = missing_summary.sort_values('Missing_Percentage', ascending=False)

if len(missing_summary) > 0:
    print("\n‚ö†Ô∏è  Colunas com valores faltantes:")
    print(missing_summary.to_string(index=False))
else:
    print("\n‚úÖ Nenhum valor faltante encontrado!")

In [None]:
# Visualizar missing values (se houver)
if len(missing_summary) > 0:
    plt.figure(figsize=(10, 6))
    plt.barh(missing_summary['Column'], missing_summary['Missing_Percentage'], color='#e74c3c')
    plt.xlabel('Percentual de Valores Faltantes (%)', fontsize=12)
    plt.title('An√°lise de Dados Faltantes', fontsize=14, fontweight='bold')
    plt.grid(axis='x', alpha=0.3)
    plt.tight_layout()
    plt.savefig('../reports/figures/missing_values.png', dpi=300, bbox_inches='tight')
    plt.show()
    print("‚úÖ Gr√°fico salvo em reports/figures/missing_values.png")

## 5. Distribui√ß√£o Temporal dos Filmes

In [None]:
# Filmes por ano
movies_per_year = df.groupby('release_year').size()
nominated_per_year = df[df['label'] == 1].groupby('release_year').size()

plt.figure(figsize=(14, 6))
plt.plot(movies_per_year.index, movies_per_year.values, marker='o', linewidth=2, label='Total de Filmes')
plt.plot(nominated_per_year.index, nominated_per_year.values, marker='s', linewidth=2, 
         color='#e74c3c', label='Indicados ao Oscar')
plt.xlabel('Ano de Lan√ßamento', fontsize=12)
plt.ylabel('Quantidade de Filmes', fontsize=12)
plt.title('Distribui√ß√£o Temporal de Filmes no Dataset', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('../reports/figures/temporal_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úÖ Gr√°fico salvo em reports/figures/temporal_distribution.png")

## 6. An√°lise Univariada - Features Num√©ricas

Vamos analisar a distribui√ß√£o de cada feature num√©rica importante.

In [None]:
# Selecionar features num√©ricas importantes
numeric_features = [
    'imdb_rating', 'imdb_votes', 'runtime_minutes', 'metascore',
    'budget', 'worldwide_gross', 'domestic_gross', 'roi_worldwide',
    'mean_score', 'median_score', 'n_samples'
]

# Criar subplots para histogramas
fig, axes = plt.subplots(4, 3, figsize=(16, 14))
axes = axes.flatten()

for idx, feature in enumerate(numeric_features):
    if feature in df.columns:
        axes[idx].hist(df[feature].dropna(), bins=30, edgecolor='black', alpha=0.7, color='#3498db')
        axes[idx].set_title(f'{feature}', fontsize=11, fontweight='bold')
        axes[idx].set_xlabel('Valor')
        axes[idx].set_ylabel('Frequ√™ncia')
        axes[idx].grid(alpha=0.3)

# Remover eixos extras
for idx in range(len(numeric_features), len(axes)):
    fig.delaxes(axes[idx])

plt.suptitle('Distribui√ß√£o de Features Num√©ricas', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.savefig('../reports/figures/numeric_distributions.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úÖ Gr√°fico salvo em reports/figures/numeric_distributions.png")

## 7. An√°lise Bivariada: Indicados vs N√£o-Indicados

**Pergunta**: H√° diferen√ßa nas distribui√ß√µes entre filmes indicados e n√£o-indicados?

In [None]:
# Selecionar features principais para compara√ß√£o
key_features = ['imdb_rating', 'metascore', 'budget', 'worldwide_gross', 'runtime_minutes']

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.flatten()

for idx, feature in enumerate(key_features):
    if feature in df.columns:
        # Separar dados por classe
        not_nominated = df[df['label'] == 0][feature].dropna()
        nominated = df[df['label'] == 1][feature].dropna()
        
        # Violin plot
        data_to_plot = [not_nominated, nominated]
        parts = axes[idx].violinplot(data_to_plot, positions=[0, 1], showmeans=True, showmedians=True)
        
        axes[idx].set_xticks([0, 1])
        axes[idx].set_xticklabels(['N√£o Indicado', 'Indicado'])
        axes[idx].set_title(f'{feature}', fontsize=11, fontweight='bold')
        axes[idx].grid(alpha=0.3)

# Remover eixo extra
fig.delaxes(axes[5])

plt.suptitle('Compara√ß√£o: Indicados vs N√£o-Indicados', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.savefig('../reports/figures/nominated_vs_not_nominated.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úÖ Gr√°fico salvo em reports/figures/nominated_vs_not_nominated.png")

## 8. Matriz de Correla√ß√£o

In [None]:
# Selecionar apenas colunas num√©ricas
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()

# Remover colunas de ID se houver
numeric_cols = [col for col in numeric_cols if 'id' not in col.lower()]

# Calcular correla√ß√£o
correlation_matrix = df[numeric_cols].corr()

# Plotar heatmap
plt.figure(figsize=(14, 12))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0,
            square=True, linewidths=0.5, cbar_kws={"shrink": 0.8})
plt.title('Matriz de Correla√ß√£o - Features Num√©ricas', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.savefig('../reports/figures/correlation_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úÖ Gr√°fico salvo em reports/figures/correlation_matrix.png")

## 9. An√°lise de Features Categ√≥ricas: G√™neros

In [None]:
# Carregar dados de g√™neros do BANCO
from src.data_loader import load_genres_data

genres_df, movie_genres = load_genres_data()

# Contar frequ√™ncia de cada g√™nero
genre_counts = movie_genres['genre_name'].value_counts().head(15)

plt.figure(figsize=(12, 6))
plt.barh(genre_counts.index[::-1], genre_counts.values[::-1], color='#3498db', edgecolor='black')
plt.xlabel('N√∫mero de Filmes', fontsize=12)
plt.title('Top 15 G√™neros Mais Frequentes', fontsize=14, fontweight='bold')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.savefig('../reports/figures/top_genres.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úÖ Gr√°fico salvo em reports/figures/top_genres.png")

## 10. Resumo e Pr√≥ximos Passos

### üìã O que descobrimos:
1. **Balanceamento**: Dataset desbalanceado (esperado)
- **Filmes n√£o-indicados**: 2,995 (94.2%)
- **Filmes indicados**: 186 (5.8%)
- **Ratio de desbalanceamento**: 1:16 (muito desbalanceado!)
- **A√ß√£o necess√°ria**: Aplicar t√©cnicas de balanceamento (SMOTE, class_weight ou undersampling) na Fase 3.6

2. **Missing values**:
- **`metascore`**: 1 missing (0.03%) - praticamente completo ‚úÖ
- **`budget`**: 309 missing (9.7%) - precisa de imputa√ß√£o ‚ö†Ô∏è
- **`worldwide_gross`**: 145 missing (4.6%) - aceit√°vel
- **`domestic_gross`**: 243 missing (7.6%) - aceit√°vel
- **`roi_worldwide`**: 369 missing (11.6%) - calculado, derivado de budget/gross
- **Features de rating samples**: 38 missing (1.2%) - excelente
- **Demais features**: completas (0% missing) ‚úÖ

3. **Features promissoras**:
- **`imdb_rating`**: M√©dia 6.68 (range 1.7-9.1) - boa variabilidade
- **`imdb_votes`**: M√©dia 213k - indica popularidade
- **`metascore`**: M√©dia 58.5 (range 9-100) - cr√≠ticas s√£o importantes
- **`budget`**: M√©dia $77M (outliers at√© $12.2B!) - precisa normaliza√ß√£o
- **`worldwide_gross`**: M√©dia $167M - sucesso comercial importa?
- **`roi_worldwide`**: M√©dia 10.4x (max 12,890x!) - outliers extremos
- **Features Metacritic** (`mean_score`, `median_score`): Boas distribui√ß√µes

- **`num_genres`**: M√©dia 2.69 (maioria tem 2-3 g√™neros)
- **`num_countries`**: M√©dia 2.03 (muitos s√£o coprodu√ß√µes)
- **`num_languages`**: M√©dia 1.93
- **`num_directors`**: M√©dia 1.10 (maioria tem 1 diretor)
- **`num_writers`**: M√©dia 2.65
- **`num_cast`**: M√©dia 4.83 (sempre 5 atores principais)

4. **Correla√ß√µes importantes**:
- **Outliers extremos** em `budget`, `worldwide_gross`, `roi_worldwide` ‚Üí Aplicar transforma√ß√£o log
- **`metascore` quase completo** (99.97%) ‚Üí Feature muito valiosa!
- **Ratings do Metacritic** (`mean_score`, `median_score`) ‚Üí Alta qualidade dos dados
- **Features de pessoas** (`num_directors`, `num_writers`) ‚Üí Podem ser √∫teis para prest√≠gio
- **Per√≠odo coberto**: 1999-2025 (26 anos) ‚Üí Bom para split temporal

### üéØ Pr√≥ximos Passos:
1. Feature Engineering (criar novas features)
2. Tratamento de dados faltantes
3. Prepara√ß√£o de dados para ML
4. Treinamento de modelos