
# üöÄ Startup Success Prediction ‚Äî **Notebook Otimizado**
**Autor:** Mari (Inteli)  

**Criado:** 2025-09-30 17:42  

**Objetivo:** Atingir **acur√°cia ‚â• 0.80** (ideal: maior que 0.85) com modelos do `scikit-learn` (Random Forest, Gradient Boosting e HistGradientBoosting) usando *pipelines* e *hyperparameter search* (Randomized + Grid).

## Checklist dos Crit√©rios de Avalia√ß√£o
- **Limpeza e Nulos** ‚úîÔ∏è
- **Codifica√ß√£o Categ√≥rica** ‚úîÔ∏è
- **Explora√ß√£o (EDA) & Visualiza√ß√£o** ‚úîÔ∏è (leve, orientada a vari√°veis)
- **Hip√≥teses** ‚úîÔ∏è (em Markdown)
- **Sele√ß√£o de Features** ‚úîÔ∏è (opcional por `SelectKBest` e import√¢ncia de features)  
- **Modelagem e M√©tricas** ‚úîÔ∏è (accuracy, precision, recall, f1, matriz de confus√£o)
- **Ajuste de Hiperpar√¢metros** ‚úîÔ∏è (RandomizedSearchCV ‚Üí GridSearchCV)
- **Gera√ß√£o de Submiss√£o** ‚úîÔ∏è (no formato de `sample_submission.csv`)

> **Regras**: Apenas `numpy`, `pandas`, `scikit-learn` (modelos), e para gr√°ficos `matplotlib`. Sem bibliotecas externas.


In [None]:

# ========== Imports & Config ==========
import os, sys, json, warnings, math, gc, itertools
from pathlib import Path
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

# Visualiza√ß√£o (apenas matplotlib conforme regra)
import matplotlib.pyplot as plt

# Sklearn - preparo e avalia√ß√£o
from sklearn.model_selection import StratifiedKFold, train_test_split, cross_validate, RandomizedSearchCV, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

# Sklearn - sele√ß√£o de features (opcional)
from sklearn.feature_selection import SelectKBest, mutual_info_classif

# Modelos
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

SEED = 42
np.random.seed(SEED)

DATA_DIR = Path('.')  # Kaggle: mesmo diret√≥rio do notebook
print('Python:', sys.version)
print('Sklearn:', __import__('sklearn').__version__)
print('Files:', [p.name for p in DATA_DIR.iterdir() if p.is_file()][:15])


## üì• Carregar dados

In [None]:

# Tenta localizar arquivos com nomes padr√£o ou pr√≥ximos
candidates_train = ['train.csv', 'Train.csv', 'train_dataset.csv']
candidates_test  = ['test.csv', 'Test.csv', 'test_dataset.csv']
candidates_sub   = ['sample_submission.csv', 'sample_sub.csv', 'submission_sample.csv']

def find_first(paths):
    for name in paths:
        p = DATA_DIR / name
        if p.exists():
            return p
    return None

train_path = find_first(candidates_train)
test_path  = find_first(candidates_test)
sub_path   = find_first(candidates_sub)

if not train_path or not test_path:
    raise FileNotFoundError('N√£o encontrei train.csv e/ou test.csv no diret√≥rio. '
                            'Coloque ambos no mesmo diret√≥rio do notebook.')

print('train_path:', train_path)
print('test_path :', test_path)
print('sub_path  :', sub_path)

df = pd.read_csv(train_path)
df_test = pd.read_csv(test_path)
print(df.shape, df_test.shape)
df.head(3)


## üéØ Identificar a vari√°vel alvo (target)

In [None]:

# Tentativa autom√°tica de localizar a coluna alvo bin√°ria:
# Prefer√™ncias comuns: 'target', 'success', 'label', 'is_success', 'y'
common_targets = ['target', 'success', 'label', 'is_success', 'Outcome', 'Status', 'Y', 'y']

target = None
for c in common_targets:
    if c in df.columns:
        target = c
        break

# fallback: se √∫ltima coluna for bin√°ria, usar
if target is None:
    last_col = df.columns[-1]
    if df[last_col].dropna().nunique() <= 2:
        target = last_col

if target is None:
    # procurar qualquer coluna bin√°ria plaus√≠vel
    for c in df.columns:
        if df[c].dropna().nunique() == 2 and df[c].dtype != 'float64':
            target = c
            break

if target is None:
    raise ValueError('N√£o consegui identificar automaticamente a coluna alvo. '
                     'Defina manualmente: target = "nome_da_coluna"')

print('Target:', target)
print(df[target].value_counts(normalize=True).round(3))


## üîé EDA R√°pida (apenas o essencial para decis√µes)

In [None]:

print('\nInfo:')
print(df.info())

print('\nNulos (%):')
null_pct = df.isna().mean().sort_values(ascending=False)
print(null_pct.head(15))

# Distribui√ß√£o do alvo
fig = plt.figure(figsize=(5,3))
df[target].value_counts().sort_index().plot(kind='bar')
plt.title('Distribui√ß√£o do Alvo')
plt.xlabel('Classe')
plt.ylabel('Contagem')
plt.show()

# Correla√ß√µes num√©ricas (se existirem)
num_cols = df.select_dtypes(include=[np.number]).columns.drop([target], errors='ignore')
if len(num_cols) > 0:
    corr = df[num_cols].corr()
    # exibe apenas um resumo
    print('N√∫mero de colunas num√©ricas:', len(num_cols))
else:
    print('Sem colunas num√©ricas identificadas (al√©m do target).')

# Listar categ√≥ricas
cat_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
print('Categ√≥ricas:', len(cat_cols))
print(cat_cols[:20])



## üß™ Hip√≥teses
1. **Hist√≥rico de financiamento**: maior volume/frqu√™ncia de capta√ß√£o tende a aumentar a probabilidade de sucesso.  
2. **Tempo de opera√ß√£o** (idade da startup): empresas com tra√ß√£o de 2‚Äì7 anos podem performar melhor do que muito jovens ou muito antigas (efeito curvil√≠neo).  
3. **Localiza√ß√£o/Setor**: hubs (ex.: grandes capitais) e setores com alto investimento (ex.: fintech, sa√∫de) elevam a taxa de sucesso.  
> Essas hip√≥teses guiam a aten√ß√£o em features relacionadas a **financiamento**, **tempo** e **contexto** (setor/local).


## üßπ Pr√©-processamento e Split

In [None]:

# Separar X/y
X = df.drop(columns=[target])
y = df[target].astype(int) if df[target].dropna().nunique() <= 2 else df[target]

# Detecta colunas de tipos
num_features = X.select_dtypes(include=[np.number]).columns.tolist()
cat_features = X.select_dtypes(include=['object', 'category']).columns.tolist()

# Pipelines por tipo
numeric_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    # √Årvores n√£o precisam de escala; mas √∫til para LR e alguns boosts cl√°ssicos:
    # ('scaler', StandardScaler())
])

categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocess = ColumnTransformer(
    transformers=[
        ('num', numeric_pipeline, num_features),
        ('cat', categorical_pipeline, cat_features)
    ],
    remainder='drop',
    verbose_feature_names_out=False
)

# Hold-out para relat√≥rio final + CV interno para tuning
X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, random_state=SEED, stratify=y
)

print('Shapes:', X_train.shape, X_valid.shape)
print('Positiva (train):', y_train.mean().round(3))


## üß© (Opcional) Sele√ß√£o de Features ‚Äî `SelectKBest`

In [None]:

USE_FEATURE_SELECTION = False  # altere para True para ativar

kbest_step = ('kbest', SelectKBest(score_func=mutual_info_classif, k='all')) if USE_FEATURE_SELECTION else None
kbest_step


## ü§ñ Modelos & Espa√ßos de Busca de Hiperpar√¢metros

In [None]:

# Pipelines base
def make_pipeline(model):
    steps = [('prep', preprocess)]
    if kbest_step:
        steps.append(kbest_step)
    steps.append(('model', model))
    return Pipeline(steps)

# Modelos
rf_model  = RandomForestClassifier(random_state=SEED, n_jobs=-1)
gb_model  = GradientBoostingClassifier(random_state=SEED)
hgb_model = HistGradientBoostingClassifier(random_state=SEED)

# Espa√ßos (coarse) para RandomizedSearchCV
rf_dist = {
    'model__n_estimators': [200, 400, 600, 800, 1000],
    'model__max_depth': [None, 6, 8, 10, 12, 16],
    'model__min_samples_split': [2, 5, 10, 20],
    'model__min_samples_leaf': [1, 2, 4, 8],
    'model__max_features': ['sqrt', 'log2', None]
}

gb_dist = {
    'model__n_estimators': [100, 200, 400, 600],
    'model__learning_rate': [0.01, 0.05, 0.1, 0.2],
    'model__max_depth': [2, 3, 4, 5],
    'model__subsample': [0.6, 0.8, 1.0]
}

hgb_dist = {
    'model__max_depth': [None, 6, 8, 10],
    'model__learning_rate': [0.01, 0.05, 0.1],
    'model__max_leaf_nodes': [15, 31, 63, None],
    'model__min_samples_leaf': [10, 20, 30, 50],
    'model__l2_regularization': [0.0, 0.1, 0.5, 1.0]
}

search_spaces = {
    'RandomForest': (make_pipeline(rf_model), rf_dist),
    'GradientBoosting': (make_pipeline(gb_model), gb_dist),
    'HistGradientBoosting': (make_pipeline(hgb_model), hgb_dist)
}

scoring = {'acc': 'accuracy', 'prec': 'precision', 'rec': 'recall', 'f1': 'f1'}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)


## üîç RandomizedSearchCV (busca ampla)

In [None]:

coarse_results = {}

for name, (pipe, dist) in search_spaces.items():
    print(f'\n=== {name} ‚Äî RandomizedSearchCV (coarse) ===')
    rand = RandomizedSearchCV(
        estimator=pipe,
        param_distributions=dist,
        n_iter=20,  # aumentar se houver tempo
        scoring='accuracy',
        n_jobs=-1,
        cv=cv,
        random_state=SEED,
        verbose=1
    )
    rand.fit(X_train, y_train)
    best_est = rand.best_estimator_
    best_acc = rand.best_score_
    print('Best CV acc:', round(best_acc, 4))
    print('Best params:', rand.best_params_)
    coarse_results[name] = {'best_estimator': best_est, 'best_score': best_acc, 'best_params': rand.best_params_}

coarse_results


## üéØ GridSearchCV (refino local em torno do melhor)

In [None]:

refined_results = {}

for name, res in coarse_results.items():
    best_params = res['best_params']
    # Construir pequena grade ao redor do melhor
    grid = {}
    for k, v in best_params.items():
        if isinstance(v, (int, float)):
            # varia√ß√µes locais (+/-) quando num√©rico
            around = sorted(set([v] + [v*0.5 if isinstance(v, (int,float)) and v not in (0,1) else v,
                                        v*1.5 if isinstance(v, (int,float)) and v not in (0,1) else v]))
            # manter inteiros se eram inteiros
            if isinstance(v, int):
                around = sorted({int(round(x)) for x in around if int(round(x))>0})
            grid[k] = list(around)[:3] if len(around)>1 else [v]
        else:
            # se categ√≥rico, mant√©m o valor
            grid[k] = [v]

    print(f'\n=== {name} ‚Äî GridSearchCV (refine) ===')
    base_pipe = search_spaces[name][0].set_params(**best_params)
    gs = GridSearchCV(
        estimator=base_pipe,
        param_grid=grid,
        scoring='accuracy',
        n_jobs=-1,
        cv=cv,
        verbose=1
    )
    gs.fit(X_train, y_train)
    refined_results[name] = {
        'best_estimator': gs.best_estimator_,
        'best_score': gs.best_score_,
        'best_params': gs.best_params_
    }
    print('Refined CV acc:', round(gs.best_score_, 4))
    print('Params:', gs.best_params_)

refined_results


## üèÅ Compara√ß√£o e Valida√ß√£o no Hold-out

In [None]:

# Seleciona o melhor por CV
best_name = max(refined_results, key=lambda k: refined_results[k]['best_score'])
best_est = refined_results[best_name]['best_estimator']
best_cv  = refined_results[best_name]['best_score']

print('Melhor modelo (CV):', best_name, '‚Äî', round(best_cv, 4))

# Avalia no hold-out
best_est.fit(X_train, y_train)
pred = best_est.predict(X_valid)
acc = accuracy_score(y_valid, pred)
prec = precision_score(y_valid, pred, zero_division=0)
rec = recall_score(y_valid, pred, zero_division=0)
f1 = f1_score(y_valid, pred, zero_division=0)

print('\nHold-out:')
print('accuracy:', round(acc, 4))
print('precision:', round(prec, 4))
print('recall   :', round(rec, 4))
print('f1       :', round(f1, 4))

cm = confusion_matrix(y_valid, pred)
fig = plt.figure(figsize=(4,3))
plt.imshow(cm, interpolation='nearest')
plt.title('Matriz de Confus√£o (hold-out)')
plt.colorbar()
tick_marks = np.arange(len(np.unique(y)))
plt.xticks(tick_marks, tick_marks)
plt.yticks(tick_marks, tick_marks)
plt.xlabel('Predito')
plt.ylabel('Verdadeiro')
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        plt.text(j, i, cm[i, j], ha='center', va='center')
plt.tight_layout()
plt.show()

print('\nClassification Report:')
print(classification_report(y_valid, pred, zero_division=0))

# Guarda o melhor pipeline treinado completo
final_model = best_est


## üî¨ Import√¢ncia de Features (quando dispon√≠vel)

In [None]:

def get_feature_names(preprocessor, num_cols, cat_cols):
    # ap√≥s o fit, podemos recuperar as colunas do OneHot
    # Aten√ß√£o: dependendo da vers√£o do sklearn, o atributo √© get_feature_names_out
    out = []
    if 'num' in dict(preprocessor.named_transformers_):
        out += num_cols
    if 'cat' in dict(preprocessor.named_transformers_):
        ohe = preprocessor.named_transformers_['cat'].named_steps.get('ohe', None)
        if ohe is not None:
            try:
                ohe_names = ohe.get_feature_names_out(cat_cols).tolist()
            except:
                ohe_names = [f'ohe_{i}' for i in range(len(cat_cols))]
            out += ohe_names
    return out

try:
    # Extrai nomes ap√≥s o fit
    feature_names = get_feature_names(final_model.named_steps['prep'], 
                                      num_features, cat_features)
    model_step = final_model.named_steps['model']
    importances = None

    if hasattr(model_step, 'feature_importances_'):
        importances = model_step.feature_importances_
    elif hasattr(model_step, 'coef_'):
        coef = model_step.coef_
        importances = np.abs(coef[0]) if coef.ndim > 1 else np.abs(coef)

    if importances is not None and len(importances) == len(feature_names):
        imp_df = pd.DataFrame({'feature': feature_names, 'importance': importances})
        imp_df = imp_df.sort_values('importance', ascending=False).head(25)
        display(imp_df)
        fig = plt.figure(figsize=(7,6))
        plt.barh(imp_df['feature'][::-1], imp_df['importance'][::-1])
        plt.title('Top 25 Import√¢ncias de Features')
        plt.tight_layout()
        plt.show()
    else:
        print('Modelo n√£o fornece import√¢ncias diretamente ou tamanho divergiu.')
except Exception as e:
    print('Falha ao calcular import√¢ncias:', e)


## üì¶ Treinar no conjunto completo & prever `test.csv`

In [None]:

# Re-treina com todo o treino para extrair previs√µes do teste
final_model.fit(X, y)

# Prepara df_test com mesmas colunas (preprocess garante)
test_pred = final_model.predict(df_test)

# Monta submiss√£o com base no sample_submission se dispon√≠vel
if sub_path and Path(sub_path).exists():
    sub = pd.read_csv(sub_path)
    # Tenta inferir a coluna de ID e a coluna de target da submiss√£o
    sub_cols = sub.columns.tolist()
    # Se houver 2 colunas, assume [id, target]
    if len(sub_cols) == 2:
        id_col, target_col = sub_cols[0], sub_cols[1]
        if id_col in df_test.columns:
            sub[target_col] = test_pred
            submission = sub[[id_col, target_col]]
        else:
            # fallback: cria um √≠ndice
            sub[id_col] = np.arange(len(df_test))
            sub[target_col] = test_pred
            submission = sub[[id_col, target_col]]
    else:
        # fallback gen√©rico
        submission = pd.DataFrame({
            'id': np.arange(len(df_test)),
            'target': test_pred
        })
else:
    # fallback sem sample
    submission = pd.DataFrame({
        'id': np.arange(len(df_test)),
        'target': test_pred
    })

out_path = Path('/mnt/data/submission_startup_success.csv')
submission.to_csv(out_path, index=False)
out_path



## ‚úÖ Pr√≥ximos Passos e Notas
- Caso a acur√°cia **n√£o** atinja 0.80:
  - Aumente `n_iter` no `RandomizedSearchCV` (por ex., 50‚Äì100) e densifique o espa√ßo de busca.
  - Ajuste `class_weight='balanced'` (para RF) se o alvo estiver desbalanceado.
  - Verifique *leaks* e a consist√™ncia entre `train` e `test` (mesma distribui√ß√£o de categorias).
  - Ative `USE_FEATURE_SELECTION = True` e teste valores de `k`.
- Se o tempo permitir, rode 10-fold CV.
- Registre a vers√£o do `scikit-learn` e *seed* para reprodutibilidade no relat√≥rio.
- Descreva brevemente porque o modelo vencedor faz sentido (ex.: √°rvores/boosting lidam bem com intera√ß√µes e n√£o linearidades).

**Boa sorte no Kaggle!** üí™
