# Notebook consolidado dos scripts

Este notebook reúne os scripts presentes em `src/` (coleta, verificação, separação, geração de embeddings e modelos: KNN, KNN+PCA+SMOTE, Regressão Logística e SVM).

Organização:
- Execute a célula 2 (instalação) antes de rodar qualquer função.
- A célula 3 contém todo o código convertido para funções reutilizáveis — execute partes conforme necessário.
- A célula 4 mostra instruções rápidas de uso.

Observação: os caminhos são relativos ao repositório (por exemplo `data/raw/feiticos.csv`, `data/feiticos_embeddings.npz`).


In [None]:
# Instalação das dependências
# Execute esta célula apenas uma vez em um ambiente novo (ex.: Colab ou venv)
!pip install -r requirements.txt


## Como usar

1. (Opcional) Rodar `scraper_collect()` para baixar os feitiços. Atenção: é um web-scraper e pode demorar.
2. Rodar `split_dataset()` para criar os arquivos em `data/separated/`.
3. Rodar `generate_and_save_embeddings()` para gerar `data/feiticos_embeddings.npz`.
4. Escolha um treino: `knn_train()`, `knn_pca_smote_train()`, `logistic_train()` ou `svm_train()`.

Dicas:
- Em Colab, execute a célula de instalação primeiro.
- Se quiser apenas testar modelos rapidamente, você pode carregar o arquivo `data/feiticos_embeddings.npz` já presente no repositório.
- Ajuste parâmetros como valores de K, C ou gamma nas chamadas de treino.


### 1) Scraper (célula separada)

A função abaixo coleta páginas de `dndbeyond` e salva em `data/raw/feiticos.csv`. Execute somente se sabe o que está fazendo (muitas requisições podem ocorrer).

In [None]:
import requests
from bs4 import BeautifulSoup
import time
import os
import csv

def get_spell_info(spell_slug):
    """Busca descrição de um feitiço pela slug."""
    url = f"https://www.dndbeyond.com/spells/{spell_slug}"
    try:
        page = requests.get(url)
        if page.status_code == 200:
            soup = BeautifulSoup(page.content, 'html.parser')
            description = ""
            bonus_text = ""
            desc_selectors = [
                'div.more-info-content p',
                'div.spell-description p',
                'div.description p',
                '.ddb-statblock-item-description p',
                '.spell-content p'
            ]
            for selector in desc_selectors:
                paragraphs = soup.select(selector)
                if paragraphs:
                    description = paragraphs[0].get_text(strip=True)
                    if len(paragraphs) > 1:
                        bonus_text = paragraphs[1].get_text(strip=True)
                    break
            return description, bonus_text
    except Exception:
        pass
    return "", ""


def scraper_collect(start_page=1, end_page=46, output_path='data/raw/feiticos.csv', pause=0.5, resume=True):
    existing_spells = set()
    file_exists = os.path.exists(output_path)

    if file_exists and resume:
        print("Arquivo existente encontrado. Carregando dados já coletados...")
        with open(output_path, 'r', encoding='utf-8') as existing_file:
            csv_reader = csv.reader(existing_file)
            next(csv_reader, None)
            for row in csv_reader:
                if row:
                    existing_spells.add(row[0])
        print(f"Encontrados {len(existing_spells)} feitiços já coletados.")

    mode = 'a' if file_exists else 'w'
    with open(output_path, mode, encoding='utf-8', newline='') as f:
        if not file_exists:
            f.write('nome,escola,descricao\n')

        for i in range(start_page, end_page+1):
            URL = f"https://www.dndbeyond.com/spells?page={i}"
            try:
                page = requests.get(URL)
                soup = BeautifulSoup(page.content, 'html.parser')

                spells = soup.find_all('div', class_='info', attrs={'data-type': 'spells'})

                for spell in spells:
                    spell_name = spell.get('data-slug', '')
                    if spell_name in existing_spells:
                        print(f"Pulando {spell_name} (já coletado)")
                        continue

                    school_element = spell.find('div', class_='school')
                    spell_school = ''
                    if school_element and hasattr(school_element, 'get'):
                        classes = school_element.get('class', [])
                        for cls in classes:
                            if cls != 'school':
                                spell_school = cls
                                break

                    description, bonus_text = get_spell_info(spell_name)

                    f.write(f'"{spell_name}","{spell_school}","{description} {bonus_text}"\n')
                    f.flush()
                    print(f"Coletado: {spell_name}")
                    time.sleep(pause)

                print(f"Página {i} concluída")
            except Exception as e:
                print(f"Erro na página {i}: {e}")


### 2) Verificar CSV (célula separada)

Roda checagem simples para garantir que o CSV tem 3 colunas por linha.

In [None]:
import csv

def check_csv_fields(file_path='data/raw/feiticos.csv'):
    with open(file_path, 'r', encoding='utf-8') as file:
        reader = csv.reader(file)
        for row_num, row in enumerate(reader, start=1):
            if len(row) != 3:
                print(f"Row {row_num} has {len(row)} fields instead of 3: {row}")
            else:
                print(f"Row {row_num}: OK ({len(row)} fields)")


### 3) Separar dataset (célula separada)

Divide `data/raw/feiticos.csv` em treino/val/test e salva em `data/separated/`.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
import os

def split_dataset(input_csv='data/raw/feiticos.csv', out_dir='data/separated'):
    raw = pd.read_csv(input_csv)
    train_df, temp_ds = train_test_split(raw, test_size=0.30, random_state=42)
    val_df, test_df = train_test_split(temp_ds, test_size=0.3333, random_state=42)

    os.makedirs(out_dir, exist_ok=True)
    train_df.to_csv(os.path.join(out_dir, 'feiticos_train.csv'), index=False)
    val_df.to_csv(os.path.join(out_dir, 'feiticos_val.csv'), index=False)
    test_df.to_csv(os.path.join(out_dir, 'feiticos_test.csv'), index=False)

    print(f"Treino: {len(train_df)} linhas")
    print(f"Val: {len(val_df)} linhas")
    print(f"Teste: {len(test_df)} linhas")

    return train_df, val_df, test_df

# Célula de preview do CSV (mostrar primeiras linhas)
def preview_raw(n=5, file_path='data/raw/feiticos.csv'):
    df = pd.read_csv(file_path)
    display(df.head(n))
    return df.head(n)


### 4) Embeddings (células separadas)

Gera embeddings a partir dos CSVs separados e salva em `data/feiticos_embeddings.npz`. Também adicionamos uma célula para visualizar shapes e um exemplo de embedding de um texto.

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np


def generate_and_save_embeddings(model_name='sentence-transformers/all-MiniLM-L6-v2', data_dir='data/separated', out_path='data/feiticos_embeddings.npz'):
    train_df = pd.read_csv(os.path.join(data_dir, 'feiticos_train.csv'))
    val_df = pd.read_csv(os.path.join(data_dir, 'feiticos_val.csv'))
    test_df = pd.read_csv(os.path.join(data_dir, 'feiticos_test.csv'))

    train_texts = train_df['descricao'].tolist()
    validation_texts = val_df['descricao'].tolist()
    test_texts = test_df['descricao'].tolist()

    train_label = train_df['escola'].tolist()
    validation_label = val_df['escola'].tolist()
    test_label = test_df['escola'].tolist()

    model = SentenceTransformer(model_name)

    train_embeddings = model.encode(train_texts, convert_to_numpy=True)
    validation_embeddings = model.encode(validation_texts, convert_to_numpy=True)
    test_embeddings = model.encode(test_texts, convert_to_numpy=True)

    print(f"\nNúmero de textos de treino: {len(train_texts)}")
    print(f"Formato dos embeddings de treino: {train_embeddings.shape}")

    np.savez_compressed(
        out_path,
        train_embeddings=train_embeddings,
        validation_embeddings=validation_embeddings,
        test_embeddings=test_embeddings,
        train_label=train_label,
        validation_label=validation_label,
        test_label=test_label
    )


def load_embeddings(path='data/feiticos_embeddings.npz'):
    data = np.load(path, allow_pickle=True)
    print({k: data[k].shape if hasattr(data[k], 'shape') else len(data[k]) for k in data.files})
    return data

# exemplo rápido de embedding de texto

def example_text_embedding(text, model_name='sentence-transformers/all-MiniLM-L6-v2'):
    model = SentenceTransformer(model_name)
    emb = model.encode([text], convert_to_numpy=True)
    print('Shape do embedding:', emb.shape)
    return emb[0]


### 5) KNN (células separadas)

Células para carregar embeddings, treinar KNN e um exemplo de predição (vetor -> label -> texto).

In [None]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler


def knn_load(embeddings_path='data/feiticos_embeddings.npz'):
    data = np.load(embeddings_path, allow_pickle=True)
    train_embeddings = data['train_embeddings']
    validation_embeddings = data['validation_embeddings']
    test_embeddings = data['test_embeddings']

    train_label = data['train_label']
    validation_label = data['validation_label']
    test_label = data['test_label']

    scaler = StandardScaler()
    train_embeddings_scaled = scaler.fit_transform(train_embeddings)
    validation_embeddings_scaled = scaler.transform(validation_embeddings)
    test_embeddings_scaled = scaler.transform(test_embeddings)

    print('Shapes (train, val, test):', train_embeddings_scaled.shape, validation_embeddings_scaled.shape, test_embeddings_scaled.shape)
    return train_label, validation_label, test_label, train_embeddings_scaled, validation_embeddings_scaled, test_embeddings_scaled, scaler


def knn_train(k_values=list(range(1,11)), metrics=['cosine','euclidean']):
    best_k = 1
    best_model = None
    best_metric = metrics[0]
    best_accuracy = 0

    train_label, validation_label, test_label, train_embeddings_scaled, validation_embeddings_scaled, test_embeddings_scaled, scaler = knn_load()

    for k in k_values:
        for metric in metrics:
            try:
                knn = KNeighborsClassifier(n_neighbors=k, metric=metric, weights='distance')
                knn.fit(train_embeddings_scaled, train_label)
                val_predictions = knn.predict(validation_embeddings_scaled)
                val_accuracy = accuracy_score(validation_label, val_predictions)
                print(f"K: {k}, Métrica: {metric}, Acurácia: {val_accuracy:.4f}")
                if val_accuracy > best_accuracy:
                    best_k = k
                    best_model = knn
                    best_metric = metric
                    best_accuracy = val_accuracy
            except Exception as e:
                print(f"Erro com k={k}, metric={metric}: {e}")

    print('Melhor K encontrado:', best_k, best_metric, best_accuracy)
    return best_k, best_model, best_metric, scaler

# Exemplo de predição (usa model + scaler) 
def knn_predict_example(model, scaler, sentence, model_name='sentence-transformers/all-MiniLM-L6-v2'):
    from sentence_transformers import SentenceTransformer
    model_text = SentenceTransformer(model_name)
    emb = model_text.encode([sentence], convert_to_numpy=True)
    emb_scaled = scaler.transform(emb)
    pred = model.predict(emb_scaled)
    print('Predição label:', pred[0])
    return pred[0]


### 6) KNN + PCA + SMOTE (células separadas)

Load, treino e exemplo de predição (retorna label codificado -> decodifique com LabelEncoder).

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE


def knn_pca_smote_load(embeddings_path='data/feiticos_embeddings.npz'):
    data = np.load(embeddings_path, allow_pickle=True)
    train_embeddings = data['train_embeddings']
    validation_embeddings = data['validation_embeddings']
    test_embeddings = data['test_embeddings']

    train_label = data['train_label']
    validation_label = data['validation_label']
    test_label = data['test_label']

    label_encoder = LabelEncoder()
    train_label_encoded = label_encoder.fit_transform(train_label)
    validation_label_encoded = label_encoder.transform(validation_label)
    test_label_encoded = label_encoder.transform(test_label)

    scaler = StandardScaler()
    train_embeddings_scaled = scaler.fit_transform(train_embeddings)
    validation_embeddings_scaled = scaler.transform(validation_embeddings)
    test_embeddings_scaled = scaler.transform(test_embeddings)

    smote = SMOTE(random_state=42)
    train_embeddings_balanced, train_label_balanced = smote.fit_resample(train_embeddings_scaled, train_label_encoded)

    pca = PCA(n_components=0.95, random_state=42)
    train_embeddings_pca = pca.fit_transform(train_embeddings_balanced)
    validation_embeddings_pca = pca.transform(validation_embeddings_scaled)
    test_embeddings_pca = pca.transform(test_embeddings_scaled)

    print('After SMOTE and PCA shapes:', train_embeddings_pca.shape, validation_embeddings_pca.shape, test_embeddings_pca.shape)
    return train_label_balanced, validation_label_encoded, test_label_encoded, train_embeddings_pca, validation_embeddings_pca, test_embeddings_pca, scaler, pca, label_encoder


def knn_pca_smote_train(k_values=list(range(1,11)), metrics=['cosine','euclidean']):
    best_k = 1
    best_model = None
    best_metric = metrics[0]
    best_accuracy = 0

    train_label, validation_label, test_label, train_embeddings_scaled, validation_embeddings_scaled, test_embeddings_scaled, scaler, pca, label_encoder = knn_pca_smote_load()

    for k in k_values:
        for metric in metrics:
            try:
                knn = KNeighborsClassifier(n_neighbors=k, metric=metric, weights='distance')
                knn.fit(train_embeddings_scaled, train_label)
                val_predictions = knn.predict(validation_embeddings_scaled)
                val_accuracy = accuracy_score(validation_label, val_predictions)
                print(f"K: {k}, Métrica: {metric}, Acurácia: {val_accuracy:.4f}")
                if val_accuracy > best_accuracy:
                    best_k = k
                    best_model = knn
                    best_metric = metric
                    best_accuracy = val_accuracy
            except Exception as e:
                print(f"Erro com k={k}, metric={metric}: {e}")

    print('Melhor K encontrado:', best_k, best_metric, best_accuracy)
    return best_k, best_model, best_metric, scaler, pca, label_encoder

# Exemplo de predição com decodificação

def knn_pca_predict_example(model, scaler, pca, label_encoder, sentence, model_name='sentence-transformers/all-MiniLM-L6-v2'):
    from sentence_transformers import SentenceTransformer
    model_text = SentenceTransformer(model_name)
    emb = model_text.encode([sentence], convert_to_numpy=True)
    emb_scaled = scaler.transform(emb)
    emb_pca = pca.transform(emb_scaled)
    pred = model.predict(emb_pca)
    pred_decoded = label_encoder.inverse_transform(pred)
    print('Predição:', pred_decoded[0])
    return pred_decoded[0]


### 7) Logistic Regression (células separadas)

Load, treino e exemplo de predição.

In [None]:
from sklearn.linear_model import LogisticRegression


def logistic_load(embeddings_path='data/feiticos_embeddings.npz'):
    data = np.load(embeddings_path, allow_pickle=True)
    train_embeddings = data['train_embeddings']
    validation_embeddings = data['validation_embeddings']
    test_embeddings = data['test_embeddings']

    train_label = data['train_label']
    validation_label = data['validation_label']
    test_label = data['test_label']

    scaler = StandardScaler()
    train_embeddings_scaled = scaler.fit_transform(train_embeddings)
    validation_embeddings_scaled = scaler.transform(validation_embeddings)
    test_embeddings_scaled = scaler.transform(test_embeddings)

    print('Shapes loaded for logistic:', train_embeddings_scaled.shape)
    return train_label, validation_label, test_label, train_embeddings_scaled, validation_embeddings_scaled, test_embeddings_scaled, scaler


def logistic_train(c_values=[0.01,0.1,1,10,100]):
    best_c = 0.01
    best_model = None
    best_accuracy = 0

    train_label, validation_label, test_label, train_embeddings_scaled, validation_embeddings_scaled, test_embeddings_scaled, scaler = logistic_load()

    for c in c_values:
        try:
            model = LogisticRegression(C=c, max_iter=2000)
            model.fit(train_embeddings_scaled, train_label)
            val_accuracy = model.score(validation_embeddings_scaled, validation_label)
            print(f"C: {c}, Acurácia: {val_accuracy:.4f}")
            if val_accuracy > best_accuracy:
                best_c = c
                best_model = model
                best_accuracy = val_accuracy
        except Exception as e:
            print(f"Erro em C={c}: {e}")

    print('Melhor C encontrado:', best_c, best_accuracy)
    return best_c, best_model, scaler


def logistic_predict_example(model, scaler, sentence, model_name='sentence-transformers/all-MiniLM-L6-v2'):
    from sentence_transformers import SentenceTransformer
    model_text = SentenceTransformer(model_name)
    emb = model_text.encode([sentence], convert_to_numpy=True)
    emb_scaled = scaler.transform(emb)
    pred = model.predict(emb_scaled)
    print('Predição:', pred[0])
    return pred[0]

### 8) SVM (células separadas)

Load, treino e exemplo de predição.

In [None]:
from sklearn.svm import SVC


def svm_load(embeddings_path='data/feiticos_embeddings.npz'):
    data = np.load(embeddings_path, allow_pickle=True)
    train_embeddings = data['train_embeddings']
    validation_embeddings = data['validation_embeddings']
    test_embeddings = data['test_embeddings']

    train_label = data['train_label']
    validation_label = data['validation_label']
    test_label = data['test_label']

    scaler = StandardScaler()
    train_embeddings_scaled = scaler.fit_transform(train_embeddings)
    validation_embeddings_scaled = scaler.transform(validation_embeddings)
    test_embeddings_scaled = scaler.transform(test_embeddings)

    print('Shapes loaded for svm:', train_embeddings_scaled.shape)
    return train_label, validation_label, test_label, train_embeddings_scaled, validation_embeddings_scaled, test_embeddings_scaled, scaler


def svm_train(c_values=[0.1,1,10,100], gamma_values=[0.001,0.01,0.1,1]):
    best_c = 0.1
    best_gamma = 0.001
    best_model = None
    best_accuracy = 0

    train_label, validation_label, test_label, train_embeddings_scaled, validation_embeddings_scaled, test_embeddings_scaled, scaler = svm_load()

    for c in c_values:
        for g in gamma_values:
            try:
                model = SVC(C=c, gamma=g, kernel='rbf')
                model.fit(train_embeddings_scaled, train_label)
                val_accuracy = model.score(validation_embeddings_scaled, validation_label)
                print(f"C: {c}, Gamma: {g}, Acurácia: {val_accuracy:.4f}")
                if val_accuracy > best_accuracy:
                    best_c = c
                    best_gamma = g
                    best_model = model
                    best_accuracy = val_accuracy
            except Exception as e:
                print(f"Erro em C={c}, G={g}: {e}")

    print('Melhor C/G encontrado:', best_c, best_gamma, best_accuracy)
    return best_c, best_gamma, best_model, scaler


def svm_predict_example(model, scaler, sentence, model_name='sentence-transformers/all-MiniLM-L6-v2'):
    from sentence_transformers import SentenceTransformer
    model_text = SentenceTransformer(model_name)
    emb = model_text.encode([sentence], convert_to_numpy=True)
    emb_scaled = scaler.transform(emb)
    pred = model.predict(emb_scaled)
    print('Predição:', pred[0])
    return pred[0]