# üìä An√°lise Explorat√≥ria de Dados - Dataset SVM

Este notebook cont√©m uma an√°lise explorat√≥ria completa do dataset `svm_claude.csv`.

**Objetivo:** Analisar as caracter√≠sticas do dataset e identificar padr√µes para classifica√ß√£o da vari√°vel `target`.

---

## 1. Importa√ß√£o de Bibliotecas

In [None]:
# Bibliotecas b√°sicas
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Visualiza√ß√£o
import matplotlib.pyplot as plt
import seaborn as sns

# Estat√≠stica
from scipy import stats

# Configura√ß√µes de visualiza√ß√£o
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

print("‚úÖ Bibliotecas importadas com sucesso!")

## 2. Carregamento dos Dados

In [None]:
# Carregar o dataset
df = pd.read_csv('svm_claude.csv')

print(f"Dataset carregado com sucesso!")
print(f"Dimens√µes: {df.shape[0]} linhas √ó {df.shape[1]} colunas")

## 3. Vis√£o Geral dos Dados

In [None]:
# Informa√ß√µes b√°sicas
print("=" * 80)
print("INFORMA√á√ïES B√ÅSICAS DO DATASET")
print("=" * 80)
print(f"\nDimens√µes: {df.shape[0]} linhas √ó {df.shape[1]} colunas")
print(f"\nMem√≥ria utilizada: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"\nTipos de dados:")
print(df.dtypes.value_counts())

In [None]:
# Primeiras linhas
print("\nPrimeiras 5 linhas do dataset:")
df.head()

In [None]:
# Informa√ß√µes detalhadas
df.info()

## 4. Qualidade dos Dados

In [None]:
# Verificar valores ausentes
print("=" * 80)
print("AN√ÅLISE DE VALORES AUSENTES")
print("=" * 80)

missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_df = pd.DataFrame({
    'Valores Ausentes': missing,
    'Percentual (%)': missing_pct
})

if missing.sum() == 0:
    print("\n‚úÖ Nenhum valor ausente encontrado!")
else:
    print("\n‚ö†Ô∏è Features com valores ausentes:")
    print(missing_df[missing_df['Valores Ausentes'] > 0].sort_values('Valores Ausentes', ascending=False))

In [None]:
# Verificar duplicatas
duplicates = df.duplicated().sum()
print(f"\nLinhas duplicadas: {duplicates}")
if duplicates == 0:
    print("‚úÖ Nenhuma linha duplicada encontrada!")

## 5. An√°lise da Vari√°vel Target

In [None]:
# Distribui√ß√£o do target
print("=" * 80)
print("AN√ÅLISE DA VARI√ÅVEL TARGET")
print("=" * 80)

print("\nDistribui√ß√£o:")
print(df['target'].value_counts())
print("\nPropor√ß√£o:")
print(df['target'].value_counts(normalize=True))

# Calcular desbalanceamento
counts = df['target'].value_counts()
desbalanceamento = counts.max() / counts.min()
print(f"\nDesbalanceamento: {desbalanceamento:.1f}:1")

In [None]:
# Visualiza√ß√£o da distribui√ß√£o do target
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Gr√°fico de barras
target_counts = df['target'].value_counts()
colors = ['#FF6B6B', '#4ECDC4']
axes[0].bar(target_counts.index, target_counts.values, color=colors, edgecolor='black', linewidth=2)
axes[0].set_title('Distribui√ß√£o da Vari√°vel Target', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Classe', fontsize=12)
axes[0].set_ylabel('Frequ√™ncia', fontsize=12)
axes[0].grid(axis='y', alpha=0.3)
for i, v in enumerate(target_counts.values):
    axes[0].text(i, v + 5, str(v), ha='center', fontweight='bold', fontsize=12)

# Gr√°fico de pizza
axes[1].pie(target_counts.values, labels=target_counts.index, autopct='%1.1f%%', 
            colors=colors, startangle=90, explode=[0.05, 0.05],
            textprops={'fontsize': 12, 'fontweight': 'bold'})
axes[1].set_title('Propor√ß√£o das Classes', fontsize=14, fontweight='bold')

# Distribui√ß√£o de iteration por target
for target_class in df['target'].unique():
    data = df[df['target'] == target_class]['iteration']
    axes[2].hist(data, alpha=0.6, label=target_class, bins=20, edgecolor='black')
axes[2].set_title('Distribui√ß√£o de Iteration por Target', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Iteration', fontsize=12)
axes[2].set_ylabel('Frequ√™ncia', fontsize=12)
axes[2].legend()
axes[2].grid(alpha=0.3)

plt.tight_layout()
plt.show()

## 6. Estat√≠sticas Descritivas

In [None]:
# Estat√≠sticas gerais
print("=" * 80)
print("ESTAT√çSTICAS DESCRITIVAS")
print("=" * 80)

df.describe()

In [None]:
# Separar features num√©ricas
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
if 'iteration' in numeric_cols:
    numeric_cols.remove('iteration')

print(f"\nTotal de features num√©ricas: {len(numeric_cols)}")

## 7. An√°lise de Correla√ß√£o com Target

In [None]:
# Criar encoding num√©rico para target
df['target_encoded'] = (df['target'] == 'interf').astype(int)

# Calcular correla√ß√£o com target
correlations = df[numeric_cols].corrwith(df['target_encoded']).abs().sort_values(ascending=False)

print("=" * 80)
print("TOP 20 FEATURES COM MAIOR CORRELA√á√ÉO COM TARGET")
print("=" * 80)
print("\n")
for i, (feature, corr) in enumerate(correlations.head(20).items(), 1):
    print(f"{i:2d}. {feature:50s} {corr:.4f}")

In [None]:
# Visualiza√ß√£o das correla√ß√µes
fig, axes = plt.subplots(1, 2, figsize=(18, 6))

# Top 15 correla√ß√µes
top_corr = correlations.head(15)
colors_corr = ['#FF6B6B' if x > 0.3 else '#4ECDC4' if x > 0.15 else '#95E1D3' for x in top_corr.values]
axes[0].barh(range(len(top_corr)), top_corr.values, color=colors_corr, edgecolor='black')
axes[0].set_yticks(range(len(top_corr)))
axes[0].set_yticklabels([col.replace('mean_', '').replace('_', ' ')[:35] for col in top_corr.index], fontsize=9)
axes[0].set_xlabel('Correla√ß√£o Absoluta com Target', fontsize=12, fontweight='bold')
axes[0].set_title('Top 15 Features - Correla√ß√£o com Target', fontsize=14, fontweight='bold')
axes[0].grid(axis='x', alpha=0.3)
axes[0].invert_yaxis()

# Distribui√ß√£o das correla√ß√µes
axes[1].hist(correlations.values, bins=30, color='#4ECDC4', edgecolor='black', alpha=0.7)
axes[1].axvline(correlations.mean(), color='red', linestyle='--', linewidth=2, 
                label=f'M√©dia: {correlations.mean():.3f}')
axes[1].axvline(correlations.median(), color='orange', linestyle='--', linewidth=2, 
                label=f'Mediana: {correlations.median():.3f}')
axes[1].set_xlabel('Correla√ß√£o Absoluta', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Frequ√™ncia', fontsize=12, fontweight='bold')
axes[1].set_title('Distribui√ß√£o das Correla√ß√µes com Target', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Matriz de correla√ß√£o das top 10 features
top_10_features = correlations.head(10).index.tolist()
corr_matrix = df[top_10_features].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0, 
            square=True, linewidths=0.5, cbar_kws={"shrink": 0.8})
plt.title('Matriz de Correla√ß√£o - Top 10 Features', fontsize=14, fontweight='bold', pad=20)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

## 8. An√°lise de Distribui√ß√µes

In [None]:
# Estat√≠sticas de distribui√ß√£o
print("=" * 80)
print("AN√ÅLISE DE DISTRIBUI√á√ïES DAS FEATURES")
print("=" * 80)

stats_summary = pd.DataFrame({
    'mean': df[numeric_cols].mean(),
    'std': df[numeric_cols].std(),
    'min': df[numeric_cols].min(),
    'max': df[numeric_cols].max(),
    'skewness': df[numeric_cols].skew(),
    'kurtosis': df[numeric_cols].kurtosis(),
    'zeros_count': (df[numeric_cols] == 0).sum(),
    'zeros_pct': ((df[numeric_cols] == 0).sum() / len(df) * 100)
})

print("\nFeatures com mais de 80% de zeros:")
high_zeros = stats_summary[stats_summary['zeros_pct'] > 80].sort_values('zeros_pct', ascending=False)
print(f"Total: {len(high_zeros)} features")
print(high_zeros[['zeros_count', 'zeros_pct']].head(10))

In [None]:
# Features com distribui√ß√£o mais assim√©trica
print("\nFeatures com distribui√ß√£o mais assim√©trica (|skewness| > 5):")
high_skew = stats_summary[stats_summary['skewness'].abs() > 5].sort_values('skewness', ascending=False)
print(f"Total: {len(high_skew)} features")
print(high_skew[['mean', 'std', 'skewness']].head(10))

In [None]:
# Visualiza√ß√£o das distribui√ß√µes das top features
top_features = correlations.head(6).index.tolist()

fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()

for idx, feature in enumerate(top_features):
    for target_class in df['target'].unique():
        data = df[df['target'] == target_class][feature]
        axes[idx].hist(data, alpha=0.6, label=target_class, bins=25, edgecolor='black', linewidth=0.5)
    
    axes[idx].set_title(f'{feature.replace("mean_", "").replace("_", " ")[:40]}', 
                        fontsize=11, fontweight='bold')
    axes[idx].set_xlabel('Valor', fontsize=10)
    axes[idx].set_ylabel('Frequ√™ncia', fontsize=10)
    axes[idx].legend(fontsize=9)
    axes[idx].grid(alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Boxplots comparativos
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()

for idx, feature in enumerate(top_features):
    data_to_plot = [df[df['target'] == target_class][feature].values 
                    for target_class in df['target'].unique()]
    
    bp = axes[idx].boxplot(data_to_plot, tick_labels=df['target'].unique(), 
                           patch_artist=True, showmeans=True)
    
    colors = ['#FF6B6B', '#4ECDC4']
    for patch, color in zip(bp['boxes'], colors):
        patch.set_facecolor(color)
        patch.set_alpha(0.7)
    
    axes[idx].set_title(f'Boxplot: {feature.replace("mean_", "").replace("_", " ")[:35]}', 
                        fontsize=11, fontweight='bold')
    axes[idx].set_ylabel('Valor', fontsize=10)
    axes[idx].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## 9. An√°lise por Categorias de M√©tricas

In [None]:
# Separar features por categoria
categories = {
    'OS CPU': [col for col in df.columns if col.startswith('mean_os_cpu')],
    'OS Disk': [col for col in df.columns if col.startswith('mean_os_disk')],
    'OS Memory': [col for col in df.columns if col.startswith('mean_os_mem')],
    'OS Network': [col for col in df.columns if col.startswith('mean_os_net')],
    'Process CPU': [col for col in df.columns if col.startswith('mean_process_cpu')],
    'Process Disk': [col for col in df.columns if col.startswith('mean_process_disk')],
    'Process Memory': [col for col in df.columns if col.startswith('mean_process_mem')],
    'Process Network': [col for col in df.columns if col.startswith('mean_process_net')],
    'Container CPU': [col for col in df.columns if col.startswith('mean_container_cpu')],
    'Container Disk': [col for col in df.columns if col.startswith('mean_container_disk')],
    'Container Memory': [col for col in df.columns if col.startswith('mean_container_mem')],
    'Container Network': [col for col in df.columns if col.startswith('mean_container_net')]
}

print("=" * 80)
print("AN√ÅLISE POR CATEGORIAS DE M√âTRICAS")
print("=" * 80)
print("\n")

for category, features in categories.items():
    print(f"{category:20s}: {len(features):3d} features")

In [None]:
# Top feature por categoria
print("\n" + "=" * 80)
print("TOP FEATURE POR CATEGORIA (maior correla√ß√£o com target)")
print("=" * 80)
print("\n")

category_top_features = {}
for category, features in categories.items():
    if features:
        corrs = df[features].corrwith(df['target_encoded']).abs()
        if not corrs.empty and corrs.max() > 0:
            top_feature = corrs.idxmax()
            top_corr = corrs.max()
            category_top_features[category] = (top_feature, top_corr)
            print(f"{category:20s}: {top_feature:50s} (corr: {top_corr:.4f})")

In [None]:
# Visualiza√ß√£o por categorias
fig, axes = plt.subplots(1, 2, figsize=(18, 6))

# N√∫mero de features por categoria
cat_names = list(categories.keys())
cat_counts = [len(categories[cat]) for cat in cat_names]
colors = plt.cm.Set3(range(len(cat_names)))
axes[0].barh(cat_names, cat_counts, color=colors, edgecolor='black')
axes[0].set_xlabel('N√∫mero de Features', fontsize=12, fontweight='bold')
axes[0].set_title('Features por Categoria', fontsize=14, fontweight='bold')
axes[0].grid(axis='x', alpha=0.3)
for i, v in enumerate(cat_counts):
    axes[0].text(v + 0.5, i, str(v), va='center', fontweight='bold')

# Correla√ß√£o m√©dia por categoria
cat_mean_corrs = []
for category, features in categories.items():
    if features:
        mean_corr = df[features].corrwith(df['target_encoded']).abs().mean()
        cat_mean_corrs.append((category, mean_corr))

cat_mean_corrs.sort(key=lambda x: x[1], reverse=True)
cat_names_sorted = [x[0] for x in cat_mean_corrs]
cat_corr_values = [x[1] for x in cat_mean_corrs]

colors_grad = plt.cm.RdYlGn_r(np.linspace(0.2, 0.8, len(cat_names_sorted)))
axes[1].barh(cat_names_sorted, cat_corr_values, color=colors_grad, edgecolor='black')
axes[1].set_xlabel('Correla√ß√£o M√©dia com Target', fontsize=12, fontweight='bold')
axes[1].set_title('Import√¢ncia M√©dia por Categoria', fontsize=14, fontweight='bold')
axes[1].grid(axis='x', alpha=0.3)
for i, v in enumerate(cat_corr_values):
    axes[1].text(v + 0.005, i, f'{v:.3f}', va='center', fontweight='bold', fontsize=9)

plt.tight_layout()
plt.show()

## 10. An√°lise de Outliers

In [None]:
# Detectar outliers usando IQR
print("=" * 80)
print("AN√ÅLISE DE OUTLIERS")
print("=" * 80)

outlier_counts = {}
for col in numeric_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = ((df[col] < lower_bound) | (df[col] > upper_bound)).sum()
    outlier_counts[col] = outliers

outlier_df = pd.DataFrame.from_dict(outlier_counts, orient='index', columns=['Outliers'])
outlier_df = outlier_df.sort_values('Outliers', ascending=False)

print("\nTop 15 features com mais outliers:")
print(outlier_df.head(15))

In [None]:
# Visualiza√ß√£o de outliers
top_features = correlations.head(6).index.tolist()

fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()

for idx, feature in enumerate(top_features):
    data_by_class = [df[df['target'] == target_class][feature].values 
                     for target_class in df['target'].unique()]
    
    bp = axes[idx].boxplot(data_by_class, tick_labels=df['target'].unique(), 
                           patch_artist=True, showmeans=True, showfliers=True)
    
    colors = ['#FF6B6B', '#4ECDC4']
    for patch, color in zip(bp['boxes'], colors):
        patch.set_facecolor(color)
        patch.set_alpha(0.6)
    
    for flier, color in zip(bp['fliers'], colors):
        flier.set(marker='o', color=color, markersize=4, alpha=0.5)
    
    short_name = feature.replace('mean_', '').replace('_', ' ')[:40]
    axes[idx].set_title(f'{short_name}', fontsize=11, fontweight='bold')
    axes[idx].set_ylabel('Valor', fontsize=10)
    axes[idx].grid(axis='y', alpha=0.3)
    
    # Adicionar contagem de outliers
    Q1 = df[feature].quantile(0.25)
    Q3 = df[feature].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = ((df[feature] < lower_bound) | (df[feature] > upper_bound)).sum()
    axes[idx].text(0.05, 0.95, f'Outliers: {outliers}', transform=axes[idx].transAxes,
                   fontsize=9, verticalalignment='top', bbox=dict(boxstyle='round', 
                   facecolor='yellow', alpha=0.3))

plt.tight_layout()
plt.show()

## 11. Testes Estat√≠sticos

In [None]:
# Estat√≠sticas descritivas por classe
print("=" * 80)
print("ESTAT√çSTICAS DESCRITIVAS POR CLASSE")
print("=" * 80)

top_5_features = correlations.head(5).index.tolist()

for feature in top_5_features:
    print(f"\nFeature: {feature}")
    print("-" * 80)
    for target_class in df['target'].unique():
        subset = df[df['target'] == target_class][feature]
        print(f"{target_class:10s}: mean={subset.mean():12.2f}, std={subset.std():12.2f}, "
              f"min={subset.min():12.2f}, max={subset.max():12.2f}")

In [None]:
# Testes T para as top 10 features
print("\n" + "=" * 80)
print("TESTES ESTAT√çSTICOS (T-TEST) - Top 10 Features")
print("=" * 80)
print("\n")

results = []
for feature in correlations.head(10).index:
    normal_data = df[df['target'] == 'normal'][feature]
    interf_data = df[df['target'] == 'interf'][feature]
    
    t_stat, p_value = stats.ttest_ind(normal_data, interf_data)
    results.append({
        'Feature': feature.replace('mean_', '')[:40],
        'T-Statistic': t_stat,
        'P-Value': p_value,
        'Significativo': 'Sim ‚úì' if p_value < 0.05 else 'N√£o ‚úó'
    })

results_df = pd.DataFrame(results)
results_df

## 12. Resumo e Insights

In [None]:
# Gerar resumo executivo
print("=" * 80)
print("RESUMO EXECUTIVO DA AN√ÅLISE")
print("=" * 80)

print(f"\nüìä DATASET:")
print(f"   ‚Ä¢ Total de amostras: {len(df)}")
print(f"   ‚Ä¢ Total de features: {len(df.columns)-1}")
print(f"   ‚Ä¢ Features num√©ricas: {len(numeric_cols)}")

print(f"\nüéØ CLASSES (TARGET):")
print(f"   ‚Ä¢ 'interf': {(df['target']=='interf').sum()} amostras ({(df['target']=='interf').sum()/len(df)*100:.1f}%)")
print(f"   ‚Ä¢ 'normal': {(df['target']=='normal').sum()} amostras ({(df['target']=='normal').sum()/len(df)*100:.1f}%)")
print(f"   ‚Ä¢ Desbalanceamento: {desbalanceamento:.1f}:1")

print(f"\n‚úÖ QUALIDADE DOS DADOS:")
print(f"   ‚Ä¢ Valores ausentes: {df.isnull().sum().sum()}")
print(f"   ‚Ä¢ Features com >80% zeros: {len(high_zeros)}")
print(f"   ‚Ä¢ Duplicatas: {df.duplicated().sum()}")

print(f"\nüîù TOP 5 FEATURES MAIS IMPORTANTES:")
for i, (feature, corr) in enumerate(correlations.head(5).items(), 1):
    print(f"   {i}. {feature.replace('mean_', '')[:40]:42s} (corr: {corr:.4f})")

print(f"\nüìÅ CATEGORIAS MAIS RELEVANTES:")
for i, (cat, corr) in enumerate(cat_mean_corrs[:3], 1):
    print(f"   {i}. {cat:20s} (corr m√©dia: {corr:.4f})")

## 13. Principais Insights

### üéØ Insights Principais

1. **Desbalanceamento Severo:**
   - 91.6% das amostras s√£o da classe 'interf'
   - Necess√°rio usar t√©cnicas de balanceamento (SMOTE, class weights)

2. **Feature Mais Importante:**
   - `mean_os_net_num_connections` tem correla√ß√£o de 0.95 com o target
   - Forte preditor da classe target

3. **Categorias Relevantes:**
   - M√©tricas de Network s√£o as mais correlacionadas
   - Container Memory e Process Network tamb√©m s√£o relevantes

4. **Caracter√≠sticas dos Dados:**
   - 68 features com mais de 80% de zeros (sparsidade alta)
   - Alta assimetria em v√°rias features
   - Presen√ßa moderada de outliers

### üí° Recomenda√ß√µes para Modelagem

#### Pr√©-processamento:
- Remover features com >90% de zeros
- Aplicar normaliza√ß√£o/padroniza√ß√£o
- Considerar transforma√ß√µes logar√≠tmicas para features assim√©tricas

#### Balanceamento de Classes:
- Usar SMOTE ou ADASYN para gerar amostras sint√©ticas
- Aplicar class weights nos modelos
- Considerar undersampling da classe majorit√°ria

#### Feature Engineering:
- Focar em m√©tricas de Network (maior poder preditivo)
- Criar features agregadas por categoria
- Aplicar sele√ß√£o de features (RFE, LASSO)

#### Valida√ß√£o:
- Usar Stratified K-Fold cross-validation
- M√©tricas apropriadas: F1-score, AUC-ROC, Precision-Recall
- Aten√ß√£o especial ao overfitting devido ao desbalanceamento

## 14. Exportar Resultados

In [None]:
# Criar DataFrame com as principais estat√≠sticas
summary_stats = pd.DataFrame({
    'Feature': correlations.head(20).index,
    'Correlation': correlations.head(20).values,
    'Mean': [df[col].mean() for col in correlations.head(20).index],
    'Std': [df[col].std() for col in correlations.head(20).index],
    'Zeros_Pct': [((df[col] == 0).sum() / len(df) * 100) for col in correlations.head(20).index]
})

print("\nTop 20 Features - Estat√≠sticas Resumidas:")
summary_stats

In [None]:
# Salvar estat√≠sticas em CSV
summary_stats.to_csv('eda_summary_statistics.csv', index=False)
print("\n‚úÖ Estat√≠sticas salvas em 'eda_summary_statistics.csv'")

---

## üéâ An√°lise Explorat√≥ria Conclu√≠da!

Este notebook forneceu uma an√°lise completa do dataset, incluindo:
- ‚úÖ An√°lise da vari√°vel target e desbalanceamento
- ‚úÖ Identifica√ß√£o das features mais importantes
- ‚úÖ An√°lise de correla√ß√µes e distribui√ß√µes
- ‚úÖ Detec√ß√£o de outliers
- ‚úÖ Testes estat√≠sticos
- ‚úÖ Recomenda√ß√µes para modelagem

**Pr√≥ximos passos:** Aplicar as recomenda√ß√µes de pr√©-processamento e desenvolver modelos de classifica√ß√£o (SVM, Random Forest, XGBoost, etc.)