# Eksperimen Preprocessing - Dewangga Megananda

Proyek Akhir Machine Learning Dicoding - Skilled Level

Notebook ini berisi:
- Exploratory Data Analysis (EDA)
- Data Preprocessing
- Data Splitting

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
import warnings
warnings.filterwarnings('ignore')

# Set style untuk visualisasi
plt.style.use('seaborn-v0_8')
sns.set_palette('husl')

## 1. Load Dataset

Load dataset yang akan digunakan untuk eksperimen. Untuk demo ini menggunakan Iris dataset,
tapi bisa diganti dengan dataset sesuai kebutuhan proyek.

In [None]:
# Load dataset (contoh menggunakan Iris)
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target
df['target_name'] = df['target'].map({i: name for i, name in enumerate(iris.target_names)})

print("Dataset shape:", df.shape)
print("\nFirst 5 rows:")
df.head()

## 2. Exploratory Data Analysis (EDA)

In [None]:
# Informasi dataset
print("Dataset Info:")
df.info()

print("\nStatistical Summary:")
df.describe()

In [None]:
# Check missing values
print("Missing Values:")
missing_values = df.isnull().sum()
print(missing_values)

# Visualisasi missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

In [None]:
# Distribusi kelas target
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='target_name')
plt.title('Distribution of Target Classes')
plt.xlabel('Species')
plt.ylabel('Count')
plt.show()

print("Class distribution:")
print(df['target_name'].value_counts())

In [None]:
# Visualisasi distribusi fitur
features = iris.feature_names
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()

for i, feature in enumerate(features):
    sns.histplot(data=df, x=feature, hue='target_name', ax=axes[i], alpha=0.7)
    axes[i].set_title(f'Distribution of {feature}')

plt.tight_layout()
plt.show()

In [None]:
# Correlation matrix
plt.figure(figsize=(8, 6))
correlation_matrix = df[features].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.show()

## 3. Data Preprocessing

In [None]:
# Separate features and target
X = df[features]
y = df['target']

print("Features shape:", X.shape)
print("Target shape:", y.shape)

In [None]:
# Handle missing values (jika ada)
# Untuk demo, kita akan mensimulasikan beberapa missing values
np.random.seed(42)
mask = np.random.random(X.shape) < 0.05  # 5% missing values
X_missing = X.copy()
X_missing[mask] = np.nan

print("Missing values after simulation:")
print(X_missing.isnull().sum())

# Impute missing values dengan mean
imputer = SimpleImputer(strategy='mean')
X_imputed = pd.DataFrame(imputer.fit_transform(X_missing), columns=features)
print("\nMissing values after imputation:")
print(X_imputed.isnull().sum())

In [None]:
# Feature scaling
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X_imputed), columns=features)

print("Features after scaling (first 5 rows):")
X_scaled.head()

## 4. Data Splitting

In [None]:
# Split data menjadi train dan test
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

print("Training set shape:", X_train.shape, y_train.shape)
print("Test set shape:", X_test.shape, y_test.shape)
print("\nClass distribution in training set:")
print(pd.Series(y_train).value_counts())
print("\nClass distribution in test set:")
print(pd.Series(y_test).value_counts())

## 5. Save Preprocessed Data

In [None]:
# Simpan data hasil preprocessing
train_data = pd.concat([X_train, y_train], axis=1)
test_data = pd.concat([X_test, y_test], axis=1)

# Simpan ke folder dataset_preprocessing
train_data.to_csv('dataset_preprocessing/train_data.csv', index=False)
test_data.to_csv('dataset_preprocessing/test_data.csv', index=False)
X_scaled.to_csv('dataset_preprocessing/X_scaled.csv', index=False)
pd.Series(y).to_csv('dataset_preprocessing/y.csv', index=False)

print("Preprocessed data saved to dataset_preprocessing/ folder")
print("Files saved:")
print("- train_data.csv")
print("- test_data.csv")
print("- X_scaled.csv")
print("- y.csv")

## Summary

Dalam notebook ini telah dilakukan:

1. **EDA**: Analisis distribusi data, missing values, dan korelasi fitur
2. **Preprocessing**: 
   - Handling missing values dengan imputation
   - Feature scaling menggunakan StandardScaler
3. **Data Splitting**: Membagi data menjadi training dan testing set
4. **Save Data**: Menyimpan hasil preprocessing ke folder dataset_preprocessing

Data siap digunakan untuk modeling pada tahap berikutnya.