# Praktikum Minggu 6: Exploratory Data Analysis (EDA)
## *Week 6 Lab: Exploratory Data Analysis*

**Mata Kuliah:** Big Data Analytics  
**Topik:** Eksplorasi Data, Statistik Deskriptif, Korelasi, Visualisasi, dan Deteksi Outlier

---
### Tujuan Praktikum
1. Melakukan EDA lengkap pada dataset nyata
2. Menghitung dan menginterpretasikan statistik deskriptif
3. Menganalisis distribusi data
4. Mengukur dan memvisualisasikan korelasi antar variabel
5. Mendeteksi outlier menggunakan metode IQR

In [None]:
# ============================================================
# Import Library yang Diperlukan
# ============================================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
from scipy import stats

# Konfigurasi tampilan
sns.set_theme(style='whitegrid', palette='muted')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['figure.dpi'] = 100
pd.set_option('display.max_columns', 20)
pd.set_option('display.float_format', '{:.4f}'.format)

print('✅ Library berhasil diimpor')
print(f'   Pandas  : {pd.__version__}')
print(f'   NumPy   : {np.__version__}')
print(f'   Seaborn : {sns.__version__}')

## 1. Memuat Dataset

Kita akan menggunakan dataset **`tips`** bawaan Seaborn — dataset tentang tip (uang tips) di sebuah restoran.

**Deskripsi variabel:**
| Variabel | Tipe | Deskripsi |
|----------|------|----------|
| `total_bill` | float | Total tagihan (USD) |
| `tip` | float | Jumlah tips (USD) |
| `sex` | kategori | Jenis kelamin pelanggan |
| `smoker` | kategori | Apakah merokok? |
| `day` | kategori | Hari (Thur, Fri, Sat, Sun) |
| `time` | kategori | Waktu makan (Lunch/Dinner) |
| `size` | int | Jumlah orang di meja |

In [None]:
# Muat dataset tips dari seaborn
df = sns.load_dataset('tips')

print('=' * 50)
print('INFORMASI DASAR DATASET')
print('=' * 50)
print(f'Ukuran dataset : {df.shape[0]} baris x {df.shape[1]} kolom')
print(f'Total elemen   : {df.size}')
print()

print('--- 5 Baris Pertama ---')
display(df.head())

print('\n--- Tipe Data ---')
print(df.dtypes)

print('\n--- Info Dataset ---')
df.info()

print('\n--- Cek Missing Values ---')
print(df.isnull().sum())
print(f'Total missing: {df.isnull().sum().sum()}')

## 2. Statistik Deskriptif

Statistik deskriptif memberikan gambaran ringkas tentang distribusi dan karakteristik data.

In [None]:
# ============================================================
# Statistik Deskriptif untuk Variabel Numerik
# ============================================================
print('=== Statistik Deskriptif Variabel Numerik ===')
display(df.describe().T.style.background_gradient(cmap='YlOrRd'))

print('\n=== Distribusi Variabel Kategoris ===')
cat_cols = df.select_dtypes(include='category').columns
for col in cat_cols:
    print(f'\n[{col}]')
    vc = df[col].value_counts()
    pct = df[col].value_counts(normalize=True) * 100
    summary = pd.DataFrame({'Count': vc, 'Percentage (%)': pct.round(1)})
    print(summary)

# Skewness dan Kurtosis
print('\n=== Skewness dan Kurtosis ===')
num_cols = df.select_dtypes(include='number').columns
skew_kurt = pd.DataFrame({
    'Skewness': df[num_cols].skew(),
    'Kurtosis (Excess)': df[num_cols].kurtosis()
})

def interpret_skew(s):
    if abs(s) < 0.5: return 'Simetris'
    elif s > 0.5: return 'Right-skewed'
    else: return 'Left-skewed'

skew_kurt['Interpretasi Skewness'] = skew_kurt['Skewness'].apply(interpret_skew)
print(skew_kurt)

# Buat kolom derived: tip_pct
df['tip_pct'] = (df['tip'] / df['total_bill']) * 100
print(f'\n[Kolom Baru] tip_pct (persentase tip):')
print(df['tip_pct'].describe())

## 3. Analisis Distribusi

Memahami distribusi setiap variabel numerik dengan histogram dan KDE plot.

In [None]:
# ============================================================
# Histogram + KDE untuk semua kolom numerik
# ============================================================
num_cols_ext = ['total_bill', 'tip', 'size', 'tip_pct']

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Distribusi Variabel Numerik (Histogram + KDE)', fontsize=15, fontweight='bold')

colors = ['steelblue', 'coral', 'mediumseagreen', 'orchid']
for ax, col, color in zip(axes.flatten(), num_cols_ext, colors):
    sns.histplot(df[col], kde=True, ax=ax, color=color, bins=25, edgecolor='white')
    ax.axvline(df[col].mean(), color='red', linestyle='--', linewidth=1.5, label=f'Mean={df[col].mean():.2f}')
    ax.axvline(df[col].median(), color='navy', linestyle=':', linewidth=1.5, label=f'Median={df[col].median():.2f}')
    ax.set_title(f'{col}  |  Skew={df[col].skew():.2f}', fontsize=11)
    ax.set_xlabel(col)
    ax.set_ylabel('Frekuensi')
    ax.legend(fontsize=8)

plt.tight_layout()
plt.show()

# Interpretasi
print('\n=== Interpretasi Distribusi ===')
for col in num_cols_ext:
    s = df[col].skew()
    direction = 'right-skewed (ekor panjang ke kanan)' if s > 0.5 else ('left-skewed (ekor panjang ke kiri)' if s < -0.5 else 'mendekati simetris')
    print(f'{col:12s}: skewness={s:+.3f} → {direction}')

## 4. Analisis Korelasi

Mengukur hubungan linear antara variabel numerik menggunakan koefisien korelasi Pearson.

In [None]:
# ============================================================
# Correlation Matrix & Heatmap
# ============================================================
corr_matrix = df[num_cols_ext].corr(method='pearson')

print('=== Matriks Korelasi Pearson ===')
print(corr_matrix.round(4))

# Heatmap
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Pearson
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, annot=True, fmt='.3f', cmap='RdYlGn',
            center=0, vmin=-1, vmax=1, ax=axes[0],
            linewidths=0.5, mask=mask, square=True,
            annot_kws={'size': 11})
axes[0].set_title('Heatmap Korelasi Pearson', fontsize=13, fontweight='bold')

# Spearman
corr_spearman = df[num_cols_ext].corr(method='spearman')
sns.heatmap(corr_spearman, annot=True, fmt='.3f', cmap='RdYlGn',
            center=0, vmin=-1, vmax=1, ax=axes[1],
            linewidths=0.5, mask=mask, square=True,
            annot_kws={'size': 11})
axes[1].set_title('Heatmap Korelasi Spearman', fontsize=13, fontweight='bold')

plt.suptitle('Perbandingan Korelasi Pearson vs Spearman', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

# Scatter plot pasangan paling berkorelasi
print('\n=== Scatter Plot: total_bill vs tip ===')
fig, ax = plt.subplots(figsize=(8, 5))
sns.regplot(data=df, x='total_bill', y='tip', scatter_kws={'alpha': 0.5, 'color': 'steelblue'},
            line_kws={'color': 'red', 'linewidth': 2}, ax=ax)
r, p = stats.pearsonr(df['total_bill'], df['tip'])
ax.set_title(f'total_bill vs tip  |  r = {r:.4f}, p-value = {p:.4e}', fontsize=12)
ax.set_xlabel('Total Bill (USD)')
ax.set_ylabel('Tip (USD)')
plt.tight_layout()
plt.show()

print(f'\nKorelasi Pearson total_bill~tip: r = {r:.4f} → korelasi KUAT POSITIF')
print(f'P-value = {p:.4e} → signifikan secara statistik (p < 0.05)')

## 5. Analisis Bivariat & Multivariat

Menjelajahi hubungan antar beberapa variabel sekaligus, termasuk variabel kategoris.

In [None]:
# ============================================================
# Pair Plot
# ============================================================
print('Generating pair plot...')
pair_plot = sns.pairplot(df[['total_bill', 'tip', 'size', 'tip_pct', 'time']],
                         hue='time', diag_kind='kde',
                         plot_kws={'alpha': 0.5})
pair_plot.fig.suptitle('Pair Plot Dataset Tips (hue=time)', y=1.02, fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

# ============================================================
# Boxplot per Kategori
# ============================================================
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Distribusi Total Bill & Tip berdasarkan Variabel Kategoris',
             fontsize=13, fontweight='bold')

sns.boxplot(data=df, x='day', y='total_bill', palette='pastel', ax=axes[0, 0])
axes[0, 0].set_title('Total Bill per Hari')

sns.boxplot(data=df, x='sex', y='tip', palette='pastel', ax=axes[0, 1])
axes[0, 1].set_title('Tip per Jenis Kelamin')

# Violin plot
sns.violinplot(data=df, x='time', y='total_bill', palette='muted', inner='quartile', ax=axes[1, 0])
axes[1, 0].set_title('Violin Plot: Total Bill per Waktu Makan')

# Bar chart
day_avg = df.groupby('day', observed=True)['tip'].mean().reset_index()
sns.barplot(data=day_avg, x='day', y='tip', palette='viridis', ax=axes[1, 1])
axes[1, 1].set_title('Rata-rata Tip per Hari')
axes[1, 1].set_ylabel('Rata-rata Tip (USD)')

plt.tight_layout()
plt.show()

# Analisis lebih lanjut
print('\n=== Rata-rata Tip per Hari ===')
print(df.groupby('day', observed=True)['tip'].agg(['mean', 'median', 'std', 'count']).round(3))

print('\n=== Crosstab: Hari vs Waktu Makan ===')
print(pd.crosstab(df['day'], df['time']))

## 6. Deteksi Outlier

Mendeteksi nilai-nilai yang menyimpang jauh dari distribusi normal menggunakan metode IQR.

In [None]:
# ============================================================
# Deteksi Outlier dengan Metode IQR
# ============================================================
def detect_outliers_iqr(data, col):
    """Deteksi outlier menggunakan metode IQR."""
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outliers = data[(data[col] < lower) | (data[col] > upper)]
    return {
        'Q1': Q1, 'Q3': Q3, 'IQR': IQR,
        'Lower Fence': lower, 'Upper Fence': upper,
        'N Outliers': len(outliers),
        'Pct Outliers': f"{len(outliers)/len(data)*100:.1f}%"
    }

print('=== Ringkasan Deteksi Outlier (Metode IQR) ===')
outlier_summary = pd.DataFrame({
    col: detect_outliers_iqr(df, col) for col in num_cols_ext
}).T
print(outlier_summary)

# Visualisasi boxplot dengan outlier
fig, axes = plt.subplots(1, 4, figsize=(16, 5))
fig.suptitle('Box Plot untuk Deteksi Outlier', fontsize=13, fontweight='bold')

for ax, col, color in zip(axes, num_cols_ext, ['steelblue', 'coral', 'mediumseagreen', 'orchid']):
    bp = ax.boxplot(df[col].dropna(), patch_artist=True,
                    boxprops=dict(facecolor=color, alpha=0.7),
                    medianprops=dict(color='black', linewidth=2),
                    flierprops=dict(marker='o', color='red', markersize=6, alpha=0.7))
    ax.set_title(col, fontsize=11)
    ax.set_ylabel('Nilai')

    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    n_out = ((df[col] < Q1 - 1.5*IQR) | (df[col] > Q3 + 1.5*IQR)).sum()
    ax.set_xlabel(f'Outlier: {n_out}', fontsize=9, color='red')

plt.tight_layout()
plt.show()

# Identifikasi baris yang mengandung outlier di total_bill
Q1_tb = df['total_bill'].quantile(0.25)
Q3_tb = df['total_bill'].quantile(0.75)
IQR_tb = Q3_tb - Q1_tb
outlier_rows = df[(df['total_bill'] < Q1_tb - 1.5*IQR_tb) | (df['total_bill'] > Q3_tb + 1.5*IQR_tb)]
print(f'\n=== Baris Outlier pada total_bill (n={len(outlier_rows)}) ===')
print(outlier_rows[['total_bill', 'tip', 'size', 'day', 'time']].sort_values('total_bill', ascending=False))

## 7. Kesimpulan EDA

Merangkum semua temuan penting dari proses eksplorasi data.

In [None]:
# ============================================================
# Ringkasan Temuan EDA
# ============================================================
print('╔══════════════════════════════════════════════════════════╗')
print('║           RINGKASAN TEMUAN EDA – DATASET TIPS           ║')
print('╠══════════════════════════════════════════════════════════╣')
print(f'║  Dataset: {df.shape[0]} observasi, {df.shape[1]} variabel                     ║')
print(f'║  Missing values: {df.isnull().sum().sum()} (dataset bersih)               ║')
print('║                                                          ║')
print('║  DISTRIBUSI:                                             ║')
print(f'║  - total_bill: right-skewed (skew={df["total_bill"].skew():.2f})              ║')
print(f'║  - tip: right-skewed (skew={df["tip"].skew():.2f})                    ║')
print(f'║  - tip_pct: right-skewed (skew={df["tip_pct"].skew():.2f})              ║')
print('║                                                          ║')
r_val, _ = stats.pearsonr(df['total_bill'], df['tip'])
print('║  KORELASI (Pearson):                                     ║')
print(f'║  - total_bill ~ tip: r={r_val:.3f} (korelasi KUAT POSITIF)  ║')
print('║                                                          ║')
print('║  OUTLIER (IQR method):                                   ║')
for col in ['total_bill', 'tip', 'tip_pct']:
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    n_out = ((df[col] < q1 - 1.5*iqr) | (df[col] > q3 + 1.5*iqr)).sum()
    pct = n_out / len(df) * 100
    print(f'║  - {col:12s}: {n_out} outlier ({pct:.1f}%)                ║')
print('║                                                          ║')
print('║  POLA BISNIS:                                            ║')
max_tip_day = df.groupby('day', observed=True)['tip'].mean().idxmax()
max_bill_time = df.groupby('time', observed=True)['total_bill'].mean().idxmax()
print(f'║  - Hari dengan tip terbesar: {max_tip_day}                       ║')
print(f'║  - Tagihan lebih tinggi saat: {max_bill_time}                  ║')
print(f'║  - Rata-rata persentase tip: {df["tip_pct"].mean():.1f}%               ║')
print('╚══════════════════════════════════════════════════════════╝')

## Tugas Praktikum

Kerjakan soal-soal berikut secara mandiri:

### Soal 1
Muat dataset `titanic` dari seaborn (`sns.load_dataset('titanic')`). Lakukan inspeksi awal:
- Tampilkan 10 baris pertama
- Cek missing values per kolom dan hitung persentasenya
- Tampilkan statistik deskriptif untuk semua kolom numerik

### Soal 2
Untuk dataset Titanic, analisis distribusi kolom `age` dan `fare`:
- Buat histogram + KDE untuk masing-masing
- Hitung skewness dan kurtosis, interpretasikan hasilnya
- Bandingkan distribusi `fare` antara penumpang kelas 1, 2, dan 3 menggunakan violin plot

### Soal 3
Hitung korelasi Pearson dan Spearman antara `age`, `fare`, `pclass`, dan `survived`.
- Buat heatmap korelasi
- Variabel apa yang paling berkorelasi dengan `survived`?
- Apakah ada perbedaan signifikan antara Pearson dan Spearman? Mengapa?

### Soal 4
Lakukan analisis bivariat pada dataset Titanic:
- Buat bar chart: tingkat keselamatan (`survived`) per `sex` dan per `pclass`
- Buat boxplot: distribusi `age` antara yang selamat vs tidak
- Apa kesimpulan yang bisa Anda tarik?

### Soal 5
Deteksi outlier pada kolom `fare` menggunakan metode IQR **dan** Z-score:
- Tampilkan baris-baris yang terdeteksi outlier
- Berapa jumlah outlier yang ditemukan oleh masing-masing metode?
- Metode mana yang lebih ketat? Jelaskan mengapa ada perbedaan.