# Praktikum Minggu 8: Latihan UTS - Review Komprehensif
## *Week 8 Lab: Midterm Exam Practice - Comprehensive Review*

**Mata Kuliah:** Big Data Analytics  
**Topik:** Review Minggu 1-7 | Apache Spark, Web Scraping, Preprocessing, EDA, Analisis Terintegrasi

---
### Petunjuk Pengerjaan
- Kerjakan setiap soal secara berurutan
- Baca komentar dalam kode untuk petunjuk tambahan
- Setiap soal memiliki bobot nilai yang tertera
- Waktu pengerjaan: 90 menit

| Soal | Topik | Bobot |
|------|-------|-------|
| 1 | Apache Spark RDD Operations | 20% |
| 2 | Web Scraping & API | 15% |
| 3 | Data Preprocessing | 25% |
| 4 | EDA & Visualisasi | 25% |
| 5 | Analisis Data Terintegrasi | 15% |

In [None]:
# Import semua library yang diperlukan
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import json
import sqlite3
import os
import warnings
from scipy import stats
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.datasets import load_iris

warnings.filterwarnings('ignore')
sns.set_theme(style='whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
pd.set_option('display.max_columns', 20)

print('Semua library berhasil diimpor')
print(f'  pandas  : {pd.__version__}')
print(f'  numpy   : {np.__version__}')
print(f'  sklearn : sesuai instalasi')

## Soal 1: Apache Spark RDD Operations
**Bobot: 20 poin**

Demonstrasikan pemahaman Anda tentang Apache Spark dengan mengerjakan operasi RDD berikut.

**Konsep yang diuji:**
- Transformations vs Actions
- Lazy evaluation
- Operasi map, filter, reduceByKey, groupByKey
- WordCount problem (klasik)

In [None]:
# Install PySpark
!pip install pyspark --quiet
print('PySpark berhasil diinstall')

In [None]:
from pyspark.sql import SparkSession
from pyspark import SparkContext
import warnings
warnings.filterwarnings('ignore')

# Inisialisasi SparkSession
spark = SparkSession.builder \
    .master('local[*]') \
    .appName('UTS_BigData_Review') \
    .config('spark.ui.showConsoleProgress', 'false') \
    .getOrCreate()
spark.sparkContext.setLogLevel('ERROR')
sc = spark.sparkContext

print(f'Spark Version: {spark.version}')
print(f'Spark Master : {sc.master}')

# ============================================================
# Latihan 1a: Operasi Dasar RDD
# ============================================================
print('\n=== 1a. Operasi Dasar RDD ===')

# Buat RDD dari list
data = [3, 7, 2, 9, 5, 1, 8, 4, 6, 10, 15, 12, 3, 7, 5]
rdd = sc.parallelize(data)

print(f'Data asal  : {data}')
print(f'Jumlah     : {rdd.count()}')
print(f'Jumlah elemen: {rdd.sum()}')
print(f'Rata-rata  : {rdd.mean():.2f}')
print(f'Min / Max  : {rdd.min()} / {rdd.max()}')

# Transformations
rdd_genap = rdd.filter(lambda x: x % 2 == 0)
rdd_kuadrat = rdd.map(lambda x: x ** 2)
rdd_besar = rdd.filter(lambda x: x > 7)

# Actions
print(f'\nBilangan genap : {sorted(rdd_genap.collect())}')
print(f'Dikuadratkan   : {sorted(rdd_kuadrat.collect())}')
print(f'Lebih dari 7   : {sorted(rdd_besar.collect())}')
print(f'Jumlah kuadrat : {rdd_kuadrat.sum()}')

# ============================================================
# Latihan 1b: WordCount (Klasik MapReduce)
# ============================================================
print('\n=== 1b. WordCount Problem ===')

teks = [
    'big data adalah era baru teknologi informasi',
    'spark memproses big data dengan cepat',
    'hadoop adalah fondasi ekosistem big data',
    'data science menggunakan big data untuk analisis',
    'spark lebih cepat dari hadoop untuk iterasi',
]

rdd_text = sc.parallelize(teks)

word_count = (
    rdd_text
    .flatMap(lambda line: line.split(' '))   # Pecah tiap baris menjadi kata
    .map(lambda word: (word, 1))             # Buat pasangan (kata, 1)
    .reduceByKey(lambda a, b: a + b)         # Jumlahkan per kata
    .sortBy(lambda x: x[1], ascending=False) # Urutkan descending
)

print('Top 10 kata yang paling sering muncul:')
for word, count in word_count.take(10):
    bar = '#' * count
    print(f'  {word:15s}: {count:2d} {bar}')

# ============================================================
# Latihan 1c: RDD dengan Pasangan Key-Value
# ============================================================
print('\n=== 1c. RDD Key-Value: Analisis Nilai Mahasiswa ===')

nilai_data = [
    ('Informatika', 85), ('Informatika', 92), ('Informatika', 78),
    ('Sistem Informasi', 88), ('Sistem Informasi', 75), ('Sistem Informasi', 91),
    ('Data Science', 95), ('Data Science', 87), ('Data Science', 93),
    ('Informatika', 80), ('Sistem Informasi', 82), ('Data Science', 89),
]

rdd_nilai = sc.parallelize(nilai_data)

# Rata-rata per prodi
avg_per_prodi = (
    rdd_nilai
    .groupByKey()
    .map(lambda x: (x[0], sum(x[1]) / len(list(x[1]))))
    .sortBy(lambda x: x[1], ascending=False)
)

print('Rata-rata nilai per prodi:')
for prodi, avg in avg_per_prodi.collect():
    print(f'  {prodi:20s}: {avg:.2f}')

# Mahasiswa dengan nilai di atas 90
nilai_tinggi = rdd_nilai.filter(lambda x: x[1] >= 90)
print(f'\nJumlah nilai >= 90 per prodi:')
count_tinggi = nilai_tinggi.map(lambda x: (x[0], 1)).reduceByKey(lambda a, b: a + b)
for prodi, count in count_tinggi.collect():
    print(f'  {prodi:20s}: {count}')

spark.stop()
print('\nSparkSession dihentikan')

## Soal 2: Web Scraping & API
**Bobot: 15 poin**

Demonstrasikan kemampuan pengumpulan data melalui REST API dan manipulasi data JSON.

**Konsep yang diuji:**
- HTTP GET request
- Parsing JSON response
- Transformasi data ke DataFrame
- Analisis data yang dikumpulkan

In [None]:
import requests
import pandas as pd
import json
import time

# ============================================================
# Latihan 2a: Mengambil Data dari JSONPlaceholder API
# ============================================================
print('=== 2a. Pengambilan Data via REST API ===')
BASE_URL = 'https://jsonplaceholder.typicode.com'

# Fungsi helper untuk GET request
def api_get(endpoint, params=None):
    """Wrapper untuk GET request dengan error handling."""
    try:
        url = f'{BASE_URL}{endpoint}'
        response = requests.get(url, params=params, timeout=10)
        response.raise_for_status()  # raise exception untuk HTTP error
        return response.json()
    except requests.exceptions.ConnectionError:
        print(f'  Tidak bisa terhubung ke API - menggunakan data simulasi')
        return None
    except requests.exceptions.Timeout:
        print(f'  Request timeout')
        return None
    except Exception as e:
        print(f'  Error: {e}')
        return None

# Ambil daftar users
print('\n[1] Mengambil data Users...')
users_raw = api_get('/users')

if users_raw is None:
    # Fallback: gunakan data simulasi
    print('Menggunakan data simulasi...')
    users_raw = [
        {'id': i, 'name': f'User {i}', 'username': f'user{i}',
         'email': f'user{i}@test.com',
         'address': {'city': ['Jakarta', 'Bandung', 'Surabaya', 'Medan'][i % 4]},
         'company': {'name': f'Company {chr(65 + i % 5)}'}}
        for i in range(1, 11)
    ]

# Parse ke DataFrame
users_df = pd.json_normalize(
    users_raw,
    sep='_'
)[['id', 'name', 'username', 'email', 'address_city', 'company_name']]
users_df.columns = ['id', 'nama', 'username', 'email', 'kota', 'perusahaan']

print(f'Berhasil mengambil {len(users_df)} user')
print(users_df.to_string(index=False))

# Ambil posts
print('\n[2] Mengambil data Posts...')
posts_raw = api_get('/posts')

if posts_raw is None:
    posts_raw = [
        {'userId': (i % 10) + 1, 'id': i, 'title': f'Post judul {i}',
         'body': ' '.join([f'kata{j}' for j in range(np.random.randint(5, 15))])}
        for i in range(1, 101)
    ]

posts_df = pd.DataFrame(posts_raw)
posts_df['word_count'] = posts_df['body'].str.split().str.len()
print(f'Berhasil mengambil {len(posts_df)} post')

# ============================================================
# Latihan 2b: Analisis Data API
# ============================================================
print('\n=== 2b. Analisis Data dari API ===')

# Merge users dan posts
posts_df = posts_df.rename(columns={'userId': 'id'})
merged = posts_df.merge(users_df[['id', 'nama', 'kota']], on='id', how='left')

# Statistik
print('\n[a] Jumlah post per user:')
post_count = merged.groupby('nama')['id'].count().reset_index()
post_count.columns = ['nama', 'jumlah_post']
print(post_count.sort_values('jumlah_post', ascending=False).head(5).to_string(index=False))

print('\n[b] Rata-rata panjang post per user (top 5):')
avg_words = merged.groupby('nama')['word_count'].mean().reset_index()
avg_words.columns = ['nama', 'avg_word_count']
print(avg_words.sort_values('avg_word_count', ascending=False).head(5).to_string(index=False))

# ============================================================
# Latihan 2c: Error Handling dan Rate Limiting
# ============================================================
print('\n=== 2c. Error Handling & Rate Limiting ===')

def fetch_with_retry(endpoint, max_retries=3, delay=1):
    """Fetch dengan retry logic dan exponential backoff."""
    for attempt in range(1, max_retries + 1):
        try:
            result = api_get(endpoint)
            if result is not None:
                return result
            print(f'  Percobaan {attempt}/{max_retries} gagal, tunggu {delay}s...')
            time.sleep(delay)
            delay *= 2  # exponential backoff
        except Exception as e:
            print(f'  Error pada percobaan {attempt}: {e}')
    return None

# Ambil beberapa todo secara selektif
todos_raw = api_get('/todos?userId=1&_limit=10')
if todos_raw is None:
    todos_raw = [{'userId': 1, 'id': i, 'title': f'Task {i}',
                  'completed': bool(i % 3 == 0)} for i in range(1, 11)]

todos_df = pd.DataFrame(todos_raw)
print(f'Todos user 1: {len(todos_df)} tasks')
completed = todos_df['completed'].sum()
print(f'  Selesai    : {completed} ({completed/len(todos_df)*100:.0f}%)')
print(f'  Belum      : {len(todos_df) - completed}')
print('\nSelesai!')

## Soal 3: Data Preprocessing
**Bobot: 25 poin**

Diberikan dataset kotor (dirty dataset) dengan berbagai masalah kualitas data. Lakukan preprocessing lengkap.

**Masalah dalam dataset:**
1. Missing values di beberapa kolom
2. Format tidak konsisten (tanggal, huruf besar/kecil)
3. Outlier ekstrem
4. Duplikasi data
5. Tipe data yang salah
6. Variabel kategoris perlu di-encoding

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler

# ============================================================
# Dataset Kotor - Simulasi data karyawan
# ============================================================
np.random.seed(42)

dirty_data = pd.DataFrame({
    'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 3, 11, 12, 13, 14, 15],   # ID 3 duplikat
    'Nama': ['Budi', 'sari DEWI', 'Andi', None, 'maya', 'RIZKY', 'Dewi', 'Ahmad',
              'Putri', 'Fajar', 'Andi', 'Lina', 'Hendra', 'wati', 'Bimo', None],
    'Usia': [28, 35, 24, 42, None, 29, 31, 27, 150, 33, 24, 38, -5, 26, 45, 30],  # 150 & -5 = outlier
    'Gaji': ['5500000', '7200000', '4800000', '9100000', '6300000', None,
              '5100000', '6700000', '8200000', '7500000', '4800000',
              '5900000', '12000000000', '6100000', '8900000', '5700000'],  # 12M juta = outlier
    'Departemen': ['IT', 'hr', 'IT', 'Finance', 'HR', 'it', 'Finance', 'HR',
                   'IT', 'Finance', 'IT', 'hr', 'IT', 'Finance', 'HR', 'IT'],
    'Tgl_Masuk': ['01/03/2020', '15-06-2019', '20/11/2021', '03/01/2018',
                  '2022-07-25', '10/09/2020', '28/02/2021', '14-04-2019',
                  '05/08/2022', '17/12/2019', '20/11/2021', '09/03/2020',
                  '22-10-2017', '11/05/2021', '30/01/2023', '07/06/2020'],
    'Pendidikan': ['S1', 'S2', 'S1', 'S3', 'D3', 'S1', 'S2', 'S1',
                   'S1', 'S2', 'S1', 'D3', 'S2', 'S1', 'S3', 'S1'],
    'Status': ['Aktif', 'aktif', 'AKTIF', 'Aktif', 'Resign', 'Aktif', 'Resign',
                'Aktif', 'Aktif', 'aktif', 'AKTIF', 'Aktif', 'Resign', 'Aktif', 'Aktif', None],
})

print('=== DATA KOTOR AWAL ===')
print(f'Shape: {dirty_data.shape}')
print(dirty_data.to_string())

print('\n=== MASALAH YANG TERDETEKSI ===')
print(f'1. Duplikasi: {dirty_data.duplicated(subset=["ID"]).sum()} baris duplikat')
print(f'2. Missing values per kolom:')
print(dirty_data.isnull().sum()[dirty_data.isnull().sum() > 0])
print(f'3. Format tidak konsisten (Departemen): {dirty_data["Departemen"].unique()}')
print(f'4. Format tidak konsisten (Status): {dirty_data["Status"].unique()}')

In [None]:
# ============================================================
# SOLUSI: Preprocessing Lengkap
# ============================================================
df_clean = dirty_data.copy()

print('LANGKAH PREPROCESSING:')
print('=' * 55)

# LANGKAH 1: Hapus duplikat
before = len(df_clean)
df_clean = df_clean.drop_duplicates(subset=['ID'], keep='first')
print(f'[1] Hapus duplikat: {before} -> {len(df_clean)} baris (dihapus: {before - len(df_clean)})')

# LANGKAH 2: Standarisasi huruf besar/kecil
df_clean['Nama'] = df_clean['Nama'].str.title()
df_clean['Departemen'] = df_clean['Departemen'].str.upper()
df_clean['Status'] = df_clean['Status'].str.capitalize()
print(f'[2] Standarisasi casing: Nama, Departemen, Status')

# LANGKAH 3: Konversi tipe data Gaji
df_clean['Gaji'] = pd.to_numeric(df_clean['Gaji'], errors='coerce')
print(f'[3] Konversi Gaji ke numerik: {df_clean["Gaji"].dtype}')

# LANGKAH 4: Deteksi dan tangani outlier Usia
usia_valid = df_clean['Usia'].between(18, 65)
n_outlier_usia = (~usia_valid & df_clean['Usia'].notna()).sum()
df_clean.loc[~usia_valid, 'Usia'] = np.nan
print(f'[4] Outlier Usia (di luar 18-65): {n_outlier_usia} nilai -> set ke NaN')

# LANGKAH 5: Deteksi dan tangani outlier Gaji menggunakan IQR
Q1_gaji = df_clean['Gaji'].quantile(0.25)
Q3_gaji = df_clean['Gaji'].quantile(0.75)
IQR_gaji = Q3_gaji - Q1_gaji
lower_g = Q1_gaji - 1.5 * IQR_gaji
upper_g = Q3_gaji + 1.5 * IQR_gaji
n_outlier_gaji = ((df_clean['Gaji'] < lower_g) | (df_clean['Gaji'] > upper_g)).sum()
df_clean.loc[(df_clean['Gaji'] < lower_g) | (df_clean['Gaji'] > upper_g), 'Gaji'] = np.nan
print(f'[5] Outlier Gaji (IQR method): {n_outlier_gaji} nilai -> set ke NaN')
print(f'    Batas valid: Rp{lower_g:,.0f} - Rp{upper_g:,.0f}')

# LANGKAH 6: Imputasi missing values
median_usia = df_clean['Usia'].median()
median_gaji = df_clean['Gaji'].median()
mode_status = df_clean['Status'].mode()[0]

df_clean['Usia'] = df_clean['Usia'].fillna(median_usia)
df_clean['Gaji'] = df_clean['Gaji'].fillna(median_gaji)
df_clean['Status'] = df_clean['Status'].fillna(mode_status)
df_clean['Nama'] = df_clean['Nama'].fillna('Tidak Diketahui')
print(f'[6] Imputasi: Usia=median({median_usia}), Gaji=median(Rp{median_gaji:,.0f}), Status=mode({mode_status})')

# LANGKAH 7: Parsing tanggal
def parse_flexible_date(date_str):
    if pd.isna(date_str):
        return pd.NaT
    date_str = str(date_str).replace('-', '/')
    for fmt in ['%d/%m/%Y', '%Y/%m/%d']:
        try:
            return pd.to_datetime(date_str, format=fmt)
        except:
            continue
    return pd.NaT

df_clean['Tgl_Masuk'] = df_clean['Tgl_Masuk'].apply(parse_flexible_date)
df_clean['Tahun_Masuk'] = df_clean['Tgl_Masuk'].dt.year
df_clean['Lama_Kerja_Tahun'] = 2024 - df_clean['Tahun_Masuk']
print(f'[7] Parsing tanggal: {df_clean["Tgl_Masuk"].notna().sum()}/{len(df_clean)} berhasil')

# LANGKAH 8: Label Encoding untuk variabel kategoris
le_dept = LabelEncoder()
le_pend = LabelEncoder()
le_stat = LabelEncoder()

df_clean['Departemen_Enc'] = le_dept.fit_transform(df_clean['Departemen'])
df_clean['Pendidikan_Enc'] = le_pend.fit_transform(df_clean['Pendidikan'])
df_clean['Status_Enc'] = le_stat.fit_transform(df_clean['Status'])

print(f'[8] Label Encoding:')
print(f'    Departemen: {dict(zip(le_dept.classes_, le_dept.transform(le_dept.classes_)))}')
print(f'    Status    : {dict(zip(le_stat.classes_, le_stat.transform(le_stat.classes_)))}')

# LANGKAH 9: Feature Scaling
scaler = StandardScaler()
df_clean['Usia_Scaled'] = scaler.fit_transform(df_clean[['Usia']])
df_clean['Gaji_Scaled'] = scaler.fit_transform(df_clean[['Gaji']])
print(f'[9] StandardScaler diterapkan pada Usia dan Gaji')

print('\n=== HASIL AKHIR SETELAH CLEANING ===')
print(df_clean[['ID', 'Nama', 'Usia', 'Gaji', 'Departemen', 'Status', 'Lama_Kerja_Tahun']].to_string(index=False))

print(f'\nRingkasan:')
print(f'  Baris awal   : {len(dirty_data)}')
print(f'  Baris akhir  : {len(df_clean)}')
print(f'  Missing final: {df_clean[["Usia","Gaji","Nama","Status"]].isnull().sum().sum()}')

## Soal 4: EDA & Visualisasi
**Bobot: 25 poin**

Lakukan EDA lengkap pada dataset Iris (classic ML dataset) dan Tips dataset. Buat visualisasi yang informatif.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from scipy import stats

# Load dataset Iris
iris_raw = load_iris()
iris_df = pd.DataFrame(iris_raw.data, columns=iris_raw.feature_names)
iris_df['species'] = [iris_raw.target_names[t] for t in iris_raw.target]
iris_df.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

print('=== Dataset Iris ===')
print(f'Shape: {iris_df.shape}')
print(iris_df.head())
print('\nDistribusi spesies:')
print(iris_df['species'].value_counts())

# ============================================================
# Statistik Deskriptif per Spesies
# ============================================================
print('\n=== Statistik Deskriptif per Spesies ===')
print(iris_df.groupby('species')[['sepal_length', 'petal_length']]
             .agg(['mean', 'std', 'min', 'max'])
             .round(2))

# ============================================================
# Visualisasi Komprehensif
# ============================================================
fig = plt.figure(figsize=(16, 12))
fig.suptitle('EDA Dataset Iris - Analisis Komprehensif', fontsize=15, fontweight='bold', y=1.01)

# 1. Pair plot (scatter matrix)
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

# Plot 1: Distribusi setiap fitur
ax1 = fig.add_subplot(3, 3, 1)
for species in iris_df['species'].unique():
    subset = iris_df[iris_df['species'] == species]['petal_length']
    ax1.hist(subset, alpha=0.6, label=species, bins=15)
ax1.set_title('Distribusi Petal Length')
ax1.set_xlabel('Petal Length (cm)')
ax1.legend(fontsize=8)

# Plot 2: Box plot per spesies
ax2 = fig.add_subplot(3, 3, 2)
iris_df.boxplot(column='petal_width', by='species', ax=ax2)
ax2.set_title('Petal Width per Spesies')
ax2.set_xlabel('Spesies')
plt.sca(ax2)
plt.title('Petal Width per Spesies')

# Plot 3: Scatter plot
ax3 = fig.add_subplot(3, 3, 3)
colors = {'setosa': 'blue', 'versicolor': 'orange', 'virginica': 'green'}
for species, color in colors.items():
    subset = iris_df[iris_df['species'] == species]
    ax3.scatter(subset['sepal_length'], subset['petal_length'],
               c=color, label=species, alpha=0.6)
ax3.set_xlabel('Sepal Length')
ax3.set_ylabel('Petal Length')
ax3.set_title('Sepal vs Petal Length')
ax3.legend(fontsize=8)

# Plot 4: Heatmap korelasi
ax4 = fig.add_subplot(3, 3, 4)
corr = iris_df[features].corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='RdYlGn', ax=ax4,
            center=0, square=True, linewidths=0.5, annot_kws={'size': 8})
ax4.set_title('Heatmap Korelasi')

# Plot 5: Violin plot
ax5 = fig.add_subplot(3, 3, 5)
sns.violinplot(data=iris_df, x='species', y='sepal_length',
               palette='muted', inner='quartile', ax=ax5)
ax5.set_title('Violin Plot Sepal Length')
ax5.set_xlabel('')

# Plot 6: KDE plot
ax6 = fig.add_subplot(3, 3, 6)
for species in iris_df['species'].unique():
    subset = iris_df[iris_df['species'] == species]['petal_width']
    subset.plot.kde(ax=ax6, label=species)
ax6.set_title('KDE Plot Petal Width')
ax6.set_xlabel('Petal Width (cm)')
ax6.legend(fontsize=8)

# Plot 7: Mean per spesies (bar chart)
ax7 = fig.add_subplot(3, 1, 3)
mean_vals = iris_df.groupby('species')[features].mean()
mean_vals.T.plot(kind='bar', ax=ax7, colormap='viridis', alpha=0.8, edgecolor='black')
ax7.set_title('Rata-rata Setiap Fitur per Spesies')
ax7.set_xlabel('Fitur')
ax7.set_ylabel('Nilai rata-rata (cm)')
ax7.legend(title='Spesies', loc='upper left', fontsize=9)
ax7.tick_params(axis='x', rotation=15)

plt.tight_layout()
plt.show()

# ============================================================
# Analisis Korelasi
# ============================================================
print('\n=== Analisis Korelasi (Pearson) ===')
corr_matrix = iris_df[features].corr()
print(corr_matrix.round(4))

r, p = stats.pearsonr(iris_df['petal_length'], iris_df['petal_width'])
print(f'\nKorelasi petal_length ~ petal_width: r={r:.4f}, p={p:.2e}')
print(f'Interpretasi: KORELASI SANGAT KUAT POSITIF (r > 0.9)')

# ============================================================
# Deteksi Outlier
# ============================================================
print('\n=== Deteksi Outlier (IQR Method) ===')
for feat in features:
    Q1 = iris_df[feat].quantile(0.25)
    Q3 = iris_df[feat].quantile(0.75)
    IQR = Q3 - Q1
    n_out = ((iris_df[feat] < Q1 - 1.5*IQR) | (iris_df[feat] > Q3 + 1.5*IQR)).sum()
    print(f'  {feat:15s}: {n_out} outlier')

## Soal 5: Analisis Data Terintegrasi
**Bobot: 15 poin**

Mini project end-to-end: mulai dari pembuatan dataset simulasi, cleaning, explorasi, hingga visualisasi dan insight bisnis.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler

np.random.seed(42)

# ============================================================
# LANGKAH 1: LOAD / BUAT DATA
# ============================================================
print('LANGKAH 1: Membuat dataset penjualan e-commerce...')

n = 1000
cities = ['Jakarta', 'Bandung', 'Surabaya', 'Medan', 'Makassar', 'Yogyakarta']
categories = ['Elektronik', 'Pakaian', 'Makanan', 'Kecantikan', 'Olahraga']
months = list(range(1, 13))

df_raw = pd.DataFrame({
    'order_id': range(1001, 1001 + n),
    'tanggal': pd.date_range('2024-01-01', periods=n, freq='8h'),
    'kota': np.random.choice(cities, n, p=[0.35, 0.20, 0.20, 0.10, 0.08, 0.07]),
    'kategori': np.random.choice(categories, n, p=[0.25, 0.30, 0.20, 0.15, 0.10]),
    'harga': np.round(np.random.lognormal(mean=12.5, sigma=1.0, size=n), -3),
    'qty': np.random.randint(1, 6, n),
    'rating': np.clip(np.random.normal(4.0, 0.8, n), 1, 5).round(1),
    'is_returned': np.random.choice([False, True], n, p=[0.93, 0.07]),
})

# Tambahkan noise (missing values)
mask = np.random.choice([True, False], n, p=[0.03, 0.97])
df_raw.loc[mask, 'rating'] = np.nan

print(f'  Dataset: {df_raw.shape[0]} baris x {df_raw.shape[1]} kolom')

# ============================================================
# LANGKAH 2: CLEANING
# ============================================================
print('\nLANGKAH 2: Cleaning...')
df = df_raw.copy()

# Tangani missing
n_missing = df['rating'].isnull().sum()
df['rating'] = df['rating'].fillna(df.groupby('kategori')['rating'].transform('median'))
print(f'  Missing rating: {n_missing} -> diisi dengan median per kategori')

# Feature engineering
df['bulan'] = df['tanggal'].dt.month
df['hari_minggu'] = df['tanggal'].dt.day_name()
df['revenue'] = df['harga'] * df['qty']
df['revenue_bersih'] = df['revenue'] * (~df['is_returned']).astype(int)

print(f'  Fitur baru: bulan, hari_minggu, revenue, revenue_bersih')

# ============================================================
# LANGKAH 3: EDA
# ============================================================
print('\nLANGKAH 3: EDA...')

total_revenue = df['revenue_bersih'].sum()
avg_order = df['revenue'].mean()
return_rate = df['is_returned'].mean() * 100
avg_rating = df['rating'].mean()

print(f'  Total revenue bersih: Rp{total_revenue:,.0f}')
print(f'  Rata-rata order value: Rp{avg_order:,.0f}')
print(f'  Return rate: {return_rate:.1f}%')
print(f'  Rata-rata rating: {avg_rating:.2f}/5.0')

# ============================================================
# LANGKAH 4: VISUALISASI
# ============================================================
print('\nLANGKAH 4: Membuat visualisasi...')

fig, axes = plt.subplots(2, 3, figsize=(18, 11))
fig.suptitle('Dashboard Analisis Penjualan E-Commerce 2024', fontsize=15, fontweight='bold')

# 1. Revenue per kategori
rev_kategori = df.groupby('kategori')['revenue_bersih'].sum().sort_values(ascending=True)
rev_kategori.plot(kind='barh', ax=axes[0, 0], color='steelblue', edgecolor='white')
axes[0, 0].set_title('Total Revenue Bersih per Kategori')
axes[0, 0].set_xlabel('Revenue (Rp)')
for i, v in enumerate(rev_kategori.values):
    axes[0, 0].text(v, i, f' Rp{v/1e6:.0f}jt', va='center', fontsize=8)

# 2. Revenue per kota (pie chart)
rev_kota = df.groupby('kota')['revenue_bersih'].sum()
axes[0, 1].pie(rev_kota.values, labels=rev_kota.index, autopct='%1.1f%%',
               startangle=90, colors=sns.color_palette('pastel'))
axes[0, 1].set_title('Distribusi Revenue per Kota')

# 3. Trend revenue per bulan
monthly = df.groupby('bulan')['revenue_bersih'].sum()
axes[0, 2].plot(monthly.index, monthly.values, marker='o', linewidth=2, color='coral')
axes[0, 2].fill_between(monthly.index, monthly.values, alpha=0.2, color='coral')
axes[0, 2].set_title('Trend Revenue per Bulan')
axes[0, 2].set_xlabel('Bulan')
axes[0, 2].set_ylabel('Revenue (Rp)')
axes[0, 2].set_xticks(range(1, 13))

# 4. Distribusi rating per kategori
sns.boxplot(data=df, x='kategori', y='rating', palette='muted', ax=axes[1, 0])
axes[1, 0].set_title('Distribusi Rating per Kategori')
axes[1, 0].tick_params(axis='x', rotation=15)

# 5. Heatmap: revenue per kota per kategori
pivot = df.pivot_table(values='revenue_bersih', index='kota', columns='kategori', aggfunc='sum')
sns.heatmap(pivot / 1e6, annot=True, fmt='.0f', cmap='YlOrRd',
            ax=axes[1, 1], annot_kws={'size': 8})
axes[1, 1].set_title('Revenue (juta Rp): Kota vs Kategori')
axes[1, 1].tick_params(axis='x', rotation=20)

# 6. Return rate per kategori
return_by_cat = df.groupby('kategori')['is_returned'].mean() * 100
return_by_cat.sort_values(ascending=False).plot(
    kind='bar', ax=axes[1, 2], color='salmon', edgecolor='white'
)
axes[1, 2].set_title('Return Rate per Kategori (%)')
axes[1, 2].set_xlabel('')
axes[1, 2].tick_params(axis='x', rotation=15)
axes[1, 2].axhline(y=return_rate, color='red', linestyle='--', label=f'Avg={return_rate:.1f}%')
axes[1, 2].legend(fontsize=9)

plt.tight_layout()
plt.show()

# ============================================================
# LANGKAH 5: INSIGHTS & REKOMENDASI
# ============================================================
print('\nLANGKAH 5: Insights & Rekomendasi Bisnis')
print('=' * 60)

top_kategori = rev_kategori.idxmax()
top_kota = rev_kota.idxmax()
best_rating_cat = df.groupby('kategori')['rating'].mean().idxmax()
highest_return_cat = return_by_cat.idxmax()
best_month = monthly.idxmax()

print(f'TEMUAN UTAMA:')
print(f'1. Kategori teratas (revenue): {top_kategori}')
print(f'2. Kota dengan revenue terbesar: {top_kota}')
print(f'3. Kategori dengan rating tertinggi: {best_rating_cat}')
print(f'4. Kategori dengan return rate tertinggi: {highest_return_cat}')
print(f'5. Bulan tersibuk (revenue tertinggi): Bulan ke-{best_month}')

print(f'\nREKOMENDASI:')
print(f'- Fokus promosi pada kategori {top_kategori} di kota {top_kota}')
print(f'- Perbaiki kualitas produk {highest_return_cat} untuk kurangi return rate')
print(f'- Rencanakan stok lebih banyak menjelang bulan ke-{best_month}')

## Kunci Jawaban & Penilaian

Bagian ini berisi rubrik penilaian untuk setiap soal.

In [None]:
# Rubrik Penilaian UTS

rubrik = {
    'Soal 1 - Apache Spark RDD (20 poin)': {
        'Berhasil inisialisasi SparkSession': 3,
        'Membuat RDD dan operasi dasar (count, sum, mean)': 4,
        'WordCount dengan flatMap + map + reduceByKey': 7,
        'Key-Value RDD + groupByKey + analisis': 6,
    },
    'Soal 2 - Web Scraping & API (15 poin)': {
        'GET request dengan error handling': 4,
        'Parsing JSON ke DataFrame': 4,
        'Analisis & agregasi data API': 4,
        'Implementasi retry logic': 3,
    },
    'Soal 3 - Data Preprocessing (25 poin)': {
        'Identifikasi semua masalah data': 5,
        'Hapus duplikat': 2,
        'Standarisasi format': 3,
        'Handling outlier Usia dan Gaji': 5,
        'Imputasi missing values dengan metode tepat': 5,
        'Parsing tanggal + feature engineering': 3,
        'Encoding + scaling': 2,
    },
    'Soal 4 - EDA & Visualisasi (25 poin)': {
        'Statistik deskriptif per spesies': 5,
        'Minimum 5 jenis visualisasi': 10,
        'Analisis korelasi dengan interpretasi': 5,
        'Deteksi outlier': 5,
    },
    'Soal 5 - Analisis Terintegrasi (15 poin)': {
        'Pembuatan dan cleaning dataset': 3,
        'Feature engineering': 3,
        'EDA dengan statistik ringkasan': 4,
        'Dashboard visualisasi (min 4 plot)': 3,
        'Insight & rekomendasi bisnis': 2,
    }
}

print('╔══════════════════════════════════════════════════════════════╗')
print('║                  RUBRIK PENILAIAN UTS                      ║')
print('║              Big Data Analytics - Minggu 8                 ║')
print('╠══════════════════════════════════════════════════════════════╣')

total_bobot = 0
for soal, kriteria in rubrik.items():
    bobot_soal = sum(kriteria.values())
    total_bobot += bobot_soal
    print(f'\n║ {soal}')
    for k, v in kriteria.items():
        print(f'║   [{v:2d} poin] {k}')

print(f'\n╠══════════════════════════════════════════════════════════════╣')
print(f'║  TOTAL NILAI MAKSIMUM: {total_bobot} POIN                           ║')
print(f'╠══════════════════════════════════════════════════════════════╣')
print(f'║  Konversi Nilai:                                            ║')
print(f'║  90-100 = A (Sangat Baik)                                  ║')
print(f'║  80-89  = B (Baik)                                         ║')
print(f'║  70-79  = C (Cukup)                                        ║')
print(f'║  60-69  = D (Kurang)                                       ║')
print(f'║  < 60   = E (Tidak Lulus)                                  ║')
print(f'╚══════════════════════════════════════════════════════════════╝')

## Tugas Praktikum

Selesaikan soal-soal latihan UTS tambahan berikut:

### Latihan 1: Review Hadoop & Spark
Jelaskan dengan kode/pseudocode:
- Bagaimana HDFS menyimpan file besar? Gambarkan dengan ASCII art
- Apa perbedaan `reduceByKey()` vs `groupByKey()` di Spark dari sisi performa?
- Buat contoh kode Spark untuk menghitung rata-rata nilai mahasiswa per fakultas

### Latihan 2: Review Preprocessing
Diberikan dataset berikut, lakukan preprocessing lengkap:
```
Nama, Usia, Nilai, Kota, Kategori
Budi, 22, 85.5, jakarta, A
Sari, , 92.0, BANDUNG, B
Andi, 19, , surabaya, A
Maya, 250, 78.3, Yogyakarta, C
```
- Tangani semua masalah yang ditemukan
- Lakukan One-Hot Encoding untuk kolom Kota
- Normalisasi kolom Nilai ke range [0, 1]

### Latihan 3: Review EDA
Gunakan dataset `diamonds` dari seaborn:
- Lakukan EDA lengkap (statistik, distribusi, korelasi, outlier)
- Temukan 3 insight menarik dari data
- Buat minimal 4 visualisasi yang berbeda jenis

### Latihan 4: Review NoSQL vs SQL
Buat perbandingan tertulis (dalam markdown cell) dan kode simulasi:
- Simpan data yang sama (data mahasiswa) dalam format SQL (SQLite) dan NoSQL (dict-based)
- Tunjukkan query yang sama dilakukan di kedua sistem
- Diskusikan kapan masing-masing lebih cocok digunakan

### Latihan 5: Soal Essay UTS
Jawab pertanyaan berikut dalam markdown cell:
1. Sebuah startup fintech memiliki 10 juta transaksi per hari dengan 50+ atribut per transaksi. Rekomendasikan arsitektur penyimpanan data yang tepat. Jelaskan alasannya!
2. Mengapa Spark lebih cepat dari Hadoop MapReduce untuk iterative algorithms (seperti ML training)? Jelaskan dengan diagram!
3. Apa yang dimaksud dengan CAP Theorem? Berikan contoh sistem nyata untuk setiap kategori (CP, AP, CA).