# Mengacak urutan bari pada Data Frame

## Import Modules

In [1]:
import pandas as pd
import numpy as np

print('Pandas version:', pd.__version__)
print('Numpy version:', np.__version__)

Pandas version: 2.3.1
Numpy version: 2.3.2


## Persiapan Data Frame

In [2]:
n_rows = 6
n_cols = 5
cols = tuple('ABCDE')

df = pd.DataFrame(
    np.random.randint(1, 5, size=(n_rows, n_cols)),
    columns=cols
)

df

Unnamed: 0,A,B,C,D,E
0,1,1,1,3,1
1,2,3,4,2,1
2,2,3,3,2,4
3,1,1,1,1,2
4,3,4,4,4,3
5,1,4,2,1,1


## Mengacak urutan baris pada Data Frame

In [3]:
df.sample(frac=1.0, random_state=1)

Unnamed: 0,A,B,C,D,E
2,2,3,3,2,4
1,2,3,4,2,1
4,3,4,4,4,3
0,1,1,1,3,1
3,1,1,1,1,2
5,1,4,2,1,1


In [7]:
df.sample(frac=1, random_state=1).reset_index(drop=True)

Unnamed: 0,A,B,C,D,E
0,2,3,3,2,4
1,2,3,4,2,1
2,3,4,4,4,3
3,1,1,1,3,1
4,1,1,1,1,2
5,1,4,2,1,1


## 📋 Kesimpulan: Mengacak Urutan Baris pada DataFrame

### 🎯 Konsep Utama

**Shuffling atau mengacak urutan baris** adalah teknik penting dalam data science untuk **menghilangkan bias urutan data** dan **mempersiapkan data untuk machine learning**. Pandas menyediakan method `sample()` yang powerful untuk melakukan randomization.

### 🔧 Method `sample()` - Parameter & Fungsi

| Parameter | Deskripsi | Default | Contoh |
|-----------|-----------|---------|--------|
| **n** | Jumlah baris yang diambil | None | `df.sample(n=10)` |
| **frac** | Fraksi dari total data (0.0-1.0) | None | `df.sample(frac=0.5)` |
| **replace** | Sampling dengan replacement | False | `df.sample(n=10, replace=True)` |
| **weights** | Bobot untuk setiap baris | None | `df.sample(n=5, weights=df['weight'])` |
| **random_state** | Seed untuk reproducibility | None | `df.sample(frac=1, random_state=42)` |
| **axis** | 0 untuk baris, 1 untuk kolom | 0 | `df.sample(axis=1)` |

### 💡 Contoh Praktis dari Notebook

```python
# 1. Shuffle semua baris - preserving original index
df.sample(frac=1.0, random_state=1)

# 2. Shuffle dengan reset index - clean integer index
df.sample(frac=1, random_state=1).reset_index(drop=True)
```

### 🔍 Berbagai Teknik Shuffling

#### **1. Full Shuffle**
```python
# Mengacak semua baris
shuffled_df = df.sample(frac=1.0, random_state=42)

# Dengan reset index untuk clean numbering
clean_shuffle = df.sample(frac=1.0, random_state=42).reset_index(drop=True)
```

#### **2. Partial Sampling**
```python
# Ambil 50% data secara random
half_data = df.sample(frac=0.5, random_state=42)

# Ambil 100 baris secara random
subset = df.sample(n=100, random_state=42)
```

#### **3. Stratified Sampling**
```python
# Sampling proporsional per kategori
def stratified_sample(dataframe, column, frac=0.5):
    return dataframe.groupby(column).apply(
        lambda x: x.sample(frac=frac, random_state=42)
    ).reset_index(drop=True)

# Usage
balanced_sample = stratified_sample(df, 'category', frac=0.3)
```

### 🆚 Alternative Methods untuk Shuffling

| Method | Syntax | Kelebihan | Kekurangan |
|--------|--------|-----------|------------|
| **sample()** | `df.sample(frac=1)` | Simple, built-in, many options | - |
| **numpy shuffle** | `df.iloc[np.random.permutation(len(df))]` | Fast for large data | Requires numpy |
| **sklearn shuffle** | `from sklearn.utils import shuffle` | Integrated dengan ML pipeline | External dependency |
| **reindex** | `df.reindex(np.random.permutation(df.index))` | Flexible | More verbose |

### 📊 Use Cases dalam Data Science

#### **1. Machine Learning Preparation**
```python
# Shuffle sebelum train-test split
shuffled_data = df.sample(frac=1, random_state=42).reset_index(drop=True)

# Train-test split
train_size = int(0.8 * len(shuffled_data))
train_data = shuffled_data[:train_size]
test_data = shuffled_data[train_size:]
```

#### **2. Cross-Validation Setup**
```python
# K-fold preparation dengan shuffling
from sklearn.model_selection import KFold

# Shuffle data first
shuffled_df = df.sample(frac=1, random_state=42).reset_index(drop=True)
kfold = KFold(n_splits=5, shuffle=False)  # Already shuffled
```

#### **3. A/B Testing**
```python
# Random assignment untuk A/B test
shuffled_users = users_df.sample(frac=1, random_state=42)
group_a = shuffled_users[:len(shuffled_users)//2]
group_b = shuffled_users[len(shuffled_users)//2:]
```

#### **4. Data Augmentation**
```python
# Bootstrap sampling untuk data augmentation
bootstrap_sample = df.sample(n=len(df), replace=True, random_state=42)
```

### 🎯 Best Practices untuk Random State

```python
# ✅ Selalu gunakan random_state untuk reproducibility
df.sample(frac=1, random_state=42)

# ✅ Gunakan seed yang konsisten di seluruh project
RANDOM_STATE = 42
train_data = df.sample(frac=0.8, random_state=RANDOM_STATE)
test_data = df.drop(train_data.index)

# ✅ Document random state untuk reproducible research
def shuffle_data(dataframe, random_state=42):
    """
    Shuffle DataFrame rows randomly
    
    Args:
        dataframe: Input DataFrame
        random_state: Seed for reproducibility (default: 42)
    
    Returns:
        Shuffled DataFrame with reset index
    """
    return dataframe.sample(frac=1, random_state=random_state).reset_index(drop=True)
```

### ⚠️ Important Considerations

#### **1. Index Handling**
```python
# ❌ Preserve original index - might cause confusion
shuffled = df.sample(frac=1, random_state=42)
print(shuffled.index)  # [3, 0, 5, 1, 4, 2]

# ✅ Reset index untuk clean numbering
shuffled_clean = df.sample(frac=1, random_state=42).reset_index(drop=True)
print(shuffled_clean.index)  # [0, 1, 2, 3, 4, 5]
```

#### **2. Time Series Data**
```python
# ⚠️ Hati-hati dengan time series - jangan shuffle!
# Time dependency penting untuk forecasting

# ✅ Untuk time series, gunakan time-based splits
train_data = df[df['date'] < '2024-01-01']
test_data = df[df['date'] >= '2024-01-01']
```

#### **3. Memory Considerations**
```python
# ✅ Untuk dataset besar, pertimbangkan chunk processing
def shuffle_large_dataset(dataframe, chunk_size=10000):
    chunks = []
    for i in range(0, len(dataframe), chunk_size):
        chunk = dataframe.iloc[i:i+chunk_size]
        shuffled_chunk = chunk.sample(frac=1, random_state=42)
        chunks.append(shuffled_chunk)
    
    return pd.concat(chunks, ignore_index=True)
```

### 🚀 Advanced Shuffling Techniques

#### **1. Weighted Random Sampling**
```python
# Sampling berdasarkan bobot (e.g., importance, frequency)
weights = df['importance_score'] / df['importance_score'].sum()
weighted_sample = df.sample(n=1000, weights=weights, random_state=42)
```

#### **2. Conditional Shuffling**
```python
# Shuffle dalam grup tertentu
def shuffle_within_groups(dataframe, group_column):
    return dataframe.groupby(group_column).apply(
        lambda x: x.sample(frac=1, random_state=42)
    ).reset_index(drop=True)
```

#### **3. Balanced Shuffling**
```python
# Maintain class balance saat shuffling
def balanced_shuffle(dataframe, target_column, n_samples_per_class=100):
    balanced_data = []
    for class_value in dataframe[target_column].unique():
        class_data = dataframe[dataframe[target_column] == class_value]
        sampled = class_data.sample(n=min(n_samples_per_class, len(class_data)), 
                                  random_state=42)
        balanced_data.append(sampled)
    
    return pd.concat(balanced_data, ignore_index=True).sample(frac=1, random_state=42)
```

### 🔍 Performance Tips

| Scenario | Recommendation | Reason |
|----------|----------------|--------|
| **Small data (<10K rows)** | `df.sample(frac=1)` | Simple dan cukup cepat |
| **Medium data (10K-1M rows)** | `df.sample(frac=1).reset_index(drop=True)` | Balance performance & memory |
| **Large data (>1M rows)** | Chunk processing atau numpy shuffle | Memory efficient |
| **Repeated shuffling** | Cache shuffled indices | Avoid repeated computation |

### 🎯 Key Takeaways

- ✅ **`sample(frac=1.0)`** adalah cara standard untuk shuffle semua baris
- ✅ **Selalu gunakan `random_state`** untuk reproducible results
- ✅ **`reset_index(drop=True)`** untuk clean integer indexing
- ✅ **Hati-hati dengan time series** - jangan shuffle data dengan temporal dependency
- ✅ **Stratified sampling** penting untuk balanced datasets
- ✅ **Document random seeds** untuk reproducible research
- ✅ **Consider memory usage** untuk dataset besar

**Random shuffling adalah fundamental preprocessing step yang crucial untuk unbiased machine learning!** 🚀