# Transformacija i Agregacija Podataka za Seaborn

## Seminarski rad - SISJ

**Autor:** Mihajlovic Luka 2020/0136, Ilic Andrija 2020/0236  
**Datum:** 29.08.2025.

---

### **Što ćemo naučiti:**

U ovom notebook-u ćemo istražiti **napredne tehnike transformacije podataka** specifične za Seaborn vizualizaciju:

1. **🔄 Data Transformation** - Preparing podataka za optimalne grafikone
2. **📊 Aggregation Functions** - Sumiranje podataka za insights
3. **🎯 Groupby Operations** - Kombinovanje sa Seaborn funkcijama
4. **📈 Statistical Transformations** - Log scale, normalization, binning
5. **🔗 Multi-level Data** - Hierarchical indexing za complex visualizations

### **Zašto je ovo važno:**
- Raw podaci retko su spremni za vizualizaciju
- Transformacija može otkriti skrivene patterns
- Agregacija omogućava high-level insights
- Dobra priprema = bolje grafikone

---

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
import warnings
from datetime import datetime, timedelta

# Podešavanje
sns.set_theme()
warnings.filterwarnings('ignore')
plt.rcParams['figure.figsize'] = (12, 8)

print("✅ Biblioteke učitane!")
print(f"Seaborn: {sns.__version__}")
print(f"Pandas: {pd.__version__}")
print(f"NumPy: {np.__version__}")

# Učitavanje početnih podataka
flights = sns.load_dataset('flights')
tips = sns.load_dataset('tips')
penguins = sns.load_dataset('penguins').dropna()

print(f"\n📊 UČITANI PODACI:")
print(f"Flights: {flights.shape} - airlines passenger data")
print(f"Tips: {tips.shape} - restaurant tips data")
print(f"Penguins: {penguins.shape} - penguin measurements")

# Preview podataka
print(f"\n🔍 FLIGHTS PREVIEW:")
print(flights.head())
print(f"Godište range: {flights['year'].min()}-{flights['year'].max()}")
print(f"Meseci: {sorted(flights['month'].unique())}")

### 1. **Agregacija podataka za vizualizaciju**

Seaborn često automatski agregira podatke, ali **eksplicitna agregacija** nam daje:
- **Potpunu kontrolu** nad kako podaci su sumirani
- **Performance benefits** za velike datasets
- **Cleaner data** za complex visualizations

**Ključne pandas agregacije:**
- `groupby().agg()` - Flexible aggregation
- `pivot_table()` - Cross-tabulation with aggregation
- `resample()` - Time-based aggregation
- `cut()` / `qcut()` - Binning continuous variables

In [None]:
# 1. BASIC AGGREGATION sa groupby
print("📊 BASIC AGGREGATION EXAMPLES:")

# Agregacija flights podataka po godinama
yearly_flights = flights.groupby('year')['passengers'].agg([
    'count',    # broj meseci
    'sum',      # ukupni putnici
    'mean',     # prosečni mesečni putnici
    'std',      # standardna devijacija
    'min',      # najmanji broj
    'max'       # najveći broj
])

print("Yearly flight statistics:")
print(yearly_flights.round(1))

# Multi-level aggregation
monthly_stats = flights.groupby(['year', 'month'])['passengers'].agg({
    'total_passengers': 'sum',
    'growth_rate': lambda x: x.pct_change().mean() * 100
}).reset_index()

print(f"\nMonthly stats shape: {monthly_stats.shape}")
print(monthly_stats.head())

# Pivot table aggregation
pivot_flights = flights.pivot_table(
    values='passengers',
    index='month', 
    columns='year', 
    aggfunc='sum'
)

print(f"\nPivot table shape: {pivot_flights.shape}")
print("Pivot table sample:")
print(pivot_flights.head())

# Kreiranje grafika sa aggregated podacima
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Agregacija Podataka za Vizualizaciju', fontsize=16)

# 1. Yearly trends sa original vs aggregated
sns.lineplot(data=flights, x='year', y='passengers', ax=axes[0,0])
axes[0,0].set_title('Original Data (all points)')
axes[0,0].set_ylabel('Passengers')

# 2. Agregirana godišnja suma
yearly_sum = flights.groupby('year')['passengers'].sum().reset_index()
sns.barplot(data=yearly_sum, x='year', y='passengers', ax=axes[0,1])
axes[0,1].set_title('Aggregated: Yearly Total')
axes[0,1].set_ylabel('Total Passengers')

# 3. Heatmap sa pivot table
sns.heatmap(pivot_flights, cmap='YlOrRd', ax=axes[1,0])
axes[1,0].set_title('Pivot Table Heatmap')
axes[1,0].set_xlabel('Year')
axes[1,0].set_ylabel('Month')

# 4. Statistical summary
monthly_means = flights.groupby('month')['passengers'].mean()
sns.barplot(x=monthly_means.index, y=monthly_means.values, ax=axes[1,1])
axes[1,1].set_title('Average Passengers by Month')
axes[1,1].set_xlabel('Month')
axes[1,1].set_ylabel('Avg Passengers')

plt.tight_layout()
plt.show()

print("\n🔍 INSIGHTS FROM AGGREGATION:")
print(f"Najprometnije godina: {yearly_sum.loc[yearly_sum['passengers'].idxmax(), 'year']}")
print(f"Najprometnije mesec: {monthly_means.idxmax()}")
print(f"Prosečni rast po godini: {yearly_flights['sum'].pct_change().mean()*100:.1f}%")

### 2. **Statistical Transformations**

**Transformacije** menjaju distribuciju podataka da bi:
- **Normalizovali skewed data**
- **Smanjili impact outliera**
- **Linearizovali relationships**
- **Standardizovali scales**

**Česte transformacije:**
- **Log transformation** - za positively skewed data
- **Square root** - umanjuje skewness
- **Standardization (z-score)** - mean=0, std=1
- **Min-Max scaling** - scale to [0,1] range
- **Quantile binning** - convert continuous to categorical

In [None]:
# 2. STATISTICAL TRANSFORMATIONS
print("🧮 STATISTICAL TRANSFORMATIONS DEMO:")

# Kreiranje skewed podataka za demonstraciju
np.random.seed(42)
n = 1000
skewed_data = pd.DataFrame({
    'original': np.random.exponential(2, n),  # Positively skewed
    'group': np.random.choice(['A', 'B', 'C'], n)
})

# Primenjujemo različite transformacije
skewed_data['log_transformed'] = np.log1p(skewed_data['original'])  # log(1+x)
skewed_data['sqrt_transformed'] = np.sqrt(skewed_data['original'])
skewed_data['standardized'] = stats.zscore(skewed_data['original'])
skewed_data['minmax_scaled'] = ((skewed_data['original'] - skewed_data['original'].min()) / 
                                (skewed_data['original'].max() - skewed_data['original'].min()))

# Binning za kategorijsku transformaciju
skewed_data['quantile_bins'] = pd.qcut(skewed_data['original'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
skewed_data['equal_bins'] = pd.cut(skewed_data['original'], bins=4, labels=['Low', 'Mid-Low', 'Mid-High', 'High'])

print(f"Original data shape: {skewed_data.shape}")
print("Transformation statistics:")
transformation_stats = skewed_data[[
    'original', 'log_transformed', 'sqrt_transformed', 'standardized', 'minmax_scaled'
]].describe()
print(transformation_stats.round(3))

# Vizualizacija transformacija
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
fig.suptitle('Statistička Transformacije Podataka', fontsize=16)

# 1. Original distribution
sns.histplot(data=skewed_data, x='original', kde=True, ax=axes[0,0])
axes[0,0].set_title('Original (Skewed)')
axes[0,0].set_ylabel('Frequency')

# 2. Log transformation
sns.histplot(data=skewed_data, x='log_transformed', kde=True, ax=axes[0,1])
axes[0,1].set_title('Log Transformed')
axes[0,1].set_ylabel('Frequency')

# 3. Square root transformation
sns.histplot(data=skewed_data, x='sqrt_transformed', kde=True, ax=axes[0,2])
axes[0,2].set_title('Square Root Transformed')
axes[0,2].set_ylabel('Frequency')

# 4. Standardized (z-score)
sns.histplot(data=skewed_data, x='standardized', kde=True, ax=axes[1,0])
axes[1,0].set_title('Standardized (Z-score)')
axes[1,0].axvline(0, color='red', linestyle='--', alpha=0.7, label='Mean=0')
axes[1,0].legend()
axes[1,0].set_ylabel('Frequency')

# 5. Min-Max scaled
sns.histplot(data=skewed_data, x='minmax_scaled', kde=True, ax=axes[1,1])
axes[1,1].set_title('Min-Max Scaled [0,1]')
axes[1,1].set_ylabel('Frequency')

# 6. Binned data
sns.countplot(data=skewed_data, x='quantile_bins', ax=axes[1,2])
axes[1,2].set_title('Quantile Bins (Categorical)')
axes[1,2].set_ylabel('Count')

plt.tight_layout()
plt.show()

# Demonstracija uticaja transformacije na visualization
print("\n📈 IMPACT ON VISUALIZATION:")

fig, axes = plt.subplots(1, 3, figsize=(15, 5))
fig.suptitle('Transformacija Utiče na Relationship Patterns', fontsize=14)

# Kreiranje corelated varijable
skewed_data['y_original'] = skewed_data['original'] * 2 + np.random.normal(0, 1, n)
skewed_data['y_log'] = skewed_data['log_transformed'] * 2 + np.random.normal(0, 0.5, n)

# Original scale relationship
sns.scatterplot(data=skewed_data, x='original', y='y_original', alpha=0.6, ax=axes[0])
axes[0].set_title('Original Scale\n(Non-linear pattern)')

# Log scale relationship
sns.scatterplot(data=skewed_data, x='log_transformed', y='y_log', alpha=0.6, ax=axes[1])
axes[1].set_title('Log Scale\n(More linear pattern)')

# Binned relationship
sns.boxplot(data=skewed_data, x='quantile_bins', y='original', ax=axes[2])
axes[2].set_title('Binned Analysis\n(Categorical approach)')
axes[2].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# Skewness analysis
transformations = ['original', 'log_transformed', 'sqrt_transformed', 'standardized']
skewness_results = {}

for col in transformations:
    skew_val = stats.skew(skewed_data[col])
    skewness_results[col] = skew_val

print("\n📊 SKEWNESS COMPARISON:")
for transform, skew_val in skewness_results.items():
    interpretation = "Symmetric" if abs(skew_val) < 0.5 else "Moderately skewed" if abs(skew_val) < 1 else "Highly skewed"
    print(f"{transform:18}: {skew_val:6.3f} ({interpretation})")

print("\n✅ BEST PRACTICES:")
print("- Log transform za positively skewed data")
print("- Standardization za machine learning")
print("- Min-max scaling za neural networks")
print("- Binning za interpretability")

### 3. **Advanced Groupby Operations**

Kombinovanje **pandas groupby** sa **Seaborn** omogućava:
- **Custom aggregation functions**
- **Multiple statistics** u jednom koraku
- **Conditional aggregations**
- **Time-based grouping**

**Advanced techniques:**
- `transform()` - keeps original shape
- `apply()` - custom functions per group
- `agg()` with dictionaries - multiple functions
- `filter()` - filter entire groups
- `resample()` - time-based aggregation

In [None]:
# 3. ADVANCED GROUPBY OPERATIONS
print("🎯 ADVANCED GROUPBY TECHNIQUES:")

# Priprema kompleksnijih podataka - tips dataset sa dodatnim features
tips_enhanced = tips.copy()
tips_enhanced['tip_percentage'] = (tips_enhanced['tip'] / tips_enhanced['total_bill']) * 100
tips_enhanced['bill_category'] = pd.cut(tips_enhanced['total_bill'], 
                                       bins=[0, 15, 25, 50], 
                                       labels=['Small', 'Medium', 'Large'])
tips_enhanced['party_size_group'] = tips_enhanced['size'].apply(
    lambda x: 'Solo/Pair' if x <= 2 else 'Small Group' if x <= 4 else 'Large Group'
)

print(f"Enhanced tips dataset: {tips_enhanced.shape}")
print(f"New columns: {list(tips_enhanced.columns[-3:])}")

# 1. CUSTOM AGGREGATION FUNCTIONS
def custom_stats(series):
    return pd.Series({
        'count': len(series),
        'mean': series.mean(),
        'std': series.std(),
        'min': series.min(),
        'max': series.max(),
        'range': series.max() - series.min(),
        'iqr': series.quantile(0.75) - series.quantile(0.25),
        'cv': (series.std() / series.mean()) * 100  # Coefficient of variation
    })

# Primena custom funkcije
day_stats = tips_enhanced.groupby('day')['tip_percentage'].apply(custom_stats)
print("\nCustom statistics by day:")
print(day_stats.round(2))

# 2. MULTIPLE AGGREGATIONS sa različitim funkcijama
multi_agg = tips_enhanced.groupby(['day', 'time']).agg({
    'total_bill': ['count', 'mean', 'sum'],
    'tip': ['mean', 'std', 'max'],
    'tip_percentage': ['mean', 'median', lambda x: x.quantile(0.9)],
    'size': ['mean', 'mode']
})

# Flatten column names
multi_agg.columns = ['_'.join(col).strip() for col in multi_agg.columns]
multi_agg = multi_agg.reset_index()

print(f"\nMulti-aggregation result shape: {multi_agg.shape}")
print("Sample columns:", list(multi_agg.columns[:8]))

# 3. TRANSFORM - dodaje agregaciju u original dataset
tips_enhanced['day_avg_tip_pct'] = tips_enhanced.groupby('day')['tip_percentage'].transform('mean')
tips_enhanced['above_day_average'] = tips_enhanced['tip_percentage'] > tips_enhanced['day_avg_tip_pct']

# 4. CONDITIONAL AGGREGATION
def conditional_agg(group):
    return pd.Series({
        'high_tippers_count': (group['tip_percentage'] > 20).sum(),
        'high_tippers_pct': (group['tip_percentage'] > 20).mean() * 100,
        'avg_bill_high_tippers': group[group['tip_percentage'] > 20]['total_bill'].mean(),
        'total_revenue': group['total_bill'].sum()
    })

conditional_results = tips_enhanced.groupby(['day', 'time']).apply(conditional_agg).reset_index()
print(f"\nConditional aggregation shape: {conditional_results.shape}")
print(conditional_results.head())

# Vizualizacija advanced groupby results
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Advanced Groupby sa Seaborn Visualizations', fontsize=16)

# 1. Custom statistics visualization
day_stats_viz = day_stats.reset_index()
sns.barplot(data=day_stats_viz, x='day', y='cv', ax=axes[0,0])
axes[0,0].set_title('Coefficient of Variation\n(Tip % by Day)')
axes[0,0].set_ylabel('CV (%)')

# 2. Transform results - comparing individual vs group average
sns.scatterplot(data=tips_enhanced, x='day_avg_tip_pct', y='tip_percentage', 
                hue='above_day_average', alpha=0.7, ax=axes[0,1])
axes[0,1].plot([10, 25], [10, 25], 'r--', alpha=0.5)
axes[0,1].set_title('Individual vs Group Average')
axes[0,1].set_xlabel('Day Average Tip %')
axes[0,1].set_ylabel('Individual Tip %')

# 3. Multi-aggregation results
sns.barplot(data=multi_agg, x='day', y='total_bill_mean', hue='time', ax=axes[1,0])
axes[1,0].set_title('Average Bill by Day & Time')
axes[1,0].set_ylabel('Average Bill ($)')

# 4. Conditional aggregation results
sns.scatterplot(data=conditional_results, x='total_revenue', y='high_tippers_pct', 
                hue='day', style='time', s=100, ax=axes[1,1])
axes[1,1].set_title('Revenue vs High Tippers %')
axes[1,1].set_xlabel('Total Revenue ($)')
axes[1,1].set_ylabel('High Tippers %')

plt.tight_layout()
plt.show()

# Filter groups example
high_volume_days = tips_enhanced.groupby('day').filter(lambda x: len(x) > 50)
print(f"\n🔍 FILTER RESULTS:")
print(f"Original dataset: {len(tips_enhanced)} records")
print(f"High volume days only: {len(high_volume_days)} records")
print(f"Filtered out days with <50 customers")

print("\n💡 ADVANCED GROUPBY INSIGHTS:")
print(f"Najvarijabilniji dan (CV): {day_stats_viz.loc[day_stats_viz['cv'].idxmax(), 'day']}")
print(f"Procenat ljudi što daju tip >20%: {(tips_enhanced['tip_percentage'] > 20).mean()*100:.1f}%")
print(f"Dan sa najviše high tippers: {conditional_results.loc[conditional_results['high_tippers_pct'].idxmax(), 'day']}")

### 4. **Multi-level Data i Hierarchical Indexing**

**MultiIndex** (hierarchical indexing) omogućava:
- **Complex data structures** sa više nivoa grupa
- **Efficient storage** za high-dimensional data
- **Advanced slicing** i filtering
- **Sophisticated visualizations** sa Seaborn

**Use cases:**
- Time series sa multiple variables
- Geographic data sa sub-regions  
- A/B test results across segments
- Financial data (stocks, sectors, time)

In [None]:
# 4. MULTI-LEVEL DATA & HIERARCHICAL INDEXING
print("🏗️ MULTI-LEVEL DATA STRUCTURES:")

# Kreiranje složenih hierarchical podataka
np.random.seed(123)

# Simulacija business data sa multiple dimenzijama
regions = ['North', 'South', 'East', 'West']
products = ['Laptop', 'Phone', 'Tablet']
quarters = ['Q1', 'Q2', 'Q3', 'Q4']
years = [2022, 2023, 2024]

# Kreiranje multi-index strukture
index_tuples = []
for year in years:
    for quarter in quarters:
        for region in regions:
            for product in products:
                index_tuples.append((year, quarter, region, product))

# Kreiranje MultiIndex
multi_index = pd.MultiIndex.from_tuples(
    index_tuples, 
    names=['Year', 'Quarter', 'Region', 'Product']
)

# Kreiranje podataka
n_records = len(index_tuples)
sales_data_multi = pd.DataFrame({
    'Sales': np.random.normal(1000, 300, n_records),
    'Units': np.random.poisson(50, n_records),
    'Profit_Margin': np.random.normal(0.2, 0.05, n_records),
    'Market_Share': np.random.uniform(0.1, 0.4, n_records)
}, index=multi_index)

# Dodavanje business logic
seasonal_boost = {
    'Q1': 0.9, 'Q2': 1.0, 'Q3': 1.1, 'Q4': 1.3  # Holiday boost u Q4
}

product_multiplier = {
    'Laptop': 2.5, 'Phone': 1.0, 'Tablet': 1.5  # Laptop je skuplji
}

for idx in sales_data_multi.index:
    year, quarter, region, product = idx
    sales_data_multi.loc[idx, 'Sales'] *= seasonal_boost[quarter] * product_multiplier[product]

print(f"Multi-level dataset shape: {sales_data_multi.shape}")
print(f"Index levels: {sales_data_multi.index.names}")
print("\nSample data:")
print(sales_data_multi.head())

# MULTI-LEVEL SLICING i AGGREGATION
print("\n🔍 MULTI-LEVEL SLICING EXAMPLES:")

# Slice po godini
sales_2023 = sales_data_multi.loc[2023]
print(f"2023 data shape: {sales_2023.shape}")

# Slice po godini i kvartalu
q4_2023 = sales_data_multi.loc[(2023, 'Q4')]
print(f"Q4 2023 data shape: {q4_2023.shape}")

# Cross-section
laptop_sales = sales_data_multi.xs('Laptop', level='Product')
print(f"Laptop sales shape: {laptop_sales.shape}")

# COMPLEX AGGREGATIONS sa MultiIndex
# Aggregation po različitim nivoima
yearly_totals = sales_data_multi.groupby(level='Year')['Sales'].sum()
quarterly_totals = sales_data_multi.groupby(level=['Year', 'Quarter'])['Sales'].sum()
product_totals = sales_data_multi.groupby(level=['Year', 'Product'])['Sales'].sum()
region_product = sales_data_multi.groupby(level=['Region', 'Product'])['Sales'].mean()

print(f"\nYearly totals:")
print(yearly_totals.round(0))

# UNSTACKING za visualization
quarterly_pivot = quarterly_totals.unstack(level='Quarter')
product_pivot = product_totals.unstack(level='Product')
region_product_pivot = region_product.unstack(level='Product')

print(f"\nQuarterly pivot shape: {quarterly_pivot.shape}")
print(quarterly_pivot)

# Vizualizacija multi-level podataka
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Multi-level Data Visualizations', fontsize=16)

# 1. Quarterly trends heatmap
sns.heatmap(quarterly_pivot, annot=True, fmt='.0f', cmap='YlOrRd', ax=axes[0,0])
axes[0,0].set_title('Sales by Year & Quarter')
axes[0,0].set_ylabel('Year')

# 2. Product comparison across years
sns.heatmap(product_pivot, annot=True, fmt='.0f', cmap='Blues', ax=axes[0,1])
axes[0,1].set_title('Sales by Year & Product')
axes[0,1].set_ylabel('Year')

# 3. Regional product performance
sns.heatmap(region_product_pivot, annot=True, fmt='.0f', cmap='Greens', ax=axes[1,0])
axes[1,0].set_title('Avg Sales by Region & Product')
axes[1,0].set_ylabel('Region')

# 4. Time series sa reset_index za Seaborn line plot
quarterly_ts = quarterly_totals.reset_index()
quarterly_ts['Period'] = quarterly_ts['Year'].astype(str) + '-' + quarterly_ts['Quarter']
sns.lineplot(data=quarterly_ts, x='Period', y='Sales', marker='o', ax=axes[1,1])
axes[1,1].set_title('Quarterly Sales Trend')
axes[1,1].tick_params(axis='x', rotation=45)
axes[1,1].set_ylabel('Total Sales')

plt.tight_layout()
plt.show()

# ADVANCED MULTI-LEVEL ANALYSIS
print("\n📊 ADVANCED MULTI-LEVEL ANALYSIS:")

# Growth rates po kvartalima
growth_rates = sales_data_multi.groupby(level=['Year', 'Quarter'])['Sales'].sum().pct_change() * 100
print("Quarterly growth rates:")
print(growth_rates.dropna().round(2))

# Top performers po različitim dimenzijama
top_region = region_product_pivot.sum(axis=1).idxmax()
top_product = region_product_pivot.sum(axis=0).idxmax()
best_quarter = quarterly_pivot.sum(axis=0).idxmax()

print(f"\n🏆 TOP PERFORMERS:")
print(f"Best region: {top_region}")
print(f"Best product: {top_product}")
print(f"Best quarter: {best_quarter}")

# Market share analysis
market_share_analysis = sales_data_multi.groupby(level=['Year', 'Product'])['Market_Share'].mean()
print(f"\nMarket share trends:")
print(market_share_analysis.round(3))

print("\n✅ MULTI-LEVEL DATA BENEFITS:")
print("- Efficient storage za complex hierarchical data")
print("- Powerful slicing i dicing capabilities")
print("- Natural groupby operations")
print("- Easy pivoting za different views")
print("- Perfect za business intelligence")

### 5. **Time Series Data Transformations**

**Time series** transformacija je kritična za:
- **Trend analysis** - long-term patterns
- **Seasonality detection** - recurring patterns
- **Anomaly detection** - unusual values
- **Forecasting preparation** - stationary data

**Key transformations:**
- **Resampling** - change time frequency
- **Rolling windows** - moving averages
- **Lag features** - previous period values
- **Differencing** - remove trends
- **Seasonal decomposition** - trend + seasonal + residual

In [None]:
# 5. TIME SERIES DATA TRANSFORMATIONS
print("⏰ TIME SERIES TRANSFORMATIONS:")

# Kreiranje realistic time series data
np.random.seed(42)
date_range = pd.date_range('2020-01-01', '2024-12-31', freq='D')
n_days = len(date_range)

# Simulacija website traffic sa trendom i sezonalnošću
trend = np.linspace(1000, 2000, n_days)  # Linear growth
seasonal = 200 * np.sin(2 * np.pi * np.arange(n_days) / 365.25)  # Yearly cycle
weekly = 100 * np.sin(2 * np.pi * np.arange(n_days) / 7)  # Weekly cycle
noise = np.random.normal(0, 50, n_days)

# Kombinovanje komponenti
traffic = trend + seasonal + weekly + noise
traffic = np.maximum(traffic, 100)  # Minimum traffic

# Kreiranje DataFrame
ts_data = pd.DataFrame({
    'date': date_range,
    'traffic': traffic,
    'weekday': date_range.day_name(),
    'month': date_range.month,
    'year': date_range.year,
    'quarter': date_range.quarter
})

# Dodavanje business events (Black Friday effect)
black_fridays = ts_data[ts_data['date'].dt.strftime('%m-%d').isin(['11-24', '11-25', '11-26'])]
for idx in black_fridays.index:
    ts_data.loc[idx, 'traffic'] *= 2.5  # Black Friday boost

# Setting date kao index
ts_data.set_index('date', inplace=True)

print(f"Time series data shape: {ts_data.shape}")
print(f"Date range: {ts_data.index.min()} to {ts_data.index.max()}")
print("\nSample data:")
print(ts_data.head())

# TIME SERIES TRANSFORMATIONS
print("\n🔄 APPLYING TIME SERIES TRANSFORMATIONS:")

# 1. RESAMPLING - različite frekvencije
weekly_avg = ts_data['traffic'].resample('W').mean()
monthly_sum = ts_data['traffic'].resample('M').sum()
quarterly_max = ts_data['traffic'].resample('Q').max()
yearly_stats = ts_data['traffic'].resample('Y').agg(['count', 'mean', 'std', 'min', 'max'])

print(f"Weekly average shape: {weekly_avg.shape}")
print(f"Monthly sum shape: {monthly_sum.shape}")
print(f"Quarterly max shape: {quarterly_max.shape}")

# 2. ROLLING WINDOWS - moving statistics
ts_data['traffic_ma_7'] = ts_data['traffic'].rolling(window=7).mean()  # 7-day MA
ts_data['traffic_ma_30'] = ts_data['traffic'].rolling(window=30).mean()  # 30-day MA
ts_data['traffic_std_7'] = ts_data['traffic'].rolling(window=7).std()  # 7-day volatility
ts_data['traffic_min_30'] = ts_data['traffic'].rolling(window=30).min()  # 30-day min
ts_data['traffic_max_30'] = ts_data['traffic'].rolling(window=30).max()  # 30-day max

# 3. LAG FEATURES
ts_data['traffic_lag_1'] = ts_data['traffic'].shift(1)  # Previous day
ts_data['traffic_lag_7'] = ts_data['traffic'].shift(7)  # Same day last week
ts_data['traffic_lag_365'] = ts_data['traffic'].shift(365)  # Same day last year

# 4. DIFFERENCES - za trend removal
ts_data['traffic_diff_1'] = ts_data['traffic'].diff(1)  # Day-to-day change
ts_data['traffic_diff_7'] = ts_data['traffic'].diff(7)  # Week-to-week change
ts_data['traffic_pct_change'] = ts_data['traffic'].pct_change() * 100  # Percentage change

# 5. CUSTOM TIME FEATURES
ts_data['is_weekend'] = ts_data.index.dayofweek >= 5
ts_data['day_of_year'] = ts_data.index.dayofyear
ts_data['week_of_year'] = ts_data.index.isocalendar().week
ts_data['is_month_start'] = ts_data.index.is_month_start
ts_data['is_quarter_start'] = ts_data.index.is_quarter_start

print(f"\nEnhanced dataset shape: {ts_data.shape}")
print(f"New features: {list(ts_data.columns[-5:])}")

# Vizualizacija time series transformations
fig, axes = plt.subplots(3, 2, figsize=(16, 12))
fig.suptitle('Time Series Transformations', fontsize=16)

# 1. Original vs Moving Averages
sample_period = ts_data['2023-01':'2023-06']  # Prvi pola 2023
axes[0,0].plot(sample_period.index, sample_period['traffic'], alpha=0.3, label='Original')
axes[0,0].plot(sample_period.index, sample_period['traffic_ma_7'], label='7-day MA')
axes[0,0].plot(sample_period.index, sample_period['traffic_ma_30'], label='30-day MA')
axes[0,0].set_title('Original vs Moving Averages')
axes[0,0].legend()
axes[0,0].tick_params(axis='x', rotation=45)

# 2. Resampled data
monthly_data = monthly_sum.reset_index()
monthly_data['month_year'] = monthly_data['date'].dt.strftime('%Y-%m')
sample_months = monthly_data[monthly_data['date'] >= '2023-01-01']
sns.barplot(data=sample_months, x='month_year', y='traffic', ax=axes[0,1])
axes[0,1].set_title('Monthly Traffic Totals (2023-2024)')
axes[0,1].tick_params(axis='x', rotation=45)
axes[0,1].set_ylabel('Total Traffic')

# 3. Lag relationship
lag_sample = ts_data[['traffic', 'traffic_lag_7']].dropna()
sns.scatterplot(data=lag_sample.sample(1000), x='traffic_lag_7', y='traffic', 
                alpha=0.6, ax=axes[1,0])
axes[1,0].plot([lag_sample['traffic_lag_7'].min(), lag_sample['traffic_lag_7'].max()],
              [lag_sample['traffic_lag_7'].min(), lag_sample['traffic_lag_7'].max()], 
              'r--', alpha=0.5)
axes[1,0].set_title('Today vs Same Day Last Week')
axes[1,0].set_xlabel('Traffic (7 days ago)')
axes[1,0].set_ylabel('Traffic (today)')

# 4. Differenced data (removing trend)
diff_sample = ts_data['2023-01':'2023-06']['traffic_diff_1']
axes[1,1].plot(diff_sample.index, diff_sample, alpha=0.7)
axes[1,1].axhline(y=0, color='r', linestyle='--', alpha=0.5)
axes[1,1].set_title('Daily Changes (Differenced)')
axes[1,1].tick_params(axis='x', rotation=45)
axes[1,1].set_ylabel('Traffic Change')

# 5. Weekend vs Weekday analysis
weekend_analysis = ts_data.groupby('is_weekend')['traffic'].mean().reset_index()
weekend_analysis['day_type'] = weekend_analysis['is_weekend'].map({True: 'Weekend', False: 'Weekday'})
sns.barplot(data=weekend_analysis, x='day_type', y='traffic', ax=axes[2,0])
axes[2,0].set_title('Weekend vs Weekday Traffic')
axes[2,0].set_ylabel('Average Traffic')

# 6. Seasonal pattern
seasonal_pattern = ts_data.groupby('month')['traffic'].mean().reset_index()
sns.lineplot(data=seasonal_pattern, x='month', y='traffic', marker='o', ax=axes[2,1])
axes[2,1].set_title('Average Traffic by Month')
axes[2,1].set_xlabel('Month')
axes[2,1].set_ylabel('Average Traffic')
axes[2,1].set_xticks(range(1, 13))

plt.tight_layout()
plt.show()

# Statistical insights
print("\n📈 TIME SERIES INSIGHTS:")

# Correlation analysis
correlation_7day = ts_data[['traffic', 'traffic_lag_7']].corr().iloc[0,1]
correlation_365day = ts_data[['traffic', 'traffic_lag_365']].corr().iloc[0,1]

print(f"7-day lag correlation: {correlation_7day:.3f}")
print(f"365-day lag correlation: {correlation_365day:.3f}")

# Trend analysis
yearly_growth = yearly_stats['mean'].pct_change().mean() * 100
print(f"Average yearly growth: {yearly_growth:.1f}%")

# Volatility analysis
avg_volatility = ts_data['traffic_std_7'].mean()
print(f"Average 7-day volatility: {avg_volatility:.1f}")

# Weekend effect
weekend_effect = (weekend_analysis[weekend_analysis['day_type']=='Weekend']['traffic'].values[0] / 
                 weekend_analysis[weekend_analysis['day_type']=='Weekday']['traffic'].values[0] - 1) * 100
print(f"Weekend effect: {weekend_effect:+.1f}%")

print("\n✅ TIME SERIES TRANSFORMATION BENEFITS:")
print("- Moving averages smooth noise")
print("- Lag features capture temporal dependencies")
print("- Differencing removes trends")
print("- Resampling changes time resolution")
print("- Custom features extract domain knowledge")

## 6. Zaključak

U ovom notebook-u smo naučili **napredne tehnike transformacije podataka** za Seaborn:

### **Što smo pokrivali:**

1. **🔄 Agregacija podataka**
   - `groupby().agg()` za flexible summarization
   - `pivot_table()` za cross-tabulation
   - Custom agregacije za domain-specific insights

2. **🧮 Statistical Transformations**
   - Log, sqrt, standardization za normalized data
   - Binning za categorical conversion
   - Skewness reduction techniques

3. **🎯 Advanced Groupby**
   - `transform()` za adding group statistics
   - `apply()` za custom functions per group
   - `filter()` za conditional group selection

4. **🏗️ Multi-level Data**
   - MultiIndex za hierarchical data
   - Complex slicing i aggregation
   - Efficient storage za high-dimensional data

5. **⏰ Time Series**
   - `resample()` za frequency changes
   - Rolling windows za trend analysis
   - Lag features za temporal patterns

### **Key Takeaways:**

- **📊 Raw podaci retko su visualization-ready**
- **🔧 Pravilna transformacija = better insights**
- **⚡ Pandas + Seaborn = powerful combination**
- **🎯 Domain knowledge drives transformation choices**

### **Best Practices:**

| **Scenario** | **Recommended Approach** | **Seaborn Benefit** |
|--------------|-------------------------|---------------------|
| 📈 **Skewed data** | Log transformation | Better distributions |
| 🎯 **Large datasets** | Aggregation first | Faster plotting |
| ⏰ **Time series** | Resampling + rolling | Cleaner trends |
| 🏗️ **Complex hierarchy** | MultiIndex | Advanced grouping |
| 📊 **Business metrics** | Custom aggregation | Domain-specific insights |

### **Sledeći koraci:**

Ove transformacije su **foundation** za naprednu analizu. U realnim projektima:
- Kombinujte više transformacija
- Validirajte transformacije sa domain experts
- Dokumentujte transformation pipeline
- Testirajte na različitim subsetovima podataka

---

**Napomena**: Ovaj notebook demonstrira kako **data transformation** može dramatično poboljšati kvalitetu Seaborn vizualizacija. Uvek razmislite o transformaciji pre plotovanja!