# Lab 4: Pretraga hiperparametara

## Zadatak: priprema i obrada Tips skupa podataka

1. **Učitavanje realnog skupa**: Učitajte taxis skup podataka upotrebom sns biblioteke: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
2. **Vektorizacija kategorickih podataka**: Identifikovati kategoričke kolone. Na odgovarajućim kolonama primjeniti LabelEncoder i OneHotEncoder
3. **Vremenski podaci** - Vektorizovati vremenske podatke na najbolji mogući način
4. **Imputacija**: Detektovati da li postoje NaN vrijednosti te ih rješiti odgovarajućom tehnikom. Detektovati kolone koje posjeduju visok stepen korelacije te izbrisati odgovarajuće kolone.
5. **Outlieri**: Definisati prag od 3 standardne devijacije za numeričke kolonu total_bill te izvršiti uklanjanje outlier-a.
6. **Skaliranje**: Primijeniti `StandardScaler` ili `MinMaxScaler` na total_bill koloni.
7. **Trening**: Podijeliti podatke na training, test i validacioni skup, te izvršiti pretragu hiperparametara. Koristiti decision tree model za regresiju. Ciljna kolona: tip.
8. **Evaluacija**: Kreirati finalni model sa najboljim parametrim i evaluirati metrike na testnom skupu podataka.

In [13]:
import seaborn as sns
import numpy as np
import pandas as pd

df = sns.load_dataset('taxis')
print(f'Originalni oblik: {df.shape}')
df.head()

Originalni oblik: (6433, 14)


Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.6,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.0,0.0,9.3,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan
3,2019-03-10 01:23:59,2019-03-10 01:49:51,1,7.7,27.0,6.15,0.0,36.95,yellow,credit card,Hudson Sq,Yorkville West,Manhattan,Manhattan
4,2019-03-30 13:27:42,2019-03-30 13:37:14,3,2.16,9.0,1.1,0.0,13.4,yellow,credit card,Midtown East,Yorkville West,Manhattan,Manhattan


In [34]:
def process_time_duration(df):
    df = df.copy()
    df['duration'] = (df['dropoff'] - df['pickup']).dt.total_seconds() / 60
    return df


def process_time_binning_interval(df, interval=30):
    df = df.copy()
    mins = df['pickup'].dt.hour * 60 + df['pickup'].dt.minute
    bins = np.arange(0, 1440 + interval, interval)
    labels = [f"{b//60:02d}:{b%60:02d}" for b in bins[1:]]
    df['time_bin'] = pd.cut(mins, bins=bins, labels=labels, right=False) # time_bin je categorical type
    return df


def process_time_cyclical(df):
    df = df.copy()
    h = df['pickup'].dt.hour
    df['hour_sin'] = np.sin(2*np.pi*h/24)
    df['hour_cos'] = np.cos(2*np.pi*h/24)
    return df

def process_time_month_seasons(df):
    """Map month to season: Winter, Spring, Summer, Fall."""
    df = df.copy()
    month = df['pickup'].dt.month
    bins = [0, 2, 5, 8, 11, 12]
    labels = ['Winter', 'Spring', 'Summer', 'Fall', 'Winter']
    df['season'] = pd.cut(month, bins=bins, labels=labels, right=True)
    return df


def process_time_month_categorical(df):
    """Use each month as a categorical variable."""
    df = df.copy()
    df['month'] = df['pickup'].dt.month_name()
    return df


def process_time_dayofweek(df):
    """Categorize day of week (Monday ... Sunday)."""
    df = df.copy()
    df['day_of_week'] = df['pickup'].dt.day_name()
    return df


def process_time_weekday_weekend(df):
    """Binary feature: Weekday vs Weekend."""
    df = df.copy()
    df['is_weekend'] = df['pickup'].dt.dayofweek >= 5
    return df


def process_time_hour_slots(df):
    """Bin hours into Night/Morning/WorkHours/Evening."""
    df = df.copy()
    hr = df['pickup'].dt.hour
    conditions = [hr.between(0,5), hr.between(6,11), hr.between(12,17), hr.between(18,23)]
    labels = ['Night', 'Morning', 'WorkHours', 'Evening']
    df['hour_slot'] = np.select(conditions, labels, default='Unknown')
    return df

# Moguc jos veliki broj obrada vremena

approaches = {
    'duration': process_time_duration,
    'bin_30min': lambda df: process_time_binning_interval(df, 30),
    'cyclical': process_time_cyclical,
    'season': process_time_month_seasons,
    'month_cat': process_time_month_categorical,
    'dayofweek': process_time_dayofweek,
    'weekend_flag': process_time_weekday_weekend,
    'hour_slots': process_time_hour_slots
}

In [33]:
# Mozemo kombinovati razlicite pristupe da dobijemo finalni rezultat
df_time_encoded = approaches['duration'](df)
df_time_encoded = approaches['month_cat'](df_time_encoded)
df_time_encoded = approaches['weekend_flag'](df_time_encoded)
df_time_encoded = approaches['hour_slots'](df_time_encoded)
df_time_encoded = df_time_encoded.drop(columns=['pickup', 'dropoff'])
df_time_encoded.head()

Unnamed: 0,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough,duration,month,is_weekend,hour_slot
0,1,1.6,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan,6.25,March,True,Evening
1,1,0.79,5.0,0.0,0.0,9.3,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan,7.083333,March,False,WorkHours
2,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan,7.4,March,False,WorkHours
3,1,7.7,27.0,6.15,0.0,36.95,yellow,credit card,Hudson Sq,Yorkville West,Manhattan,Manhattan,25.866667,March,True,Night
4,3,2.16,9.0,1.1,0.0,13.4,yellow,credit card,Midtown East,Yorkville West,Manhattan,Manhattan,9.533333,March,True,WorkHours


U zavisnosti od odabranog načina obrade vremena nastavljamo sa daljom obradom kategoričkih i numeričkih podataka