# Projekt - model klasyfikacyjny

Cel projektu: Stworzenie systemu do alertów (progonozowanie czy wypożyczeń będzie więcej niż zwrotów).
Alert powinien dotyczy kolejnej godziny. Tak, aby móc wysłać pracowników w rejony z niedoborem rowerów i przewieźć tam rezerwowe rowery lub pojazdów z innych lokalizacji.

Zadania do realizacji
1. Data preprocessing:
    - Pobranie danych
    - Filtrowanie danych (jaka historia będzie nam potrzebna do modelowania?).
    - Połączenie danych:
        - Jaki powinien być typ połączenia?
        - Jak uzupełnić braki danych w kolumnach departure name oraz date?
    - Wstępne sprawdzenie danych.
    - Resampling danych na dane godzinowe oraz uzupełnienie braków danych.
    - Stworzenie nowej zmiennej kategorycznej (y): Czy liczba wypożyczeń w bieżącej godzinie jest większa niż liczba zwrotów.
    - Stworzenie zmiennych z daty: godzina,miesiąc, kwartał.
    - Enkoding zmiennej departure name
    - Stworzenie lagów (wartości z poprzednich okresów):
        - Wartości dla danej stacji z poprzednich godzin / dni
        - Zastanów się nad innymi przekształceniami i agregacjami zmiennych.
        - Stwórz funkcję do przygotowania zmiennych.
    - Filtrowanie zbioru:
        - Czy potrzebujemy mieć wszystkie miesiące?
        - Czy model będzie działać o każdej porze dnia?
        - Czy chcemy modelować wszystkie stacje?
    - Selekcja zmiennych
    - Detekcja outlierów.
    - Pamiętaj, aby usuwać zbędne obiekty, gdyż może Ci nie wystarczyć pamięci do przetwarzania.
2. Optymalizacja modelu:
    - Wykorzystanie jednego z poznanych algorytmów optymalizacyjnych.
    - W przypadku niezadowalających wyników, testy na innym algorytmie.

In [3]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
pd.options.display.float_format = '{:.3f}'.format
pd.set_option('display.max_columns',None)

1. Data preprocessing
- Pobranie danych
- Filtrowanie danych (jaka historia będzie nam potrzebna do modelowania?).

In [1]:
import os 
os.chdir('../')

In [4]:
# Wczytanie danych - liczba wypożyczeń
df_dep = pd.read_parquet('data/hourly_data_per_station.parquet')
df_dep = df_dep[df_dep['departure_date_hours']>='2018-01-01'].reset_index(drop=True)


In [5]:
# Wczytanie danych - liczba zwrotów
df_ret = pd.read_parquet('data/hourly_data_per_station_returns.parquet')
df_ret = df_ret[df_ret['return_date_hours']>='2018-01-01'].reset_index(drop=True)

- Połączenie danych:
    - Jaki powinien być typ połączenia?
    - Jak uzupełnić braki danych w kolumnach departure name oraz date?

In [6]:
# polaczenie danych
df_merged = df_dep.merge(df_ret,
                         left_on = ['departure_name','departure_date_hours'],
                         right_on=['return_name','return_date_hours'],
                         how = 'outer',
                         suffixes=('_dep','_ret'))

In [None]:
# wielkosci poszczegolnych ramek
print(df_dep.shape)
print(df_ret.shape)
print(df_merged.shape)

In [7]:
# Wartosci puste
df_merged.isna().sum()

departure_name                544085
departure_date_hours          544085
numbers_of_departures         544085
distance (m)_dep              544085
duration (sec.)_dep           544085
avg_speed (km/h)_dep          544382
Air temperature (degC)_dep    546027
return_name                   512563
return_date_hours             512563
number_of_returns             512563
distance (m)_ret              512563
duration (sec.)_ret           512563
avg_speed (km/h)_ret          512926
Air temperature (degC)_ret    514659
dtype: int64

In [8]:
# imputacja danych - departure name/ date oraz temperature
df_merged['departure_name'] = df_merged['departure_name'].fillna(df_merged['return_name'])
df_merged['departure_date_hours'] = df_merged['departure_date_hours'].fillna(df_merged['return_date_hours'])
df_merged['Air temperature (degC)'] = df_merged['Air temperature (degC)_dep'].fillna(df_merged['Air temperature (degC)_ret'])

In [None]:
# Sprawdzenie braków danych
df_merged.isna().sum()

In [9]:
# usuniecie zbędnych kolumn i ramek
del df_merged['return_date_hours']
del df_merged['return_name']
del df_merged['Air temperature (degC)_dep']
del df_merged['Air temperature (degC)_ret']
del df_ret
del df_dep

- Wstępne sprawdzenie danych.

In [None]:
# head
df_merged.head()

In [None]:
# info
df_merged.info()

In [11]:
# describe
df_merged.describe()

Unnamed: 0,departure_date_hours,numbers_of_departures,distance (m)_dep,duration (sec.)_dep,avg_speed (km/h)_dep,number_of_returns,distance (m)_ret,duration (sec.)_ret,avg_speed (km/h)_ret,Air temperature (degC)
count,2955057,2410972.0,2410972.0,2410972.0,2410675.0,2442494.0,2442494.0,2442494.0,2442131.0,2953142.0
mean,2019-08-27 12:13:34.770882816,4.207,2216.939,992.561,0.186,4.153,2203.705,1016.668,0.182,14.17
min,2018-04-03 08:00:00,1.0,-4290278.0,0.0,-205.113,1.0,-2144724.0,0.0,-234.358,-5.2
25%,2018-10-23 17:00:00,1.0,1302.0,431.5,0.159,1.0,1283.333,430.0,0.157,10.1
50%,2019-08-13 17:00:00,2.0,1964.17,649.333,0.189,2.0,1958.0,650.4,0.188,14.85
75%,2020-06-10 08:00:00,5.0,2783.0,914.667,0.217,5.0,2785.0,919.333,0.215,18.3
max,2020-11-03 00:00:00,217.0,2106675.0,3100714.0,24.38,153.0,2106675.0,1950617.0,19.55,32.9
std,,5.119,4919.753,6744.542,0.241,5.019,3326.258,6708.934,0.2,5.901


In [10]:
# Sprawdzenie niepoprawnych wartosci
df_merged[df_merged['distance (m)_dep']<0]

Unnamed: 0,departure_name,departure_date_hours,numbers_of_departures,distance (m)_dep,duration (sec.)_dep,avg_speed (km/h)_dep,number_of_returns,distance (m)_ret,duration (sec.)_ret,avg_speed (km/h)_ret,Air temperature (degC)
1751118,Niittymaa,2019-10-27 20:00:00,3.0,-1429816.667,281.0,-156.228,2.0,1641.5,471.0,0.107,1.9
1925829,Painiitty,2020-08-20 16:00:00,1.0,-4290278.0,1255.0,-205.113,,,,,21.8
2019720,Pohjankulma,2020-08-01 00:00:00,2.0,-2143629.5,1424.0,-89.942,,,,,15.5
2432529,Sörnäinen (M),2020-07-23 22:00:00,8.0,-535255.375,413.875,-50.711,7.0,2080.429,764.143,0.124,12.9
2759127,Valimotie,2020-07-17 17:00:00,12.0,-355391.417,736.5,-12.819,7.0,1290.571,405.714,0.136,23.7
2820296,Velodrominrinne,2018-10-15 19:00:00,2.0,-2141771.5,956.0,-142.006,2.0,3204.5,962.0,0.2,12.65
2851333,Verkatehtaanpuisto,2020-07-04 12:00:00,1.0,-4289957.0,1362.0,-188.985,,,,,16.4


In [12]:
df_merged.loc[df_merged['avg_speed (km/h)_dep']<0,'avg_speed (km/h)_dep']

1751118   -156.228
1925829   -205.113
2019720    -89.942
2432529    -50.711
2759127    -12.819
2820296   -142.006
2851333   -188.985
Name: avg_speed (km/h)_dep, dtype: float64

In [13]:
df_merged.loc[df_merged['distance (m)_ret']<0,'distance (m)_ret']

426349    -2142811.500
644555     -475758.889
737509     -388698.091
1261469   -1069230.250
1266063   -2143244.000
1442165   -2144724.000
2478683    -855928.000
Name: distance (m)_ret, dtype: float64

In [14]:
df_merged.loc[df_merged['avg_speed (km/h)_ret']<0,'avg_speed (km/h)_ret']

426349    -142.001
644555     -45.071
737509     -13.963
1261469    -44.876
1266063    -94.360
1442165   -234.358
2478683    -40.867
Name: avg_speed (km/h)_ret, dtype: float64

In [15]:
# zastąpienie niepoprawnych wartości średnią
df_merged.loc[df_merged['distance (m)_dep']<0,'distance (m)_dep'] = df_merged.loc[df_merged['distance (m)_dep']>0,'distance (m)_dep'].mean()
df_merged.loc[df_merged['avg_speed (km/h)_dep']<0,'avg_speed (km/h)_dep'] = df_merged.loc[df_merged['avg_speed (km/h)_dep']>0,'avg_speed (km/h)_dep'].mean()
df_merged.loc[df_merged['distance (m)_ret']<0,'distance (m)_ret'] = df_merged.loc[df_merged['distance (m)_ret']>0,'distance (m)_ret'].mean()
df_merged.loc[df_merged['avg_speed (km/h)_ret']<0,'avg_speed (km/h)_ret'] = df_merged.loc[df_merged['avg_speed (km/h)_ret']>0,'avg_speed (km/h)_ret'].mean()

In [None]:
# zastąpienie niepoprawnych wartości średnią


- Resampling danych na dane godzinowe.

In [None]:
# resample
df_merged = df_merged.set_index('departure_date_hours').groupby('departure_name').resample('h').mean().reset_index()

In [None]:
df_merged.shape

In [None]:
# uzupełnienie braków danych
df_merged['Air temperature (degC)'] = df_merged['Air temperature (degC)'].fillna(-999)
df_merged =df_merged.fillna(0)

In [None]:
# konwersja float na int
df_merged[['numbers_of_departures','number_of_returns']] = df_merged[['numbers_of_departures','number_of_returns']].astype(int)

In [None]:
df_merged.info()

- Stworzenie nowej zmiennej kategorycznej (y): Czy liczba wypożyczeń w bieżącej godzinie jest większa niż liczba zwrotów.

In [None]:
# zmienna y kategoryczna
df_merged['y_cat'] = ((df_merged['numbers_of_departures']-1)> df_merged['number_of_returns']).astype(int)

In [None]:
# udział wartosci y
df_merged['y_cat'].value_counts() / df_merged.shape[0]

  - Stworzenie zmiennych z daty: godzina, dzien, miesiąc, kwartał.

In [None]:
df_merged['hours'] = df_merged['departure_date_hours'].dt.hour
df_merged['day'] = df_merged['departure_date_hours'].dt.day
df_merged['month'] = df_merged['departure_date_hours'].dt.month
df_merged['quarter'] = df_merged['departure_date_hours'].dt.quarter

 - Enkoding zmiennej departure name

In [None]:
from sklearn.preprocessing import OrdinalEncoder

In [None]:
# obiekt ordinalencoder
oe = OrdinalEncoder().fit(df_merged[['departure_name']])

In [None]:
# dodanie zmiennej enkodowanej
df_merged['departure_name_encoded'] = oe.transform(df_merged[['departure_name']]).astype(int)

In [None]:
# sprawdzenie - head
df_merged.head()

- Stworzenie lagów (wartości z poprzednich okresów):
    - Wartości dla danej stacji z poprzednich godzin / dni
    - Zastanów się nad innymi przekształceniami i agregacjami zmiennych.
    - Stwórz funkcję do przygotowania zmiennych.

In [None]:
# pobranie stworzonych funkcji
from help_function import lag_n

In [None]:
df_merged.columns

In [None]:
# nazwy kolumn do lagów
lag_cols = ['numbers_of_departures',
       'distance (m)_dep', 'duration (sec.)_dep', 'avg_speed (km/h)_dep',
       'number_of_returns', 'distance (m)_ret', 'duration (sec.)_ret',
       'avg_speed (km/h)_ret', 'Air temperature (degC)', 'y_cat']

In [None]:
# Stworzenie różnych lagów
for i in [1,2,3,6,9,12,24]:
    df_merged = lag_n(df = df_merged,
                      group_col='departure_name',
                      lag_cols=lag_cols,
                      sort_by='departure_date_hours',
                      lag_number=i)

In [None]:
# Zastąpienie braków danych w temperaturze, wartością sprzed godziny
df_merged.loc[df_merged['Air temperature (degC)']==-999,'Air temperature (degC)'] = df_merged.loc[df_merged['Air temperature (degC)']==-999,
                                                                                                  'Air temperature (degC)_lag_1']

In [None]:
# Zastąpienie braków danych w temperaturze, wartością sprzed 2 godzin
df_merged.loc[df_merged['Air temperature (degC)']==-999,'Air temperature (degC)'] = df_merged.loc[df_merged['Air temperature (degC)']==-999,
                                                                                                  'Air temperature (degC)_lag_2']

In [None]:
# Stworzenie funkcji prepare data, która przetworzy ramkę danych i doda do niej wymagane zmienne do późniejszej predykcji modelu
def prepare_data(df,lag_cols):
    df = df.set_index('departure_date_hours').groupby('departure_name').resample('h').mean().reset_index()
    df['Air temperature (degC)'] = df['Air temperature (degC)'].fillna(-999)
    df =df.fillna(0)
    df['y_cat'] = ((df['numbers_of_departures']-1)> df['number_of_returns']).astype(int)
    if not set(lag_cols).issubset(df.columns):
        raise KeyError('Given dataframe does not contain required fields.')
    df['hour'] = df['departure_date_hours'].dt.hour
    df['day'] = df['departure_date_hours'].dt.day
    df['month'] = df['departure_date_hours'].dt.month
    df['quarter'] = df['departure_date_hours'].dt.quarter
    for i in [1,2,3,6,9,12,24]:
        df = lag_n(df = df,
                      group_col='departure_name',
                      lag_cols=lag_cols,
                      sort_by='departure_date_hours',
                      lag_number=i)
    df.loc[df['Air temperature (degC)']==-999,'Air temperature (degC)'] = df.loc[df['Air temperature (degC)']==-999,
                                                                                                  'Air temperature (degC)_lag_1']
    df.loc[df['Air temperature (degC)']==-999,'Air temperature (degC)'] = df.loc[df['Air temperature (degC)']==-999,
                                                                                                  'Air temperature (degC)_lag_2']
    return df
    

In [None]:
df_merged.columns

Index(['departure_name', 'departure_date_hours', 'numbers_of_departures',
       'distance (m)_dep', 'duration (sec.)_dep', 'avg_speed (km/h)_dep',
       'number_of_returns', 'distance (m)_ret', 'duration (sec.)_ret',
       'avg_speed (km/h)_ret', 'Air temperature (degC)', 'y_cat', 'hours',
       'day', 'month', 'quarter', 'departure_name_encoded',
       'numbers_of_departures_lag_1', 'distance (m)_dep_lag_1',
       'duration (sec.)_dep_lag_1', 'avg_speed (km/h)_dep_lag_1',
       'number_of_returns_lag_1', 'distance (m)_ret_lag_1',
       'duration (sec.)_ret_lag_1', 'avg_speed (km/h)_ret_lag_1',
       'Air temperature (degC)_lag_1', 'y_cat_lag_1',
       'numbers_of_departures_lag_2', 'distance (m)_dep_lag_2',
       'duration (sec.)_dep_lag_2', 'avg_speed (km/h)_dep_lag_2',
       'number_of_returns_lag_2', 'distance (m)_ret_lag_2',
       'duration (sec.)_ret_lag_2', 'avg_speed (km/h)_ret_lag_2',
       'Air temperature (degC)_lag_2', 'y_cat_lag_2',
       'numbers_of_departu

In [None]:
# wywołanie funkcji

In [None]:
# wywołanie funkcji
df_merged_2 = prepare_data(df_merged[['departure_name', 'departure_date_hours', 'numbers_of_departures',
       'distance (m)_dep', 'duration (sec.)_dep', 'avg_speed (km/h)_dep',
       'number_of_returns', 'distance (m)_ret', 'duration (sec.)_ret',
       'avg_speed (km/h)_ret', 'Air temperature (degC)', 'y_cat', 'hours',
       'day', 'month', 'quarter', 'departure_name_encoded']],lag_cols = lag_cols)

In [None]:
# head
df_merged_2.tail()

In [None]:
# shape
print(df_merged.shape)
print(df_merged_2.shape)

In [None]:
# sprawdzenie listy kolumn
df_merged_2.columns == df_merged.columns

In [None]:
# wybranie 1 stacji
st = df_merged.departure_name.unique()[0]

In [None]:
# Sprawdzenie metryk dla 1 stacji
df_merged[df_merged['departure_name']==st].sort_values('departure_date_hours').head(20)

In [None]:
# usuniecie jednej ramki 
del df_merged_2

- Filtrowanie zbioru:
    - Czy potrzebujemy mieć wszystkie miesiące?
    - Czy model będzie działać o każdej porze dnia?
    - Czy chcemy modelować wszystkie stacje?

In [None]:
# stworzenie daty z dokładnoscia do dnia
df_merged['departure_date'] = pd.to_datetime(df_merged['departure_date_hours'].dt.date)

In [None]:
# liczba wypożyczeń według miesięcy
df_merged.groupby('month')['numbers_of_departures'].sum()

In [None]:
# Wybranie miesięcy z "wysokiego" sezonu
df_merged = df_merged[df_merged.month.isin([5,6,7,8,9])]

In [None]:
# liczba wypożyczeń według departure_date_hours
hours = df_merged.groupby('departure_date_hours')['numbers_of_departures'].sum().reset_index()

In [None]:
# wyciągnięcie godziny
hours['hour'] = hours['departure_date_hours'].dt.hour

In [None]:
# średnia liczba wypożyczeń per godzina
hours.groupby('hour').mean()

In [None]:
df_merged['hour'] = df_merged['hours'].copy()
del df_merged['hours']

In [None]:
# odfiltrowanie nieistotnych godzin
df_merged = df_merged[(df_merged['hour']>=8) & (df_merged['hour']<=22)].reset_index(drop=True)

In [None]:
# minimalna data dla każdej stacji
min_date = df_merged.groupby('departure_name')['departure_date'].agg(['min','max']).reset_index().rename(
    columns = {'min': 'min_date',
               'max': 'max_date'}
)

In [None]:
min_date.head()

In [None]:
min_date.info()

In [None]:
# stacje do odrzucenia 
stations_to_excluude = min_date[(min_date['min_date']>='2020-01-01') | (min_date['max_date']<'2020-01-01')]['departure_name'].values
stations_to_excluude

In [None]:
# filtrowanie
df_merged = df_merged[~(df_merged['departure_name'].isin(stations_to_excluude))].reset_index(drop=True)
df_merged.shape

In [None]:
# dodanie min date do danych
df_merged = df_merged.merge(min_date,on ='departure_name')

In [None]:
#usuniecie pierwszego dnia danych
df_merged = df_merged[df_merged['departure_date']>df_merged['min_date']].reset_index(drop=True)

In [None]:
df_merged = df_merged.dropna()

In [None]:
df_merged.shape

- Selekcja zmiennych

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
# kolumny w ramce danych
df_merged.columns

In [None]:
# potencjalna lista zmiennych
potential_x_names = [
        'day',
       'month', 'quarter', 'departure_name_encoded',
       'numbers_of_departures_lag_1', 'distance (m)_dep_lag_1',
       'duration (sec.)_dep_lag_1', 'avg_speed (km/h)_dep_lag_1',
       'number_of_returns_lag_1', 'distance (m)_ret_lag_1',
       'duration (sec.)_ret_lag_1', 'avg_speed (km/h)_ret_lag_1',
       'Air temperature (degC)_lag_1', 'y_cat_lag_1',
       'numbers_of_departures_lag_2', 'distance (m)_dep_lag_2',
       'duration (sec.)_dep_lag_2', 'avg_speed (km/h)_dep_lag_2',
       'number_of_returns_lag_2', 'distance (m)_ret_lag_2',
       'duration (sec.)_ret_lag_2', 'avg_speed (km/h)_ret_lag_2',
       'Air temperature (degC)_lag_2', 'y_cat_lag_2',
       'numbers_of_departures_lag_3', 'distance (m)_dep_lag_3',
       'duration (sec.)_dep_lag_3', 'avg_speed (km/h)_dep_lag_3',
       'number_of_returns_lag_3', 'distance (m)_ret_lag_3',
       'duration (sec.)_ret_lag_3', 'avg_speed (km/h)_ret_lag_3',
       'Air temperature (degC)_lag_3', 'y_cat_lag_3',
       'numbers_of_departures_lag_6', 'distance (m)_dep_lag_6',
       'duration (sec.)_dep_lag_6', 'avg_speed (km/h)_dep_lag_6',
       'number_of_returns_lag_6', 'distance (m)_ret_lag_6',
       'duration (sec.)_ret_lag_6', 'avg_speed (km/h)_ret_lag_6',
       'Air temperature (degC)_lag_6', 'y_cat_lag_6',
       'numbers_of_departures_lag_9', 'distance (m)_dep_lag_9',
       'duration (sec.)_dep_lag_9', 'avg_speed (km/h)_dep_lag_9',
       'number_of_returns_lag_9', 'distance (m)_ret_lag_9',
       'duration (sec.)_ret_lag_9', 'avg_speed (km/h)_ret_lag_9',
       'Air temperature (degC)_lag_9', 'y_cat_lag_9',
       'numbers_of_departures_lag_12', 'distance (m)_dep_lag_12',
       'duration (sec.)_dep_lag_12', 'avg_speed (km/h)_dep_lag_12',
       'number_of_returns_lag_12', 'distance (m)_ret_lag_12',
       'duration (sec.)_ret_lag_12', 'avg_speed (km/h)_ret_lag_12',
       'Air temperature (degC)_lag_12', 'y_cat_lag_12',
       'numbers_of_departures_lag_24', 'distance (m)_dep_lag_24',
       'duration (sec.)_dep_lag_24', 'avg_speed (km/h)_dep_lag_24',
       'number_of_returns_lag_24', 'distance (m)_ret_lag_24',
       'duration (sec.)_ret_lag_24', 'avg_speed (km/h)_ret_lag_24',
       'Air temperature (degC)_lag_24', 'y_cat_lag_24', 
       'hour'
]

In [None]:
len(potential_x_names)

In [None]:
# model bazowy - drzewo decyzyjne
model_base = DecisionTreeClassifier(max_depth=7, random_state=123).fit(df_merged[potential_x_names],df_merged['y_cat'])

In [None]:
pd.options.display.float_format = '{:.4f}'.format

In [None]:
# feature importance
model_base.feature_importances_

In [None]:
# liczba zmiennych, które nie były użyte
(model_base.feature_importances_==0).sum()

In [None]:
# lista finalnych zmiennych
x_names = model_base.feature_names_in_[model_base.feature_importances_>0]
x_names


In [None]:
# describe wytypowanych zmiennych
df_merged[x_names].describe()

In [None]:
# korelacja
df_merged[list(x_names)+['y_cat']].corr(method='spearman')['y_cat']

In [None]:
# przykladowy wykres gestosci
sns.kdeplot(data=df_merged, x='numbers_of_departures_lag_1',hue= 'y_cat')
plt.show()

- Detekcja outlierów.

In [None]:
from sklearn.ensemble import IsolationForest

In [None]:
# Definicja obiektu
iso_forest = IsolationForest(bootstrap=True,random_state=123)


In [None]:
# fit
iso_forest.fit(df_merged[x_names[1:]])

In [None]:
# predykcja outlierów
is_outlier = iso_forest.predict(df_merged[x_names[1:]])

In [None]:
# udział
pd.Series(is_outlier).value_counts()/ len(df_merged)

In [None]:
# dodanie outlierow do danych
df_merged['outlier'] = is_outlier


2. Optymalizacja modelu:
    - Wykorzystanie jednego z poznanych algorytmów optymalizacyjnych.
    - W przypadku niezadowalających wyników, testy na innym algorytmie.

In [None]:
# Daty graniczne train/test/valid
date_train = '2020-01-01'
date_test = '2020-07-15'

In [None]:
# Usunięcie outliery
df_merged = df_merged[df_merged['outlier']==1].reset_index(drop=True)


In [None]:
# udział klas
df_merged.y_cat.value_counts()/len(df_merged)

In [None]:
# podzial na train / test / valid
train = df_merged[df_merged['departure_date']<date_train]
test = df_merged[(df_merged['departure_date']>=date_train) & (df_merged['departure_date']<=date_test)]
valid = df_merged[df_merged['departure_date']>date_test]
print(train.shape)
print(test.shape)
print(valid.shape)

In [None]:
# podzial na x/y
train_x = train[x_names]
train_y = train['y_cat']
test_x  =test[x_names]
test_y = test['y_cat']
valid_x = valid[x_names]
valid_y = valid['y_cat']

In [None]:
# pobranie funkcji i bibliotek
import optuna 
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import roc_auc_score, classification_report

In [None]:
# Funkcja optymalizacyjna
def objective(trial):
    params = {'max_iter': trial.suggest_int('max_iter',200,2000),
              'max_depth': trial.suggest_int('max_depth',5,100),
              'learning_rate': trial.suggest_float('learning_rate',0.01,0.9),
              'min_samples_leaf': trial.suggest_int('min_samples_leaf',5,500)}
    model = HistGradientBoostingClassifier(**params).fit(train_x,train_y)
    preds = model.predict_proba(test_x)[:,1]
    return roc_auc_score(test_y, preds)

In [None]:
# study
study = optuna.create_study(direction='maximize')


In [None]:
# optymalizacja
study.optimize(objective, n_trials=15, n_jobs=-1)


In [None]:
# najlepsze parametry
best_params= study.best_params

In [None]:
# final model 
final_model = HistGradientBoostingClassifier(**best_params).fit(train_x,train_y)

In [None]:
from sklearn.preprocessing  import TargetEncoder

In [None]:
te = TargetEncoder().fit(train[['departure_name']],train_y)

In [None]:
train['departure_name_encoded_te'] = te.transform(train[['departure_name']])
test['departure_name_encoded_te'] = te.transform(test[['departure_name']])
valid['departure_name_encoded_te'] = te.transform(valid[['departure_name']])

In [None]:
x_names_2 = list(x_names.copy())

In [None]:
x_names_2.append('departure_name_encoded_te')


In [None]:
x_names_2

In [None]:
x_names_2 = x_names_2[1:]
x_names_2

In [None]:
# podzial na x/y
train_x = train[x_names_2]
test_x  =test[x_names_2]
valid_x = valid[x_names_2]


In [None]:
final_model_te = HistGradientBoostingClassifier(**best_params).fit(train_x,train_y)

In [None]:
# podzial na x/y
train_x = train[potential_x_names]
test_x  =test[potential_x_names]
valid_x = valid[potential_x_names]


In [None]:
final_model_all_f = HistGradientBoostingClassifier(**best_params).fit(train_x,train_y)

In [None]:
# predykcje
test_pred_1 = final_model.predict_proba(test[final_model.feature_names_in_])[:,1]
test_pred_te = final_model_te.predict_proba(test[final_model_te.feature_names_in_])[:,1]
test_pred_all_f  =final_model_all_f.predict_proba(test[final_model_all_f.feature_names_in_])[:,1]

In [None]:
final_model_all_f.feature_names_in_

In [None]:
# AUC
print(roc_auc_score(test_y,test_pred_1))
print(roc_auc_score(test_y,test_pred_te))
print(roc_auc_score(test_y,test_pred_all_f))

In [None]:
valid_pred_proba  =final_model.predict_proba(valid[final_model.feature_names_in_])[:,1]

In [None]:
roc_auc_score(valid_y,valid_pred_proba)

In [None]:
valid_pred  =final_model.predict(valid[final_model.feature_names_in_])

In [None]:
# evaluation - classification report
print(classification_report(valid_y,valid_pred))

In [None]:
import numpy as np

In [None]:
accuracy = []
for i in np.arange(0,1,0.01):
    valid_pred_proba_i = final_model.predict_proba(valid[final_model.feature_names_in_])[:,1]
    valid_pred_i = (valid_pred_proba_i > i).astype(int)
    acc_i = sum((valid_y == valid_pred_i).astype(int))
    accuracy.append(acc_i)

In [None]:
cut_offs = np.arange(0,1,0.01)
cut_off = cut_offs[accuracy.index(max(accuracy))]
cut_off

In [None]:
valid_pred_new = (final_model.predict_proba(valid[final_model.feature_names_in_])[:,1] >cut_off).astype(int)

In [None]:
print(classification_report(valid_y,valid_pred_new))

In [None]:
df_merged['pred_proba'] = final_model.predict_proba(df_merged[final_model.feature_names_in_])[:,1]

In [None]:
# wyniki wg stacji
results = df_merged.loc[df_merged['departure_date']>='2020-01-01',['y_cat','pred_proba','departure_name']].groupby('departure_name').apply(
    lambda x: roc_auc_score(x['y_cat'], x['pred_proba']) 
).reset_index()

In [None]:
results.head()

In [None]:
results.sort_values(by=0).tail(10)

In [None]:
# zapis modelu
import joblib

In [None]:
joblib.dump(final_model, 'models/classification_model.joblib')

In [None]:
joblib.dump(oe, 'models/ordinal_encoder_for_classification.joblib')