Zaimplementuj aplikację szacującą czas ukończenia półmaratonu dla zadanych danych

1. Umieść dane w Digital Ocean Spaces

1. Napisz notebook, który będzie Twoim pipelinem do trenowania modelu
    * czyta dane z Digital Ocean Spaces
    * czyści je
    * trenuje model (dobierz odpowiednie metryki [feature selection])
    * nowa wersja modelu jest zapisywana lokalnie i do Digital Ocean Spaces

1. Aplikacja
    * opakuj model w aplikację streamlit
    * wdróż (deploy) aplikację za pomocą Digital Ocean AppPlatform 
    * wejściem jest pole tekstowe, w którym użytkownik się przedstawia, mówi o tym
    jaka jest jego płeć, wiek i tempo na 5km
    * jeśli użytkownik podał za mało danych, wyświetl informację o tym jakich danych brakuje
    * za pomocą LLM (OpenAI) wyłuskaj potrzebne dane, potrzebne dla Twojego modelu
    do określenia, do słownika (dictionary lub JSON)
    * tę część podepnij do Langfuse, aby zbierać metryki o skuteczności działania LLM'a



In [None]:
import pandas as pd
from pycaret.regression import *
from sklearn.model_selection import train_test_split
from datetime import datetime

# Wczytanie danych
df_2023 = pd.read_csv('halfmarathon_wroclaw_2023__final.csv', sep=';')
df_2024 = pd.read_csv('halfmarathon_wroclaw_2024__final.csv', sep=';')


# Obliczenie wieku na podstawie roku urodzenia
df_2023['Wiek'] = 2023 - df_2023['Rocznik']
df_2024['Wiek'] = 2024 - df_2024['Rocznik']

# Połączenie obu DataFrame w jeden
df = pd.concat([df_2023, df_2024], ignore_index=True)

# Usunięcie wierszy z brakującymi danymi w istotnych kolumnach
df = df[df['5 km Czas'].notna()]
df = df[df['10 km Czas'].notna()]
df = df[df['15 km Czas'].notna()]
df = df[df['20 km Czas'].notna()]
df = df[df['Czas'].notna()]

# Mapowanie płci (0 = K, 1 = M)
df['Płeć'] = df['Płeć'].map({'K': 0, 'M': 1})

# Kolumny do usunięcia
columns_to_drop = ['Miejsce', 'Numer startowy', 'Rocznik', 'Kategoria wiekowa', 'Imię', 'Nazwisko', 'Miasto', 'Kraj', 'Drużyna', 'Płeć Miejsce', 'Kategoria wiekowa Miejsce', 
                  '5 km Miejsce Open', '10 km Miejsce Open', '15 km Miejsce Open', '20 km Miejsce Open', 'Tempo Stabilność']
df.drop(columns=columns_to_drop, inplace=True, errors='ignore')

# Funkcje konwertujące czas i tempo na sekundy
def convert_time_to_seconds(time_str):
    if pd.isna(time_str): return None
    time_parts = time_str.split(':')
    if len(time_parts) == 2:
        minutes, seconds = map(int, time_parts)
        return minutes * 60 + seconds
    elif len(time_parts) == 3:
        hours, minutes, seconds = map(int, time_parts)
        return hours * 3600 + minutes * 60 + seconds
    return None

# Zastosowanie konwersji
df['Czas_półmaratonu'] = df['Czas'].apply(convert_time_to_seconds)
df['Czas_na_5km'] = df['5 km Czas'].apply(convert_time_to_seconds)
df['Czas_na_10km'] = df['10 km Czas'].apply(convert_time_to_seconds)
df['Czas_na_15km'] = df['15 km Czas'].apply(convert_time_to_seconds)

# Usunięcie niepotrzebnych kolumn
df.drop(columns=['Czas', 'Tempo', '5 km Czas', '10 km Czas', '15 km Czas', '20 km Czas', '5 km Tempo', '10 km Tempo', '15 km Tempo', '20 km Tempo'], inplace=True, errors='ignore')

df

# Funkcja do trenowania modelu i przewidywania czasów na różnych odcinkach
def train_and_predict(target_column):
   
    # Ustawienie środowiska PyCaret z dodatkowymi cechami (w tym Płeć, Wiek)
    setup(data=df,
          target=target_column,
          session_id=123, 
          normalize=True,
        #   feature_selection=True,
        #   remove_multicollinearity=True
        use_gpu=True,
        numeric_features=['Wiek', 'Płeć'],
        verbose=True
        )

    # Porównanie modeli
    best_model = compare_models(fold=5)

    # Finalizacja najlepszego modelu
    final_model = finalize_model(best_model)

    # Zapisanie najlepszego modelu
    save_model(final_model, f'{target_column}_model')

    return final_model

# Trenowanie i zapisanie modeli dla różnych celów (czas na 5 km, 10 km, 15 km i pełny półmaraton)
time_5km_model = train_and_predict('Czas_na_5km')
time_10km_model = train_and_predict('Czas_na_10km')
time_15km_model = train_and_predict('Czas_na_15km')
# halfmarathon_time_model = train_and_predict('Czas_półmaratonu')

[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: gfx1103, Vendor: Advanced Micro Devices, Inc.
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Start training from score 0.500000
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: gfx1103, Vendor: Advanced Micro Devices, Inc.
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Start training from score 0.500000
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Dev

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Czas_na_10km
2,Target type,Regression
3,Original data shape,"(18377, 6)"
4,Transformed data shape,"(18377, 6)"
5,Transformed train set shape,"(12863, 6)"
6,Transformed test set shape,"(5514, 6)"
7,Numeric features,2
8,Rows with missing values,2.6%
9,Preprocess,True


[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: gfx1103, Vendor: Advanced Micro Devices, Inc.
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Start training from score 0.500000
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: gfx1103, Vendor: Advanced Micro Devices, Inc.
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Start training from score 0.500000


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lr,Linear Regression,18.1573,1028.3605,31.9356,0.9961,0.0092,0.0052,0.018
lar,Least Angle Regression,18.1573,1028.3605,31.9356,0.9961,0.0092,0.0052,0.014
br,Bayesian Ridge,18.1568,1028.3599,31.9354,0.9961,0.0092,0.0052,0.014
huber,Huber Regressor,18.0757,1017.0609,31.7644,0.9961,0.0091,0.0052,0.052
ridge,Ridge Regression,18.136,1028.5506,31.9321,0.9961,0.0092,0.0052,0.014
par,Passive Aggressive Regressor,18.8269,1059.0634,32.4006,0.996,0.0093,0.0054,0.018
rf,Random Forest Regressor,19.0941,1054.146,32.3534,0.996,0.0088,0.0055,0.356
et,Extra Trees Regressor,19.0167,1086.5053,32.8508,0.9959,0.0089,0.0055,0.226
gbr,Gradient Boosting Regressor,20.9816,1180.2849,34.2955,0.9955,0.0094,0.0061,0.69
llar,Lasso Least Angle Regression,19.9956,1218.1749,34.6885,0.9954,0.0101,0.0057,0.014


Transformation Pipeline and Model Successfully Saved
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: gfx1103, Vendor: Advanced Micro Devices, Inc.
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Start training from score 0.500000
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: gfx1103, Vendor: Advanced Micro Devices, Inc.
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Start training from score 0.500000
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, numbe

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Czas_na_15km
2,Target type,Regression
3,Original data shape,"(18377, 6)"
4,Transformed data shape,"(18377, 6)"
5,Transformed train set shape,"(12863, 6)"
6,Transformed test set shape,"(5514, 6)"
7,Numeric features,2
8,Rows with missing values,2.6%
9,Preprocess,True


[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: gfx1103, Vendor: Advanced Micro Devices, Inc.
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Start training from score 0.500000
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: gfx1103, Vendor: Advanced Micro Devices, Inc.
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Start training from score 0.500000


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lr,Linear Regression,26.728,1975.9622,44.4298,0.997,0.0081,0.005,0.018
ridge,Ridge Regression,26.7447,1975.2828,44.4215,0.997,0.008,0.005,0.014
lar,Least Angle Regression,26.728,1975.9622,44.4298,0.997,0.0081,0.005,0.014
br,Bayesian Ridge,26.7283,1975.9436,44.4296,0.997,0.0081,0.005,0.014
huber,Huber Regressor,26.6312,1976.12,44.4332,0.997,0.008,0.005,0.048
par,Passive Aggressive Regressor,27.2352,2034.8798,45.0873,0.9969,0.0082,0.0051,0.018
lasso,Lasso Regression,27.9042,2100.4812,45.7979,0.9968,0.0083,0.0053,0.102
llar,Lasso Least Angle Regression,27.9033,2100.5084,45.7984,0.9968,0.0083,0.0053,0.014
et,Extra Trees Regressor,28.2124,2177.1623,46.6095,0.9967,0.0084,0.0053,0.222
rf,Random Forest Regressor,28.3773,2206.1714,46.913,0.9966,0.0084,0.0053,0.348


Transformation Pipeline and Model Successfully Saved
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: gfx1103, Vendor: Advanced Micro Devices, Inc.
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Start training from score 0.500000
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: gfx1103, Vendor: Advanced Micro Devices, Inc.
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Start training from score 0.500000
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, numbe

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Czas_półmaratonu
2,Target type,Regression
3,Original data shape,"(18377, 6)"
4,Transformed data shape,"(18377, 6)"
5,Transformed train set shape,"(12863, 6)"
6,Transformed test set shape,"(5514, 6)"
7,Numeric features,2
8,Rows with missing values,2.6%
9,Preprocess,True


[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: gfx1103, Vendor: Advanced Micro Devices, Inc.
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Start training from score 0.500000
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: gfx1103, Vendor: Advanced Micro Devices, Inc.
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Start training from score 0.500000


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lr,Linear Regression,87.8501,19101.6746,138.0594,0.9872,0.0172,0.0114,0.016
ridge,Ridge Regression,88.0433,19113.5561,138.1032,0.9872,0.0172,0.0114,0.014
lar,Least Angle Regression,87.8501,19101.6746,138.0594,0.9872,0.0172,0.0114,0.014
br,Bayesian Ridge,87.8532,19101.7295,138.0596,0.9872,0.0172,0.0114,0.016
huber,Huber Regressor,86.5532,19295.0189,138.7623,0.987,0.0172,0.0112,0.046
lasso,Lasso Regression,89.9873,19541.0166,139.6494,0.9869,0.0174,0.0117,0.416
llar,Lasso Least Angle Regression,89.8738,19499.0627,139.4976,0.9869,0.0174,0.0117,0.014
par,Passive Aggressive Regressor,87.9758,19803.8194,140.5596,0.9867,0.0175,0.0114,0.024
rf,Random Forest Regressor,92.5355,20248.9915,142.1565,0.9864,0.0178,0.0121,0.354
et,Extra Trees Regressor,92.6574,20386.3916,142.6522,0.9863,0.0178,0.0121,0.224


Transformation Pipeline and Model Successfully Saved
