## Preparation

In [None]:
import pandas as pd
import numpy as np
import time
from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
import lightgbm as lgb

In [None]:
df=pd.read_csv('/datasets/taxi.csv', index_col=0, parse_dates=[0])

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df=df.resample('1H').sum()

In [None]:
df.head()

In [None]:
df.isna().sum()

## Analysis

In [None]:
df.plot(figsize=(15,5), legend=False)
plt.title('Taxi Orders Over Time (Hourly)')
plt.xlabel('Time')
plt.ylabel('Number of Orders')
plt.show()

Lets look at just August and September for a closer look:

In [None]:
df['2018-08':'2018-09'].plot(figsize=(15,5), legend=False)
plt.title('Taxi Orders (August-September 2018)')
plt.xlabel('Time')
plt.ylabel('Number of Orders')
plt.show()

In [None]:
df['rolling_mean_24h']=df['num_orders'].rolling(24).mean()

plt.figure(figsize=(15,5))
plt.plot(df['num_orders'], label='Hourly Orders')
plt.plot(df['rolling_mean_24h'], label='24-Hour Moving Average', linewidth=3)
plt.title('Taxi Orders with 24-Hour Moving Average')
plt.xlabel('Time')
plt.ylabel('Number of Orders')
plt.legend()
plt.show()

In [None]:
df['rolling_mean_168h']=df['num_orders'].rolling(168).mean()

plt.figure(figsize=(15,5))
plt.plot(df['num_orders'], label='Hourly Orders')
plt.plot(df['rolling_mean_168h'], label='7-Day Moving Average', linewidth=3)
plt.title('Taxi Orders with 7-Day Moving Average')
plt.xlabel('Time')
plt.ylabel('Number of Orders')
plt.legend()
plt.show()

We successfully loaded and resampled the Sweet Lift taxi order data to one-hour intervals without encountering missing values. Visual analysis revealed clear daily and weekly patterns, with higher demand during typical peak hours. Moving averages were applied to better visualize short- and long-term trends, confirming the presence of seasonality in the data. The dataset is now clean, structured, and ready for modeling.

## Training

In [None]:
split_index= int(len(df) * 0.9)

train=df.iloc[:split_index]
test=df.iloc[split_index:]

print('Train Shape:', train.shape)
print('Test Shape:', test.shape)

In [None]:
print('Last train date:', train.index.max())
print('First test date:', test.index.min())

Great, no gaps or overlap in the split, while using the newer data for test and the older test for training. 

In [None]:
def make_features(data, lags=[1, 2, 3], rolling_windows=[24, 168]):
    data = data.copy()
    for lag in lags:
        data[f'lag_{lag}'] = data['num_orders'].shift(lag)
    for window in rolling_windows:
        data[f'rolling_mean_{window}'] = data['num_orders'].shift(1).rolling(window=window).mean()
    data['hour'] = data.index.hour
    data['day_of_week'] = data.index.dayofweek
    return data.dropna()

In [None]:
train_features=make_features(train)
test_features= make_features(test)

In [None]:
X_train = train_features.drop('num_orders', axis=1)
y_train = train_features['num_orders']

X_test = test_features.drop('num_orders', axis=1)
y_test = test_features['num_orders']

In [None]:
#Linear Regression
model= LinearRegression()
model.fit(X_train, y_train)

y_pred=model.predict(X_test)

In [None]:
#Random Forest
rf_model = RandomForestRegressor(
    n_estimators=100, 
    random_state=42,   
    n_jobs=-1         
)

rf_model.fit(X_train, y_train)

y_pred_rf= rf_model.predict(X_test)

#Random Forest-model2
rf_model2 = RandomForestRegressor(
    n_estimators=300,        
    max_depth=10,            
    min_samples_split=5,     
    min_samples_leaf=2,      
    random_state=42,
    n_jobs=-1      
)

rf_model2.fit(X_train, y_train)

y_pred_rf2= rf_model2.predict(X_test)

In [None]:
#LightGBM
lgb_model= lgb.LGBMRegressor(
    n_estimators=1000,
    learning_rate=0.05,
    num_leaves=32,
    random_state=42
)

lgb_model.fit(X_train,y_train)

y_pred_lgb = lgb_model.predict(X_test)
#LightGBM-model 2
lgb_model2= lgb.LGBMRegressor(
    n_estimators=500,
    learning_rate=0.1,
    num_leaves=64,
    random_state=42
)

lgb_model2.fit(X_train,y_train)

y_pred_lgb2 = lgb_model2.predict(X_test)

## Testing

In [None]:
#Linear Regression
rmse= np.sqrt(mean_squared_error(y_test, y_pred))
print('Linear Regression Test RMSE:', rmse)

#Random Forest
rmse_rf= np.sqrt(mean_squared_error(y_test, y_pred_rf))
print('Random Forest Test RMSE:', rmse_rf)

#Random Forest-model2
rmse_rf2 = np.sqrt(mean_squared_error(y_test, y_pred_rf2))
print('Random Forest (different Hyperparameters) Test RMSE:', rmse_rf2)


#LightGBM
rmse_lgb= np.sqrt(mean_squared_error(y_test, y_pred_lgb))
print('LightGBM Test RMSE:', rmse_lgb)


#LightGBM-model2
rmse_lgb2= np.sqrt(mean_squared_error(y_test, y_pred_lgb2))
print('LightGBM (Different Hyperparameters) Test RMSE:', rmse_lgb2)


During the training and testing phase, several models were trained on the training set and evaluated on the reserved 10% test set. Linear Regression achieved the best performance with a test RMSE of 40.67, successfully meeting the project’s target. Random Forest models, even after hyperparameter tuning, did not meet the required RMSE threshold, with test RMSE values of 70.79 and 52.82. LightGBM models, tested with two different hyperparameter settings, achieved consistent test RMSE values of 47.78, meeting the project goal. Based on these results, Linear Regression was selected as the final model for its superior predictive accuracy.

In this project, we developed a model to predict hourly taxi orders for Sweet Lift Taxi Company. After resampling the data and creating lag, rolling average, and time-based features, the dataset was split into 90% for training and 10% for testing. Several models were trained and evaluated, including Linear Regression, Random Forest, and LightGBM, with additional hyperparameter tuning applied to Random Forest and LightGBM.

Linear Regression achieved the best performance with a test RMSE of 40.67, while LightGBM models achieved RMSE values of 47.78. Both models met the project’s RMSE target of 48. Random Forest models, however, did not meet the target despite hyperparameter tuning. Based on these results, Linear Regression was selected as the final model for its superior predictive accuracy. The final model will support Sweet Lift Taxi Company in optimizing driver allocation during peak hours.