# RidePulse Nairobi
## Predictive Demand Hotspots for Boda-Boda Riders

RidePulse Nairobi is a data science project aimed at solving a core economic challenge of inefficient positioning and excessive idle time for thousands of independent boda-boda riders in Nairobi. By analyzing historical ride data, we will build a machine learning model to predict high-demand "hotspots" across the city in real-time. The final output will be a simple, interactive heat map prototype that guides riders to areas with the highest probability of securing a customer, directly translating into reduced fuel costs, less idle time, and increased daily income.


### Problem Statement

Nairobi's boda-boda riders operate in a hyper-competitive market. Their income is directly proportional to the number of trips they complete. However, they lack predictive tools, forcing them to rely on gut instinct and experience to find customers. This leads to critical inefficiencies such as:

1. Wasted Fuel and Time: Riders spend significant portions of their day "cruising empty" in search of passengers.

2. Oversupply and Competition: Riders often congregate in traditionally "busy" areas (e.g., CBD, Westlands), only to find intense competition and long waits.

3. Missed Opportunities: A lucrative ride request might be available just a few blocks away in a non-obvious location, but the rider has no way of knowing.

This information gap puts a hard ceiling on a rider's potential earnings and operational efficiency.



### Proposed Solution: Data-Driven Positioning

We propose building a predictive system that transforms raw data into actionable intelligence. The system will:

1. Forecast Demand: Use a machine learning model to predict the number of ride requests for specific zones across Nairobi for any given hour and day.

2. Visualize Insights: Translate these predictions into a simple, color-coded heat map overlaid on a map of Nairobi.

        - Red/Orange: "Hot Zone" - Go here for a high chance of a ride.

        - Yellow: "Warm Zone" - Moderate demand.

        - Blue/Clear: "Cold Zone" - Avoid waiting here.

3. Empower Riders: Provide a simple, visual tool (simulated via a web app) that answers the rider's most important question: "Where should I be right now to find my next customer?"


### Key Objectives

1. Process and transform raw ride data into a structured feature set by engineering time-based features and implementing Uber's H3 spatial indexing to grid the city into hexagonal zones.

2. Develop a regression model (e.g., LightGBM) to accurately forecast ride demand per zone per hour, aiming for a predictive accuracy (R-squared) of over 75%.

3. Build an interactive prototype using Streamlit and Folium that displays the demand forecast as an intuitive heat map, proving the project's real-world applicability.


## Data Understanding 

We will use the Sendy Logistics Challenge dataset available on Zindi. It contains over 20,000 real-world boda-boda delivery records from Nairobi, including precise pickup timestamps and latitude/longitude coordinates. Direct Link: https://zindi.africa/competitions/sendy-logistics-challenge/data

Tech Stack:

        - Language: Python
        
        - Core Libraries: Pandas, NumPy, Scikit-learn, LightGBM
        
        - Geospatial: Geopandas, H3-py, Folium
        
        - Prototyping: Streamlit

#### Success Metrics

We will measure success both technically and practically:

1. Technical Metric (MAE): The model's Mean Absolute Error should be less than 3 rides, meaning our predictions are, on average, very close to the actual demand.

2. Business Metric (Hotspot Precision): The model must correctly identify at least 8 out of the 10 actual busiest zones during peak hours, proving its effectiveness in finding profitable locations.


## Exploratory Data Analysis

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

ModuleNotFoundError: No module named 'seaborn'

In [None]:
!ls ./data

Looking through the dataset We begin by loading the dataset and examining its structure, columns, and a few sample rows to get a feel for the data we are working with.

In [None]:
df = pd.read_csv('./data/Train.csv')
df.head()

In [None]:
df.isnull().sum()

From the cell above, we see see that two columns (Temperature and Precipitation in millimeters) have null values.

**Converting the 'Placement_Datetime' column to date time**

In [None]:
# we convert the columns to be date time assuming the month of jan
placeholder_month = 1
placeholder_year = 2019

df['Placement_Datetime'] = pd.to_datetime(
    df['Placement - Day of Month'].astype(str) + '-' +
    str(placeholder_month) + '-' +
    str(placeholder_year) + ' ' +
    df['Placement - Time'],
    format='%d-%m-%Y %I:%M:%S %p'  
)

In [None]:
df.head()

The columns we are going to use in this project are listed below

In [None]:

essential_cols = [
    'Order No',            
    'Placement_Datetime',  
    'Personal or Business',
    'Platform Type',
    'Pickup Lat', 
    'Pickup Long'          
]


df_focused = df[essential_cols].copy()



In [None]:
df_focused.head()

### Dispalying a Map of Pickup Locations

Here, we install and import **folium**, a Python library used to create interactive maps directly in Jupyter Notebooks. We create an interactive heat map showing the density of pickup locations in Nairobi, based on latitude and longitude coordinates in our DataFrame.

In [None]:
!pip install folium


In [None]:
import folium
from folium.plugins import HeatMap

#Visualize the pick up location using latitude and longitude
heat_data = df_focused[['Pickup Lat', 'Pickup Long']].values.tolist()
nairobi_map = folium.Map(location=[-1.286389, 36.817223], zoom_start=12)
HeatMap(heat_data).add_to(nairobi_map)
print("Displaying Heat Map of Pickup Locations...")
nairobi_map

In [None]:
# Extracting the  numerical features from the date
df_focused['hour_of_day'] = df_focused['Placement_Datetime'].dt.hour
df_focused['day_of_week'] = df_focused['Placement_Datetime'].dt.dayofweek
df_focused.head()

### H3 Geospatial

H3 is an open-source geospatial indexing system developed by Uber. It divides the surface of the Earth into hexagonal cells, which makes it easier to analyze and visualize geographic data efficiently and accurately. In the cell below, we use H3 to convert each pickup location's latitude and longitude into a unique H3 hexagonal cell ID, and store the result in a new column called **h3_cell**.



In [None]:
#converting the longitude and latitude into h3 cells
!pip install h3

import h3

H3_RESOLUTION = 12

def latlon_to_h3(row):
    return h3.latlng_to_cell(row['Pickup Lat'], row['Pickup Long'], H3_RESOLUTION)

df_focused['h3_cell'] = df_focused.apply(latlon_to_h3, axis=1)
df_focused.head()

Label Encoding for binary categories

In [None]:
#We binary encode the personal/business column
df_focused['is_business'] = df_focused['Personal or Business'].apply(lambda x: 1 if x == 'Business' else 0)
#preview
df_focused[['Personal or Business', 'is_business']].head()

In [None]:
df_focused.shape

Below, we group ride request data by location (H3 cell), day of the week, and hour of the day to prepare our dataset for analysis. We calculate how many ride requests occurred (demand_count), what proportion were business rides (business_ratio) and then merge both metrics into one DataFrame (df_model_ready)

In [None]:
#we try to group by location cell and the date- day and hour of the week
demand_counts = df_focused.groupby(['h3_cell', 'day_of_week', 'hour_of_day']).size().reset_index(name='demand_count')
business_proportion = df_focused.groupby(['h3_cell', 'day_of_week', 'hour_of_day'])['is_business'].mean().reset_index(name='business_ratio')

#we merge both dataframes
df_model_ready = pd.merge(demand_counts, business_proportion, on=['h3_cell', 'day_of_week', 'hour_of_day'])

print(df_model_ready.shape)
df_model_ready.head()

Below, we convert the h3_cell column into a categorical data type



In [None]:

df_model_ready['h3_cell'] = df_model_ready['h3_cell'].astype('category')
df_model_ready.info()

**Visualizing patterns in ride demand across hours of the day and days of the week using bar plots**



In [None]:
# Group by hour and day to see the patterns
hourly_demand = df_model_ready.groupby('hour_of_day')['demand_count'].mean()
daily_demand = df_model_ready.groupby('day_of_week')['demand_count'].mean()
day_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

#creating the plots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

#hourly Demand Plot
sns.barplot(x=hourly_demand.index, y=hourly_demand.values, ax=ax1, palette='viridis')
ax1.set_title('Average Demand by Hour of Day')
ax1.set_xlabel('Hour')
ax1.set_ylabel('Average Demand')

#Daily Demand Plot
sns.barplot(x=day_names, y=daily_demand.values, ax=ax2, palette='plasma')
ax2.set_title('Average Demand by Day of Week')
ax2.set_xlabel('Day')
ax2.set_ylabel('Average Demand')

plt.tight_layout()
plt.show()

**Selecting Features and Target variable**

In [None]:
features = ['h3_cell', 'day_of_week', 'hour_of_day', 'business_ratio']
target = 'demand_count'

X = df_model_ready[features]
y = df_model_ready[target]
X.info()

In [None]:
from sklearn.model_selection import train_test_split

#splitting the data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_val.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_val.shape}")

## Modelling: Ensemble Methods
### LightGBM


In [None]:
!pip install lightgbm


In [None]:
import lightgbm as lgb

#instaniate the model
lgbm = lgb.LGBMRegressor(random_state=42)

#fitting the model
lgbm.fit(X_train, y_train, categorical_feature=['h3_cell'])

print("Model Training Complete")

In [None]:

print("A visualization of the feature importance")

lgb.plot_importance(lgbm, height=0.9, figsize=(10, 6))
plt.title("LightGBM Feature Importance")
plt.show()

In [None]:
# Model evaluation
from sklearn.metrics import mean_absolute_error, r2_score

y_pred = lgbm.predict(X_val)
mae = mean_absolute_error(y_val, y_pred)
r2 = r2_score(y_val, y_pred)

print("Model Evaluation Results")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"R-squared (R²): {r2:.2f}")

# comparing few predictions vs the actual values
comparison_df = pd.DataFrame({'Actual': y_val, 'Predicted': y_pred})
print("\nSample of Actual vs. Predicted demand:")
print(comparison_df.head(10))

### Observations from Sample Predictions
* **MAE of 0.41:** The model's predictions are, on average, within 0.41 units of the actual `demand_count`.
* **R-squared of 0.86:** The model effectively explains 86% of the variability observed in the `demand_count`.
* **Accuracy for Low Demand:** For `Actual` values of 1, the model often predicts values very close to 1 (e.g., 1.017240, 0.993542, 1.050034). This suggests good accuracy for low demand scenarios.
* **Variability for Moderate Demand:** For `Actual` values of 2, the predictions are a bit more varied (e.g., 0.722886, 1.018311). The first one (0.72) shows a larger error compared to others.
* **Challenges with Higher Demand:** For `Actual` values like 3 and 4, the model seems to struggle more. For `Actual` 4, it predicted 1.517652, which is quite far off. However, for `Actual` 3, it predicted 3.398914, which is very close. This indicates that while the model performs well on average, its accuracy might vary for different demand levels, particularly for higher ones.


In [None]:
#!conda update pandas

### Xgboost model

In [None]:
#Install XGBoost
#!pip install --upgrade xgboost

import xgboost as xgb

X_train_xgb = X_train.copy()
X_val_xgb = X_val.copy()

#converting the categorical h3 column into integer for the model
X_train_xgb['h3_cell'] = X_train_xgb['h3_cell'].cat.codes
X_val_xgb['h3_cell'] = X_val_xgb['h3_cell'].cat.codes

#initialize and train model
xgb_reg = xgb.XGBRegressor(
    objective='reg:squarederror',
    tree_method='hist',
    enable_categorical=True,  
    random_state=42
)
xgb_reg.fit(np.array(X_train_xgb), y_train)
print ('finished fitting')

In [None]:
#Evaluate the XGBoost Model ---
y_pred_xgb = xgb_reg.predict(np.array(X_val_xgb))
mae_xgb = mean_absolute_error(y_val, y_pred_xgb)
r2_xgb = r2_score(y_val, y_pred_xgb)

print("\n--- XGBoost Evaluation Results ---")
print(f"Mean Absolute Error (MAE): {mae_xgb:.2f}")
print(f"R-squared (R²): {r2_xgb:.2f}")

# comparing few predictions vs the actual values
comparison_df_xgb = pd.DataFrame({'Actual': y_val, 'Predicted': y_pred_xgb})
print("\nSample of Actual vs. Predicted demand:")
print(comparison_df_xgb.head(10))

## Catboost model

In [None]:
!pip install catboost
from catboost import CatBoostRegressor

In [None]:
cat_features = ['h3_cell']

cat_reg = CatBoostRegressor(
    iterations=500,  
    verbose=0,      
    cat_features=cat_features,
    random_state=42
)

# Train the model
print("Training CatBoost Model")
cat_reg.fit(X_train, y_train)

# Evaluate the CatBoost Model
y_pred_cat = cat_reg.predict(X_val)
mae_cat = mean_absolute_error(y_val, y_pred_cat)
r2_cat = r2_score(y_val, y_pred_cat)

print("CatBoost Evaluation Results")
print(f"Mean Absolute Error (MAE): {mae_cat:.2f}")
print(f"R-squared (R²): {r2_cat:.2f}")

Mean Absolute Error (MAE) 0.40: On average, our model's predictions are only 0.40 off from the actual demand values.

R-squared (R²) 0.87: The model explains 87% of the variability in demand count, which is high and suggests the model captures the demand patterns really well.

CatBoost did better since it automatically handles categorical features like h3_cell efficiently without one-hot encoding and works great on tabular data with mixed types.

## Neural Network

The best type of neural network to start with, given this structure, is a Multi-Layer Perceptron (MLP), also known as a Feed-Forward Neural Network, enhanced with Embedding Layers for your categorical features.

In [None]:
X_train.info()

In [None]:
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.preprocessing import MinMaxScaler

# Copy datasets
X_train_nn = X_train.copy()
X_val_nn = X_val.copy()

# category mapping
combined = pd.concat([X_train['h3_cell'], X_val['h3_cell']], axis=0).astype('category')
combined = combined.cat.set_categories(combined.unique())  # optional but safe

# Assign consistent codes
X_train_nn['h3_cell'] = combined[:len(X_train)].cat.codes
X_val_nn['h3_cell'] = combined[len(X_train):].cat.codes

# unique counts for embeddings
num_h3_cells = combined.nunique()+1
num_days = 7
num_hours = 24

# Embedding dimensions
h3_embedding_dim = min(50, num_h3_cells // 2)
day_embedding_dim = min(50, num_days // 2)
hour_embedding_dim = min(50, num_hours // 2)

# Scaling the numerical column
scaler = MinMaxScaler()
X_train_nn['business_ratio'] = scaler.fit_transform(X_train_nn[['business_ratio']])
X_val_nn['business_ratio'] = scaler.transform(X_val_nn[['business_ratio']])

print(X_val_nn.info())
X_train_nn.info()


In [None]:
# Input layers
input_h3 = layers.Input(shape=(1,), name='h3_cell')
input_day = layers.Input(shape=(1,), name='day_of_week')
input_hour = layers.Input(shape=(1,), name='hour_of_day')
input_num = layers.Input(shape=(1,), name='business_ratio')

# Embedding layers
embed_h3 = layers.Embedding(input_dim=num_h3_cells, output_dim=h3_embedding_dim)(input_h3)
embed_day = layers.Embedding(input_dim=num_days, output_dim=day_embedding_dim)(input_day)
embed_hour = layers.Embedding(input_dim=num_hours, output_dim=hour_embedding_dim)(input_hour)

# Flatten embeddings
flat_h3 = layers.Flatten()(embed_h3)
flat_day = layers.Flatten()(embed_day)
flat_hour = layers.Flatten()(embed_hour)

# Concatenate all features
x = layers.Concatenate()([flat_h3, flat_day, flat_hour, input_num])

# Dense layers
x = layers.Dense(128, activation='relu')(x)
x = layers.Dense(64, activation='relu')(x)
output = layers.Dense(1)(x) 

# Compile model
model = keras.Model(inputs=[input_h3, input_day, input_hour, input_num], outputs=output)
model.compile(optimizer='adam', loss='mse') 
model.summary()


In [None]:
print(X_val_nn['h3_cell'].tail(10))
X_train_nn['h3_cell'].tail(10)


In [None]:
history = model.fit(
    x={
        'h3_cell': X_train_nn['h3_cell'],
        'day_of_week': X_train_nn['day_of_week'],
        'hour_of_day': X_train_nn['hour_of_day'],
        'business_ratio': X_train_nn['business_ratio']
    },
    y=y_train,
    validation_data=(
        {
            'h3_cell': X_val_nn['h3_cell'],
            'day_of_week': X_val_nn['day_of_week'],
            'hour_of_day': X_val_nn['hour_of_day'],
            'business_ratio': X_val_nn['business_ratio']
        },
        y_val
    ),
    epochs=20,
    batch_size=32
)

In [None]:
# Predicting on validation set
y_pred = model.predict({
    'h3_cell': X_val_nn['h3_cell'],
    'day_of_week': X_val_nn['day_of_week'],
    'hour_of_day': X_val_nn['hour_of_day'],
    'business_ratio': X_val_nn['business_ratio']
})

# Evaluate
print("MAE:", mean_absolute_error(y_val, y_pred))
print("R²:", r2_score(y_val, y_pred))
