# RideWise Customer Analytics Project  
## Notebook 01: Data Audit & Table Relationships

### Project Context
RideWise is a fictional European ride-hailing company operating across multiple cities.  
The company faces a **high customer churn rate** and lacks a unified, data-driven system to:
- Understand customer behavior
- Segment customers meaningfully
- Predict churn before it happens
- Design effective promotion strategies

This project aims to build the **analytical foundation** for a customer segmentation and churn prediction system using structured behavioral data.

---

### Purpose of This Notebook
The goal of this notebook is to perform a **data audit and relationship analysis** before any modeling begins.

Specifically, this notebook will:
1. Load and inspect all five datasets
2. Understand the structure and contents of each table
3. Validate primary and foreign key relationships
4. Identify missing values and potential data quality issues
5. Establish which tables should be joined and **how** they should be joined

This step is critical to avoid:
- Incorrect joins
- Data leakage
- Duplicated records
- Silent data quality bugs

---

### Datasets Used
This project uses the following datasets:

- **riders.csv**  
  Customer-level information (demographics, loyalty status, signup details)

- **trips.csv**  
  Trip-level transactional data (usage frequency, spend, recency)

- **sessions.csv**  
  App engagement data (session frequency, duration, conversion behavior)

- **promotions.csv**  
  Marketing campaign metadata and targeting rules

- **drivers.csv**  
  Driver and supply-side operational data (used for context only)

All datasets are located in the relative folder:  
`../../data/`

---

> **The unit of analysis is the customer (rider).**

This means:
- Final modeling datasets will contain **one row per rider**
- Transactional and session-level data will be **aggregated before joining**
- Joins are performed only when they serve a clear analytical purpose

---

### What Comes Next
After completing this notebook, the next steps will be:
- Defining customer churn using behavioral inactivity
- Feature engineering from trips and sessions
- Customer segmentation using clustering
- Churn prediction modeling
- Business interpretation and strategy recommendations

This notebook lays the groundwork for all downstream analysis.


### Imports libraries & display settings

In [1]:
!pip install haversine



In [2]:
import pandas as pd
import numpy as np
from haversine import haversine

pd.set_option("display.max_columns", 200)
pd.set_option("display.max_rows", 200)
pd.set_option("display.width", 120)

print("Libraries imported. Ready!")

Libraries imported. Ready!


### Set the data folder path

In [3]:
DATA_DIR = '../data/'
print("Data directory set to:", DATA_DIR)

Data directory set to: ../data/


### Load the datasets

In [4]:
riders = pd.read_csv(DATA_DIR + "riders.csv")
trips = pd.read_csv(DATA_DIR + "trips.csv")
sessions = pd.read_csv(DATA_DIR + "sessions.csv")
promotions = pd.read_csv(DATA_DIR + "promotions.csv")
drivers = pd.read_csv(DATA_DIR + "drivers.csv")

datasets = {
    "riders": riders,
    "trips": trips,
    "sessions": sessions,
    "promotions": promotions,
    "drivers": drivers
}

print("Datasets Loaded")

Datasets Loaded


### Peek at each dataset

In [5]:
for name, df in datasets.items():
    print("\n" + "="*60)
    print(f"{name.upper()}  |  shape = {df.shape}")
    display(df.info())
    print('\n')
    display(df.head(3))
    print()
    print(f'Number of duplicated rows in {name.title()}: {df.duplicated().sum()}')


RIDERS  |  shape = (10000, 8)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   user_id           10000 non-null  object 
 1   signup_date       10000 non-null  object 
 2   loyalty_status    10000 non-null  object 
 3   age               10000 non-null  float64
 4   city              10000 non-null  object 
 5   avg_rating_given  10000 non-null  float64
 6   churn_prob        10000 non-null  float64
 7   referred_by       3053 non-null   object 
dtypes: float64(3), object(5)
memory usage: 625.1+ KB


None





Unnamed: 0,user_id,signup_date,loyalty_status,age,city,avg_rating_given,churn_prob,referred_by
0,R00000,2025-01-24,Bronze,34.729629,Nairobi,5.0,0.142431,R00001
1,R00001,2024-09-09,Bronze,34.57102,Nairobi,4.7,0.674161,
2,R00002,2024-09-07,Bronze,47.13396,Lagos,4.2,0.510379,



Number of duplicated rows in Riders: 0

TRIPS  |  shape = (200000, 16)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 16 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   trip_id           200000 non-null  object 
 1   user_id           200000 non-null  object 
 2   driver_id         200000 non-null  object 
 3   fare              200000 non-null  float64
 4   surge_multiplier  200000 non-null  float64
 5   tip               200000 non-null  float64
 6   payment_type      200000 non-null  object 
 7   pickup_time       200000 non-null  object 
 8   dropoff_time      200000 non-null  object 
 9   pickup_lat        200000 non-null  float64
 10  pickup_lng        200000 non-null  float64
 11  dropoff_lat       200000 non-null  float64
 12  dropoff_lng       200000 non-null  float64
 13  weather           200000 non-null  object 
 14  city              200000 non-null  object 
 

None





Unnamed: 0,trip_id,user_id,driver_id,fare,surge_multiplier,tip,payment_type,pickup_time,dropoff_time,pickup_lat,pickup_lng,dropoff_lat,dropoff_lng,weather,city,loyalty_status
0,T000000,R05207,D00315,12.11,1.0,0.0,Card,2024-11-27 18:41:50+02:27,2024-11-27 19:33:50+02:27,-1.108123,36.912209,-1.068155,36.875377,Foggy,Nairobi,Bronze
1,T000001,R09453,D03717,8.73,1.0,0.02,Card,2024-10-28 23:13:48+00:14,2024-10-28 23:26:48+00:14,6.675266,3.51574,6.641734,3.52562,Sunny,Lagos,Gold
2,T000002,R00567,D02035,19.68,1.0,0.0,Card,2025-02-17 05:36:41+02:27,2025-02-17 05:52:41+02:27,-1.248589,37.010668,-1.273182,37.018586,Cloudy,Nairobi,Bronze



Number of duplicated rows in Trips: 0

SESSIONS  |  shape = (50000, 8)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   session_id      50000 non-null  object
 1   rider_id        50000 non-null  object
 2   session_time    50000 non-null  object
 3   time_on_app     50000 non-null  int64 
 4   pages_visited   50000 non-null  int64 
 5   converted       50000 non-null  int64 
 6   city            50000 non-null  object
 7   loyalty_status  50000 non-null  object
dtypes: int64(3), object(5)
memory usage: 3.1+ MB


None





Unnamed: 0,session_id,rider_id,session_time,time_on_app,pages_visited,converted,city,loyalty_status
0,S000000,R08605,2025-04-27 18:57:06+02:05,79,4,1,Cairo,Bronze
1,S000001,R08823,2025-04-27 07:32:22+02:27,101,3,0,Nairobi,Silver
2,S000002,R05342,2025-04-27 23:17:25+02:05,12,1,0,Cairo,Bronze



Number of duplicated rows in Sessions: 0

PROMOTIONS  |  shape = (20, 11)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   promo_id         20 non-null     object 
 1   promo_name       20 non-null     object 
 2   promo_type       20 non-null     object 
 3   promo_value      20 non-null     float64
 4   start_date       20 non-null     object 
 5   end_date         20 non-null     object 
 6   target_segment   20 non-null     object 
 7   city_scope       20 non-null     object 
 8   ab_test_groups   20 non-null     object 
 9   test_allocation  20 non-null     object 
 10  success_metric   20 non-null     object 
dtypes: float64(1), object(10)
memory usage: 1.8+ KB


None





Unnamed: 0,promo_id,promo_name,promo_type,promo_value,start_date,end_date,target_segment,city_scope,ab_test_groups,test_allocation,success_metric
0,P000,Peak Hour Pass,surge_waiver,1.0,2025-04-26,2025-05-25,All,Nairobi,['All'],[1.0],Usage Frequency
1,P001,Peak Hour Pass,surge_waiver,1.0,2025-04-26,2025-05-22,All,Cairo,"['Control', 'Variant A', 'Variant B']","[0.3, 0.4, 0.3]",Conversion Rate
2,P002,Peak Hour Pass,surge_waiver,1.0,2025-04-26,2025-05-16,All,Cairo,"['Control', 'Variant A', 'Variant B']","[0.3, 0.4, 0.3]",ROI



Number of duplicated rows in Promotions: 0

DRIVERS  |  shape = (5000, 7)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   driver_id        5000 non-null   object 
 1   rating           5000 non-null   float64
 2   vehicle_type     5000 non-null   object 
 3   signup_date      5000 non-null   object 
 4   last_active      5000 non-null   object 
 5   city             5000 non-null   object 
 6   acceptance_rate  5000 non-null   float64
dtypes: float64(2), object(5)
memory usage: 273.6+ KB


None





Unnamed: 0,driver_id,rating,vehicle_type,signup_date,last_active,city,acceptance_rate
0,D00000,3.1,SUV,2025-01-20,2025-01-06 18:23:09.312275,Cairo,0.679555
1,D00001,5.0,Sedan,2023-03-27,2025-04-27 01:44:02.472554,Nairobi,0.548786
2,D00002,4.5,Motorcycle,2024-05-02,2025-03-07 19:24:46.367672,Nairobi,0.593724



Number of duplicated rows in Drivers: 0


### Key relationship checks (the join keys)

From the above tables, we can see a relationship between some of the tables. This then enables us to establish a relationship between the tables in order to join them.

The tables with relationship among each other are:
- Trips table which has a forign key of user_id from the users table and driver_id which is from the drivers table.
- Session table also has a foreign key of user_id which has been named as 'riders_id'. However since the object of the business is focused on the customer churn, we would only merge the driver and trips table. 

### Standardize naming across datasets

In [6]:
sessions = sessions.rename(columns={"rider_id": "user_id"})

### Identify date columns

In [7]:
# Identify date-like columns in each dataset
date_columns = {
    "riders": ["signup_date"],
    "trips": ["pickup_time", "dropoff_time"],
    "sessions": ["session_time"],
    "promotions": ["start_date", "end_date"],
    "drivers": ["signup_date", "last_active"]
}

date_columns

{'riders': ['signup_date'],
 'trips': ['pickup_time', 'dropoff_time'],
 'sessions': ['session_time'],
 'promotions': ['start_date', 'end_date'],
 'drivers': ['signup_date', 'last_active']}

### Parse dates safely

In [8]:
# Convert date columns to datetime

for col in date_columns["riders"]:
    riders[col] = pd.to_datetime(riders[col], errors="coerce")

for col in date_columns["trips"]:
    trips[col] = pd.to_datetime(trips[col], errors="coerce", utc=True)

for col in date_columns["sessions"]:
    sessions[col] = pd.to_datetime(sessions[col], errors="coerce", utc=True)

for col in date_columns["promotions"]:
    promotions[col] = pd.to_datetime(promotions[col], errors="coerce")

for col in date_columns["drivers"]:
    drivers[col] = pd.to_datetime(drivers[col], errors="coerce")

print("Trips and sessions datetime parsing forced to UTC.")
print("Date parsing completed.")

Trips and sessions datetime parsing forced to UTC.
Date parsing completed.


### Validate date parsing

In [9]:
# Confirm date parsing worked
print("Riders date types:")
display(riders[date_columns["riders"]].dtypes)

print("\nTrips date types:")
display(trips[date_columns["trips"]].dtypes)

print("\nSessions date types:")
display(sessions[date_columns["sessions"]].dtypes)

Riders date types:


signup_date    datetime64[ns]
dtype: object


Trips date types:


pickup_time     datetime64[ns, UTC]
dropoff_time    datetime64[ns, UTC]
dtype: object


Sessions date types:


session_time    datetime64[ns, UTC]
dtype: object

In [10]:
trips[date_columns["trips"]].head()

Unnamed: 0,pickup_time,dropoff_time
0,2024-11-27 16:14:50+00:00,2024-11-27 17:06:50+00:00
1,2024-10-28 22:59:48+00:00,2024-10-28 23:12:48+00:00
2,2025-02-17 03:09:41+00:00,2025-02-17 03:25:41+00:00
3,2024-06-18 17:22:14+00:00,2024-06-18 17:27:14+00:00
4,2024-10-05 07:31:16+00:00,2024-10-05 08:01:16+00:00


### Merging Riders & Trips Dataset

In [11]:
df = pd.merge(datasets['riders'], datasets['trips'], on='user_id', how='outer')

df.info()

print('\n')

df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 23 columns):
 #   Column            Non-Null Count   Dtype              
---  ------            --------------   -----              
 0   user_id           200000 non-null  object             
 1   signup_date       200000 non-null  datetime64[ns]     
 2   loyalty_status_x  200000 non-null  object             
 3   age               200000 non-null  float64            
 4   city_x            200000 non-null  object             
 5   avg_rating_given  200000 non-null  float64            
 6   churn_prob        200000 non-null  float64            
 7   referred_by       61013 non-null   object             
 8   trip_id           200000 non-null  object             
 9   driver_id         200000 non-null  object             
 10  fare              200000 non-null  float64            
 11  surge_multiplier  200000 non-null  float64            
 12  tip               200000 non-null  float64  

Unnamed: 0,user_id,signup_date,loyalty_status_x,age,city_x,avg_rating_given,churn_prob,referred_by,trip_id,driver_id,fare,surge_multiplier,tip,payment_type,pickup_time,dropoff_time,pickup_lat,pickup_lng,dropoff_lat,dropoff_lng,weather,city_y,loyalty_status_y
0,R00000,2025-01-24,Bronze,34.729629,Nairobi,5.0,0.142431,R00001,T001144,D03414,23.62,1.4,0.0,Card,2024-09-03 22:29:02+00:00,2024-09-03 22:55:02+00:00,-1.115239,36.805339,-1.136842,36.793631,Rainy,Nairobi,Bronze
1,R00000,2025-01-24,Bronze,34.729629,Nairobi,5.0,0.142431,R00001,T022441,D04441,16.31,1.0,0.0,Card,2025-04-02 14:46:29+00:00,2025-04-02 14:52:29+00:00,-1.350546,36.74521,-1.339873,36.770102,Sunny,Nairobi,Bronze
2,R00000,2025-01-24,Bronze,34.729629,Nairobi,5.0,0.142431,R00001,T024771,D00635,9.66,1.0,0.03,Card,2024-05-23 07:10:47+00:00,2024-05-23 08:06:47+00:00,-1.31656,36.687127,-1.310676,36.680729,Sunny,Nairobi,Bronze
3,R00000,2025-01-24,Bronze,34.729629,Nairobi,5.0,0.142431,R00001,T042553,D03102,11.02,1.1,0.55,Mobile Money,2025-01-02 13:42:13+00:00,2025-01-02 14:18:13+00:00,-1.726473,37.30156,-1.713882,37.311035,Sunny,Nairobi,Bronze
4,R00000,2025-01-24,Bronze,34.729629,Nairobi,5.0,0.142431,R00001,T055259,D03417,20.83,1.0,0.91,Card,2025-01-07 11:56:49+00:00,2025-01-07 12:16:49+00:00,-1.483414,36.974683,-1.474478,36.932673,Sunny,Nairobi,Bronze


#### Missing Data in the riders_trips dataset

In [12]:
df.isnull().mean() * 100

user_id              0.0000
signup_date          0.0000
loyalty_status_x     0.0000
age                  0.0000
city_x               0.0000
avg_rating_given     0.0000
churn_prob           0.0000
referred_by         69.4935
trip_id              0.0000
driver_id            0.0000
fare                 0.0000
surge_multiplier     0.0000
tip                  0.0000
payment_type         0.0000
pickup_time          0.0000
dropoff_time         0.0000
pickup_lat           0.0000
pickup_lng           0.0000
dropoff_lat          0.0000
dropoff_lng          0.0000
weather              0.0000
city_y               0.0000
loyalty_status_y     0.0000
dtype: float64

#### Dropping duplicated and irrelevant columns

In [13]:
df[df['loyalty_status_x'] != df['loyalty_status_y']]

Unnamed: 0,user_id,signup_date,loyalty_status_x,age,city_x,avg_rating_given,churn_prob,referred_by,trip_id,driver_id,fare,surge_multiplier,tip,payment_type,pickup_time,dropoff_time,pickup_lat,pickup_lng,dropoff_lat,dropoff_lng,weather,city_y,loyalty_status_y


In [14]:
df = df.drop(columns=['referred_by', 'loyalty_status_y', 'city_y'])

#### Renaming columns

In [15]:
df.rename(columns={'loyalty_status_x': 'loyalty_status', 'city_x': 'city'}, inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 20 columns):
 #   Column            Non-Null Count   Dtype              
---  ------            --------------   -----              
 0   user_id           200000 non-null  object             
 1   signup_date       200000 non-null  datetime64[ns]     
 2   loyalty_status    200000 non-null  object             
 3   age               200000 non-null  float64            
 4   city              200000 non-null  object             
 5   avg_rating_given  200000 non-null  float64            
 6   churn_prob        200000 non-null  float64            
 7   trip_id           200000 non-null  object             
 8   driver_id         200000 non-null  object             
 9   fare              200000 non-null  float64            
 10  surge_multiplier  200000 non-null  float64            
 11  tip               200000 non-null  float64            
 12  payment_type      200000 non-null  object   

#### Typographic Errors

In [16]:
cat_cols = df.select_dtypes(include='object')

for col in cat_cols:
    print(f'{df[col].name}')
    print(f'{df[col].unique()}')
    print('\n')

user_id
['R00000' 'R00001' 'R00002' ... 'R09997' 'R09998' 'R09999']


loyalty_status
['Bronze' 'Silver' 'Gold' 'Platinum']


city
['Nairobi' 'Lagos' 'Cairo']


trip_id
['T001144' 'T022441' 'T024771' ... 'T166786' 'T176764' 'T187733']


driver_id
['D03414' 'D04441' 'D00635' ... 'D04427' 'D04825' 'D00014']


payment_type
['Card' 'Mobile Money' 'Cash']


weather
['Rainy' 'Sunny' 'Foggy' 'Cloudy']




#### Verifying Dates 

In [17]:
df[['signup_date', 'pickup_time', 'dropoff_time']].describe().transpose()

Unnamed: 0,count,mean,min,25%,50%,75%,max
signup_date,200000,2024-04-26 21:01:14.303999744,2023-04-27 00:00:00,2023-10-27 00:00:00,2024-04-25 00:00:00,2024-10-31 00:00:00,2025-04-26 00:00:00


### Feature Engineering

#### Behaviour/Demand Features

In [18]:
df['pickup_time_year'] = df['pickup_time'].dt.year
df['pickup_time_month'] = df['pickup_time'].dt.month
df['pickup_time_month_year'] = df['pickup_time'].dt.strftime('%b %Y')
df["pickup_time_day"] = df["pickup_time"].dt.day_name()
df["pickup_time_day_num"] = df["pickup_time"].dt.day
df['pickup_hour'] = df['pickup_time'].dt.hour

df['time_of_day'] = pd.cut(df['pickup_hour'], bins=[-1, 5, 11, 17, 21, 24], labels=['Night', 'Morning', 'Afternoon', 'Evening', 'Late Night'])

df["pickup_is_weekend"] = df["pickup_time"].dt.weekday >= 5  # Saturday=5, Sunday=6

df['pickup_is_peak_hour'] = df['pickup_hour'].between(7, 9) | df['pickup_hour'].between(16, 19)

df['pickup_is_night'] = df['pickup_hour'].between(22, 23) | df['pickup_hour'].between(0, 5)

def map_season(month):
    if month in [12, 1, 2]:
        return "Winter"
    elif month in [3, 4, 5]:
        return "Spring"
    elif month in [6, 7, 8]:
        return "Summer"
    else:
        return "Autumn"

df["pickup_time_season"] = df["pickup_time_month"].apply(map_season)

df['trip_duration_min'] = (df['dropoff_time'] - df['pickup_time']).dt.total_seconds() / 60

df['trip_distance_km'] = df.apply(
    lambda x: haversine(
        (x['pickup_lat'], x['pickup_lng']),
        (x['dropoff_lat'], x['dropoff_lng'])
    ), axis=1
)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 33 columns):
 #   Column                  Non-Null Count   Dtype              
---  ------                  --------------   -----              
 0   user_id                 200000 non-null  object             
 1   signup_date             200000 non-null  datetime64[ns]     
 2   loyalty_status          200000 non-null  object             
 3   age                     200000 non-null  float64            
 4   city                    200000 non-null  object             
 5   avg_rating_given        200000 non-null  float64            
 6   churn_prob              200000 non-null  float64            
 7   trip_id                 200000 non-null  object             
 8   driver_id               200000 non-null  object             
 9   fare                    200000 non-null  float64            
 10  surge_multiplier        200000 non-null  float64            
 11  tip                     20

#### Behaviour/Demand Features

In [19]:
df['total_fare'] = df['fare'] * df['surge_multiplier']

df['total_fare_with_tip'] = df['total_fare'] + df['tip']

df['tip_percentage'] = df['tip'] / df['total_fare_with_tip']

df['is_surge_trip'] = df['surge_multiplier'] > 1

df['total_fare_bucket'] = pd.qcut(df['total_fare'], q=4, labels=['low','medium','high','very_high'])

#### User Behaviour

In [20]:
user_agg = df.groupby('user_id').agg(
    user_trip_count=('trip_id', 'count'),
    avg_user_fare=('total_fare', 'mean'),
    avg_user_tip=('tip', 'mean')
).reset_index()

In [21]:
df.drop(columns=['signup_date', 'dropoff_time'], inplace=True) 

In [22]:
with pd.option_context('display.max_columns', None):
    display(df)

Unnamed: 0,user_id,loyalty_status,age,city,avg_rating_given,churn_prob,trip_id,driver_id,fare,surge_multiplier,tip,payment_type,pickup_time,pickup_lat,pickup_lng,dropoff_lat,dropoff_lng,weather,pickup_time_year,pickup_time_month,pickup_time_month_year,pickup_time_day,pickup_time_day_num,pickup_hour,time_of_day,pickup_is_weekend,pickup_is_peak_hour,pickup_is_night,pickup_time_season,trip_duration_min,trip_distance_km,total_fare,total_fare_with_tip,tip_percentage,is_surge_trip,total_fare_bucket
0,R00000,Bronze,34.729629,Nairobi,5.0,0.142431,T001144,D03414,23.62,1.4,0.00,Card,2024-09-03 22:29:02+00:00,-1.115239,36.805339,-1.136842,36.793631,Rainy,2024,9,Sep 2024,Tuesday,3,22,Late Night,False,False,True,Autumn,26.0,2.732109,33.068,33.068,0.000000,True,very_high
1,R00000,Bronze,34.729629,Nairobi,5.0,0.142431,T022441,D04441,16.31,1.0,0.00,Card,2025-04-02 14:46:29+00:00,-1.350546,36.745210,-1.339873,36.770102,Sunny,2025,4,Apr 2025,Wednesday,2,14,Afternoon,False,False,False,Spring,6.0,3.010959,16.310,16.310,0.000000,False,high
2,R00000,Bronze,34.729629,Nairobi,5.0,0.142431,T024771,D00635,9.66,1.0,0.03,Card,2024-05-23 07:10:47+00:00,-1.316560,36.687127,-1.310676,36.680729,Sunny,2024,5,May 2024,Thursday,23,7,Morning,False,True,False,Spring,56.0,0.966453,9.660,9.690,0.003096,False,low
3,R00000,Bronze,34.729629,Nairobi,5.0,0.142431,T042553,D03102,11.02,1.1,0.55,Mobile Money,2025-01-02 13:42:13+00:00,-1.726473,37.301560,-1.713882,37.311035,Sunny,2025,1,Jan 2025,Thursday,2,13,Afternoon,False,False,False,Winter,36.0,1.751916,12.122,12.672,0.043403,True,medium
4,R00000,Bronze,34.729629,Nairobi,5.0,0.142431,T055259,D03417,20.83,1.0,0.91,Card,2025-01-07 11:56:49+00:00,-1.483414,36.974683,-1.474478,36.932673,Sunny,2025,1,Jan 2025,Tuesday,7,11,Morning,False,False,False,Winter,20.0,4.774206,20.830,21.740,0.041858,False,high
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199995,R09999,Gold,36.089597,Nairobi,3.9,0.401529,T161109,D00887,18.04,1.0,0.49,Card,2025-02-08 08:03:19+00:00,-1.281023,36.756645,-1.274055,36.713220,Sunny,2025,2,Feb 2025,Saturday,8,8,Morning,True,True,False,Winter,32.0,4.889174,18.040,18.530,0.026444,False,high
199996,R09999,Gold,36.089597,Nairobi,3.9,0.401529,T166028,D02903,26.68,1.3,0.00,Cash,2024-09-11 06:12:51+00:00,-1.483096,36.833612,-1.497719,36.826340,Cloudy,2024,9,Sep 2024,Wednesday,11,6,Morning,False,False,False,Autumn,49.0,1.815820,34.684,34.684,0.000000,True,very_high
199997,R09999,Gold,36.089597,Nairobi,3.9,0.401529,T166786,D02777,9.10,1.0,1.46,Cash,2024-12-12 15:47:38+00:00,-1.135358,36.654228,-1.120021,36.655476,Sunny,2024,12,Dec 2024,Thursday,12,15,Afternoon,False,False,False,Winter,40.0,1.711043,9.100,10.560,0.138258,False,low
199998,R09999,Gold,36.089597,Nairobi,3.9,0.401529,T176764,D04642,20.27,1.0,0.08,Mobile Money,2024-09-06 16:49:13+00:00,-1.109425,36.967027,-1.065215,36.917535,Sunny,2024,9,Sep 2024,Friday,6,16,Afternoon,False,True,False,Autumn,17.0,7.378412,20.270,20.350,0.003931,False,high


In [23]:
print("Trips columns:")
print(sorted(df.columns.tolist()))

Trips columns:
['age', 'avg_rating_given', 'churn_prob', 'city', 'driver_id', 'dropoff_lat', 'dropoff_lng', 'fare', 'is_surge_trip', 'loyalty_status', 'payment_type', 'pickup_hour', 'pickup_is_night', 'pickup_is_peak_hour', 'pickup_is_weekend', 'pickup_lat', 'pickup_lng', 'pickup_time', 'pickup_time_day', 'pickup_time_day_num', 'pickup_time_month', 'pickup_time_month_year', 'pickup_time_season', 'pickup_time_year', 'surge_multiplier', 'time_of_day', 'tip', 'tip_percentage', 'total_fare', 'total_fare_bucket', 'total_fare_with_tip', 'trip_distance_km', 'trip_duration_min', 'trip_id', 'user_id', 'weather']


### Saving the dataset

In [24]:
df.to_csv('../data/riders_trips.csv',index=False)
print('Dataset saved')

Dataset saved
