<a href="https://colab.research.google.com/github/joshuabusinge/TotalEnergies/blob/main/TotalEnergiesFinal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Description of the Problem

This challenge asks you to build a model that predicts the number of seats that Mobiticket can expect to sell for each ride, i.e. for a specific route on a specific date and time. There are 14 routes in this dataset. All of the routes end in Nairobi and originate in towns to the North-West of Nairobi towards Lake Victoria.

The towns from which these routes originate are:

Awendo
Homa Bay
Kehancha
Kendu Bay
Keroka
Keumbu
Kijauri
Kisii
Mbita
Migori
Ndhiwa
Nyachenge
Oyugis
Rodi
Rongo
Sirare
Sori
The routes from these 14 origins to the first stop in the outskirts of Nairobi takes approximately 8 to 9 hours from time of departure. From the first stop in the outskirts of Nairobi into the main bus terminal, where most passengers get off, in Central Business District, takes another 2 to 3 hours depending on traffic.

The three stops that all these routes make in Nairobi (in order) are:

Kawangware: the first stop in the outskirts of Nairobi
Westlands
Afya Centre: the main bus terminal where most passengers disembark


About
Description of the Problem
This challenge asks you to build a model that predicts the number of seats that Mobiticket can expect to sell for each ride, i.e. for a specific route on a specific date and time. There are 14 routes in this dataset. All of the routes end in Nairobi and originate in towns to the North-West of Nairobi towards Lake Victoria.

The towns from which these routes originate are:

Awendo
Homa Bay
Kehancha
Kendu Bay
Keroka
Keumbu
Kijauri
Kisii
Mbita
Migori
Ndhiwa
Nyachenge
Oyugis
Rodi
Rongo
Sirare
Sori
The routes from these 14 origins to the first stop in the outskirts of Nairobi takes approximately 8 to 9 hours from time of departure. From the first stop in the outskirts of Nairobi into the main bus terminal, where most passengers get off, in Central Business District, takes another 2 to 3 hours depending on traffic.

The three stops that all these routes make in Nairobi (in order) are:

Kawangware: the first stop in the outskirts of Nairobi
Westlands
Afya Centre: the main bus terminal where most passengers disembark
All of these points are mapped here.

Passengers of these bus (or shuttle) rides are affected by Nairobi traffic not only during their ride into the city, but from there they must continue their journey to their final destination in Nairobi wherever that may be. Traffic can act as a deterrent for those who have the option to avoid buses that arrive in Nairobi during peak traffic hours. On the other hand, traffic may be an indication for people’s movement patterns, reflecting business hours, cultural events, political events, and holidays.

Uber Movement traffic data can be accessed at movement.uber.com. Uber Movement provided historic hourly travel time between any two points in Nairobi. Any tables that are extracted from the Uber Movement platform can be used in your model.

Variables description:

ride_id: unique ID of a vehicle on a specific route on a specific day and time.
seat_number: seat assigned to ticket
payment_method: method used by customer to purchase ticket from Mobiticket (cash or Mpesa)
payment_receipt: unique id number for ticket purchased from Mobiticket
travel_date: date of ride departure. (MM/DD/YYYY)
travel_time: scheduled departure time of ride. Rides generally depart on time. (hh:mm)
travel_from: town from which ride originated
travel_to: destination of ride. All rides are to Nairobi.
car_type: vehicle type (shuttle or bus)
max_capacity: number of seats on the vehicle

##Import the Libraries

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
from sklearn.metrics import r2_score
from sklearn.ensemble import HistGradientBoostingRegressor

##Mount the google drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


##Data Cleaning and Preprocessing

In [None]:
# Load training data, reshape, add departure time as an integer number of seconds and add day of week:
train_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/TotalEnergies/Train.csv', parse_dates=['travel_date'], dayfirst=True)
train = train_df.groupby(['ride_id', 'travel_date', 'travel_time', 'travel_from', 'max_capacity']).size().reset_index(name='Count') #sort=False if needed?
train["travel_time"] = train["travel_time"].str.split(':').apply(lambda x: int(x[0]) * 60 + int(x[1]))
train['day'] = train['travel_date'].dt.dayofweek
train.head()

Unnamed: 0,ride_id,travel_date,travel_time,travel_from,max_capacity,Count,day
0,1442,2017-10-17,435,Migori,49,1,1
1,5437,2017-11-19,432,Migori,49,1,6
2,5710,2017-11-26,425,Keroka,49,1,6
3,5777,2017-11-27,430,Homa Bay,49,5,0
4,5778,2017-11-27,432,Migori,49,31,0


In [None]:
# The same for the test data
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/TotalEnergies/Test.csv', parse_dates=['travel_date'], dayfirst=True).drop(['car_type', 'travel_to'], axis=1)
test["travel_time"] = test["travel_time"].str.split(':').apply(lambda x: int(x[0]) * 60 + int(x[1]))
test['day'] = test['travel_date'].dt.dayofweek
test.head()

Unnamed: 0,ride_id,travel_date,travel_time,travel_from,max_capacity,day
0,4446,2018-04-27,540,Kisii,11,4
1,13962,2018-04-23,430,Homa Bay,49,0
2,5569,2018-04-24,440,Kisii,11,1
3,1675,2018-05-01,661,Kisii,11,1
4,5711,2018-04-22,651,Kisii,11,6


In [None]:
# Combine training and test data for now, so that we can add uber movement data all in one go
train['t'] = 0
test['t'] = 1
train_test = pd.concat([train, test], sort=False)
train_test.head()

Unnamed: 0,ride_id,travel_date,travel_time,travel_from,max_capacity,Count,day,t
0,1442,2017-10-17,435,Migori,49,1.0,1,0
1,5437,2017-11-19,432,Migori,49,1.0,6,0
2,5710,2017-11-26,425,Keroka,49,1.0,6,0
3,5777,2017-11-27,430,Homa Bay,49,5.0,0,0
4,5778,2017-11-27,432,Migori,49,31.0,0,0


In [None]:
# Load travel times from Uber movement data ( 3 x 3month periods)
t1 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/TotalEnergies/Awendo-Afya.csv',parse_dates=['Date'])
t2 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/TotalEnergies/Awendo-Westlands.csv',parse_dates=['Date'])
t3 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/TotalEnergies/Awendo.csv',parse_dates=['Date'])
t4 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/TotalEnergies/Homa-Afya.csv',parse_dates=['Date'])
t5 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/TotalEnergies/Kehancha.csv',parse_dates=['Date'])
t6 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/TotalEnergies/Kendu.csv',parse_dates=['Date'])
t7 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/TotalEnergies/Keroka.csv',parse_dates=['Date'])
t8 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/TotalEnergies/Keumbu.csv',parse_dates=['Date'])
t9 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/TotalEnergies/Kijauri.csv',parse_dates=['Date'])
t10 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/TotalEnergies/Kisii.csv',parse_dates=['Date'])
t11 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/TotalEnergies/Mbita.csv',parse_dates=['Date'])
t12 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/TotalEnergies/Migori.csv',parse_dates=['Date'])
t13 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/TotalEnergies/Nyachenge.csv',parse_dates=['Date'])
t14 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/TotalEnergies/Rongo-AfyaCentre.csv',parse_dates=['Date'])
t15 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/TotalEnergies/Rongo.csv',parse_dates=['Date'])
t16 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/TotalEnergies/Rongo-Westlands.csv',parse_dates=['Date'])
t17 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/TotalEnergies/Sirare-Westlands.csv',parse_dates=['Date'])
t18 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/TotalEnergies/Sori-Kawangware.csv',parse_dates=['Date'])
t19 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/TotalEnergies/Sori-Westlands.csv',parse_dates=['Date'])

travel_times = pd.concat([t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11, t12, t13, t14, t15, t16, t17, t18, t19], ignore_index=True)
travel_times = travel_times.fillna(method='ffill')[['Daily Mean Travel Time (Seconds)', 'Date']]
travel_times['Date'] = pd.to_datetime(travel_times['Date'])
travel_times.tail()

Unnamed: 0,Daily Mean Travel Time (Seconds),Date
1738,751.0,2017-12-11
1739,410.0,2017-12-12
1740,733.0,2017-12-13
1741,810.0,2017-12-14
1742,707.0,2017-12-15


In [None]:
# Merge with our contest data
train_test['Date'] = train_test['travel_date']
train_test.set_index('travel_date', inplace=True)
merged_train_test = train_test.merge(travel_times, how='left', on='Date')
merged_train_test.head(5)

Unnamed: 0,ride_id,travel_time,travel_from,max_capacity,Count,day,t,Date,Daily Mean Travel Time (Seconds)
0,1442,435,Migori,49,1.0,1,0,2017-10-17,952.0
1,1442,435,Migori,49,1.0,1,0,2017-10-17,874.0
2,1442,435,Migori,49,1.0,1,0,2017-10-17,1234.0
3,1442,435,Migori,49,1.0,1,0,2017-10-17,936.0
4,1442,435,Migori,49,1.0,1,0,2017-10-17,2800.0


In [None]:
#Convert the column of travel_from into 1 or 0
travel_from = pd.get_dummies(merged_train_test, columns=['travel_from', 'day'])
travel_from.tail()

Unnamed: 0,ride_id,travel_time,max_capacity,Count,t,Date,Daily Mean Travel Time (Seconds),travel_from_Awendo,travel_from_Homa Bay,travel_from_Kehancha,...,travel_from_Rongo,travel_from_Sirare,travel_from_Sori,day_0,day_1,day_2,day_3,day_4,day_5,day_6
36617,718,421,49,,1,2018-05-03,,0,0,0,...,0,0,0,0,0,0,1,0,0,0
36618,4795,410,11,,1,2018-04-27,,0,0,0,...,0,0,0,0,0,0,0,1,0,0
36619,5500,340,11,,1,2018-04-25,,0,0,0,...,0,0,0,0,0,1,0,0,0,0
36620,14615,431,49,,1,2018-04-28,,0,1,0,...,0,0,0,0,0,0,0,0,1,0
36621,3921,659,11,,1,2018-04-27,,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [None]:
X_train = travel_from.loc[travel_from['t'] == 0].drop(['Count', 'ride_id', 'Date'], axis=1)
y_train = travel_from.loc[travel_from['t'] == 0]['Count']

# Initialize models
models = {
    'HistGradientBoosting': HistGradientBoostingRegressor(loss="absolute_error", max_depth=5, learning_rate=0.1, max_iter=500, max_leaf_nodes=30, min_samples_leaf=10, l2_regularization=0.0)
}

# Create a DataFrame to store predictions
predictions_df = pd.DataFrame()

X_test = travel_from.loc[travel_from['t'] == 1].drop(['Count', 'ride_id', 'Date'], axis=1)
# Train and predict each model, then store predictions in the DataFrame
for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    predictions_df[model_name] = y_pred

# # Display the stacked predictions DataFrame
# print(predictions_df.head())

# Calculate the mean of predictions across models
predictions_df['Stacked_Predictions'] = predictions_df.mean(axis=1)

# Display the DataFrame with stacked predictions
print(predictions_df.head())


   HistGradientBoosting  Stacked_Predictions
0              9.257026             9.257026
1              4.515762             4.515762
2              9.861227             9.861227
3             10.014322            10.014322
4              9.962304             9.962304


In [None]:
# Score model
from sklearn.metrics import mean_absolute_error
print(mean_absolute_error(model.predict(X_train), y_train))

3.05345378433452


In [None]:
# The sample submission file
sample = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/TotalEnergies/SampleSubmission.csv')
sample.head()

Unnamed: 0,ride_id,number_of_ticket
0,4446,0
1,13962,0
2,5569,0
3,1675,0
4,5711,0


In [None]:
# Make predictions and append to the sample submission data, and save as csv
X_test = travel_from.loc[travel_from['t'] == 1].drop(['Count', 'ride_id', 'Date'], axis=1)
pred = model.predict(X_test)
sample['number_of_ticket'][:] = pred[:] # Ignore the warning
sample.to_csv('Final_predictions.csv', index=False)
sample.head(10)

Unnamed: 0,ride_id,number_of_ticket
0,4446,9.257026
1,13962,4.515762
2,5569,9.861227
3,1675,10.014322
4,5711,9.962304
5,2417,8.0822
6,15010,11.536515
7,1823,8.402429
8,15191,8.616424
9,14402,2.903443
