<a href="https://colab.research.google.com/github/pratap-vj/Transport-Demand-Prediction/blob/main/Transport_Demand_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Transport Demand Prediction



##### **Project Type**    - Regression
##### **Contribution**    - Individual


# **Project Summary -**

The objective of this project is to develop a predictive model that forecasts the number of seats that Mobiticket, a transportation service provider, can expect to sell for each ride on 14 different routes. These routes originate in towns to the North-West of Nairobi, located towards Lake Victoria, and all end in Nairobi. The aim is to estimate the seat sales for a specific route on a given date and time, taking into account the travel time from the route's origin to Nairobi's Central Business District (CBD).

The 14 routes under consideration connect Nairobi with the following towns: Awendo, Homa Bay, Kehancha, Kendu Bay, Keroka, Keumbu, Kijauri, Kisii, Mbita, Migori, Ndhiwa, Nyachenge, Oyugis, Rodi, Rongo, Sirare, and Sori. The travel time from each of these towns to Nairobi is approximately 8 to 9 hours. Once reaching the outskirts of Nairobi, it takes an additional 2 to 3 hours to reach the main bus terminal in the CBD, depending on traffic conditions.

To achieve the goal of predicting seat sales for each route, historical data will be collected from Mobiticket's booking and ticketing system. The dataset will include information such as the departure date, time, and route for each ride, as well as the number of seats sold for that specific journey. Other relevant factors, such as holidays, special events, and weather conditions, will also be incorporated into the dataset to account for potential fluctuations in seat demand.

The data pre-processing stage will involve handling missing values, encoding categorical variables, and performing feature engineering to extract relevant features that may impact seat sales. Time-related features, such as day of the week and month, may be extracted to capture any temporal patterns in seat demand. Additionally, factors like route popularity, distance, and the availability of alternative transportation options may be considered to enrich the feature set.

Various machine learning algorithms will be evaluated to build the predictive model. Regression models, such as Linear Regression, may be employed as a baseline. More sophisticated techniques like Random Forest, Gradient Boosting, or Neural Networks will be explored to capture complex relationships and nonlinearities in the data. Model hyperparameters will be tuned to enhance performance, and ensemble methods may be considered to combine the strengths of multiple models.

The dataset will be divided into training and testing sets to evaluate the model's performance accurately. The model's effectiveness will be assessed using appropriate evaluation metrics, such as Mean Squared Error (MSE) or Root Mean Squared Error (RMSE). Furthermore, time series cross-validation techniques will be applied to account for potential temporal dependencies in the data.

Once the predictive model is trained and validated, it will be deployed as a service to provide real-time seat sale predictions for each route. Mobiticket's booking system can then use these forecasts to optimize seat availability and pricing strategies. By accurately estimating seat demand, Mobiticket can ensure efficient resource allocation, maximize seat occupancy, and enhance customer satisfaction by minimizing the occurrence of overbooked or underbooked rides.

In conclusion, this project aims to develop a robust predictive model that forecasts the number of seats Mobiticket can expect to sell for each ride on 14 different routes originating from towns towards Lake Victoria and terminating in Nairobi. The model's implementation will enable Mobiticket to optimize its operations, enhance revenue generation, and provide better services to its customers, ultimately contributing to the growth and success of the transportation company.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Mobiticket, a transportation service provider, needs a predictive model to estimate the number of seats it can sell for each ride on 14 specific routes originating from towns towards Lake Victoria and ending in Nairobi. The goal is to optimize seat allocation, pricing strategies, and enhance customer satisfaction by accurately forecasting seat demand, considering travel time and other relevant factors. The model will be deployed as a real-time service to optimize operations and provide a seamless booking experience for passengers.

#### **Define Your Business Objective?**

The primary business objective is to optimize seat allocation, pricing strategies, and improve customer satisfaction for Mobiticket, the transportation service provider. This will be achieved through the development and deployment of a predictive model that accurately forecasts the number of seats that can be sold for each ride on 14 specific routes originating from towns towards Lake Victoria and terminating in Nairobi.

**By achieving this objective, Mobiticket aims to:**

**Maximize Seat Occupancy:** The predictive model will help Mobiticket allocate the right number of seats for each route and schedule, ensuring maximum occupancy on their buses. This will reduce the likelihood of underbooked or overbooked rides, leading to efficient resource utilization.

**Optimize Pricing Strategies:** Accurate seat demand forecasts will enable Mobiticket to implement dynamic pricing strategies. They can adjust ticket prices based on predicted demand, increasing revenue during peak times and offering competitive prices during off-peak periods.

**Enhance Customer Satisfaction:** Providing an optimal number of seats and avoiding overbooked rides will improve the overall customer experience. Passengers will have a higher chance of securing seats for their preferred travel dates and times, leading to increased customer satisfaction and loyalty.

**Resource Allocation Efficiency:** With precise seat demand forecasts, Mobiticket can efficiently allocate its resources, including buses and staff, to meet the expected passenger demand on each route. This will result in cost savings and improved operational efficiency.

**Real-time Booking Experience:** The predictive model will be integrated into Mobiticket's booking system as a real-time service. Passengers will benefit from an enhanced booking experience, knowing the availability of seats in advance and making informed decisions.

**Competitive Advantage:** By leveraging data-driven insights and optimizing operations, Mobiticket can gain a competitive edge in the transportation market. Providing accurate seat availability information can attract more customers, leading to increased market share.

In conclusion, the business objective is to develop a predictive model that empowers Mobiticket to make informed decisions regarding seat allocation and pricing, leading to improved resource management, customer satisfaction, and a competitive advantage in the transportation industry.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

### Dataset Loading

In [5]:
# Load Dataset
df = pd.read_csv('C:\\Users\\HP\\Downloads\\Nairobi Transport Data.zip')


### Dataset First View

In [6]:
# Dataset First Look
df.head()

Unnamed: 0,ride_id,seat_number,payment_method,payment_receipt,travel_date,travel_time,travel_from,travel_to,car_type,max_capacity
0,1442,15A,Mpesa,UZUEHCBUSO,17-10-17,7:15,Migori,Nairobi,Bus,49
1,5437,14A,Mpesa,TIHLBUSGTE,19-11-17,7:12,Migori,Nairobi,Bus,49
2,5710,8B,Mpesa,EQX8Q5G19O,26-11-17,7:05,Keroka,Nairobi,Bus,49
3,5777,19A,Mpesa,SGP18CL0ME,27-11-17,7:10,Homa Bay,Nairobi,Bus,49
4,5778,11A,Mpesa,BM97HFRGL9,27-11-17,7:12,Migori,Nairobi,Bus,49


### Dataset Rows & Columns count

In [7]:
# Dataset Rows & Columns count
df.shape

(51645, 10)

### Dataset Information

In [8]:
# Dataset Info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51645 entries, 0 to 51644
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   ride_id          51645 non-null  int64 
 1   seat_number      51645 non-null  object
 2   payment_method   51645 non-null  object
 3   payment_receipt  51645 non-null  object
 4   travel_date      51645 non-null  object
 5   travel_time      51645 non-null  object
 6   travel_from      51645 non-null  object
 7   travel_to        51645 non-null  object
 8   car_type         51645 non-null  object
 9   max_capacity     51645 non-null  int64 
dtypes: int64(2), object(8)
memory usage: 3.9+ MB


#### Duplicate Values

In [9]:
# Dataset Duplicate Value Count
df[df.duplicated()]

Unnamed: 0,ride_id,seat_number,payment_method,payment_receipt,travel_date,travel_time,travel_from,travel_to,car_type,max_capacity


#### Missing Values/Null Values

In [10]:
# Missing Values/Null Values Count
df.isnull().sum()

ride_id            0
seat_number        0
payment_method     0
payment_receipt    0
travel_date        0
travel_time        0
travel_from        0
travel_to          0
car_type           0
max_capacity       0
dtype: int64

### What did you know about your dataset?

**Number of Entries:** The dataset contains 51,645 entries (rows).

**Data Columns:** There are 10 columns in the dataset.

**Column Names and Data Types:**

a. ride_id: An integer column with non-null values, likely representing a unique identifier for each ride.

b. seat_number: An object (string) column with non-null values, probably representing the seat number for each ride.

c. payment_method: An object (string) column with non-null values, indicating the payment method used for each ride.

d. payment_receipt: An object (string) column with non-null values, possibly containing payment receipt information for each ride.

e. travel_date: An object (string) column with non-null values, representing the date of travel for each ride.

f. travel_time: An object (string) column with non-null values, representing the time of travel for each ride.

g. travel_from: An object (string) column with non-null values, indicating the starting point of each ride.

h. travel_to: An object (string) column with non-null values, indicating the destination of each ride.

i. car_type: An object (string) column with non-null values, representing the type of car used for each ride.

j. max_capacity: An integer column with non-null values, likely denoting the maximum capacity of the car used for each ride.

**Null Values:** The dataset doesn't contain any null values, as indicated by the "Non-Null Count" for each column (51645 non-null values for all columns).

From this information, we can infer that the dataset contains details about rides, including ride IDs, seat numbers, payment methods, travel dates, travel times, travel origins and destinations, car types, and the maximum capacity of the cars. It appears that the dataset is relatively clean and does not have any missing values.

## ***2. Understanding Your Variables***

In [11]:
# Dataset Columns
df.columns.values

array(['ride_id', 'seat_number', 'payment_method', 'payment_receipt',
       'travel_date', 'travel_time', 'travel_from', 'travel_to',
       'car_type', 'max_capacity'], dtype=object)

In [12]:
# Dataset Describe
df.describe(include = 'object')

Unnamed: 0,seat_number,payment_method,payment_receipt,travel_date,travel_time,travel_from,travel_to,car_type
count,51645,51645,51645,51645,51645,51645,51645,51645
unique,61,2,51645,149,78,17,1,2
top,1,Mpesa,UZUEHCBUSO,10-12-17,7:09,Kisii,Nairobi,Bus
freq,2065,51532,1,856,3926,22607,51645,31985


### Variables Description

**seat_number:** This column contains the seat numbers for the passengers in the rides. The values in this column are categorical, and there are 61 unique seat numbers in the dataset.

**payment_method:** This column indicates the payment method used by the passengers for their rides. The values are categorical and have two unique options: "Mpesa" and another payment method.

**payment_receipt:** Each ride seems to have a unique payment receipt in this column. The values in this column are alphanumeric, likely serving as unique identifiers for payment receipts.

**travel_date:** This column represents the date on which the rides took place. The values are in date format and have 149 unique dates in the dataset.

**travel_time:** This column indicates the time at which the rides started. The values are in time format and have 78 unique time entries.

**travel_from:** This column specifies the starting point or origin of each ride. The values are categorical, and there are 17 unique travel origin locations.

**travel_to:** This column specifies the destination of each ride. The values are categorical, and there is only one unique destination in the dataset.

**car_type:** This column represents the type of car used for each ride. The values are categorical, and there are two unique car types: "Bus" and another car type.


Please note that the dataset seems to have only one unique value in the "travel_to" column, which may need further investigation to understand if this column holds any useful information. Also, the "payment_receipt" column seems to have a unique value for each ride, which might indicate that it is a unique identifier for each transaction.



### Check Unique Values for each variable.

In [13]:
# Check Unique Values for each variable.
for i in df.columns.tolist():
  print("No. of unique values in ",i,"is",df[i].nunique(),".")

No. of unique values in  ride_id is 6249 .
No. of unique values in  seat_number is 61 .
No. of unique values in  payment_method is 2 .
No. of unique values in  payment_receipt is 51645 .
No. of unique values in  travel_date is 149 .
No. of unique values in  travel_time is 78 .
No. of unique values in  travel_from is 17 .
No. of unique values in  travel_to is 1 .
No. of unique values in  car_type is 2 .
No. of unique values in  max_capacity is 2 .


## 3. ***Data Wrangling***

### Data Wrangling Code

In [14]:
# Write your code to make your dataset analysis ready.

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***