# Problem Statement
Sweet Lift Taxi company has collected historical data on taxi orders at airports. To attract more
drivers during peak hours, we need to predict the number of taxi orders for the next hour. Build a
model for such a prediction

# Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt

# Setting environment to ignore future warnings
import warnings
warnings.simplefilter('ignore')

# Loading Data

In [None]:
df = pd.read_csv("taxi.csv")
df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index("datetime", inplace=True, drop=True)
df.head()

Unnamed: 0_level_0,num_orders
datetime,Unnamed: 1_level_1
2018-03-01 00:00:00,9
2018-03-01 00:10:00,14
2018-03-01 00:20:00,28
2018-03-01 00:30:00,20
2018-03-01 00:40:00,32


In [None]:
df.shape

(26496, 1)

In [None]:
df.describe()

Unnamed: 0,num_orders
count,26496.0
mean,14.070463
std,9.21133
min,0.0
25%,8.0
50%,13.0
75%,19.0
max,119.0


In [None]:
# Resampling data to 1 hour
df.resample("60min").mean()

Unnamed: 0_level_0,num_orders
datetime,Unnamed: 1_level_1
2018-03-01 00:00:00,20.666667
2018-03-01 01:00:00,14.166667
2018-03-01 02:00:00,11.833333
2018-03-01 03:00:00,11.000000
2018-03-01 04:00:00,7.166667
...,...
2018-08-31 19:00:00,22.666667
2018-08-31 20:00:00,25.666667
2018-08-31 21:00:00,26.500000
2018-08-31 22:00:00,37.166667


# Data Preparation

In [None]:
df.reset_index(inplace=True)
df['month'] = df['datetime'].dt.month
df['day'] = df['datetime'].dt.dayofweek
df['year'] = df['datetime'].dt.year

In [None]:
X = df[["month", "day", "year"]]
y = df.num_orders

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)

# Model Building

#### Linear Regression Model

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

In [None]:
y_pred = model.predict(X_test)

In [None]:
from sklearn.metrics import mean_absolute_error
np.sqrt(mean_absolute_error(y_test, y_pred))

2.513333398736652

#### Random Forest Model

In [None]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

In [None]:
from sklearn.metrics import mean_absolute_error
np.sqrt(mean_absolute_error(y_test, y_pred))

2.4956037229537014

## <font color='#2F4F4F'>Summary of Findings</font>

What are your findings?

The prediction of orders to taxi is an important thing to increase the sales. Suppose if the number of orders are less then you can turn off some of taxis for this day. Both models appear to be performing almost equally i.e their root mean squared error is almost equal.

## <font color='#2F4F4F'>Recommendations</font>

What do you recommend?



## <font color='#2F4F4F'>Challenging your Solution</font>

#### a) Did we have the right question?

Yes

#### b) Did we have the right data?

Yes

