<a href="https://colab.research.google.com/github/ravi72munde/scala-spark-cab-rides-predictions/blob/Ravi/Cab_Price_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Uber ride prices**:

**Hamed Tabrizchi**



The purpose of this project is to investigate the relationship between cab fares and the weather. The datasets used in this study were taken from Kaggle. A price prediction model will also be developed by training and testing data.

Unlike public transportation, Uber ride prices fluctuate. The demand and supply of rides at any one time have a significant impact on them. We'd like to learn more about what drives demand for rides and how pricing change with time and weather in this data science study.Understanding the elements that determine the pricing of a cab is the main goal of this study, to forecast trip costs based on these variables (Distance, Cab type, Timestamp, Destination, Source, Price estimate, 
Temperature, Location, Clouds, Pressure, Timestamp,Humidity, and Wind).

* **DATA**

In this project, two data sets are considered. One of the data sets was collected according to the vehicles and the other according to the weather conditions. In the following, we will examine each dataset separately.

The dataset for this project has been taken from Kaggle. The dataset can be viewed or downloaded by visiting the following link: https://www.kaggle.com/datasets/ravi72munde/uber-lyft-cab-prices

In [None]:
import pandas as pd

In [None]:
cab_df = pd.read_csv("../input/cab_rides.csv")
weather_df = pd.read_csv("../input/weather.csv")

In [None]:
cab_df.head()

In [None]:
weather_df.head()

In [None]:
cab_df.describe()

In [None]:
weather_df.describe()

In [None]:
cab_df.corr()

In [None]:
weather_df.corr()

In [None]:
pd.plotting.scatter_matrix(weather_df, alpha=0.2)

In [None]:
pd.plotting.scatter_matrix(cab_df, alpha=0.2)

**Describing data set**

In [None]:
print('Weather Data set size = ', weather_df.shape)
print('Weather Data set Dimension = ', weather_df.ndim)
print('*================================*')
print('Cab Data set size = ', cab_df.shape)
print('Cab Data set Dimension = ', cab_df.ndim)

In [None]:
print('Weather Data set types -> ')
weather_df.dtypes

In [None]:
print('Cab Data set types -> ')
cab_df.dtypes

* **Combining two data sets**

In this section, to provide a more complete data set with more features, we try to merge the two data sets. In order for this operation to be performed correctly, a linear relationship between the properties will be evaluated after combination.

In [None]:
cab_df['date_time'] = pd.to_datetime(cab_df['time_stamp']/1000, unit='s')
weather_df['date_time'] = pd.to_datetime(weather_df['time_stamp'], unit='s')
cab_df.head()

In [None]:
#merge the datasets to refelect same time for a location
cab_df['merge_date'] = cab_df.source.astype(str) +" - "+ cab_df.date_time.dt.date.astype("str") +" - "+ cab_df.date_time.dt.hour.astype("str")
weather_df['merge_date'] = weather_df.location.astype(str) +" - "+ weather_df.date_time.dt.date.astype("str") +" - "+ weather_df.date_time.dt.hour.astype("str")

In [None]:
weather_df.index = weather_df['merge_date']

In [None]:
cab_df.head()

In [None]:
merged_df = cab_df.join(weather_df,on=['merge_date'],rsuffix ='_w')

In [None]:
merged_df['rain'].fillna(0,inplace=True)

In [None]:
merged_df = merged_df[pd.notnull(merged_df['date_time_w'])]

In [None]:
merged_df = merged_df[pd.notnull(merged_df['price'])]

In [None]:
merged_df['day'] = merged_df.date_time.dt.dayofweek

In [None]:
merged_df['hour'] = merged_df.date_time.dt.hour

In [None]:
merged_df.columns

In [None]:
merged_df.count()

In [None]:
merged_df.head()

In [None]:
merged_df.corr()

**Identifying independent and dependent variables**

As mentioned, after merging the two datasets, the linear relationship between the features was examined to ensure the accuracy of integration of the two datasets.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(15, 10))
sns.heatmap(merged_df.corr(), annot=True)

In [None]:
#Check for possible null values in the current data set
merged_df.info()

In [None]:
plt.figure(figsize=(12, 10))
plt.plot(merged_df['distance'], merged_df['price'], 'ro')
plt.title('distance vs price')
plt.xlabel("distance")
plt.ylabel("price")

In [None]:
plt.figure(figsize=(12, 10))
plt.plot(merged_df['rain'], merged_df['price'], 'b^')
plt.title('rain vs price')
plt.xlabel("rain")
plt.ylabel("price")

In [None]:
plt.figure(figsize=(12, 10))
plt.plot(merged_df['surge_multiplier'], merged_df['price'], 'ks')
plt.title('surge_multiplier vs price')
plt.xlabel("surge_multiplier")
plt.ylabel("price")

In [None]:
merged_df.product_id.unique()

As you can see above, various services are provided for transportation(*lyft_line, lyft_premier, lyft_luxsuv, lyft_plus,lyft_lux, and lyft*), and we intend to provide a forecasting model specifically for one of the services (*lyft_line*) in order to improve the accuracy of the final decision. This operation can be done for any of the services and it is enough to change the desired service before starting the model learning process.

In [None]:
X = merged_df[merged_df.product_id=='lyft_line'][['day','distance','hour','temp','clouds', 'pressure','humidity', 'wind', 'rain']]

In [None]:
X.count()

In [None]:
y = merged_df[merged_df.product_id=='lyft_line']['price'] 

In [None]:
y.count()

In [None]:
X.reset_index(inplace=True)
X = X.drop(columns=['index'])

In [None]:
X.head()

In [None]:
#To convert categorical data into dummy or indicator variables, we can use get_dummies method.
features = pd.get_dummies(X)

In [None]:
features.columns

In [None]:
# Use numpy to convert to arrays
import numpy as np
# Labels are the values we want to predict
labels = np.array(y)

# Saving feature names for later use
feature_list = list(features.columns)
# Convert to numpy array
features = np.array(features)

In [None]:
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25, random_state = 42)

In [None]:
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
est = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1)
est.fit(train_features, train_labels);

In [None]:
predictions = est.predict(test_features)
errors = abs(predictions - test_labels)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')

* **Prediction Accuracy**

In [None]:
# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / test_labels)
# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')

In [None]:
#feature importances
importances = list(est.feature_importances_)
#variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

According to the analysis and evaluation, it seems that the most important factor in estimating the model is distance.

* **Final Evaluation Stage**

In [None]:
def mae(predict, actual):
    predict = np.array(predict)
    actual = np.array(actual)

    difference = abs(predict - actual)
    score = difference.mean()

    return score

def mse(predict, actual):
    predict = np.array(predict)
    actual = np.array(actual)

    difference = predict - actual
    square_diff = np.square(difference)

    score = square_diff.mean()
    return score

def mbe(y, y_predict):
    errors = [y[i]-y_predict[i] for i in range(len(y))]
    bias = sum(errors) * 1.0/len(y)
    return bias


def sd(y, y_predict):
    mse= mean_squared_error(y, y_predict)
    mbev=mbe(y, y_predict)
    rsme=math.sqrt(mse)
    sd=math.sqrt(((rsme*rsme)-(mbev*mbev)))
    return sd

def MAPE(actual, pred): 
    actual, pred = np.array(actual), np.array(pred)
    return np.mean(np.abs((actual - pred) / actual)) * 100
def eva(y, y_predict):
    print("        ")
    print("Mean Absolute Error : ")
    print(mae(y, y_predict))
    print("        ")
    print("Root Mean Absolute Error : ")
    print(math.sqrt((mae(y, y_predict))))
    print("        ")
    print("Mean Squared Error : ")
    print(print(mse(y, y_predict)))
    print("        ")
    print("Root Mean Squared Error : ")
    print(math.sqrt(mse(y, y_predict)))
    print("        ")
    print("Mean Bias Error : ")
    print(mbe(y, y_predict))
    print("        ")
    print("Systematic    Error : ")
    print(sd(y, y_predict))
    print("        ")
    print("MAPE : ")
    print(MAPE(y, y_predict))

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import mean_absolute_error
import math
eva(test_labels,predictions)

**Conclusion**

Considering the various measurement and evaluation criteria on the experimental data set, it is concluded that the given trained model (Gradient Boosting Regressor) has the ability to make appropriate estimates. In this project, we intended to perform a detailed examination on the Uber data set. We reasonably design a model for estimating prices at different distances and weather conditions. Based on the results of the initial analysis and the final evaluation, it is concluded that the performance of the model is acceptable (Mean Absolute Error : 1.107435777607448, and Mean Bias Error : 0.011175742322951845).