Uber Data Analysis

Overview

Uber, a pioneering ride-sharing service, generates extensive data on rides and deliveries. This project focuses on developing an accurate price prediction model for Uber rides, taking into account various influential factors such as location, distance, time, weather, cab types, and more. Through Exploratory Data Analysis (EDA), we aim to uncover insights and visualize the relationships between these factors.

Dataset

The dataset used for this project is sourced from Kaggle and can be accessed here. It spans two months of ride information in Boston, MA, with approximately 600,000 rows and 57 feature columns. The dataset covers essential ride attributes and detailed weather data, providing a comprehensive basis for analysis.

Data Preparation

The initial stages involve meticulous data cleaning and transformation. We focus exclusively on Uber rides, filtering out other ride types to narrow our analysis. Notably, around 50,000 missing values for price are associated with Taxi rides. To ensure accuracy, consistency, and error-free data, we've dropped rows with missing values, which won't impact our analysis of Uber rides.

Data Visualization

Data visualization plays a pivotal role in understanding the dataset. We've employed various visualization techniques, including barplots, scatterplots, and strip plots, to explore relationships such as hourly ride counts, rides based on weather forecasts, and price vs. distance.

Machine Learning Models

Categorical values have been transformed using Label Encoder and fed as inputs to machine learning models. Four models were employed for price prediction: Linear Regression, Decision Tree Regressor, Random Forest Regressor, and XGBoost (Extreme Gradient Boost) Regressor. Scikit-learn library has been extensively utilized for model development.

Model Evaluation

After fine-tuning the model parameters, XGBoost emerged as the top performer, achieving an impressive 95.4% accuracy on the test dataset. Random Forest also performed well but with a longer training time. We further analyzed feature importance and found that only five features had the most significant impact on prediction.

Evaluation Metrics

The following evaluation metrics were obtained for the XGBoost predictions on the test data:

Mean Absolute Error (MAE): 1.12
Mean Squared Error (MSE): 3.31
Root Mean Absolute Error (RMAE): 1.82

Regression Plots

Regression plots for Random Forest and XGBoost predictions:

Conclusion

This project encompasses a comprehensive journey through the data analysis process. From data preparation to visualization and predictive analysis, it offers detailed explanations, code snippets, and visualizations to illuminate the results. Through exploring Uber's dataset, we've gained valuable insights into the factors influencing ride prices, ultimately developing an accurate pricing model.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
Uber_Data_Analysis.ipynb		Uber_Data_Analysis.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Uber Data Analysis

Overview

Dataset

Data Preparation

Data Visualization

Machine Learning Models

Model Evaluation

Evaluation Metrics

Regression Plots

Conclusion

About

Releases

Packages

Languages

heet-shahh/Uber-Data-Analysis

Folders and files

Latest commit

History

Repository files navigation

Uber Data Analysis

Overview

Dataset

Data Preparation

Data Visualization

Machine Learning Models

Model Evaluation

Evaluation Metrics

Regression Plots

Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages