USwB-Data-Analysis-Prediction/README.markdown at main · martynix/USwB-Data-Analysis-Prediction · GitHub

USwB - Data Analysis & Prediction with Apache Spark, Databricks, MLLib

Author: Martyna Pitera

CHECK THIS OUT ON DATABRICKS --> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/173542347700804/2929076465318146/6176203754563543/latest.html

The project was carried out using Apache Spark on Databricks, utilizing Python and SQL.

The goal of this project is to analyze the Body Fat Dataset and generate predictive insights. (dataset - https://www.kaggle.com/datasets/fedesoriano/body-fat-prediction-dataset)

The dataset contains of:

Density determined from underwater weighing
Percent body fat from Siri's (1956) equation
Age (years)
Weight (lbs)
Height (inches)
Neck circumference (cm)
Chest circumference (cm)
Abdomen circumference (cm)
Hip circumference (cm)
Thigh circumference (cm)
Knee circumference (cm)
Ankle circumference (cm)
Biceps (extended) circumference (cm)
Forearm circumference (cm)
Wrist circumference (cm)

The project involved:

Loading and preprocessing of the dataset
Statistical analysis of the data
Exploratory Data Analysis to uncover patterns and insights
Correlation Analysis to understand relationships between variables
Utilizing tree models to predict Body Fat percentage The Root Mean Squared Error (RMSE) for each model on the test data was:

Linear Regression: 0.622103
Decision Tree Regression: 0.96897
Gradient-Boosted Tree Regression: 0.891016 These results highlight the effectiveness of the Linear Regression model in predicting Body Fat percentage, outperforming both Linear Decision Tree Regression and Gradient-Boosted Tree Regression models.