Created By: Leah Nagy
Predicting the average user-rating for rock-climbing routes in Kentucky using Linear Regression models was the goal of this project. By webscraping the Mountain Project's website, I collected information about each route that could be used to make predictions on future routes. After collecting the data, I ran through multiple types of regression models before arriving at a final model.
Kentucky has some of the best rock climbing in the world and is considered the rock climbing mecca of the East coast. Rock climbing guides are a vital part of the community there. With over 3,000 routes to choose from, a rock climbing guide company wants to better understand what makes routes more desirable than others to provide the optimal experience for their clients.
After some Exploratory Data Analysis and Feature Engineering, the dataset contains 1,582 routes. I collected 17 features on each route and the final model includes a total of 12 features. The data was collected from Mountain Project's website using Selenium and BeautifulSoup.
- The route's share-date was changed to the number of years on the app for comparison
- The number of ratings, comments, photos and ticks were added together since these features were highly correlated
- Encoded categorical features
- Added interaction variables:
- Difficulty Rating X Route Length
- Popularity / Route Age
Simple Linear, Polynomial, Ridge & LASSO Regression were used. The final model used was a simple Linear Regression model with features removed according to the LASSO Regression results.
The entire dataset was split into a 60/20/20 - Training/Validation/Testing. I used 5-fold cross validation as I tested various models and scored them based on the validation set. I then combined the training and validation datasets for a final 80/20 (training/testing) split. The testing data was only used on the final model by using the same random state throughout.
The metric I used to score my models was Mean Absolute Error (MAE), because it would be in the same unit as my target. Without a need to further penalize outliers, MAE keeps the model more interpretable to stakeholders. While I focused on the MAE, I also worked to reduce multicollinearity, which also increased MAE. In the future I would like to try more interaction terms to improve the model's performance even more.
- Accuracy: 0.566
- Accuracy: 0.572
- MAE: 0.374
- Selenium, BeautifulSoup & Requests for web scraping
- Numpy and Pandas for data manipulation
- Scikit-learn for modeling
- Matplotlib and Seaborn for plotting