Focus: Web-Scraping and Linear Regression Model
Problem Statement:
- Was Russell Wilson (Seattle Seahawk's Quarterback) underpaid in his early NFL career?
- How do we value NFL players in terms of salaries, given their performance?
Estimating the value of football players in terms of $$$ is an important task for NFL team managers.
Each football player that gets drafted into the NFL gets a 4-year contract deal. As it gets closer to the end of this contract, players and managers (owners) have to renegotiate contract extensions. In such a case, what is the fvalue of an NFL player on his 4th-year in the NFL?
Read the full story in my Blog
Project Goal:
- Collect datasets pertaining to players in their early NFL career (1-4 years) and salary information and create predictive models to predict players' salaries
Photo source
- Given the available dataset (year:2000-2018), linear regression model can predict players' salaries on the fourth year of their career, with an error of ~1 million USD.
- All linear regression models show comparable performance.
Problem Solution:
- Was Russell Wilson (Seattle Seahawk's Quarterback) underpaid in his early career? Yes!
- How do we value NFL players in their early contract-years in terms of (base) salaries? Multivariate linear regression model provides best predictive method to evaluate players salaries, based on their initial NFL career performance
Code, notebooks, and Summary
- Project_Luther_Report.md - detailed explanations of data acquisition, cleaning and modeling
- Step1_DataAcquisition.ipynb - notebook describing the process of web-scraping, converting data into dataframes and data pre-processing
- Workflow.md - a step-by-step procedure to scrape data and converting them to dataframe
- ScrapeProcFunc.py - a library of python functions to webscrape player's information (statistics and salaries), to convert HTML into dataframes, and perform data-wrangling prior to machine learning
- Step2_EDA.ipynb - initial exploratory data analysis
- Step3_Engineering_Selection.ipynb - notebook describing feature engineering and selection of predictive models
- Step4_Evaluation.ipynb - notebook describing the evaluation of selected model
- Project_Presentation.pdf - High-level overview of the project and results summary
Data sources:
- Pro-football-reference.com - for rushing, passing, receiving, and NFL-Combine stats
- Spotrac.com - for Salary information
- USinflationcalculator.com - for annual inflation rate
Tools:
- Data acquisition:
Selenium
,BeautifulSoup
- Data analysis:
Pandas
,seaborn
- Models:
Scikit-learn
(i.e., Linear regression & -regularization, decision tree, random forest, bagging, boosting)
How to reproduce this work?
- Check out this code folder, and follow the step-by-step procedure of data acquisition and -wrangling described in Workflow.md and its accompanying notebook, Workflow-4th.ipynb
- Python functions for scraping and cleaning are saved in ScrapeProcFunc.py, which can be imported directly to the Jupyter notebook's workspace
- Exploratory data analysis and predictive modeling are described in EDA-WR-4th.ipynb, and Engineering_Modeling-4th.ipynb
How to contribute to this work?
- Fork (and star ⭐️ ) the repository
- Create annotated copies of the corresponding notebooks
- Submit pull request
Attribution:
- This project is inspired by similar others conducted by METIS alumni, Ka Hou Sio and Jason SA, who investigated NBA and MLB player evaluation