This project is an attempt to predict MLB arbitration salaries based on a bevy of factors. This repository includes Jupyter notebooks containing cleaning and model building procedures. All models were neural networks trained in Tensorflow Keras and evaluated on metrics including Mean Absolute Error and Mean Absolute Percentage Error. Also included are Python files containing helper functions used in the data cleaning and preprocessing sections. An article explaining this project can be found at my Medium page.
- Python
- Python Packages: pandas, numpy, sklearn, TensorFlow, matplotlib, seaborn
Component | Description |
---|---|
Data | CSV files of raw data to be cleaned. Includes custom Fangraphs data, scraped and collected arbitration data with further data collection. Includes 'Metadata' folder describing datasets |
Predictions | Contains Excel file with test set true and predicted salaires in tables, pivot tables |
Train Test Data | Contains CSV files of data after cleaning process and splitting into training, test sets |
Visualizations | Includes visualizations of histograms visualizing distributions of individual features, scatterplots for model predictions according to different groupings |
Data_Cleaning | Jupyter notebook for importing and cleaning data. Procedure includes standardizing names, positions, changing data types and values |
DNN_pitchers | Jupyter notebook for importing, preprocessing data, training and evaluating Tensorflow Neural network for MLB pitchers. Training done on 2011-2022 pitchers, tested on 2023 pitchers |
DNN_players | Jupyter notebook for importing, preprocessing data, training and evaluating Tensorflow Neural network for MLB position players. Training done on 2011-2022 position players, tested on 2023 position players |
Helper Functions | Includes Python files containing written helper functions to scrape data, build histogram visuals, clean data |
- Pitchers & Position Players: Visualizations of Model Evaluation/Prediction
- Histograms, Pitchers, Histograms, Position Players- Histograms for each dataset's continuous features
- MAE Loss, Pitchers, MAE Loss, Position Players- MAE Learning Curves for each model
- MAPE Loss, Pitchers, MAPE Loss, Position Players- MAPE Learning Curves for each model
- SHAP Values, Pitchers, SHAP Values, Position Players- Horizontal barplots of features with most positive SHAP values
- True vs. Pred, Pitchers, True vs. Pred, Position Players- Scatterplots of true 2023 salary vs. predicted 2023 salary
- True vs. Pred (Position), Pitchers, True vs. Pred (Position), Position Players- Scatterplots of true 2023 salary vs. predicted 2023 salary, grouped by player position
- True vs. Pred (Service Time), Pitchers, True vs. Pred (Service Time), Position Players- Scatterplots of true 2023 salary vs. predicted 2023 salary, grouped by service time