# Table of Contents
The purpose of this notebook is to serve as a more detailed table of contents for my repository.  There will be a sub-header for each folder that appears in my repo, in the general order the work was done.  This file will contain a more detailed description of the individual files.

## 1. MLB_Pitch_Data_Setup_SQL Folder - DONE
This folder contains notebooks and code used to create a PostgreSQL database of data from the [Kaggle dataset](https://www.kaggle.com/pschale/mlb-pitch-data-20152018?select=games.csv).
- Initial_Kaggle_Dataset_Construction.ipynb: code on combining the .csv files into combined files for pitch, at bat, and games.
- kaggle_dataset_sql_construction.ipynb: code on adding the combined .csv files into a PostgreSQL database on my local computer, mlb_pitches. 

## 2. EDA-SQL Folder - DONE
This folder contains code on initial EDA of the data, including SQLAlchemy and Pandas code to structure the dataset in preparation for machine learning.  
- initial_sql_queries.ipynb: code on running initial SQL queries to explore the data within the mlb_pitches database.  This includes initial queries to join the tables, and general EDA on pitch types and counts on different pitchers to explore the data.  
- data_cleaning.ipynb: this notebook has code to clean and prepare data from initial_sql_queries.ipynb for use in a machine learning model to predict pitches.  Some secondary EDA is also done here, mainly on visualizing pitch locations in relation to the strikezone.

## 3. Clustering Folder - DONE
This folder contains code on clustering batters of similar types, in order to dimensionally reduce the features with batters to feed into the machine learning algorithms.  
- Batter_Clustering.ipynb: this notebook contains code on collecting hitter statistical data from FanGraphs, as well as running K-Means clustering to group similar hitters together.  
- k_means_clustering_functions.py: contains python functions on running K-Means clustering on the inputted statistics from Batter_Clustering.ipynb

## 4. Pitch_Classification Folder
This folder contains code on building the following machine learning models: 
  1.  Classification: Determining pitch type
  2.  Linear Regression: Predicting pitch location (x and y coordinates) 

Code files in this folder are broken into sub-folders: 
- Modeling_Preparation, to prepare the dataframe/features for modeling  
    - modeling_prep.ipynb: this notebook contains code on preparing the dataframe for modeling, including some additional SQL queries to build new features
    - pitch_dataframe_functions.py: contains python functions on building a pitcher's arsenal of pitches, to perform EDA on what types of pitches they throw and in what proportions.
- Individual_Pitcher_Runs, to Establish the General Pipeline for modeling:
    - Pitch_Classification_Intro.ipynb: this notebook contains the initial work on a pitch classification algorithm to predict pitch type.  
    - Pitch_Classification_Oversample.ipynb: this notebook contains code on pitch_classification, utilizing oversampling to try to improve the class imbalance problem  
    - pitch_location_regression.ipynb: this notebook contains code on initial runs of the linear regression algorithm for predicting pitch location on individual pitchers  
    - location_regression_functions.py: contains functions on running linear regression to predict pitch location
    - pitch_cat_functions.py: contains functions on running classification modeling to predict pitch type  
    - Location_Viz_Checks: some visualizations to check out residuals on Max Scherzer's (i.e. one model) location predictions, to get a sense for how the model is performing
- Pipeline_Building, to develop a pipeline for chaining together the regression and classification algorithms, and to incorporate various pitchers:
    - pipeline_architecture.ipynb: contains code on building out a pipeline for pitch prediction, chaining together classification for pitch type and regression for pitch location
    - classification_location_combo.py: contains functions for training and validating models on pitch type prediction (classification) and pitch location prediction (regression)  
    - Pipeline_Part_2.ipynb: includes pipeline work with new features engineered in Feature_Engineering.ipynb
    - classification_location_combo-2.py: contains similar functions as classification_location_combo.py, but has features for including the previous 100 pitches/pitch type counts the pitcher has thrown  
- Final_Modeling: contains notebooks and functions utilized in the final model:
    - final_modeling_work.ipynb: this notebook contains code on my final modeling process, and some visualizations on the final results
    - classification_location_combo_2_l_10.py: contains similar functions as classification_location_combo.py, but has features for including the previous 10 pitches/pitch type counts the pitcher has thrown  
    - classification_location_combo_2_l_5.py: contains similar functions as classification_location_combo.py, but has features for including the previous 5 pitches/pitch type counts the pitcher has thrown 
    - final_model_fitter.py: contains functions on fitting the final models by player, on the train/validation data combined
    - final_model_scorer.py: contains functions on scoring the final models from final_model_fitting for each pitcher on the test set.
    - weighted_final_model_fitter.py: contains functions on fitting the final models by player, on the train/validation data combined.  Utilizes balanced weight clases for XGBoost classifier
    - weighted_final_model_scorer: contains functions on scoring the final models from final_model_fitting for each pitcher on the test set, using weighted models from weighted_final_model_fitter.py

## 5. Feature_Engineering_Additional Folder
This folder contains code on running additional feature engineering to utilize in my modeling process.  
- Feature_Engineering.ipynb: code on general additional feature engineering from my dataset

## 6. Final_Presentation Folder  - DONE
This folder contains my presentation slides from my final presentation of this project at Metis.  There are PDF and Powerpoint versions of the slides available to see here.
- patrick_bovard_final_presentation.pdf
- patrick_bovard_final_presentation.pptx

## 7. Streamlit_App Folder - DONE
This folder contains code for running my Streamlit App to showcase the project model and results.  *Note: this is currently under construction.*
- streamlit_script.py: the code for the Streamlit App related to this project.