**Advanced Modeling**
- 

Building upon the insights gained from the baseline modeling phase, this notebook focuses on advancing the predictive capabilities of our NFL fantasy points model by leveraging ensemble decision tree techniques. Ensemble methods, known for their robustness and ability to capture non-linear relationships, are particularly well-suited for handling the complexities and variability inherent in player performance data.

**Objectives**

- Develop advanced models for predicting fantasy points across different positions (e.g., QB, RB, WR, TE)

- Explore and implement ensemble decision tree algorithms such as Random Forests, Gradient Boosting Machines (GBMs), and XGBoost to enhance predictive performance

- Evaluate the impact of ensemble methods in capturing the positional nuances identified in the baseline models

**Key Advancements Over Baseline Models**

- Non-linear Modeling: Unlike linear regression, decision trees can naturally handle non-linear patterns, which are prevalent in sports performance metrics

- Feature Interactions: Ensemble methods inherently capture complex interactions between cumulative metrics, rolling averages, and other features

- Position-Specific Insights: Creating an ensemble decision tree for each position will provide deeper insights into the key predictors for each position

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

**To-Do List**
- 

1. Change intro for other notebooks
2. Decide on type of ensemble method

**Model Decision**
- 

1. Random Forest
Best for: Quick implementation, solid baseline for ensemble methods.
Advantages:
Robust to overfitting (compared to single decision trees).
Handles categorical and numerical features well.
Provides feature importance metrics.
Disadvantages:
May not perform as well on datasets with complex relationships compared to boosting methods.
Limited ability to model highly non-linear relationships compared to boosting.
Recommendation: Use as a baseline ensemble method to compare against boosting models.

2. Gradient Boosting (e.g., sklearn's GradientBoostingRegressor)
Best for: Moderate-sized datasets with fewer missing values.
Advantages:
Models complex relationships more effectively than Random Forest.
Good for datasets where small gains in prediction accuracy are critical.
Disadvantages:
Slower to train compared to Random Forest.
Can overfit without careful parameter tuning.
Recommendation: Use if your data isn’t too large and you want to refine predictions beyond Random Forest.

3. XGBoost
Best for: Large datasets or highly competitive projects where squeezing out extra performance matters.
Advantages:
Highly optimized and efficient implementation of gradient boosting.
Regularization parameters reduce the risk of overfitting.
Handles missing values internally.
Disadvantages:
More complex to tune than Random Forest or basic Gradient Boosting.
May require substantial compute power for large datasets.
Recommendation: Use if you want the best performance and are willing to invest time in hyperparameter tuning.

4. LightGBM or CatBoost
Best for: Very large datasets or highly categorical data (e.g., one-hot encoded quarterback data).
Advantages:
Faster training times compared to XGBoost.
CatBoost handles categorical data directly.
Disadvantages:
LightGBM can struggle with datasets that are small or have many outliers.
CatBoost may require additional preprocessing for text-heavy features.
Recommendation: If speed is critical or your dataset is very large, try LightGBM or CatBoost.

**Start with Random Forest to establish a baseline and analyze feature importance.**

**Transition to XGBoost or LightGBM for more refined models.**

**These methods are better suited for your dataset, which likely contains a mix of numerical (e.g., stats, averages) and categorical (e.g., positions, teams) data.**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, accuracy_score, r2_score

Importing Libraries/Packages

In [4]:
dfAll = pd.read_csv("/Users/mychalortiz/Downloads/Brainstation/FantasyForecasts/notebooks/ModelingDataframes/dfAll.csv")
qbsM = pd.read_csv("/Users/mychalortiz/Downloads/Brainstation/FantasyForecasts/notebooks/ModelingDataframes/qbsM.csv")
rbsM = pd.read_csv("/Users/mychalortiz/Downloads/Brainstation/FantasyForecasts/notebooks/ModelingDataframes/rbsM.csv")
wrsM = pd.read_csv("/Users/mychalortiz/Downloads/Brainstation/FantasyForecasts/notebooks/ModelingDataframes/wrsM.csv")
tesM = pd.read_csv("/Users/mychalortiz/Downloads/Brainstation/FantasyForecasts/notebooks/ModelingDataframes/tesM.csv")

Importing DataFrames

In [5]:
qbsM.head(5)

Unnamed: 0.1,Unnamed: 0,PLAYER NAME,PLAYER TEAM,PLAYER POSITION,STATUS,TOTAL,Opponent,Location,rank,DATE,...,season_rushing_td,season_total_avg,season_passing_yds_avg,season_passing_td_avg,season_receiving_rec_avg,season_receiving_yds_avg,season_receiving_td_avg,season_rushing_car_avg,season_rushing_yds_avg,season_fantasy_points_avg
0,335,AJ McCarron,Cin,QB,W 31-14,0.8,Cle,Home,34.0,01-09-24,...,0.0,0.8,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.8
1,300,AJ McCarron,Cin,QB,W 34-14,-0.04,Ind,Home,38.0,12-12-23,...,0.0,0.38,9.5,0.0,0.0,0.0,0.0,0.0,0.0,0.38
2,75,Aaron Rodgers,NYJ,QB,W 22-16,0.0,Buf,Home,36.0,09-12-23,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,468,Aidan O'Connell,LV,QB,L 20-23,20.26,Ind,Away,9.0,01-02-24,...,0.0,20.26,299.0,2.0,0.0,0.0,0.0,2.0,3.0,20.26
4,312,Aidan O'Connell,LV,QB,W 27-14,17.86,Den,Home,11.0,01-09-24,...,0.0,19.06,271.5,2.0,0.0,0.0,0.0,3.0,2.0,19.06


- Ensuring DataFrames have been transfered correctly

**Quarterback Ensemble Decision Tree**
- 

In [12]:
Xqb = qbsM.select_dtypes(include = 'number').drop(columns=['TOTAL', 'Unnamed: 0'])

yqb = qbsM['TOTAL']
#assigning X and y

Xqb.head(5)

Unnamed: 0,rank,Week,did_not_play,Name_Encoded,Team_Encoded,Opponent_Encoded,Home/Away_Encoded,total_5_game_avg,passing_yds_5_game_avg,passing_td_5_game_avg,...,season_rushing_td,season_total_avg,season_passing_yds_avg,season_passing_td_avg,season_receiving_rec_avg,season_receiving_yds_avg,season_receiving_td_avg,season_rushing_car_avg,season_rushing_yds_avg,season_fantasy_points_avg
0,34.0,18,0,0,6,7,1,0.8,20.0,0.0,...,0.0,0.8,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.8
1,38.0,14,0,0,6,13,1,0.38,9.5,0.0,...,0.0,0.38,9.5,0.0,0.0,0.0,0.0,0.0,0.0,0.38
2,36.0,1,0,1,24,3,1,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,9.0,17,0,2,18,13,0,20.26,299.0,2.0,...,0.0,20.26,299.0,2.0,0.0,0.0,0.0,2.0,3.0,20.26
4,11.0,18,0,2,18,9,1,19.06,271.5,2.0,...,0.0,19.06,271.5,2.0,0.0,0.0,0.0,3.0,2.0,19.06
