In [None]:
import sys

import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from itertools import product
from collections import OrderedDict
import xgboost as xgb
sns.set(rc={'figure.figsize':(16,9)})

In [None]:
election_df = pd.read_csv('county_level_election.csv')
election_df.head()

### Section 2: Bagging / Random Forest
We are going to be using test and training splits, cross validation, and fitting a random forest to the data. Create an 80/20 Train/Test split. For accuracy use the .score method.  
1. Set the number of estimators to be 100, the features to be the square root of available features, and iterate through depths (1-20). Use only 5 folds for cross validation to save some compute resources. Plot the max depth on the x axis and the accuracy on the y axis for training and for the mean cross validation.
2. Based on the plot, how many nodes would you recommend as the max depth?
3. What is the accuracy (mean cv) at your chosen depth?
4. The cross validation looks different than the lab, why?

In [None]:
X = election_df[['population',
                 'hispanic',
                 'minority',
                 'female',
                 'unemployed',
                 'income',
                 'nodegree',
                 'bachelor',
                 'inactivity',
                 'obesity',
                 'density',
                 'cancer']]
y = election_df['votergap']

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
depths = list(range(1, 21))
train_scores = []
cvmeans = []
cvstds = []
cv_scores = []

for depth in depths:
    
    print(f'Training with depth: {depth}...')
    randfor = RandomForestRegressor(n_estimators=100,
                                   max_features='sqrt',
                                   max_depth=depth,
                                   random_state=42)

    # Perform training and 5-fold cross validation
    train_scores.append(randfor.fit(Xtrain, ytrain).score(Xtrain, ytrain))
    scores = cross_val_score(estimator=randfor, X=Xtrain, y=ytrain, cv=5)

    cvmeans.append(scores.mean())
    cvstds.append(scores.std())

cvmeans = np.array(cvmeans)
cvstds = np.array(cvstds)

# Plot the Mean Accuracy from cross validation with 2 std shaded
plt.plot(depths, cvmeans, '*-', label="Mean CV")
plt.fill_between(depths, cvmeans - 2*cvstds, cvmeans + 2*cvstds, alpha=0.3)

# Plot accuracy of a model with depth N
# against the corss validation of the model with depth N
ylim = plt.ylim()
plt.plot(depths, train_scores, '-+', label="Train")
plt.legend()
plt.ylabel("Accuracy")
plt.xlabel("Depth")
plt.xticks(depths);

I would recommend 8-9 nodes. At 9 nodes, that is where the curve flattens and additional nodes do not improve accuracy significantly.

The accuracy at 9 nodes is around 73%.

The random forest plot looks smoother and more stable because it averages the results of many trees, which reduces variance and overfitting compared to a single decision tree. The cross-validation line doesnâ€™t drop off as sharply since ensemble averaging helps the model generalize better. It also uses 5-fold instead of 10-fold cross-validation, which makes the curve less variable across folds.

### Section 3: Boosting / XGBoost  
5. Use the defaults for most parameters. Iterate through depths (1-20). Use only 5 folds for cross validation to save some compute resources. Plot the max depth on the x axis and the accuracy on the y axis for training and for the mean cross validation.
6. Based on the plot, how many nodes would you recommend as the max depth?
7. What is the accuracy (mean cv) at your chosen depth?
8. The cross validation looks different than random forest, why?

In [None]:
depths = list(range(1, 21))
train_scores = []
cvmeans = []
cvstds = []
cv_scores = []

for depth in depths:
    
    print(f'Training with depth: {depth}...')
    boosting = xgb.XGBRegressor(max_depth=depth,
                  random_state=42)

    # Perform training and 5-fold cross validation
    train_scores.append(boosting.fit(Xtrain, ytrain).score(Xtrain, ytrain))
    scores = cross_val_score(estimator=boosting, X=Xtrain, y=ytrain, cv=5)

    cvmeans.append(scores.mean())
    cvstds.append(scores.std())

cvmeans = np.array(cvmeans)
cvstds = np.array(cvstds)

# Plot the Mean Accuracy from cross validation with 2 std shaded
plt.plot(depths, cvmeans, '*-', label="Mean CV")
plt.fill_between(depths, cvmeans - 2*cvstds, cvmeans + 2*cvstds, alpha=0.3)

# Plot accuracy of a model with depth N
# against the corss validation of the model with depth N
ylim = plt.ylim()
plt.plot(depths, train_scores, '-+', label="Train")
plt.legend()
plt.ylabel("Accuracy")
plt.xlabel("Depth")
plt.xticks(depths);

I'd recommend a max depth of around 3 to 5, since that's where the mean cross-validation score levels off and the model stops improving on unseen data. Beyond that point, the training accuracy keeps climbing toward 1.0, which suggest overfitting.

At that depth, the mean CV accuracy is around 76%.


The cross-validation curve looks different because XGBoost learns in steps, adding one tree at a time to fix the last one's mistakes. That makes it improve fast at first but also start overfitting sooner. Random forests mix a bunch of trees together, so their results come out smoother and more balanced.