In [None]:
# 4
# The apparent contradiction between a low R^2 (17.6% of variance explained) and large, statistically significant 
# coefficients can be understood by recognizing that R^2 and p-values assess different aspects of a model. A low R^2 
# suggests that the model explains only a small portion of the variability in the outcome (HP), possibly due to the presence
# of other unmeasured variables. Meanwhile, large coefficients with significant p-values indicate that the included 
# variables (e.g., Sp. Def, Generation, and their interaction) do have a statistically meaningful effect on HP, but they 
# don't account for most of the variation. In essence, the model shows strong evidence for the influence of the predictors, 
# yet the overall variability in HP is likely driven by additional factors not captured in the model.

In [None]:
# 6 
# The design matrix for model4, based on a formula with multiple main effects and interaction terms, leads to a large number
# of predictor variables, some of which are complex combinations of scaled and centered variables. This structure introduces
# multicollinearity, where high correlations between predictors, especially those involving interactions, result in 
# instability in the model's coefficient estimates. As a result, small changes in the data can cause large fluctuations in 
# the model's estimates, making it highly sensitive and prone to overfitting. This instability negatively impacts the 
# model's ability to generalize to new, unseen data, as it becomes overly reliant on specific patterns in the training data,
# reducing its out-of-sample predictive performance.

In [None]:
# 7
# Models 5, 6, and 7 represent a series of refinements from models 3 and 4, aimed at improving predictive performance and 
# stability. Model 5 streamlines the approach by focusing on main predictors and categorical variables, reducing unnecessary
# interactions to enhance both in-sample and out-of-sample R^2. Model 6 further refines this by removing less significant
# predictors and reducing multicollinearity, resulting in a more stable design matrix. Model 7 reintroduces interaction 
# terms among key numeric predictors to capture non-linear relationships while retaining significant categorical variables. 
# By centering and scaling continuous predictors, Model 7 also reduces the condition number, mitigating multicollinearity 
# and improving model stability. Overall, each model incrementally optimizes predictor selection and interactions, 
# balancing complexity with predictive accuracy and generalization.

In [None]:
# 8
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
import statsmodels.formula.api as smf

# Define model formula for model3 as a starting point
linear_form = 'HP ~ Attack + Defense + Speed + Legendary + Q("Sp. Def") + Q("Sp. Atk")'

# Number of repetitions
reps = 100
in_sample_Rsquared = np.zeros(reps)
out_of_sample_Rsquared = np.zeros(reps)

# Run the loop to gather performance metrics
for i in range(reps):
    # Perform a new 50-50 train-test split for each iteration
    pokeaman_training_data, pokeaman_testing_data = train_test_split(pokeaman, test_size=0.5)
    
    # Fit the model on the training data
    final_model_fit = smf.ols(formula=linear_form, data=pokeaman_training_data).fit()
    
    # Record the in-sample R-squared
    in_sample_Rsquared[i] = final_model_fit.rsquared
    
    # Calculate and record the out-of-sample R-squared
    out_of_sample_Rsquared[i] = np.corrcoef(
        pokeaman_testing_data.HP, 
        final_model_fit.predict(pokeaman_testing_data)
    )[0, 1] ** 2

# Create a DataFrame to store in-sample and out-of-sample R-squared values
df = pd.DataFrame({
    "In Sample Performance (R-squared)": in_sample_Rsquared,
    "Out of Sample Performance (R-squared)": out_of_sample_Rsquared
})

# Visualize the results
fig = px.scatter(df, x="In Sample Performance (R-squared)", y="Out of Sample Performance (R-squared)",
                 title="In-Sample vs Out-of-Sample Performance of Model3",
                 labels={"In Sample Performance (R-squared)": "In-Sample R-squared",
                         "Out of Sample Performance (R-squared)": "Out-of-Sample R-squared"})
fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], mode="lines", name="y=x", line=dict(color="red", dash="dash")))

fig.show()


In [None]:
# 9
#The comparison of models 6 and 7, trained on different subsets of Pokémon generations, highlights how well these models 
# generalize when applied to newer generations not included in the training data. Both in-sample and out-of-sample R^2 
# values are calculated to assess performance. When trained on Generation 1 data only, both models show good in-sample fit 
# but poor out-of-sample performance, indicating they struggle to predict HP for later generations. Training on Generations 
# 1-5 improves generalization, but out-of-sample R^2 for Generation 6 still shows a decline. This underscores the challenge 
# of extrapolating to future data and emphasizes the importance of training models on a diverse range of generations to 
# improve their ability to generalize across different data subsets.

# https://chatgpt.com/share/6736d392-fc54-8012-b3d5-0584b5e94b12