Question 1

1. The number of predictor (independent) variables used to explain the outcome (dependent) variable varies between Simple Linear Regression (SLR) and Multiple Linear Regression (MLR). Simple Linear Regression is a simple model that calculates the linear connection between a single predictor and the outcome because it only has one predictor variable. Although this model is helpful for comprehending how one variable affects the result, it ignores other elements that might potentially have an impact. In contrast, multiple linear regression incorporates two or more predictor variables. Multiple predictors increase the model's explanatory power by allowing MLR to capture more of the variables that affect the result. By controlling for several predictors at once, this additional complexity enables MLR to estimate each variable's distinct contribution to the result while "holding constant" the influence of other predictors. As a result, MLR can provide more precise and nuanced understandings of the connections between variables, which makes it particularly useful in situations where several factors influence a result.

2. Different interpretations of the link between the predictor and outcome result from the use of a continuous variable in Simple Linear Regression as opposed to an indicator (binary) variable. A continuous variable represents a steady impact on the result and can take on a range of values, such as age or height. A continuous variable's coefficient indicates how much the result is expected to vary for every unit increase in the predictor. An indicator variable, on the other hand, denotes two different categories (for example, 0 and 1 for male and female). With one category acting as the reference or baseline, its coefficient shows how the results of these two categories differ from one another. 

3. In a Multiple Linear Regression (MLR) model, adding a single indicator variable in addition to a continuous variable alters the model's behavior by enabling it to concurrently account for the effects of a continuous factor and a categorical difference. The model posits a single, straight-line association between the predictor and the outcome across all data points in Simple Linear Regression (SLR) with only a continuous variable. This suggests that any variation observed is simply the result of changes in the continuous variable. However, the indicator variable in the MLR model divides the data into two groups, each of which has its own intercept. 

4. In a Multiple Linear Regression model, the effect of the continuous variable on the outcome might vary across the categories specified by the indicator variable by including an interaction term between the continuous and the indicator variable. Instead of the parallel lines that would be visible in the absence of the interaction, this interaction term creates two unique lines with varied slopes for each group. This enables the model to capture more intricate, category-specific relationships within the data by indicating that the change in the outcome caused by the continuous predictor depends on the category of the indicator variable. A deeper comprehension of how the continuous variable influences the result differently across categories is made possible by this linear form.

5. Instead of modeling a trend along a continuous predictor, a Multiple Linear Regression model that is exclusively based on indicator variables produced from a non-binary categorical variable reflects changes in the outcome variable across the categories. To prevent repetition and guarantee full rank in the matrix, each category in the non-binary variable is represented by a set of binary indicator variables (dummy variables), one fewer than the total number of categories. The binary indicators are used to model contrasts or departures from a baseline category that is assigned by the model. With this method, the model anticipates categorical differences in the data, allowing for discrete shifts as opposed to continuous changes, with each level of the category predictor having a unique impact on the result.

Question 2

Finding the precise result and predictor variables is crucial in this situation. A health score, income level, or other relevant metric is an example of an outcome variable, which is the variable we wish to forecast or explain. The characteristics that we think affect this result are known as predictor variables, and they may include both continuous (like age and income) and categorical (like gender and region) variables.

We should examine whether the impact of one predictor variable is dependent on the level of another in order to ascertain whether interactions may be significant. An interaction term between industry type and education level, for instance, may be significant if we are forecasting income based on these factors because the impact of education on income may differ depending on the industry. By adding interaction terms, the model is better able to capture these dependencies and depict intricate interactions.

Question 3

In [4]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import plotly.express as px

url = "https://raw.githubusercontent.com/KeithGalli/pandas/master/pokemon_data.csv"
pokeaman = pd.read_csv(url).fillna('None')

pokeaman['str8fyre'] = (pokeaman['Type 1'] == 'Fire').astype(int)

linear_model_specification_formula = 'str8fyre ~ Attack * Legendary + Defense * I(Q("Type 2")=="None") + C(Generation)'

log_reg_fit = smf.logit(linear_model_specification_formula, data=pokeaman).fit()

pokeaman['pred_prob'] = log_reg_fit.predict(pokeaman)


fig_additive = px.scatter(pokeaman, x='Attack', y='str8fyre', color='Legendary', 
                          title='Additive Model: Attack vs Fire Type (Legendary)',
                          labels={'Attack': 'Attack', 'str8fyre': 'Fire Type (0=No, 1=Yes)', 'Legendary': 'Legendary'})
fig_additive.add_scatter(x=pokeaman['Attack'], y=pokeaman['pred_prob'], mode='lines', name='Best Fit Line (Additive)')

fig_interaction = px.scatter(pokeaman, x='Attack', y='str8fyre', color='Legendary', 
                             title='Interaction Model: Attack * Legendary vs Fire Type',
                             labels={'Attack': 'Attack', 'str8fyre': 'Fire Type (0=No, 1=Yes)', 'Legendary': 'Legendary'})
fig_interaction.add_scatter(x=pokeaman['Attack'], y=pokeaman['pred_prob'], mode='lines', name='Best Fit Line (Interaction)')

fig_additive.show()
fig_interaction.show()

Optimization terminated successfully.
         Current function value: 0.228109
         Iterations 8


Question 4

Understanding that significant coefficient estimates with low p-values (indicating strong evidence against the null hypothesis) and a low R-squared value (indicating only 17.6% of the variability in the outcome is explained by the model) reflect different model characteristics helps to explain the apparent contradiction between them. A low R-squared indicates that many of the factors influencing the outcome are not taken into account by the model. R-squared quantifies how effectively the model as a whole explains the variability in the outcome. When other predictors are taken into account, significant p-values show that there is substantial evidence of a relationship between the individual predictor and the outcome. This indicates that certain predictors have significant, statistically significant links with the result even if the model does not account for a large portion of the total variation. To put it briefly, p-values and R-squared have distinct uses: P-values evaluate the strength of the effects of each individual predictor, whereas R-squared evaluates the total explanatory power.

Question 5

1. Data Preparation and Splitting: The code starts by utilizing a 50/50 split to divide the Pokémon dataset into two halves, pokeaman_train and pokeaman_test. The test set is set aside for evaluating predicted performance on unseen data, reflecting how well the model might perform on future or out-of-sample data, while the training set is used to fit the model. This division is essential for determining generalizability.

2. Model 3 Fitting and Evaluation: Model 3 is a more straightforward model that only includes Attack and Defense as HP predictors. An "in-sample" R-squared value, which shows the percentage of HP variation that this model explains within the training data, is reported after the model has been fitted to the training set. The "out-of-sample" R-squared value can then be computed since the model predicts HP values on the test data (yhat_model3). By comparing these R-squared values, one may determine whether this more straightforward model exhibits overfitting or good generalization.

3. Model 4 Fitting and Evaluation: In order to capture possibly synergistic effects between these variables, the code then builds a more complicated model, model4, which includes many predictors (Attack, Defense, Speed, Legendary, Sp. Def, and Sp. Atk) as well as several interaction terms. Additionally, the training data is fitted to this extended model, and the "in-sample" and "out-of-sample" R-squared values are calculated. Because of its complexity, this model should have a higher in-sample R-squared. However, the out-of-sample R-squared will reveal if the improvement in in-sample performance actually represents better generalizability or if the model is overfit to the training set.

4. A crucial point is brought to light by the comparison of the more complex and simpler models: while complexity generally raises in-sample R-squared, it can also raise the risk of overfitting, as demonstrated by a notable decline in out-of-sample R-squared. This procedure highlights the necessity of striking a balance between generalizability and model complexity, as well as the significance of assessing model resilience using distinct training and test datasets.

Question 6

The code's model4_linear_form expands the model's capacity to capture intricate interactions by generating a "design matrix" with many predictors and interaction terms between them. In particular, it incorporates all pairwise and higher-level interactions between the continuous variables (Attack, Defense, Speed, Sp. Def, Sp. Atk) and a binary variable (Legendary). Each interaction term or variable combination that is produced as a consequence represents a new column in the high-dimensional matrix (model4_spec.exog). Multicollinearity, or strongly correlated predictors, are a result of the inclusion of these interaction factors in the design matrix. Strong correlations between predictors in the design matrix impair the model's capacity to discern the independent contributions of each predictor variable, which has an impact on the stability of the regression coefficients. When tested on new data, this instability results in a low out-of-sample R-squared but a high in-sample R-squared because the model may overfit to random noise in the training data. This impairs generalizability.

Question 7

A methodical approach to increasing complexity with each successive model is demonstrated by the journey from `model3_fit` to `model7_fit`. Starting with `model5_linear_form`, a baseline model comprising certain categorical variables like `Generation` and `Type` and primary predictors like `Attack` and `Defense` is formed using simpler predictor combinations. In order to improve predictive accuracy while preserving interpretability, the model moves on to `model6_linear_form`, where more indicator variables are added for particular categories and predictors with less evidence are excluded.

By including interaction terms across several variables, including `Attack`, `Speed`, and `Special Defense`, `Model7_linear_form` expands on `model6`. Although its addition increases multicollinearity and improves predictive capability, it also adds complexity. This is addressed by centering and scaling, which preserves model stability by lowering extreme condition numbers.

By gradually increasing predictive complexity and subsequently controlling multicollinearity through modifications to the model specification, the model evolution aims to strike a balance between interpretability and model performance.

Question 8

In [5]:
import numpy as np
import pandas as pd
import plotly.express as px
import statsmodels.formula.api as smf
from sklearn.model_selection import train_test_split

reps = 100
in_sample_Rsquared = np.zeros(reps)
out_of_sample_Rsquared = np.zeros(reps)

linear_form = 'HP ~ Attack + Defense'

for i in range(reps):
    pokeaman_train, pokeaman_test = train_test_split(pokeaman, train_size=0.5)
    
    model_spec = smf.ols(formula=linear_form, data=pokeaman_train)
    model_fit = model_spec.fit()
    
    in_sample_Rsquared[i] = model_fit.rsquared
    
    yhat = model_fit.predict(pokeaman_test)
    y = pokeaman_test['HP']
    out_of_sample_Rsquared[i] = np.corrcoef(y, yhat)[0, 1] ** 2

df = pd.DataFrame({
    "In Sample R-squared": in_sample_Rsquared,
    "Out of Sample R-squared": out_of_sample_Rsquared
})

fig = px.scatter(df, x="In Sample R-squared", y="Out of Sample R-squared",
                 labels={"In Sample R-squared": "In-Sample R²", "Out of Sample R-squared": "Out-of-Sample R²"},
                 title="In-Sample vs Out-of-Sample R² Values over Different Splits")
fig.update_layout(shapes=[dict(type='line', x0=0, x1=1, y0=0, y1=1, line=dict(dash='dash'))])
fig.show()


This code performs several train-test splits, fits a basic linear regression model for each split, and logs the R-squared for the test set (also known as "out-of-sample") and the training set (sometimes known as "in-sample"). Visualizing the paired findings allows us to see how R-squared varies with various splits, exposing trends in the generalizability of the model.

Question 9

Particularly in the context of linear regression, this topic highlights the need to strike a compromise between interpretability and model complexity. Two models are compared in the analysis: `model6_fit`, a simpler model, and `model7_fit`, a more complicated model with more interaction terms. Even if `model7_fit` performs better out-of-sample, some of its estimated coefficients have less evidence, and its complexity makes it more difficult to interpret. Thus, the more straightforward model (`model6_fit`) is easier to understand and might be more reliably generalizable.

The findings also demonstrate how generalizability may be impacted by sophisticated models that overfit to particular patterns in training data that do not hold true for testing data. By using values from previous generations in the Pokémon data to forecast values for subsequent generations, the extra code evaluates the generalizability of each model. It demonstrates the danger of overfitting in complex models by demonstrating that model7_fit's performance is more adversely impacted when switching between generations. Because simpler models are frequently more dependable for real-world applications where data arrives sequentially, this emphasizes the significance of choosing simpler models wherever feasible, particularly if predicted accuracy across datasets is close.

https://chatgpt.com/share/6736b239-db94-8010-978c-515995d73997