# STA130 Homework 07

This is my submission for Homework 07, answering each question based on my understanding and work with ChatGPT to enhance clarity and accuracy.

### Question 1: Differences Between Types of Regression and Model Behavior with Variables

**1. Simple Linear Regression vs. Multiple Linear Regression**

In Simple Linear Regression, we predict the outcome variable using one predictor variable. The equation is of the form:

$$ outcome = \beta_0 + \beta_1 \cdot predictor + \epsilon $$

In contrast, Multiple Linear Regression includes two or more predictors, which enhances the model's capacity to capture more complex relationships. The equation for Multiple Linear Regression is:

$$ outcome = \beta_0 + \beta_1 \cdot predictor_1 + \beta_2 \cdot predictor_2 + ... + \epsilon $$

**2. Continuous vs. Indicator Variables in Simple Linear Regression**

A continuous variable can take any value within a range, whereas an indicator variable represents categories as binary values (0 or 1). For example, we might have an indicator variable 1(\text{Male}) to represent gender.

In simple linear regression, the model with a continuous predictor is:

$$ outcome = \beta_0 + \beta_1 \cdot continuous\ predictor $$

For an indicator variable:

$$ outcome = \beta_0 + \beta_1 \cdot 1(indicator) $$


**3. Changes in Model Behavior with Indicator and Continuous Variables**

When a continuous and indicator variable are both included in a Multiple Linear Regression model, the behavior of the outcome adjusts depending on the category indicated by the indicator variable.

$$ outcome = \beta_0 + \beta_1 \cdot continuous + \beta_2 \cdot 1(indicator) $$

This allows the model to capture different trends within each category indicated by the variable.


**4. Interaction Effects between Continuous and Indicator Variables**

Adding an interaction term between a continuous and indicator variable enables the model to capture how the effect of the continuous variable changes based on the category. The equation becomes:

$$ outcome = \beta_0 + \beta_1 \cdot continuous + \beta_2 \cdot 1(indicator) + \beta_3 \cdot continuous \cdot 1(indicator) $$

This provides flexibility, especially useful when the effect of one variable depends on another.


**5. Multiple Linear Regression with Only Indicator Variables**

When using only indicator variables from a non-binary categorical variable, we need $k-1$ indicator variables to encode a categorical variable with $k$ categories. This approach introduces a baseline group and offsets for each other group.

$$ outcome = \beta_0 + \beta_1 \cdot 1(cat_1) + \beta_2 \cdot 1(cat_2) + ... $$

Each indicator variable adjusts the intercept based on the category, allowing flexible prediction based on category.


### Question 2: Predicting Advertising Effectiveness with Interaction Effects

**Identifying Variables**

For this scenario, the outcome variable might be `Sales` or `Effectiveness`. The predictors are `TV_ad_budget` and `online_ad_budget`.

We might consider a basic linear form as:

$$ Sales = \beta_0 + \beta_1 \cdot TV\_ad\_budget + \beta_2 \cdot online\_ad\_budget $$

With interaction:

$$ Sales = \beta_0 + \beta_1 \cdot TV\_ad\_budget + \beta_2 \cdot online\_ad\_budget + \beta_3 \cdot (TV\_ad\_budget \times online\_ad\_budget) $$

This interaction term lets us examine if the effectiveness of one ad channel depends on the other.


In [None]:
# Example with high and low budgets (using binary encoding)
import pandas as pd
import statsmodels.formula.api as smf

# Sample data setup
ads_data = pd.DataFrame({
    'TV_ad_budget': ['high', 'low', 'high', 'low'],
    'online_ad_budget': ['high', 'high', 'low', 'low'],
    'sales': [300, 200, 150, 100]
})

# Encoding
ads_data['TV_high'] = (ads_data['TV_ad_budget'] == 'high').astype(int)
ads_data['Online_high'] = (ads_data['online_ad_budget'] == 'high').astype(int)

# Fit model with interaction
model = smf.ols('sales ~ TV_high * Online_high', data=ads_data).fit()
model.summary()

### Question 3: Logistic Regression and Model Building with CSCS Data

**Logistic Regression Setup**

In this question, I explored using logistic regression instead of multiple linear regression because we are predicting a binary outcome. Using categorical and continuous variables, we created a logistic model.

The following code demonstrates the setup of a logistic regression model:


In [None]:
# Logistic Regression example setup
import pandas as pd
import statsmodels.formula.api as smf

# Sample data for illustration
url = "https://raw.githubusercontent.com/KeithGalli/pandas/master/pokemon_data.csv"
df = pd.read_csv(url)
df['FireType'] = (df['Type 1'] == 'Fire').astype(int)

# Logistic regression example
model_spec = 'FireType ~ Attack * Legendary + Defense * (Type 2 == "None") + C(Generation)'
log_reg = smf.logit(model_spec, data=df).fit()
log_reg.summary()

**Interpretation of Logistic Model Output**

Logistic regression results are presented in terms of log-odds, where positive coefficients indicate an increase in odds as the predictor increases. In this case, the coefficients explain how being a 'Fire' type relates to attack, defense, and whether it is a Legendary Pokemon.


### Question 4: Contradictions between Model Fit and Coefficients

It is possible for a model to have high-value coefficients and significant p-values while explaining only a small amount of variability (low R-squared). This discrepancy arises because R-squared measures the overall variance explained by the model, while p-values indicate the strength of evidence against the null hypothesis for individual coefficients.

$$ R^2 = 1 - \frac{\sum_{i=1}^n(y_i - \hat{y})^2}{\sum_{i=1}^n(y_i - \bar{y})^2} $$

Thus, significant p-values on coefficients do not guarantee a high R-squared and vice versa.

### Question 5: Model Performance with Train-Test Split

Here we use a 50-50 split to evaluate in-sample and out-of-sample R-squared to check generalizability.


In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

# Load and split data
np.random.seed(130)
df_train, df_test = train_test_split(df, test_size=0.5)

# Model fit and R-squared calculation
simple_model = smf.ols('HP ~ Attack + Defense', data=df_train).fit()
in_sample_r2 = simple_model.rsquared
out_sample_r2 = np.corrcoef(df_test['HP'], simple_model.predict(df_test))[0,1]**2

print(f"In-sample R-squared: {in_sample_r2}")
print(f"Out-of-sample R-squared: {out_sample_r2}")

### Question 6: Multicollinearity and Condition Number in Design Matrix

High multicollinearity, indicated by a high condition number, shows that predictors are interdependent. This complicates coefficient interpretation and reduces model reliability. Condition numbers above 30 may indicate potential issues.

Centering and scaling the data helps correct inflated condition numbers as demonstrated below:


In [None]:
from patsy import center, scale

# Centering and Scaling for better Condition Number
scaled_model_spec = smf.ols('HP ~ scale(center(Attack)) + scale(center(Defense))', data=df_train).fit()
scaled_model_spec.summary().tables[-1]

### Question 7: Model Development and Complexity Trade-offs

As we develop models by extending previous versions, we balance complexity and interpretability. Simpler models with reasonable fit, like `model6`, may offer better generalizability over more complex models.


### Question 8: Variation in Model Performance Metrics

This question involves observing the variation in in-sample and out-of-sample R-squared over multiple train-test splits to examine model stability and risk of overfitting.

```python
import plotly.express as px
import numpy as np

reps = 100
in_sample = []
out_sample = []

for _ in range(reps):
    train, test = train_test_split(df, test_size=0.5)
    model = smf.ols('HP ~ Attack + Defense', data=train).fit()
    in_sample.append(model.rsquared)
    out_sample.append(np.corrcoef(test['HP'], model.predict(test))[0,1]**2)

# Visualization
px.scatter(x=in_sample, y=out_sample, labels={'x':'In-Sample R^2', 'y':'Out-of-Sample R^2'})
```



### Question 9: Interpreting Future Prediction Reliability

Using a model to predict outcomes for future data (e.g., new generations) emphasizes generalizability. If a model fit on previous data poorly predicts new data, it indicates overfitting, a lack of generalizability.


### Chat Summary and Link

For this homework, I consulted with ChatGPT to help clarify complex questions and ensure accurate explanations. In particular, the chatbot provided detailed insights into differences between linear and logistic regression, the importance of model interpretability, and techniques to evaluate model performance and generalizability.

Key takeaways from this chat include:

- **Simple vs. Multiple Linear Regression:** Clear understanding of model expansion and added predictive power with additional predictors.
- **Logistic Regression and Interaction Effects:** The benefits of logistic regression for binary outcomes and modeling interactions.
- **Overfitting and Generalizability:** Recognizing overfitting through out-of-sample testing and the role of condition numbers in multicollinearity detection.
- **Model Building Approach:** Balancing complexity with interpretability for reliable predictions and insights.

Link to the chat: [Chat Summary and Link](https://chatgpt.com/share/67354dc3-23ec-8011-aa6e-e6b6827e68ca)
