<a href="https://colab.research.google.com/github/thedarredondo/data-science-fundamentals/blob/main/Unit6/Unit6ExercisesSF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unit 6 Exercises: Is my model good?

#### Over and Under fitting, Model Visualization, and Model/Variable Selection Concepts

These exercises are meant to get you to think about the model and variable selection process, and consider how we determine if a model is "good".

## Task1

Does `elpd_loo` mean anything if we only have one model?

---

No, because `elpd_loo` is a metric that helps us tell the difference between two or more models, but doesn't convey information without doing such a comparison.

## Task2

Describe overfitting, in the context of this course.

---

Overfitting is when the model is distracted by irrelevant patterns in the data. Generally, a model with more variables is more likely to have greater overfitting.

## Task3

How do we mitigate overfitting?

---

To mitigate overfitting, we use weakly informative priors and minimize the amount of variables used in our models.

## Task4

How do we mitigate underfitting?

---

We mitigate underfitting by comparing models and choosing the one with a relatively good `elpd_loo`.

## Task5

Why would we want more than one predictor in a model?

---

We want more than one predictor when we determine that there is a more complicated relationship between the variables we are trying to analyze.

## Task6

Can we have too many predictors? How would we now?

---

There is such a thing as too many predictors. We know when there are too many predictors when we detect overfitting. In general, we can use our knowledge of the variables to create a DAG and see which ones actually cause what we are trying to analyze. Also, we can use metrics like `elpd_loo` to hint at weaknesses in the model.

## Task7

What is variable selection, and how does it work?

---

Variable selection is the process of choosing the correct predictors for a given task. This includes determining whether multiple predictors are needed, how many are needed, and which are the best to use.

## Task8

Describe the differences and similarities between the following three models: linear regression with two predictors, one of which is a categorical variable:

- adding the variables in the model, as is standard.
- using that categorical variable as a hierarchy upon the other predictor variable.
- adding the variables, plus the categorical variable's interaction with the other variable.

---

In the example from the notes, I noticed the following similarities and differences.

When we add just variables in the model, we don't split it into categories and we get a single graph. When we have a categorical model, Bambi chooses the same slope for the categories but varies their intercepts, specific to the given category. A hierarchy represents another relationship between variables that also allows their slopes to vary on the graphs. Both categories and hierarchies are tools we use based on our knowledge of the relationships between variables.

## Task9

How do we visualize multiple linear regression models? Can we visualize the entire model, all at once

---

We visualize multiple linear regression models by choosing two variables to plot on the x and y axis, and possibly a category for which we draw multiple lines. We typically cannot visualize the entire model all at once, and need multiple plots to understand what is going on.

## Task10

Compare the following linear models that all use the basketball data to predict field goal percentage:

1. predictors free throw percentage and position (with position as a categorical predictor)
2. predictors free throw percentage and position (with position as a hierarchy)
3. predictors free throw percentage and position (with position interacting with frew throw percentage)
4. predictors free throw percentage, position, 3 point attempts, and interactions between all three predictors
5. predictors free throw percentage, position, 3 point attempts, with an interaction between 3 point attempts and postion.

using ```az.compare()``` and ```az.plot_compare()```, or an equivalent method using LOO (elpd_loo).

You may use the following two code blocks to load and clean the data.

In [1]:
import pandas as pd
import bambi as bmb
import arviz as az

In [2]:
#have to drop incomplete rows, so that bambi will run
bb = pd.read_csv(
    'https://raw.githubusercontent.com/thedarredondo/data-science-fundamentals/refs/heads/main/Data/basketball2324.csv').dropna()

In [3]:
#only look at players who played more than 600 minutes
#which is 20 min per game, for 30 games
bb = bb.query('MP > 600')
#remove players who never missed a free throw
bb = bb.query('`FT%` != 1.0')
#filter out the combo positions. This will make it easier to read the graphs
bb = bb.query("Pos in ['C','PF','SF','SG','PG']")
#gets rid of the annoying '%' sign
bb.rename(columns={"FT%":"FTp","FG%":"FGp"}, inplace=True)

In [4]:
# Model 1
model1 = bmb.Model("FGp ~ FTp + Pos", data=bb)
idata_model1 = model1.fit(idata_kwargs={"log_likelihood": True})

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, Intercept, FTp, Pos]


Output()

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 2 seconds.


In [5]:
# Model 2
model2 = bmb.Model("FGp ~ (FTp | Pos)", data=bb)
idata_model2 = model2.fit(idata_kwargs={"log_likelihood": True})

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, Intercept, 1|Pos_sigma, 1|Pos_offset, FTp|Pos_sigma, FTp|Pos_offset]


Output()

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 26 seconds.
There were 137 divergences after tuning. Increase `target_accept` or reparameterize.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details


In [6]:
# Model 3
model3 = bmb.Model("FGp ~ FTp + Pos + FTp:Pos", data=bb)
idata_model3 = model3.fit(idata_kwargs={"log_likelihood": True})

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, Intercept, FTp, Pos, FTp:Pos]


Output()

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 14 seconds.


In [7]:
# Model 4
model4 = bmb.Model("FGp ~ FTp + Pos + `3PA` + FTp:Pos:`3PA`", data=bb)
idata_model4 = model4.fit(idata_kwargs={"log_likelihood": True})

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, Intercept, FTp, Pos, 3PA, FTp:Pos:3PA]


Output()

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 11 seconds.


In [8]:
# Model 5
model5 = bmb.Model("FGp ~ FTp + Pos + `3PA` + `3PA`:Pos", data=bb)
idata_model5 = model5.fit(idata_kwargs={"log_likelihood": True})

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, Intercept, FTp, Pos, 3PA, 3PA:Pos]


Output()

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 7 seconds.


In [9]:
az.compare({"Model 1": idata_model1,
            "Model 2": idata_model2,
            "Model 3": idata_model3,
            "Model 4": idata_model4,
            "Model 5": idata_model5})

Unnamed: 0,rank,elpd_loo,p_loo,elpd_diff,weight,se,dse,warning,scale
Model 4,0,531.554708,13.833942,0.0,0.6866742,15.105571,0.0,False,log
Model 5,1,529.836221,12.720547,1.718486,0.3133258,15.877077,3.1275,False,log
Model 3,2,509.396937,13.758973,22.157771,0.0,16.701151,6.913375,False,log
Model 2,3,508.473508,13.650741,23.081199,2.221671e-15,16.921504,6.870815,False,log
Model 1,4,507.390955,8.224211,24.163752,0.0,16.0623,6.992583,False,log


## Task11

Which model is "better" according to this metric?

Why do you think that is?

---

According to this metric, model 4 is the best. It's probably because the model fits the data very closely due to the relationship between all predictors. However, this might be overfitting, in wich case model 5 might even be better since the `elpd_loo`s are pretty close.