danger_zone.qmd

# Danger Zone {#sec-danger-zone}

The following are very brief thoughts on common pitfalls in modeling we thought of while writing this book. This is not meant to be comprehensive, but rather a quick reference for common things that can trip folks up. We're not going to tell you not to do these things, but we are going to tell you to be very careful if you do, and be ready to defend your choices.

## Things to Generally Avoid

### The Linear Model

-   Using stepwise/best subset regression for feature selection. Instead, use ridge/lasso/elastic net.

-   Focusing on statistical significance while ignoring predictive/practical results can lead you down dangerous paths. It is often just as important to know if a feature doesn't have any effect on the target.

-   Forgetting to rely heavily on domain knowledge. Technical skills alone cannot make sense of context-specific effects or knowing what you are actually wanting to model.

-   Forgetting to rely heavily on visualizations. Visualizations can tell you general relationships (both linear and nonlinear), while also giving you an idea of variables that are unlikely to be helpful. The average person is going to find a visualization far more useful than a model.

-   Using statistical tests in lieu of visualizations or practical metrics for assumptions. Tests are often overly conservative, so using visualizations will give you an idea about just how bad it is.

-   Comparing models with R^2^. Statistical tests like ANOVA or metrics like AIC would be a better choice for comparing models.

-   Dropping/modifying extreme values. They are valid values, so you might just need to use a different distribution.

-   Misinterpreting simple contrasts of categorical features. A significant result doesn't mean that the level is "important", but only means that the level is significantly different than the baseline level.

### Generalized Linear Models

-   Using pseudo-R^2^ for any important conclusion. These aren't very useful and will never really be the R^2^ you get from OLS.

-   Thinking accuracy is the primary metric for your logistic regression. A combination of balanced accuracy and AUC are both better choices than standard accuracy alone.

-   Thinking GLMs are enough for your problem. For example, Poisson's assumption of equivalent means and variances is very restrictive and typically doesn't hold.

### Estimation

-   Obsessing about distributions if your primary goal is interpretation.

-   Assuming a bootstrap is good enough for inference.

-   Being overly concerned with which optimization technique you choose. Very often, there is no single optimal solution.

### Data

-   Forgetting to transform your features. Many models perform better when working with variables that are on the same scale.

-   Using simple imputation (like the mean, modal category) when you're missing a lot of data. While model-based imputation is always preferred (e.g., MICE), it might not work too well when you have a lot of missingness. 

-   Making something binary when it naturally isn't and has notable variability. This is effectively burning information about the variable.

-   Forgetting that large data can get small very quickly due to class imbalance, interactions, etc.

### Machine Learning

-   Using a single model without any baseline. It is hard to improve if you don't know the starting point.

-   Starting with a complicated model. Parsimony is always key, so create that baseline model and iterate from there.

-   Using a simple .5 cutoff for binary classification. It is tempting to assume that .5 is the ideal cutpoint, but you should test a range of values out before settling on one.

-   Splitting your target just so you can do classification. Most models will happily perform regression or classification, so try not to unnecessarily disturb your target.

-   Using a single metric for model assessment. Each metric has its own pros and cons, so you should evaluate your model's performance with a suite of metrics.

-   Ignoring uncertainty in your predictions or metrics. There will always be uncertainty, so acknowleding that it is there will be helpful for mitigating the chance of embarrassment when your model doesn't perform as expected. 

-   Comparing models on different datasets. Models and data are always tied together, so you should never expect that different models would perform better/worse on different datasets.

-   Letting training leak into test set. Doing this gives your model an unfair advantage when it is time for testing, leading you to believe that your model is doing better than it really is.

-   Assuming variable importance is telling you what you think it is. It doesn't mean that it is the "best" variable of your features and that it is therefore causal; it only means that the model likes using that feature the most.

-   Assuming that because your approach dropped a feature that it has no effect on the target

-   Using a model just because it seems popular. Popularity fades as models advance; you should focus on usability and explainability, while balancing performance.

-   Using a model just because it's the only one you know. In the age of rapid advancement and sharing, you should seek out new solutions to problems.

-   Using older black box methods that don't keep up with more modern techniques. Modern models, in conjunction with advances in "explainable AI", will help you make the most sense of the relationships within your model.

-   Forgetting to scale your data. You don't want the scale of the data to be something that the model hones in on.

-   Forgetting that the data is more important than your modeling technique. The axiom will always ring true: 90% of analytics work is on data prep.

-   Forgetting you need to be able to explain your model and code to someone else. The only good model is a useful one; if you can't explain it to someone, it probably isn't useful.

-   Assuming that grid search is good enough for all or even most cases. Not only is it computationally expensive, but you very well might miss tuning parameter values that are outside of the grid; instead, you might consider Bayesian Optimization.

-   Ignoring temporal/spatial data structure. People will often forget about the effects of time and space on relationships; fortunately, many methods exist for exploring these important effects.

-   Thinking deep learning will solve all your problems. If you are dealing with standard, tabular data, deep learning will just increase computational complexity, all with very little promise of increased performance.

-   Using shap values for feature importance. They weren't meant for this and can be very misleading; they are merely there to help demonstrate direction.

### Causal Modeling

-   Pretending a modeling technique can prove a causal relationship. These models can certainly get you a bit closer, but they are still models at the end of the day.

-   Thinking random assignment can prove a causal relationship. While random assignment certainly is the "gold standard" for experimental design, there could be hundreds of other unmeasured variables influencing the target.

-   Ignoring potentially confounding variables. You'll want to carefully consider variables that might have an impact on the target and be sure to measure them.

-   Ignoring the possibility of reverse causality. Reverse causality and simultaneity are both real potentials when working on causal models, so don't fall in love with a single model. 

-   Ignoring the possibility of selection bias. You can never assume that you've gotten a truly random sample of rational people, so don't assume that your sample will be representative of the target population.

-   Ignoring measurement error. Even with the tightest tolerances, everything has measurement error. 

## Perfect is not Possible

Your model will never be perfect. Time, money, and other constraints will conspire to make sure of that, and you will always have to make trade-offs. You're also human, and mistakes will be made. The best you can do is to be aware of these limitations and do your best to mitigate them. Modeling is about exploration and discovery, and the best models are the ones that are the most useful, not the most perfect.