data.qmd

# Dealing with Data {#sec-data}

```{r}
#| label: setup-data
#| include: false
df_reviews = read_csv("data/movie_reviews.csv")
```


It's an inescapable fact that models need data to work. One of the dirty secrets in data science is that getting the data right will do more for your model than any fancy algorithm or hyperparameter tuning.  Data is messy, and there are a lot of ways it can be messy. In addition, dealing with a target variable on its terms can lead to more interesting results. In this chapter we'll discuss some of the most common data issues and things to consider. There's a lot to know about data before you ever get into modeling your own, so we'll give you some things to think about it in this chapter. 


## Key Ideas

- Data transformations can provide many modeling benefits.
- Label and text-based data still needs a numeric representation, and this can be accomplished in a variety of ways.
- The data type for the target may suggest a particular model, but does not necessitate one.
- The data *structure*, e.g. temporal, spatial, censored, etc., may suggest a particular modeling domain to use.
- Missing data can be handled in a variety of ways, and the simpler approaches are typically not great.
- Class imbalance is a very common issue in classification problems, and there are a number of ways to deal with it.
- Latent variables are everywhere!


### Why this matters {#sec-data-common-why-matters}

Knowing your data is one of the most important aspects of any application of *data* science. It's not just about knowing what you have, but also what you can do with it. The data you have will influence the models you can potentially use, the features you can create and manipulate, and have a say on the results you can expect. 

### Good to Know {#sec-data-common-good-to-know}

We're talking very generally about data here, so not much background is needed. The models mentioned are covered in other chapters, or build upon those, but we're not doing any actual modeling here. 

## Standard Feature & Target Transformations {#sec-data-transformations}

Transforming variables an provide several benefits in modeling, whether applied to the target, features, or both, and *should regularly be used for most model situations*. Some of these benefits include:

- More comparable feature effects and related parameters
- Faster estimation
- Easier convergence
- Helping with heteroscedasticity

For example, just centering features, i.e. subtracting the mean, provides a more interpretable intercept that will fall within the actual range of the target variable. After centering, the intercept tels us what the value of the target variable is when the features are at their means (or reference value if categorical). Centering also puts the intercept within the expected range of the target, which often makes for easier parameter estimation. So even if easier interpretation isn't a major concern, variable transformations can help with convergence and speed up estimation, so can always be of benefit.

### Numeric variables {#sec-data-numeric}

The following table shows the interpretation of some very common transformations applied to numeric variables- logging and standardizing to mean zero, standard deviation of one.


```{r}
#| echo: false
#| eval: true
#| label: tbl-transformation
#| tbl-cap: Common numeric transformations
#|
# gt doesn't seem to have an actual working answer to latex within cells, though kabel_extra and modelsummary appear to handle it as expected, at least if you just run it. Rendering to pdfseems to fail. tried using \U03B2 for beta, but that didn't work either.
# tibble(
#     Target = c("y", "y", "log(y)", "log(y)", "y", "scale(y)", "scale(y)"),
#     Feature = c("x", "log(x)", "x", "log(x)", "scale(x)", "x", "scale(x)"),
#     Interpretation = c(
#         "$\\Delta y = \\beta\\Delta x$",
#         "$\\Delta y \\approx (\\beta/100)\\%\\Delta x$",
#         "$\\%\\Delta y \\approx 100\\cdot \\beta\\%\\Delta x$",
#         "$\\%\\Delta y = \\beta\\%\\Delta x$",
#         "$\\Delta y =  \\beta\\sigma\\Delta x$",
#         "$\\sigma\\Delta y =  \\beta\\Delta x$",
#         "$\\sigma\\Delta y =  \\beta\\sigma\\Delta x$"
#     ),
# ) |>
#     modelsummary::datasummary_df()
# christ what year is this?
tibble(
    Target = c("y", "log(y)", "log(y)", "y", "scale(y)"),
    Feature = c("x", "x", "log(x)", "scale(x)", "scale(x)"),
    `Change in X` = c("1 unit", "1 unit", "1% change", "1 standard deviation", "1 standard deviation"),
    `Change in Y` = c("B unit", "100 * (exp(B) -1) ", "B% change", "B unit", "B standard deviation"),
    Benefits = c("Interpretation", "Heteroscedasticity in y", "Interpretation, deal with feature extremes", "Interpretation, estimation", "Interpretation, estimation"),
) |>
    gt()

```

For example, it is very common to use **standardized** variables, or simply 'scaling' them. Some also call this **normalizing** but [this term can mean a lot of things](https://en.wikipedia.org/wiki/Normalization_(statistics)), so one should be clear in their communication. If $y$ and $x$ are both standardized, a one unit (i.e. one standard deviation) change in $x$ leads to a $\beta$ standard deviation change in $y$. So, if $\beta$ was .5, a standard deviation change in $x$ leads to a half standard deviation change in $y$. In general, there is nothing to lose by standardizing, so you should employ it often.

Another common transformation, particularly in machine learning, is **min-max scaling**. This involves changing variables to range from some minimum value to some maximum value, and usually this means zero and one respectively. This transformation can make numeric and categorical indicators more comparable, or at least put the on the same scale for estimation purposes, and so can help with convergence and speed up estimation. 

:::{.panel-tabset}

###### Python

When using `sklearn`, it's a bit of a verbose process to do such a simple transformation. However it is beneficial when you want to do more complicated things, especially when using data pipelines.


```{python}
#| eval: false
#| label: py-numeric-transformation
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np

# Create a sample dataset
import numpy as np

# Create a random sample of integers
data = np.random.randint(low=0, high=100, size=(5, 3))

# Apply StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

# Apply MinMaxScaler
minmax_scaler = MinMaxScaler()
minmax_scaled_data = minmax_scaler.fit_transform(data)
```

###### R

R, being made for statistics, makes it easy to do simple transformations, but you can also use tools like `recipes` and `mlr3` pipeline operations when needed to make sure your preprocessing is applied appropriately.

```{r}
#| eval: false
#| label: r-numeric-transformation

# Create a sample dataset
data = matrix(sample(1:100, 15), nrow = 5)

# Standardization
scaled_data = scale(data)

# Min-Max Scaling
minmax_scaled_data = apply(data, 2, function(x) {
    (x - min(x)) / (max(x) - min(x))
})

```
:::


Using a **log** transformation for numeric targets and features is straightforward, and [comes with several benefits](https://stats.stackexchange.com/questions/107610/what-is-the-reason-the-log-transformation-is-used-with-right-skewed-distribution). For example, it can help with **heteroscedasticity**, which is when the variance of the target is not constant across the range of the predictions[^notnormal] (demonstrated below). It can also help to keep predictions positive after transformation, allows for interpretability gains, and more. One issue with logging is that it is not a linear transformation, which can help capture nonlinear feature-target relationships, but can also can make some post-modeling transformations more less straightforward. Also if you have a lot of zeros, 'log plus one' transformations are not going to be enough to help you overcome that hurdle[^logp1]. Logging also won't help much when the variables in question have few distinct values, like ordinal variables, which we'll discuss later in @sec-data-ordinal.

[^logp1]: That doesn't mean you won't see many people try (and fail).

:::{.content-visible when-format="html"}
```{r}
#| echo: false
#| label: fig-log-transformation-heteroscedasticity
#| fig-cap: Log transformation and heteroscedasticity
#| out-width: 100%

# Generate x values
set.seed(123)
N = 250

x = rpois(N, lambda = 10)

# Generate y values with heteroscedasticity
y = exp(.1*x + rnorm(length(x), mean = 0, sd = .5))

# Create a data frame
data = tibble(x, y)

# Fit a linear regression model without log transformation
model = lm(y ~ x, data = data)
# plot(model)

# Fit a linear regression model with log transformation for y
model_log = lm(log(y) ~ x, data = data)
# plot(model_log)


fitted_residuals = data.frame(
    model = rep(c("Model", "Model w/ Log"), each = length(data$x)),
    fitted_values = c(model$fitted, model_log$fitted),
    residuals = c(resid(model), resid(model_log))
)


# Create a ggplot with facetted scatter plots
p = ggplot(fitted_residuals, aes(x = fitted_values, y = residuals)) +
    geom_point() +
    facet_wrap(~ model, nrow = 1, scales = 'free') +
    labs(
        x = "Fitted Values", 
        y = "Residuals",
        caption = "\nVariance increases with predicted values for the model without log transformation,\nbut is consistent across prediction for the model with log transformation."
    ) +
    theme(
        plot.caption = element_text(hjust = 0, size = 10)
    )

p

ggsave(
    "img/data-log_transformation_heteroscedasticity.svg", 
    p, 
    width = 8, 
    height = 6
)
```
:::

:::{.content-visible when-format="pdf"}
![](img/data-log_transformation_heteroscedasticity.svg)(img/data-log_transformation_heteroscedasticity.svg){width=100% #fig-log-transformation-heteroscedasticity}
:::

[^notnormal]: For the bazillionth time, logging does not make data 'normal' so that you can meet your normality assumption in linear regression. The only time that would work is if your data is log-normally distributed to begin with.

:::{.callout-tip title='Categorizing continuous variables' collapse="true"}
It is rarely necessary or a good idea to transform a numeric feature or target to a categorical one. Doing so potentially throws away useful information by making the feature a less reliable measure of the underlying construct. For example, discretizing age to 'young' and 'old' does not help your model, and you can always get predictions for what you would consider 'young' and 'old' after the fact. Though extremely common, particularly in machine learning contexts when applied to target variables, it's just not a statistically or practically sound thing to do, and can ultimately hinder interpretation.
:::


### Categorical Variables {#sec-data-cat}

Despite their ubiquity in data, we can't analyze raw text information as it is. Character strings, and labeled features like factors, must be converted to a numeric representation before we can analyze them. For categorical features, we can use something called **effects coding** to test for specific types of group differences. Far and away the most common type is called **dummy coding** or **one-hot encoding**[^onehotnotdummy], which we visited previously @sec-lm-categorical-features. In these situations we create columns for each category, and the value of the column is 1 if the observation is in that category, and 0 otherwise.  Here is a one-hot encoded version of the `season` feature that was demonstrated previously.

[^onehotnotdummy]: Note that one-hot encoding can refer to just the 1/0 coding for all categories, or to the specific case of dummy coding where one category is dropped. Make sure the context is clear.


```{r}
#| echo: false
#| label: tbl-one-hot
#| tbl-cap: One-hot encoding
model.matrix(rating ~ -1 + season, df_reviews) |>
    as_tibble() |>
    bind_cols(df_reviews |> select(season)) |>
    head() |>
    gt() |> 
    fmt_number(columns = everything(), decimals = 0)
```

:::{.callout-note title='Dummy Coding Explained' collapse="true"}
For statistical models, when doing one-hot encoding all relevant information is incorporated in k-1 groups, where k is the number of groups, so one category will be dropped from the model matrix. This dropped category is called the **reference**.  In a standard linear model, the intercept represents the mean of the target for the reference category. The coefficients for the other categories are the difference between the mean for the reference category and the group mean of the category being considered. 

As an example, in the case of the `season` feature, if the dropped category is `winter`, the intercept tells us the mean rating for `winter`, and the coefficients for the other categories are the difference between the value for `winter` and the mean of the target for `fall`, `summer` and `spring`. 

In other models, we include all categories in the model. The model learns how to use them best and might only consider some or one of them at a time.
:::

When we encode categories for statistical analysis, we can summarize their impact on the target variable single measure. For a model with only categorical features, we can use an **ANOVA** (@sec-lm-categorical-anova) for this. But a similar approach can also be used for mixed models, splines, and other models. Techniques like SHAP also provide a way to summarize the total effect of a categorical feature.


#### Text Embeddings {#sec-data-embeddings}

When it comes to other character representations like text, we use other methods to represent them numerically. One important way to encode textis through an **embedding**.  This is a way of representing the text as a vector of numbers, at which point the embedding feature is used in the model like anything else. The way to do this usually involves a model itself, one that learns the best way to represent the text or categories numerically. This is commonly used in deep learning and natural language processing in particular. However, embeddings can also be used as a preprocessing step in any modeling situation.

To understand how embeddings work, consider a one-hot encoded matrix for a categorical variable. This matrix then connects to a hidden layer of a neural network, and the weights of that layer are the embeddings for the categorical variable. While this isn't the exact method used (there are more efficient methods that don't require the actual matrix), the concept is the same. In addition, we normally don't even use whole words. Instead, we break the text into smaller units called **tokens**, like characters or subwords, and then use embeddings for those units. [Tokenization](https://huggingface.co/learn/nlp-course/en/chapter6/5) is is used in many of the most successful models in the  in natural language processing domain, including those such as [ChatGPT](https://www.youtube.com/watch?v=zduSFxRajkE).


```{r}
#| echo: false
#| eval: false
#| label: half-nn-graph-setup

g = DiagrammeR::grViz('img/graphical-embedding.dot')

g %>%
    DiagrammeRsvg::export_svg() %>% charToRaw() %>% rsvg::rsvg_svg("img/graphical-embedding.svg")
```

![Conceptual example of an embedding](img/graphical-embedding.svg){width=25% #fig-embedding}


#### Multiclass Targets {#sec-data-multiclass}


We've talked and demonstrated binary targets a lot, but what about when there are more than two classes?  In statistical settings, we can use a **multinomial regression**, which is a generalization of (binomial) logistic regression to more than two classes via the multinomial distribution. Depending on the tool, you may have to use a *multivariate* target of the counts, though most commonly they would be zeros and ones for a classification model, which then is just a one-hot encoded target.  The following table demonstrates how this might look.

```{r}
#| echo: false
#| label: tbl-multinomial-data
#| tbl-cap: Multinomial data example

tibble(
    x1 = rnorm(5),
    x2 = rpois(5, 5),
    target = c("A", "C", "C", "B", "A"),
    `Class A` = c(1, 0, 0, 0, 1),
    `Class B` = c(0, 0, 0, 1, 0),
    `Class C` = c(0, 1, 1, 0, 0)
) |> 
    gt() |> 
    fmt_number(columns = -x1, decimals = 0) |> 
    cols_align("center")
```

With Bayesian tools, it's common to use the **[categorical distribution](https://en.wikipedia.org/wiki/Categorical_distribution)**, which is a different generalization of the Bernoulli distribution to more than two classes. Unlike the multinomial, it is not a count distribution, but an actual distribution over categories. 

In the machine learning context, we can use a variety of models we'd use for binary classification. How the model is actually implemented will depend on the tool, but one of the more popular methods is to use **one-vs-all** or **one-vs-one** strategies, where you treat each class as the target in a binary classification problem. In the first case you would have a model for each class that predicts whether an observation is in that class or not. In the second case, you would have a model for each pair of classes. You should generally be careful with either approach if interpretation is important, as it can make the feature effects very difficult to understand.  As an example, we can't expect feature X to have the same effect on the target in a model for class A vs B, as it does in a model for class A vs. (B & C) or A & C. It also can be misleading when the models are conducted as if the categories are independent.

Regardless of the context, interpretation is now spread across multiple target output, and so it can be difficult to understand the overall effect of a feature on the target. Even in the statistical setting, you now have coefficients that regard *relative* effects for one class versus a reference group, and so they cannot tell you a general effect of a feature on the target. This is where tools like marginal effects and SHAP can be useful (@sec-knowing).

#### Multilabel Targets {#sec-data-multilabel}

Multilabel targets are a bit more complicated, and are not as common as multiclass targets. In this case, each observation can have multiple labels. For example, if we wanted to predict genre based on the movie review data, we could choose to allow a movie to be both a comedy and action film, a sci-fi horror, or a romantic comedy. In this setting, labels are not mutually exclusive.


:::{.callout-note title='Categorical Objective Functions' collapse="true"}
In many situations where you have a categorical target you will use a form of **cross-entropy loss** for the objective function.  You may see other names such as *log loss* or *logistic loss* or *negative log likelihood* depending on the context, but usually it's just a [different name for the same underlying objective](https://en.wikipedia.org/wiki/Cross-entropy#Cross-entropy_loss_function_and_logistic_regression).
:::


### Ordinal Variables {#sec-data-ordinal}

So far in our discussion of categorical data, it's assumed to have no order. But it's quite common to have labels like "low", "medium", and "high", or "very bad", "bad", "neutral", "good", "very good", or simply are a few numbers, like ratings from 1 to 5. **Ordinal data** is categorical data that has a known ordering, but which still has arbitrary labels. Let us repeat that, *ordinal data is categorical data*.  

#### Ordinal Features {#sec-data-ordinal-features}

The simplest way to treat ordinal features is as if they were numeric.  If you do this, then you're just pretending that it's not categorical, and this is usually fine.  Most of the transformations we mentioned probably aren't going to be as useful, but you can still use them if you want. For example, logging ratings 1-5 isn't going to do anything for you model-wise, but it technically doesn't hurt anything.  But you should know that typical statistics like means and standard deviations don't really make sense for ordinal data, so the main reason for treating them as numeric is for modeling convenience.  

If you choose to treat an ordinal feature as categorical, you can ignore the ordering and do the same as you would with categorical data.  This would allow for some nonlinearity since the category means whatever they need to be. There are also some specific techniques to coding ordinal data for use in linear models, but they are not commonly used, and they generally aren't going to help the model performance or interpreting the feature, so we do not recommend them. You could however use old-school **effects coding** that you would incorporate traditional ANOVA models, but again, you'd need a good reason to do so.

The take home message for ordinal features is generally simple. Teat them as you would numeric  features or non-ordered categorical features. Either is fine.


#### Ordinal Targets {#sec-data-ordinal-targets}

Ordinal targets can be trickier to deal with. If you treat them as numeric, you're assuming that the difference between 1 and 2 is the same as the difference between 2 and 3, and so on.  This is probably not true.  If you treat them as categorical and use standard models for that setting, you're assuming that there is no connection between categories. So what should you do?  

There are a number of ways to model ordinal targets, but probably the most common is the **proportional odds model**.  This model can be seen as a generalization of the logistic regression model, and is very similar to it, and actually identical if you only had two categories. It basically is a model of 2 vs. 1, 3 vs. (2, 1), 4 vs. (3, 2, 1), etc. But other models beyond proportional odds are also possible, and your results could return something that gives coefficients for the model for the 1-2 category change, the 2-3 category change, and so on.  

Ordinality of a categorical outcome is largely ignored in machine learning applications. The outcome is either treated as numeric or multiclass classification. This is not necessarily a bad thing, especially if prediction is the primary goal. But if you need a categorical prediction, treating the target as numeric means you have to make an arbitrary choice to classify the predictions. And if you treat it as multiclass, you're ignoring the ordinality of the target, which may not work as well.


#### Rank Data {#sec-data-rank}

Though ranks are ordered, with rank data we are referring to cases where the observations are uniquely ordered. An ordinal vector with numeric labels could be something like [2, 1, 1, 3, 4, 2], whereas rank data would be [2, 1, 3, 4, 5, 6].  For example, in sports, the actual finish of the runners in a race is what we'd want to predict. Assuming you have a modeling tool that actually handles this situation, the objective will be different than other scenarios. Statistical modeling methods include using the Plackett-Luce distribution (or the simpler variant Bradley-Terry model). In machine learning, you might use so-called [learning to rank methods](https://en.wikipedia.org/wiki/Learning_to_rank), like the [RankNet and LambdaRank algorithms](https://icml.cc/Conferences/2015/wp-content/uploads/2015/06/icml_ranking.pdf), and other variants for [deep learning models](https://github.com/tensorflow/ranking).


## Missing Data {#sec-data-missing}


```{r}
#| echo: false
#| 
#| label: fig-missing-data
set.seed(123)
df <- data.frame(
    x1 = rpois(5, 5),
    x2 = rpois(5, 3),
    x3 = sample(c("A", "B", "C"), 5, replace = TRUE)
) |>
    mutate(across(everything(), as.character)) |>
    as.matrix()

# Randomly introduce missing values
df[sample(1:15, 4)] <- NA

# Print the resulting data frame
df |> 
    as_tibble() |>
    mutate(across(everything(), \(x) replace_na(x, '?'))) |>
    gt()

```


Missing data is a common challenge in data science, and there are a number of ways to deal with it, usually by substituting, or **imputing**, the substituted value for the missing one. Here we'll provide an overview of common techniques to deal with missing data.


### Complete Case Analysis {#sec-data-missing-complete}

The first way to deal with missing data is the simplest - **complete case analysis**. Here we only use observations that have no missing data and drop the rest.  Unfortunately, can lead to a lot of lost data, and can lead to biased statistical results if the data is not **missing completely at random**.  There are special cases of some models that by their nature can ignore the missingness under an assumption of **missing at random**, but even those models would likely benefit from some sort of imputation. If you don't have much missing data though, dropping the missing data is fine for practical purposes[^dropmiss]. How much is too much? Unfortunately that depends on the context, but if you have more than 10% missing, you should probably be looking at alternatives.

[^dropmiss]: While many statisticians will possibly huff and puff at the idea of dropping data, there are two things to consider. With minimal missingness you'll likely never come to a different conclusion unless you have very little data to come to a conclusion about, which is already the bigger problem. Secondly, it's impossible to prove one way or another if the data is missing at random, because doing so would require knowing the missing values.

### Single Value Imputation {#sec-data-missing-single}

**Single value imputation**: Replace missing values with the mean, median, mode or some other typical value of the feature.This will probably rarely help your model for a variety of reasons. Consider a numeric feature that is 50% missing, and for which you replace the missing with the mean. How good do you think that feature will be when at least half the values are identical? Whatever variance it normally would have and share with the target is probably reduced, and possibly dramatically. Furthermore, you've also attenuated correlations it has with the other features, which may mute other modeling issues that you would otherwise deal with in some way (e.g. collinearity), or cause you to miss out on interactions. 

Single value imputation makes perfect sense if you *know* that the missingness should be a specific value, like a count feature where missing means a count of zero. If you don't have much missing data, it's unlikely this would have any real benefit over complete case analysis, *except* if it allows you to use all the other features that would otherwise be dropped. But then, why not just drop this feature and keep the others?


### Model-based Imputation {#sec-data-missing-model}

**Model-based imputation** is more complicated, but can be very effective. In essence, you run a model for complete cases in which the feature with missing values is now the target, and all the other features and primary target are used to predict it. You then use that model to predict the missing values, and use those predictions as the imputed values. After these predictions are made, you move on to the next feature and do the same. There are no restrictions on which model you use for which feature. If the other features in the imputation model also have missing data, you can use something like mean imputation to get more complete data if necessary as a first step, and then when their turn comes, impute those values.

Although the implication is that you would have one model per feature and then be done, you can do this iteratively for several rounds, such that the initial imputed values are then used in subsequent models to reimpute the missing values. You can do this as many times as you want, but the returns will diminish. In this setting, we are assuming you'll have a single value imputed for each missing one, but this approach is the precursor for our next method.

### Multiple Imputation {#sec-data-missing-mi}

**Multiple imputation** (MI) is a more complicated technique, but can be very useful under some situations, depending on what you're willing to sacrifice for having better uncertainty estimates versus a deeper dive into the model. The idea is that you create multiple imputed datasets, each of which is based on the **predictive distribution** of the model used in model-based imputation. Say we use a linear regression assuming a normal distribution to impute feature A. We would then draw repeatedly from the predictive distribution of that model to create multiple datasets with (randomly) imputed values for feature A

Let's say we do this 10 times, and we now have 10 imputed data sets. We now run our actual model of interest on each of these datasets, and final model results are averaged in some way to get final parameter estimates. Doing so acknowledges that your single imputation methods have uncertainty in those imputed values, and that uncertainty is incorporated into the final model results. 

MI can in theory handle any source of missingness and can be a very powerful technique. But it has several drawbacks. One is that you need a a specified target distribution for all imputation models used in order to generate random draws from. Your final model presumably is also a probabilistic model with coefficients and variances you are trying to estimate and understand. MI probably isn't going to help a boosting or deep learning models that have native methods for dealing with missing values, or at least offer little if anything over single value imputation. If you have very large data and a complicated model, you could be waiting a long time, and as modeling is an iterative process itself, this can be rather tedious to work through. Finally, few data or post-model processing tools that you commonly use will work with MI results, especially visualization tools. So you will have to hope that whatever package you use for MI has what you need. As an example, you'd have to figure out how you're going to impute interaction terms if you have them.

*Practically speaking*, MI takes a lot of effort to often come to the same conclusions you would have with a single imputation method, or possibly fewer conclusions for anything beyond GLM coefficients and their standard errors. But if you want your uncertainty estimate for those models to be better, MI can be an option.


### Bayesian Imputation {#sec-data-missing-bayesian}

One final option is to run a Bayesian model where the missing values are treated as parameters to be estimated, and they would have priors just like other parameters as well. MI is basically a variant of Bayesian imputation that can be applied to the non-Bayesian model setting, so why not just use the actual Bayesian method? Some modeling packages can allow you to try this very easily, and it can be very effective. But it is also very computationally intensive, and can be very slow as you may be increasing the number of parameters to estimate dramatically. At least it would be more fun than standard MI!


## Class Imbalance {#sec-data-class-imbalance}

:::{.content-visible when-format="html"}
```{r}
#| echo: false
#| label: fig-class-imbalance
#| fig-cap: Class Imbalance

p = df_reviews |>
    mutate(y = ifelse(rating < 2, "Class A", "Class B")) |>
    count(y) |> 
    mutate(prop = n / sum(n)) |>
    ggplot(aes(x = y, y = prop)) +
    geom_col(width = .05) +
    geom_point(aes(y = prop), size = 10, alpha = 1, color = okabe_ito[2]) +
    coord_flip() +
    scale_y_continuous(breaks = seq(0, 1, .1), labels = scales::percent_format()) +
    labs(x = "", y = "", title = "")

p

ggsave(
    "img/data-class_imbalance.svg", 
    p, 
    width = 6, 
    height = 4
)
```
:::

:::{.content-visible when-format="pdf"}
![](img/data-class_imbalance.svg)(img/data-class_imbalance.svg){width=100% #fig-class-imbalance}
:::

**Class imbalance** refers to the situation where the target variable has a large difference in the number of observations in each class.  For example, if you have a binary target, and 90% of the observations are in one class, and 10% in the other, you would have class imbalance. You'll almost never see a 50/50 split in the real world, but the issue is that as we move further away from that point, we can start to see problems in model estimation, prediction, and interpretation. With our example, if we just predict the majority class in a binary classification problem, our accuracy would 90%! Under other circumstances that might be a great result, but in this case it's not. So right off the bat one of our favorite metrics to use for classification models isn't going to help us much. 


For classification problems, *class imbalance is the rule, not the exception*. This is because nature just doesn't sort itself into nice and even bins.  The majority of people of a random sample do not have cancer, the vast majority of people have not had a heart attack in the past year, most people do not default on their loans, and so on.

There are a number of ways to help deal with class imbalance, and the method that works best will depend on the situation. Some of the most common are:

- **Use different metrics**: Use metrics that are less affected by class imbalance, such as area under a receiver operating characteristic curve (AUC), or those that balance the tradeoff between precision and recall, like the F1 score, or balance recall and true negative rate, like the balanced accuracy score.
- **Oversampling/Undersampling**: Randomly sample from the minority (majority) class to increase (decrease) the number of observations in that class. 
- **Weighted objectives**: Weight the loss function to give more weight to the minority class. Although commonly employed, and a simple thing to use with models like lightgbm and xgboost, it often fails to help, and can cause other issues.
- **Thresholding**: Change the threshold for classification to be more sensitive to the minority class. Nothing says you have to use 0.5 as the threshold for classification, and you can change it to be more sensitive to the minority class. This is a very simple approach, and may be all you need.

These are not necessarily mutually exclusive. For example, in the imbalanced setting it's probably a good idea to switch your focus to a metric besides accuracy even as you employ other techniques.


### Calibration Issues in Classification {#sec-data-calibration}

Probability **calibration** is often an issue in classification problems, and is a bit more complicated than just class imbalance, but is often discussed in the same setting. Having calibrated probabilities refers to the situation where the predicted probabilities of the target match up well to the actual probabilities. For example, if you predict that 10% of people will default on their loans, and 10% of people actually do default on their loans, one would say your model is well calibrated. Conversely, if you predict that 10% of people will default on their loans, but 20% of people actually do default on their loans, your model is not so well-calibrated. 

One way to assess calibration is to use a **calibration curve**, which is a plot of the predicted probabilities vs. the observed proportions. We bin the observations in some way, and then calculate the average predicted probability and the average observed proportion of the target in each bin. If the model is well-calibrated, the points should fall along the 45-degree line. If the model is not well-calibrated, the points will fall above or below the line. In the following, one model seems to align well with the observed proportions based on the chosen bins. The other model (dashed line) is not so well calibrated, and is overshooting with its predictions. For example, if the model's average prediction predicts a 0.5 probability of default, the actual proportion of defaults is only around 0.6.


```{python}
#| echo: false
#| eval: false
#| label: py-calibration-plot-data

# reticulate basically refused to run this code properly except under random situations. see bootstrap_calibration notebook.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.calibration import calibration_curve
from sklearn.linear_model import LogisticRegression

# Generate synthetic data
X, y = make_classification(n_samples=1500, n_features=10, random_state=42)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)


# Get predicted probabilities on the test set
y_pred_prob = model.predict_proba(X_test)[:, 1]

# Compute calibration curve
true_prob, pred_prob = calibration_curve(y_test, y_pred_prob, n_bins=10)

# Create data frame
df = pd.DataFrame({
    'true_prob': true_prob, 
    'pred_prob': pred_prob, 
})

# write to file
df.to_csv("data/basic_calibration.csv", index=False)
```

```{python}
#| echo: false
#| eval: false
#| label: py-bootstrap-calibration-plot-data

# Set the number of bootstrap samples
n_bootstrap = 100

# Initialize arrays to store bootstrap results
bootstrap_true_prob = np.zeros(shape=(n_bootstrap, len(true_prob)))
bootstrap_pred_prob = np.zeros(shape=(n_bootstrap, len(pred_prob)))


# Perform bootstrap sampling

for i in np.arange(n_bootstrap):
    # Split the data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Train a logistic regression model
    model = LogisticRegression()
    model.fit(X_train, y_train)

    # Generate bootstrap sample
    bootstrap_indices = np.random.choice(len(X_test), size=len(X_test), replace=True)
    bootstrap_X_test = X_test[bootstrap_indices]
    bootstrap_y_test = y_test[bootstrap_indices]
    bootstrap_y_pred_prob = model.predict_proba(bootstrap_X_test)[:, 1]
    
    # Compute calibration curve for bootstrap sample
    bootstrap_true_prob[i, :], bootstrap_pred_prob[i, :] = calibration_curve(
        bootstrap_y_test, 
        bootstrap_y_pred_prob, 
        n_bins=10
    )

# Calculate mean and confidence intervals
mean_true_prob = np.mean(bootstrap_true_prob, axis=0)
mean_pred_prob = np.mean(bootstrap_pred_prob, axis=0)
lower_bound_true = np.percentile(bootstrap_true_prob, 2.5, axis=0)
upper_bound_true = np.percentile(bootstrap_true_prob, 97.5, axis=0)
lower_bound_pred = np.percentile(bootstrap_pred_prob, 2.5, axis=0)
upper_bound_pred = np.percentile(bootstrap_pred_prob, 97.5, axis=0)

df_true = pd.DataFrame({
    'prob': mean_true_prob, 
    'lower': lower_bound_true,
    'upper': upper_bound_true,    
})

df_pred = pd.DataFrame({
    'prob': mean_pred_prob, 
    'lower': lower_bound_pred,
    'upper': upper_bound_pred,    
})

df_true['type'] = 'true'
df_pred['type'] = 'pred'

p_dat = pd.concat([df_true, df_pred])
p_dat.to_csv("data/bootstrap_calibration.csv", index=False)
```

:::{.content-visible when-format="html"}
```{r}
#| echo: false
#| label: fig-calibration-plot
#| fig-cap: Calibration Plot
p_dat = read_csv("data/basic_calibration.csv")

set.seed(123)
nb = 5
p_dat = tibble(
    true_prob = rep(seq(0,1, length.out = nb), 2),
    pred_prob = c(
        seq(0,1, length.out = nb) + rnorm(nb, sd=.05), 
        c(seq(0,1, length.out = nb)^.5) #+ rnorm(nb, sd=.1))
    ),
    pred_type = c(rep("better", nb), rep("worse", nb))
)

p = p_dat |>
    # pivot_longer(cols = c(pred_prob, pred_worse), names_to = "prob_type", values_to = "pred_prob") |>
    ggplot(aes(x = pred_prob, y = true_prob, color = pred_type)) +
    geom_abline(intercept = 0, slope = 1, color = 'gray', lwd= 1.5) +
    geom_line(aes(linetype = pred_type)) +
    geom_point(size = 3, alpha = 1) +
    scale_x_continuous(breaks = seq(0, 1, .1)) +
    scale_y_continuous(breaks = seq(0, 1, .1)) +
    scale_color_manual(values = c(okabe_ito[2], okabe_ito[1])) +
    labs(x = "Average (Binned) Probability", y = "Observed (Binned)\nProportions", title = "")

p

ggsave(
    "img/data-calibration_plot.svg", 
    p, 
    width = 6, 
    height = 4
)
```
:::

:::{.content-visible when-format="pdf"}
![](img/data-calibration_plot.svg)(img/data-calibration_plot.svg){width=100% #fig-calibration-plot}
:::


While the issue is an important one, it's good to keep the issue of calibration and imbalance separate. As miscalibration implies bias, bias can happen irrespective of the class proportions, and can be due to a variety of factors related to the model, target, or features. Furthermore miscalibration is not inherent to any particular model. 

Furthermore, the assessment of calibration in this manner has a few issues. For one, the observed 'probabilities' are proportions based on arbitrarily chosen bins, and the observed values are measured with some **measurement error** and have a natural variability with will partly reflect sample size[^testcalibbinsize]. These plots are often presented such that observed proportions are labeled as the "true" probabilities. However, you do not have the *true* probabilities, just the observed class labels, so whether your model's predicted probabilities match observed proportions is a bit of a different question. The predictions obviously have uncertainty as well, and this will depend on the model, sample size, and other factors. And finally, the number of bins chosen can also affect the appearance of the plot in a notable way if the sample size is small.

[^testcalibbinsize]: Note that each bin will reflect the portion of the test set size in this situation. If you have a small test set, the observed proportions will be more variable, and the calibration plot will be more variable as well. 


All this is to say that each point in a calibration plot, along with the reference line itself, has some error bar around it, and the differences between models and the 'best case scenario' would need additional steps to suss out if we are interested in doing so in a statistically rigorous fashion. Some methods are available to calibrate probabilities, but they are not commonly implemented in practice, and often involve a model-based technique, with all of its own assumptions and limitations. It's also not exactly clear that forcing your probabilities to be on the line is helping solve the actual modeling goal in any way[^practicalprob]. But if you are interested, you can read more [here](https://scikit-learn.org/stable/modules/calibration.html).

[^practicalprob]: Oftentimes we are only interested in the ordering of the predictions, and not the actual probabilities. For example, if we are trying to identify the top 10% of people most likely to default on their loans, we'll just take the top 10% of predictions, and the actual probabilities are irrelevant for that goal.


```{r}
#| echo: false
#| eval: false
#| label: r-bootstrap-calibration-plot
#| fig-cap: Bootstrap Calibration Plot

# ultimately I think this gets too far afield
p_dat = read_csv("data/bootstrap_calibration.csv")

p_dat = p_dat |> 
    filter(type == "true") |>
    rename(
        true_prob = prob, 
        true_lower = lower, 
        true_upper = upper
    ) |>
    bind_cols(
        p_dat |> 
            filter(type == "pred") |>
            select(-type) |>
            rename(pred_prob = prob, pred_lower = lower, pred_upper = upper)
    )


p_dat |>
    ggplot(aes(x = pred_prob, y = true_prob)) +
    geom_abline(intercept = 0, slope = 1, linetype = 'dashed', color = 'gray', lwd= 1) +
    geom_segment(aes(x = pred_lower, y = true_prob, xend = pred_upper, yend = true_prob), color = "gray", size = 0.5) +
    geom_segment(aes(x = pred_prob, y = true_lower, xend = pred_prob, yend = true_upper), color = "gray", size = 0.5) +
    geom_line() +
    geom_point(size = 3, alpha = 1) +
    scale_x_continuous(breaks = seq(0, 1, .1)) +
    scale_y_continuous(breaks = seq(0, 1, .1)) +
    scale_color_manual(values = c(okabe_ito[2], okabe_ito[1])) +
    labs(x = "Avg (Binned) Probability", y = "Observed (Binned)\nProportions", title = "")

```


## Censoring and Truncation {#sec-data-censoring}


:::{.content-visible when-format="html"}
```{r}
#| echo: false
#| label: fig-censoring
#| fig-cap: Censoring for Time Until Death

# Create a data frame with age and censoring information
df = tibble(
    person =  LETTERS[1:5],
    age = c(53, 72, 80, 65, 83),
    censoring = c("End of Life", "Censored", "Censored", "End of Life", "Censored"))

# Plot the potential censoring

p = ggplot(df, aes(x = person, age, fill = censoring, color = censoring)) +
    geom_col(width = .05, alpha = .5, show.legend=FALSE) +
    geom_segment(
        aes(x = person, y = age, xend = person, yend = 85), 
        linetype = "dashed", 
        color = "gray50",
        data = df |> filter(censoring == "Censored")
    ) +
    geom_point(size = 5, alpha = 1) +
    lims(y = c(0, 90)) + 
    labs(x = "Person", y = "Age", title = "") +
    scale_fill_manual(values = okabe_ito, aesthetics = c('color', 'fill')) +
    coord_flip()

p

ggsave(
    "img/data-censoring.svg", 
    p, 
    width = 6, 
    height = 4
)
```
:::

:::{.content-visible when-format="pdf"}
![Censoring for Time Until Death](img/data-censoring.svg)(img/data-censoring.svg){width=100% #fig-censoring}
:::


Sometimes, we just don't see all the data there is to see. **Censoring** is a situation where the target variable is not fully observed. This is common in techniques where the target is the time to an event, like 'death', but the event has not yet occurred for some observations (thankfully!). Specifically this is called **right censoring**, and is the most common type of censoring, and depicted in the above plot, where several individuals are only observed to a certain age and were still alive at that time. There is also [**left censoring**](https://stats.stackexchange.com/questions/144037/right-censoring-and-left-censoring), where the censoring happens from the other direction, and data before a certain point is unknown. Finally, there is **interval censoring**, where the event of interest occurs within some interval, but the exact value is unknown. 

[^eventhistory]: Survival analysis is also called event history analysis, and is widely used in biostatistics, sociology, demography, and other disciplines where the target is the time to an event, such as death, marriage, divorce, etc.

**Survival analysis**[^eventhistory] is a common modeling technique in this situation, but you may also be able to keep things even more simple via something like [tobit regression](https://m-clark.github.io/models-by-example/tobit.html).  In this model, you assume that the target is fully observed, but that the values are censored, and you model the probability of censoring. This is a common technique in econometrics, and can keep you in a traditional linear model context.  

:::{.content-visible when-format="html"}
```{r}
#| echo: false
#| label: fig-truncation
#| fig-cap: Truncation
library(ggplot2)

set.seed(1234)

# Generate a random sample from a normal distribution
data = rpois(100, 60)

# Censor some of the values
key_value = 70

# Create a data frame for plotting
df = tibble(
    data = data, 
    censored = factor(if_else(data > key_value, "Unobserved", "Observed")),
)

# Plot the original data and the censored data
p = ggplot(df, aes(x = data)) +
    geom_bar(
        aes(
            alpha = after_stat(x < key_value),
            group = after_scale(fill)
        ), 
        # alpha = 0.5,
        color = NA,
        show.legend = FALSE,
    ) +
    geom_vline(xintercept = key_value, color = "gray", size = 1) +
    annotate(
        geom = "text",
        x = key_value + 1, 
        y = 3, 
        label = "Unobserved", 
        color = "gray", 
        size = 6,
        hjust = 0,
    ) +
    scale_fill_manual(values = c(okabe_ito)) +
    labs(x = "Ages", y = "", title = "") +
    theme(
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
    )

p

ggsave(
    "img/data-truncation.svg", 
    p, 
    width = 6, 
    height = 4
)
```
:::

:::{.content-visible when-format="pdf"}
![](img/data-truncation.svg)(img/data-truncation.svg){width=100% #fig-truncation}
:::


**Truncation** is a situation where the target variable is only observed if it is above or below some value, even though we know other possibilities exist. One of the issues is that default distributional methods assume a distribution that is not necessarily bounded in the way that the data exhibits. In our plot above, we restrict our data to 70 and below for practical or other reasons, but typical modeling methods predicting age would not respect that.

You could truncate predictions after the fact, but this is a bit of a hack, and often results in lumpiness in the predictions at the boundary that likely isn't realistic. Alternatively, Bayesian methods allow you to model the target as a distribution with truncated distributions, and so you can model the probability of the target being above or below some value. There are also models such as **hurdle models** that might prove useful where the truncation is theoretically motivated, e.g. a zero-inflated Poisson model for count data where the zero counts are due to a separate process than the non-zero counts.


:::{.callout-note title="Censoring vs. Truncation" collapse="true"}
One way to distinguish censored and truncated data is that censored data is usually due to some external process such that the target is not observed, but could be possible (capping reported income at $1 million). Truncated data, on the other hand, is due to some internal process that prevents the target from being observed, and is often derived from sample selection (we only want to model non-millionaires). We would not want predictions past the censored point to be unlikely, but we would want predictions past the truncated point to be impossible. Trickier still is that for bounded or truncated distributions that might be applied to the truncated scenario, such as folded vs. truncated distributions, they would not result in the same probability distributions even if they can be applied to the same data situation.


![Folded and Truncated Distributions](img/trunc_vs_folded.png){width=33%}

Image from [StackExchange]((https://stats.stackexchange.com/questions/16089/is-sampling-from-a-folded-normal-distribution-equivalent-to-sampling-from-a-norm)
)

:::


## Time Series {#sec-data-time}

:::{.content-visible when-format="html"}
```{r}
#| echo: false
#| label: fig-time-series
#| fig-cap: Time Series Data

set.seed(123)

# Generate date sequence
dates <- seq(as.Date("2022-01-01"), as.Date("2022-12-31"), by = "day")

# Generate increasing trend
trend <- seq(1, length(dates))

# Generate seasonality
seasonality <- sin(2 * pi * (1:length(dates)) / 365 * 7) *10000

# Combine trend and seasonality
y <- trend + trend^2 + seasonality


p = tibble(date = dates, value = y) |> 
    ggplot(aes(x = date, y = value)) +
    geom_line(linewidth=3, color = okabe_ito[1]) +
    # geom_smooth(se = TRUE, color = okabe_ito[1], method = 'gam', formula = y ~ s(x, bs = 'cs', k = 25)) +
    geom_smooth(se = TRUE, method = 'gam') +
    labs(x = "", y = "") +
    scale_y_continuous(labels = NULL) +
    scale_x_date(date_labels = "%b", date_breaks = "1 month") +
    theme(
        axis.ticks.y = element_blank(),
    )

p

ggsave(
    "img/data-time_series.svg", 
    p, 
    width = 8, 
    height = 6
)
```
:::

:::{.content-visible when-format="pdf"}
![](img/data-time_series.svg)(img/data-time_series.svg){#fig-time-series}
:::


Time series data is any data that incorporates values over a period of time. This could be something like the a state's population over years, or the max temperature of an area over days. Time series data is very common in data science, and there are a number of ways to model such data.

### Time-based Targets

As in other settings, the most common approach when the target is some value that varies time is to use a linear model of some kind. While the target varies over time, the features may be time-varying or not. There are traditional autoregressive models that use the target's past values as features, for example, autoregressive moving average (**ARIMA**) models. Others can incorporate historical information in other ways such as is done in Bayesian methods to marketing data or in reinforcement learning (@sec-ml-more-reinforcement). Still others can get quite complex, such **recurrent neural networks** and their generalizations that form the backbone of modern AI models. Lately [transformer-based models](https://research.google/blog/a-decoder-only-foundation-model-for-time-series-forecasting/) have looked promising.

**Longitudinal data**[^paneldata] is a special case of time series data, where the target is a function of time, but it is typically grouped in some fashion. An example is a model for school performance for students over several semesters, where values are clustered within students over time. In this case, you can use some sort of time series regression, but you can also use a **mixed model** (@sec-lm-extend-mixed-models), where you model the target as a function of time, but also include a random effect for the grouping variable, in this case, students.  This is a very common approach in many domains, and can be very effective in terms of performance as well. It can also be used for time series data that is not longitudinal, where the random effects are based on autoregressive covariance matrices. In this case, an ARIMA component is added to linear model as a random effect to account for the time series nature of the data. This is fairly common in Bayesian contexts.

In general lots of models can be found specific to time series data, and the choice of model will depend on the data, the questions we want to ask, and the goals we have.

:::{.callout-tip title="Time Series vs. Longitudinal" collapse="true"}
The primary distinguishing feature for referring data as 'time-series' and 'longitudinal' is the number of time points, where the latter typically has relatively few. This arbitrary though.
:::


[^paneldata]: You may also hear the term **panel data** in econometrics-oriented disciplines.


### Time-based Features

When it comes to time-series features, we can apply time-specific transformations. One technique is the [**fourier transform**](https://betterexplained.com/articles/an-interactive-guide-to-the-fourier-transform/), which can be used to decompose a time series into its component frequencies, much like how we use PCA (@sec-ml-more-unsuper). This can be useful for identifying periodicity in the data, which can be used as a feature in a model. 

In marketing contexts, some perform **adstocking** with features. This method models the delayed effect of features over time, such that they may have their most important impact immediately, but still can impact the present target from past values. For example, a marketing campaign might have the most significant impact immediately after it's launched, but it can still influence the target variable at later time points. Adstocking helps capture this delayed effect without having to include multiple lagged features in the model.  That said, including lagged features is also an option. In this case, you would have a feature for the current time point (*t*), the same feature for the previous time point (*t-1*), the feature for the time point before that (*t-2*), and so on.

If you have the year as a feature, you can use it as a numeric feature or as a categorical feature. If you treat it as numeric, you need to consider what a zero means. In a linear model, the intercept usually represents the outcome when all features are zero. But with something like year, a zero year isn't meaningful in most contexts. To solve this, you can shift the values so that the earliest time point, like the first year in your data, becomes zero. This way, the intercept in your model will represent the outcome for this first time point, which is more meaningful. The same goes if you are using months or days as a numeric feature. It doesn't really matter which year/month/day is zero, just that zero refers to one of the actual time points observed.

Dates and/or times can be it bit trickier. Often you can just split dates out into year, month, day, etc., and proceed as discussed. In other cases you'd want to track the time period to assess possible seasonal effects. You can use something lie a **cyclic** approach (e.g. [cyclic spline or sine/cosine transformation](https://developer.nvidia.com/blog/three-approaches-to-encoding-time-information-as-features-for-ml-models/)) to get at yearly or within-day seasonal effects. As mentioned, a fourier transform can also be used to decompose the time series into its component frequencies for use as model features. Time components like hours, minutes, and seconds can often be dealt with in similar ways, but you will more often deal with the periodicity in the data. For example, if you are looking at hourly data, you may want to consider the 24-hour cycle.

:::{.callout-note title="Calendars are hard." collapse="true"}
*Weeks are not universal*. Some start on Sunday, others Monday. Some data contexts only consider weekdays. [Some systems](https://en.wikipedia.org/wiki/ISO_week_date) may have 52 or 53 weeks in a year, and dates may not be in the same week from one year to the next, etc. So use caution when considering weeks as a feature. 
:::

#### Covariance Structures

In many cases you'll have features that vary over time but are not a time-oriented feature like year or month. For example, you might have a feature that is the number of people who visited a website over days. This is a time-varying feature, but it's not a time metric in and of itself. 

In general, we'd like to account for the time-dependent correlations in our data, and the main way to do so is to posit a covariance structure that accounts for this in some fashion.  This helps us understand how data points are related to each other over time, and requires us to estimate the correlations between observations. As a starting point, consider linear regression. In a standard linear regression model, we assume that the samples are independent of one another, with a constant variance and no covariance. 

Instead, we can also use something like a **mixed model**, where we include a random effect for each group and estimate the variance attributable to the grouping effect. By default, this ultimately assumes a constant correlation from time point to time point, but many tools allow you to specify a more complex covariance structure. A common method is to use autoregressive covariance structure that allows for correlations further apart in time to lessen. In this sense the covariance comes in as an added *random effect*, rather than being a model in and of itself as with ARIMA. Many such approaches to covariance structures are special cases of **gaussian processes**, which are a very general technique to model time series, spatial, and other types of data. 

:::{.content-visible when-format="html"}
```{r}
#| echo: false
#| label: fig-covariance-structure
#| fig-cap: AR (1) Covariance Structure Visualized
library(nlme)
library(lme4)
model_corAR <- lme(Reaction ~ 1, random= ~1|Subject, data = sleepstudy, correlation = corAR1(form=~Days))
df = tibble(id = rep(1:2, e= 12), Months = rep(1:12, 2))
model_corAR <- lme(rnorm(24) ~ 1, random= ~1|id, data = df, correlation = corAR1(form=~Months))


# phi = abs(coef(model_corAR$modelStruct$corStruct, unconstrained = F)*3)
phi = .66
init = corMatrix(Initialize(corAR1(phi, form = ~1|id), data=df))
rc_ar = bdsmatrix::bdsmatrix(rep(12, 2), blocks = rep(init[[1]][lower.tri(init[[1]], diag = T)], 2)) %>% 
  as.matrix()
 
dates = seq.Date(as.Date("2022-01-15"), as.Date("2022-12-15"), by = "month")
dates = factor(month.abb, levels = month.abb)
l = length(dates)

p = tibble(x = rep(dates, e=l), y = rep(dates, times=l)) |> 
    mutate(
        z = c(rc_ar[1:l, 1:l]),
        z = ifelse(z == 0, NA, z)
    ) |>
    ggplot(aes(x = x, y = y)) +
    geom_tile(aes(fill=z), color = 'white', linewidth=3,show.legend = FALSE) +
    # scale_x_date(breaks = '3 months', labels = label_date_short(format = c('%b'))) +
    # scale_y_date(breaks = '3 months', labels = label_date_short(format = c('%b'))) +
    # scale_x_date(breaks = '1 months', labels = date_format(format = c('%b'))) +
    # scale_y_date(breaks = '1 months', labels = date_format(format = c('%b'))) +
    scale_fill_scico(palette = "devon", na.value = "transparent", direction = -1) +
    labs(
        x = '',
        y = '',
        caption = 'A AR(1) structure for months within year. Stronger correlation is indicated by a darker color. Correlations systematically weaken for time points further apart.') +
    # theme_void() +
    theme(
        plot.caption = element_text(hjust = 0, size = 12),
    )

p

ggsave(
    "img/data-covariance_structure.svg", 
    p, 
    width = 6, 
    height = 6
)
# scico::scico_palette_show()
```
:::

:::{.content-visible when-format="pdf"}
![](img/data-covariance_structure.svg)(img/data-covariance_structure.svg){width=100% #fig-covariance-structure}
:::


## Spatial Data {#sec-data-spatial}


```{r}
#| echo: false
#| eval: false
#| label: r-spatial-data
#| fig-cap: Spatial Weighting Applied to the Dallas-Fort Worth Area Census Tracts

# from https://walker-data.com/census-r/spatial-analysis-with-us-census-data.html
library(tidycensus)
library(sf)
library(spdep)
library(tigris)

dfw <- core_based_statistical_areas(cb = TRUE, year = 2020) %>%
    filter(str_detect(NAME, "Dallas")) %>%
    st_transform(32138)

dfw_tracts <- get_acs(
    geography = "tract",
    variables = "B01002_001",
    state = "TX",
    year = 2020,
    geometry = TRUE
) %>%
    st_transform(32138) %>%
    st_filter(dfw, .predicate = st_within) %>%
    na.omit()

neighbors <- poly2nb(dfw_tracts, queen = TRUE)
summary(neighbors)

weights <- nb2listw(neighbors, style = "W")

weights$weights[[1]]
# For Gi*, re-compute the weights with `include.self()`

dfw_tracts$lag_estimate <- lag.listw(weights, dfw_tracts$estimate)
localg_weights <- nb2listw(include.self(neighbors))
dfw_tracts$localG <- localG(dfw_tracts$estimate, localg_weights)
str(dfw_tracts$localG, 0)
dim(dfw_tracts)

dfw_tracts$localG[1:5]

ggplot(dfw_tracts) + 
    geom_sf(aes(fill=as.numeric(dfw_tracts$localG)), color = NA, show.legend = FALSE) + 
    scico::scale_fill_scico(
        palette = "vik", 
        na.value = "transparent", 
        midpoint = 0
        ) +
    #   scale_fill_distiller(palette = "RdYlBu") + 
    theme_void() + 
    labs(
        caption = ''
        # fill = "Local Gi* statistic"
    ) +
    theme(
        plot.caption = element_text(hjust = 0, size = 14),
    )
    

# scico::scico_palette_show()
ggsave(
    "img/data-spatial_data.svg", 
    width = 8, 
    height = 6
)
```
![Spatial Weighting Applied to the Dallas-Fort Worth Area Census Tracts](img/data-spatial_data.svg){#fig-spatial-data}

We visited spatial data in a discussion on non-tabular data (@sec-ml-more-spatial), but here we want to talk about it from a modeling perspective, especially within the tabular domain. Say you have a target that is a function of location, such as the proportion of people voting a certain way in a county, or the number of crimes in a city.  You can use a **spatial regression** model, where the target is a function of location among other features that may or may not be spatially oriented.  Two approaches already discussed may be applied in the case of having continuous spatial features, such as latitude and longitude, or discrete features like county. For the continuous case, we could use a GAM (@sec-gam), where we use a smooth interaction of latitude and longitude. For the discrete setting, we can use a mixed model, where we include a random effect for county.

There are other traditional techniques to spatial regression, especially in the continuous spatial domain, such as using a **spatial lag**, where we incorporate information about the neighborhood of a observation's location into the model (e.g. a weighted mean of neighboring values, as in the visualization above based on code from @walker_analyzing_2023). Such models fall under names such as CAR (conditional autoregressive), SAR (spatial autoregressive), BYM, kriging, and so on. These models can be very effective, but are in general different forms of random effects models very similar to those used for time-based settings, and likewise can be seen as special cases of gaussian processes.


## Multivariate Targets

Often you will encounter settings where the target is not a single value, but a vector of values. This is often called a **multivariate target** in statistical settings, or just the norm for deep learning. For example, you might be interested in predicting the number of people who will buy a product, the number of people who will click on an ad, and the number of people who will sign up for a newsletter. This is a common setting in marketing, where you are interested in multiple outcomes. The main idea is that there is a relationship among the targets, and you want to take this into account.

One model example we've already seen is the case where we have more than two categories for the target. Some default approaches may take that input and just do a one-vs-all, for each category, but this kind of misses the point. Others will simultaneously model the multiple targets in some way. On the other hand, it can be difficult to interpret results with multiple targets. Because of this, you'll often see results presented in terms of the respective targets anyway, and often even ignoring parameters specifically associated with such a model[^manova].

In deep learning contexts, the multivariate setting is ubiquitous. For example, if you want to classify the content of an image, you might have to predict something like different species of animals, or different types of car models. In natural language processing, you might want to predict the probability of different words in a sentence. In some cases, there are even multiple kinds of targets considered simultaneously! It can get very complex, but often in these settings prediction performance far outweighs the need to interpret specific parameters, and so it's a good fit.


[^manova]: It was common in social sciences back in the day to run a Multivariate ANOVA, and then if the result was statistically significant, and mostly because few practitioners knew what to do with the result, they would run separate ANOVAs for each target. 


<!-- :::{.callout-tip }
Some are a little too eager to jump to simultaneously modeling multiple target variables, e.g. in structural equation modeling or multivariate regression. It's not wrong to do so, but given the difficulty our brains have with interpreting results for a single target, you might think twice. However, for scenarios focused much more on prediction performance that involve method like deep learning, it makes more sense, and is actually required.
::: -->


## Latent Variables {#sec-data-latent}

<!-- TODO: maybe update to look like other nnet plots? -->
```{r}
#| echo: false
#| eval: false

g = DiagrammeR::grViz('img/graphical-bifactor.dot')

g |> DiagrammeRsvg::export_svg() %>% charToRaw() %>% rsvg::rsvg_svg("img/graphical-bifactor.svg")
```

![Latent Variable Model (Bifactor)](img/graphical-bifactor.svg){#fig-graphical-bifactor}


Latent variables are a fundamental aspect of modeling, and simply put, are variables that are not directly observed, but are inferred from other variables. Here are some examples of what might be called latent variables:

- The linear combination of features in a linear regression model is a latent variable, but usually we only think of it as such before the link transformation in GLMs (@sec-glm).
- The error term in any model is a latent variable representing all the unknown/unobserved/unmodeled factors that influence the target (@eq-lm-observed).
- The principal components in PCA (@sec-ml-more).
- The measurement error in any feature or target.
- The factor scores in a factor analysis model or structural equation (visualization above).
- The true target underlying the censored values (@sec-data-censoring).
- The clusters in cluster analysis/mixture models (@sec-ml-more-clusters-latent).
- The random effects in a mixed model (@sec-lm-extend-mixed-models).
- The hidden states in a hidden Markov model.
- The hidden layers in a deep learning model (@sec-ml-dl-nn).

It's easy to see from such a list that latent variables are very common in modeling, so it's good to get comfortable with the concept. Whether they're appropriate to your specific situation will depend on a variety of factors, but they can be very useful in many settings, if not a required part of the modeling approach.


## Data Augmentation {#sec-data-aug}

**[Data augmentation](https://aws.amazon.com/what-is/data-augmentation/)** is a technique where you artificially increase the size of your dataset by creating new data points based on the existing data. This is a common technique in deep learning for computer vision, where you might rotate, flip, or crop images to create new training data. This can help improve the performance of your model, especially when you have a small dataset. Techniques are also available for text.

In the tabular domain, data augmentation is less common, but still possible. You'll see it most commonly applied with class-imbalance settings (@sec-data-class-imbalance), where you might create new data points for the minority class to balance the dataset. This can be done by randomly sampling from the existing data points, or by creating new data points based on the existing data points. For the latter, SMOTE and many variants of it are quite common.

Unfortunately for tabular data, these techniques are not nearly as successful, nor consistently so. Part of the issue is that tabular data is very noisy and fraught with measurement error, so in a sense, such techniques are just adding noise to the modeling process[^vscv]. Simple random upsampling can potentially lead to an overconfident model that still doesn't generalize well. In the end, the best approach is to get more and/or better data, but hopefully more successful methods will be developed in the future.

[^vscv]: Compare to the image settings where there is relatively little measurement error, by just rotating an image, you are still preserving the underlying structure of the data.

<!-- TODO: Should we have a section on measurement error? I removed this todo and now I'm putting it back in. -->


## Wrapping Up

There's a lot going on with data before you ever get to modeling, and which will affect every aspect of your modeling approach. This chapter outlines common data types, issues, and associated modeling aspects, but in the end, you'll always have to make decisions based on your specific situation, and they will often not be easy ones. These are only some of the things to consider, so be ready for surprises, and, be ready to learn from them!


### The Common Thread

Many of the transformations and missing data techniques could possibly be applied in many modeling settings. Likewise, you may find yourself dealing with different target variable issues like imbalance or censoring, and deal with temporal, spatial or other structures, in a variety of models. The key is to understand the data, the target, and the features, and to make the best decisions you can based on that understanding.


### Choose Your Own Adventure

Consider revisiting a model covered in the other parts of this book in light of the data issues discussed here. For example,might you deal with class imbalance for a boosted tree model? How would you deal with spatial structure in a neural network? How would you deal with a multivariate target in a time series model?

### Additional Resources

Here are some additional resources to help you learn more about the topics covered in this chapter.

**Transformations**

- [About Feature Scaling and Normalization](https://sebastianraschka.com/Articles/2014_about_feature_scaling.html) (@raschka_about_2014)
- [What are Embedding](https://vickiboykis.com/what_are_embeddings/) (@boykis_what_2023)

**Class Imbalance**

- [Brief Overview](https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data) (Google)
- [Handling imbalanced datasets in machine learning](https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28) (@rocca_handling_2019)
- [Imbalanced Outcomes: Challenges & Solutions](https:strong.io/blog/imbalanced-outcomes-challenges-and-solutions) (@monroe_imbalanced_2024, in preparation)

- [A Gentle Introduction to Imbalanced Classification](https://machinelearningmastery.com/what-is-imbalanced-classification/) (@brownlee_gentle_2019)

**Calibration**

- [Why some algorithms produce calibrated probabilities](https://stats.stackexchange.com/questions/452483/why-some-algorithms-produce-calibrated-probabilities) (StackExchange)
- [Predicting good probabilities with supervised learning](https://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf) (@niculescu-mizil_predicting_2005) 


**Survival, Ordinal and Other Models**

- Regression Modeling Strategies is a great resource for statistical modeling in general (@harrell_regression_2015)

**Time Series**

- [Forecasting: Principles and Practice](https://otexts.com/fpp3/) (@hyndman_forecasting_2021)

**Latent Variables**

There's a ton of stuff, but one of your humble authors has an applied that is not far the conceptual approach of this text. However, we'd recommend more of an 'awareness' as you will likely be more interested in the specific context of your data and model.

[Graphical and Latent Variable Modeling](https://m-clark.github.io/sem/) (@clark_graphical_2018)

**Data Augmentation**

[What is Data Augmentation?](https://aws.amazon.com/what-is/data-augmentation/) (@amazon_what_2024)