-
Notifications
You must be signed in to change notification settings - Fork 7
/
data.qmd
1142 lines (789 loc) · 72.4 KB
/
data.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Dealing with Data {#sec-data}
```{r}
#| label: setup-data
#| include: false
df_reviews = read_csv("data/movie_reviews.csv")
```
It's an inescapable fact that models need data to work. One of the dirty secrets in data science is that getting the data right will do more for your model than any fancy algorithm or hyperparameter tuning. Data is messy, and there are a lot of ways it can be messy. In addition, dealing with a target variable on its terms can lead to more interesting results. In this chapter we'll discuss some of the most common data issues and things to consider. There's a lot to know about data before you ever get into modeling your own, so we'll give you some things to think about it in this chapter.
## Key Ideas
- Data transformations can provide many modeling benefits.
- Label and text-based data still needs a numeric representation, and this can be accomplished in a variety of ways.
- The data type for the target may suggest a particular model, but does not necessitate one.
- The data *structure*, e.g. temporal, spatial, censored, etc., may suggest a particular modeling domain to use.
- Missing data can be handled in a variety of ways, and the simpler approaches are typically not great.
- Class imbalance is a very common issue in classification problems, and there are a number of ways to deal with it.
- Latent variables are everywhere!
### Why this matters {#sec-data-common-why-matters}
Knowing your data is one of the most important aspects of any application of *data* science. It's not just about knowing what you have, but also what you can do with it. The data you have will influence the models you can potentially use, the features you can create and manipulate, and have a say on the results you can expect.
### Good to Know {#sec-data-common-good-to-know}
We're talking very generally about data here, so not much background is needed. The models mentioned are covered in other chapters, or build upon those, but we're not doing any actual modeling here.
## Standard Feature & Target Transformations {#sec-data-transformations}
Transforming variables an provide several benefits in modeling, whether applied to the target, features, or both, and *should regularly be used for most model situations*. Some of these benefits include:
- More comparable feature effects and related parameters
- Faster estimation
- Easier convergence
- Helping with heteroscedasticity
For example, just centering features, i.e. subtracting the mean, provides a more interpretable intercept that will fall within the actual range of the target variable. After centering, the intercept tels us what the value of the target variable is when the features are at their means (or reference value if categorical). Centering also puts the intercept within the expected range of the target, which often makes for easier parameter estimation. So even if easier interpretation isn't a major concern, variable transformations can help with convergence and speed up estimation, so can always be of benefit.
### Numeric variables {#sec-data-numeric}
The following table shows the interpretation of some very common transformations applied to numeric variables- logging and standardizing to mean zero, standard deviation of one.
```{r}
#| echo: false
#| eval: true
#| label: tbl-transformation
#| tbl-cap: Common numeric transformations
#|
# gt doesn't seem to have an actual working answer to latex within cells, though kabel_extra and modelsummary appear to handle it as expected, at least if you just run it. Rendering to pdfseems to fail. tried using \U03B2 for beta, but that didn't work either.
# tibble(
# Target = c("y", "y", "log(y)", "log(y)", "y", "scale(y)", "scale(y)"),
# Feature = c("x", "log(x)", "x", "log(x)", "scale(x)", "x", "scale(x)"),
# Interpretation = c(
# "$\\Delta y = \\beta\\Delta x$",
# "$\\Delta y \\approx (\\beta/100)\\%\\Delta x$",
# "$\\%\\Delta y \\approx 100\\cdot \\beta\\%\\Delta x$",
# "$\\%\\Delta y = \\beta\\%\\Delta x$",
# "$\\Delta y = \\beta\\sigma\\Delta x$",
# "$\\sigma\\Delta y = \\beta\\Delta x$",
# "$\\sigma\\Delta y = \\beta\\sigma\\Delta x$"
# ),
# ) |>
# modelsummary::datasummary_df()
# christ what year is this?
tibble(
Target = c("y", "log(y)", "log(y)", "y", "scale(y)"),
Feature = c("x", "x", "log(x)", "scale(x)", "scale(x)"),
`Change in X` = c("1 unit", "1 unit", "1% change", "1 standard deviation", "1 standard deviation"),
`Change in Y` = c("B unit", "100 * (exp(B) -1) ", "B% change", "B unit", "B standard deviation"),
Benefits = c("Interpretation", "Heteroscedasticity in y", "Interpretation, deal with feature extremes", "Interpretation, estimation", "Interpretation, estimation"),
) |>
gt()
```
For example, it is very common to use **standardized** variables, or simply 'scaling' them. Some also call this **normalizing** but [this term can mean a lot of things](https://en.wikipedia.org/wiki/Normalization_(statistics)), so one should be clear in their communication. If $y$ and $x$ are both standardized, a one unit (i.e. one standard deviation) change in $x$ leads to a $\beta$ standard deviation change in $y$. So, if $\beta$ was .5, a standard deviation change in $x$ leads to a half standard deviation change in $y$. In general, there is nothing to lose by standardizing, so you should employ it often.
Another common transformation, particularly in machine learning, is **min-max scaling**. This involves changing variables to range from some minimum value to some maximum value, and usually this means zero and one respectively. This transformation can make numeric and categorical indicators more comparable, or at least put the on the same scale for estimation purposes, and so can help with convergence and speed up estimation.
:::{.panel-tabset}
###### Python
When using `sklearn`, it's a bit of a verbose process to do such a simple transformation. However it is beneficial when you want to do more complicated things, especially when using data pipelines.
```{python}
#| eval: false
#| label: py-numeric-transformation
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np
# Create a sample dataset
import numpy as np
# Create a random sample of integers
data = np.random.randint(low=0, high=100, size=(5, 3))
# Apply StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
# Apply MinMaxScaler
minmax_scaler = MinMaxScaler()
minmax_scaled_data = minmax_scaler.fit_transform(data)
```
###### R
R, being made for statistics, makes it easy to do simple transformations, but you can also use tools like `recipes` and `mlr3` pipeline operations when needed to make sure your preprocessing is applied appropriately.
```{r}
#| eval: false
#| label: r-numeric-transformation
# Create a sample dataset
data = matrix(sample(1:100, 15), nrow = 5)
# Standardization
scaled_data = scale(data)
# Min-Max Scaling
minmax_scaled_data = apply(data, 2, function(x) {
(x - min(x)) / (max(x) - min(x))
})
```
:::
Using a **log** transformation for numeric targets and features is straightforward, and [comes with several benefits](https://stats.stackexchange.com/questions/107610/what-is-the-reason-the-log-transformation-is-used-with-right-skewed-distribution). For example, it can help with **heteroscedasticity**, which is when the variance of the target is not constant across the range of the predictions[^notnormal] (demonstrated below). It can also help to keep predictions positive after transformation, allows for interpretability gains, and more. One issue with logging is that it is not a linear transformation, which can help capture nonlinear feature-target relationships, but can also can make some post-modeling transformations more less straightforward. Also if you have a lot of zeros, 'log plus one' transformations are not going to be enough to help you overcome that hurdle[^logp1]. Logging also won't help much when the variables in question have few distinct values, like ordinal variables, which we'll discuss later in @sec-data-ordinal.
[^logp1]: That doesn't mean you won't see many people try (and fail).
:::{.content-visible when-format="html"}
```{r}
#| echo: false
#| label: fig-log-transformation-heteroscedasticity
#| fig-cap: Log transformation and heteroscedasticity
#| out-width: 100%
# Generate x values
set.seed(123)
N = 250
x = rpois(N, lambda = 10)
# Generate y values with heteroscedasticity
y = exp(.1*x + rnorm(length(x), mean = 0, sd = .5))
# Create a data frame
data = tibble(x, y)
# Fit a linear regression model without log transformation
model = lm(y ~ x, data = data)
# plot(model)
# Fit a linear regression model with log transformation for y
model_log = lm(log(y) ~ x, data = data)
# plot(model_log)
fitted_residuals = data.frame(
model = rep(c("Model", "Model w/ Log"), each = length(data$x)),
fitted_values = c(model$fitted, model_log$fitted),
residuals = c(resid(model), resid(model_log))
)
# Create a ggplot with facetted scatter plots
p = ggplot(fitted_residuals, aes(x = fitted_values, y = residuals)) +
geom_point() +
facet_wrap(~ model, nrow = 1, scales = 'free') +
labs(
x = "Fitted Values",
y = "Residuals",
caption = "\nVariance increases with predicted values for the model without log transformation,\nbut is consistent across prediction for the model with log transformation."
) +
theme(
plot.caption = element_text(hjust = 0, size = 10)
)
p
ggsave(
"img/data-log_transformation_heteroscedasticity.svg",
p,
width = 8,
height = 6
)
```
:::
:::{.content-visible when-format="pdf"}
![](img/data-log_transformation_heteroscedasticity.svg)(img/data-log_transformation_heteroscedasticity.svg){width=100% #fig-log-transformation-heteroscedasticity}
:::
[^notnormal]: For the bazillionth time, logging does not make data 'normal' so that you can meet your normality assumption in linear regression. The only time that would work is if your data is log-normally distributed to begin with.
:::{.callout-tip title='Categorizing continuous variables' collapse="true"}
It is rarely necessary or a good idea to transform a numeric feature or target to a categorical one. Doing so potentially throws away useful information by making the feature a less reliable measure of the underlying construct. For example, discretizing age to 'young' and 'old' does not help your model, and you can always get predictions for what you would consider 'young' and 'old' after the fact. Though extremely common, particularly in machine learning contexts when applied to target variables, it's just not a statistically or practically sound thing to do, and can ultimately hinder interpretation.
:::
### Categorical Variables {#sec-data-cat}
Despite their ubiquity in data, we can't analyze raw text information as it is. Character strings, and labeled features like factors, must be converted to a numeric representation before we can analyze them. For categorical features, we can use something called **effects coding** to test for specific types of group differences. Far and away the most common type is called **dummy coding** or **one-hot encoding**[^onehotnotdummy], which we visited previously @sec-lm-categorical-features. In these situations we create columns for each category, and the value of the column is 1 if the observation is in that category, and 0 otherwise. Here is a one-hot encoded version of the `season` feature that was demonstrated previously.
[^onehotnotdummy]: Note that one-hot encoding can refer to just the 1/0 coding for all categories, or to the specific case of dummy coding where one category is dropped. Make sure the context is clear.
```{r}
#| echo: false
#| label: tbl-one-hot
#| tbl-cap: One-hot encoding
model.matrix(rating ~ -1 + season, df_reviews) |>
as_tibble() |>
bind_cols(df_reviews |> select(season)) |>
head() |>
gt() |>
fmt_number(columns = everything(), decimals = 0)
```
:::{.callout-note title='Dummy Coding Explained' collapse="true"}
For statistical models, when doing one-hot encoding all relevant information is incorporated in k-1 groups, where k is the number of groups, so one category will be dropped from the model matrix. This dropped category is called the **reference**. In a standard linear model, the intercept represents the mean of the target for the reference category. The coefficients for the other categories are the difference between the mean for the reference category and the group mean of the category being considered.
As an example, in the case of the `season` feature, if the dropped category is `winter`, the intercept tells us the mean rating for `winter`, and the coefficients for the other categories are the difference between the value for `winter` and the mean of the target for `fall`, `summer` and `spring`.
In other models, we include all categories in the model. The model learns how to use them best and might only consider some or one of them at a time.
:::
When we encode categories for statistical analysis, we can summarize their impact on the target variable single measure. For a model with only categorical features, we can use an **ANOVA** (@sec-lm-categorical-anova) for this. But a similar approach can also be used for mixed models, splines, and other models. Techniques like SHAP also provide a way to summarize the total effect of a categorical feature.
#### Text Embeddings {#sec-data-embeddings}
When it comes to other character representations like text, we use other methods to represent them numerically. One important way to encode textis through an **embedding**. This is a way of representing the text as a vector of numbers, at which point the embedding feature is used in the model like anything else. The way to do this usually involves a model itself, one that learns the best way to represent the text or categories numerically. This is commonly used in deep learning and natural language processing in particular. However, embeddings can also be used as a preprocessing step in any modeling situation.
To understand how embeddings work, consider a one-hot encoded matrix for a categorical variable. This matrix then connects to a hidden layer of a neural network, and the weights of that layer are the embeddings for the categorical variable. While this isn't the exact method used (there are more efficient methods that don't require the actual matrix), the concept is the same. In addition, we normally don't even use whole words. Instead, we break the text into smaller units called **tokens**, like characters or subwords, and then use embeddings for those units. [Tokenization](https://huggingface.co/learn/nlp-course/en/chapter6/5) is is used in many of the most successful models in the in natural language processing domain, including those such as [ChatGPT](https://www.youtube.com/watch?v=zduSFxRajkE).
```{r}
#| echo: false
#| eval: false
#| label: half-nn-graph-setup
g = DiagrammeR::grViz('img/graphical-embedding.dot')
g %>%
DiagrammeRsvg::export_svg() %>% charToRaw() %>% rsvg::rsvg_svg("img/graphical-embedding.svg")
```
![Conceptual example of an embedding](img/graphical-embedding.svg){width=25% #fig-embedding}
#### Multiclass Targets {#sec-data-multiclass}
We've talked and demonstrated binary targets a lot, but what about when there are more than two classes? In statistical settings, we can use a **multinomial regression**, which is a generalization of (binomial) logistic regression to more than two classes via the multinomial distribution. Depending on the tool, you may have to use a *multivariate* target of the counts, though most commonly they would be zeros and ones for a classification model, which then is just a one-hot encoded target. The following table demonstrates how this might look.
```{r}
#| echo: false
#| label: tbl-multinomial-data
#| tbl-cap: Multinomial data example
tibble(
x1 = rnorm(5),
x2 = rpois(5, 5),
target = c("A", "C", "C", "B", "A"),
`Class A` = c(1, 0, 0, 0, 1),
`Class B` = c(0, 0, 0, 1, 0),
`Class C` = c(0, 1, 1, 0, 0)
) |>
gt() |>
fmt_number(columns = -x1, decimals = 0) |>
cols_align("center")
```
With Bayesian tools, it's common to use the **[categorical distribution](https://en.wikipedia.org/wiki/Categorical_distribution)**, which is a different generalization of the Bernoulli distribution to more than two classes. Unlike the multinomial, it is not a count distribution, but an actual distribution over categories.
In the machine learning context, we can use a variety of models we'd use for binary classification. How the model is actually implemented will depend on the tool, but one of the more popular methods is to use **one-vs-all** or **one-vs-one** strategies, where you treat each class as the target in a binary classification problem. In the first case you would have a model for each class that predicts whether an observation is in that class or not. In the second case, you would have a model for each pair of classes. You should generally be careful with either approach if interpretation is important, as it can make the feature effects very difficult to understand. As an example, we can't expect feature X to have the same effect on the target in a model for class A vs B, as it does in a model for class A vs. (B & C) or A & C. It also can be misleading when the models are conducted as if the categories are independent.
Regardless of the context, interpretation is now spread across multiple target output, and so it can be difficult to understand the overall effect of a feature on the target. Even in the statistical setting, you now have coefficients that regard *relative* effects for one class versus a reference group, and so they cannot tell you a general effect of a feature on the target. This is where tools like marginal effects and SHAP can be useful (@sec-knowing).
#### Multilabel Targets {#sec-data-multilabel}
Multilabel targets are a bit more complicated, and are not as common as multiclass targets. In this case, each observation can have multiple labels. For example, if we wanted to predict genre based on the movie review data, we could choose to allow a movie to be both a comedy and action film, a sci-fi horror, or a romantic comedy. In this setting, labels are not mutually exclusive.
:::{.callout-note title='Categorical Objective Functions' collapse="true"}
In many situations where you have a categorical target you will use a form of **cross-entropy loss** for the objective function. You may see other names such as *log loss* or *logistic loss* or *negative log likelihood* depending on the context, but usually it's just a [different name for the same underlying objective](https://en.wikipedia.org/wiki/Cross-entropy#Cross-entropy_loss_function_and_logistic_regression).
:::
### Ordinal Variables {#sec-data-ordinal}
So far in our discussion of categorical data, it's assumed to have no order. But it's quite common to have labels like "low", "medium", and "high", or "very bad", "bad", "neutral", "good", "very good", or simply are a few numbers, like ratings from 1 to 5. **Ordinal data** is categorical data that has a known ordering, but which still has arbitrary labels. Let us repeat that, *ordinal data is categorical data*.
#### Ordinal Features {#sec-data-ordinal-features}
The simplest way to treat ordinal features is as if they were numeric. If you do this, then you're just pretending that it's not categorical, and this is usually fine. Most of the transformations we mentioned probably aren't going to be as useful, but you can still use them if you want. For example, logging ratings 1-5 isn't going to do anything for you model-wise, but it technically doesn't hurt anything. But you should know that typical statistics like means and standard deviations don't really make sense for ordinal data, so the main reason for treating them as numeric is for modeling convenience.
If you choose to treat an ordinal feature as categorical, you can ignore the ordering and do the same as you would with categorical data. This would allow for some nonlinearity since the category means whatever they need to be. There are also some specific techniques to coding ordinal data for use in linear models, but they are not commonly used, and they generally aren't going to help the model performance or interpreting the feature, so we do not recommend them. You could however use old-school **effects coding** that you would incorporate traditional ANOVA models, but again, you'd need a good reason to do so.
The take home message for ordinal features is generally simple. Teat them as you would numeric features or non-ordered categorical features. Either is fine.
#### Ordinal Targets {#sec-data-ordinal-targets}
Ordinal targets can be trickier to deal with. If you treat them as numeric, you're assuming that the difference between 1 and 2 is the same as the difference between 2 and 3, and so on. This is probably not true. If you treat them as categorical and use standard models for that setting, you're assuming that there is no connection between categories. So what should you do?
There are a number of ways to model ordinal targets, but probably the most common is the **proportional odds model**. This model can be seen as a generalization of the logistic regression model, and is very similar to it, and actually identical if you only had two categories. It basically is a model of 2 vs. 1, 3 vs. (2, 1), 4 vs. (3, 2, 1), etc. But other models beyond proportional odds are also possible, and your results could return something that gives coefficients for the model for the 1-2 category change, the 2-3 category change, and so on.
Ordinality of a categorical outcome is largely ignored in machine learning applications. The outcome is either treated as numeric or multiclass classification. This is not necessarily a bad thing, especially if prediction is the primary goal. But if you need a categorical prediction, treating the target as numeric means you have to make an arbitrary choice to classify the predictions. And if you treat it as multiclass, you're ignoring the ordinality of the target, which may not work as well.
#### Rank Data {#sec-data-rank}
Though ranks are ordered, with rank data we are referring to cases where the observations are uniquely ordered. An ordinal vector with numeric labels could be something like [2, 1, 1, 3, 4, 2], whereas rank data would be [2, 1, 3, 4, 5, 6]. For example, in sports, the actual finish of the runners in a race is what we'd want to predict. Assuming you have a modeling tool that actually handles this situation, the objective will be different than other scenarios. Statistical modeling methods include using the Plackett-Luce distribution (or the simpler variant Bradley-Terry model). In machine learning, you might use so-called [learning to rank methods](https://en.wikipedia.org/wiki/Learning_to_rank), like the [RankNet and LambdaRank algorithms](https://icml.cc/Conferences/2015/wp-content/uploads/2015/06/icml_ranking.pdf), and other variants for [deep learning models](https://github.com/tensorflow/ranking).
## Missing Data {#sec-data-missing}
```{r}
#| echo: false
#|
#| label: fig-missing-data
set.seed(123)
df <- data.frame(
x1 = rpois(5, 5),
x2 = rpois(5, 3),
x3 = sample(c("A", "B", "C"), 5, replace = TRUE)
) |>
mutate(across(everything(), as.character)) |>
as.matrix()
# Randomly introduce missing values
df[sample(1:15, 4)] <- NA
# Print the resulting data frame
df |>
as_tibble() |>
mutate(across(everything(), \(x) replace_na(x, '?'))) |>
gt()
```
Missing data is a common challenge in data science, and there are a number of ways to deal with it, usually by substituting, or **imputing**, the substituted value for the missing one. Here we'll provide an overview of common techniques to deal with missing data.
### Complete Case Analysis {#sec-data-missing-complete}
The first way to deal with missing data is the simplest - **complete case analysis**. Here we only use observations that have no missing data and drop the rest. Unfortunately, can lead to a lot of lost data, and can lead to biased statistical results if the data is not **missing completely at random**. There are special cases of some models that by their nature can ignore the missingness under an assumption of **missing at random**, but even those models would likely benefit from some sort of imputation. If you don't have much missing data though, dropping the missing data is fine for practical purposes[^dropmiss]. How much is too much? Unfortunately that depends on the context, but if you have more than 10% missing, you should probably be looking at alternatives.
[^dropmiss]: While many statisticians will possibly huff and puff at the idea of dropping data, there are two things to consider. With minimal missingness you'll likely never come to a different conclusion unless you have very little data to come to a conclusion about, which is already the bigger problem. Secondly, it's impossible to prove one way or another if the data is missing at random, because doing so would require knowing the missing values.
### Single Value Imputation {#sec-data-missing-single}
**Single value imputation**: Replace missing values with the mean, median, mode or some other typical value of the feature.This will probably rarely help your model for a variety of reasons. Consider a numeric feature that is 50% missing, and for which you replace the missing with the mean. How good do you think that feature will be when at least half the values are identical? Whatever variance it normally would have and share with the target is probably reduced, and possibly dramatically. Furthermore, you've also attenuated correlations it has with the other features, which may mute other modeling issues that you would otherwise deal with in some way (e.g. collinearity), or cause you to miss out on interactions.
Single value imputation makes perfect sense if you *know* that the missingness should be a specific value, like a count feature where missing means a count of zero. If you don't have much missing data, it's unlikely this would have any real benefit over complete case analysis, *except* if it allows you to use all the other features that would otherwise be dropped. But then, why not just drop this feature and keep the others?
### Model-based Imputation {#sec-data-missing-model}
**Model-based imputation** is more complicated, but can be very effective. In essence, you run a model for complete cases in which the feature with missing values is now the target, and all the other features and primary target are used to predict it. You then use that model to predict the missing values, and use those predictions as the imputed values. After these predictions are made, you move on to the next feature and do the same. There are no restrictions on which model you use for which feature. If the other features in the imputation model also have missing data, you can use something like mean imputation to get more complete data if necessary as a first step, and then when their turn comes, impute those values.
Although the implication is that you would have one model per feature and then be done, you can do this iteratively for several rounds, such that the initial imputed values are then used in subsequent models to reimpute the missing values. You can do this as many times as you want, but the returns will diminish. In this setting, we are assuming you'll have a single value imputed for each missing one, but this approach is the precursor for our next method.
### Multiple Imputation {#sec-data-missing-mi}
**Multiple imputation** (MI) is a more complicated technique, but can be very useful under some situations, depending on what you're willing to sacrifice for having better uncertainty estimates versus a deeper dive into the model. The idea is that you create multiple imputed datasets, each of which is based on the **predictive distribution** of the model used in model-based imputation. Say we use a linear regression assuming a normal distribution to impute feature A. We would then draw repeatedly from the predictive distribution of that model to create multiple datasets with (randomly) imputed values for feature A
Let's say we do this 10 times, and we now have 10 imputed data sets. We now run our actual model of interest on each of these datasets, and final model results are averaged in some way to get final parameter estimates. Doing so acknowledges that your single imputation methods have uncertainty in those imputed values, and that uncertainty is incorporated into the final model results.
MI can in theory handle any source of missingness and can be a very powerful technique. But it has several drawbacks. One is that you need a a specified target distribution for all imputation models used in order to generate random draws from. Your final model presumably is also a probabilistic model with coefficients and variances you are trying to estimate and understand. MI probably isn't going to help a boosting or deep learning models that have native methods for dealing with missing values, or at least offer little if anything over single value imputation. If you have very large data and a complicated model, you could be waiting a long time, and as modeling is an iterative process itself, this can be rather tedious to work through. Finally, few data or post-model processing tools that you commonly use will work with MI results, especially visualization tools. So you will have to hope that whatever package you use for MI has what you need. As an example, you'd have to figure out how you're going to impute interaction terms if you have them.
*Practically speaking*, MI takes a lot of effort to often come to the same conclusions you would have with a single imputation method, or possibly fewer conclusions for anything beyond GLM coefficients and their standard errors. But if you want your uncertainty estimate for those models to be better, MI can be an option.
### Bayesian Imputation {#sec-data-missing-bayesian}
One final option is to run a Bayesian model where the missing values are treated as parameters to be estimated, and they would have priors just like other parameters as well. MI is basically a variant of Bayesian imputation that can be applied to the non-Bayesian model setting, so why not just use the actual Bayesian method? Some modeling packages can allow you to try this very easily, and it can be very effective. But it is also very computationally intensive, and can be very slow as you may be increasing the number of parameters to estimate dramatically. At least it would be more fun than standard MI!
## Class Imbalance {#sec-data-class-imbalance}
:::{.content-visible when-format="html"}
```{r}
#| echo: false
#| label: fig-class-imbalance
#| fig-cap: Class Imbalance
p = df_reviews |>
mutate(y = ifelse(rating < 2, "Class A", "Class B")) |>
count(y) |>
mutate(prop = n / sum(n)) |>
ggplot(aes(x = y, y = prop)) +
geom_col(width = .05) +
geom_point(aes(y = prop), size = 10, alpha = 1, color = okabe_ito[2]) +
coord_flip() +
scale_y_continuous(breaks = seq(0, 1, .1), labels = scales::percent_format()) +
labs(x = "", y = "", title = "")
p
ggsave(
"img/data-class_imbalance.svg",
p,
width = 6,
height = 4
)
```
:::
:::{.content-visible when-format="pdf"}
![](img/data-class_imbalance.svg)(img/data-class_imbalance.svg){width=100% #fig-class-imbalance}
:::
**Class imbalance** refers to the situation where the target variable has a large difference in the number of observations in each class. For example, if you have a binary target, and 90% of the observations are in one class, and 10% in the other, you would have class imbalance. You'll almost never see a 50/50 split in the real world, but the issue is that as we move further away from that point, we can start to see problems in model estimation, prediction, and interpretation. With our example, if we just predict the majority class in a binary classification problem, our accuracy would 90%! Under other circumstances that might be a great result, but in this case it's not. So right off the bat one of our favorite metrics to use for classification models isn't going to help us much.
For classification problems, *class imbalance is the rule, not the exception*. This is because nature just doesn't sort itself into nice and even bins. The majority of people of a random sample do not have cancer, the vast majority of people have not had a heart attack in the past year, most people do not default on their loans, and so on.
There are a number of ways to help deal with class imbalance, and the method that works best will depend on the situation. Some of the most common are:
- **Use different metrics**: Use metrics that are less affected by class imbalance, such as area under a receiver operating characteristic curve (AUC), or those that balance the tradeoff between precision and recall, like the F1 score, or balance recall and true negative rate, like the balanced accuracy score.
- **Oversampling/Undersampling**: Randomly sample from the minority (majority) class to increase (decrease) the number of observations in that class.
- **Weighted objectives**: Weight the loss function to give more weight to the minority class. Although commonly employed, and a simple thing to use with models like lightgbm and xgboost, it often fails to help, and can cause other issues.
- **Thresholding**: Change the threshold for classification to be more sensitive to the minority class. Nothing says you have to use 0.5 as the threshold for classification, and you can change it to be more sensitive to the minority class. This is a very simple approach, and may be all you need.
These are not necessarily mutually exclusive. For example, in the imbalanced setting it's probably a good idea to switch your focus to a metric besides accuracy even as you employ other techniques.
### Calibration Issues in Classification {#sec-data-calibration}
Probability **calibration** is often an issue in classification problems, and is a bit more complicated than just class imbalance, but is often discussed in the same setting. Having calibrated probabilities refers to the situation where the predicted probabilities of the target match up well to the actual probabilities. For example, if you predict that 10% of people will default on their loans, and 10% of people actually do default on their loans, one would say your model is well calibrated. Conversely, if you predict that 10% of people will default on their loans, but 20% of people actually do default on their loans, your model is not so well-calibrated.
One way to assess calibration is to use a **calibration curve**, which is a plot of the predicted probabilities vs. the observed proportions. We bin the observations in some way, and then calculate the average predicted probability and the average observed proportion of the target in each bin. If the model is well-calibrated, the points should fall along the 45-degree line. If the model is not well-calibrated, the points will fall above or below the line. In the following, one model seems to align well with the observed proportions based on the chosen bins. The other model (dashed line) is not so well calibrated, and is overshooting with its predictions. For example, if the model's average prediction predicts a 0.5 probability of default, the actual proportion of defaults is only around 0.6.
```{python}
#| echo: false
#| eval: false
#| label: py-calibration-plot-data
# reticulate basically refused to run this code properly except under random situations. see bootstrap_calibration notebook.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.calibration import calibration_curve
from sklearn.linear_model import LogisticRegression
# Generate synthetic data
X, y = make_classification(n_samples=1500, n_features=10, random_state=42)
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)
# Get predicted probabilities on the test set
y_pred_prob = model.predict_proba(X_test)[:, 1]
# Compute calibration curve
true_prob, pred_prob = calibration_curve(y_test, y_pred_prob, n_bins=10)
# Create data frame
df = pd.DataFrame({
'true_prob': true_prob,
'pred_prob': pred_prob,
})
# write to file
df.to_csv("data/basic_calibration.csv", index=False)
```
```{python}
#| echo: false
#| eval: false
#| label: py-bootstrap-calibration-plot-data
# Set the number of bootstrap samples
n_bootstrap = 100
# Initialize arrays to store bootstrap results
bootstrap_true_prob = np.zeros(shape=(n_bootstrap, len(true_prob)))
bootstrap_pred_prob = np.zeros(shape=(n_bootstrap, len(pred_prob)))
# Perform bootstrap sampling
for i in np.arange(n_bootstrap):
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)
# Generate bootstrap sample
bootstrap_indices = np.random.choice(len(X_test), size=len(X_test), replace=True)
bootstrap_X_test = X_test[bootstrap_indices]
bootstrap_y_test = y_test[bootstrap_indices]
bootstrap_y_pred_prob = model.predict_proba(bootstrap_X_test)[:, 1]
# Compute calibration curve for bootstrap sample
bootstrap_true_prob[i, :], bootstrap_pred_prob[i, :] = calibration_curve(
bootstrap_y_test,
bootstrap_y_pred_prob,
n_bins=10
)
# Calculate mean and confidence intervals
mean_true_prob = np.mean(bootstrap_true_prob, axis=0)
mean_pred_prob = np.mean(bootstrap_pred_prob, axis=0)
lower_bound_true = np.percentile(bootstrap_true_prob, 2.5, axis=0)
upper_bound_true = np.percentile(bootstrap_true_prob, 97.5, axis=0)
lower_bound_pred = np.percentile(bootstrap_pred_prob, 2.5, axis=0)
upper_bound_pred = np.percentile(bootstrap_pred_prob, 97.5, axis=0)
df_true = pd.DataFrame({
'prob': mean_true_prob,
'lower': lower_bound_true,
'upper': upper_bound_true,
})
df_pred = pd.DataFrame({
'prob': mean_pred_prob,
'lower': lower_bound_pred,
'upper': upper_bound_pred,
})
df_true['type'] = 'true'
df_pred['type'] = 'pred'
p_dat = pd.concat([df_true, df_pred])
p_dat.to_csv("data/bootstrap_calibration.csv", index=False)
```
:::{.content-visible when-format="html"}
```{r}
#| echo: false
#| label: fig-calibration-plot
#| fig-cap: Calibration Plot
p_dat = read_csv("data/basic_calibration.csv")
set.seed(123)
nb = 5
p_dat = tibble(
true_prob = rep(seq(0,1, length.out = nb), 2),
pred_prob = c(
seq(0,1, length.out = nb) + rnorm(nb, sd=.05),
c(seq(0,1, length.out = nb)^.5) #+ rnorm(nb, sd=.1))
),
pred_type = c(rep("better", nb), rep("worse", nb))
)
p = p_dat |>
# pivot_longer(cols = c(pred_prob, pred_worse), names_to = "prob_type", values_to = "pred_prob") |>
ggplot(aes(x = pred_prob, y = true_prob, color = pred_type)) +
geom_abline(intercept = 0, slope = 1, color = 'gray', lwd= 1.5) +
geom_line(aes(linetype = pred_type)) +
geom_point(size = 3, alpha = 1) +
scale_x_continuous(breaks = seq(0, 1, .1)) +
scale_y_continuous(breaks = seq(0, 1, .1)) +
scale_color_manual(values = c(okabe_ito[2], okabe_ito[1])) +
labs(x = "Average (Binned) Probability", y = "Observed (Binned)\nProportions", title = "")
p
ggsave(
"img/data-calibration_plot.svg",
p,
width = 6,
height = 4
)
```
:::
:::{.content-visible when-format="pdf"}
![](img/data-calibration_plot.svg)(img/data-calibration_plot.svg){width=100% #fig-calibration-plot}
:::
While the issue is an important one, it's good to keep the issue of calibration and imbalance separate. As miscalibration implies bias, bias can happen irrespective of the class proportions, and can be due to a variety of factors related to the model, target, or features. Furthermore miscalibration is not inherent to any particular model.
Furthermore, the assessment of calibration in this manner has a few issues. For one, the observed 'probabilities' are proportions based on arbitrarily chosen bins, and the observed values are measured with some **measurement error** and have a natural variability with will partly reflect sample size[^testcalibbinsize]. These plots are often presented such that observed proportions are labeled as the "true" probabilities. However, you do not have the *true* probabilities, just the observed class labels, so whether your model's predicted probabilities match observed proportions is a bit of a different question. The predictions obviously have uncertainty as well, and this will depend on the model, sample size, and other factors. And finally, the number of bins chosen can also affect the appearance of the plot in a notable way if the sample size is small.
[^testcalibbinsize]: Note that each bin will reflect the portion of the test set size in this situation. If you have a small test set, the observed proportions will be more variable, and the calibration plot will be more variable as well.
All this is to say that each point in a calibration plot, along with the reference line itself, has some error bar around it, and the differences between models and the 'best case scenario' would need additional steps to suss out if we are interested in doing so in a statistically rigorous fashion. Some methods are available to calibrate probabilities, but they are not commonly implemented in practice, and often involve a model-based technique, with all of its own assumptions and limitations. It's also not exactly clear that forcing your probabilities to be on the line is helping solve the actual modeling goal in any way[^practicalprob]. But if you are interested, you can read more [here](https://scikit-learn.org/stable/modules/calibration.html).
[^practicalprob]: Oftentimes we are only interested in the ordering of the predictions, and not the actual probabilities. For example, if we are trying to identify the top 10% of people most likely to default on their loans, we'll just take the top 10% of predictions, and the actual probabilities are irrelevant for that goal.
```{r}
#| echo: false
#| eval: false
#| label: r-bootstrap-calibration-plot
#| fig-cap: Bootstrap Calibration Plot
# ultimately I think this gets too far afield
p_dat = read_csv("data/bootstrap_calibration.csv")
p_dat = p_dat |>
filter(type == "true") |>
rename(
true_prob = prob,
true_lower = lower,
true_upper = upper
) |>
bind_cols(
p_dat |>
filter(type == "pred") |>
select(-type) |>
rename(pred_prob = prob, pred_lower = lower, pred_upper = upper)
)
p_dat |>
ggplot(aes(x = pred_prob, y = true_prob)) +
geom_abline(intercept = 0, slope = 1, linetype = 'dashed', color = 'gray', lwd= 1) +
geom_segment(aes(x = pred_lower, y = true_prob, xend = pred_upper, yend = true_prob), color = "gray", size = 0.5) +
geom_segment(aes(x = pred_prob, y = true_lower, xend = pred_prob, yend = true_upper), color = "gray", size = 0.5) +
geom_line() +
geom_point(size = 3, alpha = 1) +
scale_x_continuous(breaks = seq(0, 1, .1)) +
scale_y_continuous(breaks = seq(0, 1, .1)) +
scale_color_manual(values = c(okabe_ito[2], okabe_ito[1])) +
labs(x = "Avg (Binned) Probability", y = "Observed (Binned)\nProportions", title = "")
```
## Censoring and Truncation {#sec-data-censoring}
:::{.content-visible when-format="html"}
```{r}
#| echo: false
#| label: fig-censoring
#| fig-cap: Censoring for Time Until Death
# Create a data frame with age and censoring information
df = tibble(
person = LETTERS[1:5],
age = c(53, 72, 80, 65, 83),
censoring = c("End of Life", "Censored", "Censored", "End of Life", "Censored"))
# Plot the potential censoring
p = ggplot(df, aes(x = person, age, fill = censoring, color = censoring)) +
geom_col(width = .05, alpha = .5, show.legend=FALSE) +
geom_segment(
aes(x = person, y = age, xend = person, yend = 85),
linetype = "dashed",
color = "gray50",
data = df |> filter(censoring == "Censored")
) +
geom_point(size = 5, alpha = 1) +
lims(y = c(0, 90)) +
labs(x = "Person", y = "Age", title = "") +
scale_fill_manual(values = okabe_ito, aesthetics = c('color', 'fill')) +
coord_flip()
p
ggsave(
"img/data-censoring.svg",
p,
width = 6,
height = 4
)
```
:::
:::{.content-visible when-format="pdf"}
![Censoring for Time Until Death](img/data-censoring.svg)(img/data-censoring.svg){width=100% #fig-censoring}
:::
Sometimes, we just don't see all the data there is to see. **Censoring** is a situation where the target variable is not fully observed. This is common in techniques where the target is the time to an event, like 'death', but the event has not yet occurred for some observations (thankfully!). Specifically this is called **right censoring**, and is the most common type of censoring, and depicted in the above plot, where several individuals are only observed to a certain age and were still alive at that time. There is also [**left censoring**](https://stats.stackexchange.com/questions/144037/right-censoring-and-left-censoring), where the censoring happens from the other direction, and data before a certain point is unknown. Finally, there is **interval censoring**, where the event of interest occurs within some interval, but the exact value is unknown.
[^eventhistory]: Survival analysis is also called event history analysis, and is widely used in biostatistics, sociology, demography, and other disciplines where the target is the time to an event, such as death, marriage, divorce, etc.
**Survival analysis**[^eventhistory] is a common modeling technique in this situation, but you may also be able to keep things even more simple via something like [tobit regression](https://m-clark.github.io/models-by-example/tobit.html). In this model, you assume that the target is fully observed, but that the values are censored, and you model the probability of censoring. This is a common technique in econometrics, and can keep you in a traditional linear model context.
:::{.content-visible when-format="html"}
```{r}
#| echo: false
#| label: fig-truncation
#| fig-cap: Truncation
library(ggplot2)
set.seed(1234)
# Generate a random sample from a normal distribution
data = rpois(100, 60)
# Censor some of the values
key_value = 70
# Create a data frame for plotting
df = tibble(
data = data,
censored = factor(if_else(data > key_value, "Unobserved", "Observed")),
)
# Plot the original data and the censored data
p = ggplot(df, aes(x = data)) +
geom_bar(
aes(
alpha = after_stat(x < key_value),
group = after_scale(fill)
),
# alpha = 0.5,
color = NA,
show.legend = FALSE,
) +
geom_vline(xintercept = key_value, color = "gray", size = 1) +
annotate(
geom = "text",
x = key_value + 1,
y = 3,
label = "Unobserved",
color = "gray",
size = 6,
hjust = 0,
) +
scale_fill_manual(values = c(okabe_ito)) +
labs(x = "Ages", y = "", title = "") +
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
)
p
ggsave(
"img/data-truncation.svg",
p,
width = 6,
height = 4
)
```
:::
:::{.content-visible when-format="pdf"}
![](img/data-truncation.svg)(img/data-truncation.svg){width=100% #fig-truncation}
:::
**Truncation** is a situation where the target variable is only observed if it is above or below some value, even though we know other possibilities exist. One of the issues is that default distributional methods assume a distribution that is not necessarily bounded in the way that the data exhibits. In our plot above, we restrict our data to 70 and below for practical or other reasons, but typical modeling methods predicting age would not respect that.
You could truncate predictions after the fact, but this is a bit of a hack, and often results in lumpiness in the predictions at the boundary that likely isn't realistic. Alternatively, Bayesian methods allow you to model the target as a distribution with truncated distributions, and so you can model the probability of the target being above or below some value. There are also models such as **hurdle models** that might prove useful where the truncation is theoretically motivated, e.g. a zero-inflated Poisson model for count data where the zero counts are due to a separate process than the non-zero counts.
:::{.callout-note title="Censoring vs. Truncation" collapse="true"}
One way to distinguish censored and truncated data is that censored data is usually due to some external process such that the target is not observed, but could be possible (capping reported income at $1 million). Truncated data, on the other hand, is due to some internal process that prevents the target from being observed, and is often derived from sample selection (we only want to model non-millionaires). We would not want predictions past the censored point to be unlikely, but we would want predictions past the truncated point to be impossible. Trickier still is that for bounded or truncated distributions that might be applied to the truncated scenario, such as folded vs. truncated distributions, they would not result in the same probability distributions even if they can be applied to the same data situation.
![Folded and Truncated Distributions](img/trunc_vs_folded.png){width=33%}
Image from [StackExchange]((https://stats.stackexchange.com/questions/16089/is-sampling-from-a-folded-normal-distribution-equivalent-to-sampling-from-a-norm)
)
:::
## Time Series {#sec-data-time}
:::{.content-visible when-format="html"}
```{r}
#| echo: false
#| label: fig-time-series
#| fig-cap: Time Series Data
set.seed(123)
# Generate date sequence
dates <- seq(as.Date("2022-01-01"), as.Date("2022-12-31"), by = "day")
# Generate increasing trend
trend <- seq(1, length(dates))
# Generate seasonality
seasonality <- sin(2 * pi * (1:length(dates)) / 365 * 7) *10000
# Combine trend and seasonality
y <- trend + trend^2 + seasonality
p = tibble(date = dates, value = y) |>
ggplot(aes(x = date, y = value)) +
geom_line(linewidth=3, color = okabe_ito[1]) +
# geom_smooth(se = TRUE, color = okabe_ito[1], method = 'gam', formula = y ~ s(x, bs = 'cs', k = 25)) +
geom_smooth(se = TRUE, method = 'gam') +
labs(x = "", y = "") +
scale_y_continuous(labels = NULL) +
scale_x_date(date_labels = "%b", date_breaks = "1 month") +
theme(
axis.ticks.y = element_blank(),
)
p
ggsave(
"img/data-time_series.svg",
p,
width = 8,
height = 6
)
```
:::
:::{.content-visible when-format="pdf"}
![](img/data-time_series.svg)(img/data-time_series.svg){#fig-time-series}
:::
Time series data is any data that incorporates values over a period of time. This could be something like the a state's population over years, or the max temperature of an area over days. Time series data is very common in data science, and there are a number of ways to model such data.
### Time-based Targets
As in other settings, the most common approach when the target is some value that varies time is to use a linear model of some kind. While the target varies over time, the features may be time-varying or not. There are traditional autoregressive models that use the target's past values as features, for example, autoregressive moving average (**ARIMA**) models. Others can incorporate historical information in other ways such as is done in Bayesian methods to marketing data or in reinforcement learning (@sec-ml-more-reinforcement). Still others can get quite complex, such **recurrent neural networks** and their generalizations that form the backbone of modern AI models. Lately [transformer-based models](https://research.google/blog/a-decoder-only-foundation-model-for-time-series-forecasting/) have looked promising.
**Longitudinal data**[^paneldata] is a special case of time series data, where the target is a function of time, but it is typically grouped in some fashion. An example is a model for school performance for students over several semesters, where values are clustered within students over time. In this case, you can use some sort of time series regression, but you can also use a **mixed model** (@sec-lm-extend-mixed-models), where you model the target as a function of time, but also include a random effect for the grouping variable, in this case, students. This is a very common approach in many domains, and can be very effective in terms of performance as well. It can also be used for time series data that is not longitudinal, where the random effects are based on autoregressive covariance matrices. In this case, an ARIMA component is added to linear model as a random effect to account for the time series nature of the data. This is fairly common in Bayesian contexts.
In general lots of models can be found specific to time series data, and the choice of model will depend on the data, the questions we want to ask, and the goals we have.
:::{.callout-tip title="Time Series vs. Longitudinal" collapse="true"}
The primary distinguishing feature for referring data as 'time-series' and 'longitudinal' is the number of time points, where the latter typically has relatively few. This arbitrary though.
:::
[^paneldata]: You may also hear the term **panel data** in econometrics-oriented disciplines.
### Time-based Features
When it comes to time-series features, we can apply time-specific transformations. One technique is the [**fourier transform**](https://betterexplained.com/articles/an-interactive-guide-to-the-fourier-transform/), which can be used to decompose a time series into its component frequencies, much like how we use PCA (@sec-ml-more-unsuper). This can be useful for identifying periodicity in the data, which can be used as a feature in a model.
In marketing contexts, some perform **adstocking** with features. This method models the delayed effect of features over time, such that they may have their most important impact immediately, but still can impact the present target from past values. For example, a marketing campaign might have the most significant impact immediately after it's launched, but it can still influence the target variable at later time points. Adstocking helps capture this delayed effect without having to include multiple lagged features in the model. That said, including lagged features is also an option. In this case, you would have a feature for the current time point (*t*), the same feature for the previous time point (*t-1*), the feature for the time point before that (*t-2*), and so on.
If you have the year as a feature, you can use it as a numeric feature or as a categorical feature. If you treat it as numeric, you need to consider what a zero means. In a linear model, the intercept usually represents the outcome when all features are zero. But with something like year, a zero year isn't meaningful in most contexts. To solve this, you can shift the values so that the earliest time point, like the first year in your data, becomes zero. This way, the intercept in your model will represent the outcome for this first time point, which is more meaningful. The same goes if you are using months or days as a numeric feature. It doesn't really matter which year/month/day is zero, just that zero refers to one of the actual time points observed.
Dates and/or times can be it bit trickier. Often you can just split dates out into year, month, day, etc., and proceed as discussed. In other cases you'd want to track the time period to assess possible seasonal effects. You can use something lie a **cyclic** approach (e.g. [cyclic spline or sine/cosine transformation](https://developer.nvidia.com/blog/three-approaches-to-encoding-time-information-as-features-for-ml-models/)) to get at yearly or within-day seasonal effects. As mentioned, a fourier transform can also be used to decompose the time series into its component frequencies for use as model features. Time components like hours, minutes, and seconds can often be dealt with in similar ways, but you will more often deal with the periodicity in the data. For example, if you are looking at hourly data, you may want to consider the 24-hour cycle.
:::{.callout-note title="Calendars are hard." collapse="true"}
*Weeks are not universal*. Some start on Sunday, others Monday. Some data contexts only consider weekdays. [Some systems](https://en.wikipedia.org/wiki/ISO_week_date) may have 52 or 53 weeks in a year, and dates may not be in the same week from one year to the next, etc. So use caution when considering weeks as a feature.
:::
#### Covariance Structures
In many cases you'll have features that vary over time but are not a time-oriented feature like year or month. For example, you might have a feature that is the number of people who visited a website over days. This is a time-varying feature, but it's not a time metric in and of itself.
In general, we'd like to account for the time-dependent correlations in our data, and the main way to do so is to posit a covariance structure that accounts for this in some fashion. This helps us understand how data points are related to each other over time, and requires us to estimate the correlations between observations. As a starting point, consider linear regression. In a standard linear regression model, we assume that the samples are independent of one another, with a constant variance and no covariance.
Instead, we can also use something like a **mixed model**, where we include a random effect for each group and estimate the variance attributable to the grouping effect. By default, this ultimately assumes a constant correlation from time point to time point, but many tools allow you to specify a more complex covariance structure. A common method is to use autoregressive covariance structure that allows for correlations further apart in time to lessen. In this sense the covariance comes in as an added *random effect*, rather than being a model in and of itself as with ARIMA. Many such approaches to covariance structures are special cases of **gaussian processes**, which are a very general technique to model time series, spatial, and other types of data.
:::{.content-visible when-format="html"}
```{r}
#| echo: false
#| label: fig-covariance-structure
#| fig-cap: AR (1) Covariance Structure Visualized
library(nlme)
library(lme4)
model_corAR <- lme(Reaction ~ 1, random= ~1|Subject, data = sleepstudy, correlation = corAR1(form=~Days))
df = tibble(id = rep(1:2, e= 12), Months = rep(1:12, 2))
model_corAR <- lme(rnorm(24) ~ 1, random= ~1|id, data = df, correlation = corAR1(form=~Months))
# phi = abs(coef(model_corAR$modelStruct$corStruct, unconstrained = F)*3)
phi = .66
init = corMatrix(Initialize(corAR1(phi, form = ~1|id), data=df))
rc_ar = bdsmatrix::bdsmatrix(rep(12, 2), blocks = rep(init[[1]][lower.tri(init[[1]], diag = T)], 2)) %>%
as.matrix()
dates = seq.Date(as.Date("2022-01-15"), as.Date("2022-12-15"), by = "month")
dates = factor(month.abb, levels = month.abb)
l = length(dates)
p = tibble(x = rep(dates, e=l), y = rep(dates, times=l)) |>
mutate(
z = c(rc_ar[1:l, 1:l]),
z = ifelse(z == 0, NA, z)
) |>
ggplot(aes(x = x, y = y)) +
geom_tile(aes(fill=z), color = 'white', linewidth=3,show.legend = FALSE) +
# scale_x_date(breaks = '3 months', labels = label_date_short(format = c('%b'))) +
# scale_y_date(breaks = '3 months', labels = label_date_short(format = c('%b'))) +
# scale_x_date(breaks = '1 months', labels = date_format(format = c('%b'))) +
# scale_y_date(breaks = '1 months', labels = date_format(format = c('%b'))) +
scale_fill_scico(palette = "devon", na.value = "transparent", direction = -1) +
labs(
x = '',
y = '',
caption = 'A AR(1) structure for months within year. Stronger correlation is indicated by a darker color. Correlations systematically weaken for time points further apart.') +
# theme_void() +
theme(
plot.caption = element_text(hjust = 0, size = 12),
)
p
ggsave(
"img/data-covariance_structure.svg",
p,
width = 6,
height = 6
)
# scico::scico_palette_show()
```
:::
:::{.content-visible when-format="pdf"}
![](img/data-covariance_structure.svg)(img/data-covariance_structure.svg){width=100% #fig-covariance-structure}
:::
## Spatial Data {#sec-data-spatial}
```{r}
#| echo: false
#| eval: false
#| label: r-spatial-data
#| fig-cap: Spatial Weighting Applied to the Dallas-Fort Worth Area Census Tracts
# from https://walker-data.com/census-r/spatial-analysis-with-us-census-data.html
library(tidycensus)
library(sf)
library(spdep)
library(tigris)
dfw <- core_based_statistical_areas(cb = TRUE, year = 2020) %>%
filter(str_detect(NAME, "Dallas")) %>%
st_transform(32138)
dfw_tracts <- get_acs(
geography = "tract",
variables = "B01002_001",
state = "TX",
year = 2020,
geometry = TRUE
) %>%
st_transform(32138) %>%
st_filter(dfw, .predicate = st_within) %>%
na.omit()
neighbors <- poly2nb(dfw_tracts, queen = TRUE)
summary(neighbors)
weights <- nb2listw(neighbors, style = "W")
weights$weights[[1]]
# For Gi*, re-compute the weights with `include.self()`
dfw_tracts$lag_estimate <- lag.listw(weights, dfw_tracts$estimate)
localg_weights <- nb2listw(include.self(neighbors))
dfw_tracts$localG <- localG(dfw_tracts$estimate, localg_weights)
str(dfw_tracts$localG, 0)
dim(dfw_tracts)
dfw_tracts$localG[1:5]
ggplot(dfw_tracts) +
geom_sf(aes(fill=as.numeric(dfw_tracts$localG)), color = NA, show.legend = FALSE) +
scico::scale_fill_scico(
palette = "vik",
na.value = "transparent",
midpoint = 0
) +