/
05-datastructures.Rmd
executable file
·161 lines (123 loc) · 7.29 KB
/
05-datastructures.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
# Data Structures {#datastructures}
***
CHAPTER OVERVIEW NEEDED??
***
## Tidy Data
[INSERT QUIP ABOUT TIDY DATA OR QUOTE]
In Chapter 2 when we introduced the R programming language as our preferred tool
for marketing analytics we briefly referenced the concept of "tidy" data. It is
important to understand tidy data principles because they are the motivation for
many packages that make data processing a less tedious exercise. Hadley Wickham
proposes three rules that make a dataset tidy:
1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.
[INSERT FIGURE, POSSIBLY https://r4ds.had.co.nz/tidy-data.html#fig:tidy-structure]
Wickham has even reimagined the `data.frame` object type that exists in base R with
a version that is more "tidy". It is a `tbl_df` object, also referred to as a "tibble".
For example, tibbles do not add rownames because rownames are values that should be
stored in their own column. Tibbles are more convenient than data.frames for other
reasons, which is also why we recommend using them.
Not all datasets are tidy,
****
EUROMONITOR EXAMPLE???
****
## Relational Data
In programming there is a matra of "Don't repeat yourself" (DRY). It is a principle
instilled in software developers because unnecessary repetition makes code less
efficient, requires additional time to maintain, and can quickly become out of sync.
In code, this means creating functions or classes. In databases, this is usually done
through data normalization. The degree of data normalization is the extent to which
the data is replicated in multiple places within a database. Highly normalized means
that a single data point only appears in one place. In contrast, a highly denormalized
dataset means that a single data point may appear in multiple places. For example, ...
****
COMPLETEJOURNEY AS AN EXAMPLE???
****
Databases use "keys" in order to keep track of related data points. The word "key"
simply refers to values in one or more columns that identify each row. In
the example above column XYZ was the key. This column exists in both tables and
refers to the same row. When working with real-world data it is important to understand
how keys are represented in your dataset. There are names for two different types
of keys that work together:
1. Primary key
2. Foreign key
A primary key is one or more columns in a dataset that are always unique and non-missing
for each row. Database architects can put explicit constraints on database tables
so that duplicates are never entered into the data, even by mistake. This is a
good practice for preserving data integrity. In the example above, the primary key
is ...
A foreign key is just the existence of a primary key in a different table from where
it is defined. In the example above column XYZ was the primary key in the first dataset,
but it existed in the second dataset for us to join on. In this situation it called the
foreign key in the second dataset because it is a key, but it refers to a different dataset
where it is a primary key. A great discussion about keys is available in Chapter 13
Section 3 of *R for Data Science* [@wickham2017]
Below is an entity-relationship diagram (ERD). This visually represents the relationships
between a set of tables so that analysts can get a high-level view of what data exists
in each dataset and how to bring them together through joins.
[INSERT COMPLETE JOURNEY DIAGRAM]
[INSERT SOME COMMENTARY ABOUT HOW THE COMPLETE JOURNEY DATASET IS ORGANIZED AND WHY]
****
SHOULD WE REVISIT SPREAD/GATHER UNDER THE BACKDROP OF NORMALIZATION?
****
## Joins
If database architects "normalize" a database by creating multiple tables that reference
each other, then we need a way to bring the data back together from each of these
sources. The process of combining columns from two or more tables is called a "join".
"Joining" tables may also be described as "merging". These phrases usually mean the
same thing. The tricky part about joining datasets is understanding how observations
in each set should be matched and how to handle instances where the data is missing
from one of the tables. One of the most effective ways to learn the behavior of different
joins is to use venn diagrams. Below is a comprehensive outline of join operations.
You will notice that depending on the type of join certain records will be retained.
For example, the `inner_join()` function will only retain records if they both exist
in the dataset.
****
[INSERT IMAGE HERE] SOMETHING LIKE https://r4ds.had.co.nz/diagrams/join-venn.png
****
You will also notice in the join reference above how the observations are matched.
Typically, you will choose one column and whereever there is a match between the
datasets, then the rows will be combined having now columns from both datasets together
for the same observation. The one or more columns used to match records is typically
called the join "key".
Consider a simple example where we have one dataset with three columns, the first
is a record ID column and a second dataset with 2 columns, its first column is
also a record ID column. We will assume the majority of records should have matching
IDs between the two datasets. If we would like to "join" the second dataset into the
first, then we should use the `left_join()` function. This function takes the first
dataset provided, keeping all of its rows regardless of whether a matching record
is found in the second dataset.
```{r left-join}
#left_join code here
```
All of the columns in the second dataset are now added to the first dataset. In
this example, that means only one additional column. The ID column is not duplicated
after the merge because that information is reflected in the ID column from the
first dataset. For readers with knowledge of SQL programming, the join functions
in R perform exactly as their counterparts in SQL. For readers with knowledge of
Excel this is akin to using the `VLOOKUP` function or a combination of `INDEX-MATCH`
to pull records from another dataset.
****
SHOULD WE EXPLICITLY SHOW THE TRANSLATION FROM VLOOKUP TO LEFT_JOIN()? OR INDEX-MATCH?
****
As depicted in the joins reference diagram there are different join types for
returning records from both datasets, all from one dataset, or only the overalapping
records. For example, the `outer_join()` function combines the rows that match on the
join key(s) and retains all of the unmatched rows from each dataset. This is also
sometimes referred to as a "full join" since it retains the full set of records
from both datasets.
```{r outer-join}
#outer_join code here
```
One join function that is unique is the `anti_join()` function. This join returns
only the rows that are not matched in the second dataset. This is particularly useful
for diagnosing what is not matching. If you think about its behavior, it should,
together with the `inner_join()` function span all the records in the first dataset
you provide. This is because either the records are matched via `inner_join()` or
they are not matched and identifed via `anti_join()`. Below is an example which
highlights that 95% of records are matched between the two datasets. The remaining
5% are returned via the `anti_join()` function for us to further inspect.
```{r inner-and-anti-join}
#inner_join and anti_join code here
```