The concat.split
function takes a column with multiple values, splits the values into a list or into separate columns, and returns a new data.frame
.
-
data
: the sourcedata.frame
. -
split.col
: the variable that needs to be split; can be specified either by the column number or the variable name. -
sep
: the character separating each value (defaults to","
).Note: If using
structure = "compact"
, the value forsep
can only be a single character. See the "Advanced Usage" example of how to specify multiple characters for batch conversion of columns. -
structure
: Can be either"compact"
,"expanded"
, or"list"
. Defaults to"compact"
."compact"
creates as many columns as the maximum length of the resulting split. This is the most useful general-case application of this function.- When the input is numeric,
"expanded"
creates as many columns as the maximum value of the input data. This is most useful when converting tomode = "binary"
. "list"
creates a single new column that is structurally alist
within adata.frame
.
-
mode
: can be eitherbinary
orvalue
(wherebinary
is default and it recodes values to1
orNA
, like Boolean data, but without assuming0
when data is not available). This setting only applies whenstructure = "expanded"
; an warning message will be issued if used with other structures. -
drop.col
: logical (whether to remove the original variable from the output or not; defaults toTRUE
). -
fixed
: Is the input for thesep
value fixed, or a regular expression? Whenstructure = "expanded"
orstructure = "list"
, it is possible to supply a a regular expression containing the characters to split on. For example, to split on","
,";"
, or"|"
, you can setsep = ",|;|\\|"
orsep = "[,;|]"
, andfixed = FALSE
to split on any of those characters.
First load some data from a CSV stored at github. The URL is an HTTPS, so we need to use getURL
from RCurl
.
require(RCurl)
baseURL = c("https://raw.github.com/mrdwab/2657-R-Functions/master/")
temp = getURL(paste0(baseURL, "data/concatenated-cells.csv"))
concat.test = read.csv(textConnection(temp))
rm(temp)
# How big is the dataset?
dim(concat.test)
## [1] 48 4
# Just show me the first few rows
head(concat.test)
## Name Likes Siblings Hates
## 1 Boyd 1,2,4,5,6 Reynolds , Albert , Ortega 2;4;
## 2 Rufus 1,2,4,5,6 Cohen , Bert , Montgomery 1;2;3;4;
## 3 Dana 1,2,4,5,6 Pierce 2;
## 4 Carole 1,2,4,5,6 Colon , Michelle , Ballard 1;4;
## 5 Ramona 1,2,5,6 Snyder , Joann , 1;2;3;
## 6 Kelley 1,2,5,6 James , Roxanne , 1;4;
Notice that the data have been entered in a very silly manner. Let's split it up!
# Load the function!
# require(RCurl)
# baseURL = c("https://raw.github.com/mrdwab/2657-R-Functions/master/")
source(textConnection(getURL(paste0(baseURL, "scripts/concat.split.R"))))
# Split up the second column, selecting by column number
head(concat.split(concat.test, 2))
## Name Likes Siblings Hates Likes_1 Likes_2 Likes_3
## 1 Boyd 1,2,4,5,6 Reynolds , Albert , Ortega 2;4; 1 2 4
## 2 Rufus 1,2,4,5,6 Cohen , Bert , Montgomery 1;2;3;4; 1 2 4
## 3 Dana 1,2,4,5,6 Pierce 2; 1 2 4
## 4 Carole 1,2,4,5,6 Colon , Michelle , Ballard 1;4; 1 2 4
## 5 Ramona 1,2,5,6 Snyder , Joann , 1;2;3; 1 2 5
## 6 Kelley 1,2,5,6 James , Roxanne , 1;4; 1 2 5
## Likes_4 Likes_5
## 1 5 6
## 2 5 6
## 3 5 6
## 4 5 6
## 5 6 NA
## 6 6 NA
# ... or by name, and drop the offensive first column
head(concat.split(concat.test, "Likes", drop.col = TRUE))
## Name Siblings Hates Likes_1 Likes_2 Likes_3 Likes_4
## 1 Boyd Reynolds , Albert , Ortega 2;4; 1 2 4 5
## 2 Rufus Cohen , Bert , Montgomery 1;2;3;4; 1 2 4 5
## 3 Dana Pierce 2; 1 2 4 5
## 4 Carole Colon , Michelle , Ballard 1;4; 1 2 4 5
## 5 Ramona Snyder , Joann , 1;2;3; 1 2 5 6
## 6 Kelley James , Roxanne , 1;4; 1 2 5 6
## Likes_5
## 1 6
## 2 6
## 3 6
## 4 6
## 5 NA
## 6 NA
# The "Hates" column uses a different separator:
head(concat.split(concat.test, "Hates", sep = ";", drop.col = TRUE))
## Name Likes Siblings Hates_1 Hates_2 Hates_3 Hates_4
## 1 Boyd 1,2,4,5,6 Reynolds , Albert , Ortega 2 4 NA NA
## 2 Rufus 1,2,4,5,6 Cohen , Bert , Montgomery 1 2 3 4
## 3 Dana 1,2,4,5,6 Pierce 2 NA NA NA
## 4 Carole 1,2,4,5,6 Colon , Michelle , Ballard 1 4 NA NA
## 5 Ramona 1,2,5,6 Snyder , Joann , 1 2 3 NA
## 6 Kelley 1,2,5,6 James , Roxanne , 1 4 NA NA
## Hates_5
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
# You'll get a warning here, when trying to retain the original values
head(concat.split(concat.test, 2, mode = "value", drop.col = TRUE))
## Warning: 'mode' supplied but ignored. 'mode' setting only applicable when
## structure='expanded'.
## Name Siblings Hates Likes_1 Likes_2 Likes_3 Likes_4
## 1 Boyd Reynolds , Albert , Ortega 2;4; 1 2 4 5
## 2 Rufus Cohen , Bert , Montgomery 1;2;3;4; 1 2 4 5
## 3 Dana Pierce 2; 1 2 4 5
## 4 Carole Colon , Michelle , Ballard 1;4; 1 2 4 5
## 5 Ramona Snyder , Joann , 1;2;3; 1 2 5 6
## 6 Kelley James , Roxanne , 1;4; 1 2 5 6
## Likes_5
## 1 6
## 2 6
## 3 6
## 4 6
## 5 NA
## 6 NA
# Try again. Notice the differing number of resulting columns
head(concat.split(concat.test, 2, structure = "expanded",
mode = "value", drop.col = TRUE))
## Name Siblings Hates Likes_1 Likes_2 Likes_3 Likes_4
## 1 Boyd Reynolds , Albert , Ortega 2;4; 1 2 NA 4
## 2 Rufus Cohen , Bert , Montgomery 1;2;3;4; 1 2 NA 4
## 3 Dana Pierce 2; 1 2 NA 4
## 4 Carole Colon , Michelle , Ballard 1;4; 1 2 NA 4
## 5 Ramona Snyder , Joann , 1;2;3; 1 2 NA NA
## 6 Kelley James , Roxanne , 1;4; 1 2 NA NA
## Likes_5 Likes_6
## 1 5 6
## 2 5 6
## 3 5 6
## 4 5 6
## 5 5 6
## 6 5 6
# Let's try splitting some strings... Same syntax
head(concat.split(concat.test, 3, drop.col = TRUE))
## Name Likes Hates Siblings_1 Siblings_2 Siblings_3
## 1 Boyd 1,2,4,5,6 2;4; Reynolds Albert Ortega
## 2 Rufus 1,2,4,5,6 1;2;3;4; Cohen Bert Montgomery
## 3 Dana 1,2,4,5,6 2; Pierce
## 4 Carole 1,2,4,5,6 1;4; Colon Michelle Ballard
## 5 Ramona 1,2,5,6 1;2;3; Snyder Joann
## 6 Kelley 1,2,5,6 1;4; James Roxanne
# Split up the "Likes column" into a list variable; retain original column
head(concat.split(concat.test, 2, structure = "list", drop.col=FALSE))
## Name Likes Siblings Hates Likes_list
## 1 Boyd 1,2,4,5,6 Reynolds , Albert , Ortega 2;4; 1, 2, 4, 5, 6
## 2 Rufus 1,2,4,5,6 Cohen , Bert , Montgomery 1;2;3;4; 1, 2, 4, 5, 6
## 3 Dana 1,2,4,5,6 Pierce 2; 1, 2, 4, 5, 6
## 4 Carole 1,2,4,5,6 Colon , Michelle , Ballard 1;4; 1, 2, 4, 5, 6
## 5 Ramona 1,2,5,6 Snyder , Joann , 1;2;3; 1, 2, 5, 6
## 6 Kelley 1,2,5,6 James , Roxanne , 1;4; 1, 2, 5, 6
# View the structure of the output for the first 10 rows to verify
# that the new column is a list; note the difference between "Likes"
# and "Likes_list".
str(concat.split(concat.test, 2, structure = "list",
drop.col=FALSE)[1:10, c(2, 5)])
## 'data.frame': 10 obs. of 2 variables:
## $ Likes : Factor w/ 5 levels "1,2,3,4,5","1,2,4,5",..: 3 3 3 3 5 5 3 3 3 4
## $ Likes_list:List of 10
## ..$ : num 1 2 4 5 6
## ..$ : num 1 2 4 5 6
## ..$ : num 1 2 4 5 6
## ..$ : num 1 2 4 5 6
## ..$ : num 1 2 5 6
## ..$ : num 1 2 5 6
## ..$ : num 1 2 4 5 6
## ..$ : num 1 2 4 5 6
## ..$ : num 1 2 4 5 6
## ..$ : num 1 2 5
It is also possible to use concat.split
to split multiple columns at once. This can be done in stages, or it can be all wrapped in nested statements, as follows:
do.call(cbind,
c(concat.test[1],
lapply(1:(ncol(concat.test)-1),
function(x) {
splitchars = c(",", ",", ";")
concat.split(concat.test[-1][x], 1,
splitchars[x],
drop.col=TRUE)
})))
In the example above:
- The
lapply()
function is applied to each column in thedata.frame
except the first one. - Before applying the
concat.split
function, we enter a vector of the characters on which we should split, in the same order as the columns. Here, the first two columns are separated by commas, and the third is separated by a semicolon. - The
concat.split
function arguments can then be included as you would if splitting a single column. - We use
do.call(cbind, ...)
to "bind" the data together by columns. Since we had dropped the first column for thelapply
step, we add that back in at this stage.
# Show just the first few lines, compact structure
# Note that the split characters must be specified
# in the same order that lapply will encounter them
head(do.call(cbind,
c(concat.test[1],
lapply(1:(ncol(concat.test)-1),
function(x) {
splitchars = c(",", ",", ";")
concat.split(concat.test[-1][x], 1,
splitchars[x],
drop.col=TRUE)
}))))
## Name Likes_1 Likes_2 Likes_3 Likes_4 Likes_5 Siblings_1 Siblings_2
## 1 Boyd 1 2 4 5 6 Reynolds Albert
## 2 Rufus 1 2 4 5 6 Cohen Bert
## 3 Dana 1 2 4 5 6 Pierce
## 4 Carole 1 2 4 5 6 Colon Michelle
## 5 Ramona 1 2 5 6 NA Snyder Joann
## 6 Kelley 1 2 5 6 NA James Roxanne
## Siblings_3 Hates_1 Hates_2 Hates_3 Hates_4 Hates_5
## 1 Ortega 2 4 NA NA NA
## 2 Montgomery 1 2 3 4 NA
## 3 2 NA NA NA NA
## 4 Ballard 1 4 NA NA NA
## 5 1 2 3 NA NA
## 6 1 4 NA NA NA
# Show just the first few lines, Boolean mode
# Note the use of a regular expression for sep
# and the setting of fixed to FALSE
head(do.call(cbind,
c(concat.test[1],
lapply(1:(ncol(concat.test)-1),
function(x) {
concat.split(concat.test[-1][x], 1,
sep = "[,;]",
structure = "expanded",
fixed = FALSE,
drop.col=TRUE)
}))))
## Name Likes_1 Likes_2 Likes_3 Likes_4 Likes_5 Likes_6 Siblings_1 Siblings_2
## 1 Boyd 1 1 NA 1 1 1 Reynolds Albert
## 2 Rufus 1 1 NA 1 1 1 Cohen Bert
## 3 Dana 1 1 NA 1 1 1 Pierce <NA>
## 4 Carole 1 1 NA 1 1 1 Colon Michelle
## 5 Ramona 1 1 NA NA 1 1 Snyder Joann
## 6 Kelley 1 1 NA NA 1 1 James Roxanne
## Siblings_3 Hates_1 Hates_2 Hates_3 Hates_4
## 1 Ortega NA 1 NA 1
## 2 Montgomery 1 1 1 1
## 3 <NA> NA 1 NA NA
## 4 Ballard 1 NA NA 1
## 5 <NA> 1 1 1 NA
## 6 <NA> 1 NA NA 1
# Show just the first few lines, value mode
head(do.call(cbind,
c(concat.test[1],
lapply(1:(ncol(concat.test)-1),
function(x) {
concat.split(concat.test[-1][x], 1,
sep = "[,;]",
structure = "expanded",
mode = "value",
fixed = FALSE,
drop.col=TRUE)
}))))
## Name Likes_1 Likes_2 Likes_3 Likes_4 Likes_5 Likes_6 Siblings_1 Siblings_2
## 1 Boyd 1 2 NA 4 5 6 Reynolds Albert
## 2 Rufus 1 2 NA 4 5 6 Cohen Bert
## 3 Dana 1 2 NA 4 5 6 Pierce <NA>
## 4 Carole 1 2 NA 4 5 6 Colon Michelle
## 5 Ramona 1 2 NA NA 5 6 Snyder Joann
## 6 Kelley 1 2 NA NA 5 6 James Roxanne
## Siblings_3 Hates_1 Hates_2 Hates_3 Hates_4
## 1 Ortega NA 2 NA 4
## 2 Montgomery 1 2 3 4
## 3 <NA> NA 2 NA NA
## 4 Ballard 1 NA NA 4
## 5 <NA> 1 2 3 NA
## 6 <NA> 1 NA NA 4
# Show just the first few lines, list output mode
head(do.call(cbind,
c(concat.test[1],
lapply(1:(ncol(concat.test)-1),
function(x) {
concat.split(concat.test[-1][x], 1,
sep = "[,;]",
structure = "list",
fixed = FALSE,
drop.col=TRUE)
}))))
## Name Likes_list Siblings_list Hates_list
## 1 Boyd 1, 2, 4, 5, 6 Reynolds, Albert, Ortega 2, 4
## 2 Rufus 1, 2, 4, 5, 6 Cohen, Bert, Montgomery 1, 2, 3, 4
## 3 Dana 1, 2, 4, 5, 6 Pierce 2
## 4 Carole 1, 2, 4, 5, 6 Colon, Michelle, Ballard 1, 4
## 5 Ramona 1, 2, 5, 6 Snyder, Joann 1, 2, 3
## 6 Kelley 1, 2, 5, 6 James, Roxanne 1, 4
See: http://stackoverflow.com/q/10100887/1270695
The "condensed"
setting was inspired by an answer from David Winsemius (@DWin) to a question at Stack Overflow. See: http://stackoverflow.com/a/13924245/1270695
\cleardoublepage