In [None]:
options(jupyter.rich_display = FALSE)

# WRANGLING IRIS DATA FRAME

**by Serhat Çevikel**

## Summarizing a data frame with aggregate()

iris is a famous database and is a built-in one in R:

In [None]:
iris

Info on iris:

In [None]:
?iris

```
Format
iris is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.

iris3 gives the same data arranged as a 3-dimensional array of size 50 by 4 by 3, as represented by S-PLUS. The first dimension gives the case number within the species subsample, the second the measurements with names Sepal L., Sepal W., Petal L., and Petal W., and the third the species.
```

See the unique values of species:

In [None]:
unique(iris$Species)

And let's get the average

- Sepal.Length
- Sepal.Width
- Petal.Length
- Petal.Width

values of each of the species

First the long way:

First split the data frame into a list across species:

In [None]:
iris_split <- split(iris[,-5], iris[,5])
iris_split

In [None]:
str(iris_split)

For each list item, we will get the mean of each column:

In [None]:
sapply(iris_split$setosa, mean)

Now repeat this for each list item:

In [None]:
t(sapply(iris_split, function(x) sapply(x, mean)))

Now let's do this with the aggregate function:

In [None]:
?aggregate

```
## S3 method for class 'data.frame'
aggregate(x, by, FUN, ..., simplify = TRUE, drop = TRUE)

Arguments
x	
an R object.

by	
a list of grouping elements, each as long as the variables in the data frame x. The elements are coerced to factors before use.

FUN	
a function to compute the summary statistics which can be applied to all data subsets.

simplify	
a logical indicating whether results should be simplified to a vector or matrix if possible.

drop	
a logical indicating whether to drop unused combinations of grouping values. The non-default case drop=FALSE has been amended for R 3.5.0 to drop unused combinations.

formula	
a formula, such as y ~ x or cbind(y1, y2) ~ x1 + x2, where the y variables are numeric data to be split into groups according to the grouping x variables (usually factors).

data	
a data frame (or list) from which the variables in formula should be taken.
```

In [None]:
averages <- aggregate(iris[,-5], by = list(iris[,5]), FUN = mean)

In [None]:
averages

See that, average values of all four variables differ across the species

## merge data frames

Now let's merge the average values back into the original iris df

In [None]:
iris_avs <- merge(iris, averages, by.x = "Species", by.y = "Group.1")
iris_avs

Columns with x are the original values while columns with y are the average values

Now let's get the difference from average values for all species and columns:

In [None]:
data.frame(iris$Species, iris_avs[,2:5] - iris_avs[,6:9])

### merge types

Now let's create a small sample: just the first rows of each species:

In [None]:
sample1 <- aggregate(iris[,-5], by = list(iris[,5]), FUN = head, 1)
sample1

Let's delete the virginica row from sample1 

In [None]:
sample2 <- sample1[-3,]
sample2

And from the averages df, let's delete the setosa row:

In [None]:
averages2 <- averages[-1,]
averages2

#### Left join

Join the sepal lengths on species, keep all species categories on the LEFT df:

In [None]:
merge(sample2[,1:2], averages2[,1:2], by = "Group.1", all.x = T)

#### Right join

Now keep all species on the RIGHT df:

In [None]:
merge(sample2[,1:2], averages2[,1:2], by = "Group.1", all.y = T)

#### Full outer join

Take the union of species on either df:

In [None]:
merge(sample2[,1:2], averages2[,1:2], by = "Group.1", all = T)

#### Inner join

Keep only the common species:

In [None]:
merge(sample2[,1:2], averages2[,1:2], by = "Group.1", all = F)

## Reshape data frames

Now let's calculate both min and max values for each column and each species:

In [None]:
agg1 <- aggregate(iris[,1:4],
                  by = list(iris[,5]),
                  FUN = function(x) c(min = min(x), max = max(x)))

In [None]:
str(agg1)

The aggregate output for each column is an embedded matrix.

We combine all of them into a single data frame as such:

In [None]:
agg2 <- do.call(data.frame, unclass(agg1))
agg2

That's too many columns.

We may have four value columns and max and min values might be inseparate rows for each species

The reshape() function will be used:

In [None]:
?reshape

```
data: a data frame

 varying: names of sets of variables in the wide format that correspond
          to single variables in long format (‘time-varying’).  This is
          canonically a list of vectors of variable names, but it can
          optionally be a matrix of names, or a single vector of names.
          In each case, the names can be replaced by indices which are
          interpreted as referring to ‘names(data)’.  See ‘Details’ for
          more details and options.

 v.names: names of variables in the long format that correspond to
          multiple variables in the wide format.  See ‘Details’.

 timevar: the variable in long format that differentiates multiple
          records from the same group or individual.  If more than one
          record matches, the first will be taken (with a warning).

   idvar: Names of one or more variables in long format that identify
          multiple records from the same group/individual.  These
          variables may also be present in wide format.

     ids: the values to use for a newly created ‘idvar’ variable in
          long format.

   times: the values to use for a newly created ‘timevar’ variable in
          long format.  See ‘Details’.

    drop: a vector of names of variables to drop before reshaping.

direction: character string, partially matched to either ‘"wide"’ to
          reshape to wide format, or ‘"long"’ to reshape to long
          format.
```

### melt data frame

We melt all value columns so that each original row becomes 8 rows and column headers become a separate column:

In [None]:
cols <- names(agg2)[-1]

iris_long <- reshape(agg2,
                      idvar = c("Group.1"),
                      varying = cols,
                        times = cols,
                      v.name = "value",
                      direction = "long")

Flush rownames for a better view:

In [None]:
rownames(iris_long) <- NULL
iris_long

It is better that we have two columns as such:

- Sepal.Length, Sepal.Width, Petal.Length and Petal.Width in one column
- min and max in the other:

In [None]:
split1 <- sapply(iris_long$time, strsplit, split = "\\.")
split1

In [None]:
split2 <- t(sapply(unname(split1),
                   function(x) c(paste(x[1], x[2], sep = "."), x[3])))
split2

In [None]:
iris_long2 <- data.frame(iris_long[,-2], split2)
iris_long2

### cast data frame

Now let's keep Group.1 and X2 columns and convert each unique string in X1 into a separate column. The "value" column will be the values in each cell of the new columns:

In [None]:
iris_wide <- reshape(iris_long2,
                      idvar = c("Group.1", "X2"),
                      v.names = "value",
                      timevar = "X1",
                      direction = "wide")

In [None]:
colnames(iris_wide)[3:6] <- colnames(iris)[1:4]
iris_wide