In [1]:
options(jupyter.rich_display = FALSE)

# Week 10: Data Wrangling in R

## POP77001 Computer Programming for Social Scientists

### Tom Paskhalis

##### 15 November 2021

##### Module website: [bit.ly/POP77001](https://bit.ly/POP77001)

## Overview

- Data frames
- `tidyverse` packages
- Working with tabular data
- Data input and output
- Summary statistics


## Tidy data

- Tidy data is a specific subset of rectangular data, where:
    - Each variable is in a column
    - Each observation is in a row
    - Each value is in a cell

<div style="text-align: center;">
    <img width="700" height="700" src="../imgs/tidy_data.png">
</div>

Source: [R for Data Science](https://r4ds.had.co.nz/tidy-data.html) 

## Creating a data frame

- Data frames can be created using `data.frame()` function
- Or converted from R objects such as vectors, matrices, lists (with elements of equal lengths)

In [2]:
df <- data.frame(
    x = 1:2,
    y = letters[1:2],
    z = c(TRUE, FALSE)
)
df

  x y z    
1 1 a  TRUE
2 2 b FALSE

## Creating a data frame examples

In [3]:
l <- list(x = 1:5, y = letters[1:5], z = rep(c(TRUE, FALSE), length.out = 5))
l

$x
[1] 1 2 3 4 5

$y
[1] "a" "b" "c" "d" "e"

$z
[1]  TRUE FALSE  TRUE FALSE  TRUE


In [4]:
df <- data.frame(l)
df

  x y z    
1 1 a  TRUE
2 2 b FALSE
3 3 c  TRUE
4 4 d FALSE
5 5 e  TRUE

In [5]:
str(df)

'data.frame':	5 obs. of  3 variables:
 $ x: int  1 2 3 4 5
 $ y: chr  "a" "b" "c" "d" ...
 $ z: logi  TRUE FALSE TRUE FALSE TRUE


## Building a data frame

- `rbind()` (row bind) - appends a row to data frame
- `cbind()` (column bind) - appends a column to data frame
- Both require compatible sizes (number of rows/columns)

## Building a data frame examples

In [6]:
rand <- rnorm(5)
rand

[1] -1.3910228 -2.3316790 -0.6458175  1.3926078  1.0089422

In [7]:
df <- cbind(df, rand)
df

  x y z     rand      
1 1 a  TRUE -1.3910228
2 2 b FALSE -2.3316790
3 3 c  TRUE -0.6458175
4 4 d FALSE  1.3926078
5 5 e  TRUE  1.0089422

In [8]:
# Not that a row has to be a list as it contains different data types
r <- list(6, letters[6], FALSE, rnorm(1))
r

[[1]]
[1] 6

[[2]]
[1] "f"

[[3]]
[1] FALSE

[[4]]
[1] -2.018545


In [9]:
df <- rbind(df, r)
df

  x y z     rand      
1 1 a  TRUE -1.3910228
2 2 b FALSE -2.3316790
3 3 c  TRUE -0.6458175
4 4 d FALSE  1.3926078
5 5 e  TRUE  1.0089422
6 6 f FALSE -2.0185451

## `tidyverse` packages

- `tidyverse` [package ecosystem](https://www.tidyverse.org/) - rich collection of data science packages
- Designed with consistent interfaces and generally higher usability than base R function
- Notable packages:
    - `readr` - data input/output (also `readxl` for spreadsheets, `haven` for SPSS/Stata)
    - `dplyr` - data manipulation (also `tidyr` for pivoting)
    - `ggplot2` - data visualisation
    - `lubridate` - working with dates and time
    - `tibble` - enhanced data frame

```
install.packages("tidyverse")
```

## Tibble vs data frame

- Tibbles are designed to be backward compatible with base R data frames
- Console printing of tibbles is cleaner (prettified, only first 10 rows by default)
- Tibbles can have columns that themselves contain lists as elements
- Tibbles can be created with `tibble::tibble()` function
- Or objects can be coerced into a tibble using `tibble::as_tibble()` function

In [10]:
tb <- tibble::tibble(
    x = 1:4,
    y = c("a", "b", "c", "d"),
    z = c(TRUE, FALSE, FALSE, TRUE)
)
tb

  x y z    
1 1 a  TRUE
2 2 b FALSE
3 3 c FALSE
4 4 d  TRUE

## Tibbles work (mostly) like data frames

In [11]:
str(tb)

tibble [4 × 3] (S3: tbl_df/tbl/data.frame)
 $ x: int [1:4] 1 2 3 4
 $ y: chr [1:4] "a" "b" "c" "d"
 $ z: logi [1:4] TRUE FALSE FALSE TRUE


In [12]:
dim(tb)

[1] 4 3

In [13]:
tb[c("x", "z")]

  x z    
1 1  TRUE
2 2 FALSE
3 3 FALSE
4 4  TRUE

In [14]:
tb[tb$y == "b",]

  x y z    
1 2 b FALSE

## Manipulating columns in base R

In [15]:
# New columns can also be created/modified by assignment (if the RHS object has correct length)
tb['r'] <- rnorm(4)
tb

  x y z     r        
1 1 a  TRUE 0.5000782
2 2 b FALSE 0.5489620
3 3 c FALSE 1.1344874
4 4 d  TRUE 0.5488395

In [16]:
# Individual columns can also be selected with $ operator
tb$r <- tb$r + 5
tb

  x y z     r       
1 1 a  TRUE 5.500078
2 2 b FALSE 5.548962
3 3 c FALSE 6.134487
4 4 d  TRUE 5.548840

In [17]:
# names() attribute for data frames/tibbles contains column names
names(tb)

[1] "x" "y" "z" "r"

In [18]:
names(tb)[4] <- "rand"
tb

  x y z     rand    
1 1 a  TRUE 5.500078
2 2 b FALSE 5.548962
3 3 c FALSE 6.134487
4 4 d  TRUE 5.548840

## Data manipulation with `dplyr`

- `dplyr` - is one of the core packages for data manipulation in `tidyverse`
- Its principal functions are:
    - `filter()` - subset rows from data
    - `mutate()` - add new/modify existing variables
    - `rename()` - rename existing variable
    - `select()` - subset columns from data
    - `arrange()` - order data by some variable
    
- For data summary:
    - `group_by()` - aggregate data by some variable
    - `summarise()` - create a summary of aggregated variables

In [19]:
library("dplyr")


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




## Data manipulation with `dplyr` examples

In [20]:
dplyr::filter(tb, y == 'b', z == FALSE)

  x y z     rand    
1 2 b FALSE 5.548962

In [21]:
# Note that dplyr functions do not require enquoted variable names
dplyr::select(tb, x, z)

  x z    
1 1  TRUE
2 2 FALSE
3 3 FALSE
4 4  TRUE

In [22]:
# We can also use helpful tidyselect functions for more complex rules
dplyr::select(tb, tidyselect::starts_with('r'))

  rand    
1 5.500078
2 5.548962
3 6.134487
4 5.548840

In [23]:
# Data is not modified in-place, you need to re-assign the results
tb <- dplyr::rename(tb, random = rand)

In [24]:
dplyr::mutate(tb, random_8plus = ifelse(random >= 8, TRUE, FALSE))

  x y z     random   random_8plus
1 1 a  TRUE 5.500078 FALSE       
2 2 b FALSE 5.548962 FALSE       
3 3 c FALSE 6.134487 FALSE       
4 4 d  TRUE 5.548840 FALSE       

## `%>%` operator

- Users of `tidyverse` packages are encouraged to use pipe operator `%>%`
- It allows to chain data transformations without creating intermidate variables
- It passes the result of the previous operation as a first first argument to the next
- Base R now also includes its own pipe operator `|>` but it is still relatively uncommon

```
<result> <- <input> %>%
  <function_name>(., arg_1, arg_2, ..., arg_n)

<result> <- <input> %>%
  <function_name>(arg_1, arg_2, ..., arg_n)
```

## `%>%` operator examples

In [25]:
tb

  x y z     random  
1 1 a  TRUE 5.500078
2 2 b FALSE 5.548962
3 3 c FALSE 6.134487
4 4 d  TRUE 5.548840

In [26]:
tb <- tb %>%
  dplyr::mutate(random_2 = rnorm(4)) %>%
  dplyr::filter(z == FALSE)

In [27]:
tb

  x y z     random   random_2 
1 2 b FALSE 5.548962 -1.924009
2 3 c FALSE 6.134487  1.460439

In [28]:
# Pipe %>% can also be used with non-dplyr functions
tb$x %>% .[2]

[1] 3

In [29]:
# Base R pipe operator |> is more restrictive (e.g. tb$x |> `[`(2) doesn't work)
tb |> nrow()

[1] 2

## Pivoting data

- Sometimes you want to pivot you data by:
    - Spreading some variable across columns (`tidyr::pivot_wider()`)
    - Gathering some columns in a variable pair (`tidyr::pivot_longer()`)
    
<table>
    <tr>
        <td><img width="500" src='../imgs/pivot_wider.png'></td>
        <td><img width="500" src='../imgs/pivot_longer.png'></td>
    </tr>
    <tr>
        <td style="text-align:center"><h2>pivot_wider()</h2></td>
        <td style="text-align:center"><h2>pivot_longer()</h2></td>
    </tr>
</table>

Source: [R for Data Science](https://r4ds.had.co.nz/tidy-data.html?q=pivot#pivoting)

## Pivoting data example

In [30]:
tb2 <- tibble::tibble(
  country = c("Afghanistan", "Brazil"),
  `1999` = c(745, 2666),
  `2000` = c(37737, 80488)
)
tb2

  country     1999 2000 
1 Afghanistan  745 37737
2 Brazil      2666 80488

In [31]:
tb2 <- tb2 %>%
  # Note that pivoting functions come 'tidyr' package
  tidyr::pivot_longer(cols = c("1999", "2000"), names_to = "year", values_to = "cases")
tb2

  country     year cases
1 Afghanistan 1999   745
2 Afghanistan 2000 37737
3 Brazil      1999  2666
4 Brazil      2000 80488

In [32]:
tb2 <- tb2 %>%
  tidyr::pivot_wider(names_from = "year", values_from = "cases")
tb2

  country     1999 2000 
1 Afghanistan  745 37737
2 Brazil      2666 80488

## Data formats in R

- `.csv` (Comma-separated value) files for storing tabular data
    - Standard file format for storing data that is highly interoperable across systems
    - Human-readable and can be opened in a simple text processor
- `.rds` (R data serialization) files allow to store single R object
    - Can store arbitrary R objects (e.g. fitted model), similar to Python `pickle`
    - Offers data compression
    - Only works within R
- `.rda` (R data) files for saving and loading multiple R objects
    - Offers data compression
    - Compares unfavourably to rds and, generally, should be avoided
- `.feather`/`.parquet` - big data formats associated with [Apache Hadoop](https://en.wikipedia.org/wiki/Apache_Hadoop) ecosystem
    - Cutting-edge performance (compression and read/write access)
    - Not human-readable
    - Relatively new, could be an overkill for some tasks 

## Functions for data I/O

- `.csv` (Comma-separated value) 
    - `read.csv()`/`write.csv()` - base R functions
    - `readr::read_csv()`/`readr::write_csv()` - functions from `readr` package in `tidyverse`
- `.rds` (R data serialization) 
    - `readRDS()`/`writeRDS()` - base R functions
    - `readr::read_rds()`/`readr::write_rds()` - functions from `readr` (no default compression)
- `.rda` (R data)
    - `save()`/`load()` - base R functions
- `.feather`/`.parquet`
    - `arrow::read_feather()`/`arrow::write_feather()` - functions from
    - `arrow::read_parquet()`/`arrow::write_parquet()` - `arrow` package in [Apache Arrow](https://github.com/apache/arrow/tree/master/r)

In [33]:
options(jupyter.rich_display = TRUE)

## Reading data in R example

In [34]:
# We are skipping the first row as this dataset has a composite header of 2 rows (variable name, question)
kaggle2020 <- readr::read_csv('../data/kaggle_survey_2020_responses.csv', skip = 1)


[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
cols(
  .default = col_character(),
  `Duration (in seconds)` = [32mcol_double()[39m
)
[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m for the full column specifications.




In [35]:
head(kaggle2020)

Duration (in seconds),What is your age (# years)?,What is your gender? - Selected Choice,In which country do you currently reside?,What is the highest level of formal education that you have attained or plan to attain within the next 2 years?,Select the title most similar to your current role (or most recent title if retired): - Selected Choice,For how many years have you been writing code and/or programming?,What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Python,What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - R,What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - SQL,⋯,"In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Weights & Biases","In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Comet.ml","In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Sacred + Omniboard","In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - TensorBoard","In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Guild.ai","In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Polyaxon","In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Trains","In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Domino Model Monitor","In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - None","In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Other"
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1838,35-39,Man,Colombia,Doctoral degree,Student,5-10 years,Python,R,SQL,⋯,,,,TensorBoard,,,,,,
289287,30-34,Man,United States of America,Master’s degree,Data Engineer,5-10 years,Python,R,SQL,⋯,,,,,,,,,,
860,35-39,Man,Argentina,Bachelor’s degree,Software Engineer,10-20 years,,,,⋯,,,,,,,,,,
507,30-34,Man,United States of America,Master’s degree,Data Scientist,5-10 years,Python,,SQL,⋯,,,,,,,,,,
78,30-34,Man,Japan,Master’s degree,Software Engineer,3-5 years,Python,,,⋯,,,,,,,,,,
401,30-34,Man,India,Bachelor’s degree,Data Analyst,< 1 years,Python,R,,⋯,,,,,,,,,,


## Summarizing numeric variables

In [36]:
# Note that summary() as opposed to pandas' describe() gives summary for all variable types by default
summary(kaggle2020)

 Duration (in seconds) What is your age (# years)?
 Min.   :     20       Length:20036               
 1st Qu.:    398       Class :character           
 Median :    626       Mode  :character           
 Mean   :   9156                                  
 3rd Qu.:   1030                                  
 Max.   :1144493                                  
 What is your gender? - Selected Choice
 Length:20036                          
 Class :character                      
 Mode  :character                      
                                       
                                       
                                       
 In which country do you currently reside?
 Length:20036                             
 Class :character                         
 Mode  :character                         
                                          
                                          
                                          
 What is the highest level of formal education that you have a

## Summarizing categorical variables

In [37]:
# table() function is rather flexible in allowing to tabulate a single variable and do crosstabs
table(kaggle2020[3])


                    Man               Nonbinary       Prefer not to say 
                  15789                      52                     263 
Prefer to self-describe                   Woman 
                     54                    3878 

In [38]:
# Wrapping it inside prop.table() gives proportions of each category
prop.table(table(kaggle2020[3]))


                    Man               Nonbinary       Prefer not to say 
            0.788031543             0.002595328             0.013126373 
Prefer to self-describe                   Woman 
            0.002695149             0.193551607 

In [39]:
# Wrapping it inside sort() gives value sorting, as opposed to alphabetic (or facto levels)
sort(table(kaggle2020[3]), decreasing = TRUE)[1]

## Next

- Tutorial: Working with data in R
- Next week: Performance and Complexity