# Data Wrangling in R Part 2

In [None]:
library(bigrquery)
library(tidyverse) #imports stringr, dplyr and tidyr
library(ggplot2)

In [None]:
download_data <- function(query) {
  tb <- bq_project_query(Sys.getenv('GOOGLE_PROJECT'), query = str_glue(query)
                         , default_dataset = Sys.getenv('WORKSPACE_CDR'))
  bq_table_download(tb)
}

DATASET <- Sys.getenv('WORKSPACE_CDR')
DATASET

# How to pivot if questions have multiple answers for each person_id

For example, some questions in 'The Basics' can have multiple answers for each person_id. Handling datasets where each question can have multiple answers for each person requires special attention. This part tutorial will guide you through the process of pivoting such data using the pivot_wider() in the R tidyr package as well as using SQL.

We will cover:

Basic Pivoting: Converting long-format data into wide-format.

Handling Multiple Answers: Aggregating multiple answers for the same question.

Examples: Step-by-step examples demonstrating various scenarios.

## How many questions can have multiple answers in 'The Basics' survey

In [None]:
query ="
with df1 as (
SELECT 
  person_id, 
  question_concept_id, question,
  COUNT(DISTINCT answer_concept_id) AS num_answers
FROM 
  `{DATASET}.ds_survey`
WHERE survey='The Basics'
GROUP BY 
  1,2,3
HAVING 
  COUNT(DISTINCT answer_concept_id) > 1
)

SELECT DISTINCT question_concept_id, question
FROM df1
"

In [None]:
df = download_data(query)
dim(df)

In the CT, there are 8 questions that allow multiple answers per person_id in 'The Basics' survey

In [None]:
df

## We use two questions  (1585899,1585845) as an example

The question 1585899 (The Basics: Sexual Orientation) allows multiple answers.

In [None]:
query ="
WITH df1 AS (
SELECT 
  person_id, 
  question_concept_id, question,
  COUNT(DISTINCT answer_concept_id) AS num_answers
FROM 
  `{DATASET}.ds_survey`
WHERE survey='The Basics'
AND question_concept_id IN (1585899)
GROUP BY 
  1,2,3
HAVING 
  COUNT(DISTINCT answer_concept_id) > 1
  ),
  
df2 AS (SELECT DISTINCT person_id, question_concept_id, question,answer_concept_id, answer,

FROM 
  `{DATASET}.ds_survey`
WHERE question_concept_id IN (1585899,1585845) 
AND person_id IN (SELECT DISTINCT person_id FROM df1 LIMIT 5)
)

SELECT * FROM df2
ORDER BY person_id
"

In [None]:
df = download_data(query)
dim(df)

In [None]:
df

## How to check rows that have mulitple answers in a data frame?

In [None]:
# library(dplyr)
df2=df |>
  dplyr::summarise(n = dplyr::n(), .by = c(person_id, question_concept_id)) |>
  dplyr::filter(n > 1L)
dim(df2)

In [None]:
df2

## Method 1, how to pivot this table using pivot_wider() in the R tidyr package

In [None]:
df_wide <- pivot_wider(df,
                   id_cols = person_id,
                   names_from = question,
                   values_from = answer_concept_id,
                   values_fn = list(answer_concept_id = ~ paste(.x, collapse = ", ")))
dim(df_wide)

In [None]:
df_wide

In [None]:
df_wide <- pivot_wider(df,
                   id_cols = person_id,
                   names_from = question,
                   values_from = answer,
                   values_fn = list(answer = ~ paste(.x, collapse = ", ")))
dim(df_wide)

In [None]:
df_wide

## Method 2: how to aggregate multiple answers using SQL

We can also perform aggregation directly in bigquery using STRING_AGG(). This is useful for Rstudio/SAS users since RAM is limited. 

In [None]:
query="
WITH df1 AS (
SELECT 
  person_id, 
  question_concept_id, question,
  COUNT(DISTINCT answer_concept_id) AS num_answers
FROM 
  `{DATASET}.ds_survey`
WHERE survey='The Basics'
AND question_concept_id IN (1585899)
GROUP BY 
  1,2,3
HAVING 
  COUNT(DISTINCT answer_concept_id) > 1
  )

SELECT 
  person_id,
  question_concept_id,question,
  STRING_AGG(CAST(answer_concept_id AS STRING), ', ') AS answers_id,
  STRING_AGG(CAST(answer AS STRING), ', ') AS answers 
FROM 
  `{DATASET}.ds_survey`
WHERE question_concept_id IN (1585899,1585845)
AND person_id IN (SELECT DISTINCT person_id FROM df1 LIMIT 5)

GROUP BY 1,2,3
"

In [None]:
df = download_data(query)
dim(df)

In [None]:
head(df)

**And then can use regular pivot_wider function**

In [None]:
df_wide <- pivot_wider(df, id_cols = person_id, names_from = question_concept_id, values_from = answers)
dim(df_wide)

In [None]:
head(df_wide)

## Method 3: how to pivot using SQL

We can also perform pivoting directly in bigquery using both STRING_AGG() and PIVOT(). This is useful for Rstudio/SAS users since RAM is limited. 

In [None]:
df=download_data("SELECT DISTINCT question 
                    FROM  `{DATASET}.ds_survey` WHERE survey='The Basics' and question_concept_id IN (1585899,1585845)")
dim(df)

In [None]:
df

In [None]:
questions <- paste0("'", unique(df$question), "'", collapse = ", ")
questions

In [None]:
query="
WITH df1 AS (
SELECT 
  person_id, 
  question_concept_id, question,
  COUNT(DISTINCT answer_concept_id) AS num_answers
FROM 
  `{DATASET}.ds_survey`
WHERE survey='The Basics'
AND question_concept_id IN (1585899)
GROUP BY 
  1,2,3
HAVING 
  COUNT(DISTINCT answer_concept_id) > 1
  ),
  
df2 as (
SELECT 
  person_id,
  -- question_concept_id,
  question,
  -- STRING_AGG(CAST(answer_concept_id AS STRING), ', ') AS answers_id,
  STRING_AGG(CAST(answer AS STRING), ', ') AS answers 
FROM 
  `{DATASET}.ds_survey`
WHERE question_concept_id IN (1585899,1585845)
AND person_id IN (SELECT DISTINCT person_id FROM df1 LIMIT 5)
GROUP BY 1,2
order by person_id
)

SELECT * FROM df2
PIVOT
(MAX(answers) FOR question IN ({questions})
)

"

In [None]:
df=download_data(query)
dim(df)

In [None]:
df

**Other pivot examples using bigquery directly**

In [None]:
## USING PIVOT() IN GBQ on Physical Meas data

system.time({

df=download_data("SELECT DISTINCT standard_concept_name  
                         FROM  `{DATASET}.ds_measurement` 
                         WHERE measurement_concept_id IN (3036277, 3027018)")

pms <- paste0("'", unique(df$standard_concept_name), "'", collapse = ", ")

    
query="
    SELECT * FROM 
    (##1 -- data to pivot
     SELECT DISTINCT person_id, standard_concept_name as physical_meas, value_as_number 
     FROM  `{DATASET}.ds_measurement` 
     WHERE measurement_concept_id IN (3036277, 3027018) and value_as_number is not null )
    PIVOT
    (
    ##2 -- pivot aggregation
    AVG(value_as_number) AS avg 
    ##3 -- pivot column - either specify the unique columns or dynamically input like the below
    FOR physical_meas IN ({pms})
    )
    "
pm_pivot = download_data(query)

    
})

In [None]:
dim(pm_pivot)
head(pm_pivot)

In [None]:
## USING PIVOT() IN GBQ on survey data - counting if/how many answers per question people have

system.time({
    
    
df=download_data("SELECT DISTINCT question 
                    FROM  `{DATASET}.ds_survey` WHERE survey='The Basics'")

questions <- paste0("'", unique(df$question), "'", collapse = ", ")
    
query="
    SELECT * FROM 
    (##1 -- data to pivot
     SELECT DISTINCT person_id, question, answer FROM  `{DATASET}.ds_survey` WHERE survey='The Basics')
    PIVOT
    (
    ##2 -- pivot aggregation
    COUNT(CAST(answer AS STRING)) AS answers 
    ##3 -- pivot column
    FOR question IN ({questions})
    )
    "
basics_count_pivot= download_data(query)
    })

In [None]:
dim(basics_count_pivot)
head(basics_count_pivot)