# Data Wrangling in Python Part 2

In [None]:
import pandas as pd
import numpy as np
import os
pd.set_option('MAX_colwidth', None) # show all contents

In [None]:
# dataset name used in the SQL queries
DATASET = os.getenv('WORKSPACE_CDR')
DATASET

# How to pivot if questions have multiple answers for each person_id

For example, some questions in 'The Basics' can have multiple answers for each person_id. Handling datasets where each question can have multiple answers for each person requires special attention. This part tutorial will guide you through the process of pivoting such data using the pivot_table function in pandas as well as using SQL.

We will cover:

Basic Pivoting: Converting long-format data into wide-format.

Handling Multiple Answers: Aggregating multiple answers for the same question.

Dealing with Numeric and Non-Numeric Data: Appropriate aggregation functions for different data types.

Examples: Step-by-step examples demonstrating various scenarios.

## How many questions can have mulitple answers in 'The Basics' survey

In [None]:
query =f"""
SELECT 
  person_id, 
  question_concept_id, question,
  COUNT(DISTINCT answer_concept_id) AS num_answers
FROM 
  `{DATASET}.ds_survey`
WHERE survey='The Basics'
GROUP BY 
  1,2,3
HAVING 
  COUNT(DISTINCT answer_concept_id) > 1
"""

In [None]:
df = pd.read_gbq(query, dialect="standard")
# check the number of (rows, columns) returned
df.shape

In [None]:
df.head()

**Eight questions have mulitple answers**

In [None]:
df.question_concept_id.nunique()

In [None]:
df.question_concept_id.unique()

In [None]:
df.question.unique()

## We use two questions  (1585899,1585845) as an example

The question 1585899 (The Basics: Sexual Orientation) allows mulitple answers.

In [None]:
query =f"""
WITH df1 AS (
SELECT 
  person_id, 
  question_concept_id, question,
  COUNT(DISTINCT answer_concept_id) AS num_answers
FROM 
  `{DATASET}.ds_survey`
WHERE survey='The Basics'
AND question_concept_id IN (1585899)
GROUP BY 
  1,2,3
HAVING 
  COUNT(DISTINCT answer_concept_id) > 1
  ),
  
df2 AS (SELECT DISTINCT person_id, question_concept_id, question,answer_concept_id, answer,

FROM 
  `{DATASET}.ds_survey`
WHERE question_concept_id IN (1585899,1585845) 
AND person_id IN (SELECT DISTINCT person_id FROM df1 LIMIT 5)
)

SELECT * FROM df2
ORDER BY person_id
"""

In [None]:
df = pd.read_gbq(query, dialect="standard")
# check the number of (rows, columns) returned
df.shape

In [None]:
df

## How to check rows that have multiple answers in a data frame?

In [None]:
# Group by person_id and question_concept_id and count occurrences
df2 = df.groupby(['person_id', 'question_concept_id']).size().reset_index(name='n')

# Filter to keep only rows with more than 1 occurrence
df2 = df2[df2['n'] > 1]

# Display the dimensions of the resulting DataFrame
df2.shape

In [None]:
df2

If using regular pivot function, we will have errors

If using values='answer', since 'answer' column is a str, but pandas expects the type is numeric, we will have 1st error as shown below.

In [None]:
df_wide = df.pivot_table(index='person_id', columns='question', values='answer').reset_index()
df_wide.shape

if using values='answer_concept_id', it will not handle well for those multiple answers.

In [None]:
df_wide = df.pivot_table(index='person_id', columns='question', values='answer_concept_id').reset_index()
df_wide.shape

In [None]:
df_wide

## Method 1: How to pivot such data frame using Pandas

if using values='answer', we need to aggfunc() when pivoting the data frame.

In [None]:
df_wide = df.pivot_table(index='person_id', columns='question', values='answer', aggfunc=lambda x: ','.join(x)).reset_index()
df_wide

if using values='answer_concept_id', we need to convert it to str type first and then run pivot_table().

In [None]:
df['answer_concept_id'] = df['answer_concept_id'].astype(str)

In [None]:
df_wide = df.pivot_table(index='person_id', columns='question', values='answer_concept_id', aggfunc=lambda x: ','.join(x)).reset_index()
df_wide

## Method 2: how to aggregate multiple answers using SQL

We can also perform aggregation directly in bigquery using STRING_AGG(). This is useful for Rstudio/SAS users since RAM is limited. 

In [None]:
query=f"""

WITH df1 AS (
SELECT 
  person_id, 
  question_concept_id, question,
  COUNT(DISTINCT answer_concept_id) AS num_answers
FROM 
  `{DATASET}.ds_survey`
WHERE survey='The Basics'
AND question_concept_id IN (1585899)
GROUP BY 
  1,2,3
HAVING 
  COUNT(DISTINCT answer_concept_id) > 1
  )

SELECT 
  person_id,
  question_concept_id,question,
  STRING_AGG(CAST(answer_concept_id AS STRING), ', ') AS answers_id,
  STRING_AGG(CAST(answer AS STRING), ', ') AS answers 
FROM 
  `{DATASET}.ds_survey`
WHERE question_concept_id IN (1585899,1585845)
AND person_id IN (SELECT DISTINCT person_id FROM df1 LIMIT 5)

GROUP BY 1,2,3
"""

In [None]:
df.person_id.nunique()

In [None]:
df = pd.read_gbq(query, dialect="standard")
# check the number of (rows, columns) returned
df.shape

In [None]:
df

Then we run pivot_table() with aggfunc(). By default, pivot_table tries to aggregate values, and if it encounters strings, it may lead to the error. To avoid this, we can use the aggfunc parameter to specify a function that works with strings, such as first or join() to combine strings. Since we already used STRING_AGG() in SQL to combine the answer_concept_ids/answers, the cells below will have the same results.

In [None]:
df_wide = df.pivot_table(index='person_id', columns='question', values='answers',aggfunc='first').reset_index()
df_wide

In [None]:
df_wide = df.pivot_table(index='person_id', columns='question', values='answers', aggfunc=lambda x: ','.join(x)).reset_index()
df_wide

## Method 3: how to pivot using SQL

We can also perform pivoting directly in bigquery using both STRING_AGG() and PIVOT(). This is useful for Rstudio/SAS users since RAM is limited. 

In [None]:
questions = tuple(pd.read_gbq(f'''SELECT DISTINCT question 
                    FROM  `{DATASET}.ds_survey` WHERE survey='The Basics' and question_concept_id IN (1585899,1585845) ''').question)

In [None]:
questions

In [None]:
query=f"""
WITH df1 AS (
SELECT 
  person_id, 
  question_concept_id, question,
  COUNT(DISTINCT answer_concept_id) AS num_answers
FROM 
  `{DATASET}.ds_survey`
WHERE survey='The Basics'
AND question_concept_id IN (1585899)
GROUP BY 
  1,2,3
HAVING 
  COUNT(DISTINCT answer_concept_id) > 1
  ),
  
df2 as (
SELECT 
  person_id,
  -- question_concept_id,
  question,
  -- STRING_AGG(CAST(answer_concept_id AS STRING), ', ') AS answers_id,
  STRING_AGG(CAST(answer AS STRING), ', ') AS answers 
FROM 
  `{DATASET}.ds_survey`
WHERE question_concept_id IN (1585899,1585845)
AND person_id IN (SELECT DISTINCT person_id FROM df1 LIMIT 5)
GROUP BY 1,2
order by person_id
)

SELECT * FROM df2
PIVOT
(MAX(answers) FOR question IN {questions}
)

"""

In [None]:
df = pd.read_gbq(query, dialect="standard")
df.shape

In [None]:
df

**Other pivot examples using bigquery directly**

In [None]:
%%time 
## USING PIVOT() IN GBQ on Physical Meas data

pms = tuple(pd.read_gbq(f'''SELECT DISTINCT standard_concept_name  
                         FROM  `{DATASET}.ds_measurement` 
                         WHERE measurement_concept_id IN (3036277, 3027018) ''').standard_concept_name)
query=f"""
    SELECT * FROM 
    (##1 -- data to pivot
     SELECT DISTINCT person_id, standard_concept_name as physical_meas, value_as_number 
     FROM  `{DATASET}.ds_measurement` 
     WHERE measurement_concept_id IN (3036277, 3027018) and value_as_number is not null )
    PIVOT
    (
    ##2 -- pivot aggregation
    AVG(value_as_number) AS avg 
    ##3 -- pivot column - either specify the unique columns or dynamically input like the below
    FOR physical_meas IN {pms}
    )
    """
pm_pivot = pd.read_gbq(query, dialect = 'standard')
pm_pivot.head()

In [None]:
df.shape, df.person_id.nunique()

In [None]:
%%time
## USING PIVOT() IN GBQ on survey data - counting if/how many answers per question people have
questions = tuple(pd.read_gbq(f'''SELECT DISTINCT question 
                    FROM  `{DATASET}.ds_survey` WHERE survey='The Basics' ''').question)
query=f"""
    SELECT * FROM 
    (##1 -- data to pivot
     SELECT DISTINCT person_id, question, answer FROM  `{DATASET}.ds_survey` WHERE survey='The Basics')
    PIVOT
    (
    ##2 -- pivot aggregation
    COUNT(CAST(answer AS STRING)) AS answers 
    ##3 -- pivot column
    FOR question in {questions}
    )
    """
basics_count_pivot = pd.read_gbq(query, dialect = 'standard')
basics_count_pivot.head()