# Guide to the data structures and processing steps used in creation of the synthetic standard survey school dashboard

Please note: This will differ for the symbol survey and public dashboards.

In [1]:
# Import packages required to produced this notebook
import pandas as pd

## Data processing for the synthetic dashboard

### Key:

```mermaid
  graph TD;
    d[(Database)]; 
    r(Dataset);
    f{{Function}}
    p((Element on<br>Streamlit));
    sub[Location];

    %% Add custom colour to nodes

    classDef database fill:#b4e1d4;
    class d database;

    classDef function fill:#f5e3cb;
    class f function;

    classDef streamlit fill:#bfe9ff;
    class p streamlit;

    classDef subg fill:#fffcdc;
    class sub subg;
```

### Figure:

```mermaid
  graph TD;

    %% Define the nodes and subgraphs

    in["Amy completed each of the<br>six versions of the survey"]

    subgraph subgraph_clean["Within DSH"]
        red[("REDCap DSH")]
        red_raw("Raw data from REDCap with seperate<br>columns for the six shuffles")
        head("Headings from the cleaned dataset<br>which has a single set of REDCap columns<br>and some fake demographic columns")
    end

    f_label{{"Function creating dictionary with<br>labels for each of the question responses"}};
    f_scores{{"Function creating scores"}}
    ons[("Office for National Statistics (ONS)<br>Open Geography Portal")];
    msoa_ew("MSOA shapefile for England + Wales");
    msoa_nd("MSOA shapefile for Northern Devon");
    data_synth("Synthetic pupil dataset");

    agg_scores("
        Aggregate dataset with <b>mean scores</b>
        for each school overall and by
        year/gender/FSM/SEN");
    agg_rag("
        Aggregate scores dataset with
        addition of <b>RAG ratings</b>");
    agg_resp("
        Aggregate dataset with proportion
        of each response option for each
        <b>non-demographic</b> survey question for
        each school, overall and by
        year/gender/FSM/SEN");
    agg_counts("
        Aggregate dataset with count of
        <b>total respondents</b> for each school,
        overall and by year/gender/FSM/SEN");
    agg_dem("
        Aggregate dataset with proportion
        of each response option for each
        <b>demographic</b> variable or survey
        question for each school, overall
        and by year/gender/FSM/SEN");

    st_sum(("
        RAG boxes on the
        summary page"));
    st_exp(("
        Question bar charts
        on the explore
        results page"));
    st_order(("
        Ordered RAG bar charts
        on the explore
        results page"));
    st_count(("
        Total pupils on the 
        who took part page"));
    st_dem(("
        Bar charts on the 
        who took part page"));

    %% Produce the figure

    in --> red;
    red --> red_raw;
    red_raw --> head;

    ons --> msoa_ew;
    msoa_ew --> msoa_nd;
    msoa_nd --> data_synth;
    head --> data_synth;
    f_scores --> data_synth;
    f_label --> data_synth;

    data_synth --> agg_scores; agg_scores --> agg_rag;
    data_synth --> agg_resp;
    data_synth --> agg_counts;
    data_synth --> agg_dem;

    agg_scores --> st_order;
    agg_rag --> st_sum;
    agg_resp --> st_exp;
    agg_counts --> st_count;
    agg_dem --> st_dem;

    %% Add custom colour to nodes

    classDef database fill:#b4e1d4;
    class red,ons database;

    classDef function fill:#f5e3cb;
    class f_label,f_scores function;

    classDef transparent fill:transparent, stroke:transparent;
    class in transparent;

    classDef streamlit fill:#bfe9ff;
    class st_order,st_sum,st_exp,st_count,st_dem streamlit;
```

### Description:

#### Cleaning the REDCap data extract

Pupil survey responses are stored within REDCap on the Data Safe Haven (DSH). Pupils were assigned to one of six survey orders, to mitigate the impact of response fatigue. For example, for a question on acceptance by peers, there will be seven sets of columns - one from the default survey set up ('accept_peer_shuffle') and then six for each of the shuffles ('accept_peer_shuffle_1', 'accept_peer_shuffle_2', ''accept_peer_shuffle_3', 'accept_peer_shuffle_4', 'accept_peer_shuffle_5', and 'accept_peer_shuffle_6'). All the data can be downloaded as a single extract using the "Data Exports, Reports, and Stats" page on REDCap.

Cleaning is performed using the script `clean_standard_survey_1.ipynb` on the DSH under Group(S:)/ Kailo_Consortium_BeeWell/ scripts/. This script creates a single set of columns (rather than six columns for the same question). It also adds some synthetic demographic data columns (to mimic data that would be received from the council).

#### Creating the synthetic dataset
The synthetic pupil dataset is produced through:
* Extracting the headings from the cleaned REDCap dataset
* Populating each of those columns by sampling from the numeric response options in the dictionary of labels (for each question, there is a dictionary with all possible numeric responses from REDCap and the relevant labels)
* Adding an MSOA for each pupil by randomly sampling from the list of MSOA in Northern Devon (as extracted from the ONS shapefile)
* Adding some random missing data for all variables except school
* Adding some intentional missing data (e.g. school missing a whole year group)
* Adding scores for each pupil on each topic
* Adding labels for each of the responses

#### Producing aggregated datasets for use in Streamlit

Five datasets are produced for use in streamlit.

1. **Aggregate scores** - provides the mean score (and count of pupils it was based on) for each question by school and the chosen pupil characteristics (gender, year group, FSM and SEN).
2. **RAG** - the RAG ratings are calculated by finding the overall mean score for a question across all schools using a weighted mean (weighting mean of each school by school size - using this approach so would be consistent with how would calculate for GM data where wouldn't have pupil level). The weighted standard deviation is also calculated. Whether someone is then above average, average or below average is based on whether they are within 1 standard deviation of the mean. Variable labels and descriptions and descriptions are also added for use on the Streamlit page.
3. **Aggregate non-demographic responses** - provides the proportion of people giving each possible response to each question. In order to ensure one row per question, these are stored as lists within a single cell of the dataframe (see head of dataframe below for example). Labels for each response option are included, as well as an overall label for the question to use on the streamlit page. Results are provided for each school and by the chosen characteristics.
4. **Overall counts** - this dataset provides the overall count of pupils who answered at least one question (and were therefore included in the dashboard) for each school and by the chosen characteristics.
5. **Aggregate demographic responses** - this is as calculated for the non-demographic, except that they are only provided by school and not by pupil characteristics, and there are different rules for censoring small sample sizes.

## Data processing that will be required for the actual dashboards

Differences (beyond the obvious, of not creating synthetic data) will include:
* **Demographic data** will be provided by Devon County Council. It will be linked to the survey responses based on the pseudonymised UPN associated with each of the survey responses. It will likely have different column names and response options, compared to what I have used.
* **Location** of processing will need to be entirely within the DSH. This was not done for the synthetic dashboard as it is using synthetic data, so storing the scripts outside of the DSH allows us to have a trackable and forever accessible record in GitHub of how data were processed.
* **Gender** will need to be chosen from either the survey response for gender, council data on gender, or a combination of both.
* **Cleaning** will need review to check for any additional cleaning steps required on the actual data

## Data structure at key stages

In [2]:
def describe_data(filepath):
    '''
    Describe the shape of the data, preview the first five rows, and print
    the name and type of every column

    filepath:
    filename : string
        Filepath of dataset to import and describe
    '''
    df = pd.read_csv(filepath)

    # Print shape of dataframe
    print(df.shape)

    # Preview first 5 rows of dataframe
    display(df.head())

    # Print the name and type of every column
    with pd.option_context('display.max_rows', None,
                           'display.max_columns', None):
        print(df.dtypes)

### Raw data

In [3]:
describe_data('data/survey_data/KailoBeeWellStandard_DATA_2023-11-06_1152.csv')

(23, 860)


Unnamed: 0,record_id,redcap_event_name,redcap_survey_identifier,id_and_consent_timestamp,consent_answer,id_and_consent_complete,survey_questions_default_order_timestamp,gender,transgender,sexual_orientation,...,places_barriers_shuffle_6___1,places_barriers_shuffle_6___2,places_barriers_shuffle_6___3,places_barriers_shuffle_6___4,places_barriers_shuffle_6___5,places_barriers_shuffle_6___6,places_barriers_shuffle_6___7,places_barriers_shuffle_6___8,places_barriers_shuffle_6___9,survey_questions_shuffle_6_complete
0,1,survey_completion_arm_1,,2023-10-06 12:11:47,1,2,,,,,...,,,,,,,,,,
1,2,survey_completion_arm_1,,2023-10-06 13:06:38,1,2,,,,,...,,,,,,,,,,
2,3,survey_completion_arm_1,,2023-10-10 16:00:13,1,2,,,,,...,,,,,,,,,,
3,4,survey_completion_arm_2,,2023-10-10 16:00:27,1,2,,,,,...,,,,,,,,,,
4,5,survey_completion_arm_1,,2023-10-11 11:39:58,1,2,,,,,...,,,,,,,,,,


record_id                                     int64
redcap_event_name                            object
redcap_survey_identifier                     object
id_and_consent_timestamp                     object
consent_answer                                int64
id_and_consent_complete                       int64
survey_questions_default_order_timestamp     object
gender                                      float64
transgender                                 float64
sexual_orientation                          float64
neurodivergent                              float64
birth_parent1                               float64
birth_parent2                               float64
birth_you                                   float64
birth_you_age                               float64
autonomy_pressure                           float64
autonomy_express                            float64
autonomy_decide                             float64
autonomy_told                               float64
autonomy_mys

### Headings

Headings is a dataset which only has column headings and does not contain any column entries.

In [4]:
head = pd.read_csv('data/survey_data/headings.csv')

# Print shape of dataframe
print(head.shape)

# Print all the columns in the dataframe
head.columns.tolist()

(0, 122)


['Unnamed: 0',
 'record_id',
 'gender',
 'transgender',
 'sexual_orientation',
 'neurodivergent',
 'birth_parent1',
 'birth_parent2',
 'birth_you',
 'birth_you_age',
 'autonomy_pressure',
 'autonomy_express',
 'autonomy_decide',
 'autonomy_told',
 'autonomy_myself',
 'autonomy_choice',
 'life_satisfaction',
 'optimism_future',
 'optimism_best',
 'optimism_good',
 'optimism_work',
 'wellbeing_optimistic',
 'wellbeing_useful',
 'wellbeing_relaxed',
 'wellbeing_problems',
 'wellbeing_thinking',
 'wellbeing_close',
 'wellbeing_mind',
 'esteem_satisfied',
 'esteem_qualities',
 'esteem_well',
 'esteem_value',
 'esteem_good',
 'stress_control',
 'stress_overcome',
 'stress_confident',
 'stress_way',
 'appearance_happy',
 'appearance_feel',
 'negative_lonely',
 'negative_unhappy',
 'negative_like',
 'negative_cry',
 'negative_school',
 'negative_worry',
 'negative_sleep',
 'negative_wake',
 'negative_shy',
 'negative_scared',
 'lonely',
 'support_ways',
 'support_look',
 'sleep',
 'physical_da

### Synthetic pupil dataset

In [5]:
describe_data('data/survey_data/synthetic_data_raw.csv')

(800, 281)


Unnamed: 0,gender,transgender,sexual_orientation,neurodivergent,birth_parent1,birth_parent2,birth_you,birth_you_age,autonomy_pressure,autonomy_express,...,peer_talk_listen_lab,peer_talk_helpful_lab,peer_talk_if_lab,accept_peer_lab,year_group_lab,fsm_lab,sen_lab,ethnicity_lab,english_additional_lab,school_lab
0,4.0,2.0,6.0,3.0,2.0,1.0,2.0,1.0,2.0,5.0,...,Fully,Somewhat helpful,Very uncomfortable,Not at all,Year 10,Non-FSM,Non-SEN,Ethnic minority,No,School E
1,1.0,2.0,1.0,3.0,3.0,2.0,3.0,,4.0,2.0,...,Mostly,Very helpful,Very uncomfortable,Slightly,Year 10,Non-FSM,Non-SEN,Ethnic minority,,School D
2,,3.0,4.0,1.0,1.0,1.0,1.0,1.0,,4.0,...,Mostly,Very helpful,Very comfortable,Not at all,Year 10,Non-FSM,Non-SEN,White British,No,School E
3,2.0,5.0,5.0,2.0,2.0,2.0,1.0,3.0,1.0,2.0,...,Fully,Somewhat helpful,Uncomfortable,Mostly,Year 10,Non-FSM,,White British,No,School G
4,,3.0,4.0,1.0,1.0,3.0,3.0,2.0,5.0,2.0,...,Slightly,Somewhat helpful,Uncomfortable,Not at all,Year 8,Non-FSM,Non-SEN,,Yes,School B


gender                      float64
transgender                 float64
sexual_orientation          float64
neurodivergent              float64
birth_parent1               float64
birth_parent2               float64
birth_you                   float64
birth_you_age               float64
autonomy_pressure           float64
autonomy_express            float64
autonomy_decide             float64
autonomy_told               float64
autonomy_myself             float64
autonomy_choice             float64
life_satisfaction           float64
optimism_future             float64
optimism_best               float64
optimism_good               float64
optimism_work               float64
wellbeing_optimistic        float64
wellbeing_useful            float64
wellbeing_relaxed           float64
wellbeing_problems          float64
wellbeing_thinking          float64
wellbeing_close             float64
wellbeing_mind              float64
esteem_satisfied            float64
esteem_qualities            

### Aggregated dataset with scores and RAG ratings

In [6]:
describe_data('data/survey_data/aggregate_scores_rag.csv')

(2079, 17)


Unnamed: 0,variable,mean,count,school_lab,year_group_lab,gender_lab,fsm_lab,sen_lab,total_pupils,group_n,group_wt_mean,group_wt_std,lower,upper,rag,variable_lab,description
0,birth_you_age_score,7.850427,117.0,School A,All,All,All,All,742.0,,,,,,,,
1,autonomy_score,18.220779,77.0,School A,All,All,All,All,500.0,7.0,17.922,0.436777,17.485223,18.358777,average,Autonomy,How 'in control' young people feel of their life
2,life_satisfaction_score,5.065041,123.0,School A,All,All,All,All,788.0,7.0,5.076142,0.304727,4.771415,5.380869,average,Life satisfaction,How satisfied young people feel with their life
3,optimism_score,11.841463,82.0,School A,All,All,All,All,572.0,7.0,12.006993,0.388618,11.618375,12.395611,average,Optimism,Young people's hopefulness and confidence for ...
4,wellbeing_score,21.466667,75.0,School A,All,All,All,All,471.0,7.0,20.96603,0.527471,20.438559,21.493501,average,Psychological wellbeing,How positive and generally happy young people ...


variable           object
mean              float64
count             float64
school_lab         object
year_group_lab     object
gender_lab         object
fsm_lab            object
sen_lab            object
total_pupils      float64
group_n           float64
group_wt_mean     float64
group_wt_std      float64
lower             float64
upper             float64
rag                object
variable_lab       object
description        object
dtype: object


### Aggregated dataset with non-demographic question responses

In [7]:
describe_data('data/survey_data/aggregate_responses.csv')

(6678, 13)


Unnamed: 0,cat,cat_lab,count,percentage,measure,n_responses,school_lab,year_group_lab,gender_lab,fsm_lab,sen_lab,group,measure_lab
0,"[1, 2, 3, 4, 5, nan]","['1 - Completely not true', '2', '3', '4', '5 ...","[18, 23, 24, 20, 20, 23]","[14.0625, 17.96875, 18.75, 15.625, 15.625, 17....",autonomy_pressure,128.0,School A,All,All,All,All,autonomy,I feel pressured in my life
1,"[1, 2, 3, 4, 5, nan]","['1 - Completely not true', '2', '3', '4', '5 ...","[26, 28, 20, 25, 26, 3]","[20.3125, 21.875, 15.625, 19.53125, 20.3125, 2...",autonomy_express,128.0,School A,All,All,All,All,autonomy,I generally feel free to express my ideas and ...
2,"[1, 2, 3, 4, 5, nan]","['1 - Completely not true', '2', '3', '4', '5 ...","[26, 26, 17, 35, 21, 3]","[20.3125, 20.3125, 13.28125, 27.34375, 16.4062...",autonomy_decide,128.0,School A,All,All,All,All,autonomy,I feel like I am free to decide for myself how...
3,"[1, 2, 3, 4, 5, nan]","['1 - Completely not true', '2', '3', '4', '5 ...","[23, 22, 12, 32, 21, 18]","[17.96875, 17.1875, 9.375, 25.0, 16.40625, 14....",autonomy_told,128.0,School A,All,All,All,All,autonomy,In my daily life I often have to do what I am ...
4,"[1, 2, 3, 4, 5, nan]","['1 - Completely not true', '2', '3', '4', '5 ...","[29, 22, 21, 21, 20, 15]","[22.65625, 17.1875, 16.40625, 16.40625, 15.625...",autonomy_myself,128.0,School A,All,All,All,All,autonomy,I feel I can pretty much be myself in daily si...


cat                object
cat_lab            object
count              object
percentage         object
measure            object
n_responses       float64
school_lab         object
year_group_lab     object
gender_lab         object
fsm_lab            object
sen_lab            object
group              object
measure_lab        object
dtype: object


### Aggregated dataset with overall counts

In [8]:
describe_data('data/survey_data/overall_counts.csv')

(63, 6)


Unnamed: 0,count,school_lab,year_group_lab,gender_lab,fsm_lab,sen_lab
0,128.0,School A,All,All,All,All
1,67.0,School A,Year 8,All,All,All
2,58.0,School A,Year 10,All,All,All
3,17.0,School A,All,Girl,All,All
4,28.0,School A,All,Boy,All,All


count             float64
school_lab         object
year_group_lab     object
gender_lab         object
fsm_lab            object
sen_lab            object
dtype: object


### Aggregated dataset with demographic question responses

In [9]:
describe_data('data/survey_data/aggregate_demographic.csv')

(210, 11)


Unnamed: 0,cat,cat_lab,count,percentage,measure,n_responses,school_lab,school_group,school_group_lab,plot_group,measure_lab
0,"[1, 2, 3, 4, 5, 6, nan]","['Girl', 'Boy', 'Non-binary', 'I describe myse...","[17, 28, 19, 20, 16, 20, 8]","[13.28125, 21.875, 14.84375, 15.625, 12.5, 15....",gender,128.0,School A,1,Your school,gender,Gender
1,"[1, 2, 3, 4, 5, nan]","['Yes', 'No', 'Prefer not to say', 'I describe...","[23, 24, 19, 22, 27, 13]","[17.96875, 18.75, 14.84375, 17.1875, 21.09375,...",transgender,128.0,School A,1,Your school,gender,Do you consider yourself to be transgender?
2,"[1, 2, 3, 4, 5, 6, nan]","['Bi/pansexual', 'Gay/lesbian', 'Heterosexual/...","[20, 26, 15, 20, 26, 16, 5]","[15.625, 20.3125, 11.71875, 15.625, 20.3125, 1...",sexual_orientation,128.0,School A,1,Your school,sexual_orientation,Sexual orientation
3,"[1, 2, 3, nan]","['Yes', 'No', 'Unsure', 'No response']","[45, 31, 38, 14]","[35.15625, 24.21875, 29.6875, 10.9375]",neurodivergent,128.0,School A,1,Your school,neuro,Do you identify as neurodivergent?
4,"[1, 2, 3, nan]","['Yes', 'No', ""I don't know"", 'No response']","[39, 43, 39, 7]","[30.46875, 33.59375, 30.46875, 5.46875]",birth_parent1,128.0,School A,1,Your school,birth,Was birth parent 1 born outside the UK?


cat                  object
cat_lab              object
count                object
percentage           object
measure              object
n_responses         float64
school_lab           object
school_group          int64
school_group_lab     object
plot_group           object
measure_lab          object
dtype: object
