# **Table of contents**

# **Abstract**

# **1. Short introduction**

# **2. Data collection & dataset overview**

Before we start wrangling, cleaning, and analyzing the data, it's essential to understand what we are working with. Here is some vital information about our dataset and the data collection process.

Using [Qualtrics XM](https://www.qualtrics.com/), the online survey was created and distributed through various social media channels. Recruitment for the research was open from February 17, 2023, to March 29, 2023. The distributed survey consisted of the following sections:

- **Homepage**

  Participants were informed about the study conditions, a general outline of their participation, and an estimate of its duration. The introduction mentioned that certain parts of the survey might cause some emotional discomfort. It was stated that complete anonymity would be ensured and participants could withdraw from the study at any time. Clicking the "Proceed" button at the bottom of the page indicated consent to participate.

- **Demographic metrics**

  Basic demographic data of participants were collected: age, gender, level of education, and worldview on spiritual matters. In this section, participants were also asked about their preferred survey version – masculine or feminine. Based on their choice, subsequent content was presented using either masculine or feminine forms of verbs and adjectives (Polish language).

- **Meditation/mindfulness practice**

  Participants were asked if they practice meditation/mindfulness. If they did, they were asked to estimate the number of minutes devoted to this practice in the last 30 days.

- **Psychedelics use**

  A multiple-choice list was presented for participants to mark the psychedelic compounds they had used at least once in their lives, with an option to add other substances. The list included: LSD (or 1P-LSD); psilocybin mushrooms (or synthetic psilocybin); ayahuasca; DMT (other than ayahuasca); 5-MeO-DMT; mescaline; ibogaine; salvia divinorum.

  Those who selected at least one substance were asked to estimate how many times they had used psychedelics and the subjective doses (microdoses, low, average, high, very high). Salvia divinorum and ibogaine were eventually excluded from the psychedelics group as they do not meet the criteria (affinity with 5-HT<sub>2A</sub> receptors) for classic psychedelics.

- **Transcendental experiences**

  Participants were asked if they had experienced a state of consciousness characterized by: **(1)** an altered sense of time and/or space, **(2)** a sense of awe, wonder, or fear, **(3)** ineffability, and **(4)** subjective transcendental/mystical qualities. Negative responses skipped subsequent items and the entire Mystical Experience Questionnaire (MEQ30), proceeding directly to dependent variable questionnaires.

  Affirmative responses led to a multiple-choice list to select the circumstances of the experience. If involving a psychedelic substance, they specified which of the compounds triggered the most intense experience. Multiple circumstances required specification of the first and most intense experience. The last item estimated how long ago the most intense experience occurred.

- **Revised Mystical Experience Questionnaire (MEQ30)**

  This questionnaire, consisting of 30 items, measures the intensity of mystical experiences (MacLean et al., 2012; Barrett et al., 2015). Based on Walter Stace's classic concept (1960/1973), four dimensions are measured: *Mysticism*, *Positive Affect*, *Transcendence of Time and Space*, and *Ineffability*. Participants rated their experience on a 6-point Likert scale. The Polish translation (α = 0.95, N = 515) by the research author was used.

  Confirmatory factor analysis showed a structural difference from the original. The item "Experience of amazement" loaded onto the *Ineffability* dimension rather than *Positive Affect*. However, retaining the original structure preserved slightly better reliability (α<sub>mysticism</sub> = 0.95; α<sub>positive affect</sub> = 0.80; α<sub>transcendence</sub> = 0.85; α<sub>ineffability</sub> = 0.79).

- **Perth Empathy Scale (PES)**

  This tool measures cognitive and affective empathy (Brett et al., 2022), categorized into positive and negative emotions, forming four subscales. The study used the *Positive* and *Negative Affective Empathy* subscales (α<sub>positive-affective</sub> = 0.70; α<sub>negative-affective</sub> = 0.72, N = 676), contributing to *Overall Affective Empathy* (α = 0.73; 10 items). Each item followed a structured format with responses on a 5-point Likert scale. The Polish version by Paweł Larionow and Karolina Mudło-Głagolska (2022) was used.

- **Satisfaction with Life Scale (SWLS)**

  Created by Ed Diener and colleagues (1985), this scale has 5 items rated on a 7-point Likert scale. It measures life satisfaction (example item: *If I could live my life over, I would change almost nothing*). The Polish translation by Konrad Jankowski (2015) was used (α = 0.86, N = 676).

- **Death Attitude Profile - Revised (DAP-R-PL)**

  This tool measures five areas of attitudes toward death, including types of death acceptance and avoidance, and fear of death (Wong et al., 1994). Only the *Fear of Death* subscale was used (example item: *Death is undoubtedly an unpleasant experience*) in the Polish adaptation by Paweł Brudek et al. (2020). This subscale has 10 items rated on a 7-point Likert scale (α = 0.92, N = 676).

- **Subjectively perceived influence**

  This section was for participants who reported undergoing a mystical experience. They answered three questions about the impact of the experience on empathy, life satisfaction, and fear of death, rated on a 5-point Likert scale. Participants could also share their experience in an optional text field.

## **2.1. Table of variables**

| Variable | Description | Values | Type |
| - | - | - | - |
| start_date | Timestamp when the survey started | N/A | datetime64 |
| end_date | Timestamp when the survey finished | N/A | datetime64 |
| progress_in_percent | Survey completion percentage | N/A | int64 |
| duration_in_seconds | Time taken to complete the survey in seconds | N/A | int64 |
| finished | Survey completion status | 0 = *False*; 1 = *True* | int64 |
| recorded_date | Date when the data was recorded | N/A | datetime64 |
| age | Participant's age in years | N/A | float64 |
| sex | Participant's gender | 1 = *Male*; 2 = *Female*; 3 = *Other* | float64 |
| survey_version | Survey version based on gender-specific language | 1 = *Male*; 2 = *Female* | float64 |
| worldview | Participant's religious or spiritual worldview | 1 = *Religious*; 2 = *Spiritual but not religious*; 3 = *Atheistic*; 4 = *Agnostic or no specific worldview* | float64 |
| education | Participant's highest level of education | 1 = *Primary*; 2 = *Vocational*; 3 = *Secondary*; 4 = *Bachelor's or Master's degree*; 5 = *Doctorate or higher* | float64 |
| meditation | Practice of meditation or mindfulness techniques | 1 = *Yes*; 2 = *No* | float64 |
| meditation_minutes | Estimated time (in minutes) spent on meditation or mindfulness in the past 30 days | N/A | float64 |
| compound_never | Never used any psychedelic compounds | 1 = *Yes*; 2 = *No* | float64 |
| compound_LSD | Used LSD at least once | 1 = *Yes*; 2 = *No* | float64 |
| compound_psilocybin | Used psilocybin at least once | 1 = *Yes*; 2 = *No* | float64 |
| compound_ayahuasca | Used ayahuasca at least once | 1 = *Yes*; 2 = *No* | float64 |
| compound_DMT | Used DMT (other than ayahuasca) at least once | 1 = *Yes*; 2 = *No* | float64 |
| compound_5MeODMT | Used 5-MeO-DMT at least once | 1 = *Yes*; 2 = *No* | float64 |
| compound_mescaline | Used mescaline at least once | 1 = *Yes*; 2 = *No* | float64 |
| compound_ibogaine | Used ibogaine at least once | 1 = *Yes*; 2 = *No* | float64 |
| compound_salvia | Used salvinorin A (*Salvia divinorum*) at least once | 1 = *Yes*; 2 = *No* | float64 |
| compound_other | Used other psychedelic compounds at least once | 1 = *Yes*; 2 = *No* | float64 |
| compound_text | Details of other psychedelic compounds used | N/A | object |
| use_amount | Estimated total number of times psychedelic compounds were used | N/A | float64 |
| microdose | Used a microdose at least once | 1 = *Yes*; 2 = *No* | float64 |
| low_dose | Used a low dose at least once | 1 = *Yes*; 2 = *No* | float64 |
| average_dose | Used an average dose at least once | 1 = *Yes*; 2 = *No* | float64 |
| high_dose | Used a high dose at least once | 1 = *Yes*; 2 = *No* | float64 |
| very_high_dose | Used a very high dose at least once | 1 = *Yes*; 2 = *No* | float64 |
| mystical_experience | Experienced a transcendental/mystical occurrence | 1 = *Yes*; 2 = *No* | float64 |
| context_psychedelic | Transcendental experience triggered by a psychedelic compound | 1 = *Yes*; 2 = *No* | float64 |
| context_other_psychoactive | Transcendental experience triggered by a non-psychedelic psychoactive compound | 1 = *Yes*; 2 = *No* | float64 |
| context_NDE | Transcendental experience triggered by a near-death experience | 1 = *Yes*; 2 = *No* | float64 |
| context_meditation | Transcendental experience triggered by meditation | 1 = *Yes*; 2 = *No* | float64 |
| context_ritual | Transcendental experience triggered by a religious ritual | 1 = *Yes*; 2 = *No* | float64 |
| context_hypnosis | Transcendental experience triggered by a hypnotic state | 1 = *Yes*; 2 = *No* | float64 |
| context_other | Transcendental experience triggered by other contexts | 1 = *Yes*; 2 = *No* | float64 |
| context_other_psychoactive_text | Details of the non-psychedelic psychoactive compound trigger | N/A | object |
| context_other_text | Details of the other context or event that triggered the experience | N/A | object |
| order | Context that triggered the first transcendental experience (if multiple) | 1 = *Psychedelic use*; 2 = *Near-death experience*; 3 = *Meditation*; 4 = *Religious ritual*; 5 = *Hypnotic state*; 6 = *Other event*; 7 = *Other psychoactive compounds use* | float64 |
| intensity | Context that triggered the most intense transcendental experience (if multiple) | 1 = *Psychedelic use*; 2 = *Near-death experience*; 3 = *Meditation*; 4 = *Religious ritual*; 5 = *Hypnotic state*; 6 = *Other event*; 7 = *Other psychoactive compounds use* | float64 |
| how_long_ago | Time (in months) since the most intense transcendental experience | N/A | float64 |
| MEQ30_1 - MEQ30_30 | Items from the Revised Mystical Experience Questionnaire (MEQ30) | 0-5 scale | float64 |
| PES_1 - PES_11 | Items from the Perth Empathy Scale (Positive and Negative Emotional Empathy subscales) + one control question (PES_9) | 1-5 scale | float64 |
| SWLS_1 - SWLS_5 | Items from the Satisfaction with Life Scale | 1-7 scale | float64 |
| DAP_R_1 - DAP_R_8 | Items from the Death Attitude Profile-Revised (Fear of Death subscale) + one control question (DAP_R_4) | 1-7 scale | float64 |
| influence_empathy | Perceived influence of the transcendental experience on empathy levels | 1 = *Definitely positive*; 2 = *Rather positive*; 3 = *No influence*; 4 = *Rather negative*; 5 = *Definitely negative* | float64 |
| influence_satisfaction | Perceived influence of the transcendental experience on life satisfaction levels | 1 = *Definitely positive*; 2 = *Rather positive*; 3 = *No influence*; 4 = *Rather negative*; 5 = *Definitely negative* | float64 |
| influence_fear | Perceived influence of the transcendental experience on fear of death | 1 = *Definitely positive (lower fear)*; 2 = *Rather positive*; 3 = *No influence*; 4 = *Rather negative*; 5 = *Definitely negative (higher fear)* | float64 |
| description | Additional details about the experience | N/A | object |


# **3. Imports**

## **3.1 Importing libraries**

Let's first import all the libraries and packages needed to run the following code cells.

In [23]:
import pandas as pd
import numpy as np

## **3.2. Loading dataset**

Since Pandas package is in place, we can import our dataset from the GitHub repository, loading it as a Pandas DataFrame.

In [2]:
dataset_url = "https://github.com/michal-owsiak/research/raw/main/dataset.xlsx"
df = pd.read_excel(dataset_url)

Now let's check how big the loaded set is...

In [3]:
print(f"The dataset consists of {df.shape[0]} records and {df.shape[1]} variables.")

The dataset consists of 1127 records and 193 variables.


...and what all those variables are.

In [4]:
list(df.columns)

['start_date',
 'end_date',
 'progress_in_percent',
 'duration_in_seconds',
 'finished',
 'recorded_date',
 'age',
 'sex',
 'survey_version',
 'worldview',
 'education',
 'meditation_M',
 'meditation_minutes_M',
 'compound_never_M',
 'compound_LSD_M',
 'compound_psylocybin_M',
 'compound_ayahuasca_M',
 'compound_DMT_M',
 'compound_5MeODMT_M',
 'compound_mescaline_M',
 'compound_ibogaine_M',
 'compound_salvia_M',
 'compound_other_M',
 'compound_text_M',
 'use_amount_M',
 'microdose_M',
 'low_dose_M',
 'average_dose_M',
 'high_dose_M',
 'very_high_dose_M',
 'mystical_experience_M',
 'context_psychedelic_M',
 'context_other_psychoactive_M',
 'context_NDE_M',
 'context_meditation_M',
 'context_ritual_M',
 'context_hypnosis_M',
 'context_other_M',
 'context_other_psychoactive_text_M',
 'context_other_text_M',
 'trigger_compound_M',
 'order_M',
 'intensity_M',
 'how_long_ago_M',
 'MEQ30_1_M',
 'MEQ30_2_M',
 'MEQ30_3_M',
 'MEQ30_4_M',
 'MEQ30_5_M',
 'MEQ30_6_M',
 'MEQ30_7_M',
 'MEQ30_8_M',
 '

*Tip: To get a comprehensive overview of all columns, viewing the above list as a scrollable element is recommended.*

We observe that, excluding the first 11 columns, all variables are duplicated, denoted by either an `_F` or `_M` suffix. This duplication stems from the survey's interactive nature, which diverged the main survey flow into two paths – one using masculine and the other feminine language forms based on participants' choices. Variables ending with `_F` contain data obtained from the female version of the survey, while those ending with `_M` pertain to the male version. 

We don't want to analyze those two types of variables separately, so aggregating all the `_F` and `_M` variables together should be one of the first steps for facilitating our analysis. Thus, let's now proceed to the **Data warangling and cleaning** section.

# **4. Data wrangling & cleaning**

First, let's look up how a sample of our dataframe looks like.

In [5]:
pd.set_option("display.max_columns", None)
df.head(10)

Unnamed: 0,start_date,end_date,progress_in_percent,duration_in_seconds,finished,recorded_date,age,sex,survey_version,worldview,education,meditation_M,meditation_minutes_M,compound_never_M,compound_LSD_M,compound_psylocybin_M,compound_ayahuasca_M,compound_DMT_M,compound_5MeODMT_M,compound_mescaline_M,compound_ibogaine_M,compound_salvia_M,compound_other_M,compound_text_M,use_amount_M,microdose_M,low_dose_M,average_dose_M,high_dose_M,very_high_dose_M,mystical_experience_M,context_psychedelic_M,context_other_psychoactive_M,context_NDE_M,context_meditation_M,context_ritual_M,context_hypnosis_M,context_other_M,context_other_psychoactive_text_M,context_other_text_M,trigger_compound_M,order_M,intensity_M,how_long_ago_M,MEQ30_1_M,MEQ30_2_M,MEQ30_3_M,MEQ30_4_M,MEQ30_5_M,MEQ30_6_M,MEQ30_7_M,MEQ30_8_M,MEQ30_9_M,MEQ30_10_M,MEQ30_11_M,MEQ30_12_M,MEQ30_13_M,MEQ30_14_M,MEQ30_15_M,MEQ30_16_M,MEQ30_17_M,MEQ30_18_M,MEQ30_19_M,MEQ30_20_M,MEQ30_21_M,MEQ30_22_M,MEQ30_23_M,MEQ30_24_M,MEQ30_25_M,MEQ30_26_M,MEQ30_27_M,MEQ30_28_M,MEQ30_29_M,MEQ30_30_M,PES_1_M,PES_2_M,PES_3_M,PES_4_M,PES_5_M,PES_6_M,PES_7_M,PES_8_M,PES_9_M,PES_10_M,PES_11_M,SWLS_1_M,SWLS_2_M,SWLS_3_M,SWLS_4_M,SWLS_5_M,DAP_R_1_M,DAP_R_2_M,DAP_R_3_M,DAP_R_4_M,DAP_R_5_M,DAP_R_6_M,DAP_R_7_M,DAP_R_8_M,influence_empathy_M,influence_satisfaction_M,influence_fear_M,description_text_M,meditation_F,meditation_minutes_F,compound_never_F,compound_LSD_F,compound_psylocybin_F,compound_ayahuasca_F,compound_DMT_F,compound_5MeODMT_F,compound_mescaline_F,compound_ibogaine_F,compound_salvia_F,compound_other_F,compound_text_F,use_amount_F,microdose_F,low_dose_F,average_dose_F,high_dose_F,very_high_dose_F,mystical_experience_F,context_psychedelic_F,context_other_psychoactive_F,context_NDE_F,context_meditation_F,context_ritual_F,context_hypnosis_F,context_other_F,context_other_psychoactive_text_F,context_other_text_F,trigger_compound_F,order_F,intensity_F,how_long_ago_F,MEQ30_1_F,MEQ30_2_F,MEQ30_3_F,MEQ30_4_F,MEQ30_5_F,MEQ30_6_F,MEQ30_7_F,MEQ30_8_F,MEQ30_9_F,MEQ30_10_F,MEQ30_11_F,MEQ30_12_F,MEQ30_13_F,MEQ30_14_F,MEQ30_15_F,MEQ30_16_F,MEQ30_17_F,MEQ30_18_F,MEQ30_19_F,MEQ30_20_F,MEQ30_21_F,MEQ30_22_F,MEQ30_23_F,MEQ30_24_F,MEQ30_25_F,MEQ30_26_F,MEQ30_27_F,MEQ30_28_F,MEQ30_29_F,MEQ30_30_F,PES_1_F,PES_2_F,PES_3_F,PES_4_F,PES_5_F,PES_6_F,PES_7_F,PES_8_F,PES_9_F,PES_10_F,PES_11_F,SWLS_1_F,SWLS_2_F,SWLS_3_F,SWLS_4_F,SWLS_5_F,DAP_R_1_F,DAP_R_2_F,DAP_R_3_F,DAP_R_4_F,DAP_R_5_F,DAP_R_6_F,DAP_R_7_F,DAP_R_8_F,influence_empathy_F,influence_satisfaction_F,influence_fear_F,description_text_F
0,2023-02-17 22:50:49,2023-02-17 23:04:08,100,798,1,2023-02-17 23:04:09,28.0,1.0,1.0,2.0,4.0,1.0,200.0,,,1.0,,,,,,,,,6.0,1.0,1.0,1.0,1.0,,1.0,1.0,,,,,,,,,2.0,,,4.0,6.0,6.0,6.0,6.0,6.0,4.0,5.0,5.0,5.0,6.0,5.0,4.0,6.0,5.0,5.0,5.0,4.0,6.0,5.0,5.0,5.0,6.0,5.0,4.0,4.0,5.0,5.0,5.0,6.0,5.0,4.0,4.0,2.0,5.0,2.0,4.0,3.0,3.0,4.0,4.0,2.0,3.0,4.0,2.0,1.0,3.0,5.0,6.0,7.0,5.0,6.0,7.0,7.0,5.0,2.0,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2023-02-18 00:24:48,2023-02-18 00:34:16,100,567,1,2023-02-18 00:34:18,30.0,1.0,1.0,4.0,4.0,2.0,,1.0,,,,,,,,,,,,,,,,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.0,3.0,2.0,5.0,3.0,2.0,2.0,3.0,4.0,2.0,1.0,5.0,4.0,4.0,3.0,4.0,3.0,1.0,2.0,3.0,1.0,4.0,4.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,2023-02-18 12:57:58,2023-02-18 13:13:33,100,934,1,2023-02-18 13:13:33,25.0,2.0,2.0,2.0,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,5.0,,,,,,,,,,1.0,Haszysz,1.0,,1.0,,,,1.0,,1.0,,,,,,Ciastko z haszyszem,,,,,52.0,6.0,4.0,6.0,5.0,6.0,4.0,6.0,4.0,6.0,6.0,6.0,3.0,6.0,5.0,5.0,5.0,3.0,4.0,6.0,4.0,5.0,6.0,1.0,3.0,2.0,5.0,1.0,5.0,6.0,1.0,4.0,4.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0,3.0,3.0,3.0,3.0,3.0,2.0,2.0,1.0,2.0,1.0,5.0,2.0,2.0,2.0,1.0,3.0,3.0,4.0,
3,2023-02-18 13:14:33,2023-02-18 13:18:33,100,239,1,2023-02-18 13:18:34,22.0,2.0,2.0,4.0,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,1.0,,,,,,,,,,,,,,,,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3.0,3.0,3.0,3.0,2.0,3.0,2.0,3.0,4.0,3.0,2.0,3.0,4.0,2.0,2.0,3.0,4.0,4.0,4.0,5.0,5.0,4.0,4.0,4.0,,,,
4,2023-02-18 13:19:13,2023-02-18 13:26:12,100,419,1,2023-02-18 13:26:13,28.0,2.0,2.0,4.0,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,1.0,,,,,,,,,,,,,,,,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4.0,4.0,2.0,2.0,3.0,3.0,2.0,3.0,4.0,3.0,3.0,4.0,5.0,5.0,6.0,5.0,3.0,2.0,2.0,5.0,2.0,3.0,2.0,2.0,,,,
5,2023-02-18 13:31:12,2023-02-18 13:42:46,100,693,1,2023-02-18 13:42:47,40.0,1.0,1.0,1.0,5.0,2.0,,1.0,,,,,,,,,,,,,,,,,1.0,,1.0,,,,,1.0,,,,6.0,6.0,180.0,2.0,3.0,2.0,1.0,2.0,1.0,2.0,3.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,3.0,2.0,2.0,2.0,3.0,2.0,3.0,2.0,2.0,2.0,4.0,3.0,2.0,5.0,3.0,4.0,3.0,4.0,3.0,3.0,2.0,4.0,4.0,2.0,2.0,4.0,2.0,2.0,2.0,4.0,1.0,1.0,1.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6,2023-02-18 13:44:18,2023-02-18 13:54:10,100,592,1,2023-02-18 13:54:10,27.0,2.0,2.0,4.0,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,100.0,,1.0,1.0,,1.0,,,,,,,10.0,1.0,1.0,1.0,,,1.0,1.0,1.0,1.0,1.0,,,,Marihuana,,1.0,5.0,1.0,120.0,6.0,5.0,5.0,4.0,6.0,5.0,5.0,5.0,5.0,5.0,6.0,6.0,6.0,6.0,5.0,5.0,5.0,5.0,5.0,6.0,4.0,5.0,6.0,4.0,5.0,5.0,5.0,6.0,6.0,6.0,3.0,3.0,3.0,4.0,3.0,3.0,2.0,3.0,4.0,3.0,2.0,3.0,4.0,3.0,4.0,3.0,4.0,5.0,6.0,5.0,6.0,4.0,3.0,4.0,2.0,2.0,3.0,
7,2023-02-18 13:54:16,2023-02-18 14:01:42,79,446,0,2023-02-25 14:01:45,33.0,2.0,2.0,1.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,1.0,,,,,,,,,,,,,,,,,1.0,1.0,,,,,,,,,9.0,,,8.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
8,2023-02-18 14:07:19,2023-02-18 14:13:37,100,377,1,2023-02-18 14:13:38,27.0,2.0,2.0,1.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,1.0,,,,,,,,,,,,,,,,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,2.0,4.0,3.0,3.0,4.0,4.0,2.0,3.0,5.0,5.0,4.0,6.0,6.0,6.0,,,,
9,2023-02-18 14:07:44,2023-02-18 14:10:39,33,175,0,2023-02-25 14:10:50,45.0,2.0,1.0,2.0,5.0,1.0,10.0,,1.0,1.0,,,,,,,,,10.0,,,1.0,,,1.0,1.0,1.0,,1.0,,,,,,2.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


We can observe numerous **NaN** values in the dataset. What causes this?

In some cases, NaN values correspond to incomplete survey responses (as seen in rows 7 and 9). However, what about the NaN values appearing even when participants have completed 100% of the survey? Again, this is a result of the survey's interactive nature. Besides splitting the questionnaires into two gender-specific forms, certain sections were hidden based on participants' responses. For instance, if a participant indicated never having used a psychedelic compound, the section asking about dosage wouldn't appear, leading to NaN values. The same scenario applies to questions about meditation time and those specific to mystical experiences. Also, unchecked boxes on the lists were automatically exported as *Not a Number*.

## **4.1. Joining female and male variables into one dataframe**

Regarding the incomplete records mentioned earlier, we will set them aside for now and proceed to aggregate the `_F` and `_M` variables into a consolidated dataframe.

First, we will convert all non-numeric records to zeros (we can apply it for all the columns - string variables included - as we don't need descriptive records for our analysis). Having numeric values, we will sum the corresponding `_F` and `_M` values together, storing the combined columns in a separate dataframe. This dataframe will then be concatenated with the original one. Finally, we will drop all `_F` and `_M` columns, which will leave us with the desired dataset for further wrangling.

In [6]:
columns_to_process = [col[:-2] for col in df.columns if col.endswith("_M")]
combined_columns = {}

for col in columns_to_process:
    male_col = f"{col}_M"
    female_col = f"{col}_F"
    
    male_variables = pd.to_numeric(df[male_col], errors="coerce").fillna(0)
    female_variables = pd.to_numeric(df[female_col], errors="coerce").fillna(0)
    
    combined_columns[col] = male_variables + female_variables

combined_df = pd.DataFrame(combined_columns)

df = pd.concat([df, combined_df], axis=1)

df.drop(columns=[f"{col}_M" for col in columns_to_process] + [f"{col}_F" for col in columns_to_process], inplace=True)

df.head(10)

Unnamed: 0,start_date,end_date,progress_in_percent,duration_in_seconds,finished,recorded_date,age,sex,survey_version,worldview,education,meditation,meditation_minutes,compound_never,compound_LSD,compound_psylocybin,compound_ayahuasca,compound_DMT,compound_5MeODMT,compound_mescaline,compound_ibogaine,compound_salvia,compound_other,compound_text,use_amount,microdose,low_dose,average_dose,high_dose,very_high_dose,mystical_experience,context_psychedelic,context_other_psychoactive,context_NDE,context_meditation,context_ritual,context_hypnosis,context_other,context_other_psychoactive_text,context_other_text,trigger_compound,order,intensity,how_long_ago,MEQ30_1,MEQ30_2,MEQ30_3,MEQ30_4,MEQ30_5,MEQ30_6,MEQ30_7,MEQ30_8,MEQ30_9,MEQ30_10,MEQ30_11,MEQ30_12,MEQ30_13,MEQ30_14,MEQ30_15,MEQ30_16,MEQ30_17,MEQ30_18,MEQ30_19,MEQ30_20,MEQ30_21,MEQ30_22,MEQ30_23,MEQ30_24,MEQ30_25,MEQ30_26,MEQ30_27,MEQ30_28,MEQ30_29,MEQ30_30,PES_1,PES_2,PES_3,PES_4,PES_5,PES_6,PES_7,PES_8,PES_9,PES_10,PES_11,SWLS_1,SWLS_2,SWLS_3,SWLS_4,SWLS_5,DAP_R_1,DAP_R_2,DAP_R_3,DAP_R_4,DAP_R_5,DAP_R_6,DAP_R_7,DAP_R_8,influence_empathy,influence_satisfaction,influence_fear,description_text
0,2023-02-17 22:50:49,2023-02-17 23:04:08,100,798,1,2023-02-17 23:04:09,28.0,1.0,1.0,2.0,4.0,1.0,200.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,4.0,6.0,6.0,6.0,6.0,6.0,4.0,5.0,5.0,5.0,6.0,5.0,4.0,6.0,5.0,5.0,5.0,4.0,6.0,5.0,5.0,5.0,6.0,5.0,4.0,4.0,5.0,5.0,5.0,6.0,5.0,4.0,4.0,2.0,5.0,2.0,4.0,3.0,3.0,4.0,4.0,2.0,3.0,4.0,2.0,1.0,3.0,5.0,6.0,7.0,5.0,6.0,7.0,7.0,5.0,2.0,2.0,2.0,0.0
1,2023-02-18 00:24:48,2023-02-18 00:34:16,100,567,1,2023-02-18 00:34:18,30.0,1.0,1.0,4.0,4.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,3.0,2.0,5.0,3.0,2.0,2.0,3.0,4.0,2.0,1.0,5.0,4.0,4.0,3.0,4.0,3.0,1.0,2.0,3.0,1.0,4.0,4.0,2.0,0.0,0.0,0.0,0.0
2,2023-02-18 12:57:58,2023-02-18 13:13:33,100,934,1,2023-02-18 13:13:33,25.0,2.0,2.0,2.0,4.0,1.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,52.0,6.0,4.0,6.0,5.0,6.0,4.0,6.0,4.0,6.0,6.0,6.0,3.0,6.0,5.0,5.0,5.0,3.0,4.0,6.0,4.0,5.0,6.0,1.0,3.0,2.0,5.0,1.0,5.0,6.0,1.0,4.0,4.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0,3.0,3.0,3.0,3.0,3.0,2.0,2.0,1.0,2.0,1.0,5.0,2.0,2.0,2.0,1.0,3.0,3.0,4.0,0.0
3,2023-02-18 13:14:33,2023-02-18 13:18:33,100,239,1,2023-02-18 13:18:34,22.0,2.0,2.0,4.0,4.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,3.0,3.0,3.0,2.0,3.0,2.0,3.0,4.0,3.0,2.0,3.0,4.0,2.0,2.0,3.0,4.0,4.0,4.0,5.0,5.0,4.0,4.0,4.0,0.0,0.0,0.0,0.0
4,2023-02-18 13:19:13,2023-02-18 13:26:12,100,419,1,2023-02-18 13:26:13,28.0,2.0,2.0,4.0,4.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,4.0,2.0,2.0,3.0,3.0,2.0,3.0,4.0,3.0,3.0,4.0,5.0,5.0,6.0,5.0,3.0,2.0,2.0,5.0,2.0,3.0,2.0,2.0,0.0,0.0,0.0,0.0
5,2023-02-18 13:31:12,2023-02-18 13:42:46,100,693,1,2023-02-18 13:42:47,40.0,1.0,1.0,1.0,5.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,6.0,6.0,180.0,2.0,3.0,2.0,1.0,2.0,1.0,2.0,3.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,3.0,2.0,2.0,2.0,3.0,2.0,3.0,2.0,2.0,2.0,4.0,3.0,2.0,5.0,3.0,4.0,3.0,4.0,3.0,3.0,2.0,4.0,4.0,2.0,2.0,4.0,2.0,2.0,2.0,4.0,1.0,1.0,1.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,5.0,0.0
6,2023-02-18 13:44:18,2023-02-18 13:54:10,100,592,1,2023-02-18 13:54:10,27.0,2.0,2.0,4.0,4.0,1.0,100.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,5.0,1.0,120.0,6.0,5.0,5.0,4.0,6.0,5.0,5.0,5.0,5.0,5.0,6.0,6.0,6.0,6.0,5.0,5.0,5.0,5.0,5.0,6.0,4.0,5.0,6.0,4.0,5.0,5.0,5.0,6.0,6.0,6.0,3.0,3.0,3.0,4.0,3.0,3.0,2.0,3.0,4.0,3.0,2.0,3.0,4.0,3.0,4.0,3.0,4.0,5.0,6.0,5.0,6.0,4.0,3.0,4.0,2.0,2.0,3.0,0.0
7,2023-02-18 13:54:16,2023-02-18 14:01:42,79,446,0,2023-02-25 14:01:45,33.0,2.0,2.0,1.0,5.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,2023-02-18 14:07:19,2023-02-18 14:13:37,100,377,1,2023-02-18 14:13:38,27.0,2.0,2.0,1.0,5.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,2.0,4.0,3.0,3.0,4.0,4.0,2.0,3.0,5.0,5.0,4.0,6.0,6.0,6.0,0.0,0.0,0.0,0.0
9,2023-02-18 14:07:44,2023-02-18 14:10:39,33,175,0,2023-02-25 14:10:50,45.0,2.0,1.0,2.0,5.0,1.0,10.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we can see that all the NaN values have been replaced with zeros and that we no longer have duplicated set of female and male variables.

## **4.2. Recoding the variables**

The next crucial step involves appropriately recoding the scores from the questionnaires. While none of them include reversed items, their scales' coding needs some adjustments (apart from PES questionnaire).

**MEQ30**

When exporting data from Qualtrics (the survey creation tool) to an *.xlsx* file, Likert-type responses are automatically coded as follows: the first answer on the scale is marked as `1`, the second as `2`, and so on. However, for the MEQ30 questionnaire, the original scale ranges from 0 to 5. Therefore, we need to recode the current values accordingly. We'll identify all columns starting with `MEQ30`, create a dictionary where each key corresponds to a value that needs replacement, and then apply this dictionary to replace the values in these columns.

In [7]:
meq30_columns = [col for col in df.columns if col.startswith("MEQ30")]

replacement_mapping = {
    1: 0, 
    2: 1, 
    3: 2, 
    4: 3, 
    5: 4, 
    6: 5
}

df[meq30_columns] = df[meq30_columns].replace(replacement_mapping)

**SWLS and DAP-R**

For the SWLS and DAP-R questionnaires, our goal is to obtain scales starting from 1, so there shouldn't be an issue as with the MEQ30. However, these questionnaires are structured such that the first answer on the list reflects **total agreement** with the statement of each item (e.g., *I am satisfied with my life*). Therefore, to ensure that a higher level of agreement corresponds to a higher score on the scale, we need to reverse the scoring of all variables in these two questionnaires. We will employ the same approach used for the MEQ30 described above.

In [8]:
swls_columns = [col for col in df.columns if col.startswith("SWLS")]
dap_r_columns = [col for col in df.columns if col.startswith("DAP_R")]

reverse_mapping = {
    1: 7, 
    2: 6, 
    3: 5, 
    4: 4, 
    5: 3, 
    6: 2, 
    7: 1
}

df[swls_columns] = df[swls_columns].replace(reverse_mapping)
df[dap_r_columns] = df[dap_r_columns].replace(reverse_mapping)

## **4.3. Calculating the summed results**

Now that we have all the scales ready, we can proceed to summing the values up. Let's start with the MEQ30.

### **4.3.1. MEQ30**

First, we need to create a dictionary based on the questionnaire's scoring instructions. For each subscale (`key`), we will provide a list of items belonging to that subscale. Next, we will create a `for` loop to iterate through all the items in each subscale and sum them. This loop will create new columns with the summed values for each subscale.

In [9]:
MEQ30_subscales = {
    "Mystical": [4, 5, 6, 9, 14, 15, 16, 18, 20, 21, 23, 24, 25, 26, 28],
    "Positive Mood": [2, 8, 12, 17, 27, 30],
    "Transcendence": [1, 7, 11, 13, 19, 22],
    "Ineffability": [3, 10, 29]
}

for subscale, items in MEQ30_subscales.items():
    df.loc[:, subscale] = df[[f"MEQ30_{item}" for item in items]].sum(axis=1)

Let's make sure that now the last 4 columns in our dataframe are actually the MEQ30 subscales' scorings.

In [10]:
df.iloc[:10, -4:]

Unnamed: 0,Mystical,Positive Mood,Transcendence,Ineffability
0,60.0,23.0,27.0,15.0
1,0.0,0.0,0.0,0.0
2,50.0,10.0,30.0,15.0
3,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0
5,15.0,14.0,6.0,3.0
6,62.0,26.0,27.0,13.0
7,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0


As seen above, this approach seems to be working. However, we are still not finished with the MEQ30. The crucial information we are looking for is whether a person has undergone a *complete mystical experience*. According to the authors of the questionnaire, this is defined by scoring at least 60% of the points in each of the four subscales.

For that matter, we will create a dictionary to assign the maximum scoring value to each subscale. This can be done by taking the lengths of the lists from our previous `MEQ30_subscales` dictionary and multiplying them by 5 (since our scale for this questionnaire ranges from 0 to 5). Next, we will take the scores displayed in the table above and divide them by the calculated maximum values, checking if all the four results are at least 60%. Finally, we will convert resulting Boolean flags into numeric values.

In [11]:
max_score = {
    "Mystical": len(MEQ30_subscales["Mystical"]) * 5,
    "Positive Mood": len(MEQ30_subscales["Positive Mood"]) * 5,
    "Transcendence": len(MEQ30_subscales["Transcendence"]) * 5,
    "Ineffability": len(MEQ30_subscales["Ineffability"]) * 5
}

df["MEQ_complete"] = (
    (df["Mystical"] / max_score["Mystical"] >= 0.6) &
    (df["Positive Mood"] / max_score["Positive Mood"] >= 0.6) &
    (df["Transcendence"] / max_score["Transcendence"] >= 0.6) &
    (df["Ineffability"] / max_score["Ineffability"] >= 0.6)
).astype(int)

df.iloc[:10, -1:]

Unnamed: 0,MEQ_complete
0,1
1,0
2,0
3,0
4,0
5,0
6,1
7,0
8,0
9,0


### **4.3.2. PES**

To sum up the overall emotional empathy score, we will need to iterate through all 10 PES items, excluding item number 9, as it serves as an additional question to anchor the participant's attention.

In [12]:
pes_items = [f"PES_{i}" for i in range(1, 12) if i != 9]

df["PES_sum"] = df[pes_items].sum(axis=1)

Additionally, it will be useful to have separate summed scores for positive and negative emotions. Therefore, we need to implement the same approach as we did with the MEQ30.

In [13]:
PES_subscales = {
    "PES_positive": [2, 4, 6, 8, 11],
    "PES_negative": [1, 3, 5, 7, 10]
}

for subscale, items in PES_subscales.items():
    df.loc[:, subscale] = df[[f"PES_{item}" for item in items]].sum(axis=1)

Now we have three summed columns for the PES questionnaire:

In [14]:
df.iloc[:10, -3:]

Unnamed: 0,PES_sum,PES_positive,PES_negative
0,33.0,18.0,15.0
1,28.0,14.0,14.0
2,33.0,17.0,16.0
3,27.0,14.0,13.0
4,29.0,15.0,14.0
5,30.0,17.0,13.0
6,29.0,15.0,14.0
7,0.0,0.0,0.0
8,29.0,14.0,15.0
9,0.0,0.0,0.0


### **4.3.3. SWLS and DAP-R**

Similarly to the PES, we will sum the items of the DAP-R, and then of the SWLS questionnaire. For the DAP-R, we will exclude the 4th item as it is a control question.

In [15]:
dap_r_items = [f"DAP_R_{i}" for i in range(1, 9) if i != 4]
swls_items = [f"SWLS_{i}" for i in range(1, 6)]

df["DAP_R_sum"] = df[dap_r_items].sum(axis=1)
df["SWLS_sum"] = df[swls_items].sum(axis=1)

df.iloc[:10, -2:]

Unnamed: 0,DAP_R_sum,SWLS_sum
0,13.0,27.0
1,39.0,20.0
2,45.0,27.0
3,27.0,26.0
4,40.0,15.0
5,49.0,26.0
6,24.0,23.0
7,0.0,0.0
8,24.0,22.0
9,0.0,0.0


## **4.4. Cleaning the dataset**

After merging separated columns, properly recoding values, and summarizing results, we proceed to the data cleaning phase. In this stage, we will remove rows that meet the following criteria:

- do not demonstrate `100%` survey completion,
- have incorrect answers in both attention-controlling questions,
- fall below or above `2.5` standard deviations from the mean score in any of the three dependent variables (emotional empathy, satisfaction with life, fear of death).

### **4.4.1. Dropping incomplete records**

In [16]:
incomplete_survey = df[df["finished"] == 0]

print(f"Out of {df.shape[0]} records, there are {incomplete_survey.shape[0]} ({((incomplete_survey.shape[0] / df.shape[0]) * 100):.0f}%) incomplete surveys.")

Out of 1127 records, there are 403 (36%) incomplete surveys.


As we can observe, there are as many as `36%` of the records that are incomplete, and therefore not useful for the analysis. Replacing numeric variables with the mean score and categorical ones with the dominant value would introduce too much distortion in the data. In this case, the best approach is to simply drop the incomplete records from our dataset.

In [17]:
df.drop(incomplete_survey.index, inplace=True)
df.reset_index(drop=True, inplace=True)

### **4.4.2. Dropping records with unconscious responses**

Within the original questionnaire items, there were 2 control questions mixed in. The purpose of these questions was to identify participants who may have been marking random answers.

For PES, the control question is:
- "If you are reading this, please mark ***Often***", which corresponds to an expected value of `4`.

For DAP-R, the control question is:
- "If you are reading this, please mark ***Rather disagree***", which corresponds to an expected value of `3` on our recoded scale.

Let's then create new columns to flag records with incorrect answers for the above questions.

In [19]:
df["control_q1"] = (df["PES_9"] == 4).astype(int)
df["control_q2"] = (df["DAP_R_4"] == 3).astype(int)

df.iloc[:10, -2:]

Unnamed: 0,control_q1,control_q2
0,1,1
1,1,0
2,1,1
3,1,1
4,1,1
5,1,1
6,1,1
7,1,1
8,1,1
9,1,1


Now that we have our flags ready, we will filter out the records that have a `0` value in both `control_q1` and `control_q2` columns. We are taking a less rigorous approach by retaining records with one incorrect answer. To implement this, we will spot the records where either `control_q1` or `control_q2` contains a `1` value.

*If we would like to take more rigorous approach and discard any record that has at least single* `0` *value, we only need to replace* `|` *with* `&` *in the first line below*.

In [20]:
df_filtered = df[(df["control_q1"] == 1) | (df["control_q2"] == 1)]

print(f"{(df.shape[0] - df_filtered.shape[0])} records have been filtered out.")

df_filtered.reset_index(drop=True, inplace=True)
df = df_filtered

8 records have been filtered out.


### **4.4.3. Dropping the outliers**

During the research design phase, it was determined to exclude outliers that fall outside `2.5` standard deviations (*SD*) from the mean scores of dependent variables. This approach aims to enhance the likelihood of obtaining normally distributed results.

To begin, we need to calculate the *Z-scores* for each record. The *Z-score* indicates how many *SD* a particular data point deviates from the mean score. Here is the formula for calculating it:

<p style="font-size: 22px; font-weight: bold">
    <i>Z = (X - μ) ÷ σ</i>
</p>

where:

***X*** - the individual data point,

***μ*** - the mean score of the data,

***σ*** - the *SD* value.

We will apply this formula to our dependent variables and create new columns to store these calculated *Z-scores*. Since the *Z-score* only reflects the magnitude of deviation from the mean, without indicating direction (above or below), we will use the NumPy function `abs` to obtain absolute (positive) values of the scores.

In [24]:
df.loc[:, "PES_Z-score"] = np.abs((df["PES_sum"] - df["PES_sum"].mean()) / df["PES_sum"].std())
df.loc[:, "SWLS_Z-score"] = np.abs((df["SWLS_sum"] - df["SWLS_sum"].mean()) / df["SWLS_sum"].std())
df.loc[:, "DAP_R_Z-score"] = np.abs((df["DAP_R_sum"] - df["DAP_R_sum"].mean()) / df["DAP_R_sum"].std())

df.iloc[:10, -3:]

Unnamed: 0,PES_Z-score,SWLS_Z-score,DAP_R_Z-score
0,0.62537,0.83876,0.871172
1,0.229469,0.297113,1.588671
2,0.62537,0.83876,2.156327
3,0.400437,0.676493,0.453358
4,0.058502,1.108451,1.68328
5,0.112466,0.676493,2.534764
6,0.058502,0.18969,0.16953
7,0.058502,0.027422,0.16953
8,0.112466,0.946183,2.250936
9,3.01892,1.270718,0.398126


As demonstrated in the sample above, certain scores exceed `2.5` standard deviations from the mean (e.g., row 9 for PES and row 5 for DAP-R). To maintain consistency with our established criteria, these outliers will be removed from the dataset.

In [25]:
df_outliers = df[(df["PES_Z-score"] > 2.5) | (df["SWLS_Z-score"] > 2.5) | (df["DAP_R_Z-score"] > 2.5)]

df = df[~df.index.isin(df_outliers.index)]
df.reset_index(inplace=True)

print(f"We have filtered out {df_outliers.shape[0]} outliers.")

We have filtered out 30 outliers.
