## Data Exploration

In this session we will work with one data frame:

The built-in `sophonet_data_simulated` data frame contains information from a project investigating the effects of a cognitive-behavioral therapy (CBT) and a short-term psychodynamic psychotherapy (STPP) on social phobia. The project included 328 participants allocated to two groups ("tb" = CBT or "tb" = STPP), representing cognitive-behavioral therapy and short-term psychodynamic psychotherapy. The study followed a prestest-postest design. Therefore, the outcome variable social phobia (measured by two item parcels: lsas.a (measures the intensity of fear) and lsas.v(measures the frequency of avoidance behavior)) was assessed on two different measurement occasions. In the present data case, lsas.a is an item parcel resulting from averaging over 24 items. Further variables will be described as we advance. The `sophonet_data_simulated` is a built-in data frame from `EffectLiteR`, so it is accessible by simply typing its name in RStudio. But here, I provide this data frame as a .csv file to explore the data in python and to illustrate the full analysis pipeline.

Note: An item parcel corresponds to the average of two or more item scores from a given test with the aim of constructing more reliable indicators that can be used to identify latent variables in a Structural Equation Model (SEM). In other words, an item parcel is an aggregated indicator, resulting from averaging over two or more item scores (eliminate some measurement noise by averaging).

Feel free to inspect the data frame in the next code chunk.

In [4]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Read the CSV file using a relative path
effects = pd.read_csv("../EffectLiteR/Datasets/sophonet_data_simulated.csv")
effects.head()


Unnamed: 0,lsas.a.t2,lsas.v.t2,lsas.a.t1,lsas.v.t1,bdi.t1.i1,bdi.t1.i2,bdi.t1.i3,ecr.anx.t1.i1,ecr.anx.t1.i2,ecr.anx.t1.i3,...,tpq.ha.i2,tpq.ha.i3,tosca.shame.t1.i1,tosca.shame.t1.i2,fskn.se.t1.i1,fskn.se.t1.i2,comorbid,iip.lov,iip.dom,tb
1,1.239,1.359,1.784,1.523,0.754,1.16,0.957,3.031,3.539,2.695,...,0.927,0.922,3.665,3.57,2.475,2.935,1.563,-0.195,-0.991,CBT
2,1.661,1.057,1.718,1.386,1.531,1.768,1.567,3.837,2.729,2.241,...,0.777,0.822,3.807,3.86,2.901,2.592,1.804,-0.083,-0.322,CBT
3,1.643,0.988,1.306,0.901,-0.303,0.121,0.199,1.754,1.012,0.227,...,0.699,0.596,3.774,2.501,5.044,4.511,-0.714,-0.418,-0.795,CBT
4,1.24,0.943,2.282,2.282,0.597,1.151,0.397,4.752,4.761,5.354,...,0.519,0.633,2.363,2.631,4.489,3.269,1.031,-0.429,-1.355,CBT
5,0.189,0.324,0.734,0.721,0.174,0.338,0.647,1.635,4.423,3.014,...,0.435,0.566,2.971,1.769,5.89,4.499,0.543,0.226,-0.01,CBT


In [10]:
### Check missing values in the effects DataFrame
missing_values = effects.isnull().sum()
print("Missing values in each column:")
print(missing_values)

#print the values of column "tb" 
print("Values in column 'tb':")
print(effects['tb'].unique())

#Print number of rows with different values in column "tb"
tb_counts = effects['tb'].value_counts()
print("\nNumber of rows with different values in column 'tb':")
print(tb_counts)


Missing values in each column:
lsas.a.t2            0
lsas.v.t2            0
lsas.a.t1            0
lsas.v.t1            0
bdi.t1.i1            0
bdi.t1.i2            0
bdi.t1.i3            0
ecr.anx.t1.i1        0
ecr.anx.t1.i2        0
ecr.anx.t1.i3        0
ecr.avoi.t1.i1       0
ecr.avoi.t1.i2       0
ecr.avoi.t1.i3       0
tpq.ha.i1            0
tpq.ha.i2            0
tpq.ha.i3            0
tosca.shame.t1.i1    0
tosca.shame.t1.i2    0
fskn.se.t1.i1        0
fskn.se.t1.i2        0
comorbid             0
iip.lov              0
iip.dom              0
tb                   0
dtype: int64
Values in column 'tb':
['CBT' 'STPP']

Number of rows with different values in column 'tb':
tb
CBT     170
STPP    158
Name: count, dtype: int64
