# Survey example statistics
Workshop for doctoral students and young faculty, based on survey on accessibility conducted at PŁ. It's based on progressive survey questions:
1. Asks for overall rating of accesibility using a numbered scale.
2. Asks for a rating on 7 elements using categories as scale.
3. Asks for a rating on 12 elements of accessible bathroom design (based on technical requirements for accessible bathrooms).

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# pd.set_option('display.float_format', '{:.4f}'.format)

In [2]:
data = pd.read_csv('./data.csv')
data.columns

Index(['Sygnatura czasowa', 'What's your name/nick/alias?',
       'How accessible to people with special needs is this university?',
       'Rate to what extent this university is accessible to people with special needs. [Parking]',
       'Rate to what extent this university is accessible to people with special needs. [Signs]',
       'Rate to what extent this university is accessible to people with special needs. [Bathrooms]',
       'Rate to what extent this university is accessible to people with special needs. [Classrooms]',
       'Rate to what extent this university is accessible to people with special needs. [Furniture]',
       'Rate to what extent this university is accessible to people with special needs. [Building access]',
       'Rate to what extent this university is accessible to people with special needs. [Floor access]',
       'How accessible are bathrooms at this university? [Bathroom on the same floor with no stairs/steps]',
       'How accessible are bathrooms at

In [3]:
data.head()

Unnamed: 0,Sygnatura czasowa,What's your name/nick/alias?,How accessible to people with special needs is this university?,Rate to what extent this university is accessible to people with special needs. [Parking],Rate to what extent this university is accessible to people with special needs. [Signs],Rate to what extent this university is accessible to people with special needs. [Bathrooms],Rate to what extent this university is accessible to people with special needs. [Classrooms],Rate to what extent this university is accessible to people with special needs. [Furniture],Rate to what extent this university is accessible to people with special needs. [Building access],Rate to what extent this university is accessible to people with special needs. [Floor access],...,How accessible are bathrooms at this university? [Bathroom clearly marked],How accessible are bathrooms at this university? [Doors wide enough for wheelchair],How accessible are bathrooms at this university? [Doors open easily],How accessible are bathrooms at this university? [Toilet at max 0.46m height],How accessible are bathrooms at this university? [Toilet paper within reach],How accessible are bathrooms at this university? [Support bars in toilet],How accessible are bathrooms at this university? [Support bars at sink],How accessible are bathrooms at this university? [Mirror positioned low enough],How accessible are bathrooms at this university? [Soap dispenser within reach],How accessible are bathrooms at this university? [Drier/Towels within reach]
0,2023/05/10 7:13:05 PM EEST,Wojtek,4,Accessible to people with some requirements,Accessible to people with minor requirements,Accessible to people with some requirements,Accessible to people with some requirements,Accessible to people with minor requirements,Accessible to people with some requirements,Accessible to people with minor requirements,...,Most of the bathrooms,Most of the bathrooms,All of the bathrooms,All of the bathrooms,All of the bathrooms,All of the bathrooms,Most of the bathrooms,Half of the bathrooms,Half of the bathrooms,Most of the bathrooms
1,2023/05/10 7:13:36 PM EEST,Magdalena,4,Accessible to people with various requirements,Accessible to people with some requirements,Accessible to people with some requirements,Accessible to people with some requirements,Accessible to people with minor requirements,Accessible to people with minor requirements,Accessible to people with some requirements,...,Most of the bathrooms,Don't know,Don't know,Don't know,Don't know,In some cases,In some cases,In some cases,In some cases,In some cases
2,2023/05/10 7:15:17 PM EEST,Sylwia,3,Accessible to people with some requirements,Hardly accessible,Accessible to people with minor requirements,Accessible to people with minor requirements,Accessible to people with minor requirements,Accessible to people with various requirements,Accessible to people with some requirements,...,Most of the bathrooms,Don't know,In some cases,Don't know,In some cases,Hardly ever,Hardly ever,Hardly ever,Hardly ever,In some cases
3,2023/05/10 7:16:46 PM EEST,Kaczewiak,4,Accessible to people with minor requirements,Accessible to people with minor requirements,Hardly accessible,Hardly accessible,Hardly accessible,Accessible to people with minor requirements,Accessible to people with minor requirements,...,In some cases,In some cases,In some cases,In some cases,Hardly ever,In some cases,In some cases,In some cases,Hardly ever,Hardly ever
4,2023/05/10 7:16:55 PM EEST,XYZ,3,Hardly accessible,Hardly accessible,Hardly accessible,Accessible to people with minor requirements,Hardly accessible,Accessible to people with some requirements,Accessible to people with some requirements,...,In some cases,Hardly ever,In some cases,In some cases,Don't know,Don't know,Don't know,In some cases,Don't know,In some cases


In [4]:
# Creating a new dataframe to use values for statistics instead of labels
df = data.copy()
df.columns=['Time', 'Nick','Overall', 'R_Parking', 'R_Signs', 'R_Bathrooms', 
              'R_Classrooms', 'R_Furniture', 'R_Building', 'R_Floor', 
             'B_Floor', 'B_Find', 'B_Mark', 'B_DoorWide', 'B_DoorEasy', 'B_Toilet', 'B_Paper', 'B_BarsToilet', 
            'B_BarsSink', 'B_Miror', 'B_Sink', 'B_Towel']

In [5]:
# Change categories to ordered rankings
x = df['R_Parking'].astype('category')
dict(enumerate(x.cat.categories ))
dict_R_ratings = {'Accessible to people with minor requirements' : 2,
 'Accessible to people with some requirements' : 3,
 'Accessible to people with various requirements': 4,
 'Hardly accessible' : 1,
 'Not accessible' : 0}
for c in df.iloc[:,3:10]:
    print('Changing values in column ' + c)
    df[c].replace(dict_R_ratings, inplace = True)
df.iloc[1:5,3:10]

Changing values in column R_Parking
Changing values in column R_Signs
Changing values in column R_Bathrooms
Changing values in column R_Classrooms
Changing values in column R_Furniture
Changing values in column R_Building
Changing values in column R_Floor


Unnamed: 0,R_Parking,R_Signs,R_Bathrooms,R_Classrooms,R_Furniture,R_Building,R_Floor
1,4,3,3,3,2,2,3
2,3,1,2,2,2,4,3
3,2,2,1,1,1,2,2
4,1,1,1,2,1,3,3


In [6]:
pd.crosstab(data.iloc[:,3], df.iloc[:,3])

R_Parking,0,1,2,3,4
Rate to what extent this university is accessible to people with special needs. [Parking],Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Accessible to people with minor requirements,0,0,4,0,0
Accessible to people with some requirements,0,0,0,9,0
Accessible to people with various requirements,0,0,0,0,13
Hardly accessible,0,3,0,0,0
Not accessible,1,0,0,0,0


In [7]:
# Change categories to ordered rankings
x = df['B_Floor'].astype('category')
dict(enumerate(x.cat.categories ))
dict_B_ratings = {'All of the bathrooms': 4,
 "Don't know" : np.nan,
 'Half of the bathrooms' : 2,
 'Hardly ever' : 0,
 'In some cases' : 1,
 'Most of the bathrooms' : 3}
for c in df.iloc[:,10:22]:
    print('Changing values in column ' + c)
    df[c].replace(dict_B_ratings, inplace = True)
df.iloc[1:5,10:22]

Changing values in column B_Floor
Changing values in column B_Find
Changing values in column B_Mark
Changing values in column B_DoorWide
Changing values in column B_DoorEasy
Changing values in column B_Toilet
Changing values in column B_Paper
Changing values in column B_BarsToilet
Changing values in column B_BarsSink
Changing values in column B_Miror
Changing values in column B_Sink
Changing values in column B_Towel


Unnamed: 0,B_Floor,B_Find,B_Mark,B_DoorWide,B_DoorEasy,B_Toilet,B_Paper,B_BarsToilet,B_BarsSink,B_Miror,B_Sink,B_Towel
1,3.0,3.0,3.0,,,,,1.0,1.0,1.0,1.0,1.0
2,1.0,3.0,3.0,,1.0,,1.0,0.0,0.0,0.0,0.0,1.0
3,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0
4,2.0,1.0,1.0,0.0,1.0,1.0,,,,1.0,,1.0


In [8]:
pd.crosstab(data.iloc[:,10], df.iloc[:,10])

B_Floor,0.0,1.0,2.0,3.0,4.0
How accessible are bathrooms at this university? [Bathroom on the same floor with no stairs/steps],Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
All of the bathrooms,0,0,0,0,8
Half of the bathrooms,0,0,6,0,0
Hardly ever,1,0,0,0,0
In some cases,0,6,0,0,0
Most of the bathrooms,0,0,0,6,0


## Does Q2 form one factor or many?
* With 2 factors we explain 75% variance, with one only 54%.
* Only two items (building and floor access) correlate with answers to question 1.
* The first factor actually loads negatively.
* This means people did not consider the remaining dimensions of accesibility

In [9]:
pd.set_option('display.float_format', '{:.4f}'.format)

In [10]:
from sklearn.decomposition import PCA
x = df.iloc[:,3:10]
pca_model = PCA(n_components=2) #  75% of variance with just two components, 87% with three.
pca_model.fit(x)
print(pca_model.explained_variance_ratio_)
print("Total variance explained is " + str(pca_model.explained_variance_ratio_.sum()))
print(pca_model.explained_variance_ratio_)

[0.5412446  0.21579027]
Total variance explained is 0.7570348767200109
[0.5412446  0.21579027]


In [11]:
# Loadings
pd.DataFrame(pca_model.components_.transpose(), index = x.columns).sort_values(0)

Unnamed: 0,0,1
R_Signs,-0.4372,-0.4517
R_Floor,-0.4268,0.3614
R_Furniture,-0.4154,0.1172
R_Bathrooms,-0.4027,-0.1897
R_Classrooms,-0.3674,0.224
R_Building,-0.3074,0.4519
R_Parking,-0.25,-0.601


In [12]:
# Calculate PCA values per observations
df_pca = pd.DataFrame(pca_model.components_.transpose(), index = x.columns)
result = []
for i in range(2):
    result.append(x.dot(df_pca.iloc[:,i]))
df_pca_R = pd.DataFrame(result).T
df_pca_R.columns = ['R_PCA0','R_PCA1']
df_pca_R.head()
# df_pca_R.join(df.iloc[:,3:10]).corr()[(df_pca_R>0.4) | (df_pca_R<-0.4)]

Unnamed: 0,R_PCA0,R_PCA1
0,-6.5414,-0.2908
1,-7.3479,-1.4341
2,-6.0684,0.9399
3,-4.0284,-0.3275
4,-4.4428,1.7626


## How does Q2 correlate with Q1?
* On average at 0.71
* But that's only from PCA0, not PCA1, ie. the main principal component.

In [13]:
# Calculate average
df['mean_q2'] = 0.00
df['mean_q2'] = df.apply(lambda row: row[3:10].sum(), axis=1) / 7
# correlation
df[['mean_q2', 'Overall']].dropna().corr().iloc[0,1]

0.7130693166594904

In [14]:
# Correlations between Overall and pca vectors
df_pca_R.join(df[['Overall']]).dropna().corr().iloc[2,:2]

R_PCA0   -0.7171
R_PCA1    0.0674
Name: Overall, dtype: float64

## What new information do we get from Q3?
* Bathroom access was not taken into account by respondents as important overall - it loaded into the negative factor.
* So, what are bathrooms like? We need a detailed question.
* There are three PCA vectors - only the first one correlates with the overall rating of Bathrooms.

In [15]:
from sklearn.decomposition import PCA
x = df.iloc[:,10:22].dropna()
pca_model = PCA(n_components=3) #  75% of variance with just two components, 86% with three.
pca_model.fit(x)
print(pca_model.explained_variance_ratio_)
print("Total variance explained is " + str(pca_model.explained_variance_ratio_.sum()))
print(pca_model.explained_variance_ratio_)

[0.61380861 0.14562477 0.09641629]
Total variance explained is 0.8558496678254084
[0.61380861 0.14562477 0.09641629]


In [16]:
# Loadings
pd.DataFrame(pca_model.components_.transpose(), index = x.columns).sort_values(0)

Unnamed: 0,0,1,2
B_Sink,0.2282,0.4698,0.3483
B_Miror,0.2344,-0.1598,-0.1675
B_BarsToilet,0.2434,0.0392,-0.3034
B_BarsSink,0.2591,0.0402,-0.2732
B_Towel,0.2604,0.4414,0.0113
B_Toilet,0.2625,0.2182,-0.3739
B_DoorWide,0.2656,-0.0639,-0.0262
B_Find,0.2788,-0.2929,0.287
B_DoorEasy,0.297,-0.1957,-0.4833
B_Floor,0.3036,-0.444,0.3644


In [17]:
# Calculate PCA values per observations
df_pca = pd.DataFrame(pca_model.components_.transpose(), index = x.columns)
result = []
for i in range(3):
    result.append(x.dot(df_pca.iloc[:,i]))
df_pca_B = pd.DataFrame(result).T
df_pca_B.columns = ['B_PCA0','B_PCA1', 'B_PCA2']

In [18]:
# Obtain correlations
df_pca_B.join(df[['R_Bathrooms']]).dropna().corr().iloc[3,:3]

B_PCA0    0.4459
B_PCA1    0.0262
B_PCA2   -0.3514
Name: R_Bathrooms, dtype: float64