<a href="https://colab.research.google.com/github/ped4416/Research-Methods-Workshop/blob/main/PracticalSessionPart2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Research Methods Talk 16th Dec 2020
##Introduction to statistics for questionnaires
###Practical session Part 2

* Taken from Chapter 12 in the **Computer Booklet.pdf** in your resourses folder
* Thanks to Mike Griffiths for this content - if you have SPSS you can work through many other excellent **Core Quantitative Methods** in your own time. This booklet summaries many statistical methods that may be useful for your research. 

First we need to import our data as a .csv file.

*   We will use [pandas](https://pandas.pydata.org/) to do this.
*   We are using Colab (short for Colaboratory) to access our data and run some statistical tests on that data! 



In [None]:
#load our dependencies 
from google.colab import files
from __future__ import print_function
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

print("pd v = {}\nnp v = {}".format(pd.__version__, np.__version__))

In [None]:
#we will upload from our local drive 
#run this cell and select the file from your computer...
from google.colab import files
uploaded = files.upload()

#Introdcution to the data
The data file contains the results of an imaginary questionnaire, given to 60 participants.
The data file contains the sort of information you might collect from a simple questionnaire:
* participant numbers (Part). It is good practice to give each participant a
number, and to write the same number on each questionnaire. This
allows you to check back to the questionnaire if you need to.
* demographic information (Gender and Age)
* participants’ responses to six questions (Q1 to Q6). These are related to
job satisfaction (e.g. “I like the work I do”). Responses are on a Likert
scale running from 1 (strongly disagree) to 7 (strongly agree).


In [None]:
#now we can load the data into a pd DataFrame to view 
import io
df = pd.read_csv(io.BytesIO(uploaded['Questionnaire_Data_v1.csv']))

#lets view our data
print(df.head(3))
print(df.info()) #due to missing data notice "object" Dtype

#Missing data
There are a few data points missing. 
We will ignore these participnats as we want to look at the overall questions in this example. 

```
Use df. dropna() to drop rows with NaN from a Pandas dataframede
```

In [None]:
#We have 2 missing values they are empty strings, which Pandas doesn't recognise as null.
#let's replace the empty cells with a nan object
print(df.shape) #shows 60 rows 
df.replace(" ", np.nan, inplace=True) #replace blank values 
print(df.isnull().sum()) #note that we now have two values Q1 and Q5 that need removing
df = df.dropna(axis='rows') #assign our df to remove the null rows 
df = df.apply(pd.to_numeric) #ensure all data is numeric for analysis 
print(df.shape) #shows 58 rows
print(df.info())


In [None]:
#print a few basic statistics
df.describe()

In [None]:
#desciptive stats - how many males vs females are there?
#male = 1 female = 2 in this case 
#calculate count
counts = df["Gender"].value_counts()
#calculate a basic percentage number
percent = df["Gender"].value_counts(normalize=True)
#calculate a basic percentage number with % sign 
percent100 = df["Gender"].value_counts(normalize=True).mul(100).round(1).astype(str) + "%"
#create a new dataframe to view the data
df_gender = pd.DataFrame({"gender_count" : counts, "percentage" : percent100})
print(df_gender)

#Calculating overall scores on a questionnaire

* If the questions constitute a scale, you will need to add up (or average) the
answers to different questions. For example, if the questions are all about job satisfaction (as in the example), we might have to add them up to get a total score for job satisfaction. There may be just one total, or sometimes different questions have to be added up to make sub-scales.

* If using a published questionnaire, make sure you find the instructions. These may be in a manual or in a journal article. They will give important information on how to calculate the overall score (e.g. whether it is a total or an average, whether there are subscales, whether there are any reverse-scored questions, whether certain questions need to be ignored for any reason). The same source will also tell you who the questionnaire was tested on (the so-called norm group) and what their mean score was; you may wish to compare your own participants’ mean against the norm group

#Reverse-scored questions: what they are
* We will see how to add up scores in a minute, but first you may have to deal with reverse-scored questions.

* If the questions are about how happy people are in their job, a typical question might ask people how much they agree with the statement “I enjoy coming to work.” Obviously the more people agree with this statement, the happier they are in their job. The people who most strongly agree with the statement get the highest score.

* However there might be some questions such as “If I could give up work
tomorrow, I would.” In this case, the more people agree with the statement, the less they are happy in their job. If the questionnaire has been devised and published by someone else, they should have made it clear if there are any such questions.

#Reverse-scored questions: How to deal with them
The score needs to be amended so that low scores are changed into high
scores, and vice versa.

To keep an audit trail and prevent mistakes, it is advisable to keep the old variable as it is and to create a new variable with a new name, such as Q3rev for a reversed version of Q3.

The calculation to reverse the questions is actually quite simple: As it is a 7 stage likert scale we can use 8 - Q3 value.

* Participant’s score 1 2 3 4 5 6 7 
* Reverse score 7 6 5 4 3 2 1 (subtract from 8)
* Total of score and reverse score 8 8 8 8 8 8 8

In [None]:
#okay now to create a new Q3_rev column and position it next to the origional Q3
# Third position would be at index 2, because of zero-indexing.
Q3_rev = 8 - df["Q3"]
print(type(Q3_rev))
df.insert(5, 'Q3_rev', Q3_rev)
df.head()

In [None]:
#Add up our scores: ensure you use Q3_rev
#create a new pd series summing all 6 questions 
JobSatTotal = df["Q1"] + df["Q2"] + df["Q3_rev"] + df["Q4"] + df["Q5"] + df["Q6"]

#calculate the mean score 
JobSatMean = JobSatTotal/6

#check the scores: 
#For the example file, the correct first few means are 5.33, 2.33 and 3.50.
JobSatMean.head(3)

#Checking for problematic questions
When you have given your questionnaire to a sample of people, you can check
for questions which might cause a problem. There are three ways in which
questions might stand out as being problematic. The first two are matters of
opinion – their correlations and standard deviations. The third is to look at how they affect Cronbach’s alpha.

* Firstly, you can look at the correlations between questions

In [None]:
#Pearson product-moment correlation
corrQ1_Q2 = stats.pearsonr(df["Q1"], df["Q2"])
print('Pearsons: corr and p-value:{}'.format(corrQ1_Q2))
corrQ1_Q2, _ = stats.pearsonr(df["Q1"], df["Q2"]) #ignore second value 
print('Pearsons correlation: %.3f' % corrQ1_Q2)


* For example, this shows that the correlation coefficient between
Q1 and Q2 is .592, that this is highly significant (p < .001).

* If one or more of the questions has a particularly low correlation with the others, it suggests that particular question is not getting at the same concept as the other questions. (You might think it is, but perhaps your participants are interpreting the question differently from you.) Read over the question and consider eliminating it from the analysis. Or, if some questions correlate with each other but not with the rest, it may indicate that your questions are tapping into two or more sub-scales, and you might want to consider a factor analysis.

* If one of the questions is negatively correlated with the others, it would appear that you forgot to reverse-score it; or that your respondents are interpreting the question very differently from the way you expected. If the latter, you may need to eliminate it.

Another thing you could consider looking at is the standard deviations. If one of the questions has a much smaller standard deviation than the others, it appears that there is very little difference between participants as to how they answer that question, so perhaps it is not telling you anything. Consider whether it is worth keeping.

Of course, if you discard questions for any reason this is an important part of your findings and should be reported in your Results section.

In [None]:
#lets review the standard deviations
df_final = df[["Q1","Q2","Q3_rev","Q4","Q5","Q6"]].copy()
df_final.describe()
#none appear very small so we can consider keeping them all in 

#Cronbach’s alpha: how to calculate it
Cronbach’s alpha is a measure of how much the questions measure the same
thing (i.e. how much the questions as a whole correlate with each other).Again, if you have any reverse-scored questions the reverse-scored questions are the ones to use. 

We will use a function - see this link for a far more detailed tutorial:
https://towardsdatascience.com/cronbachs-alpha-theory-and-application-in-python-d2915dd63586 




In [None]:
#Cronbach’s alpha for Q1, Q2, Q3rev, Q4, Q5 and Q6. 
#use out df_final with this function: 
def cronbach_alpha(df):
    # 1. Transform the df into a correlation matrix
    df_corr = df.corr()
    
    # 2.1 Calculate N
    # The number of variables equals the number of columns in the df
    N = df.shape[1]
    
    # 2.2 Calculate R
    # For this, we'll loop through the columns and append every
    # relevant correlation to an array calles "r_s". Then, we'll
    # calculate the mean of "r_s"
    rs = np.array([])
    for i, col in enumerate(df_corr.columns):
        sum_ = df_corr[col][i+1:].values
        rs = np.append(sum_, rs)
    mean_r = np.mean(rs)
    
   # 3. Use the formula to calculate Cronbach's Alpha 
    cronbach_alpha = (N * mean_r) / (1 + (N - 1) * mean_r)
    return cronbach_alpha


cronbach_alpha(df_final) #the result is looking good. See info below. 

* There is no hard-and-fast rule on what is an acceptable level for Cronbach’s
alpha. Most people would find a figure of above .8 good, and a figure above .7
acceptable. Some people (but not everyone) consider that it is possible for alpha to be too high, and say that if it is above .9 then the scale contains too many items that are just the same as each other, and is wasteful53.

* If Cronbach’s alpha is too low, you might consider excluding problematic
questions (see sections 12.4.1 and 12.4.3). Or it might be appropriate to carry
out a factor analysis to create two or more separate scales; each of these is likely to have a higher Cronbach’s alpha than the overall scale.

In our case here we have 0.87 so this suggests that the questions are all measuring the same thing for our research. 



END