<a href="https://colab.research.google.com/github/ped4416/Research-Methods-Workshop/blob/main/PracticalSessionPart1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Research Methods Talk 16th Dec 2020
###Practical session Part 1
First we need to import our data as a .csv file.

*   We will use [pandas](https://pandas.pydata.org/) to do this.
*   We are using Colab (short for Colaboratory) to access our data and run some statistical tests on that data! 



In [None]:
#load our dependencies 
from google.colab import files
from __future__ import print_function
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

print("pd v = {}\nnp v = {}".format(pd.__version__, np.__version__))

In [None]:
#we will upload from our local drive 
#run this cell and select the file from your computer...
from google.colab import files
uploaded = files.upload()

In [None]:
#now we can load the cat_dog.csv into a pd DataFrame to view 
import io
df = pd.read_csv(io.BytesIO(uploaded['cat_dog.csv']))

#lets view our data
df

In [None]:
#remove the Timestamp as it is not really needed now
del df['Timestamp']
#the column headers are a little long - let's update them
df.columns = ["gender", "age", "cats", "dogs"]
#add an ID number 
#df.insert(loc=0, column='id', value=np.arange(len(df)))
df.insert(loc=0, column='id', value=df.index + 1)

#lets view our data again - it should be a bit neater
df

In [None]:
#print a few basic statistics
#what variable is missing? 
df.describe()

In [None]:
#we can also use head and tail to print rows of data out - insert a number argument 
#this can be useul for larger data sets
print("df head = \n{}\ndf tail = \n{}".format(df.head(3), df.tail(3)))

In [None]:
#desciptive stats - how many males vs females are there?

#calculate count
counts = df["gender"].value_counts()
#calculate a basic percentage number
percent = df["gender"].value_counts(normalize=True)
#calculate a basic percentage number with % sign 
percent100 = df["gender"].value_counts(normalize=True).mul(100).round(1).astype(str) + "%"
#create a new dataframe to view the data
df_gender = pd.DataFrame({"gender_count" : counts, "percentage" : percent100})
print(df_gender)

#Run a paired-samples t-test

Do we have a prediction based on our sample? 
Will there be a preference for dogs or cats?

Remember we have assumptions... 

1.   A continuous variable (we assume yes - our preference scores)
2.   Two related groups (yes same participants)
3.   No significant outliers in the differences between the two related groups ([Test with boxplots - python tutorial](https://statinfer.com/104-3-5-box-plots-and-outlier-dectection-using-python))
4.   The distribution of these differences should be approximately normally distributed 

However, the paired-samples t-test is considered "robust" to violations of normality. This means that violations of this assumption can be somewhat tolerated.

See [this link](https://statistics.laerd.com/premium/spss/pstt/paired-samples-t-test-in-spss-7.php) for a full description of these assumptions. 

The hypothesis being tested is:

* Null hypothesis (H0): the population mean difference between the paired values is equal to zero
* Alternative hypothesis (H1): the population mean difference between the paired values is not equal to zero
* If the p-value is less than .05, we can reject the null hypothesis.

In [None]:
#first lets look at some descriptives on our two key variables again
#look at the mean value - this should provide a clue
df[['dogs','cats']].describe()

In [None]:
#Lets look for some outliers... 
#We need to calculate a differnce score
#simply subtract dogs from cats 
diffs = df['dogs'] - df['cats']
print("Difference values\n{}".format(diffs))
%matplotlib inline 

plt.title("diff box plot")
plt.boxplot(diffs)
plt.tight_layout()

In [None]:
#Now to test that the data came from a normal distribution
#The Shapiro-Wilk test is recommended if you have small sample sizes (< 50 participants) 
stats.shapiro(diffs)
#The first value is the W test value, and the second value it the p-value.
#If the assumption of normality is met the significance level should be more than .05 (i.e., p > .05).

In [None]:
#Finally lets run the t-test only if we can say all assumtions are met! 
stats.ttest_rel(df['cats'], df['dogs'])
#if the p value < .05 the test is significant 
#if it is significant and you have many questions you could run further 
#t-tests to see which questions are driving the effect

#Run a Wilcoxon signed-rank test
The Wilcoxon signed-rank test is the non-parametric alternative to the dependent t-test.

Do we have a prediction based on our sample? 
Will there be a preference for dogs or cats?

Remember we have assumptions... 

1.   A continuous variable (we assume yes - our preference scores)
2.   Two related groups (yes same participants)
3.   The distribution of the differences bwetween groups should be approximately symmetrical in shape


See [this link](https://statistics.laerd.com/premium/spss/wsrt/wilcoxon-signed-rank-test-in-spss-3.php) for a full description of these assumptions.

The hypothesis being test is:

*  Null hypothesis (H0): The difference between the pairs follows a symmetric distribution around zero.
* Alternative hypothesis (HA): The difference between the pairs does not follow a symmetric distribution around zero.
* If the p-value is less than .05, we can reject the null hypothesis.

In [None]:
#So lets test the assumption that the distribution of the differences bwetween groups is approximately symmetrical in shape
#We can use a histogram to test this.
#Looking at the histogram you need to make a judgement about whether the distribution is symmetrical. 
#By visually inspecting the shape of this distribution of difference scores
# An "interface" to matplotlib.axes.Axes.hist() method

plt.hist(diffs, bins = 5)
plt.title("differences - symmetry")
plt.show()

In [None]:
#so lets run the Wilcoxon signed-rank test
stats.wilcoxon(df['cats'], df['dogs'])
#if the p value < .05 the test is significant 
#if it is significant and you have many questions you could run further 
#test to see which questions are driving the effect

When reporting the results of non-parametric tests it is usual to report medians rather than means. But with Likert scales there is some debate over median vs mean... 

Consider reporting the mean if you have a normal distribution and potentially median if you have a skewed distribution of your Likert findings.

In [None]:
#median and mean for our two outcomes
mean_dogs = df['dogs'].mean()
mean_cats = df['cats'].mean()
med_dogs = df['dogs'].median()
med_cats = df['cats'].median()

#initialise data of lists. 
data = {'average':['mean', 'median'],
        'dogs':[mean_dogs, med_dogs], 
        'cats':[mean_cats, med_cats]} 
  
#pandas DataFrame of our data
df_averages = pd.DataFrame(data)
df_averages