# Welcome to the biostatistics tutorial!

In this tutorial we will use Python language to analyze some biological datasets and perform statistical analysis and statistical tests. The tutorial was created in Jupyter notebook and can be run as a live notebook, either in the cloud, or you can download it and run it locally in you computer.

**Important:** You don't have to know how to code in Python to execute the notebook! 

Just focus on concepts and basic ideas behind the statistical analysis, not on the code (you only have to follow the instructions to execute the cells as you scroll through the notebook).



# Outline of the Tutorial

We will analyze 3 datasets and perform different statistical tests:

1. Association between genotype and a binary phenotype
2. Association between genotype and a continuous phenotype
3. Association between Sickle Cell Disease and ovarian reserve

For each example, the following steps are taken:

-  State biological hypothesis
-  State the number and types of variables
-  Determine the preferred statistical test and null hypothesis
-  Check if data meet the assumptions of the preferred statistical test
-  Decide what statistical test to use
-  Run the statistical test
-  Interpret the results of the statistical test
-  Display the data and statistical results in a figure

You do not need to install Python or perform any of the analyses in the tutorial in order to learn from the examples!

# Source of datasets

- Dataset 1 contains random data generated specifically for this tutorial;
- Datasets 2 is taken from: Pollard et al (2019), "Empowering Statistical Methods for Cellular and Molecular Biologists",  MBOC Vol. 30, No. 12 https://www.molbiolcell.org/doi/full/10.1091/mbc.E15-02-0076 
- Dataset 3 (ovarian reserve data) is derived from: Kopeika et al. (2019), "Ovarian reserve in women with sickle cell
disease", Plos ONE https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0213024&type=printable

If you wish you can download the datasets (provided as Excel files) from here:
https://github.com/matteofigliuzzi/biostatistics_tutorial/tree/main/data

# The computational tools

## What is Python?

Python is a high-level, interpreted, general-purpose programming language. It provides a wide range of tools useful to perform biostatistical analysis. You can find a simple tutorial focusing on the essentials you need to know to start programming with Python.
https://realpython.com/python-first-steps/

## What is a Jupyter notebook?

A Jupyter Notebook is an open source application that you can use to create and share documents that contain live code in different languages (such as Python or R), analysis, visualizations, and text. 

In a Jupyter notebook code blocks are in grey boxes (see just below), and output from running the code (including text, results of calculations and plots) just after the python code blocks.

To execute the code, click on the cell, and then use the "Run" button in the top menu ($\blacktriangleright$) or alternatively press shift + enter (press enter while keeping pressed shift). 

In [None]:
print('this is the output from some python code')

In [None]:
# These first 2 lines are a comment in Python, as they starts with the pound sign "#"
# print('this is just a comment, it is not run')

print(3+5) # in this line the comment starts after the print command

You can create new blocks or, alternatively, you can edit the code of existing blocks:



In [None]:
# Try to edit the content below to change the output of the cell

a=22
b=5
print(a+b)

c='questa è una stringa'
print(c)

To update the output, you have to execute the code again, clicking on the cell and then using the "Run" button in the top menu ($\blacktriangleright$) or pressing shift + enter. 

If you run the following cell you will get an error.

In [None]:
x = 5
y = 2

##option A
z = x+y

##option B
#z = x-y

print(z)

**Question:** What happened? Why we got an error?

In previous cell, try to uncomment (remove the '#' symbol) the line below option A or option B, and execute again!

If you are interested, you can find more information on Jupyter notebooks here: https://realpython.com/jupyter-notebook-introduction/


## How to run this Jupyter notebook on cloud?

You don't have to do anything but copy and paste the following URL in your web browser:

https://mybinder.org/v2/gh/matteofigliuzzi/biostatistics_tutorial/main?labpath=notebooks%2FTutorial_HypothesisTesting_Jupyter_python.ipynb

and an interactive notebook session will be opened in your browser.

In case you are curious how it works: I have pre-built a Binder repository associated to the following github repo: https://github.com/matteofigliuzzi/biostatistics_tutorial . Binder is a service provided by the Binder Project. It allows you to input the URL of any public Git repository, and it will open that repository within the native Jupyter Notebook interface. You can run any notebooks in the repository, though any changes you make will not be saved back to the repository. The repository must include a configuration file that specifies its package requirements, which are used by Binder to build a Docker image, in which all configurations and dependencies to run the notebook are satisfied.

## How to run this notebook locally on you computer?

To run the notebook on your computer, you have to download it from here: https://github.com/matteofigliuzzi/biostatistics_tutorial and make sure that you have Python and Jupyter notebook installed (and all dependencies satisfied).

### Installing python
Installing Python is generally easy, and nowadays most Linux and UNIX distributions include a recent Python. Even some Windows computers now come with Python already installed. If you do need to install Python, follow the incrusctions on this page: https://wiki.python.org/moin/BeginnersGuide/Download


### Downloading and using Jupyter
To install Jupyter in your computer follow the instructions on this page: https://jupyter.org/install


## Importing the libraries

In Python language, libraries, packages and modules are files (or group of files) containing specialized functions.

Although you only have to install a specialized library once (this was done by Binder if you are running the notebook on cloud), you have to load it every time you restart python and want to use functions in the library.  

Libraries, packages or modules can be loaded using the import statement as follows:

In [None]:
#Fundamental packages for scientific computing with Python
import numpy as np 
from scipy import stats as st

#Package for data analysis and manipulation tool
import pandas as pd 

#Packages for data visualization 
import seaborn as sns 
from matplotlib import pyplot as plt

#Package for biostatistical analysis, check documentation here https://github.com/reneshbedre/bioinfokit
from bioinfokit.analys import stat 

print('library imported')

# Example 1: Test for association between mutant genotype and a binary phenotype

In this first example, we are testing the biological hypothesis that a mutant genotype affects a phenotype we are measuring. Our statistical null hypothesis is that genotype has no effect on the phenotype.

There are two variables in the experiment: Genotype and Phenotype. 

- Genotype (our independent variable) is a categorical variable with two possible values: WT and Mutant. 
- Phenotype (our dependent variable) is a categorical binary variable: Pizza or Pasta.

We will run a statistical test to check if they are significantly associated or not.

**Question**: Which Statistical Test would you perform? 

In [None]:
# read fist data file into dataframe (using pandas library)
data0 = pd.read_excel('../data/dataset0.xlsx')

In [None]:
# show the first 5 lines of the dataframe
data0.head(5)

In [None]:
# show the last 3 lines of the dataframe
data0.tail(3)

In the above code blocks we read the data in from an Excel file and saved it in a data frame variable called 'data1'. The data appears to have been read in correctly. Each row is a record and the columns are the variables. In this case, each individual has a genotype and measurement. 

In [None]:
# look at the size of the dataframe (rows, columns)
data0.shape

In [None]:
#use pandas to calculate phenotype frequency
data0['Phenotype'].value_counts()

In [None]:
#use pandas to calculate genotype frequency
data0['Genotype'].value_counts()

In [None]:
# use pandas to calculate a contingency table
table = pd.crosstab(data0.Phenotype,data0.Genotype)
display(table)

In [None]:
#histogram of the phenotype data, stratified by genotype
sns.histplot(data=data0,x='Genotype',hue='Phenotype',multiple='dodge',shrink=0.6)

Since we are dealing with binary categorical variables, we will perform a fisher exact test to test their association.
We will set the significancy level alpha=0.05.

**Quesiton:** what is the null hypothesis for the fisher exact test?

In [None]:
# Perform Fisher exact test,
oddsratio,pvalue = st.fisher_exact(table)

In [None]:
print('Odds ratio is:',oddsratio)

In [None]:
print('Fisher Exact Test p-value is:',pvalue)

**Question**: given the significancy was set at 0.05, are we accepting or rejecting the null hypothesis?


# Example 2: Test for association between mutant genotype and a continuous phenotype

In this example, we are testing again the biological hypothesis that a mutant genotype affects a phenotype we are measuring. Again, our statistical null hypothesis is that genotype has no effect on the measurement.

Genotype is again a categorical variable with two possible values: WT and Mutant, but 
Measurement is now a continuous numerical variable. 

**Question:** Which test would you choose?

In [None]:
# read fist data file into data frame object
data1 = pd.read_excel('../data/dataset1.xlsx')

In [None]:
# look at the first several rows of data
data1.head(5)

In [None]:
# look at the size of the dataframe (rows, columns)
data1.shape

In [None]:
# look at summary information on the Measurement variable
data1['Measurement'].describe()

The describe function shows us some information about the data. For example we learn that the measurements range from 2 to 45.

Based on these two variables, we will run a Student's two-sample t-test as long as we can meet the assumptions of that test: 
- normally distributed responses within each treatment  
- equal variances between treatments

We have to look at our data to see if we have met these assumptions, so we will plot the data 
and calculate  summary statistics.

**Question:** which plotting method would you choose? Uncomment the options below

In [None]:
# Uncomment one of the following plotting options

plotting = 'undefined'

#plotting = 'boxplot' 
#plotting = 'swarmplot'
#plotting = 'violin-plot'

print(plotting)

if plotting == 'boxplot':
    # plot data as boxplot
    sns.boxplot(data=data1,x='Genotype',y='Measurement')
elif plotting == 'swarmplot':
    # plot data as swarmplot
    sns.swarmplot(data=data1,x='Genotype',y='Measurement')
elif plotting == 'violin-plot':
    # plot data as violin-plot
    sns.violinplot(data=data1,x='Genotype',y='Measurement')
else:
    print('I don\'t know this option!')

We will now plot the histogram of phenotype data, stratified by genotype.

Try to change the value of variable number_of_bins to modify the binning.

**Question:** how many bins would you use?

In [None]:
# plot data as stacked histograms

#binning
number_of_bins = 3 #


min_value = np.floor(data1['Measurement'].min())
max_value = np.ceil(data1['Measurement'].max())
bins = np.linspace(min_value,max_value,number_of_bins+1)

fig,ax=plt.subplots(2,1,sharex=True)
ax[0].set_title('Mutant')
sns.histplot(data=data1[data1.Genotype=='Mutant'],x='Measurement',ax=ax[0],bins=bins)
ax[1].set_title('WT')
sns.histplot(data=data1[data1.Genotype=='WT'],x='Measurement',ax=ax[1],bins=bins)
plt.tight_layout()

In [None]:
#summary statistics, stratified by genotype
data1.groupby('Genotype').agg(['count','mean','median','std','var'])

From the boxplot/violinplot/swarmplot we can tell that the variances are somewhat different between genotypes. The histplot function makes histograms. The stacked histograms give a sense for the shapes of each distribution. Although neither looks perfectly normal, neither is strongly skewed. 

The groupby method helps us organize our summary statistics for each genotype. 

The median and mean values for each genotype are very similar, confirming that the distributions are not highly skewed. From this information, we will say that the data have met the t-test assumption of normally distributed responses in each treatment. What about equal variances between treatments? The variances are an order of magnitude different, which violates the assumption of the t-test.

Instead of a Student's t-test, we can run a Welch's t-test which assumes normality but does not assume equal variances.

To know more about how to use a function use help():

In [None]:
#We can use scipy library to calculate T test

#extract measurement data from WT records
data_wt = data1[data1.Genotype=='WT']['Measurement'].values

#extract measurement data from mutant records
data_mutant = data1[data1.Genotype=='Mutant']['Measurement'].values

# run Welch's t-test on data, setting alpha=0.05
test_result = st.ttest_ind(data_wt,data_mutant,equal_var=False,alternative='two-sided')
print(test_result)

Our p-value is less than our alpha value of 0.05 so we reject the null hypothesis that genotype has no effect on our measured response. 

In [None]:
# Alternatively, if you are more familiar with R output, we can use the bioinfokit library to calculate Welch's t-test on data 
res = stat()
res.ttest(df=data1,xfac='Genotype',res='Measurement',evar=False,test_type=2)
print(res.summary)

In [None]:
help(res.ttest)

When reporting this result in a paper it is best to include t, df, and p-value. Here is what that might look like:

The mutant had significantly different measurements than wild type (Welch's t(2,0.05) = -3.57, df = 27.9, p-value < 0.0015).

The numbers in parentheses next to the t are 2 for a two-sided test (i.e. allowing the effect of the mutant to both increase or decrease the measurement) and 0.05 for the alpha value.


# Example 3: Logistic regression to predict Ovarian Reserve

In this example we analyze real data from: Kopeika et al. (2019), Ovarian reserve in women with sickle cell disease, Plos ONE

In this study the authors investigate if women of reproductive age with sickle cell disease (SCD, Anemia Flaciforme)  are more likely to
have a low ovarian reserve (AMH<5) at a younger age in comparison with patients with no
haemoglobinopathy.


In [None]:
data = pd.read_excel('../data/dataset_amh.xlsx')

In [None]:
#show the first rows of data
data.head()

In [None]:
data['AMH<5'].value_counts()

Now we have a dependent variable (AMH<5) and three independent variables (SCD, Smoking Status & Age).

The biological hypothesis is that SCD affects AMH levels. We include Age and Smoking status as so called covariates, or 'nuisance' treatment variable. Our statistical null hypotheses is that the frequency of Low AMH is the same independently of the SCD.

Since our response variable is a binary categorical variable, a possible type of regression is the **logistic** regression. The logistic regression model will predict the probability of low AMH depending on SCD, the age and the smoking status. We can also test the hypotheses using logistic regression and a series of Wald tests. 

This approach makes few assumptions about the structure of the data and the function that we will use will warn us if our data are not meeting those assumptions.

More information on performing logistic regression in python can be found here:
- https://realpython.com/logistic-regression-python/
- https://www.reneshbedre.com/blog/logistic-regression.html

In [None]:
data.groupby('SCD')[['Age','Smoking','AMH']].mean()

In [None]:
# plot AMH data as boxplot
sns.boxplot(data=data,x='SCD',y='AMH')

In [None]:
# plot Age data as boxplot
sns.boxplot(data=data,x='SCD',y='Age')

In [None]:
#binning
num_bins= 15
min_value = data['AMH'].min()
max_value = data['AMH'].max()

bins = np.linspace(min_value,max_value,num_bins+1)

# plot data as stacked histograms
fig,ax=plt.subplots(2,1,sharex=True)
ax[0].set_title('SCD')
sns.histplot(data=data[data.SCD==1],x='AMH',ax=ax[0],bins=bins)
ax[1].set_title('not SCD')
sns.histplot(data=data[data.SCD==0],x='AMH',ax=ax[1],bins=bins)
plt.tight_layout()

In [None]:
data.groupby('SCD')[['AMH']].describe()

In [None]:
# run Welch's t-test on data, setting alpha=0.05
st.ttest_ind(data[data.SCD==0]['AMH'],data[data.SCD==1]['AMH'],equal_var=False,alternative='two-sided')

In [None]:
# feature engineering:

# age class
def age_class(x):
    if x<=30:
        return '<30'
    elif x<=35:
        return '31-35'
    elif x<=40:
        return '36-40'
    else:
        return '>41'
data['Age class'] = data['Age'].apply(age_class)


In [None]:
amh_by_ageclass = data.groupby(['Age class','SCD'])['AMH<5'].mean().reset_index()

In [None]:
sns.barplot(data=amh_by_ageclass,x='Age class',hue='SCD',y='AMH<5',order=['<30','31-35','36-40','>41'])

In [None]:
# import a few more modules to perform logistic regression
from sklearn import linear_model
from sklearn.metrics import classification_report, confusion_matrix
import statsmodels.api as sm 

In [None]:
# logistic regression model
# get independent variables
data['Constant'] = 1
X = data[['Age','SCD','Smoking','Constant']]
# to get intercept -- this is optional
# X = sm.add_constant(X)
# get response variables
Y = data[['AMH<5']]
# fit the model with maximum likelihood function
model = sm.Logit(endog=Y, exog=X).fit()

print(model.summary())

The section of the function output that we are most interested in is the coefficients. The line that starts with "SCD" tells us about the effect of SCD. The magnitude of the effect of being affected by SCD is 0.99. This coefficient is calculated as the natural log of the odds ratio. A Wald test was run on this coefficient to determine if it is a significant departure from what would be expected if there was no effect of genotype. The Wald test is based on z-scores. Dividing the coefficient by the standard error results in a z-score of 2.26 which has a p-value < 0.025. So SCD has a significant effect on the probability of low AMH.

There were significant effects also for the Age (p<1e-3). By including it as a variable in the model we were able to estimate and therefore control for significant variation in the AMH across different age. Controling for the age effect allowed us to more accurately estimate the effect and significance of genotype.

No significant effect is detected for smoking status (p=0.63)

In [None]:
# Odds Ratio = (Low AMH SCD / High AMH SCD) / (Low AMH notSCD / High AMH notSCD)
np.exp(model.params['SCD'])