![Screenshot%20from%202023-02-07%2014-58-16.png](attachment:Screenshot%20from%202023-02-07%2014-58-16.png)

# **_Microcosm Data Analysis_** 

# Part 0: Introduction

Welcome to this Data Analysis Workshop! This workshop is designed to refresh your skills in data analysis using Python. We will be using the modules pandas, seaborn, and scipy. We will cover some fundamental concepts of data analysis, and practice the puting them into practice in an appropriate fashon. 

By the end of this workshop, you will have a better understanding of how to do the following:

_Download and view our data_

_Use decision tools to choose appropriate plots for your data_

_Use data visualisation tools to make plots_

_Learn how to choose the right statistical test for your data_

_Perform appropriate statistical tests on your data_

But first let's think about our objectives for the Microcosm experiment. It's good to have these in mind as we explore our data.

<div class="alert alert-info">
"To investigate how predation by sea anemones affect the demographic characteristics -
population abundance and make up (sex-ratio and/or juvenile stages) of brine shrimp, as well as
the effects of predation on the nitrogen cycle in this three-level trophic system"
</div>



### Hypotheses


Before we start our data analysis, indeeed before we started our experminent, we should have in mind a hypothesis.

**What is are null hypothesis $(H_{0})$?** 

**What is are alternative hypothesis $(H_{A})$?** 


## Task 0:

Write down your hypotheses for this experiment:

Null Hypothesis:

Alternative Hypothesis: 

# Part 1: Your data

Ok, now we have some predictions about the results of our experiment we need some data to analyse. 

The data for your experiment is going to look similar to the data we generate last year. These data should be available here: https://docs.google.com/spreadsheets/d/1mdaYVDsj-id4a8VwbFjf7RBw1bKxXOTopMqk3YDLubw/edit?usp=sharing

If you haven't done so already, download a copy as an excel file (.xlsx) into a local folder and upload to Noteable for analysis ("Upload" button on the right hand side). Upload it to the same folder in which you find this notebook, this will keep everything in the same place and avoid long file addresses when loading the data.

I've also stuck a csv on the github as an alternative. IRL Your data may be collected and stored in a miriad of formats, it's up to you to find it, download, and import in the most apropriate way. 


## Task 1: Read in and print your group's data

Using pandas, you are now going to read in the excel spreadsheet and call it something sensible.

1. To read in excel spreadsheets we use the command `pd.read_excel(filename)`. Do this now, calling the DataFrame something sensible, such as `micros`.

2. Print the data to make sure it is okay.

In [None]:
# read and print your Microcosm dataset

import pandas as pd

micros = pd.read_excel('Microcosm practical data.xlsx')

#or maybe

#micros = pd.read_csv('Microcosm practical data - Sheet1.csv')

#To display the first 50 rows of the dataframe we use the .head() method 
#... but equally you could just call the name of the dataframe here
micros.head(50)

### Are our data 'tidy'?

When it comes to data analysis there is a specific way of organising your raw data which makes it easier to visualise and analyse. This organisation of data, known as 'tidy' data, is a simple but effective way making sure our data frame is easy to read and access. 

### The Three Rules of Tidy Data

1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.

![tidy_all.png](attachment:tidy_all.png)

Why ensure that your data is tidy? There are two main advantages:

There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.

There’s a specific advantage to placing variables in columns because it allows programming languages to see each column as a list or vector.

If our data needed some rearrangment, 'melting' or 'pivoting', then the pandas module has lots of methods to help. Check out this really useful cheatsheet with some of the ways we can manipulate our data into the right shape:

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

# Part 2: Exploring variables

A variable has two defining properties.

1. a variable is any characteristic of a person, place, thing, or idea that can be measured.
2. the value of that variable can vary from one entity to another.

Examples of variables: Eye colour, height, weight, temperature, pH, leaf shape and butterfly wing shape 
*All of these can be measured, and all vary from one individual to another.*

Hopefully, when we designed our experiment we decided on variables to measure that have good explanitory power. These will help us accept or reject our null hypothesis. 

We also need to identify what type of variables we will be working with. 

Remember there are two main types of variable: **categorical variables** and **numerical variables**.

### Categorical variables

Categorical variables describe **qualitative** characteristics of individuals. A characteristic may have two or more values, or categories. 

There are two types of categorical variable: **categorical nominal** and **categorical ordinal**.

Nominal means "name". The following table lists some examples of categorical nominal variables.

Variable | Values or categories
:--- | :---
Blood group | O+, O-, A+, A-, B+, B-, AB+, AB-
Sex chromosome genotype | XX, XY, XO, XXY, XYY
Eye colour | brown, blue, green, hazel, grey
Survival | alive, dead

Ordinal means "having order". The categories have an inherent temporal or spatial order to them. The following table lists some examples of categorical nominal variables.

Variable | Values or categories
:--- | :---
Life stage | Egg, larva, juvenile, adult
Age class | 0-4 years, 5-9 years, 10-14 years, etc.
Rank | 1<sup>st</sup>, 2<sup>nd</sup>, 3<sup>rd</sup>, etc.


### Numercal variables

Numerical data are measurements on individuals that have magnitude. They describe **quantitative** characteristics of individuals. 

Numerical variables can be either **continuous** or **discrete**. 

Characteristics that have a decimal point in them are continuous variables. For example,

Variable | Values
:--- | :---
Body temperature | 37.5 $^\circ$C
Territory size| 4.5 m<sup>2</sup>
Concentration | 0.5 Molar

Characteristics that are discrete are generally counts. For example,

Variable | Values
:--- | :---
Number of amino acids | 546
Harem size | 5
Bird nests per tree | 11

#### Explanitory vs Response Variables 

So we know the type of data our variable contains, but there is an extra distinction it is important to make when ploting and analysisng our data: whether it is an explanitory or resonce variable.

These two additional types of variables are improtant to understand in statistics and overlay our previous variable types, i.e. **both response and explanitory variables can be either categorical or numerical.** 

**Explanatory Variable:** Sometimes referred to as an independent variable or a predictor variable, this variable explains the variation in the response variable.

**Response Variable:** Sometimes referred to as a dependent variable or an outcome variable, the value of this variable responds to changes in the explanatory variable. 


## Task 2: Choose Variables to Analyse

With this in mind let's think about which variables we are interested in plotting to visualise our data and help us understand the patterns of change within our tanks.  

A good starting place might be this site: https://www.data-to-viz.com/ which shows you some data visualisations appropriate to your data types. Use the 'explore' button to find a graph type that suits our data. 

**Which of the variables in our data set are categorical? Which are numeric?**

**If we were to pick two variables to analyse and plot, which would be the explanitory and which the response variable?**

# Part 3: Plotting our Data

Good data visualization is a powerful tool in the biological sciences, helping researchers to effectively communicate complex data and results. It's going to form an essential part of any scientific report, theisis, or paper you will ever write so it's and important skill to practice. 

(Not only that, data visualisation skills are applicable well outside the realms of biology so very transferable!)

Let's start with an example of a graph that we may well want to create to track the numbers of shrimp alive in our tanks at each time point. 

As I am interested initially in visualising the number of shrimp (discrete numerical) over time, I'm going to create a line plot using seaborn.lineplot() https://seaborn.pydata.org/generated/seaborn.lineplot.html

In [None]:
#Remember when using a new module we need to 'import', we say 'as sns' here just as an abreviation
import seaborn as sns

#You may notice I have broken this graph code down across multiple lines for readability
sns.lineplot(data = micros, 
             x="day", 
             y="Live Shrimp",);#This semicolon is a quirk of jupyter notebooks and stops additional plots


The above graph looks about right but there are some issues: the axis labels are the names of our columns (not very presentable), and the graph has no title. 

**Can you correct this below?**


In [None]:
sns.lineplot(data = micros, 
             x="day", 
             y="Live Shrimp",)
#add code here to set axis labels


Ok, so now your plot looks a little neater but one 'trick' we're missing here is to separate out those tanks with and without predation to help us visualise the effect of 'predation' on shrimp numbers. we can do this with the argument 'hue'.

In [None]:
sns.lineplot(data = micros, 
             x="day", 
             y="Live Shrimp",
            hue = "predation");

Ok, so this graph is an important one. One of our main goals for this experiment was to track population abundence. 

**Can you add to this graph or adapt it to your own needs?**

There are some other variables of interest here. We are also interested in how shrimp numbers are correlated with chemical indicators like Nitrate. 

This time we'll use a 'scatterplot' because we are plotting two numeric variables agaist one another.

In [None]:
sns.scatterplot(data = micros, 
             x="Live Shrimp", 
             y="NO2-Nitrite",
            hue = "predation")
plt.xlabel('Live Shrimp')
plt.ylabel('NO2 - Nitrite');

In [None]:
sns.regplot(data = micros, 
             x="Live Shrimp", 
             y="NO2-Nitrite");

#### Saving your graph

In [None]:
shrimp = sns.regplot(data = micros, 
             x="Live Shrimp", 
             y="NO2-Nitrite");
shrimp_fig = shrimp.get_figure()
shrimp_fig.savefig('shrimpvsNO2.jpg')

## Task 3: Choose another visualisation

**This time you're on your own!**

Pick two (or more, if appropraite) relavant variables to visualise using the modules we've used above.  

Don't forget this handy website to aid your choice! https://www.data-to-viz.com/ 

And don't forget to upload your successes and failiures (both valid) to our padlet page:
https://padlet.com/btopad/environment-1d-data-visualisation-23zpefzm2ceg1h3 

# Part 4: Statistical Analysis

### Parametric vs Non-parametric

Parametric and nonparametric statistical tests are two different approaches to statistical hypothesis testing. The main **difference between the two is the assumption about the underlying population distribution.**

**Parametric tests** are based on the **assumption that the population follows a specific probability distribution**, usually the normal distribution. Examples of parametric tests include t-tests, ANOVA, and linear regression. These tests make use of the population parameters (mean and variance) to determine the significance of the results.

**Nonparametric tests**, on the other hand, **do not make any assumptions about the underlying population distribution.** Instead, they use the rank or ordering of the data to perform the hypothesis test. Examples of nonparametric tests include Wilcoxon rank-sum test, Kruskal-Wallis test, and Mann-Whitney U test. Nonparametric tests are useful when the data do not meet the assumptions of parametric tests or when the sample size is small.

In general, parametric tests are more powerful than nonparametric tests, but they are also more sensitive to violations of their assumptions. If the assumptions of a parametric test are not met, the results can be misleading, while nonparametric tests are less sensitive to such violations.

### How to Choose a Parametric Test

The most common types of parametric test include regression tests, comparison tests, and correlation tests.

#### Regression tests
Regression tests look for cause-and-effect relationships. They can be used to estimate the effect of one or more continuous variables on another variable.

#### Comparison tests
Comparison tests look for differences among group means. They can be used to test the effect of a categorical variable on the mean value of some other characteristic.

T-tests are used when comparing the means of precisely two groups (e.g., the average heights of men and women). ANOVA and MANOVA tests are used when comparing the means of more than two groups (e.g., the average heights of children, teenagers, and adults).

#### Correlation tests
Correlation tests check whether variables are related without hypothesizing a cause-and-effect relationship.

These can be used to test whether two variables you want to use in (for example) a multiple regression test are autocorrelated.

![Screenshot%20%281%29.png](attachment:Screenshot%20%281%29.png)

### How to Choose a Non-parametric Test
For most parametric test you come across there is a non-parametric equivelant to use if you believe the assumption of normally distributed data has been violated. 

![non-para.png](attachment:non-para.png)

To read more about choosing stats tests and to see the original source of the above figures check out:

https://www.scribbr.com/statistics/statistical-tests/


## Task 4: Choose an appropriate test

Use the above resources and flow chart to decide on an appropriate test for our data.

I'm going to run with an example to demonstrate the kind of decision making and process involved in choosing and applying statistical tests. 

### Example 1. 

**Variable 1:** Live Shrimp four days after predator introduction (day 8) - numerical discrete - our response variable

**Variable 2:** Anemonie starting weight (day4) (numerical continuous) - our explantory variable



Firstly, because I'm taking two bits of data from different parts of our spreadsheet I will have to do some trimming and rearranging of our table. 

In [None]:
#I'm going to use the square bracket function of Panadas to 'subset' my micros dataframe
# I use the logical equals '==' to pick out all rows where day is equal to 8
micros_day8 = micros[micros["day"]==8]

#To display the first 10 rows of the dataframe we use the .head() method again
micros_day8.head(10)

Excellent now we have a section from just day 8: Let's do the same for day 4.

In [None]:
micros_day4 = micros[micros["day"]==4]

Now I'm going to slice out just the columns I want from these two new dataframes, again using the square bracets. 

I'm also going to remove and indexes (on the far left of our dataframe) given to these rows when we imported our data. '.reset_index(drop=True)' is a method we can use on slices of a dataframe and it does what it says on the tin: reset the indexing of rows and drop the old one. This will help me merge these two columns into a new dataset. 

In [None]:
day4_anem_weight = micros_day4["anenome.weight"].reset_index(drop=True) 
day8_shrimp = micros_day8["Live Shrimp"].reset_index(drop=True)

We can now merge these two variables together to form a new dataset. To do this I am using a new type of object called a 'dictionary'. Don't worry too much about these right now, this intermediate step is not neccessary if your two variables are already in the same dataframe. 

In [None]:
#Creating a 'dictionary' called 'frame' with our two variables
frame = { 'Anemone Weight': day4_anem_weight, 'Live Shrimp': day8_shrimp }

#Creating DataFrame from this dictionary with pd.DataFrame
result = pd.DataFrame(frame)

result

We have a lot of 'NaN' values in our new data frame: unfortunately these will need to be removed before further analysis. I use the .dropna method from Pandas https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html 

In [None]:
result.dropna(axis = 0, how = 'any', inplace = True)
result

And to see if any patern of correlation is visable in our data let's make a quick scatter plot: 

In [None]:
sns.scatterplot(data=result, x="Anemone Weight", y="Live Shrimp");

There seems to be some negative correlation between these two variables: how can we test that?

Well, first lets decide if we need a parametric or non-parametric test.

Is our data normally distributed?

In [None]:
#a quick histogram will show me
sns.histplot(data=result, x="Live Shrimp");

Not particularly! 

We also have a small sample size for this particular test which means I'm going to stick with a non-parametric test that does not make these assumptions about my data: a Spearmans Rank correlation. 

In [None]:
from scipy.stats import spearmanr

#calculate Spearman Rank correlation and corresponding p-value
spearmanr(result['Anemone Weight'],result['Live Shrimp'])


So it would apear that my impression from the scatterplot were acurate there is a negative correlation between Anemone Starting Weight and Live shrimp four days after predator introduction (r= -0.746, p= 0.005)

## Now give it a go yourself! 

You don't have to choose the same variables, you don't have to slice and dice your dataset as I have, you can pick any variable you think will be interesting to examine, and you don't have to get it right first time. Practice makes perfect and this is a great time to try, fail, and learn. 