# Welcome to the Jupyter notebook! 

###This platform allows you to use the power of computer coding to perform complex calculations. "But I don't know anything about computer code!!" Don't worry! The coding language this platform uses is called Python and the Jupyter notebook allows us to write code and test it in small chunks. You don't have to understand every line of code but I hope you will be able to follow the general goal of each step. Each gray rectangle below contains a set of instructions for the computer to follow (called a cell). By running each cell, we ask the computer to follow the instructions we give it. The result of the code will show up below the cell. I can write text instructions in the cell as long as I include the # symbol to tell the computer that what follows is not code. Follow along with your protocol and let's get started!

## In each box, descriptions of each command are given and are set apart from the coding language by the # symbol. You should click on each box and then the "run cell" button (looks like a "play" button, on the far left side of the cell) to execute the code written inside.

In [0]:
#The first thing we need to do is set up the notebook so it will be ready to do what we ask of it.
#We are going to want to make some graphs, also called plots, so we first create this environment.
#The code below this line sets up a plotting environment inside the notebook.
%matplotlib notebook
#Next we will bring in some shortcut libraries that we will use for our analyses.
#Think of these like toolboxes containing lots of shortcuts. We call these modules.
#When the code line says "import", we are simply bringing in Python modules with code and objects we can use.
import pandas as pd #pandas provides the capability for spreadsheets
import numpy as np #for numerical analysis
import matplotlib.pyplot as plt #for plotting
import scipy.stats as ss #for statistical analysis
import seaborn as sb #for nicer graphics
#now we have a set of tools that we can call on inside the notebook to do things for us.
sb.set_style('darkgrid') #sets background style for graphics
#!pip install seaborn --upgrade

Now we are ready to begin our data analysis. 

The first step is to import the data your section generated. When you click "play" on the cell below, click the "Choose Files" box and select your Leaf_data_example_student.csv file your TA sent you. 

In [0]:
from google.colab import files
uploaded = files.upload()

In [0]:
#This code allows you to verify that your file was correctly uploaded.
#You should see a message that the file was uploaded and has a specific length.
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(name=fn, length=len(uploaded[fn])))

In [0]:
#Now you will take that data you just imported and extract the numbers into a form Python understands
#Here you are using the "pandas" library to tell the computer to pull the numbers from the file
#into a table and to call that table "leaves". That way any time you want to reference this data,
#you simply need to use the "leaves" command.
#If you were successful here, you should see a list appear with all the data collected.
import pandas as pd
import io
leaves = pd.read_csv(io.StringIO(uploaded['Leaf_data_example_student.csv'].decode('utf-8')))
print(leaves)

**Awesome**! Now we have imported our Excel data into the Notebook and we can begin analyzing it. You've used Excel many times in the past so you know how much effort goes into calculating basic descriptive statistics like mean and error. Now you will see how a single line of code can get us the same information.

In [0]:
#Before we get started, let's generate a new variable for the ratio just in case you've decided to use it.
#We can easily do that by asking the computer to calculate the ratio of length to width
#If you've done everything right, you should see the data table again but now with a new variable we're
#calling "l_w_ratio". If you look at the code below, you can see how we did that.
leaves['l_w_ratio'] = leaves['Length']/leaves['Width']
leaves.head()

In [0]:
#Next we need to divide the data into 2 sets; sun and shade
#Here the code makes new variables composed of data of either category
sun = leaves[leaves['Shade/Sun']=='Sun']
shade = leaves[leaves['Shade/Sun']=='Shade']


In [0]:
#We can compute the descriptive stats for each data set.
sun.describe()
#Then add some other goodies. 
#Here's the command to find the range.
r=sun[['Length','Width']].apply(np.ptp)
#Here's the command to find the Variance
v = sun[['Length','Width']].apply(np.var)
#And the Standard error
stderr = sun[['Length','Width']].apply(ss.sem)
#And the 95% CI
ci_s = sun[['Length','Width']].apply(lambda x: ss.norm.interval(0.95,loc=x.mean(),scale=ss.sem(x)))
#Merge these into the sun data frame
sun_stats = sun[['Length','Width']].describe().append(pd.DataFrame([r,v,stderr,ci_s],\
                                                        index=['Range','Variance','SEM','95% CI']))
sun_stats
#Use this output to complete the table (for the sun leaves)in your eLN under Part 3.

In [0]:
#Here's the Shade data set full stats
#Use this output to fill out the Shade side of the table in your eLN.
r=shade[['Length','Width']].apply(np.ptp)
v = shade[['Length','Width']].apply(np.var)
stderr = shade[['Length','Width']].apply(ss.sem)
ci_s = shade[['Length','Width']].apply(lambda x: ss.norm.interval(0.95,loc=x.mean(),scale=ss.sem(x)))
shade_stats = shade[['Length','Width']].describe().append(pd.DataFrame([r,v,stderr,ci_s],index=['Range','Variance','SEM','95% CI'],\
                                                    columns=['Length','Width']))
shade_stats

*You*'ve taken a look at the basic summarized data, so now let's make a plot of these data. You should have decided which plot to make in the pre-experiment planning. 
##Run the cell that corresponds to your chosen plot type. You only need to run one of these cells.

In [0]:
#If you chose to make a box plot, run the code in this cell
%matplotlib inline
plt.subplot(211)
sb.boxplot(x="Shade/Sun", y="Length", data=leaves)
plt.subplot(212)
sb.boxplot(x = "Shade/Sun", y='Width', data=leaves)

# Show the plot                   
plt.show()

In [0]:
#If you chose to make a scatter plot, run the code in this cell which will
#generate both the plot and the equation for the line of the form
#y=mx+b where y is the width value, m is the slope, x is the length value
#and b is the y-intercept
%matplotlib inline
equation=np.poly1d(np.polyfit(leaves['Length'],leaves['Width'],1))
print(equation)
#regplot will draw scatter plus fit line, also puts 95% confidence band
sb.regplot(leaves['Length'],leaves['Width'])


Take a screenshot of the plot you decided to make and place it in your eLN.

**We** are ready to start our statistical analysis for Question 1. The first thing we need to do is simply plot the distribution of our data one variable at a time and look for the characteristic "bell curve" shape. This would mean that the data is normally distributed. Remember that we need to know this so we can decide what test to do. 

In [0]:
#Above, we brought in the toolbox seaborn and called it sb.
#Now we can call on that toolbox anytime we type sb and use one of the tools it contains. 
#First, we will plot the distribution of the Length variable, keep in mind sun and shade are combined for this!
#the command below tells the computer to use the distribution plot in the seaborn toolbox to plot the leaves variable.
#screenshot this plot into your eLN
%matplotlib inline
sb.distplot(leaves['Length'])

Great! You should see a plot appear above this box. Does the data look normal? The shape of the curve is roughly bell shaped and so the data is normally distributed.

In [0]:
#Change the code below so the plot shows the distribution of the width variable. 
sb.distplot(leaves['Width'])

If you completed the code string correctly, your distribution plot should be shown above. Is this variable normally distributed? Seems to be. Remember that determining normal distribution in this manner is a bit of a judgement call.



Now we will perform the appropriate statistical test. 
##You must choose one of the code cells below depending on which statistical test you chose. A few cells ago, we decided that both length and width are normally distributed. Check back in your eLN to see which test you decided was appropriate when both variables are normally distributed.

In [0]:
#This code will run a Pearson Corellation coefficient analysis.
#The results are shown as the coefficient then the p-value.
ss.pearsonr(leaves['Length'],leaves['Width'])
#This line of code returns the regression analysis
slope,intercept,rval,pvalue,stderr = ss.linregress(x=leaves['Length'],y=leaves['Width'])
print('Slope',slope)
print('Intercept',intercept)
print('Rsq',rval**2)
print('pvalue',pvalue)

In [0]:
#This code will run a Spearman correlation coefficient analysis.
ss.spearmanr(leaves['Length'], leaves['Width'])

Record the test you chose to run and the resulting p-value from that test in your eLN. Also record the regression output. Interpret the results of your analysis as it relates to the original question.

# Let's start the analysis for Question 2.

In [0]:
#For Question 2, we are going to need to define sun and shade as separate groups. 
grouped = leaves.groupby('Shade/Sun')

##You must modify the code in the cell below to include your variable of choice. You only need to do this for the plot you've chosen to make.

In [0]:
#If you chose to make a bar/box plot, you should run the code in this cell
#AFTER you add your chosen variable where the code says y=''
#Type your chosen variable bewteen the '' marks, either Length, Width or l_w_ratio
#This is case-sensitive!
sb.boxplot(data=leaves, x='Shade/Sun',y='Width',width=0.2)

In [0]:
#If you chose to make a scatterplot, you should run the code in this cell
#AFTER you add your chosen variable where the code says y=''
sb.regplot(leaves[''],leaves[''])

Take a screenshot of your plot and put it in your eLN under Question 2 plot.

Just as we did for Question 1, first you should look at how your chosen variable is distributed and check for normality. Remember that when we did this earlier, all the data was lumped together. Now we will separate it out into sun and shade. You will need to change the title and the variable in BOTH of these cells. Use the distribution of the data to decide if your variable is normally distributed. If one of your variables is non-normal but the other is normal, perform the test for non-normal data.

In [0]:
#Evaluate normality for the Sun group. You must add your variable to the first
#line of code which sets the title as well as the second line of code which
#defines the data set. Enter your variable of choice between the '' marks.
#Remember this is case-sensitive, so eitehr Length, Width or l_w_ratio
plt.title('Histogram of for Sun')
sb.distplot(sun[''])

In [0]:
#Evaluate normality for the shade group. You must add your variable to the first
#line of code which sets the title as well as the second line of code which
#defines the data set. Enter your variable of choice between the '' marks.
#Remember this is case-sensitive, so eitehr Length, Width or l_w_ratio
plt.title('Histogram of for Shade')
sb.distplot(shade[''])

Now you are ready to perform your statistical test. Depending on which type of question you thought was best and the type of data we have, you will need to pick one of the cells below to run a statistical test. Make sure you record which test you chose and the resulting p-value in your eLN.

In [0]:
#If you chose to do a t-test, run this cell. This is the code to run an unpaired
#t-test. You would only run a paired t-test if the leaves were taken from the same tree
#YOU MUST MODIFY THIS CODE with your variable of choice. Enter your variable 
#between the '' marks and again it must match either Length, Width or l_w_ratio
result=ss.ttest_ind(sun[''],shade[''])
print("Test statistic:",result[0],"P-value:",result[1]/2)


In [0]:
#IF you chose to do a Mann-Whitney U test, run this cell. 
result=ss.mannwhitneyu(sun[''], shade[''],use_continuity=True, alternative='two-sided')
print("p-value:",result[1])

In [0]:
#If you chose to do a Pearson Correlation, run this cell. You must modify this
#code with your variable of choice. Enter your variable between the '' marks and
#again it must match either Length, Width or l_w_ratio
ss.pearsonr(leaves[''],leaves[''])

In [0]:
#If you chose to do a Spearman Correlation, run this cell
ss.spearmanr(leaves[''], leaves[''])

# Great work! Now make sure to interpret your test result in your eLN.