# Lab 03 Prelab, Part 2

Please complete Part 1 of the prelab on Canvas before working through this notebook.

In [None]:
# After executing this cell successfully once, please comment out the next two lines:
!wget -N --quiet https://www.phas.ubc.ca/~michal/data_entry.py # download data_entry.py
%run data_entry.py  # install it.

# but leave these lines alone:
import numpy as np
import data_entry

## Welcome to the Lab 03 Prelab, Part 2: notebook and initial data collection

This document walks you through how to calculate the quantities average, standard deviation and (standard) uncertainty of the mean in a python notebook. 

Both the ideas and the skills discussed in this prelab assignment are extremely important to understand in order to be successful in Lab 03 and later labs. 

This prelab activity starts by using a hypothetical example data set to guide you through the use of the relevant Python functions. The work done with the hypothetical data set will not be handed in directly, and instead will set you up to perform these same calculations on some real data, also collected in this prelab. 

## Summary of Part 1 of the prelab

Part 1 of this week’s prelab was found in the Prelab section of Canvas in the Lab 03 module. It was a guided lesson on statistics concepts to prepare you for Part 2 of the prelab. Here is a summary of the statistics concepts covered or reviewed in part 1 of this prelab:

a. Average is given by
     
$$x_{ave} = \frac{1}{N} \sum_{i=1}^N x_i$$

b. For variables that follow a Gaussian distribution, approximately 68\% of the values lie between the range $ x_{ave} - \sigma$ to $x_{ave} + \sigma$ **(68% CI)**

c. Approximately 95% of the values will lie within the range $ x_{ave} - 2\sigma$ to $x_{ave} + 2\sigma$ **(95% CI)**

d. Standard deviation is given by 

$$ \sigma = \frac{95\% \,\mathrm{CI}}{4} = \sqrt{\frac{1}{N-1}\sum_{i=1}^N \left(x_i - x_{ave}\right)^2} $$

e. We use the standard deviation as an indicator of the uncertainty (or the variability) in a single measurement and it does not depend on the number of measurements taken. 

f. Uncertainty of the mean (often called standard error of the mean) is given by

$$\sigma_m = u[x_{ave}] = \frac{\sigma}{\sqrt{N}}$$.

We use uncertainty of the mean as an indicator of the uncertainty (or the variability) in the average of multiple measurements and it does improve as we increase the number of measurements.

## Developing your Python skills

Use the data_entry library in the code block below to load in a blank spreadsheet for the purposes of this prelab (consult Lab 2 for help if you forget how to do this). Title the spreadsheet 'lab03_prelab_data1' and the first column variable as 'd' with units of 'mm'.

In [None]:
de=data_entry.sheet('lab03_prelab_data1')

Below is a table of some hypothetical data. Add enough rows to your spreadsheet to hold 25 data points. (Recall that Python indexing starts at 0!) If you highlight the table below and use the copy command (Ctrl + C on a PC, Command + C on a Mac), you can then select the first data row of your spreadsheet and **repeatedly** press the paste command (Ctrl + V on a PC, Command + V on a Mac) to paste the data into your spreadsheet - paste works a little strangely here, each "paste" will paste in one more value, so you'll need to press it 25 times.

### Hypothetical data

| d (mm) |
| ------ |
 439.3 
431.6
434.6
433.3
439.3
442.6
428.6
441.6
431.2
427.6
433.2
441.3
436
437.6
434.7
433.2
433.1
431.3
436
432.9
436.5
437.2
435.7
432.6
434.7

## Calculating average and standard deviation using Python

Press the 'Generate Vectors' button below your spreadsheet to transfer the data into the Python environment. Both average and standard deviation can be calculated the "long way", but first we will be using python functions as a shortcut to calculate these two quantities using the 'np.mean()' and 'np.std()' functions. These functions require you to specify an *argument*, that is a vector of data over which the function with calculate the average or the standard deviation now. 

If you correctly titled the single column of data 'd', then the vector 'dVec' should now be in the Python environment. The code block below will calculate and print out both the average and standard deviation of the data in 'dVec'.

In [None]:
dAvg = np.mean(dVec)
print('Average of data = ', dAvg)

dStd = np.std(dVec, ddof=1)
print('Standard deviation of data = ', dStd)

If you copied the data over properly, you should find that the average is 435.028 mm, which is consistent with our earlier estimate of 435 mm from the histogram. The standard deviation should be 3.8362872676586677 mm (3.8 mm if we were to round it to 2 significant figures when we report it). As a check, this is also consistent with our estimate of 4 mm using the 95% Confidence Interval with the histogram earlier

Note that in 'np.std()' we are supplying an additional argument, 'ddof=1'; this additional argument is needed because the np.std() function uses a general formula in its calculation - it can be used for a number of related calculations. In particular the formula it uses is:

$$ \textrm{np.std()} = \sqrt{\frac{1}{N-\textrm{ddof}}\sum_{i=1}^N \left(x_i - x_{ave}\right)^2} $$

We want:
$$ \sigma = \sqrt{\frac{1}{N-1}\sum_{i=1}^N \left(x_i - x_{ave}\right)^2} $$
So the 1 in ddof = 1 corresponds to the 1 in the $N-1$ in the denominator of the definition of the standard deviation. 

What happens to $\sigma$ if $N=1$?

> If you are interested, ddof is an abbreviation for 'delta degrees of freedom.'  We use one 'degree of freedom' from our dataset when we calculate the average. Since the average is used in the calculation of standard deviation, we control for this in the formula for standard deviation by dividing the squared differences between each data point in the mean by $N-1$ instead of $N$. 

If you want to control the number of significant figures displayed you can modify the print statement as follows:

In [None]:
print('Standard deviation to 2 sig figs = {:.2}'.format(dStd))

Within the curly braces, the ':.2' tells the print function to round the variable specified to 'format()' - in this case 'dStd', the standard deviation of 'd' - to two digits. 

Let's step back for a moment and think about what the standard deviation represents. Twenty-five measurements were made using the same experimental procedure, so this standard deviation is a method we can use to represent the variability in our measurements. In the language we are using in the lab, this standard deviation is the single-measurement standard uncertainty of the distance, $u[d_1]$. What does this mean? It means that if we wanted to report the value and uncertainty for one of our measurements of $d$, 434.7 mm for example, we would report it as:

$$ d_1 = (434.7 \pm 3.8) \, mm$$

The subscript '1' is being used here to emphasize that we are talking about a single measurement and not the average. We will look at the uncertainty in the average later.

The variability (the standard deviation) in the 25 measurements that we made describes us how confident we should be in any one of the individual values. Instead of estimating our uncertainty from a single measurement as we did with the height of the spring in the first two labs, the use of repeated measurements can allow us to measure the variability in our measurements in a more rigorous way.

## Calculating average and standard deviation the "long way" using Python

*In the lab, you do not need to perform your calculations the "long way", but we want you to learn how to do it this way as part of the prelab for the following reasons:*

1. Many of the calculations we perform later in this course will not correspond to built-in functions, so it is useful to learn how to do more complicated calculations.
2. Breaking down complicated calculations into a several lines of code in doing these “long way” calculations is the strategy that we will be encouraging you to use for most of your coding work going forward in this course.
3. We will also be giving you a few tips and skills here that you will find generally useful.
4. It is often easier to find problems or errors in your calculations if you can look at intermediate values.

### Calculating average the "long way"

Let's revisit our equation for calculating average,

$$x_{ave} = \frac{1}{N} \sum_{i=1}^N x_i$$

We will break the operation of calculating the average into steps. We will first sum up all the $x_i$ values, then count how many values there are ($N$), and finally calculate the quotient.

Similar to 'np.mean()' and 'np.std()', there is a NumPy function for calculating a sum, namely 'np.sum()'. Use this function in the code cell below to define a variable 'dSum' which is the result of the sum over 'dVec'.

Next, the built-in Python function 'len()' calculate how "long" a vector is, i.e. it counts up the number of elements within the supplied variable. For instance, if you run the code cell below you can see 'len()' returns '3' when we supply it with the three-element vector 'foo':

In [None]:
foo = np.array([1, 2, 3])
len(foo)

Use 'len()' in code cell below to define another variable 'dCount' which is the result of counting the number of elements in 'dVec'.

Finally, divide 'dSum' by 'dCount' in the code cell below to arrive at the average of 'dVec' the "long way".

You should find that you calculated an average distance of 435.028 mm just like when using the short way.

### Calculating standard deviation the "long way"

This equation is a little more involved, but we want you have some practice with these methods in addition to having to stop and think a bit about each of the pieces involved in doing the standard deviation calculation.

Lets look again at our equation for the standard deviation,

$$ \sigma = \sqrt{\frac{1}{N-1}\sum_{i=1}^N \left(x_i - x_{ave}\right)^2} $$

We need to first find the average (done!), then for each value $x_i$ find the difference between it and the average, then find the square of that difference for each value, then sum up all of those differences of squares, divide that sum by $N-1$ and finally take the square root. Let's do it!

Starting with calculating $x_i - x_{ave}$. What we want Python to do is take each data point in 'dVec' and subtract off 'dAvg'. Thankfully, this can be done in a single, intuitive line of code. If we were to do this in a calculator, we'd have to make 25 calculations - one for each data point in 'dVec'. However, Python is smart enough that when we supply it with a 25-element vector like 'dVec' and ask it to subtract off a one-element vector (or scalar) like 'dAvg', then it knows that you want to subtract 'dAvg' from each data point in 'dVec'.

Run the example code cell below to see how this works.

In [None]:
bar = np.array([1, 2, 3, 4, 5])
print('Dummy data = ', bar)

barMinusOne = bar - 1
print('Dummy data subtracted by 1 = ', barMinusOne)

Using this example, define a new Python variable 'diffFromAvg' below which subtracts off 'dAvg' from each element of 'dVec'.

Going back to the standard deviation formula, we see that we now need to square each of these differences from the average. In Python, the operator that raises a number to a power is two stars. Again, Python is smart enough to know when we ask to square a vector, Python will square each element within the vector. Run the cell below to define the new variable 'diffFromAvgSquared', which squares your previous result.

In [None]:
diffFromAvgSquared = diffFromAvg**2

Our next step is to sum up these squared differences. You already learned how to perform sums in Python using 'np.sum()' earlier in calculating the average the "long way". Use 'np.sum()' to define a new variable 'sumSquaredDiffs' which is the result of summing over 'diffFromAvgSquared'.

Only two steps remaining! Recall that because we use one degree of freedom to calculate the average, which is used in the formula for standard deviation, we divide the sum of the squared differences by $N-1$ instead of $N$. We already have $N$ calculated and stored in the variable 'dCount', so below define a new variable 'dCountMinusOne' which stores $N-1$.

Finally, we can combine everything together by running the code cell below, which takes the square root of the sum of the squared differences divided by $N-1$:

In [None]:
dStdLong = np.sqrt( sumSquaredDiffs / dCountMinusOne )
print('Standard deviation (long way) = ', dStdLong)
print('Standard deviation (short way) = ', dStd)

If all went well, you should see identical results for calculating the standard deviation of 'dVec' the long or short way.

# Collecting your first set of data (approx. 15 min)

For this lab, we are asking you to collect some initial data using a simulation of the experimental equipment.

Notes:
You may find it helpful to add some notes about your observations in the space before the “GROUP DISCUSSION ABOUT PRELAB MEASUREMENTS” section of your Lab 03 notes.
All of your calculations for this analysis and all future analyses should use the “short way” ('np.std(dVec, ddof=1)', etc). The “long way” was intended to help you better understand what the equations are doing and to give you some initial practice with doing calculations by column, which will come up again later in the course. 

Please open the Pendulum simulation (link found on Canvas in the Lab 03 module). Play around with the pendulum simulation so that you understand how the pendulum and the timer work. In this prelab, you’ll be taking some initial measurements to determine the period of a pendulum (T) at a starting amplitude of $15^\circ$. Here are things to consider when planning your first set of measurements:

1. Remember that the period, T,  is defined as one complete cycle of the pendulum’s motion, returning to the same initial position while also travelling in the same initial direction.
2. Once you have figured out how to use timer and pendulum, you will have a design choice to make: you need to decide how many swings back and forth (Mswings) will be counted in each of your trials (Ntrials). Be sure to record Mswings as a python variable (ie, have something like: `Mswings = <value>` in a code cell).
3. Start a fresh spreadsheet below for data collection (make sure the name **is different** from the name used for the earlier spreadsheet above). In the new spreadsheet you will record the time taken for Mswings swings of the pendulum in each trial.
4. Set an external timer and give yourself 7 minutes total to collect data.
   1. Start with a release amplitude of $15^\circ$. Record the time taken, t for the pendulum to complete M cycles. We will refer to this as your "measured time" or just "time."
   2. Repeat your measurement as many times as you can in 7 minutes. We will refer to the number of data points you collected as your number of trials, Ntrials.
   
After your 7 minutes of data collection are finished:

5. Press generate_vectors to create a vector with your data
6. in a new code cell below your new spreadsheet, calculate the average time for M swings (tave) and the average period (Tave)
7. Calculate utave ($u[t_{ave}]$) and uTave ($u[T_{ave}]$, the uncertainties of the means for tave and Tave.
8. Calculate relutave and reluTave, the relative uncertainties in tave and Tave.


## Share your prelab results

We will use everybody's shared prelab results as the basis for a discussion about measurement design during the lab. Please add your results to the  Prelab tab of this week's shared student results spreadsheet (find the link on Canvas).

# Submission
As we've done before, rerun your notebook, proof-read, export as HTML and then submit to Canvas

In [None]:
display_sheets()