
# Prelab 03 – part 2: Analysis preparation and data collection practice


Please complete Prelab 03 – part 1 on Canvas before working through this notebook.

In [None]:
%reset -f

# Clear all variables, start with a clean environment.
import numpy as np
import data_entry2

This prelab activity introduces useful features of our spreadsheet tool, `data_entry2`, and then shows you how to use Python to calculate the quantities _mean_ (_average_), _standard deviation_, and _(standard) uncertainty of the mean_. 

It starts with a hypothetical example data set to guide you through the use of the relevant Python functions. The work done with the hypothetical data set will not be turned in directly, but will instead set you up to perform the same calculations on some real data, also collected in this prelab. 

❗ <span style="color:#D22B2B"> **NOTE**  
We will now adopt a more conventional approach and write the uncertainty of a measured quantity $x$ as $u(x)$; such that $x$ is reported as $x \pm u(x)$ and that the relative uncertainty is $u_\text{rel}(x) = u(x) / x$.  
In Python the uncertainty of a variable will be noted `u_x` and its relative uncertainty `urel_x`.</span>

## Simple Calculations in data_entry2 cells

It is possible to do some simple calculations directly in the `data_entry2` sheet. In general we want you to do calculations using Python, but for some tasks, most notably recording your uncertainties, it is very convenient to use this feature of the sheet.

As an example, if you measure a mass of 497 g, and estimate a 95% confidence interval of [477, 516] g, you can record the mass and its uncertainty $u(m)$ in your spreadsheet like this:


| m | u_m|
| ------ | ------- |
| g | g |
| 497 | = (516-477)/4|


Alternatively, if you have a rectangular PDF on a balance with a 10 g resolution, you might use something like:

| m | u_m |
| ------ | ------- |
| g | g |
| 142 | = 10/(2 * np.sqrt(3))|


**Your turn #1:**

Use the sheet below to try out both of these styles of uncertainty.
- Enter a variable name, $m$ (in grams) for the first column, and $u(m)$ in the second column for the uncertainty. 

- In the next two rows, enter the measurements and expressions to calculate uncertainties as shown in the two examples above.

- Notice that in the sheet interface, you see the formulas you've entered, but that when you `Generate Vectors`, the expressions are evaluated and the generated uncertainy vector contains the results of the calculations.

- Alter one of the expressions in the uncertainty column so that it contains an error - perhaps add an extra ')' at the end of the expression to see what happens.

- To get rid of unused rows and columns, execute (Shift+Enter) in the cell that you used to create the data_entry2 sheet.

In [None]:
de0 = data_entry2.sheet("test_formulas")

## Summary of Prelab 03 – part 1

Here is a summary of the statistics concepts covered or reviewed in Part 1 of this prelab. 

For a distribution of **$N$ data points**:

1. The **average** (or mean) is defined by:
     
$$\bar{x} = \frac{1}{N} \sum_{i=1}^N x_i.$$

2. The **standard deviation** is defined by:

$$ \sigma = \sqrt{\frac{1}{N-1}\sum_{i=1}^N \left(x_i - \bar{x}\right)^2} $$ 

3. For variables that follow a Gaussian distribution (bell curve), 
    - the **68 \% confidence interval** ($\mathrm{CI_{68 \%}}$) is defined as $[\bar{x} - \sigma,~\bar{x} + \sigma]$; and $\sigma = \frac{\mathrm{CI_{68 \%}}}{2}$,
    - the **95 \% confidence interval** ($\mathrm{CI_{95 \%}}$) is defined as $[\bar{x} - 2\sigma,~\bar{x} + 2\sigma]$; and $\sigma = \frac{\mathrm{CI_{95 \%}}}{4}$.

   *Note that most of the time one has several data points from an experiment and so these confidence intervals are not estimated but calculated from the standard deviation of the distribution of the measured values.*

5. We use the **standard deviation** as an indicator of the uncertainty (or the variability) in a **single measurement** and this value does not depend on the number of measurements taken. 

6. The **uncertainty in the mean** (often called **standard error of the mean**) is given by:

    $$u(\bar{x}) = \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{N}}$$

    We use the uncertainty in the mean as an indicator of the uncertainty (or the variability) in the average of multiple measurements and it does improve as we increase the number of measurements.

💡**Confidence-interval interpretation:** If an experiment is conducted repeatedly (multiple sampling of a population) and you construct a 95 percent (68 percent) confidence interval (CI) for the mean each time, you can anticipate that 95 percent (68 percent) of these CIs will contain the (true) mean of your population.

## Developing your Python skills

Let's import a spreadsheet of our data "prelab03_01"

In [None]:
# Run me to import the spreadsheet, `prelab03_1`, which is found in the same 
# directory as `Prelab03.ipynb`
de1 = data_entry2.sheet('prelab03_1')

Below is a table of the hypothetical data in your imported spreadsheet

**Your turn #2:**
  
Double-check that you have the correct number of data points. It should be 25, but you need to recall that Python indexing starts at 0! 

### Hypothetical data

| d (mm) |
| ------ |
| 439.3  |
| 431.6  |
| 434.6  |
| 433.3  |
| 439.3  |
| 442.6  |
| 428.6  |
| 441.6  |
| 431.2  |
| 427.6  |
| 433.2  |
| 441.3  |
| 436  |
| 437.6  |
| 434.7  |
| 433.2  |
| 433.1  |
| 431.3  |
| 436  |
| 432.9  |
| 436.5  |
| 437.2  |
| 435.7  |
| 432.6  |
| 434.7  |

## Calculating average and standard deviation using Python numpy functions

**Your turn #3:**

Press the `Generate Vectors` button at the top of your spreadsheet to transfer the data into the Python environment. Then, use the cell below to calculate the average and standard deviation using the `np.mean` and `np.std()` functions, respectively. `np.mean` has a single *argument*, which is the vector of values over which to calculate the average. We discuss the second argument in `np.std` below.

_Note: If it is not working correctly, double-check above that you have correctly titled the single spreadsheet column as `d` and that there is a resulting generated vector `dVec`._

In [None]:
# Run me to calculate average and standard deviation. 
# - Notice how we're able to include descriptive text and units in the print
# commands.
d_avg = np.mean(dVec)
print("Average of d =", d_avg, "mm")
d_std = np.std(dVec, ddof=1)
print("Standard deviation of d =", d_std, "mm")

You should find that the average is 435.028 mm, which is consistent with our estimate of 435 mm from the histogram in Part 1 of the prelab. The standard deviation should be 3.8362872676586677 mm, which would be 3.8 mm if we were to round it to 2 significant figures when we report it. This is also consistent with our estimate of 4 mm using the 95% confidence interval with the histogram earlier.

Note that in `np.std()` we are supplying a second argument, `ddof=1`. This additional argument is needed because the `np.std()` function uses a general formula in its calculation - it can be used for a number of related calculations. In particular the formula it uses is:

$$ \textrm{np.std()} = \sqrt{\frac{1}{N-\textrm{ddof}}\sum_{i=1}^N \left(x_i - \bar{x}\right)^2}. $$

We want $N-1$ in the denominator as per our definition of standard deviation, so we need to use `ddof = 1`:

$$ \sigma = \sqrt{\frac{1}{N-1}\sum_{i=1}^N \left(x_i - \bar{x}\right)^2}. $$

If you are interested, ddof is an abbreviation for 'delta degrees of freedom.' As discussed in Lab 01, we use up one 'degree of freedom' from our overall dataset when we calculate the average. Since the average is used in the calculation of standard deviation, we control for this in the formula for standard deviation by dividing the squared differences between each data point in the mean by $N-1$ instead of $N$. 

If you want to control the number of significant figures displayed you can modify the print statement to be an f-string as follows. Recall that we first encountered f-strings in Lab 00.

Within the curly braces, the `:.2` tells the print function to round the variable to the left of the colon, in this case `d_std`, the standard deviation of `d`, to two digits. 

In [None]:
# Run me to print d_std with 2 significant figures "{d_std:.2}"
print(f"Standard deviation to 2 sig figs = {d_std:.2} mm")

Let's step back for a moment and think about what the standard deviation represents. Twenty-five measurements were made using the same experimental procedure, so this standard deviation is a method we can use to represent the variability in our measurements. In the language we are using in this course, this standard deviation is the single-measurement standard uncertainty of the distance, $u(d_1)$. What does this mean? It means that if we wanted to report the value and uncertainty for one of our measurements of $d$, 434.7 mm for example, we would report it as:

$ d_1 = (434.7 \pm 3.8)$ mm,


The subscript '1' is being used here to emphasize that we are talking about a single measurement and not the average. We will look at the uncertainty in the average later.

The variability (the standard deviation) in the 25 measurements that we made describes how confident we should be in any one of the individual values. Instead of estimating our uncertainty from a single measurement as we did with the height of the spring in the first two labs, **the use of repeated measurements can allow us to measure the variability in our measurements directly**.

## Calculating average and standard deviation the "long way" using Python

<span style="color:#D22B2B">*In Lab 03, you should perform your calculations the "short way", but we want you to learn how to do it the "long way" as part of the prelab for the following reasons:*</span>

1. Many of the calculations we perform later in this course will not correspond to built-in functions, so it is useful to learn how to do more complicated calculations.
2. Breaking down complicated calculations into a several lines of code---as we do in these "long way" calculations---is the strategy that we will be encouraging you to use for most of your coding work going forward in this course.
3. It is often easier to find problems or errors in your calculations if you can look at intermediate values.
4. We will also be giving you a few generally useful tips and skills during this process.


### Calculating an average the "long way"

Let's revisit our equation for calculating an average,

$$\bar{x} = \frac{1}{N} \sum_{i=1}^N x_i.$$

We will break the operation of calculating the average into steps. We will first sum up all the $x_i$ values, then count how many values there are ($N$), and then finally calculate the quotient.

**Your turn #4a:**

Similar to `np.mean()` and `np.std()`, there is a NumPy function for calculating a sum,`np.sum()`. Use this function in the code cell below to define a variable `d_sum` which is the result of the sum over the elements in `dVec`.

In [None]:
# Use this cell to define your variable d_sum
d_sum = 

Next, the built-in Python function `len()` calculates how "long" a vector is, i.e. it counts up the number of elements within the supplied variable. For instance, if you run the code cell below you can see `len()` returns `3` when we supply it with the three-element vector `foo`:

In [None]:
# Run me to see how len() works
foo = np.array([1, 2, 3])
len(foo)

**Your turn #4b:**

Use `len()` in the cell below to define another variable `d_count` which is the result of counting the number of elements in 'dVec'.

In [None]:
# Use this cell to define a variable d_count
d_count = 

**Your turn #4c:**

Finally, define the variable `d_avg_long`, which is calculated by dividing `d_sum` by `d_count` to arrive at the average of `dVec` the "long way". Print out the value of `d_avg_long`.

In [None]:
# Use this cell to define d_avg_long. Add a second line of code to print out
# the value.
d_avg_long = 

You should find that you calculated an average distance of 435.028 mm just like when using the short way.

### Calculating standard deviation the "long way"

This equation is a little more involved, but we want you have some practice with these methods in addition to taking the time to stop and consider each of the pieces involved in doing the standard deviation calculation.

Lets look again at our equation for the standard deviation,

$$ \sigma = \sqrt{\frac{1}{N-1}\sum_{i=1}^N \left(x_i - \bar{x}\right)^2}.$$

Our steps, in order, are as follows

1. Find the average (done!)
2. For each value $x_i$, find the difference between it and the average.
3. Find the square of that difference for each value and then sum up all of those differences of squares.
4. Finally we need divide that sum by $N-1$ and take the square root. Let's do it!

Let's start with calculating $x_i - \bar{x}$ for each data point (step 2 above). What we want Python to do is take each data point in `dVec` and subtract `d_avg`. Thankfully, this can be done in a single, intuitive line of code. If we were to do this in a calculator, we'd have to make 25 calculations - one for each data point in `dVec`. However, Python is smart enough that when we supply it with a 25-element vector like 'dVec' and ask it to subtract off a one-element vector or scalar like `d_avg`, then it knows that you want to subtract `d_avg` from each data point in `dVec`.

In [None]:
# Run me to see an example of subtracting a single number from a vector
bar = np.array([1, 2, 3, 4, 5])
print('Dummy data = ', bar)
bar_minus_one = bar - 1
print('Dummy data subtracted by 1 = ', bar_minus_one)

**Your turn #5a:**

Using the example above, define a new Python variable `diff_from_avg` below which subtracts `d_avg` from each element of `dVec`.

In [None]:
# Use this cell to define diff_from_avg
diff_from_avg = 

Going back to the standard deviation formula, we see that we now need to _square_ each of these differences from the average. In Python, the operator that raises a number to a power is two stars. Again, Python is smart enough to know when we ask to square a vector, Python will square each element within the vector. Run the cell below to define the new variable `diff_from_avg_squared`, which squares your previous result.

In [None]:
# Run this cell to define "diff_from_avg_squared", the square of each element
# from the vector diff_from_avg.
diff_from_avg_squared = diff_from_avg**2

**Your turn #5b:**

Our next step is to sum up these squared differences. You already learned how to perform sums in Python using `np.sum()` earlier in calculating the average the "long way". Use `np.sum()` to define a new variable `diff_from_avg_squared_sum` which is the result of summing all the elements from the vector `diff_from_avg_squared`.

In [None]:
# Use this cell to define "diff_from_avg_squared_sum"
diff_from_avg_squared_sum = 

**Your turn #5d:**

Only two steps to go! Recall that because we use one degree of freedom to calculate the average, we divide the sum of the squared differences by $N-1$ instead of $N$ when we calculate the standard deviation.

We already have $N$ calculated and stored in the variable `d_count`, so below define a new variable `d_count_minus_one` which stores $N-1$.

In [None]:
# Use this cell to define "d_count_minus_one"
d_count_minus_one = 

Finally, we can combine everything together by running the code cell below, which takes the square root of the sum of the squared differences divided by $N-1$:

In [None]:
# Run me to finish the "long way" calculation of the standard deviation and
# compare it to the "short way".
d_std_long = np.sqrt(diff_from_avg_squared_sum / d_count_minus_one)
print("Standard deviation (long way) =", d_std_long, "mm")
print("Standard deviation (short way) =", d_std, "mm")

If all went well, you should see identical results for calculating the standard deviation of 'dVec' the long or short way.

# Familiarizing yourself with collecting pendulum data (approx. 15 min)

For this lab, we are asking you to collect some initial data using a simulation of the experimental equipment.

Notes:

* You may find it helpful to add some notes about your observations in the "Part B - Start of familiarization" section of your Lab03.ipynb notebook.
* **All of your calculations should use the "short way"** (e.g., `np.std(dVec, ddof=1)`). The "long way" was to help you better understand what the equations are doing and to give you some initial practice doing complicated multi-step calculations, which will come up again later in the course. 

**Your turn #6:**

Open the Pendulum simulation ([link](https://phas.ubc.ca/~sqilabs/Lab03-Pendulum.html)). Play around with the pendulum simulation so that you understand how the pendulum and the timer work. In this prelab, you’ll be taking some initial measurements to determine the period of a pendulum $T$ at a starting amplitude of $15^\circ$.

In theory the period $T$ is defined as one complete cycle of the pendulum’s motion, returning to the same initial position while also travelling in the same initial direction.

Here are things to consider when planning your measurements:

1. In practice, to reduce the uncertainty of your measurements, it is more advantageous to measure multiple periods at once over multiple pendulum cycles and to repeat your experiment over multiple trials than to measure a single pendulum period.
2. Therefore, you will have a design choice to make: how many cycles, $M_{\text{cycles}}$, will be counted in each of your trials. Be sure to record your choice for $M_{\text{cycles}}$ as python variable (i.e., `m_cycles = ` in a code cell).
3. Start a fresh spreadsheet below for data collection. In the new spreadsheet you will record the time $\Delta t$ (in Python, `delta_t`) taken for $M_{\text{cycles}}$ cycles of the pendulum in each trial.
.
4. Set an external timer and give yourself **5 minutes** total to collect data:
   1. Set the initial release amplitude to $15^\circ$.
   2. Start your first trial and record $\Delta t$ **directly in your spreadsheet in this notebook**. 
   3. Repeat your $\Delta t$ measurement **as many times as you can** in 5 minutes, and record each measurement in the spreadsheet. The number of data points you collected for $\Delta t$ is your number of trials, $N_{\text{trials}}$ (record it as the Python variable `n_trials = ` in a code cell).
   
5. After your 5 minutes of data collection are finished, press `generate_vectors` to create a vector with your data.

In [None]:
# Use this cell to create a new spreadsheet, prelab03_2, for data collection
de2 = data_entry2.sheet('prelab03_2')

Now, let's calculate the period of the pendulum from your measurements. In the case of our experiment, the period $T$ (in Python we will call it `t`) of the pendulum is given by the average of the periods derived from each of the trials.

Let $T_i$ be the period from the $i$-th trial:

$$ T_i = \frac{\Delta t_i}{M_{\text{cycles}}},$$

where $\Delta t_i$ is the time taken for $M_{\text{cycles}}$ cycles of the pendulum in the $i$-th trial. Then, $T$ is given by:

$$ T = \overline{T_i},$$

where the $~\bar{}~$ sign represents the average (as seen before), such that:

$$ \overline{T_i} = \frac{1}{N_{\text{trials}}} \sum_{i=1}^{N_{\text{trials}}} T_i,$$
$$ \overline{T_i} = \frac{1}{N_{\text{trials}}} \sum_{i=1}^{N_{\text{trials}}} \frac{\Delta t_i}{M_{\text{cycles}}},$$

since $M_{\text{cycles}}$ is a constant, it can be factored out of the sum $\sum$, 
$$ \overline{T_i} =  \frac{1}{M_{\text{cycles}}} \frac{1}{N_{\text{trials}}} \sum_{i=1}^{N_{\text{trials}}} \Delta t_i.$$

We can now identify the average of the $\Delta t$ measurements, that we call $\overline{\Delta t}$:
$$ \overline{\Delta t} = \frac{1}{N_{\text{trials}}} \sum_{i=1}^{N_{\text{trials}}} \Delta t_i,$$

such that

$$  \overline{T_i} = \frac{\overline{\Delta t}}{M_{\text{cycles}}}.$$

Consequently, the period of the pendulum is given by:

$$ \boxed{T = \frac{\overline{\Delta t}}{M_{\text{cycles}}}}.$$

In Python, for $\overline{\Delta t}$ we will be using the following variable name `delta_t_avg`.



The uncertainty $u(T)$ (in Python `u_t`) in the pendulum period $T$ is given by:

$$ u(T) = \sigma(T), $$

where

$$ \sigma(T) = \frac{\sigma(\overline{\Delta t})}{M_{\text{cycles}}}, $$

where $\sigma(\overline{\Delta t})$ is the standard error of the mean $\overline{\Delta t}$ which is the average of your $\Delta t$ measurements. Remembering the standard error of the mean formula from the beginning of this prelab, we get:

$$ \boxed{u(T) = \frac{\sigma(\Delta t)}{M_{\text{cycles}} \sqrt{N_{\text{trials}}}}}, $$

where $\sigma(\Delta t)$ is the standard deviation of your distribution of $\Delta t$ measurements across your $N_{\text{trials}}$.

In Python, we will write `delta_t_std` for $\sigma(\Delta t)$.


**Your turn #7:**

Using the data you just collected, calculate the pendulum period $T$ (`t`), the uncertainty in the period $u(T)$ (`u_t`), and the relative uncertainty in the period $u(T) / T$ (`urel_t`)

In [None]:
# Use this cell (and additional ones if needed) to define and calculate t, 
# u_t, and urel_t.
t = 
u_t = 
urel_t = 


**Your turn #8:**

Report the values calculated above in the "Prelab 03 – shared data" spreadsheet. The spreadsheet link is in the description of the "Prelab 03 – part 2" page in Canvas.

❗Remember the rule about significant figures as seen in Lab 02.

## Preparing your Lab 03 notebook
In this final set of tasks you will prepare your Lab 03 notebook for data collection and analysis

**Your turn #9:**

1. Open the Lab 03 Instructions on Canvas and take a couple minutes to read through them so that you have a sense of how you will be spending your time during the lab.
2. Focusing on Part C.1, open up your Lab 03 notebook and notice that we have again provided you with a ready-to-go spreadsheet with two columns for data entry. Instead of just `delta_t` from the prelab (for 15°), we have specified `delta_t_10` and `delta_t_20` since in the lab you will be collecting data at two different angles.
3. In the provided spreadsheet, make-up a few rows of test data for these two angles and press 'Generate Vectors'.
4. Copy in and modify your code as needed from this prelab so that you can calculate the average periods `t_10` and `t_20`, as well as their uncertainties `u_t_10` and `u_t_20`, and relative uncertainties `urel_t_10` and `urel_t_20`. Note that you will need to specify or extract your values for `m_cycles` and `n_trials` to be able to do these calculations.
5. Test your code in your Lab 03 notebook using the provided prelab data to ensure you are getting the same values in that notebook as in this one.

You should now be ready for data collection and data analysis in the lab.

# Submit

Steps for submission:

1. Click: Run => Run_All_Cells
2. Read through the notebook to ensure all the cells executed correctly and without error.
3. Correct any errors you find.
4. File => Save_and_Export_Notebook_As->HTML
5. Upload the HTML document to the lab submission assignment on Canvas.

In [None]:
# The following function will display tables based on the data currently
# stored in your data_entry2 spreadsheets. Please do not modify this cell.
display_sheets()