# Click "Edit App" to see the code
# Averaging a subset of data

In this notebook we'll demonstrate how to compute the average of a chunk of data from a large dataset.
We can start from loading the Python packages

# The Jupyter Notebook
First of all we import the Python packages

In [1]:
# python packages
import pandas as pd # DataFrames and reading CSV files
import numpy as np # Numerical libraries
import matplotlib.pyplot as plt # Plotting library
from lmfit import Model # Least squares fitting library

We then read a data file into a DataFrame, and rename the columns

In [2]:
data = pd.read_csv("random1.csv")
data.columns = ("X","Y")
print(data)

         X          Y
0      0.0   8.318571
1      1.0  11.950092
2      2.0   7.428268
3      3.0   6.325524
4      4.0  11.849234
..     ...        ...
995  995.0  10.243258
996  996.0  10.029642
997  997.0   8.203106
998  998.0  12.357048
999  999.0   7.434732

[1000 rows x 2 columns]


The most common scenario is to compute the average of a chunk of data, discarding the initial and/or final part of the data set. We can therefore define two variables; the index of the first point to be included in the average and the total number of points to be averaged. Alternatively one could set the index of the last point to be included in the average
Remember that Python starts counting from zero

In [3]:
# Total number of points
totalNumberOfValues = len(data["Y"]) 
# First element to be included in the average
firstValue = 0
# Number of elements to be included in the average
numberOfValuesToAverage = 3 
# Last element to be included in the average
lastValue = firstValue + numberOfValuesToAverage - 1
print("Total number of points in the DataFrame        :",totalNumberOfValues)
print("First element to be included in the average    :",firstValue)
print("Last element to be included in the average     :",lastValue)
print("Number of values to be included in the average :",
      numberOfValuesToAverage)

Total number of points in the DataFrame        : 1000
First element to be included in the average    : 0
Last element to be included in the average     : 2
Number of values to be included in the average : 3


Let's print the values in the second column that corresponds to the interval we have chosen.
We can also to check they are what we expect.

In [4]:
values = data.iloc[firstValue:lastValue+1]["Y"].values
print(values)

[ 8.31857138 11.95009171  7.42826837]


* Note how in the cell above we used a different syntax for selecting the elements of the data frame, **iloc[:]["Y"]**. That is equivalent to the following code.
* Also note how we used **.values** to convert the DataFrame to an array

In [5]:
v0 = data["Y"].values
v1 = v0[firstValue:lastValue+1]
print(v1)

[ 8.31857138 11.95009171  7.42826837]


We can now compute the average of the numbers in the array using the **mean** function in NumPy.

In [6]:
average = np.mean(values)
print("Average :",average)

Average : 9.232310487540373


For some types of statistical analysis, like bootstrapping, we might be interested in randomly selecting a subset of data, to reduce the human bias in the analysis. In order to do this we can use the **ramdom.choice()** function in NumPy to create an array of random numbers taken between 0 and the size of out sample (_numberOfValues_).
This array will contain the indices of the elements that we'll pick from our global array.

In [7]:
numberOfValues = 20 
randomIndices = np.random.choice(totalNumberOfValues, 
                                 replace=False, 
                                 size=numberOfValues)
print(randomIndices)

[914 133 569 557 343 149 542 321 455 297 682 469   0 383 660 583   5 432
 847 413]


We can then use that array of number to create a array with the data that we are going to average.

In [8]:
randomValues = data.iloc[randomIndices]["Y"].values
print(randomValues)

[14.43290722 10.01038488 12.82186055  7.99014575 10.12693564 13.87310542
 19.70650205 12.37137983 17.24604556 18.73483869 19.8881469   7.75569392
  8.31857138  6.03471242 18.79729107 10.13443119  8.55755266 19.98451194
 18.79274328 16.54646406]


We can then compute the average of _randomValues_

In [9]:
averageOfRandomValues = np.mean(randomValues)
print("Number of randomly selected values      :",numberOfValues)
print("Average of the randomly selected values :",averageOfRandomValues)

Number of randomly selected values      : 20
Average of the randomly selected values : 13.606211220335126
