# Video: Calculating Basic Statistics with Python

In this video, you will learn how to compute means and standard deviations in Python.

## Calculating Basic Statistics with Python

* $X = [x_1, x_2, \ldots, x_n]$
* $\mathrm{mean}(X) = \mu_X = \bar{X}$
* $\mathrm{stdev}(X) = \sigma_X$

Script:
* In this video, I will show you how to calculate means and standard deviations in Pythons.
* I will focus on just the calculations, and not try to wrap them in a function, to emphasize how the calculation works.
* Then I will show you can use a couple libraries that we will cover later to efficiently perform these calculations.

## Calculating the Mean with Python

* $X = [x_1, x_2, \ldots, x_n]$
* $\mathrm{mean}(X) = \mu_X = \bar{X} = \frac{\sum_i x_i}{|X|}$

Script:
* Let's start with calculating the mean.
* For a data set $X$ with values $x_1$, $x_2$, and so on up to $x_n$, the calculation is simple.
* First you add up all the values $x_1$ through $x_n$, and then you divide by $n$, the number of values.
* Let's do that with a small data set.
* Just 1, 2, 3, 4, 5.

In [None]:
X = [1, 2, 3, 4, 5]

Script:
* To calculate the sum manually, we can just type in 1 plus 2 plus 3 plus 4 plus 5.

In [None]:
1 + 2 + 3 + 4 + 5

15

Script:
* Or we can use Python's built-in `sum` function which will give us a shorter more concise version that will work for bigger data sets.
* We do not want to type in every number every time, of course.

In [None]:
sum(X)

15

Script:
* What is the length of $X$?
* Looking at the definition of $X$, we can see that there are 5 numbers.
* Or we can use Python's built-in `len` function, short for length, to calculate it more generally.

In [None]:
len(X)

5

Script:
* Now we have all the pieces to finish the computation of the mean.

In [None]:
sum(X) / len(X)

3.0

Script:
* That answer should make intuitive sense for this $X$.
* It is right in the middle, and the values are evenly split on each side.
* Most cases won't be so neat, and beware that loose description also works for the median in this case, so don't just pick the middle value for the mean.


## Using NumPy to Calculate the Mean

Libraries give us pre-written code to solve common tasks.
* Pre-written
* Pre-tested
* Often faster

NumPy is the first library we will use for these common tasks.

Script:
* Next, we will look at using libraries like NumPy and Pandas to perform these calculations.
* Libraries are programming resources that have code that is already written for common tasks.
* They save us time writing the same code over and over again from scratch, or copy pasting from previous projects.
* The libraries that we recommend in this program have been developed over many years and have been thoroughly tested and usually are significantly faster than code you would quickly write.
* That is especially the case when you start working with larger data sets.

Script:
* We will start with the NumPy library.
* We make the NumPy library accessible to our code with an import statement.

In [None]:
import numpy as np

Script:
* In this import statement, we imported the NumPy library with the alias `np`.
* You can think of this as a short nickname since we will be referencing the NumPy library a lot.
* Most code examples on the internet will use `np` instead of spelling out NumPy all the time.
* The NumPy library is focused on array operations.
* We can convert the `X` variable easily as follows.
* Here is the `X` variable that we made before.

In [None]:
X

[1, 2, 3, 4, 5]

Script:
* And here is the `X` variable converted for use with NumPy.

In [None]:
X_np = np.array(X)

In [None]:
X_np

array([1, 2, 3, 4, 5])

Script:
* Normally we would just replace `X` with the NumPy version, but I am keeping them separate to emphasize which version we are using here.
* Once we have a NumPy version of `X`, we can call its `mean` method to calculate the mean.

In [None]:
X_np.mean()

np.float64(3.0)

Script:
* NumPy also has another function `np.mean` that does the same calculation if you pass in an array.

In [None]:
np.mean(X_np)

np.float64(3.0)

Script:
* An advantage of `np.mean` is that it will automatically convert its input to the NumPy version as needed.
* So we can call it on the original version of `X`.

In [None]:
np.mean(X)

np.float64(3.0)

Script:
* Many NumPy functions have two versions like this for convenience.
* If you are calculating a lot with the same data, you should convert the data to NumPy once and work with that version.
* But you do not need to worry about those details yet.

## Using Pandas to Calculate the Mean

* pandas is a library for data management and visualization.
* Will use pandas to load data in mod 1.
* More advanced pandas usage in mod 2.

Script:
* The next library that we will look at now is called pandas.
* Pandas is a library that makes loading and analyzing data easy.
* A common usage pattern is using pandas to load a data file, and then immediately running lots of statistics with pandas.

Script:
* Like with NumPy, we need to import the pandas module to access it within our Python code.
* Pandas is usually imported with the alias pd.

In [None]:
import pandas as pd

Script:
* Usually we would load a file next.
* To continue the previous example, I will manually make a pandas data frame object with the previous data.
* You do not need to remember how to do this yet.

In [None]:
X_pd = pd.DataFrame(data={"X": X})

In [None]:
X_pd

Unnamed: 0,X
0,1
1,2
2,3
3,4
4,5


Script:
* Now we have a pandas data frame, their version of a data set, with one column called X.
* Taking the mean of data frame is simple, just like with NumPy.

In [None]:
X_pd.mean()

Unnamed: 0,0
X,3.0


Script:
* If there were more columns of data, we would get the average of each column separately.
* If we wanted, we could also use the NumPy `np.mean` function on this data frame.

In [None]:
np.mean(X_pd)

np.float64(3.0)

Script:
* Pandas and NumPy are designed to work together well.

## Calculating the Standard Deviation with Python

$\mathrm{stdev}(X) = \sigma_X = \sqrt{\frac{\sum_i (x_i - \mu_X)^2)}{|X|-1}}$


Script:
* Next, let's look at calculating the standard deviation.

Script:
* As a reminder, here is our data in `X`.

In [None]:
X

[1, 2, 3, 4, 5]

Script:
* And the mean that we calculated before was 3.

In [None]:
np.mean(3)

np.float64(3.0)

Script:
* Let's start with calculating the differences between values in X and the mean.

In [None]:
[x - 3 for x in X]

[-2, -1, 0, 1, 2]

Script:
* Next, we will square each of them.
* As a reminder, the double asterisk notation is for exponentiation.
* We will be using it to square the differences from the mean.

In [None]:
[(x - 3) ** 2 for x in X]

[4, 1, 0, 1, 4]

Script:
* Then we add them all up.

In [None]:
sum([(x - 3) ** 2 for x in X])

10

Script:
* And divide by 4.
* We are dividing by 4 instead of 5 to avoid biasing this sample standard deviation to be too low.

In [None]:
sum([(x - 3) ** 2 for x in X]) / 4

2.5

Script:
* This value here is the sample variance.
* We will cover variance in mod 1.
* To calculate the sample standard deviation, we will use the math module for the square root function.

In [None]:
import math

In [None]:
math.sqrt(sum([(x - 3) ** 2 for x in X]) / 4)

1.5811388300841898

Script:
* We could use exponentiation with power one half instead of the square root function, but most people find this quicker to read.
* NumPy also has its own standard deviation function called `np.std`.
* Various libraries will call the standard deviation `std` or `stdev` or even `stddev` with two d's.

In [None]:
np.std(X, ddof=1)

np.float64(1.5811388300841898)

Script:
* The `ddof` parameter is for the sample correction.
* If we omit it, then we will get a lower number because of the downward bias.

In [None]:
np.std(X)

np.float64(1.4142135623730951)

Script:
* NumPy arrays also have their own `std` method like we saw with the mean.

In [None]:
X_np.std(ddof=1)

np.float64(1.5811388300841898)

Script:
* And finally, pandas also has its own `std` method that we call just like the NumPy method.

In [None]:
X_pd.std(ddof=1)

Unnamed: 0,0
X,1.581139


Script:
* One last thing before wrapping up this video.
* If you have your data in a pandas data frame, pandas has a really convenient method `describe` which will give you many different statistics all at once.

In [None]:
X_pd.describe()

Unnamed: 0,X
count,5.0
mean,3.0
std,1.581139
min,1.0
25%,2.0
50%,3.0
75%,4.0
max,5.0


Script:
* That output has the number of entries with values, mean and standard deviation that we already saw, plus the minimum, maximum, and all the quartiles in between.
* We will look at those more in both modules 1 and 2, but you should be seeing already that these stats are easy to compute with Python and these libraries in your toolkit.