# Chapter 05 - Statistics

Code from chapter 5, "Statistics", of the book, _Data Science from Scratch_, 2nd edition.

In [None]:
from collections import Counter

In [None]:
import matplotlib.pyplot as plt

In [None]:
import dsfs as scratch

## Describing a single set of data

The VP of Fundraising asks you for some sort of description of how many friends members of the DataSciencester community have.

An obvious description is the data itself.

In [None]:
num_friends = [100,49,41,40,25,21,21,19,19,18,18,16,15,15,15,15,14,14,13,13,13,13,12,12,11,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,8,8,8,8,8,8,8,8,8,8,8,8,8,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]

Not very friendly. How 'bout a picture?

In [None]:
# Create a "histogram" of the number of friends for each user
friend_counts = Counter(num_friends)

In [None]:
# Calculate data ranges for x- and y-axes.
min(num_friends), max(num_friends)

In [None]:
min(friend_counts.values()), max(friend_counts.values())

In [None]:
# Plot the histogram as a bar chart
xs = range(max(num_friends) + 1)  # add one for "whitespace"
ys = [friend_counts[i] for i in xs]  # for each count, extract the number of people who have that many friends
plt.bar(xs, ys)
# The values on the x- and y-axes of the chart. The range of the x-axis is
# the minimum, 0, to the maximum, 101, number of friends for a person.
# The y-axis range is determined empirically based on a "nice" value
# above the maximum of the number of people who have a specific number
# of friends (the maximum is 21; the "nice" number is 25).
plt.axis([0, max(num_friends) + 1, 0, 25])
plt.title('Histogram of Friend Counts')
plt.xlabel('# of friends')
plt.ylabel('# of people with that number of friends')
plt.show()

Unfortunately, this chart is too difficult to slip into conversions. Let's try some statistics.

In [None]:
# The simplest statistic (a summary of data) is the number of data points.
len(num_friends)

In [None]:
# The largest and smallest values are also interesting.
smallest_value = min(num_friends)
largest_value = max(num_friends)

In [None]:
smallest_value, largest_value

The minimum and maximum values are actually special cases of wanting to know the values in specific positions.

In [None]:
# Let's calculate some others
sorted_values = sorted(num_friends)
smallest_value = sorted_values[0]
second_smallest_value = sorted_values[1]
next_to_largest_value = sorted_values[-2]
largest_value = sorted_values[-1]

In [None]:
smallest_value, second_smallest_value, next_to_largest_value, largest_value

### Central Tendencies

A typical question / statistic is "Where is our data centered?" Most commonly, we use _mean_.

In [None]:
scratch.statistics.mean(num_friends)

Another measure of central tendency is the _median_. However, whereas the mean is affected by **all** values, the median only depends on

- The central value if the number of items in the data is odd
- The two central values if the number of items in the data is even

In [None]:
scratch.statistics.median([1, 10, 2, 9, 5])  ## == 5

In [None]:
scratch.statistics.median([9, 1, 10, 2]) ## == 5.5

In [None]:
scratch.statistics.median(num_friends)  ## == 6

The mean is easier to compute. It varies smoothly as our data changes. The median requires sorting our data.

A change to any number of points of our data by a small amount, _delta_, might affect the median by _delta_, by some small amount other than _delta_, or it may change it not at all. On the other hand, the mean is very sensitive to outliers.