# Chapter 05 - Statistics

Code from chapter 5, "Statistics", of the book, _Data Science from Scratch_, 2nd edition.

In [None]:
from collections import Counter

In [None]:
import matplotlib.pyplot as plt

In [None]:
import dsfs as scratch

## Describing a single set of data

The VP of Fundraising asks you for some sort of description of how many friends members of the DataSciencester community have.

An obvious description is the data itself.

In [None]:
num_friends = [100,49,41,40,25,21,21,19,19,18,18,16,15,15,15,15,14,14,13,13,
               13,13,12,12,11,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,
               9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,8,8,8,8,8,8,8,8,8,8,8,8,
               8,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
               6,6,6,6,6,6,6,6,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,4,4,4,4,4,
               4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,
               3,3,3,3,3,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,
               1,1,1,1,1,1,1,1,1,1,1,1,1,1]

Not very friendly. How 'bout a picture?

In [None]:
# Create a "histogram" of the number of friends for each user
friend_counts = Counter(num_friends)

In [None]:
# Calculate data ranges for x- and y-axes.
min(num_friends), max(num_friends)

In [None]:
min(friend_counts.values()), max(friend_counts.values())

In [None]:
# Plot the histogram as a bar chart
xs = range(max(num_friends) + 1)  # add one for "whitespace"
ys = [friend_counts[i] for i in xs]  # for each count, extract the number of people who have that many friends
plt.bar(xs, ys)
# The values on the x- and y-axes of the chart. The range of the x-axis is
# the minimum, 0, to the maximum, 101, number of friends for a person.
# The y-axis range is determined empirically based on a "nice" value
# above the maximum of the number of people who have a specific number
# of friends (the maximum is 21; the "nice" number is 25).
plt.axis([0, max(num_friends) + 1, 0, 25])
plt.title('Histogram of Friend Counts')
plt.xlabel('# of friends')
plt.ylabel('# of people with that number of friends')
plt.show()

Unfortunately, this chart is too difficult to slip into conversions. Let's try some statistics.

In [None]:
# The simplest statistic (a summary of data) is the number of data points.
len(num_friends)

In [None]:
# The largest and smallest values are also interesting.
smallest_value = min(num_friends)
largest_value = max(num_friends)

In [None]:
smallest_value, largest_value

The minimum and maximum values are actually special cases of wanting to know the values in specific positions.

In [None]:
# Let's calculate some others
sorted_values = sorted(num_friends)
smallest_value = sorted_values[0]
second_smallest_value = sorted_values[1]
next_to_largest_value = sorted_values[-2]
largest_value = sorted_values[-1]

In [None]:
smallest_value, second_smallest_value, next_to_largest_value, largest_value

### Central Tendencies

A typical question / statistic is "Where is our data centered?" Most commonly, we use _mean_.

In [None]:
scratch.statistics.mean(num_friends)

Another measure of central tendency is the _median_. However, whereas the mean is affected by **all** values, the median only depends on

- The central value if the number of items in the data is odd
- The two central values if the number of items in the data is even

In [None]:
scratch.statistics.median([1, 10, 2, 9, 5])  ## == 5

In [None]:
scratch.statistics.median([9, 1, 10, 2]) ## == 5.5

In [None]:
scratch.statistics.median(num_friends)  ## == 6

The mean is easier to compute. It varies smoothly as our data changes. The median requires sorting our data.

A change to any number of points of our data by a small amount, _delta_, might affect the median by _delta_, by some small amount other than _delta_, or it may change it not at all. On the other hand, the mean is very sensitive to outliers.

A generalization of the median is the _quantile_. The _quantile_ represents a value under which a certain percentage of the data lie. For example, the median represents the value under which 50% of the data lies.

In [None]:
ps = [0.1, 0.25, 0.5, 0.75, 0.9]
#    [1,   3,    6,   9,    13]
for p in ps:
    print(scratch.statistics.quantile(num_friends, p))

In [None]:
scratch.statistics.mode(num_friends)  # {1, 6}

Despite all these options, one most frequently uses the mean to measure the central tendency.

### Dispersion

_Dispersion_ is a general measure of the spread of the data. Typically, statistics measuring dispersion use a value near zero (0) to signify that data is _not spread out at all_ and use "large" values to indicate _very spread out data_.

A very simple measure is the _range_; that is the difference between maximum and minimum values.

In [None]:
scratch.statistics.data_range(num_friends)  # == 99

The range, like the median, **does not** depend on all values in the data. For example, the range of a list consisting only of 0's and 100's and the range of a list consisting of a single 0, a single 100, and many 50's **both** have a range of 100.

A more complex measure of dispersion is the _variance_, defined as the sum of the squares of the difference of each value and the mean of the data.

In [None]:
scratch.statistics.variance(num_friends)  # =~ 81.5435

The "units" of variance as the **square** of data "units". These "square units" can sometimes be difficult to interpret. The _standard deviation_ addresses that issue by taking the square root of the variance.

In [None]:
scratch.statistics.standard_deviation(num_friends)  # =~ 9.0301

Both the range and the standard deviation have the same outlier problem we encountered for the mean. For example, if the friendliest user in the `num_friends` data had 200 friends instead of 100, the standard deviation would be about 14.89 - an increase of more than 60%!

In [None]:
scratch.statistics.standard_deviation([200] + num_friends[1:])

A more robust metric is the intequartile range: the difference between the value at the 75th and 25th percentiles.

In [None]:
scratch.statistics.interquartile_range(num_friends)  # = 6

In [None]:
scratch.statistics.interquartile_range([200] + num_friends[1:])

## Correlation

The VP of Growth at DataSciencester has a theory that the amount of time people spend on the site is related to the number of friends they have. She's asked you to verify this theory.

Using the traffic logs, you have created a list called `daily_minutes` that shows the number of minutes per day each user spends on Data Sciencester. You've ordered the list so that each item in `daily_minutes` corresponds to our previous `num_friends` list. We'd like to investigate the relationship between these two lists.

We'll first look at _covariance_, the paired analogue of variance. Remember, variance measures how the items in a list deviate from the mean of the items in the list, covariance measures how two different variables vary in tandem from their means.

In [None]:
daily_minutes = [1,68.77,51.25,52.08,38.36,44.54,57.13,51.4,41.42,31.22,
                 34.76,54.01,38.79,47.59,49.1,27.66,41.03,36.73,48.65,28.12,
                 46.62,35.57,32.98,35,26.07,23.77,39.73,40.57,31.65,31.21,
                 36.32,20.45,21.93,26.02,27.34,23.49,46.94,30.5,33.8,24.23,
                 21.4,27.94,32.24,40.57,25.07,19.42,22.39,18.42,46.96,23.72,
                 26.41,26.97,36.76,40.32,35.02,29.47,30.2,31,38.11,38.18,
                 36.31,21.03,30.86,36.07,28.66,29.08,37.28,15.28,24.17,22.31,
                 30.17,25.53,19.85,35.37,44.6,17.23,13.47,26.33,35.02,32.09,
                 24.81,19.33,28.77,24.26,31.98,25.73,24.86,16.28,34.51,15.23,
                 39.72,40.8,26.06,35.76,34.76,16.13,44.04,18.03,19.65,32.62,
                 35.59,39.43,14.18,35.24,40.13,41.82,35.45,36.07,43.67,24.61,
                 20.9,21.9,18.79,27.61,27.21,26.61,29.77,20.59,27.53,13.82,
                 33.2,25,33.1,36.65,18.63,14.87,22.2,36.81,25.53,24.62,26.25,
                 18.21,28.08,19.42,29.79,32.8,35.99,28.32,27.79,35.88,29.06,
                 36.28,14.1,36.63,37.49,26.9,18.58,38.48,24.48,18.95,33.55,
                 14.24,29.04,32.51,25.63,22.22,19,32.73,15.16,13.9,27.2,
                 32.01,29.27,33,13.74,20.42,27.32,18.23,35.35,28.48,9.08,24.62,
                 20.12,35.26,19.92,31.02,16.49,12.16,30.7,31.22,34.65,13.13,
                 27.51,33.2,31.57,14.1,33.42,17.44,10.12,24.42,9.82,23.39,
                 30.93,15.03,21.67,31.09,33.29,22.61,26.89,23.48,8.38,27.81,
                 32.35,23.84]

In [None]:
daily_hours = [dm / 60 for dm in daily_minutes]

In [None]:
# Daily minutes
scratch.statistics.covariance(num_friends, daily_minutes)  # =~ 22.43

In [None]:
# Daily hours
scratch.statistics.covariance(num_friends, daily_hours)  # =~ 22.43 / 60