<h3>Chapter 5: Statistics</h3>


In [7]:
# Central Tendencies

def mean(x):
    return sum(x) / len(x)

def median(v):
    n = len(v)
    sorted_v = sorted(v)
    midpoint = n // 2
    
    if n % 2 == 1:
        return sorted_v[midpoint]
    else:
        low = midpoint - 1
        high = midpoint
        return (sorted_v[low] + sorted_v[high]) / 2

In [18]:
def quantile(x, p):
    """returns the pth-percentile value in x"""
    p_index = int(p * len(x))
    return sorted(x)[p_index]

def mode(x):
    """returns a list, might be more than one mode"""
    counts = Counter(x)
    max_count = max(counts.values())
    return [x_i for x_i, count in counts.items() if count == max_count]

In [19]:
# Dispersion
def data_range(x):
    return max(x) - min(x)

data_range([1,2,5,99])

98

In [28]:
def de_mean(x):
    """translate x by subtracting its mean (so the result has mean 0)"""
    x_bar = mean(x)
    return [x_i - x_bar for x_i in x]
de_mean([1,2,3,4, 99])

def dot(v, w):
    """The dot product of two vectors is the sum of their componentwise products
    v_1 * w_1 + ... + v_n * w_n"""
    return sum(v_i * w_i for v_i, w_i in zip(v, w))

def sum_of_squares(v):
    """v_1 * v_1 + ... + v_n * v_n"""
    return dot(v, v)

def variance(x):
    """assumes x has at least two elements"""
    n = len(x)
    deviations = de_mean(x)
    return sum_of_squares(deviations) / (n-1)


In [31]:
import math

def standard_deviation(x):
    return math.sqrt(variance(x))

In [33]:
standard_deviation([1,2,3,99])

48.50687236533259

In [34]:
# This is more robust to outliers than standard deviation and variance
def interquartile_range(x):
    return quantile(x, 0.75) - quantile(x, 0.25)

In [38]:
# Covariance is the paired analogue of variance
# It measures how two variables vary in tandem from their means
def covariance(x, y):
    n = len(x)
    return dot(de_mean(x), de_mean(y)) / (n-1)

num_friends = [20, 90, 200, 3, 15]
num_daily_minutes = [10, 20, 50, 2, 7]

covariance(num_friends, num_daily_minutes)

1568.15

Recall that a dot sums up the products of corresponding pairs of elements. When corresponding elements of x and y are either both above their means or below their means, a positive number enters the sum. When one is above its mean and the other below, a negative number enters the sum. A "large" positive covariance means that x tends to be large when y is large and small when y is small. A "large" negative covariance means the opposite -- that x tends to be small when y is large and vice versa. A covariance close to zero means that no such relationship exists. 

However, the number can be hard to interpret for the following reasons:
1. Its units are the product of the inputs' units (e.g., friend-minutes-per-day) which can be hard to make sense of.
2. If each user had twice as many friends (but the same number of minutes), the covariance would be twice as large. But in a sense the variables would be just as interrelated. It's hard to say what counts as a "large" covariance.

In [40]:
# Always lies between -1 and 1

def correlation(x, y):
    stdev_x = standard_deviation(x)
    stdev_y = standard_deviation(y)
    if stdev_x > 0 and stdev_y > 0:
        return covariance(x, y) / stdev_x / stdev_y
    else:
        return 0
    
correlation(num_friends, num_daily_minutes)

0.9920750771995687

<h3> Simpson's Paradox</h3>

Correlations can be misleading when <i>confounding</i> variables are ignored. For example, imagine that you can identify all members as either east coast data scientists or west coast data scientists to see which are friendlier. Just looking at the coast, you may find that the west coast data scientists have more friends on average. However, after accounting for their type of degree (PhD vs. non-PhD), you find that actually east coast people have more friends on average when splitting the data based on degree type. Bucketing the data as east coast/west coast disguised the fact that the east coast data scientists skew much more heavily toward PhD types. 

The key issue is that correlation is measuring the relationship between your two variables <i>all else being equal</i>. If your data classes are assigned at random, as they might be in a well-designed experiment, "all else being equal" might not be a terrible assumption. But when there is a deeper pattern to class assignments, "all else being equal" can be a terrible assumption.

The only way to avoid this is by <i>knowing your data</i> and doing what you can to make sure you've checked for possible confounding factors. Obviously this is not always possible -- if you didn't have the educational attainment of the data scientists, you may simply conclude that there was something inherently more sociable about the West Coast.

In [43]:
# A correlation of zero indicates that there is no linear relationship
# between the two variables. However, there are other sorts of relationships


x = [-2,-1,0,1,2]
y = [2, 1,0,1,2]
correlation(x, y)

# Here, each element of y equals the absolute value of the correspoinding
# element of x. What they do not have is a relationship in which knowing
# how x_i compares to mean(x) gives us information about how Y_i
# compares to mean(y). That's what correlation looks for

0.0

In [46]:
# Additionally, correlation tells you nothing about how large the
# relationship actually is. These variables are perfectly correlated,
# but depending on what you're measuring, it's quite possible that
# the relationship isn't all that interesting
x = [-2,-1,0,1,2]
y = [99.98, 99.99, 100, 100.01, 100.02]
correlation(x, y)

1.0

<h3>Correlation and Causation</h3>

If x and y are strongly correlated, that might mean x causes y, that y causes x, that each causes the other, that some third factor causes both, or it might mean nothing.

Consider the relationship between num_friends and daily_minutes. It's possible that having more friends on the site <i>causes</i> DataSciencester users to spend more time on the site. This might be the case if each friend posts a certain amount of content each day, which means that the more friends you have, the more time it takes to stay current with their updates. 

However, it's also possible that the more time you spend arguing in the forums, the more you encounter and befriend like-minded people. That is, spending more time on the site <i>causes</i> users to have more friends.

A third possibility is that the users who are most passionate about DS spend more time on the site and more actively collect data science friends. 

One way to feel more confident about causality is by conducting randomized trials. If you can randomly split your users into two groups with similar demographics and give one of the groups a slightly different experience, then you can often feel pretty good that the different experiences are causing the different outcomes. 

For example, you could randomly choose a subset of your users and show them content from only a fraction of their friends. If this subset subsequently spent less time on the site, this would give you some confidence that having more friends <i>causes</i> more time to be spent on the site.