## Describing Data using Statistics

First, I'd like to compile a list of users into a dataframe so that I can easily reference their values. Even though I am using a data frame, I will still do this the long way, and calculate the values manually.

In [1]:
# import dependencies
import pandas as pd

In [2]:
df = pd.DataFrame()

df["name"] = ["Greg", "Marcia", "Peter", "Jan", "Bobby", "Cindy", "Oliver"]
df["age"] = [14, 12, 10, 11, 8, 6, 8]

In [3]:
df

Unnamed: 0,name,age
0,Greg,14
1,Marcia,12
2,Peter,10
3,Jan,11
4,Bobby,8
5,Cindy,6
6,Oliver,8


### Compute Statistics

Now I want to compute some values. I'll calculate the `mean`, `median`, `mode` as well as the `variance`, `standard deviation` and `standard error`.


First I will do this by hand, and then just use the built-ins for convenience.

#### Mean:

The sum of all the data points, divided by the number of data points
```python
mean = (14 + 12 + 10 + 11 + 8 + 6 + 8) / 7

mean = 9.86
```

$$ \bar x = \dfrac{\sum{x}}n $$


#### Median:

The middle value of the sorted data points.
```python
median = middle of [6, 8, 8, 10, 11, 12, 14]

median = 10
```


#### Mode:

The most frequently occuring data point in a population

```python
mode = most frequent of [6, 8, 8, 10, 11, 12, 14]
mode = 8
```

#### Variance

The "spread" of the data.
~~~python
# v = sum((x - mean) ** 2) / (n - 1)

v = sum(((x - mean)**2 for x in df["age"]) / (len(df["age"] - 1)

~~~
Sample Variance

$$ \sigma^2 = \dfrac{\sum{(x - \bar x^2)}}n $$

#### Standard Deviation

Used to measure how far apart values are from the group

```python
s = v ** 0.5

```
$$ \sigma = \sqrt{\sigma^2}$$


#### Standard Error

Used to measure the uncertainty in the estimate of the sample mean. This can help in understanding error when estimating a population using a sample.

```python

se = s / (n ** 0.5)

```


$$ SE = \dfrac{\sigma}{\sqrt{n}} $$


### Deciding what estimate of Central Tendency to use

Based on the `mean: 9.86` and the `median: 10`, it seems pretty safe to go with the `mean` (or average) of the dataset as the `mean` and `median` are very close in value. I'm choosing the mean because it doesn't seem to have any side-effects given the current dataset.

### Calculations

Below I will preform the actual calculations using pandas

In [4]:
def compute_statistics(values: pd.DataFrame):
    print("Sample Statistics:")

    mean = values.mean()
    print("Mean: {}".format(mean))

    median = values.median()
    print("Median: {}".format(median))

    mode = values.mode()
    print("Mode: {}".format(mode))

    variance = values.var(ddof=1)
    print("Variance: {}".format(variance))

    standard_deviation = values.std(ddof=1)
    print("Standard Deviation: {}".format(standard_deviation))

    standard_error = values.sem(ddof=1)
    print("Standard Error: {}".format(standard_error))

compute_statistics(df["age"])

Sample Statistics:
Mean: 9.857142857142858
Median: 10.0
Mode: 0    8
dtype: int64
Variance: 7.476190476190475
Standard Deviation: 2.734262327610589
Standard Error: 1.0334540197243192


#### Cindy has a birthday

In [5]:
# index 5 is Cindy, this is pretty hacky
df.at[5, "age"] += 1

In [6]:
df

Unnamed: 0,name,age
0,Greg,14
1,Marcia,12
2,Peter,10
3,Jan,11
4,Bobby,8
5,Cindy,7
6,Oliver,8


In [7]:
compute_statistics(df["age"])

Sample Statistics:
Mean: 10.0
Median: 10.0
Mode: 0    8
dtype: int64
Variance: 6.333333333333333
Standard Deviation: 2.516611478423583
Standard Error: 0.9511897312113418


This change moves the mean, variance, standard deviation, but keeps the median and mode the same. The mode is the same by chance, same with median

### Boot Oliver, bring in Jessica
Oliver wasn't doing well, so Jessica needs to come in. Let's see the difference.

In [8]:
df.at[0, "name"] = "Jessica"
df.at[0, "age"] = 1

In [9]:
compute_statistics(df["age"])

Sample Statistics:
Mean: 8.142857142857142
Median: 8.0
Mode: 0    8
dtype: int64
Variance: 13.14285714285714
Standard Deviation: 3.6253078686998625
Standard Error: 1.3702375780893483


Almost everything was changed. Re-order caused the median to change, lower value caused the mean to change, mode was the same. Variance changed due to low value, meaning `std` and `se` also changed due to that low value

In [12]:
ratings_df = pd.DataFrame()

ratings_df["poll"] = ["TV Guide", "Entertainment Weekly", "Pop Culture Today", "SciPhi Phanatic"]
ratings_df["percentage"] = [0.2, 0.23, 0.17, 0.05]

In [13]:
# compute the metrics for the ratings
compute_statistics(ratings_df["percentage"])

Sample Statistics:
Mean: 0.16250000000000003
Median: 0.185
Mode: 0    0.05
1    0.17
2    0.20
3    0.23
dtype: float64
Variance: 0.0062250000000000005
Standard Deviation: 0.0788986691902975
Standard Error: 0.03944933459514875


Lets use the mean

In [14]:
# people who like Brady bunch
# 16.25%
brady_bunch_approvers = 325700000 * 0.1625
print("{} people approve of the Brady Bunch".format(brady_bunch_approvers))

52926250.0 people approve of the Brady Bunch
