# Coding exercises
Exercises 1-3 are thought exercises that don't require coding. If you need a Python crash-course/refresher, work through the [`python_101.ipynb`](./python_101.ipynb) notebook in chapter 1.

## Exercise 4: Generate the data by running this cell
This will give you a list of numbers to work with in the remaining exercises.

In [8]:
import random

random.seed(0)
salaries = [round(random.random()*1000000, -3) for _ in range(100)]

## Exercise 5: Calculating statistics and verifying
### mean

In [41]:
def as_money(input):
    # Format the input as USD since the input is salaries
    return f"${input:,.2f}"

sal_mean = sum(salaries) / len(salaries)
print(as_money(sal_mean))

$585,690.00


### median

In [27]:
from statistics import median
# I hope it's okay to use statistics to complete some of these, since it's built-in to Python
# https://docs.python.org/3/library/statistics.html

print(as_money(median(salaries)))

$589,000.00


### mode

In [28]:
from statistics import mode
# https://docs.python.org/3/library/statistics.html#statistics.mode
print(as_money(mode(salaries)))

$477,000.00


### sample variance
Remember to use Bessel's correction.

In [29]:
from statistics import variance

# Bessel's correction is using n-1 instead of n which is included in statistics.variance
# https://docs.python.org/3/library/statistics.html#statistics.variance
print(variance(salaries))

70664054444.44444


### sample standard deviation
Remember to use Bessel's correction.

In [30]:
from statistics import stdev
# https://docs.python.org/3/library/statistics.html#statistics.stdev

print(stdev(salaries))

265827.11382484


## Exercise 6: Calculating more statistics
### range

In [31]:
sal_range = max(salaries) - min(salaries)
# max = built-in function to find highest number in a list
# min = built-in function to find lowest number in a list
print(sal_range)

995000.0


### coefficient of variation
Make sure to use the sample standard deviation.

In [33]:
from statistics import mean, stdev
# cv = sample stdev / mean * 100
cv = (stdev(salaries) / mean(salaries)) *100
print(cv)

45.38699889443903


### interquartile range

In [34]:
from numpy import percentile
# iqr = q3 (75%) - q1 (25%)
# numpy.percentile is the best way to find the quartiles of a list
iqr = percentile(salaries, 75) - percentile(salaries, 25)
print(iqr)

413250.0


### quartile coefficent of dispersion

In [35]:
from numpy import percentile
# https://www.statisticshowto.com/coefficient-of-dispersion/
# qcd = (q3 - q1) / (q3 + q1)
q1 = percentile(salaries, 25)
q3 = percentile(salaries, 75)
qcd = (q3 - q1) / (q3 + q1)
print(qcd)

0.338660110633067


## Exercise 7: Scaling data
### min-max scaling

In [57]:
# min-max scaling = sal_scaled
# sal_scaled = (x - min(x)) / range(x) where x = each data point
# this formula will scale the data to the range [0,1] to normalize it

sal_range = max(salaries) - min(salaries)      # get range of salaries
sal_scaled = [(x - min(salaries)) / sal_range for x in salaries]

# I prefer to use list comprehensions. as a regular loop it would look something like:
# sal_list = []
# for x in salaries
#   sal_scaled = (x - min(x)) / range(x)
#   sal_list.append(sal_scaled)

print(sal_scaled[:10])

[0.0, 0.01306532663316583, 0.07939698492462312, 0.0814070351758794, 0.08944723618090453, 0.10050251256281408, 0.10854271356783919, 0.18693467336683417, 0.18894472361809045, 0.19095477386934673]


### standardizing

In [58]:
from statistics import mean, stdev
# z-score = ( datapoint - (mean of dataset)) / stdev )

sal_mean = mean(salaries)
sal_stdev = stdev(salaries)
z_score = [(x - sal_mean) / sal_stdev for x in salaries]
# similar to min-max scaling, this list comp as a loop looks something like:
# z_score = []
# for x in salaries
#     score = (x - sal_mean) / sal_stdev
#     z_score.append(score)

print(z_score[:10])

[-2.199512275430514, -2.150608309943509, -1.9023266390094862, -1.8948029520114855, -1.8647082040194827, -1.8233279255304788, -1.7932331775384762, -1.4998093846164489, -1.4922856976184482, -1.4847620106204475]


## Exercise 8: Calculating covariance and correlation
### covariance

In [60]:
from numpy import cov
from statistics import mean

# cov(X, Y) = E[(X - E[X])(Y - E[Y])]
# cov(X, Y) = covariance
# E[X] = expected value of X = sum of all  possible values of X * probability
# covariance of x and y = mean of ((dp - mean of x set)(dp - mean of y set))
# dp = datapoint, I will have to loop through each list
# https://numpy.org/doc/stable/reference/generated/numpy.cov.html

x = sal_scaled      # Min Max Scaling from previous exercise
y = z_score         # Standardization from previous exercise

sal_cov = cov(x, y)
print(f"numpy solution:\n{sal_cov}")

# Since using numpy for everything feels like a cop out, this is (kind of) what it's doing
def covariance(x, y):
    # x_dp and y_dp = datapoints of x and y
    # zip pairs the two lists (x and y) together based on index position
    # note that the lists MUST be equal length
    return mean([(x_dp - mean(x)) * (y_dp - mean(y)) for x_dp, y_dp in zip(x, y)])

print(f"function solution:\n{covariance(x, y)}")

numpy solution:
[[0.07137603 0.26716293]
 [0.26716293 1.        ]]
function solution:
0.26449129918250414


### Pearson correlation coefficient ($\rho$)

In [59]:
from statistics import stdev

# p = cov(x, y) / (stdev(x) * stdev(y))

x = sal_scaled
y = z_score
sal_pearson = covariance(x, y) / (stdev(x) * stdev(y))
print(sal_pearson)

0.9900000000000001


<hr>
<div style="overflow: hidden; margin-bottom: 10px;">
    <div style="float: left;">
        <a href="./python_101.ipynb">
            <button>Python 101</button>
        </a>
    </div>
    <div style="float: right;">
        <a href="../../solutions/ch_01/solutions.ipynb">
            <button>Solutions</button>
        </a>
        <a href="../ch_02/1-pandas_data_structures.ipynb">
            <button>Chapter 2 &#8594;</button>
        </a>
    </div>
</div>
<hr>