## Summary notes

Visualising the change in median earnings of NFL positions over time.

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Welcome to Week 2 of <a href="https://twitter.com/hashtag/TidyTuesday?src=hash&amp;ref_src=twsrc%5Etfw">#TidyTuesday</a>! We&#39;ll be exploring a 538 article on NFL salaries! Good luck &amp; have fun!<br><br>Data: <a href="https://t.co/8NaXR93uIX">https://t.co/8NaXR93uIX</a><br>Article: <a href="https://t.co/6GdoPb0hJc">https://t.co/6GdoPb0hJc</a><br>Data Source: <a href="https://t.co/vliDMOl9Lc">https://t.co/vliDMOl9Lc</a><br>Blog: <a href="https://t.co/cZJ94Hhz7U">https://t.co/cZJ94Hhz7U</a><a href="https://twitter.com/hashtag/tidyverse?src=hash&amp;ref_src=twsrc%5Etfw">#tidyverse</a> <a href="https://twitter.com/hashtag/rstats?src=hash&amp;ref_src=twsrc%5Etfw">#rstats</a> <a href="https://twitter.com/hashtag/dataviz?src=hash&amp;ref_src=twsrc%5Etfw">#dataviz</a> <a href="https://twitter.com/hashtag/ggplot2?src=hash&amp;ref_src=twsrc%5Etfw">#ggplot2</a> <a href="https://twitter.com/hashtag/r4ds?src=hash&amp;ref_src=twsrc%5Etfw">#r4ds</a> <a href="https://t.co/AF7qTFLvkj">pic.twitter.com/AF7qTFLvkj</a></p>&mdash; Tom Mock ❤️ Quarto (@thomas_mock) <a href="https://twitter.com/thomas_mock/status/983330650257272832?ref_src=twsrc%5Etfw">April 9, 2018</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

## Dependencies

In [1]:
import os
import requests
import polars as pl
import pandas as pd
import altair as alt
from matplotlib import pyplot as plt
import seaborn as sns

## Functions

In [2]:
def cache_file(url: str, fname: str, dir_: str = './__cache') -> str:
    """Cache the file at given url in the given dir_ with the given
    fname and return the local path.

    Preconditions:
    - dir_ exists
    """
    local_path = f'{dir_}/{fname}'
    if fname not in os.listdir(dir_):
        r = requests.get(url, allow_redirects=True)
        open(local_path, 'wb').write(r.content)
    return local_path

## Main

### Cache the data

In [3]:
nfl_salary_path = cache_file(
    url=('https://github.com/rfordatascience/tidytuesday/blob'
         + '/master/data/2018/2018-04-09/nfl_salary.xlsx?raw=true'),
    fname='nfl_salary.xlsx'
)

### Load the data

In [4]:
nfl_salary = pl.DataFrame(pd.read_excel(nfl_salary_path)).lazy()
nfl_salary.schema

{'year': polars.datatypes.Int64,
 'Cornerback': polars.datatypes.Int64,
 'Defensive Lineman': polars.datatypes.Int64,
 'Linebacker': polars.datatypes.Int64,
 'Offensive Lineman': polars.datatypes.Int64,
 'Quarterback': polars.datatypes.Float64,
 'Running Back': polars.datatypes.Int64,
 'Safety': polars.datatypes.Int64,
 'Special Teamer': polars.datatypes.Float64,
 'Tight End': polars.datatypes.Int64,
 'Wide Receiver': polars.datatypes.Int64}

### Prepare the data

We remove high outliers, those players with a salary greater than \$30M.

In [5]:
lazy_query = nfl_salary.melt(
    id_vars='year',
    variable_name='position',
    value_name='salary'
).with_column(
    pl.col('salary').cast(int)
).drop_nulls(
).sort(
    by=['position', 'year', 'salary'],
    reverse=[False, False, True]
).filter(
    pl.col('salary') < 30_000_000
).groupby(
    by=['position', 'year'],
    maintain_order=True
).head(
    16
)
lazy_query.schema

{'position': polars.datatypes.Utf8,
 'year': polars.datatypes.Int64,
 'salary': polars.datatypes.Int64}

### Visualise the data

In [6]:
_ch = alt.Chart(
            lazy_query.collect().to_pandas()
).properties(
    width=110,
    height=200
)
_line = _ch.mark_line(
    color='darkorange'
).encode(
    x=alt.X('year:N', title=''),
    y=alt.Y('mean(salary)', title='Average cap value ($USD)'),
)
_scatter = _ch.mark_circle(color='lightgrey').encode(
    x='year:N',
    y='salary'
)
alt.layer(_scatter, _line).facet(
    facet=alt.Facet('position', title=''),
    columns=5,
    title=alt.TitleParams(
        'The average pay for top running backs has stalled',
        subtitle=('Average cap value of 16 highest-paid players in each '
                  + ' position'),
        fontSize=18
    )
)

In [7]:
_gsource = lazy_query.with_columns(
    [pl.col('salary').sum().over(['position', 'year']).alias('total_by_pos'),
     pl.col('salary').sum().over(['year']).alias('total')]
).with_column(
    (pl.col('total_by_pos') / pl.col('total')).alias('prop_by_pos')
).with_column(
    (pl.col('prop_by_pos') * 100).round(1).alias('pct_by_pos')
).select(
    ['position',
     'year',
     'pct_by_pos']
).unique(
)

_ch = alt.Chart(
            _gsource.collect().to_pandas()
).properties(
    width=600,
    height=400,
    title=alt.TitleParams(
        'Teams are spending less on RBs',
        subtitle=('Percent of money spent on the top 16 players at each'
                  + ' position'),
        anchor='start',
        fontSize=18
    )
)
_line = _ch.mark_line(color='darkorange').encode(
    x=alt.X('year:N', title=''),
    y=alt.Y('pct_by_pos', title='Percent spent at each position'),
    color='position'
)
_scatter = _ch.mark_circle(color='lightgrey').encode(
    x='year:N',
    y='pct_by_pos',
    color='position'
)
_scatter + _line

In [8]:
%load_ext watermark
%watermark --iv

sys       : 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
altair    : 4.2.0
seaborn   : 0.11.2
matplotlib: 3.5.3
requests  : 2.28.1
pandas    : 1.4.3
polars    : 0.14.6

