## Summary notes

Visualising the average annual cost of College tuition fees in the USA.

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Happy to announce the newest <a href="https://twitter.com/hashtag/R4DS?src=hash&amp;ref_src=twsrc%5Etfw">#R4DS</a> online learning community project! <a href="https://twitter.com/hashtag/TidyTuesday?src=hash&amp;ref_src=twsrc%5Etfw">#TidyTuesday</a> is your weekly <a href="https://twitter.com/hashtag/tidyverse?src=hash&amp;ref_src=twsrc%5Etfw">#tidyverse</a> practice!<br><br>Each week we&#39;ll post data and a plot at <a href="https://t.co/8NaXR93uIX">https://t.co/8NaXR93uIX</a> under the datasets link.<br><br>You clean the data and tweak the plot in R!<a href="https://twitter.com/hashtag/rstats?src=hash&amp;ref_src=twsrc%5Etfw">#rstats</a> <a href="https://twitter.com/hashtag/ggplot2?src=hash&amp;ref_src=twsrc%5Etfw">#ggplot2</a> <a href="https://t.co/sDaHsB8uwL">pic.twitter.com/sDaHsB8uwL</a></p>&mdash; Tom Mock ❤️ Quarto (@thomas_mock) <a href="https://twitter.com/thomas_mock/status/980921600429252608?ref_src=twsrc%5Etfw">April 2, 2018</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

## Dependencies

In [1]:
import os
import requests
import pandas as pd
import polars as pl
import altair as alt
from vega_datasets import data as vdata

## Functions

In [2]:
def cache_file(url: str, fname: str, dir_: str = './__cache') -> str:
    """Cache the file at given url in the given dir_ with the given
    fname and return the local path.

    Preconditions:
    - dir_ exists
    """
    local_path = f'{dir_}/{fname}'
    if fname not in os.listdir(dir_):
        r = requests.get(url, allow_redirects=True)
        open(local_path, 'wb').write(r.content)
    return local_path

## Main

### Set theme

In [3]:
alt.themes.enable('latimes')

ThemeRegistry.enable('latimes')

### Cache the data

In [4]:
us_avg_tuition_path = cache_file(
    url=('https://github.com/rfordatascience/tidytuesday/blob/master'
         + '/data/2018/2018-04-02/us_avg_tuition.xlsx?raw=true'),
    fname='us_avg_tuition.xlsx'
)

In [5]:
ansi_path = cache_file(
    url=('https://www2.census.gov/geo/docs/reference/state.txt'),
    fname='state.txt'
)

### Load the data

In [6]:
us_avg_tuition = pl.DataFrame(pd.read_excel(us_avg_tuition_path)).lazy()
us_avg_tuition.schema

{'State': polars.datatypes.Utf8,
 '2004-05': polars.datatypes.Float64,
 '2005-06': polars.datatypes.Float64,
 '2006-07': polars.datatypes.Float64,
 '  2007-08 ': polars.datatypes.Float64,
 '2008-09': polars.datatypes.Float64,
 '2009-10': polars.datatypes.Float64,
 '2010-11': polars.datatypes.Float64,
 '2011-12': polars.datatypes.Float64,
 '2012-13': polars.datatypes.Float64,
 '2013-14': polars.datatypes.Float64,
 '2014-15': polars.datatypes.Float64,
 '2015-16': polars.datatypes.Float64}

In [7]:
ansi = pl.DataFrame(pd.read_csv(ansi_path, sep='|')).lazy()
ansi.schema

{'STATE': polars.datatypes.Int64,
 'STUSAB': polars.datatypes.Utf8,
 'STATE_NAME': polars.datatypes.Utf8,
 'STATENS': polars.datatypes.Int64}

In [8]:
states = alt.topo_feature(vdata.us_10m.url, 'states')

### Prepare the data

In [9]:
lazy_query = us_avg_tuition.select(
    ['State',
     '2010-11',
     '2015-16']
).melt(
    id_vars='State',
    variable_name='year',
    value_name='tuition'
).join(
    other=ansi,
    left_on='State',
    right_on='STATE_NAME',
    how='inner'
).with_column(
    pl.col('tuition').pct_change().over('State').alias('pct_change')
).filter(
    pl.col('pct_change').is_not_null()
).select(
    [pl.col('STATE').alias('state_id'),
     pl.col('State').alias('state_name'),
     pl.col('tuition'),
     pl.col('pct_change').apply(lambda x: x * 100).round(1)]
)
lazy_query.schema

{'state_id': polars.datatypes.Int64,
 'state_name': polars.datatypes.Utf8,
 'tuition': polars.datatypes.Float64,
 'pct_change': polars.datatypes.Float64}

### Visualise the data

Both visualisations will use the same source, so we initialise a single instance of `alt.Chart`.

In [10]:
ch = alt.Chart(lazy_query.collect().to_pandas())

Plot the percentage change in college annual tuition costs from 2010 to 2015 as a *chloropleth heatmap*.

In [11]:
ch.mark_geoshape(
    stroke='black'
).encode(
    shape='geo:G',
    color=alt.Color(
                "pct_change",
                scale=alt.Scale(scheme="oranges"),
                legend=alt.Legend(title='Change (%)')
    ),
    tooltip=[alt.Tooltip('state_name', title='State'),
             alt.Tooltip('pct_change', title='Change (%)')]
).transform_lookup(
    lookup='state_id',
    from_=alt.LookupData(data=states, key='id'),
    as_='geo'
).project(
    type='albersUsa'
).properties(
    title='Percentage change in college tuition costs between 2010 and 2015',
    width=600,
    height=400
)

Plot the annual cost of college tuition in the USA in 2015-16 as a *dot plot*.

In [12]:
ch.mark_circle(
    size=60
).encode(
    x=alt.X('tuition', title='Tuition ($)'),
    y=alt.Y('state_name', sort='-x', axis=alt.Axis(grid=True), title='State')
).properties(
    title='College tuition costs in the USA (2015-16)',
    width=400,
    height=600
)

In [13]:
%load_ext watermark
%watermark --iv

sys     : 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
polars  : 0.14.6
altair  : 4.2.0
requests: 2.28.1
pandas  : 1.4.3

