## Summary notes

This **#TidyTuesday** project was posted back on 21st May, 2018.
Here's the motivating tweet from [@thomas_mock](https://twitter.com/thomas_mock):

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">/1 The <a href="https://twitter.com/R4DScommunity?ref_src=twsrc%5Etfw">@R4DScommunity</a> welcomes you to week 8 of <a href="https://twitter.com/hashtag/TidyTuesday?src=hash&amp;ref_src=twsrc%5Etfw">#TidyTuesday</a> !<br><br>We&#39;re exploring US Honey production data! Trying something new this week, please read through the thread!<br><br>Data: <a href="https://t.co/sElb4fcv3u">https://t.co/sElb4fcv3u</a> <br>Article: <a href="https://t.co/0OSk49AwlR">https://t.co/0OSk49AwlR</a><a href="https://twitter.com/hashtag/rstats?src=hash&amp;ref_src=twsrc%5Etfw">#rstats</a> <a href="https://twitter.com/hashtag/tidyverse?src=hash&amp;ref_src=twsrc%5Etfw">#tidyverse</a> <a href="https://twitter.com/hashtag/r4ds?src=hash&amp;ref_src=twsrc%5Etfw">#r4ds</a> <a href="https://twitter.com/hashtag/dataviz?src=hash&amp;ref_src=twsrc%5Etfw">#dataviz</a> <a href="https://t.co/NMaCTgtkEn">pic.twitter.com/NMaCTgtkEn</a></p>&mdash; Tom Mock (@thomas_mock) <a href="https://twitter.com/thomas_mock/status/998560140612784128?ref_src=twsrc%5Etfw">May 21, 2018</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

We defined a `dataclass`, *Constants*, to hold our `str` literals.

The data was in a zip file, so the first part of the analysis focused on downloading the zip file and extracting the files to a local directory.

The zip file contained four CSV files: Three raw messy CSV files, and a tidy combined CSV file.
There was a bonus exercise to tidy the raw CSV files, which we were sorely tempted to do, but instead I used the tidied file.

We processeed the data by recasting the [*state*, *year*] columns to [`CategoricalDType`, `DateTime`], and then setting these columns as a `MultiIndex`.

We closed by plotting three graphs:

1. Annual total honey yield
2. Annual total number of colonies
3. Annual median honey yield per colony 

## Dependencies

In [1]:
import os
import requests
import zipfile
from dataclasses import dataclass
import pandas as pd
import altair as alt

## Classes

In [2]:
@dataclass(frozen=True)
class Constants:
    zfile_remote = ('https://github.com/rfordatascience/tidytuesday/blob/'
                    + 'master/data/2018/2018-05-21/'
                    + 'week8_honey_production.zip?raw=true')
    temp_dir = './__temp'
    zfile = 'week8_honey_production.zip'
    honeyprod = 'honeyproduction.csv'

    @property
    def zfile_local(self) -> str:
        return self.temp_dir + '/' + self.zfile

    @property
    def honeyprod_local(self) -> str:
        return self.temp_dir + '/' + self.honeyprod

## Main

### Initialise the constants

In [3]:
constants = Constants()

### Get the zip file

In [4]:
#| code-summary: 'Download the zip file'
if constants.zfile not in os.listdir(constants.temp_dir):
    r = requests.get(constants.zfile_remote, allow_redirects=True)
    open(constants.zfile_local, 'wb').write(r.content)

In [5]:
#| code-summary: 'Inspect the zip file'
zf = zipfile.ZipFile(constants.zfile_local)
zf.printdir()

File Name                                             Modified             Size
honeyraw_2003to2007.csv                        2018-04-09 23:31:26        19029
honeyproduction.csv                            2018-04-09 23:31:26        29143
honeyraw_2008to2012.csv                        2018-04-09 23:31:26        20722
honeyraw_1998to2002.csv                        2018-04-09 23:31:26        12920


In [6]:
#| code-summary: 'Extract the honeyprod csv file`
zf.extract(constants.honeyprod, constants.temp_dir)

'__temp\\honeyproduction.csv'

### Load the data

In [7]:
#| code-summary: 'Ready honeyprod into a DataFrame`
honeyprod = pd.read_csv(constants.honeyprod_local)
honeyprod.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 626 entries, 0 to 625
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   state        626 non-null    object 
 1   numcol       626 non-null    float64
 2   yieldpercol  626 non-null    int64  
 3   totalprod    626 non-null    float64
 4   stocks       626 non-null    float64
 5   priceperlb   626 non-null    float64
 6   prodvalue    626 non-null    float64
 7   year         626 non-null    int64  
dtypes: float64(5), int64(2), object(1)
memory usage: 39.2+ KB


### Process the data

In [8]:
#| code-summary: 'Take view of honeyprod and transform the view'
v_honeyprod = honeyprod
# set [state] to CategoricalDType
v_honeyprod['state'] = pd.Categorical(
    honeyprod['state'].to_numpy(),
    honeyprod['state'].drop_duplicates().to_numpy(),
    ordered=False
)
# set [year] to DateTime
v_honeyprod['year'] = pd.to_datetime(
    v_honeyprod['year'].to_numpy(), format='%Y'
)
# set a new multiindex on (year, state)
v_honeyprod = v_honeyprod.set_index(['year', 'state'])
v_honeyprod.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 626 entries, (Timestamp('1998-01-01 00:00:00'), 'AL') to (Timestamp('2012-01-01 00:00:00'), 'WY')
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   numcol       626 non-null    float64
 1   yieldpercol  626 non-null    int64  
 2   totalprod    626 non-null    float64
 3   stocks       626 non-null    float64
 4   priceperlb   626 non-null    float64
 5   prodvalue    626 non-null    float64
dtypes: float64(5), int64(1)
memory usage: 32.8 KB


### Visualise the data

In [9]:
#| code-summary: 'Annual total honey yield'
gsource = (
    v_honeyprod
    .apply(lambda r: r['yieldpercol'] * r['numcol'], axis=1)
    .rename('totalyield')
    .groupby('year')
    .sum()
    .div(1_000_000)
    .to_frame()
    .reset_index()
)
alt.Chart(gsource).mark_line().encode(
    x='year',
    y=alt.Y('totalyield', title='total yield (millions of pounds)')
).properties(
    width=600,
    height=400,
    title='Total honey yield decreased between 1998 and 2012'
).configure_title(
    anchor='start'
)

In [10]:
#| code-summary: 'Annual total number of colonies'
gsource = (
    v_honeyprod
    .get('numcol')
    .groupby('year')
    .sum()
    .div(1_000_000)
    .rename('totalnumcol')
    .to_frame()
    .reset_index()
)
alt.Chart(gsource).mark_line().encode(
    x='year',
    y=alt.Y('totalnumcol', title='total number of colones (millions)')
).properties(
    width=600,
    height=400,
    title='Total number of colonies remained stable between 1998 and 2012'
).configure_title(
    anchor='start'
)

In [11]:
#| code-summary: 'Annual median honey yield per colony'
ch = alt.Chart(v_honeyprod.reset_index())
line = ch.mark_line().encode(
    x='year',
    y=alt.Y('median(yieldpercol)', title='median yield per colony (lbs)')
)
band = ch.mark_errorband(extent='iqr').encode(
    x='year',
    y=alt.Y('yieldpercol', title='')
)
(line + band).properties(
    width=600,
    height=400,
    title='Median honey yield per colony decreased between 1998 and 2012'
).configure_title(
    anchor='start'
)