# Colorado cannabis data

In this notebook, we're going to look at some county-level data on retail and medical cannabis sales in Colorado. [The data are](https://www.colorado.gov/pacific/revenue/colorado-marijuana-sales-reports) updated every month by the Colorado Department of Revenue.

The cannabis sales data lives here: `../data/co-cannabis-sales.csv`.

A few things to note about this data:

- Every row in our data is the sum of one month of sales for one category of cannabis ("retail" or "medical") for one county
- Not every county in Colorado has pot shops
- Not every county in Colorado has _retail_ pot shops
- To maintain taxpayer privacy, the state releases aggregate sales data only for counties with at least three dispensaries, and then only if none represent more than 80 percent of total sales, according to the Colorado Department of Revenue. Totals for counties that don't meet these criteria are represented in the data as 'NR'
- One of the "counties" in the data is "Sum of NR Counties" -- the weed sales from all of the "NR" counties grouped together for that month -- which is how everything totals up like it should

We also have a CSV of Colorado county population estimates for 2016: `../data/co-county-pop.csv`. We'll use this to help us answer a question about per-capita sales.

Let's load up our data.

First, we'll import pandas. Then we'll tell pandas to change the way it displays floating-point numbers (decimals, basically) so that we won't have to look at big numbers in scientific notation later on (gross, no thanks).

Then we'll use the `read_csv()` method to create dataframes for each CSV as we need them.

In [None]:
# import pandas


In [None]:
# display floats with thousand-separator commas and no decimal points


In [None]:
# read in data frame from the sales CSV


In [None]:
# use `head()` to check the output


### Noodling

Let's check out the unique values in the columns, run some summary stats, check out samples, etc.

In [None]:
# check sorted() month values
# https://docs.python.org/3/howto/sorting.html#sorting-basics


In [None]:
# check year values


In [None]:
# check county values


In [None]:
# check sales type values


In [None]:
# grab a sample


### Analysis

Let's answer some questions:

- Total sales for all years?
- Totals by year?
- Totals by county by year?
- Percent difference from 2014-2017, by county?
- Top counties in terms of per-capita retail sales for 2017? ([Checking this guy's work, basically](https://www.thecannabist.co/2018/02/09/colorado-marijuana-sales-southern-border/98669/).)

#### Total sales, all years

This one's pretty simple: Use `sum()`.

In [None]:
# sum the values in the `amount` column


#### Totals by year

For this, we'll select the two columns we're interested in ('year' and 'amount') and then use `groupby()` and `sum()`.

👉 For more details on grouping data in pandas, [check out this notebook](../reference/Grouping%20data%20in%20pandas.ipynb).

In [None]:
# select year, amount and groupby year, sum values


#### Totals by county and year

For this one, we'll need a pivot table. We're going to hand the [`pd.pivot_table()`](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.pivot_table.html) method five arguments:
- `df_sales`: The dataframe we're pivoting
- `index='county'`: The grouping column that will become the rows in our pivot table
- `values='amount'`: The column we're doing math on
- `aggfunc='sum'`: What aggregate function to apply to the values -- in this case, we want a sum
- `columns='year'`: The second grouping column that will become the columns in our pivot table

We'll fill null values with zeroes using the [`fillna()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html) method. We're also going to save this one to a variable, `by_county_by_year`, because we'll use the pivot table to help us answer our next question.

In [None]:
# make a pivot table


### Percent difference from 2014-2017, by county

Let's build on the pivot table we just made by adding a calculated column.

First, though, we need to use [`reset_index()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html) to change the data structure from a pivot table with multiple levels into a plain old dataframe. Finally, we'll change the name of the indexed columns from `'year'` to `''`.

In [None]:
# reset index

# change the name of the columns group to an empty string


Now we can calculate the percent change over time. The formula is:

```
(new value - old value)
-----------------------    *   100
       old value
```

In [None]:
# new calculated column, `change14-17`


Then sort descending by our new value, and voilá!

In [None]:
# sort descending by change column


### Top counties, per-capita retail sales, 2017

To answer this one, we'll first need to join the cannabis sales data to the population data. Let's do that now.

👉 For more information on merging data in pandas, [see this notebook](../reference/Merging%20data%20in%20pandas.ipynb).

First, read in the population CSV. We're going to use the `dtype` argument to specify that the FIPS code is a string, not a number.

In [None]:
# read in the pop data, specify fips column is a string


In [None]:
# check the output with `head()`


Now we need to isolate the sales data we're interested in: retail sales from 2017. I like to do this in two steps. First, filter to get retail sales. Then filter to get 2017 sales.

In [None]:
# filter for 'retail' sales


In [None]:
# check the output with `head()`


In [None]:
# filter that for 2017 sales


In [None]:
# check the output with `head()`


Excellent! Each row in this dataframe is a month's worth of sales, but we want to get the annual total. So we want to select the `county` and `amount` columns, group by `county` and `sum()` the amount:

In [None]:
# select county and amount, groupby county, run sum()


In [None]:
# check the output with `head()`


Perf. Now we'll join the two dataframes using the `merge()` function. (Note: Every county is represented in the population data, but not every county is present in the weed sales data.)

👉 For more informaton on merging data in pandas, [see this notebook](http://localhost:8888/notebooks/reference/Merging%20data%20in%20pandas.ipynb).

We'll hand the `merge()` function 5 arguments:
- `retail_17_grouped`: the "left" table 
- `df_pop`: the "right" table
- `left_on='county'`: The name of the column we're grouping on in the "left" table
- `right_on='county_name'`: The name of the column we're grouping on in the "right" table
- `how='left'`: We're doing a left join

In [None]:
# merge the data


In [None]:
# check the output with `head()`


Now we can calculate the per-capita sales by dividing the amount into the population:

In [None]:
# new calculated column, per_capita


In [None]:
# `sort_values()` on per_capita column and check the output with `head()`


Los Animas! Looks like the dude's numbers check out.

# 📚 GROUP HOMEWORK 📚

In groups, answer this question:
- Statewide, which month has the highest average sales?