# Setup

## Imports

In [89]:
import pandas as pd
from pyarrow import csv
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
import seaborn as sns

## Config

In [2]:
pio.renderers.default = 'notebook_connected'

In [3]:
pd.options.display.float_format = '{:,.2f}'.format
pd.options.display.max_rows = 25
pd.options.display.min_rows = 25

cm_blue_red = sns.diverging_palette(10, 240, n=19, as_cmap=True)

## To-Dos

This analysis will be revisited after taking more stats / probability courses.

- Add introduction and specific problems to be answered by analysis
- Look into city and region specific trends
- Add bag size dimension / analysis

# Data Ingestion

**Data source**: https://www.kaggle.com/neuromusic/avocado-prices

**Data Dictionary**
- **Date**: The date of the observation
- **AveragePrice**: the average price of a single avocado
- **type**: conventional or organic
- **year**: the year
- **Region**: the city or region of the observation
- **Total Volume**: Total number of avocados sold
- **4046**: Total number of avocados with PLU 4046 sold
- **4225**: Total number of avocados with PLU 4225 sold
- **4770**: Total number of avocados with PLU 4770 sold

**Data Overview**

The table below represents weekly 2018 retail scan data for National retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. Starting in 2013, the table below reflects an expanded, multi-outlet retail data set. Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military. The Average Price (of avocados) in the table reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. The Product Lookup codes (PLU’s) in the table are only for Hass avocados. Other varieties of avocados (e.g. greenskins) are not included in this table.

In [4]:
avocado_path = r"C:\Users\matth\OneDrive\Data\Kaggle\avocado.csv"

In [5]:
arrow_avo = csv.read_csv(avocado_path)

Arrow infers data types for each column and picks up Date as timestamp, which saves the effort of having to do this in pandas as its own operation.  I see that the first columns has no label that needs to be investigated.  After checking the source documentation I see that this represent the index and since we already have that baked into our data structure the column can be removed.

In [6]:
arrow_avo

pyarrow.Table
: int64
Date: timestamp[s]
AveragePrice: double
Total Volume: double
4046: double
4225: double
4770: double
Total Bags: double
Small Bags: double
Large Bags: double
XLarge Bags: double
type: string
year: int64
region: string

After removing the column I can safely convert to pandas for further analysis.  The one additional cleanup is that I want to remove spaces in the column names to enable query methods on the dataframe down the line.

In [7]:
df_avo = arrow_avo.remove_column(0).to_pandas()
df_avo.columns = df_avo.columns.str.replace(" ", "")

In [8]:
df_avo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18249 entries, 0 to 18248
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Date          18249 non-null  datetime64[ns]
 1   AveragePrice  18249 non-null  float64       
 2   TotalVolume   18249 non-null  float64       
 3   4046          18249 non-null  float64       
 4   4225          18249 non-null  float64       
 5   4770          18249 non-null  float64       
 6   TotalBags     18249 non-null  float64       
 7   SmallBags     18249 non-null  float64       
 8   LargeBags     18249 non-null  float64       
 9   XLargeBags    18249 non-null  float64       
 10  type          18249 non-null  object        
 11  year          18249 non-null  int64         
 12  region        18249 non-null  object        
dtypes: datetime64[ns](1), float64(9), int64(1), object(2)
memory usage: 1.8+ MB


First things first I want to look into the numerical columns to make them into a more human friendly name.  The documentation gives some information on this, but it is not explicit so I will search the PLUs on Google for confirmation.  The result is:

- **PLU 4046**: California Small Hass
- **PLU 4225**: Mexico Large Hass
- **PLU 4770**: California Extra Large Hass

It's actually good we looked into this because as a result we see that PLU 4225 is from Mexico while the other two are from California.  The fact that they are from different geographic regions could certainly have an impact on price and volumnes.

Asides from this there are no null values which will make my life easier.

# Data Cleaning

In [9]:
avo_col_mapper = {'4046': 'CaliSmall', '4225': 'MexicoLarge', '4770': 'CalixLarge'}
df_avo.rename(columns=avo_col_mapper, inplace=True)
df_avo.head(10)

Unnamed: 0,Date,AveragePrice,TotalVolume,CaliSmall,MexicoLarge,CalixLarge,TotalBags,SmallBags,LargeBags,XLargeBags,type,year,region
0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany
5,2015-11-22,1.26,55979.78,1184.27,48067.99,43.61,6683.91,6556.47,127.44,0.0,conventional,2015,Albany
6,2015-11-15,0.99,83453.76,1368.92,73672.72,93.26,8318.86,8196.81,122.05,0.0,conventional,2015,Albany
7,2015-11-08,0.98,109428.33,703.75,101815.36,80.0,6829.22,6266.85,562.37,0.0,conventional,2015,Albany
8,2015-11-01,1.02,99811.42,1022.15,87315.57,85.34,11388.36,11104.53,283.83,0.0,conventional,2015,Albany
9,2015-10-25,1.07,74338.76,842.4,64757.44,113.0,8625.92,8061.47,564.45,0.0,conventional,2015,Albany


Now I will sanity check the values in the columns to see if there are any outliers that need to be handled.

- For the column 'Date' I will look at the range and distribution of dates available
- For integer and float data I will generate summary stats (mean, quartiles, min, max) to get a high level view of the distribution
- For strings I will look at the cardinality and unique values.

For the 'Date' field I can see that the observation period starts on January 4th 2015 and ends on March 25th 2018.  I can also see the distribution across time periods is completely uniform.  This is positive, as it shows consistency in collecting data.  Now we'll need to see if the actual data collected is of good quality.

In [10]:
df_avo.Date.min()

Timestamp('2015-01-04 00:00:00')

In [11]:
df_avo.Date.max()

Timestamp('2018-03-25 00:00:00')

In [12]:
df_avo.Date.describe()

count                   18249
unique                    169
top       2015-05-24 00:00:00
freq                      108
first     2015-01-04 00:00:00
last      2018-03-25 00:00:00
Name: Date, dtype: object

In [13]:
df_avo.groupby('year').size()

year
2015    5615
2016    5616
2017    5722
2018    1296
dtype: int64

There is some interest spikes in reecords that are worth looking into more.  What makes these interesting is that there are only two values that a given month can have - 432 or 540 - with no variation.  Also of interest is that the months with 540 counts do not appear with a regular cadence (i.e. same month each year or following a distinct patern). Depending on the validity of these spikes they could impact statistics that I want to compute later on.  This doesnt seem to be driven by the type of avocado as there are spikes at the same time across both conventional and organic.  Additionally, when you look by region to see if some contributed inconsistently or more than others you see that is not the case and it the number of contributions is uniform across all regions.

In [14]:
df_monthly_count = df_avo.set_index('Date').resample('m').size().to_frame('Size')
px.bar(df_monthly_count, x=df_monthly_count.index, y='Size')

I put this data into a table with month index and year columns where I highlight the max counts to see if that helps identify a trend. The only trend that I can see is that it appears that 4 times per year a large record is collected.  The outlier being 2018 which looks like its larger record count may have been added to 2017 in December as December has five 540 counts.  As it relates to data cleaning it will is hard at this stage to pick anything immediately concerning up.  Once we dig more into the data analysis section we may have to circle back and clean up the data if something shows up.  It will be interesting later to see if there is a material difference between data from the two different size reporting groups.

In [15]:
df_month_year_count = (df_avo.assign(month=lambda x: x['Date'].dt.month)
                             .pivot_table(index='month', columns='year', values='Date', aggfunc='count'))
df_month_year_count.style.highlight_max(axis=0)

year,2015,2016,2017,2018
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,432.0,540.0,540.0,432.0
2,432.0,432.0,432.0,432.0
3,540.0,432.0,432.0,432.0
4,432.0,432.0,540.0,
5,540.0,540.0,432.0,
6,432.0,432.0,430.0,
7,432.0,540.0,540.0,
8,540.0,432.0,432.0,
9,432.0,432.0,432.0,
10,432.0,540.0,540.0,


In [16]:
df_monthly_count_conventional = df_avo.query("type=='conventional'").set_index('Date').resample('m').size().to_frame('Size')
px.bar(df_monthly_count_conventional, x=df_monthly_count.index, y='Size')

In [17]:
df_monthly_count_organic = df_avo.query("type=='organic'").set_index('Date').resample('m').size().to_frame('Size')
px.bar(df_monthly_count_organic, x=df_monthly_count.index, y='Size')

In [18]:
df_region_sum = (df_avo.set_index('Date')
                       .groupby('region').size().to_frame('sum_records'))
px.bar(df_region_sum, x=df_region_sum.index, y='sum_records', title="Total Monthly Records by Region")

In [19]:
df_region_avg = (df_avo.set_index('Date')
                       .groupby('region').resample('m').size()
                       .stack().to_frame('size').reset_index()
                       .groupby('region').agg(avg_monthly_records=('size', 'mean')))
px.bar(df_region_avg, x=df_region_avg.index, y='avg_monthly_records', title="Average Monthly Records by Region")

The two string fields to be investigated are region and type.  For region it appears that a majority of the items are major metropolitan areas. However, it can also be seen that there are aggregated areas as well.  These can be put into the following categories:

- **Metro Area**: A city in a state - the deepest level of detail.  Example: Albany
- **State**: Contain multiple cities from the dataset. First level of aggregation.  If the dataset is consistent the expectation is that if you sum all metro areas within a state that it would equal the state total.  Example: California
- **Geographic Region**: Contain multiple states from the dataset. Second level of aggregation.  If the dataset is consistent the expectation is that if you sum all states within a state that it would equal the geographic region total.  Example: West
- **Total**: The sum of the entire dataset.  If the dataset is consistent the expectation is that if you sum all values within any of the sub groups that it would equal the total.

In [20]:
df_avo['region'].unique()
# TODO Automate mapping of regions to categories

array(['Albany', 'Atlanta', 'BaltimoreWashington', 'Boise', 'Boston',
       'BuffaloRochester', 'California', 'Charlotte', 'Chicago',
       'CincinnatiDayton', 'Columbus', 'DallasFtWorth', 'Denver',
       'Detroit', 'GrandRapids', 'GreatLakes', 'HarrisburgScranton',
       'HartfordSpringfield', 'Houston', 'Indianapolis', 'Jacksonville',
       'LasVegas', 'LosAngeles', 'Louisville', 'MiamiFtLauderdale',
       'Midsouth', 'Nashville', 'NewOrleansMobile', 'NewYork',
       'Northeast', 'NorthernNewEngland', 'Orlando', 'Philadelphia',
       'PhoenixTucson', 'Pittsburgh', 'Plains', 'Portland',
       'RaleighGreensboro', 'RichmondNorfolk', 'Roanoke', 'Sacramento',
       'SanDiego', 'SanFrancisco', 'Seattle', 'SouthCarolina',
       'SouthCentral', 'Southeast', 'Spokane', 'StLouis', 'Syracuse',
       'Tampa', 'TotalUS', 'West', 'WestTexNewMexico'], dtype=object)

A mapping will need to be done for each region to it's appropriate category so that figures can be appropriately aggregated.  For now, I will do a quick and easy separation of the categories.

In [21]:
df_avo_total = df_avo.query("region=='TotalUS'").sort_values('Date')
regions = ['Midsouth', 'Northeast', 'SouthCentral', 'Southeast', 'West']
df_avo_region = df_avo.query("region in @regions").sort_values('Date')
df_avo_metro = df_avo.query("region != 'TotalUS' & region not in @regions").sort_values('Date')

We are now ready to move into analyzing the data.

# Data Analysis

I will do a top down analysis where I first look at the totals to see if anything sticks out.  To start a simple line graph to get a view of the overall trend and summary statistics will guide me in the right direction - I will focus on 'AveragePrice' and 'Total Volume'.  You can quickly see that there seems to be something wrong with the price data for organic avocados in July of 2015.  Additionally, you can quickly see that conventional avocados are significantly more popular than organic.  In 2016 and 2017 there was also a significant increase and subsequent decrease in prices in the Fall.

In [22]:
px.line(df_avo_total, 
        x='Date', 
        y='AveragePrice', 
        color='type', 
        title='Average Prices by Type')

Looking at the distribution of prices for the two we can see that the distribution of organic has a wider range.  Part of this is due to what is likely an outlier data point.  The conventional avocados also seem to have a positive or right skew.

In [23]:
px.histogram(df_avo_total, 
             x='AveragePrice', 
             color='type', 
             nbins=25,
             barmode='overlay',
             opacity=.75,
             marginal='violin')

In [24]:
px.histogram(df_avo_total, 
             x='AveragePrice', 
             color='type', 
             nbins=50,
             cumulative=True,
             histnorm='probability',
             barmode='overlay',
             opacity=.75)

From looking at the prices it appears like the organic and conventional avocados have a pretty strong correlation, but we'll compute that to be sure.  The result is a correlation of .65 which is quite strong.

In [25]:
(df_avo_total[['Date', 'AveragePrice', 'type']].pivot(index='Date', 
                                                     columns='type', 
                                                     values='AveragePrice')
                                              .corr())

type,conventional,organic
type,Unnamed: 1_level_1,Unnamed: 2_level_1
conventional,1.0,0.65
organic,0.65,1.0


When looking at volume it is immediately apparent that there is seasonality in the conventional data as every January there is a spike in volume (an increase in daily volume of over 100% or 30 million plus avocados a day between December and end of January).  Additionally it can also be seen that from July on there is a general downtrend to the end of the year.  While there does appear to be a small uptrend in the conventional volumes (highs continue to get higher and like dates over years appear to be going up) this is accompanied with significant volatility as well as lows that continue to get to the same level.  Further statistics will be required to see the actual magnitude of change. Due to the volume of organic avocoados being signficantly less than the conventional it is hard to get an idea of the trend for them - this will require plotting on its own.

In [26]:
px.line(df_avo_total, 
        x='Date', 
        y='TotalVolume', 
        color='type', 
        title='Total Volume by Type')

I have created a helper function to add some additional summary statistics such as coefficient of variation and interquartile range to the standard pandas dataframe describe method.

In [31]:
def describe_stats(df):
    stats_df = df.describe()
    stats_df.loc['cv'] = stats_df.loc['std'] / stats_df.loc['mean']
    stats_df.loc['iqr'] = stats_df.loc['75%'] - stats_df.loc['25%']
    # Index order
    stats_index = ['count', 'mean', 'std', 'cv', 'min', '25%', \
                   '50%', '75%', 'max', 'iqr']
    
    return stats_df.reindex(stats_index)

Viewing the summary statistic for conventional avocados does show some evidence of an uptrend.  Comparing the earliest and latest full years (2015 and 2017) you can see that the mean, min, max, and quartiles are increasing.  In each case there is at least a 4% increase over the two years and in most cases closer to 10%.  The largest increase is seen in the max where there was a 37% increase.  An interesting point is that over the two full years there was a substantial increase in volatility compared to the modest increase in volume.

In [33]:
(df_avo_total.query("type=='conventional'")
            .pivot(index='Date', columns='year', values='TotalVolume')
            .pipe(describe_stats)
            .add_prefix('Y_')
            .eval("Ratio=Y_2017/Y_2015"))

year,Y_2015,Y_2016,Y_2017,Y_2018,Ratio
count,52.0,52.0,53.0,12.0,1.02
mean,31224729.15,34043449.79,33995658.14,42125533.35,1.09
std,3894760.81,5815718.2,6346590.41,6837607.18,1.63
cv,0.12,0.17,0.19,0.16,1.5
min,22617999.38,21009730.21,24397166.19,36703156.72,1.08
25%,28982693.28,30977470.89,30237911.23,39341132.88,1.04
50%,30773184.44,34406259.58,33824253.0,40595408.58,1.1
75%,32633666.42,36840872.14,37352360.59,42996817.69,1.14
max,44655461.51,52288697.89,61034457.1,62505646.52,1.37
iqr,3650973.14,5863401.25,7114449.36,3655684.81,1.95


Getting the volume for organic on it's own you can quickly see that there is an obvious uptrend as the lows and highs continue to increase.  The seasonality is also apparent however it appears that the spike in volume comes just after the spike from conventional avocados - usually starting after January for organic.

In [34]:
px.line(df_avo_total.query("type=='organic'"), 
        x='Date', 
        y='TotalVolume', 
        title='Total Volume for Organic')

Looking at summary statistic paints a similar picture where the mean, min, max, and quartiles are all growing by nearly 80% over the two years.  While there was a substantial increase in the volume with the attendant increase in volatility the coefficient of variation actually decreased over the two periods.

In [37]:
(df_avo_total.query("type=='organic'")
            .pivot(index='Date', columns='year', values='TotalVolume')
            .pipe(describe_stats)
            .add_prefix('Y_')
            .eval("Ratio=Y_2017/Y_2015"))

year,Y_2015,Y_2016,Y_2017,Y_2018,Ratio
count,52.0,52.0,53.0,12.0,1.02
mean,645563.57,940379.88,1187239.29,1510487.83,1.84
std,98459.77,162169.92,181386.85,165114.33,1.84
cv,0.15,0.17,0.15,0.11,1.0
min,501814.87,647723.55,808971.88,1283987.65,1.61
25%,570612.39,825467.81,1068530.09,1372757.33,1.87
50%,644637.19,925106.11,1148617.16,1496991.89,1.78
75%,677590.98,1035734.41,1302205.55,1641881.8,1.92
max,912681.57,1475457.53,1634877.11,1814929.97,1.79
iqr,106978.6,210266.59,233675.46,269124.47,2.18


I want to get a better understanding of the total revenue brought in so I will mutliply the average price times the volume to get the revenue per day.  The summary stats are mostly in line with expectations.  The one interesting point is that the cv for organic is quite high compared to conventional.

In [58]:
df_avo_total.eval("Revenue=TotalVolume*AveragePrice", inplace=True)

In [59]:
df_avo_total.pivot(index='Date', columns='type', values='Revenue').pipe(describe_stats)

type,conventional,organic
count,169.0,169.0
mean,36299028.44,1497380.44
std,5649787.93,501588.85
cv,0.16,0.33
min,22391819.39,573873.0
25%,32360761.36,1085445.35
50%,36449306.18,1413311.01
75%,39743360.74,1939747.19
max,54379912.47,2758693.55
iqr,7382599.39,854301.84


In [57]:
df_avo_total.groupby(['type', 'year'])['Revenue'].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
type,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
conventional,2015,52.0,31602546.03,3979590.95,22391819.39,29182491.7,31924573.45,34112247.88,39743360.74
conventional,2016,52.0,34862218.41,3300763.32,27736905.81,32427664.59,34414833.97,37570389.31,42392777.2
conventional,2017,53.0,40524497.14,4160410.32,30611477.0,37669304.59,40361458.14,43274254.56,51550374.54
conventional,2018,12.0,44214142.3,3652828.97,39646678.49,42285308.34,43690093.77,44724081.68,54379912.47
organic,2015,52.0,965857.96,177175.22,573873.0,876275.12,983633.51,1083096.18,1264746.65
organic,2016,52.0,1380886.98,196205.76,1001291.64,1224552.91,1406491.23,1520384.95,1844321.91
organic,2017,53.0,1941616.33,267848.42,1197278.38,1875227.75,1984033.94,2149099.3,2230787.84
organic,2018,12.0,2343407.64,218980.42,2054520.73,2144043.66,2319921.86,2495660.33,2758693.55


In [103]:
df_tidy_conv = (df_avo_total.query("type=='conventional'")
                    .melt(id_vars=['Date'], 
                          var_name=['Category'], 
                          value_vars=['TotalVolume', 'Revenue', 'AveragePrice'], 
                          value_name='Value'))

When looking at a graph of price, volume, and revenue charted together for conventional avocados it appears there was a change in correlation between price and volume starting in 2016.  In 2015 there was effectively no correlation but starting in 2016 a drastic negative correlation started.  Perhaps this was due to the fact that there was much less volume volatility in 2015 which prevented large price changes.  After 2015 the volality increased by > 60% from the 2015 baseline.

In [132]:
fig = make_subplots(specs=[[{"secondary_y": True}]])

dates = df_tidy_conv.query("Category=='Revenue'").Date # Arbitrary filter to get unique dates
revenue = df_tidy_conv.query("Category=='Revenue'").Value
volume = df_tidy_conv.query("Category=='TotalVolume'").Value
price = df_tidy_conv.query("Category=='AveragePrice'").Value

fig.add_trace(go.Scatter(x=dates, 
                         y=revenue, 
                         mode='lines', 
                         name='revenue'), 
              secondary_y=False)
fig.add_trace(go.Scatter(x=dates, 
                         y=volume, 
                         mode='lines',
                         name='volume'), 
              secondary_y=False)
fig.add_trace(go.Scatter(x=dates, 
                         y=price, 
                         mode='markers',
                         marker={'size': 4},
                         name='price'), 
              secondary_y=True)
fig.update_layout(title_text='Time Series Revenue and its Components for Conventional Type',
                 xaxis_rangeslider_visible=True,
                 width=800)
fig.show()

In [133]:
df_avo_total.query("type=='conventional'")[['AveragePrice', 'TotalVolume', 'Revenue']].corr()

Unnamed: 0,AveragePrice,TotalVolume,Revenue
AveragePrice,1.0,-0.51,0.36
TotalVolume,-0.51,1.0,0.6
Revenue,0.36,0.6,1.0


In [140]:
(df_avo_total.query("type=='conventional'")
             .groupby('year')
             [['AveragePrice', 'TotalVolume', 'Revenue']].corr())

Unnamed: 0_level_0,Unnamed: 1_level_0,AveragePrice,TotalVolume,Revenue
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2015,AveragePrice,1.0,-0.07,0.32
2015,TotalVolume,-0.07,1.0,0.92
2015,Revenue,0.32,0.92,1.0
2016,AveragePrice,1.0,-0.83,0.02
2016,TotalVolume,-0.83,1.0,0.51
2016,Revenue,0.02,0.51,1.0
2017,AveragePrice,1.0,-0.79,0.2
2017,TotalVolume,-0.79,1.0,0.4
2017,Revenue,0.2,0.4,1.0
2018,AveragePrice,1.0,-0.86,-0.59


With organic the natural dynamic of an inverse relationship between volume and price was present from the beginning.  There was also a significant increase in daily revenue going from around $1 million a day to $2.5 million for a 150% increase in the course of 3 years.

In [135]:
df_tidy_org = (df_avo_total.query("type=='organic'")
                    .melt(id_vars=['Date'], 
                          var_name=['Category'], 
                          value_vars=['TotalVolume', 'Revenue', 'AveragePrice'], 
                          value_name='Value'))

In [137]:
fig = make_subplots(specs=[[{"secondary_y": True}]])

dates = df_tidy_org.query("Category=='Revenue'").Date # Arbitrary filter to get unique dates
revenue = df_tidy_org.query("Category=='Revenue'").Value
volume = df_tidy_org.query("Category=='TotalVolume'").Value
price = df_tidy_org.query("Category=='AveragePrice'").Value

fig.add_trace(go.Scatter(x=dates, 
                         y=revenue, 
                         mode='lines', 
                         name='revenue'), 
              secondary_y=False)
fig.add_trace(go.Scatter(x=dates, 
                         y=volume, 
                         mode='lines',
                         name='volume'), 
              secondary_y=False)
fig.add_trace(go.Scatter(x=dates, 
                         y=price, 
                         mode='markers',
                         marker={'size': 4},
                         name='price'), 
              secondary_y=True)
fig.add_trace(go.Scatter())
fig.update_layout(title_text='Time Series Revenue and its Components for Organic Type',
                 xaxis_rangeslider_visible=True,
                 width=800)
fig.show()

In [138]:
df_avo_total.query("type=='organic'")[['AveragePrice', 'TotalVolume', 'Revenue']].corr()

Unnamed: 0,AveragePrice,TotalVolume,Revenue
AveragePrice,1.0,0.02,0.4
TotalVolume,0.02,1.0,0.92
Revenue,0.4,0.92,1.0


In [141]:
(df_avo_total.query("type=='organic'")
             .groupby('year')
             [['AveragePrice', 'TotalVolume', 'Revenue']].corr())

Unnamed: 0_level_0,Unnamed: 1_level_0,AveragePrice,TotalVolume,Revenue
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2015,AveragePrice,1.0,-0.19,0.58
2015,TotalVolume,-0.19,1.0,0.68
2015,Revenue,0.58,0.68,1.0
2016,AveragePrice,1.0,-0.51,0.03
2016,TotalVolume,-0.51,1.0,0.84
2016,Revenue,0.03,0.84,1.0
2017,AveragePrice,1.0,-0.47,0.5
2017,TotalVolume,-0.47,1.0,0.51
2017,Revenue,0.5,0.51,1.0
2018,AveragePrice,1.0,-0.7,-0.57


In [146]:
px.line(df_avo_total.query("type=='organic'"), 
        x='Date', 
        y='Revenue', 
        title='Organic Revenue')

Digging into aggregation statistics by year on revenue data we can see that there is a clear positive trend in price and volume with the attendant increase in revenue. 

In [160]:
df_rev_by_year = pd.pivot_table(df_avo_total, 
                                index=df_avo_total.Date.dt.year, 
                                columns='type', 
                                values=['TotalVolume', 'Revenue', 'AveragePrice'], 
                                aggfunc={'TotalVolume': 'sum',
                                        'Revenue': 'sum',
                                        'AveragePrice': 'mean'})

num_format = {'AveragePrice': "{:,.4f}", 
              'Revenue': "{:,.0f}",
              'TotalVolume': "{:,.0f}"}
df_rev_by_year.style.background_gradient(cmap=cm_blue_red).format("{:,.2f}")

Unnamed: 0_level_0,AveragePrice,AveragePrice,Revenue,Revenue,TotalVolume,TotalVolume
type,conventional,organic,conventional,organic,conventional,organic
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2015,1.01,1.5,1643332393.71,50224613.92,1623685915.93,33569305.49
2016,1.05,1.48,1812835357.48,71806123.13,1770259388.99,48899753.79
2017,1.22,1.65,2147798348.19,102905665.67,1801769881.21,62923682.17
2018,1.06,1.55,530569707.58,28120891.65,505506400.23,18125853.99


When looking at the data by month of the year its clear that the first quarter of the year is the highest grossing which is driven by the volumes which are also highest then.  The highest prices are seen in the summer and full between July and November. 

In [176]:
df_rev_by_month = pd.pivot_table(df_avo_total, 
                                index=df_avo_total.Date.dt.month, 
                                columns='type', 
                                values=['TotalVolume', 'Revenue', 'AveragePrice'], 
                                aggfunc={'TotalVolume': ['sum', 'mean'],
                                        'Revenue': ['sum', 'mean'],
                                        'AveragePrice': 'mean'})

df_rev_by_month.style.background_gradient(cmap=cm_blue_red).format("{:,.2f}".format)

Unnamed: 0_level_0,AveragePrice,AveragePrice,Revenue,Revenue,Revenue,Revenue,TotalVolume,TotalVolume,TotalVolume,TotalVolume
Unnamed: 0_level_1,mean,mean,mean,mean,sum,sum,mean,mean,sum,sum
type,conventional,organic,conventional,organic,conventional,organic,conventional,organic,conventional,organic
Date,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3
1,0.99,1.47,35757913.29,1370260.86,643642439.29,24664695.42,35949790.55,925941.1,647096229.97,16666939.74
2,0.94,1.44,37317975.07,1538075.97,597087601.1,24609215.55,40534011.55,1067557.85,648544184.74,17080925.6
3,1.05,1.42,36766112.12,1665611.04,625023906.06,28315387.69,34982576.48,1172877.57,594703800.08,19938918.69
4,1.06,1.47,36955102.95,1621588.07,480416338.31,21080644.9,34872393.12,1110736.85,453341110.61,14439579.08
5,1.02,1.48,39257278.41,1588693.76,549601897.67,22241712.7,38587542.9,1069962.81,540225600.6,14979479.38
6,1.08,1.6,39743543.18,1566800.71,476922518.2,18801608.54,36881969.33,983924.89,442583631.99,11807098.7
7,1.17,1.47,39773734.14,1459445.09,556832277.98,20432231.31,34181115.31,947634.6,478535614.35,13266884.39
8,1.18,1.61,37559804.79,1452285.62,488277462.22,18879713.1,31911541.07,881503.06,414850033.92,11459539.77
9,1.24,1.8,36387661.69,1560818.23,436651940.32,18729818.7,29852423.15,864042.91,358229077.85,10368514.91
10,1.3,1.74,34595402.29,1473009.6,484335632.04,20622134.42,26857290.29,835859.79,376002064.06,11702037.06
