# Setup

## Imports

In [1]:
import pandas as pd
from pyarrow import csv
import plotly.express as px
import plotly.io as pio

## Config

In [3]:
pio.renderers.default = 'notebook_connected'

In [4]:
pd.options.display.float_format = '{:,.2f}'.format
pd.options.display.max_rows = 25

# Introduction

To be completed

# Data Ingestion

**Data source**: https://www.kaggle.com/neuromusic/avocado-prices

**Data Dictionary**
- **Date**: The date of the observation
- **AveragePrice**: the average price of a single avocado
- **type**: conventional or organic
- **year**: the year
- **Region**: the city or region of the observation
- **Total Volume**: Total number of avocados sold
- **4046**: Total number of avocados with PLU 4046 sold
- **4225**: Total number of avocados with PLU 4225 sold
- **4770**: Total number of avocados with PLU 4770 sold

**Data Overview**

The table below represents weekly 2018 retail scan data for National retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. Starting in 2013, the table below reflects an expanded, multi-outlet retail data set. Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military. The Average Price (of avocados) in the table reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. The Product Lookup codes (PLU’s) in the table are only for Hass avocados. Other varieties of avocados (e.g. greenskins) are not included in this table.

In [5]:
avocado_path = r"C:\Users\matth\OneDrive\Data\Kaggle\avocado.csv"

In [6]:
arrow_avo = csv.read_csv(avocado_path)

Arrow infers data types for each column and picks up Date as timestamp, which saves the effort of having to do this in pandas as its own operation.  I see that the first columns has no label that needs to be investigated.  After checking the source documentation I see that this represent the index and since we already have that baked into our data structure the column can be removed.

In [7]:
arrow_avo

pyarrow.Table
: int64
Date: timestamp[s]
AveragePrice: double
Total Volume: double
4046: double
4225: double
4770: double
Total Bags: double
Small Bags: double
Large Bags: double
XLarge Bags: double
type: string
year: int64
region: string

After removing the column I can safely convert to pandas for further analysis.

In [8]:
df_avo = arrow_avo.remove_column(0).to_pandas()

In [9]:
df_avo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18249 entries, 0 to 18248
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Date          18249 non-null  datetime64[ns]
 1   AveragePrice  18249 non-null  float64       
 2   Total Volume  18249 non-null  float64       
 3   4046          18249 non-null  float64       
 4   4225          18249 non-null  float64       
 5   4770          18249 non-null  float64       
 6   Total Bags    18249 non-null  float64       
 7   Small Bags    18249 non-null  float64       
 8   Large Bags    18249 non-null  float64       
 9   XLarge Bags   18249 non-null  float64       
 10  type          18249 non-null  object        
 11  year          18249 non-null  int64         
 12  region        18249 non-null  object        
dtypes: datetime64[ns](1), float64(9), int64(1), object(2)
memory usage: 1.8+ MB


First things first I want to look into the numerical columns to make them into a more human friendly name.  The documentation gives some information on this, but it is not explicit so I will search the PLUs on Google for confirmation.  The result is:

- **PLU 4046**: California Small Hass
- **PLU 4225**: Mexico Large Hass
- **PLU 4770**: California Extra Large Hass

It's actually good we looked into this because as a result we see that PLU 4225 is from Mexico while the other two are from California.  The fact that they are from different geographic regions could certainly have an impact on price and volumnes.

Asides from this there are no null values which will make my life easier.

# Data Cleaning

In [10]:
avo_col_mapper = {'4046': 'Cali Small', '4225': 'Mexico Large', '4770': 'Cali xLarge'}
df_avo.rename(columns=avo_col_mapper, inplace=True)
df_avo.head(10)

Unnamed: 0,Date,AveragePrice,Total Volume,Cali Small,Mexico Large,Cali xLarge,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany
5,2015-11-22,1.26,55979.78,1184.27,48067.99,43.61,6683.91,6556.47,127.44,0.0,conventional,2015,Albany
6,2015-11-15,0.99,83453.76,1368.92,73672.72,93.26,8318.86,8196.81,122.05,0.0,conventional,2015,Albany
7,2015-11-08,0.98,109428.33,703.75,101815.36,80.0,6829.22,6266.85,562.37,0.0,conventional,2015,Albany
8,2015-11-01,1.02,99811.42,1022.15,87315.57,85.34,11388.36,11104.53,283.83,0.0,conventional,2015,Albany
9,2015-10-25,1.07,74338.76,842.4,64757.44,113.0,8625.92,8061.47,564.45,0.0,conventional,2015,Albany


Now I will sanity check the values in the columns to see if there are any outliers that need to be handled.

- For the column 'Date' I will look at the range and distribution of dates available
- For integer and float data I will generate summary stats (mean, quartiles, min, max) to get a high level view of the distribution
- For strings I will look at the cardinality and unique values.

For the 'Date' field I can see that the observation period starts on January 4th 2015 and ends on March 25th 2018.  I can also see the distribution across time periods is completely uniform.  This is positive, as it shows consistency in collecting data.  Now we'll need to see if the actual data collected is of good quality.

In [11]:
df_avo.Date.min()

Timestamp('2015-01-04 00:00:00')

In [12]:
df_avo.Date.max()

Timestamp('2018-03-25 00:00:00')

In [15]:
fig = px.histogram(df_avo, x='Date', title='Distribution of Record Dates')
fig.show()

In [18]:
df_avo.describe()

Unnamed: 0,AveragePrice,Total Volume,Cali Small,Mexico Large,Cali xLarge,Total Bags,Small Bags,Large Bags,XLarge Bags,year
count,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0
mean,1.41,850644.01,293008.42,295154.57,22839.74,239639.2,182194.69,54338.09,3106.43,2016.15
std,0.4,3453545.36,1264989.08,1204120.4,107464.07,986242.4,746178.51,243965.96,17692.89,0.94
min,0.44,84.56,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2015.0
25%,1.1,10838.58,854.07,3008.78,0.0,5088.64,2849.42,127.47,0.0,2015.0
50%,1.37,107376.76,8645.3,29061.02,184.99,39743.83,26362.82,2647.71,0.0,2016.0
75%,1.66,432962.29,111020.2,150206.86,6243.42,110783.37,83337.67,22029.25,132.5,2017.0
max,3.25,62505646.52,22743616.17,20470572.61,2546439.11,19373134.37,13384586.8,5719096.61,551693.65,2018.0
