In [1]:
import numpy as np
import pandas as pd

from warnings import simplefilter
# plotly/pandas have an issue right now
simplefilter(action='ignore', category=FutureWarning)

ONTARIO = '/kaggle/input/fuel-prices-in-ontario-1990-2023-eda/Ontario_Fuel_Prices_1990_2023.csv'
df = pd.read_csv(filepath_or_buffer=ONTARIO, parse_dates=['Date'], index_col=['_id'])
df['year'] = df['Date'].dt.year
df.head()

Unnamed: 0_level_0,Date,Ottawa,Toronto West/Ouest,Toronto East/Est,Windsor,London,Peterborough,St. Catharine's,Sudbury,Sault Saint Marie,...,North Bay,Timmins,Kenora,Parry Sound,Ontario Average/Moyenne provinciale,Southern Average/Moyenne du sud de l'Ontario,Northern Average/Moyenne du nord de l'Ontario,Fuel Type,Type de carburant,year
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1990-01-03,55.9,49.1,48.7,45.2,50.1,0.0,0.0,56.4,54.8,...,55.1,58.1,0.0,0.0,50.3,49.5,56.2,Regular Unleaded Gasoline,Essence sans plomb,1990
2,1990-01-10,55.9,47.7,46.8,49.7,47.6,0.0,0.0,56.4,54.9,...,55.0,58.2,0.0,0.0,49.2,48.3,56.2,Regular Unleaded Gasoline,Essence sans plomb,1990
3,1990-01-17,55.9,53.2,53.2,49.6,53.7,0.0,0.0,55.8,54.9,...,54.4,58.2,0.0,0.0,53.6,53.3,56.0,Regular Unleaded Gasoline,Essence sans plomb,1990
4,1990-01-24,55.9,53.2,53.5,49.0,52.1,0.0,0.0,55.7,54.9,...,54.3,58.2,0.0,0.0,53.5,53.2,56.0,Regular Unleaded Gasoline,Essence sans plomb,1990
5,1990-01-31,55.9,51.9,52.6,48.6,49.1,0.0,0.0,55.6,54.8,...,54.2,58.1,0.0,0.0,52.5,52.1,55.9,Regular Unleaded Gasoline,Essence sans plomb,1990


Let's pick a place and look a the different fuel prices across time.

In [2]:
from plotly.express import scatter
scatter(data_frame=df.replace(0, np.nan), x='Date', y=['Ottawa'], facet_col='Fuel Type', height=900, facet_col_wrap=3)

Not surprisingly the gas and diesel prices appear to move together; the data for the other two fuel prices is fragmentary.

In [3]:
scatter(data_frame=df[['Date', 'Ottawa', 'Fuel Type']].replace(0, np.nan), x='Date', y='Ottawa', color='Fuel Type')

We can see essentially the same story in a single graph where we use color instead of facet for the fuel type, and here we also see that diesel is cheaper than gas most of the time, and premium gas is more expensive than regular.

In [4]:
scatter(data_frame=df[df['Fuel Type'] == 'Diesel'].replace(0, np.nan), x='Date', y=df.columns[1:-3], height=900)

Diesel prices are highly corrlated across locations. This is not surprising; we would expect fuel prices to be highly correlated across locations; the sources of the differences would be taxes, regulatory issues, and transportation costs. There are periods of time where one place stands out and others where others do.

In [5]:
from plotly.express import imshow
imshow(img=df[df['Fuel Type'] == 'Diesel'].replace(0, np.nan)[df.columns[1:-3]].corr(), height=900)

This is probably the nut graf for diesel price correlations; Kenora is the outlier and it has a correlation with the other prices greater than 0.96 in all cases. What does this look like for all fuel types?

In [6]:
for fuel_type in df['Fuel Type'].unique():
    imshow(img=df[df['Fuel Type'] == fuel_type].replace(0, np.nan)[df.columns[1:-3]].corr(), height=900).show()

The CNG and auto propane cases are odd; let's look at those.

In [7]:
scatter(data_frame=df[df['Fuel Type'].isin({'Compressed Natural Gas', 'Auto Propane'})].replace(0, np.nan), x='Date', y=df.columns[1:-2], height=1200,
       facet_col='Fuel Type', facet_col_wrap=1)

This data is strange; we probably have some data quality issues, particularly in the Auto Propane series. But we suspected that from our first Ottowa x Fuel Types plot.

Let's pick a couple of locations where we have lots of data and look at their correlations.

In [8]:
scatter(data_frame=df[['North Bay', 'Timmins', 'year', 'Fuel Type']][df['Fuel Type'] == 'Diesel'].replace(0, np.nan).dropna(), x='North Bay', y='Timmins', color='year', trendline='ols')

These diesel prices are really highly correlated; the correlation does seem to have some sensitivity to the price level, which has an embedded time component, which we have shown rather crudely here using year buckets.

In [9]:
scatter(data_frame=df[['Windsor', 'London', 'year', 'Fuel Type']][df['Fuel Type'] == 'Regular Unleaded Gasoline'].replace(0, np.nan).dropna(), x='Windsor', y='London', color='year', trendline='ols')

For Windsor x London unleaded gas prices we see lower correlations at lower prices and higher correlation at higher prices, which is surprising. Then again we also have less data at the higher prices.