Skip to content

klogges5/capstone

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Udacity Data Scientist Nanodegree Capstone Project
"Forecasting of Fuel prices in Germany"

Table of Contents

  1. Project Description
  2. Project Summary
  3. Prerequisites
  4. Instructions and Files
  5. Discussion on project
  6. Data Structure of Tankerkoenig
  7. Acknowledgements

Project description

The fuel price is like the stock prices changing heavily on a day, but mostly not with the same amplitude. Therefore I want to have a look at the data and see what’s inside. Furthermore I want to build a forecasting model for the price on daily basis. Therefore I used the history data available from Tankerkoenig, they are available in the azure cloud for private use with the following license https://creativecommons.org/licenses/by-nc-sa/4.0/.

The blog post is on Medium Medium Post

Project Summary

I've found some interesting statistics in the data and visualized them. Additional I've build different SARIMAX models, predicted the fuel price with the models for the test data and calculated the mean squared error and R2 Score for each. Also i build and trained a LSTM neural network. The models have promising values as you can see here:

Model MSE R2 Score
(0,1,1)x(0,1,1,52) 0.001827971789383413 -3.891623163418407
(3,1,1)x(3,1,1,52) 0.000816361372162868 -1.184569926617356
(3,1,2)x(3,1,2,52) 0.000982089719322026 -1.628056323129107
LSTM 8.22876984342417e-05 0.735263088586480

But nevertheless predicting a fuel price is not as easy and if you have a closer look to the predictions, you'll see differences. Or in case of the LSTM, that it is not really predicting the future. The prediction for more days in the future goes to a border.

Therefore the models have to be optimized with other features.

Prerequisites

History Data from Tankerkoenig History if you start in with Data Reading.

Libraries needed

I'd used and therefore also recommend to use Anaconda distribution, because it has most of mentioned libraries included.

You'll need:

  • pandas
  • glob
  • math
  • itertools
  • matplotlib.pyplot
  • plotly
  • numpy
  • sklearn
  • keras
  • sqlalchemy
  • statsmodels
  • pylab

Instructions and Files

If you downloaded the History Data (2015-2019) from Tankerkoenig you can start on the beginning. There I loaded the files and reduced the dataset to a smaller number. The description of the used data structure you can find below.

For better handling i have created a slice out off the data, one year of data has around 5 GB. So i collected only stations with PLZ beginning with 40. Loaded the data per year and saved the data for the few stations in a sql file for later use. The needed functions for loading and saving data and for getting a dataset for a special station i've put in helper_functions.

If you don't want to download all the data you can run cell 1 for importing all relevant libraries and go directly to DataAnalysis section in the notebook and load the data from the sqllite there i've done some data analysis and plottings with this data. There i read the data from a sqllite database created before prices_40.sql. The database consists of 2 tables Prices and Stations.

The preparation and implementation for building a forecast model follows in the notebook. On my MacbookPro (10 GB RAM) it was not possible to calculate the different ARIMA models in a sequence, even I overwrite everytime the data from previous calculation and give back only the fitted ARIMA model and forecast.

Discussion on the Project

You'll find a discussion of this Project in Medium Post

Data Structure of Tankerkoenig

I've put the data in one directory "sprit" and inside this directory there where folder for every year and month. Inside the folders there is a csv-file for every day. So we have ./sprit/year/month/year-month-day-prices.csv for example ./sprit/2015/12/2015-12-01-prices.csv. That is the same structure as it is on Tankerkoenig, despite of the stations.csv.

sprit
+-- stations.csv
+-- 2015
| +-- 01
| | +-- 2015-12-01-prices.csv
| | +-- ...
| +-- 02
| +-- ...
| +-- 12
+-- ...
+-- 2019
| +-- 01
| +-- 02
| +-- ...
| +-- 12

Inside the csv the data is structured as follows:

date,station_uuid,diesel,e5,e10,dieselchange,e5change,e10change

The corresponding stations are in the ./sprit/stations.csv file and have the following structure inside:

uuid,name,brand,street,house_number,post_code,city,latitude,longitude

They are connected via the UUID in both files, so I've written a function to extract data from the dataframe for specific UUID and type of fuel.

def get_data4uid(uid, typ):
    """
    Give the typ data for given uid
    Input: uid, typ
    Output: Dataframe"""
    data_uid = pd.DataFrame(prices_pd[prices_pd['station_uuid'] == uid][['date', typ]])
    
    data_uid.date = pd.to_datetime(data_uid.date, utc=True)
    
    data_uid.set_index('date', inplace=True)
   
    
    return data_uid

Acknowledgements

This project is a capstone project for Udacity Data Science nanodegree. Thanks to Udacity preparing me technically on this task with great learning material and projects. I've to thank Tankerkoenig for the open source of data that makes it only possible to create some data analysis.

And last but not least thanks to my family for having great patience in the last months.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published