In [58]:
import os
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib_inline
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

In [59]:
DATA = 'data'

# Problem statement
Palm olein has grown over the last 60 years to be the worlds single largest vegetable oil crop. It makes up 35% of edible oil consumption globally, and is used in numerous industries.

Since late 2020, palm olein prices have surged to break one record after another. Prices today are at an all-time high. It is unclear what is driving this price surge. There has been no obvious structural change to the market that could easily explain such a sustained bull market. This implies that the current prices are unsustainable and there will be a market correction at some point.

In the meantime, business need to plan and the uncertainty about future prices makes this task challenging, as margins are <8% for most actors. Tha palm oil market affects many industries including foods, industrial applications (consumer goods like soaps, detergents & cosmetics) and energy. In a market where price has increased by 400% in 24 months, accurately forecasting future prices and making appropriate decisions become mission-critical.

**The suppliers pricing mechanism**: the supplier sets the price for the month on the first day of the new month. The supplier closes their order book on the 21st of the current month. by the 21st of the month, purchasing must choose whether or not to place forward orders for 1-3 months at the current price, or to wait for the new price in 10 days.

# Goal
The goal of this project is:
- to understand the drivers for the palm oil price
- to understand the current price surge
- to model the supply and demand factors that drive price and quantity
- to forecast future prices over 30 - 60 - 90 day horizons
- use naive as the benchmark
- fourier transform or wavelet
- pipeline should be automatic, duplication, failure

# Price Drivers

We have major data challenges. need more data over longer periods

### General
- political policy
- inflation (quarterly)
- exchange rate movements (daily)
- war in russia & geo-politics
- economic effects of COVID flow-on to palm oil

### Market
- price of palm oil (daily)
- price of substitute oils (daily)

# Demand Side
- general increase in the consumption of edible oils globally
- growth in demand with the growth of specific economies in china and india.

# Supply Side
- weather: daily
- labor shortages:
- land clearing:?
- output per ha:?

# Visualisation
- **production of palm oil**: total world production & by country and region over time. line plot
- **vegetable oils production**: production and by oil type, country and region over time. stacked area plot
- **imports**: which countries import the most oil? how has it changed over time?. global map/heatmap
- **production by country**: palm oil production by country over time. global map/heatmap
- **production by country**: horizontal bar chart with national output at end of bar.
- **exports**: which countries export palm oil? how has it changed over time? global map/heatmap.
- **land used for palm oil**: how much land is used for palm oil cultivation over time? line plot.
- **land used for vegetable oil**: how much land is used for the cultivation of oil crops, by crop, by country and region, over time? stacked area plot.
- **oil yield by crop**: a comparison of oil yield per hectare of land cultivated by crop. shows that palm oil is the most productive per hectare. horizontal bar plot showing top 10 crops
- **price**: palm olein and other edible oil prices and other edible oil prices. line chart

# Introduction
There are a number of forces at play that make this an interesting topic
- palm oil demand appears to continuously be growing year-on-year
- there is an increasingly negative view of palm oil and its role in driving deforestation. "palm oil free" is now a prominent issue and should cause downward pressure on demand
- labour shortages due to restrictions imposed during COVID 19 had an effect in 2020 and the "stickiness" of agricultural production means the effect of this will take time to work through the supply chain. This may be one of the causes behind lagging supply.
- global shipping disruptions and delays have impacted trade broadly and palm oil is affected. This should have a dampening effect on demand
- Public policy in indonesia constraining exports will have had a dampening effect on supply
- El nino weather conditions will have a postive impact on supply, however the timing of this impact is unclear at this time

# 1) Global palm oil production
Production by country over time. Currently, showing only world.

Production of palm oil has increased rapidly over the last 60 years, total 4800%. The growth has occurred to meet rising demands for vegetable oils in general.Palm oil's meteoric growth is a function of increased demand for edible oils.


In [60]:
production = pd.read_csv(os.path.join(DATA, "palm-oil-production.csv"))
world_df = production.loc[production['Entity'] == 'World']

In [61]:
fig = px.line(world_df, x='Year', y='Crops - Oil, palm - 257 - Production - 5510 - tonnes', title='Global Oil Palm Production')
fig.show()
# To-do: format plot. button to add country or region. automation

- 1961 = 1,478,901mt
- 2018 = 71,453,193mt
- 48x increase in 57 years

need to include a drop down menu to select countries and regions

# 2) Land used for Palm Oil Production

There should be a strong correlation between increased areas under cultivation for oil palm and increased production of palm oil.

Total production should effectively be the sum of total hectares under cultivation and yield per hectare. Production increases are driven by increases in land under cultivation and improving (or deteriorating) yields per hectare.

- plot by country over time (stacked line plot)
- plot by country over time (geo heat map)

In [62]:
land = pd.read_csv(os.path.join(DATA, 'land-use-palm-oil.csv'))
world_land = land.loc[land['Entity'] == 'World']
oil_palm_fruit = world_land["Crops - Oil palm fruit - 254 - Area harvested - 5312 - ha"]

In [63]:
fig = px.line(world_land, x="Year", y=oil_palm_fruit)

# Add figure title
fig.update_layout(title_text="<b>Land under Cultivation (Palm Oil)<b>",title_font_size=40, legend_font_size=20, width=1600, height=1400)

# format x-axis
fig.update_xaxes(title_text="</b>Year</b>", title_font=dict(size=30, family='Verdana', color='black'), tickfont=dict(family='Calibri', color='white', size=25))

# Format y-axes
fig.update_yaxes(title_text="<b>Palm Oil Fruit (mt)</b>", title_font=dict(size=30, family='Verdana', color='black'), tickfont=dict(family='Calibri', color='white', size=25))

fig.show()

# I want this to be stacked
# I need to format the hovertext
# the x and y titles aren't showing
# need to fill below the line
# need to update the data to include 2019, 2020, 2021
# interesting to see a drop in land under cultivation
# for upsample the yearly or monthly data and then should be OK

# 3) Vegetable Oil Production
- plot 1: production over time by crop. currently a simple line chart. Need to turn it into a stacked area chart
- geo-map plot showing production by country or region over time

In [64]:
vegetable_oil_production = pd.read_csv(os.path.join(DATA, 'vegetable-oil-production.csv'))
year = vegetable_oil_production['Year'].drop_duplicates(keep='first', inplace=False)
vegetable_oil_production.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11064 entries, 0 to 11063
Data columns (total 16 columns):
 #   Column                                                                     Non-Null Count  Dtype  
---  ------                                                                     --------------  -----  
 0   Entity                                                                     11064 non-null  object 
 1   Code                                                                       9323 non-null   object 
 2   Year                                                                       11064 non-null  int64  
 3   Crops processed - Oil, soybean - 237 - Production - 5510 - tonnes          5739 non-null   float64
 4   Crops processed - Oil, sesame - 290 - Production - 5510 - tonnes           4212 non-null   float64
 5   Crops processed - Oil, linseed - 334 - Production - 5510 - tonnes          4387 non-null   float64
 6   Crops processed - Oil, palm - 257 - Production - 5510 

In [65]:
veg_oil_yearly_production = vegetable_oil_production.groupby('Year').sum()
veg_oil_yearly_production.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 54 entries, 1961 to 2014
Data columns (total 13 columns):
 #   Column                                                                     Non-Null Count  Dtype  
---  ------                                                                     --------------  -----  
 0   Crops processed - Oil, soybean - 237 - Production - 5510 - tonnes          54 non-null     float64
 1   Crops processed - Oil, sesame - 290 - Production - 5510 - tonnes           54 non-null     float64
 2   Crops processed - Oil, linseed - 334 - Production - 5510 - tonnes          54 non-null     float64
 3   Crops processed - Oil, palm - 257 - Production - 5510 - tonnes             54 non-null     float64
 4   Crops processed - Oil, rapeseed - 271 - Production - 5510 - tonnes         54 non-null     float64
 5   Crops processed - Oil, groundnut - 244 - Production - 5510 - tonnes        54 non-null     float64
 6   Crops processed - Oil, cottonseed - 331 - Production - 

In [66]:
import re
pattern = r'(?<=Oil, ).+?(?= - \d)'
cols = [re.search(pattern, c, re.RegexFlag.IGNORECASE)[0] for c in veg_oil_yearly_production]
cols = [re.sub(' ', '_', c) for c in cols]
cols = [re.sub('\W', '', c) for c in cols]

In [67]:
veg_oil_yearly_production.columns = cols
veg_oil_yearly_production.reset_index(inplace=True)
veg_oil_yearly_production.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54 entries, 0 to 53
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Year           54 non-null     int64  
 1   soybean        54 non-null     float64
 2   sesame         54 non-null     float64
 3   linseed        54 non-null     float64
 4   palm           54 non-null     float64
 5   rapeseed       54 non-null     float64
 6   groundnut      54 non-null     float64
 7   cottonseed     54 non-null     float64
 8   coconut_copra  54 non-null     float64
 9   olive_virgin   54 non-null     float64
 10  safflower      54 non-null     float64
 11  sunflower      54 non-null     float64
 12  maize          54 non-null     float64
 13  palm_kernel    54 non-null     float64
dtypes: float64(13), int64(1)
memory usage: 6.0 KB


In [68]:
veg_oil_fig = px.area(
    veg_oil_yearly_production,
    x='Year',
    y=veg_oil_yearly_production.columns[1:]
)
veg_oil_fig.update_traces(textfont_size=16, hovertemplate=None)
veg_oil_fig.update_layout(hovermode="x")

# Add figure title
veg_oil_fig.update_layout(title_text="<b>Land under Cultivation (Palm Oil)<b>",title_font_size=40, legend_font_size=20, width=1800, height=1400)

# format x-axis
veg_oil_fig.update_xaxes(title_text="</b>Year</b>", title_font=dict(size=30, family='Verdana', color='black'), tickfont=dict(family='Calibri', color='white', size=25))

# Format y-axes
veg_oil_fig.update_yaxes(title_text="<b>Palm Oil Fruit (mt)</b>", title_font=dict(size=30, family='Verdana', color='black'), tickfont=dict(family='Calibri', color='white', size=25))

veg_oil_fig.show()

KeyError: 'Crops processed - Oil, soybean - 237 - Production - 5510 - tonnes'

NameError: name 'palm' is not defined

In [None]:
oil_yield = pd.read_csv(os.path.join(DATA, "oil-yield-by-crop.csv"))
oil_yield.head()

# 4) Price

In [None]:
price = pd.read_csv(os.path.join(DATA, 'palm oil prices 021020 - 290422.csv'))
price.head()

I need to find more data. Ideally daily price data back to 1961

In [None]:
fig = px.line(price, x='DAILY PRICES', y='Palm olein RBD Mal FOB US$', title='Palm Olein Price')
fig.show()
# format this

In [None]:
price.shape

In [None]:
# line chart horizontal, no hover text

# 5) Export Volumes
are a very good way to look at total volumes. Most palm oi is exported as evidenced by comparing the importer country vs the

In [None]:
# stacked line chart with hover text

# 6) Key Import Markets

In [None]:
imports = pd.read_csv(os.path.join(DATA, 'palm-oil-imports.csv'))
imports.head()

In [None]:
imports.shape

In [None]:
imports = pd.read_csv(os.path.join(DATA, 'vegetable-oil-production.csv'))
# imports.head()
year = imports['Year']
imports = imports.groupby(year).sum()
imports

In [None]:
# stacked area plot

In [None]:
# world heat map

# 7) Palm Oil Production Volumes by Country & over time

In [None]:
production_time = pd.read_csv(os.path.join(DATA, 'palm-oil-production.csv'))
production_time.head()

year = production_time['Year']
production_time = production_time.groupby(year).sum()
production_time

In [None]:
production_time.shape

In [None]:
# world heat map of producers
# line chart with drop box for country

# 8) land use for vegetable oil crops

In [None]:
crops = pd.read_csv(os.path.join(DATA, 'land-use-for-vegetable-oil-crops.csv'))
crops.head()

In [None]:
crops.shape

In [None]:
# stacked line chart

# 6) Total Vegetable Oil Production (substitutes) Over Time
palm oil prices are effected by the price and supply of substitute vegetable oils

In [None]:
oil_production = pd.read_csv(os.path.join(DATA, 'vegetable-oil-production.csv'))
oil_production.head()

In [None]:
oil_production.shape

# 7) Palm oil uses

# 8) oil yield by crop (2018)

In [None]:
_yield = pd.read_csv(os.path.join(DATA, 'oil-yield-by-crop.csv'))
_yield.head()
# need to pivot this

In [None]:
_yield.shape

# Palm oil terms
We're mainly interested in RBD palm olein but its worth mentioning some others
- RBD: refined deodorised and bleached
- CPO: crude palm olein
- CPKO: crude palm kernel oil

# INCO terms
We're mainly interested in FOB but its work knowing some other INCO terms
- EXW: ex works
- FAS: free alongside
- FOB: free on board
- CFR: cost and freight
- CIF: cost, insurance & freight

# Ports of origin
we will reference 'malaysia', however in fact, there are two primary ports for exports of palm olein.
- Port Klang
- Kuala Lumpur
the often quoted 'FOB malaysia' is an index price, traded on Bursa Malaysia. In reality, the prices will differ slightly between the two ports

# Data Sources
- https://agropost.wordpress.com/
- This has a lot of great market data going back to 2010. however it doesn't seem to have the data as CSV. This seems like a perfect opportunity to demonstrate my scraping skills and build a data pipeline
- https://www.fao.org/home/en
- FAO food and agriculture organisation. Part of the UN. Good data, high quality and easy to download, but lacking in granularity

# Data Pipeline
Software automation solution to move data around. host data in a data warehouse & move the data to where you can use it.

http://tmgmdashboarding-env-1.eba-srqgw9ij.ap-southeast-2.elasticbeanstalk.com/

- data storage
- ETL
- preprocess data
- seasonality