# Business Case:
My previous employer was a major oil bottler. We purchased, bottled and distributed approx 2000mt per month, making is the largest oil bottler in the country.

Beginning in May 2020, we began to see prices for palm olein increase in an unprecedented rate. The explanation that I was giving to our customers was not satisfactory.

Lack of clarity regarding the bubble's duration made purchasing/trading challenging:
- it is good to buy forward when the market is rising
- it is bad to buy forward when the market is falling
- buying forward in a flat market is a risk with no reward.
- it is risky buying forward in a volatile market

There were no readily available tools or resources to forecast in the short term with accuracy that could guide purchase decisions, so i decided to make one.

This notebook represents the first step: **understand the market**. This EDA notebook uses UN FAO data.

**Assumptions**: commodity markets are efficient and self-clearing. at market prices, everything that can be produced will be sold and consumed. Total Production is therefore a proxy for total consumption.

In [234]:
import os
import pandas as pd
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")

In [235]:
DATA_DIR = '../data/palm_olein_data'

production_elements = ['Area harvested', 'Yield', 'Production', 'Stocks', 'Laying', 'Producing Animals/Slaughtered', 'Yield/Carcass Weight', 'Milk Animals', 'Prod Popultn']

# area codes >1000 and in this list are regions and will cause double count unless removed
area_drop_list = [261, 265, 266, 268, 269, 5100, 5200, 5300, 5501, 5206, 5204, 5301, 5101
, 5302, 5400, 5401, 5707, 5802, 5801, 5815, 5102, 5817, 5103, 5803, 5105, 5305, 5404, 5000]

oil_crops = ['Oil, castor beans', 'Oil, citronella', 'Oil, coconut (copra)', 'Oil, cottonseed', 'Oil, essential nes', 'Oil, groundnut', 'Oil, linseed', 'Oil, maize', 'Oil, olive residues', 'Oil, olive, virgin', 'Oil, palm', 'Oil, palm kernel', 'Oil, rapeseed', 'Oil, sesame', 'Oil, soybean', 'Oil, sunflower']

# Global Pricing Data

In [236]:
price = pd.read_csv(os.path.join(DATA_DIR, 'palm oil prices 021020 - 290422.csv'))
price.columns = ['date', 'price']
price['price'] = price['price'].ffill()
fig = px.line(price, x='date', y='price')
fig.update_layout(title_text="<b>Global Palm Oil Price<b>", title_font_size=40, legend_font_size=20, width=800, height=400)
fig.show()

# Conclusions:
the price of RBD palm olein FOB malaysia increased from USD528 on 7th of May 2020, reaching a peak of USD1946 on 1st September 2022, an increase of nearly 400%

# Global Production Data

In [237]:
volume_df = pd.read_csv(os.path.join(DATA_DIR, 'Production_Crops_Livestock_E_All_Data.csv'), encoding='ISO-8859-1')

volume_df = volume_df[volume_df['Element'] == 'Production']
volume_df = volume_df[~volume_df['Area Code'].isin(area_drop_list)]
volume_df = volume_df.drop(['Item Code', 'Area', 'Area Code', 'Element Code', 'Element', 'Unit'], axis=1)


volume_df.columns = volume_df.columns.str.replace('Y', '')
volume_df = volume_df.loc[:,~volume_df.columns.str.endswith('F')]

volume_df = volume_df.groupby(['Item']).sum()
volume_df = volume_df[volume_df.index.isin(oil_crops)].transpose()
# rename column headers
# volume_df = volume_df.rename(columns={Oil, essential nes})
volume_df.index.name = "Year"
volume_df = volume_df.reset_index(drop=False)

volume_df = volume_df.rename(columns={'Oil, coconut (copra)': 'coconut', 'Oil, cottonseed': 'cottonseed', 'Oil, groundnut': 'groundnut', 'Oil, linseed': 'linseed', 'Oil, maize': 'maize', 'Oil, olive, virgin': 'olive', 'Oil, palm': 'palm', 'Oil, palm kernel': 'palm kernal', 'Oil, rapeseed': 'rapeseed', 'Oil, sesame': 'sesame', 'Oil, soybean': 'soybean', 'Oil, sunflower': 'sunflower'})

volume_df = volume_df[volume_df.Year != '2020'].copy()
volume_df

Item,Year,coconut,cottonseed,groundnut,linseed,maize,olive,palm,palm kernal,rapeseed,sesame,soybean,sunflower
0,1961,2739433.0,3596745.0,4281323.0,1451646.0,641099.0,2468065.0,1798920.0,678573.0,1798668.0,671017.0,5597224.0,2253415.0
1,1962,3428045.0,3727521.0,4442355.0,1682264.0,678591.0,1690084.0,1809568.0,687505.0,1914866.0,815141.0,6119198.0,2663020.0
2,1963,3361038.0,4024319.0,4841528.0,1623127.0,692458.0,3327269.0,1894481.0,627261.0,1868024.0,815803.0,6511223.0,2781423.0
3,1964,3196247.0,4280103.0,5052844.0,1621186.0,731832.0,1551145.0,1946783.0,685158.0,1774766.0,828957.0,6424453.0,2684406.0
4,1965,3261468.0,4480258.0,4448143.0,1807789.0,792634.0,2279575.0,1980316.0,717691.0,2550553.0,822099.0,6948570.0,3496456.0
5,1966,3605218.0,4522104.0,4688294.0,1508791.0,795928.0,2294783.0,2147166.0,724729.0,2535416.0,777030.0,7870776.0,3586534.0
6,1967,3325858.0,3803959.0,5188633.0,1331200.0,802822.0,2463330.0,2206307.0,593207.0,2688876.0,828556.0,8375872.0,4158174.0
7,1968,3273892.0,3732499.0,4595862.0,1313786.0,809758.0,2652178.0,2456927.0,595207.0,2971300.0,875346.0,8361924.0,4306140.0
8,1969,3254430.0,4166902.0,4805574.0,1339067.0,846927.0,2383160.0,2656521.0,687676.0,2762883.0,870910.0,9103723.0,4295300.0
9,1970,3475054.0,3997412.0,5511103.0,1755617.0,872586.0,2534900.0,2766389.0,679333.0,3063131.0,998848.0,11218382.0,4430620.0


In [238]:
veg_oil_prodn_fig = px.area(volume_df, x='Year', y=volume_df.columns[1:])
veg_oil_prodn_fig.update_traces(textfont_size=16, hovertemplate=None)
veg_oil_prodn_fig.update_layout(hovermode="x")

veg_oil_prodn_fig.update_layout(
    title_text="<b>Global Vegetable Oil Production<b>",
    title_font_size=40,
    legend_font_size=20,
    width=1000,
    height=700
)

veg_oil_prodn_fig.update_xaxes(
    title_text="</b>Year</b>",
    title_font=dict(size=30, family='Verdana', color='white'),
    tickfont=dict(family='Calibri', color='white', size=25)
)

veg_oil_prodn_fig.update_yaxes(
    title_text="<b>Palm Oil Fruit (mt)</b>",
    title_font=dict(size=30, family='Verdana', color='white'),
    tickfont=dict(family='Calibri', color='white', size=25))

veg_oil_prodn_fig.show()

# Conclusions:
- palm oil has grown to be the largest contributor to global supplies of edible oils.


# Palm Oil Production

In [239]:
palm_oil_production = volume_df[["Year", "palm"]]  # new df
palm_oil_prodn_fig = px.line(palm_oil_production, x="Year", y="palm")

palm_oil_prodn_fig.update_layout(
    title_text="<b>Global Oil Palm Production<b>",
    title_font_size=40,
    legend_font_size=20,
    width=1400,
    height=1000
)

palm_oil_prodn_fig.update_xaxes(
    title_text="Year",
    title_font=dict(size=30, family='Verdana', color='white'),
    tickfont=dict(family='Calibri', color='white', size=25)
)

palm_oil_prodn_fig.update_yaxes(
    title_text="<b>Palm Oil production (mt)</b>",
    title_font=dict(size=30, family='Verdana', color='white'),
    tickfont=dict(family='Calibri', color='white', size=25)
)

palm_oil_prodn_fig.show()

# Conclusions:
production of palm oil has increased rapidly to keep up with growing demand.

In [240]:
production_df = pd.read_csv(os.path.join(DATA_DIR, 'Production_Crops_Livestock_E_All_Data.csv'), encoding='ISO-8859-1')
production_df = production_df[production_df['Element'] == 'Production']
production_df = production_df[~production_df['Area Code'].isin(area_drop_list)]
production_df = production_df.drop(['Area Code', 'Item Code', 'Element Code', 'Element', 'Unit'], axis=1)

production_df.columns = production_df.columns.str.replace('Y', '')
production_df = production_df.loc[:,~production_df.columns.str.endswith('F')]

production_df = production_df[production_df['Item'].isin(oil_crops)].reset_index(drop=True)

production_df['Code'] = production_df['Area'].str[:3]
production_df['Code'] = production_df['Code'].str.upper()
code = production_df.pop('Code')  # pop the code column
production_df.insert(1, 'Code', code)  # move it to the position we want
temp_df = production_df.melt(id_vars=production_df.columns[:3], value_vars=production_df.columns[3:])
temp_df.rename(columns={'variable': 'Year'}, inplace=True)
new_production_df = temp_df.pivot_table(index=['Area', 'Code', 'Year'], columns=['Item'], values='value', aggfunc=sum).reset_index()
production_df = new_production_df.rename(columns={'Area': 'Entity'})  # fix this
production_df = production_df[production_df.Year != '2020'].copy()

production_df = production_df.rename(columns={'Oil, coconut (copra)': 'coconut', 'Oil, cottonseed': 'cottonseed', 'Oil, groundnut': 'groundnut', 'Oil, linseed': 'linseed', 'Oil, maize': 'maize', 'Oil, olive, virgin': 'olive', 'Oil, palm': 'palm', 'Oil, palm kernel': 'palm kernal', 'Oil, rapeseed': 'rapeseed', 'Oil, sesame': 'sesame', 'Oil, soybean': 'soybean', 'Oil, sunflower': 'sunflower'})

production_df

Item,Entity,Code,Year,coconut,cottonseed,groundnut,linseed,maize,olive,palm,palm kernal,rapeseed,sesame,soybean,sunflower
0,Afghanistan,AFG,1961,,4997.0,,3531.0,,82.0,,,,2253.0,,2938.0
1,Afghanistan,AFG,1962,,7716.0,,3701.0,,90.0,,,,1876.0,,3138.0
2,Afghanistan,AFG,1963,,11742.0,,2857.0,,82.0,,,,1831.0,,3138.0
3,Afghanistan,AFG,1964,,7960.0,,3377.0,,90.0,,,,2722.0,,3138.0
4,Afghanistan,AFG,1965,,7926.0,,4327.0,,82.0,,,,2821.0,,3238.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12354,Zimbabwe,ZIM,2015,0.0,12400.0,8100.0,0.0,14000.0,0.0,,0.0,0.0,0.0,10400.0,2000.0
12355,Zimbabwe,ZIM,2016,0.0,6300.0,7000.0,0.0,13500.0,0.0,,0.0,0.0,0.0,9600.0,1400.0
12356,Zimbabwe,ZIM,2017,0.0,9300.0,7300.0,0.0,14700.0,0.0,,0.0,0.0,0.0,14400.0,1500.0
12357,Zimbabwe,ZIM,2018,0.0,10200.0,10100.0,0.0,14898.0,0.0,,0.0,0.0,0.0,11700.0,1800.0


In [241]:
oil_type = 'palm'
geo_fig = px.choropleth(
    production_df,
    locations='Code',
    color=oil_type,
    color_continuous_scale=px.colors.diverging.PiYG,
    locationmode='ISO-3',
    animation_frame='Year',
    projection='natural earth'
)
geo_fig.update_layout(title_text=f'{oil_type}')

geo_fig.show()  # the data/countries are wrong

# Conclusions:
The bulk of palm oil cultivation and production occurs in SE Asia, and primarily in Malaysia and Indonesia. Malaysia was the early dominant player, having been overtaken by Indonesia of more recent times.

Both countries are dependent on available land, which is finite, for further growth. As a result, major oil growers from those countries are now exporting cultivation and production to new countries with suitable land.

# Export Data

In [242]:
export_df = pd.read_csv(os.path.join(DATA_DIR, 'Trade_CropsLivestock_E_All_Data.csv'), encoding='ISO-8859-1')
export_df = export_df[export_df['Element'] == 'Export Quantity']
export_df = export_df[~export_df['Area Code'].isin(area_drop_list)]
export_df = export_df.drop(['Item Code', 'Area Code', 'Element Code', 'Element', 'Unit'], axis=1)

export_df.columns = export_df.columns.str.replace('Y', '')
export_df = export_df.loc[:,~export_df.columns.str.endswith('F')]

export_df = export_df[export_df['Item'].isin(oil_crops)].reset_index(drop=True)

export_df['Code'] = export_df['Area'].str[:3]
export_df['Code'] = export_df['Code'].str.upper()
code = export_df.pop('Code')  # pop the code column
export_df.insert(1, 'Code', code)  # move it to the position we want

temp_df = export_df.melt(id_vars=export_df.columns[:3], value_vars=export_df.columns[3:])
temp_df.rename(columns={'variable': 'Year'}, inplace=True)
export_df = temp_df.pivot_table(index=['Area', 'Code', 'Year'], columns=['Item'], values='value', aggfunc=sum).reset_index()

export_df = export_df.rename(columns={'Area': 'Entity', 'Oil, coconut (copra)': 'coconut', 'Oil, cottonseed': 'cottonseed', 'Oil, groundnut': 'groundnut', 'Oil, linseed': 'linseed', 'Oil, maize': 'maize', 'Oil, olive, virgin': 'olive', 'Oil, palm': 'palm', 'Oil, palm kernel': 'palm kernal', 'Oil, rapeseed': 'rapeseed', 'Oil, sesame': 'sesame', 'Oil, soybean': 'soybean', 'Oil, sunflower': 'sunflower', 'Oil, castor beans': 'castor oil', 'Oil, citronella': 'citronella', 'Oil, essential nes': 'essential', 'Oil, olive residues': 'other olive'})

export_df = export_df[export_df.Year != '2020'].copy()

export_df

Item,Entity,Code,Year,castor oil,citronella,coconut,cottonseed,essential,groundnut,linseed,maize,other olive,olive,palm,palm kernal,rapeseed,sesame,soybean,sunflower
0,Afghanistan,AFG,1961,,0.0,0.0,,0.0,,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,,0.0
1,Afghanistan,AFG,1962,,0.0,0.0,,0.0,,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,,0.0
2,Afghanistan,AFG,1963,,0.0,0.0,,0.0,,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,,0.0
3,Afghanistan,AFG,1964,,0.0,0.0,,0.0,,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,,0.0
4,Afghanistan,AFG,1965,,0.0,0.0,,0.0,,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14874,Zimbabwe,ZIM,2015,0.0,0.0,0.0,0.0,62.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14875,Zimbabwe,ZIM,2016,0.0,0.0,0.0,0.0,25.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,31.0,1.0
14876,Zimbabwe,ZIM,2017,0.0,0.0,0.0,0.0,58.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1118.0,0.0
14877,Zimbabwe,ZIM,2018,0.0,0.0,0.0,0.0,59.0,0.0,2.0,0.0,0.0,1.0,229.0,0.0,0.0,0.0,320.0,9.0


In [243]:
oil_type = 'palm'
geo_fig = px.choropleth(
    export_df,
    locations='Code',
    color=oil_type,
    color_continuous_scale=px.colors.diverging.PiYG,
    locationmode='ISO-3',
    animation_frame='Year',
    projection='natural earth'
)

geo_fig.update_layout(title_text=f'{oil_type}')

geo_fig.show()  # not right

# Conclusions:
unsurprisingly, export volumes closely resemble production volumes, as the major growing nations export most of the oil they grow and refine.


# Imports

In [244]:
import_df = pd.read_csv(os.path.join(DATA_DIR, 'Trade_CropsLivestock_E_All_Data.csv'), encoding='ISO-8859-1')
import_df = import_df[import_df['Element'] == 'Import Quantity']
import_df = import_df[~import_df['Area Code'].isin(area_drop_list)]
import_df = import_df.drop(['Area Code', 'Item Code', 'Element Code', 'Element', 'Unit'], axis=1)

import_df.columns = import_df.columns.str.replace('Y', '')
import_df = import_df.loc[:,~import_df.columns.str.endswith('F')]

import_df = import_df[import_df['Item'].isin(oil_crops)].reset_index(drop=True)

import_df['Code'] = import_df['Area'].str[:3]
import_df['Code'] = import_df['Code'].str.upper()
code = import_df.pop('Code')  # pop the code column
import_df.insert(1, 'Code', code)  # move it to the position we want

temp_df = import_df.melt(id_vars=import_df.columns[:3], value_vars=import_df.columns[3:])
temp_df.rename(columns={'variable': 'Year'}, inplace=True)
import_df = temp_df.pivot_table(index=['Area', 'Code', 'Year'], columns=['Item'], values='value', aggfunc=sum).reset_index()

import_df = import_df.rename(columns={'Area': 'Entity', 'Oil, coconut (copra)': 'coconut', 'Oil, cottonseed': 'cottonseed', 'Oil, groundnut': 'groundnut', 'Oil, linseed': 'linseed', 'Oil, maize': 'maize', 'Oil, olive, virgin': 'olive', 'Oil, palm': 'palm', 'Oil, palm kernel': 'palm kernal', 'Oil, rapeseed': 'rapeseed', 'Oil, sesame': 'sesame', 'Oil, soybean': 'soybean', 'Oil, sunflower': 'sunflower', 'Oil, castor beans': 'castor oil', 'Oil, citronella': 'citronella', 'Oil, essential nes': 'essential', 'Oil, olive residues': 'other olive'})

import_df = import_df[import_df.Year != '2020'].copy()

import_df

Item,Entity,Code,Year,castor oil,citronella,coconut,cottonseed,essential,groundnut,linseed,maize,other olive,olive,palm,palm kernal,rapeseed,sesame,soybean,sunflower
0,Afghanistan,AFG,1961,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,319.0,0.0
1,Afghanistan,AFG,1962,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,300.0,0.0
2,Afghanistan,AFG,1963,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,400.0,0.0
3,Afghanistan,AFG,1964,0.0,,0.0,0.0,0.0,10.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,600.0,0.0
4,Afghanistan,AFG,1965,0.0,,0.0,0.0,0.0,5.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,958.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14874,Zimbabwe,ZIM,2015,4.0,,49.0,0.0,142.0,0.0,27.0,64.0,5.0,182.0,4670.0,60.0,10.0,0.0,104655.0,5036.0
14875,Zimbabwe,ZIM,2016,4.0,,111.0,0.0,191.0,0.0,14.0,0.0,0.0,177.0,7689.0,0.0,126.0,1.0,121158.0,2308.0
14876,Zimbabwe,ZIM,2017,2.0,,37.0,0.0,189.0,0.0,25.0,0.0,0.0,156.0,7901.0,34.0,215.0,0.0,96173.0,1663.0
14877,Zimbabwe,ZIM,2018,6.0,,81.0,344.0,158.0,0.0,21.0,0.0,0.0,167.0,6883.0,297.0,70.0,0.0,121304.0,950.0


In [245]:
oil_type = 'palm'
geo_fig = px.choropleth(
    import_df,
    locations='Code',
    color=oil_type,
    color_continuous_scale=px.colors.diverging.PiYG,
    locationmode='ISO-3',
    animation_frame='Year',
    projection='natural earth'
)

geo_fig.update_layout(title_text=f'{oil_type}')

geo_fig.show()  # fix data type

# Conclusions:
the growth in demand is unsurprisingly driven by the two largest countries in the world: India and China. Both of which are likely driven by population growth and economic growth.

# Summary: The palm oil and edible oil markets:
- the demand for edible oil has grown enormously over the last 60 years, driven by population growth and economic development.
- production & consumption of palm oil has increased more than other edible oils over the period, due to its favourable production costs, relatively low price and high yield per hectare
- palm olein/oil now accounts for the largest share of edible oil production & consumption and this is likely to continue to be the case into the future.


# Hypothesis:
**long term** prices are driven by structural supply and demand factors.

**Short term** prices are heavily influenced by information, politics, sentiment and unexpected events.

On the supply side, over the **long term** I expect to see a relationship between price and
- the total land area under cultivation
- changing production efficiencies per hectare

On the supply side over the **short term** I expect to see fluctuations driven by:
- weather
- abnormal events: war, pandemic, etc
- government export policy

On the **demand** side over the **long term** I expect to see a relationship between price and
- population growth in key import markets: China & India in particular.
- economic development (GDP growth of the total economy and per capita) in key import markets: china & india

On the **demand** side over the **short term** I expect to see a relationship between price and
- seasonality around harvests, holidays and other events that the market expects to affect price.
- prices of substitute edible oils
- market sentiment

# Technical Objectives of the project
- Build pipeline to automatically collect, and clean data from a mixture of sources.
- aggregate and save the data to a mariadb database hosted on AWS
- analyse the data to understand long term and short term trends in the palm oil market.
- Build a model to accurately infer/predict the price of palm oil in the long-, mid-, and short-term time horizons
- Based on predictions, improve decision-making regarding whether to buy or hold at spot, or to buy at the forward pricing.

# Next steps
1) Build a data pipeline to gather the supplementary information needed to complete this project. Automate scraping data from the web for supplementary information, updated daily: palm oil prices; edible oil prices; population, GDP growth etc.
2) perform time series analysis on the aggregated data to identify features that add power to the model.
3) perform feature selection and engineering, and transform data in preparation for stage 4.
4) test a range of statistical, ML, DL and ensemble approaches to time series forecasting.
5) Perform forward and backwards testing to validate the models & chose which modelling and forecasting approach to ultimately take.

# Goal:
To make optimal purchase strategies and tactical decisions based on sound, reliable, accurate forecasts of price movements.