# Coffee Market Analysis
## Exploratory Data Analysis Notebook

### Matthew Garton - February 2019

**Purpose:** The purpose of this notebook is to perform Exploratory Data Analysis on my coffee dataset, to examine relationships between variables, distributions, and try to determine which variables will be most useful for predicting coffee prices. 

**Context**: The ultimate goal of my project is to develop trading signals for coffee futures. I will attempt to build a machine learning model which uses fundamental and technical data to predict the future direction of coffee futures price changes. My expectation at the outset of this project is that my feature matrix will include data on weather, GDP, and coffee production and exports in major coffee-producing nations, GDP and coffee import data in major coffee-importing nations, as well as volume, open-interest, and commitment of traders data for ICE coffee futures contracts.

This notebook imports a cleaner dataset that I prepared in the Data Wrangling Notebook, called CoffeeDataset. See '../data/' for all of the raw data that I started with, or the links in the Data Wrangling Notebook to get the data directly from the source.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

pd.options.display.max_columns = 1000
pd.options.display.max_rows = 1000
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

In [2]:
# import the dataset
coffee = pd.read_csv('../data/CoffeeDataset.csv')
coffee['Date'] = pd.to_datetime(coffee['Date'])
coffee.set_index('Date', inplace=True)

In [3]:
coffee.fillna(method='pad', inplace=True) # Fill NaNs with last observation carried forward
coffee.dropna(inplace=True); # Drop remaining NaN values - cases where there's no prior observation to carry forward

In [4]:
def get_forward_returns(df, ranges):
    for r in ranges:
        df["{}D_Return".format(r)] = df['Open'].pct_change(r).shift(-r)

In [5]:
ranges = [5, 10, 20]
get_forward_returns(coffee, ranges)

In [7]:
coffee.columns

Index(['Open', 'High', 'Low', 'Settle', 'Volume', 'BRA_Temp', 'BRA_Precip',
       'COL_Precip', 'COL_Temp', 'ETH_Precip', 'ETH_Temp', 'IDN_Precip',
       'IDN_Temp', 'VNM_Precip', 'VNM_Temp', 'Production',
       'Consumption (domestic)', 'Exportable Production',
       'Gross Opening Stocks', 'Exports', 'Imports', 'Re-exports',
       'Inventories', 'Disappearance', 'Open_Interest_All',
       'NonComm_Positions_Long_All', 'NonComm_Positions_Short_All',
       'NonComm_Postions_Spread_All', 'Comm_Positions_Long_All',
       'Comm_Positions_Short_All', 'Tot_Rept_Positions_Long_All',
       'Tot_Rept_Positions_Short_All', 'NonRept_Positions_Long_All',
       'NonRept_Positions_Short_All', 'Pct_of_OI_NonComm_Long_All',
       'Pct_of_OI_NonComm_Short_All', 'Pct_of_OI_NonComm_Spread_All',
       'Pct_of_OI_Comm_Long_All', 'Pct_of_OI_Comm_Short_All',
       'Pct_of_OI_Tot_Rept_Long_All', 'Pct_of_OI_Tot_Rept_Short_All',
       'Pct_of_OI_NonRept_Long_All', 'Pct_of_OI_NonRept_Short_All',
 

In [26]:
annual_factors = ['Production',
       'Consumption (domestic)', 'Exportable Production',
       'Gross Opening Stocks', 'Exports', 'Imports', 'Re-exports',
       'Inventories', 'Disappearance']

for factor in annual_factors:
    coffee['{}_YoY_Change'.format(factor)] = coffee[factor].resample().pct_change(freq='Y',)

ValueError: cannot reindex from a duplicate axis

In [25]:
sns.regplot(x=coffee['Production'].pct_change(freq='Y').dropna(), y=coffee['20D_Return'])

ValueError: cannot reindex from a duplicate axis