# Coffee Market Analysis
## Data-Wrangling Notebook

### Matthew Garton - February 2019

**Purpose:** The purpose of this notebook is to acquire my data, inspect it, clean it and prepare it for EDA and modeling.

**Context**: The ultimate goal of my project is to develop trading signals for coffee futures. I will attempt to build a machine learning model which uses fundamental and technical data to predict the future direction of coffee futures price changes. My expectation at the outset of this project is that my feature matrix will include data on weather, GDP, and coffee production and exports in major coffee-producing nations, GDP and coffee import data in major coffee-importing nations, as well as volume, open-interest, and commitment of traders data for ICE coffee futures contracts.

Note that many of the decisions made and functions written here came up at various stages of the project, from initial inspection all the way to model-building (as is the non-linear nature of the data science workflow). To keep things clean, I have moved all of the data cleaning/prep (outside of train-test splitting and some feature engineering) to this notebook. The csv file that I output can then be accessed in other notebooks in this repository.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

pd.options.display.max_columns = 1000
pd.options.display.max_rows = 1000
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

## Gathering data

1. Price data - daily OHLC prices (plus Volume and OI) for ICE Coffee 'C' futures.
      
    source: [Wiki Continuous Futures database on Quandl](https://www.quandl.com/data/CHRIS-Wiki-Continuous-Futures)
      
      
2. Weather data - monthly average temperature (celsius) and rainfall (mm) for the top five coffee exporting countries (Brazil, Vietnam, Colombia, Indonesia, Ethiopia).
    
    source: [World Bank Climate Change Knowledge Portal](http://sdwebx.worldbank.org/climateportal/index.cfm?page=downscaled_data_download&menu=historical)
    
    
3. Fundamental data - annual data on coffee production, imports, exports, etc. from International Coffee Organization*.

    source: [International Coffee Organization](http://www.ico.org/new_historical.asp?section=Statistics)


4. Positioning data - monthly Commitment of Traders' reports from CFTC

    source: [Commodity Futures Trading Commission](https://www.cftc.gov/MarketReports/CommitmentsofTraders/HistoricalCompressed/index.htm)
    
*Note: Before getting started here, I did some initial data assembling/cleaning in excel, so if you choose to get the data directly from the sources listed above, some preparation will be necessary before getting it into the format shown here. The biggest decision I made so far was in how to handle some of the ICO data which was indexed by 'Crop Year' rather than 'Calendar Year'. My initial solution is to treat the most recent year of the 'Crop Year' as the relevant 'year' for the data (so Crop Year 1991/1992 is treated as Year 1992, with the understanding that all of the data for the 1991-1992 period would have been availably by EOY 1992). For now, this is a simplifying assumption to avoid any 'look-ahead bias.' This might be an oversimplification that I'll have to come back to. 

In [2]:
# import Daily ICE Coffee 'C' Futures price data
coffee = pd.read_csv('../data/CHRIS-ICE_KC1.csv')

# import Monthly Weather data for major coffee producing countries
weather = pd.read_csv('../data/Weather.csv')

# import Annual fundamental (Production, Exports, Imports, etc.) data
fundamental = pd.read_csv('../data/SupplyDemand.csv')

# import Monthly Commitment of Traders report data
cot = pd.read_csv('../data/CommitmentOfTraders.csv')

In [3]:
# Quick fix to 'Country' column typo..
weather.rename(index=str, columns={' Country':'Country'}, inplace=True)
weather.head()

Unnamed: 0,Date,Country,Temperature (Monthly – C),Precip (mm)
0,01/31/91,BRA,25.643,260.878
1,02/28/91,BRA,25.9575,193.859
2,03/31/91,BRA,25.6557,238.866
3,04/30/91,BRA,25.3129,194.848
4,05/31/91,BRA,24.791,119.09


In [4]:
# For each dataframe, index by Date (as datetime object) and extract year, month
dfs = [coffee, weather, fundamental, cot]
for df in dfs:
    df['Date'] = pd.to_datetime(df['Date'])
    df.set_index('Date', inplace=True)   

In [5]:
countries = ['BRA', 'COL', 'ETH', 'IDN', 'VNM']

dfs = []
for country in countries:
    df = weather[weather['Country'] == country]
    df.rename(index=str, 
              columns={'Temperature (Monthly – C)':'{}_Temp'.format(country),
                       'Precip (mm)':'{}_Precip'.format(country)}, inplace=True)
    df.drop(columns=['Country'], inplace=True)
    dfs.append(df)

In [6]:
weather = dfs[0]

for df in dfs[1:]:
    cols = df.columns.difference(weather.columns)
    weather = weather.merge(df[cols], left_index=True, right_index=True, how='outer')

In [7]:
dfs = [coffee, weather, fundamental, cot]

In [8]:
coffee_data = pd.concat(dfs, axis=1)

In [9]:
coffee_data.head()

Unnamed: 0_level_0,Open,High,Low,Settle,Change,Wave,Volume,Prev. Day Open Interest,EFP Volume,EFS Volume,Block Volume,BRA_Temp,BRA_Precip,COL_Precip,COL_Temp,ETH_Precip,ETH_Temp,IDN_Precip,IDN_Temp,VNM_Precip,VNM_Temp,Year,Production,Consumption (domestic),Exportable Production,Gross Opening Stocks,Exports,Imports,Re-exports,Inventories,Disappearance,Market_and_Exchange_Names,As_of_Date_In_Form_YYMMDD,CFTC_Contract_Market_Code,CFTC_Market_Code,CFTC_Region_Code,CFTC_Commodity_Code,Open_Interest_All,NonComm_Positions_Long_All,NonComm_Positions_Short_All,NonComm_Postions_Spread_All,Comm_Positions_Long_All,Comm_Positions_Short_All,Tot_Rept_Positions_Long_All,Tot_Rept_Positions_Short_All,NonRept_Positions_Long_All,NonRept_Positions_Short_All,Open_Interest_Old,NonComm_Positions_Long_Old,NonComm_Positions_Short_Old,NonComm_Positions_Spread_Old,Comm_Positions_Long_Old,Comm_Positions_Short_Old,Tot_Rept_Positions_Long_Old,Tot_Rept_Positions_Short_Old,NonRept_Positions_Long_Old,NonRept_Positions_Short_Old,Open_Interest_Other,NonComm_Positions_Long_Other,NonComm_Positions_Short_Other,NonComm_Positions_Spread_Other,Comm_Positions_Long_Other,Comm_Positions_Short_Other,Tot_Rept_Positions_Long_Other,Tot_Rept_Positions_Short_Other,NonRept_Positions_Long_Other,NonRept_Positions_Short_Other,Change_in_Open_Interest_All,Change_in_NonComm_Long_All,Change_in_NonComm_Short_All,Change_in_NonComm_Spead_All,Change_in_Comm_Long_All,Change_in_Comm_Short_All,Change_in_Tot_Rept_Long_All,Change_in_Tot_Rept_Short_All,Change_in_NonRept_Long_All,Change_in_NonRept_Short_All,Pct_of_Open_Interest_All,Pct_of_OI_NonComm_Long_All,Pct_of_OI_NonComm_Short_All,Pct_of_OI_NonComm_Spread_All,Pct_of_OI_Comm_Long_All,Pct_of_OI_Comm_Short_All,Pct_of_OI_Tot_Rept_Long_All,Pct_of_OI_Tot_Rept_Short_All,Pct_of_OI_NonRept_Long_All,Pct_of_OI_NonRept_Short_All,Pct_of_Open_Interest_Old,Pct_of_OI_NonComm_Long_Old,Pct_of_OI_NonComm_Short_Old,Pct_of_OI_NonComm_Spread_Old,Pct_of_OI_Comm_Long_Old,Pct_of_OI_Comm_Short_Old,Pct_of_OI_Tot_Rept_Long_Old,Pct_of_OI_Tot_Rept_Short_Old,Pct_of_OI_NonRept_Long_Old,Pct_of_OI_NonRept_Short_Old,Pct_of_Open_Interest_Other,Pct_of_OI_NonComm_Long_Other,Pct_of_OI_NonComm_Short_Other,Pct_of_OI_NonComm_Spread_Other,Pct_of_OI_Comm_Long_Other,Pct_of_OI_Comm_Short_Other,Pct_of_OI_Tot_Rept_Long_Other,Pct_of_OI_Tot_Rept_Short_Other,Pct_of_OI_NonRept_Long_Other,Pct_of_OI_NonRept_Short_Other,Traders_Tot_All,Traders_NonComm_Long_All,Traders_NonComm_Short_All,Traders_NonComm_Spread_All,Traders_Comm_Long_All,Traders_Comm_Short_All,Traders_Tot_Rept_Long_All,Traders_Tot_Rept_Short_All,Traders_Tot_Old,Traders_NonComm_Long_Old,Traders_NonComm_Short_Old,Traders_NonComm_Spead_Old,Traders_Comm_Long_Old,Traders_Comm_Short_Old,Traders_Tot_Rept_Long_Old,Traders_Tot_Rept_Short_Old,Traders_Tot_Other,Traders_NonComm_Long_Other,Traders_NonComm_Short_Other,Traders_NonComm_Spread_Other,Traders_Comm_Long_Other,Traders_Comm_Short_Other,Traders_Tot_Rept_Long_Other,Traders_Tot_Rept_Short_Other,Conc_Gross_LE_4_TDR_Long_All,Conc_Gross_LE_4_TDR_Short_All,Conc_Gross_LE_8_TDR_Long_All,Conc_Gross_LE_8_TDR_Short_All,Conc_Net_LE_4_TDR_Long_All,Conc_Net_LE_4_TDR_Short_All,Conc_Net_LE_8_TDR_Long_All,Conc_Net_LE_8_TDR_Short_All,Conc_Gross_LE_4_TDR_Long_Old,Conc_Gross_LE_4_TDR_Short_Old,Conc_Gross_LE_8_TDR_Long_Old,Conc_Gross_LE_8_TDR_Short_Old,Conc_Net_LE_4_TDR_Long_Old,Conc_Net_LE_4_TDR_Short_Old,Conc_Net_LE_8_TDR_Long_Old,Conc_Net_LE_8_TDR_Short_Old,Conc_Gross_LE_4_TDR_Long_Other,Conc_Gross_LE_4_TDR_Short_Other,Conc_Gross_LE_8_TDR_Long_Other,Conc_Gross_LE_8_TDR_Short_Other,Conc_Net_LE_4_TDR_Long_Other,Conc_Net_LE_4_TDR_Short_Other,Conc_Net_LE_8_TDR_Long_Other,Conc_Net_LE_8_TDR_Short_Other,Contract_Units
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1,Unnamed: 129_level_1,Unnamed: 130_level_1,Unnamed: 131_level_1,Unnamed: 132_level_1,Unnamed: 133_level_1,Unnamed: 134_level_1,Unnamed: 135_level_1,Unnamed: 136_level_1,Unnamed: 137_level_1,Unnamed: 138_level_1,Unnamed: 139_level_1,Unnamed: 140_level_1,Unnamed: 141_level_1,Unnamed: 142_level_1,Unnamed: 143_level_1,Unnamed: 144_level_1,Unnamed: 145_level_1,Unnamed: 146_level_1,Unnamed: 147_level_1,Unnamed: 148_level_1,Unnamed: 149_level_1,Unnamed: 150_level_1,Unnamed: 151_level_1,Unnamed: 152_level_1,Unnamed: 153_level_1,Unnamed: 154_level_1,Unnamed: 155_level_1,Unnamed: 156_level_1
1973-08-20,67.8,68.0,67.3,67.35,,,150.0,1220.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1973-08-21,67.5,67.6,66.6,67.1,,,113.0,1221.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1973-08-22,67.0,67.8,65.5,65.8,,,168.0,1169.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1973-08-23,65.4,67.75,65.4,66.75,,,151.0,1165.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1973-08-24,67.4,67.55,66.4,66.6,,,99.0,1172.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


# Notes

A few things left to work on before I'm ready for basic EDA:

1. I need to deal with the mismatch of available data and narrow my study to the range of dates for which I have all of the data I need. Using decades of coffee prices for which I have no feature data is not helpful.

2. I need to comb through all of the columns that I have, particularly in the Commitment of Traders data. I expect the EDA/Feature Engineering stages will pare things down further, but I can probably eliminate a large number of columns quickly by sorting through the data dictionary. My guess is that many of the CoT features will be highly correlated with each other, so it probably makes more sense to start with a few of the most representative.

3. 'Missing' data - My data differ significantly in frequency, so I need to fill in gaps. As of now the plan to fill in missing values by the 'last observation carried forward' method, ensuring that for each point in time, I have the most recent (according to the data I have) entry for each column. I should think through the efficacy of this, versus trying to match up the frequency of my observarions better (i.e. do I need daily futures prices for my purposes?). How do I handle the fact that the same data point would represent 'new' information to the market on one day, but would be discounted