# Coffee Market Analysis
## Data-Wrangling Notebook

### Matthew Garton - February 2019

**Purpose:** The purpose of this notebook is to acquire my data, inspect it, clean it and prepare it for EDA and modeling.

**Context**: The ultimate goal of my project is to develop trading signals for coffee futures. I will attempt to build a machine learning model which uses fundamental and technical data to predict the future direction of coffee futures price changes. My expectation at the outset of this project is that my feature matrix will include data on weather, GDP, and coffee production and exports in major coffee-producing nations, GDP and coffee import data in major coffee-importing nations, as well as volume, open-interest, and commitment of traders data for ICE coffee futures contracts.

Note that many of the decisions made and functions written here came up at various stages of the project, from initial inspection all the way to model-building (as is the non-linear nature of the data science workflow). To keep things clean, I have moved all of the data cleaning/prep (outside of train-test splitting and some feature engineering) to this notebook. The csv file that I output can then be accessed in other notebooks in this repository.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

pd.options.display.max_columns = 1000
pd.options.display.max_rows = 1000
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

## Gathering data

1. Price data (1973-2019) - daily OHLC prices (plus Volume and OI) for ICE Coffee 'C' futures.

    source: [Wiki Continuous Futures database on Quandl](https://www.quandl.com/data/CHRIS-Wiki-Continuous-Futures)
      
      
2. Weather data (1991-2015) - monthly average temperature (celsius) and rainfall (mm) for the top five coffee exporting countries (Brazil, Vietnam, Colombia, Indonesia, Ethiopia).
    
    source: [World Bank Climate Change Knowledge Portal](http://sdwebx.worldbank.org/climateportal/index.cfm?page=downscaled_data_download&menu=historical)
    
    
3. Fundamental data (1990-2017) - annual data on coffee production, imports, exports, etc. from International Coffee Organization*.

    source: [International Coffee Organization](http://www.ico.org/new_historical.asp?section=Statistics)


4. Positioning data (1995-2016) - monthly Commitment of Traders' reports from CFTC

    source: [Commodity Futures Trading Commission](https://www.cftc.gov/MarketReports/CommitmentsofTraders/HistoricalCompressed/index.htm)
    
*Note: Before getting started here, I did some initial data assembling/cleaning in excel, so if you choose to get the data directly from the sources listed above, some preparation will be necessary before getting it into the format shown here. The biggest decision I made so far was in how to handle some of the ICO data which was indexed by 'Crop Year' rather than 'Calendar Year'. My initial solution is to treat the most recent year of the 'Crop Year' as the relevant 'year' for the data (so Crop Year 1991/1992 is treated as Year 1992, with the understanding that all of the data for the 1991-1992 period would have been availably by EOY 1992). For now, this is a simplifying assumption to avoid any 'look-ahead bias.' This might be an oversimplification that I'll have to come back to. 

In [2]:
# import Daily ICE Coffee 'C' Futures price data
coffee = pd.read_csv('../data/CHRIS-ICE_KC1.csv')

# import Monthly Weather data for major coffee producing countries
weather = pd.read_csv('../data/Weather.csv')

# import Annual fundamental (Production, Exports, Imports, etc.) data
fundamental = pd.read_csv('../data/SupplyDemand.csv')

# import Monthly Commitment of Traders report data
cot = pd.read_csv('../data/CommitmentOfTraders.csv')

In [3]:
# Quick fix to 'Country' column typo..
weather.rename(index=str, columns={' Country':'Country'}, inplace=True)
weather.head()

Unnamed: 0,Date,Country,Temperature (Monthly – C),Precip (mm)
0,01/31/91,BRA,25.643,260.878
1,02/28/91,BRA,25.9575,193.859
2,03/31/91,BRA,25.6557,238.866
3,04/30/91,BRA,25.3129,194.848
4,05/31/91,BRA,24.791,119.09


In [4]:
# For each dataframe, index by Date (as datetime object) and extract year, month
dfs = [coffee, weather, fundamental, cot]
for df in dfs:
    df['Date'] = pd.to_datetime(df['Date'])
    df.set_index('Date', inplace=True)
    
# coffee and cot are sorted backwards; reverse order
coffee.sort_index(inplace=True)
cot.sort_index(inplace=True);

In [5]:
# For weather data, what I want is one row per observation, with each country's
# data represented in columns of that row

countries = ['BRA', 'COL', 'ETH', 'IDN', 'VNM']

# split weather into dfs for each country and rename columns appropriately
dfs = []
for country in countries:
    df = weather[weather['Country'] == country]
    df.rename(index=str, 
              columns={'Temperature (Monthly – C)':'{}_Temp'.format(country),
                       'Precip (mm)':'{}_Precip'.format(country)}, inplace=True)
    df.drop(columns=['Country'], inplace=True)
    dfs.append(df)

# combine separate countries' weather data into one frame indexed by date
weather = dfs[0]

for df in dfs[1:]:
    cols = df.columns.difference(weather.columns)
    weather = weather.merge(df[cols], left_index=True, right_index=True, how='outer')

In [6]:
# combine all data into one dataframe (using only the 1995-2015 window where they overlap)
dfs = [coffee['1995':'2016'], weather['1995':'2016'], fundamental['1995':'2016'], cot['1995':'2016']]

coffee_data = pd.concat(dfs, axis=1)

In [8]:
coffee_data.shape

(5484, 58)