# Coffee Market Analysis
## Data-Wrangling Notebook

### Matthew Garton - February 2019

**Purpose:** The purpose of this notebook is to acquire my data, inspect it, clean it and prepare it for EDA and modeling.

**Context**: The ultimate goal of my project is to develop trading signals for coffee futures. I will attempt to build a machine learning model which uses fundamental and technical data to predict the future direction of coffee futures price changes. My expectation at the outset of this project is that my feature matrix will include data on weather, GDP, and coffee production and exports in major coffee-producing nations, GDP and coffee import data in major coffee-importing nations, as well as volume, open-interest, and commitment of traders data for ICE coffee futures contracts.

Note that many of the decisions made and functions written here came up at various stages of the project, from initial inspection all the way to model-building (as is the non-linear nature of the data science workflow). To keep things clean, I have moved all of the data cleaning/prep (outside of train-test splitting and some feature engineering) to this notebook. The csv file that I output can then be accessed in other notebooks in this repository.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

import quandl

pd.options.display.max_columns = 1000
pd.options.display.max_rows = 1000
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

In [2]:
# import ICE Coffee 'C' Futures price data
coffee = pd.read_csv('./data/CHRIS-ICE_KC1.csv')

In [3]:
coffee.head()

Unnamed: 0,Date,Open,High,Low,Settle,Change,Wave,Volume,Prev. Day Open Interest,EFP Volume,EFS Volume,Block Volume
0,2019-02-11,102.6,102.7,99.85,100.2,-2.4,101.01,41306.0,81909.0,733.0,1851.0,
1,2019-02-08,104.25,104.75,102.25,102.6,-1.7,103.24,39198.0,91190.0,384.0,2525.0,
2,2019-02-07,105.2,105.3,103.55,104.3,-1.2,104.41,38973.0,103661.0,385.0,119.0,
3,2019-02-06,104.7,105.9,104.35,105.5,0.65,105.09,23725.0,106848.0,483.0,18.0,
4,2019-02-05,105.8,106.2,104.25,104.85,-0.75,105.16,21214.0,110696.0,268.0,15.0,


In [4]:
coffee['Date'] = pd.to_datetime(coffee['Date'])
coffee.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11244 entries, 0 to 11243
Data columns (total 12 columns):
Date                       11244 non-null datetime64[ns]
Open                       11187 non-null float64
High                       11187 non-null float64
Low                        11187 non-null float64
Settle                     11243 non-null float64
Change                     6051 non-null float64
Wave                       707 non-null float64
Volume                     11186 non-null float64
Prev. Day Open Interest    11235 non-null float64
EFP Volume                 5299 non-null float64
EFS Volume                 4630 non-null float64
Block Volume               3589 non-null float64
dtypes: datetime64[ns](1), float64(11)
memory usage: 1.0 MB


In [5]:
coffee.set_index(coffee['Date'], inplace=True)

In [6]:
coffee.head()

Unnamed: 0_level_0,Date,Open,High,Low,Settle,Change,Wave,Volume,Prev. Day Open Interest,EFP Volume,EFS Volume,Block Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2019-02-11,2019-02-11,102.6,102.7,99.85,100.2,-2.4,101.01,41306.0,81909.0,733.0,1851.0,
2019-02-08,2019-02-08,104.25,104.75,102.25,102.6,-1.7,103.24,39198.0,91190.0,384.0,2525.0,
2019-02-07,2019-02-07,105.2,105.3,103.55,104.3,-1.2,104.41,38973.0,103661.0,385.0,119.0,
2019-02-06,2019-02-06,104.7,105.9,104.35,105.5,0.65,105.09,23725.0,106848.0,483.0,18.0,
2019-02-05,2019-02-05,105.8,106.2,104.25,104.85,-0.75,105.16,21214.0,110696.0,268.0,15.0,


In [11]:
new = pd.read_csv('./data/BCB-1262.csv')
new['Date'] = pd.to_datetime(new['Date'])
new.set_index(new['Date'], inplace=True)

In [16]:
coffee = coffee.merge(new, how='outer', on='Date')

In [17]:
coffee.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11259 entries, 0 to 11258
Data columns (total 13 columns):
Date                       11259 non-null datetime64[ns]
Open                       11187 non-null float64
High                       11187 non-null float64
Low                        11187 non-null float64
Settle                     11243 non-null float64
Change                     6051 non-null float64
Wave                       707 non-null float64
Volume                     11186 non-null float64
Prev. Day Open Interest    11235 non-null float64
EFP Volume                 5299 non-null float64
EFS Volume                 4630 non-null float64
Block Volume               3589 non-null float64
Value                      35 non-null float64
dtypes: datetime64[ns](1), float64(12)
memory usage: 1.2 MB


In [18]:
coffee.head()

Unnamed: 0,Date,Open,High,Low,Settle,Change,Wave,Volume,Prev. Day Open Interest,EFP Volume,EFS Volume,Block Volume,Value
0,2019-02-11,102.6,102.7,99.85,100.2,-2.4,101.01,41306.0,81909.0,733.0,1851.0,,
1,2019-02-08,104.25,104.75,102.25,102.6,-1.7,103.24,39198.0,91190.0,384.0,2525.0,,
2,2019-02-07,105.2,105.3,103.55,104.3,-1.2,104.41,38973.0,103661.0,385.0,119.0,,
3,2019-02-06,104.7,105.9,104.35,105.5,0.65,105.09,23725.0,106848.0,483.0,18.0,,
4,2019-02-05,105.8,106.2,104.25,104.85,-0.75,105.16,21214.0,110696.0,268.0,15.0,,


In [None]:
# Write a function to read in potential feature data
# clean it, index by time, and merge with main time series
def import_clean_merge(data, df):
    '''Funtion reads in csv data, backfills missing data, sets index to time
    then merges with main dataframe'''
    
    new = pd.read_csv('./data/' + data + '.csv')
    new.set_index(new['Date'], inplace=True)
    
    df.