# GETTING FAMILIAR WITH THE DATA 

# Table of contents
1. [Forex Factory](#forex)
    - [Initial exploration](#forex_explore)
    - [Data curation & feature extraction](#next_forex)
2. [Forexite](#forexite)
    - [Initial exploration](#forexite_explore)
    - [Data curation & feature extraction](#next_forexite)




## Forex Factory <a name="forex"></a>


Data from https://www.forexfactory.com/ was gotten using our own scrapper. Thus, we need to do some sanity checks to ensure that the downloaded data corresponds to the expected one.

As we have data from several years, the best approach for data curation is to create a script.
Before that, we need to explore the data for getting familiarity with our dataset. That´s exactly the goal of this notebook.


-----------


In [830]:
import pandas as pd
import numpy as np
from datetime import datetime
import pytz

In [831]:
# Global variables
# Please note this are relative directories to the project, so you need to edit these variables if modifying the folder structure

data_directory = '../../data/raw/'


### Initial exploration. Just for 2017, for the sake of getting familiarity with the data <a name="forex_explore"></a>


In [832]:
ff_2017 = pd.read_csv(data_directory + 'forexfactory_2017.csv')

In [833]:
ff_2017['datetime'] =  pd.to_datetime(ff_2017['datetime'])


In [834]:
ff_2017.head()

Unnamed: 0.1,Unnamed: 0,actual,country,datetime,forecast,forecast_error,impact,new,previous,previous_error,week
0,0,,NZD,2017-12-31,,,Non-Economic,Bank Holiday,,,52
1,1,,AUD,2017-12-31,,,Non-Economic,Bank Holiday,,,52
2,2,,JPY,2017-12-31,,,Non-Economic,Bank Holiday,,,52
3,3,,CNY,2017-12-31,,,Non-Economic,Bank Holiday,,,52
4,4,,NZD,2017-12-24,,,Non-Economic,Bank Holiday,,,52


In [835]:
ff_2017.dtypes

Unnamed: 0                 int64
actual                    object
country                   object
datetime          datetime64[ns]
forecast                  object
forecast_error            object
impact                    object
new                       object
previous                  object
previous_error            object
week                       int64
dtype: object

Let´s ensure the datetime column also has the time information

In [836]:
str(ff_2017['datetime'][16])

'2017-12-25 18:30:00'

In [837]:
ff_2017 = ff_2017.drop(columns = ['Unnamed: 0'])


Please note that **"forecast_error"** is a variable that I´ve created when scrapping the website, set to NaN whenever there was no error between the published forecast and the actual value.<br/> 
Equivalently, **"previous_error"** was also created by me, set to NaN whenever there was no goverment correction on the published value for the previous release event. 

Let´s replace those NaN by a categorical value = 'accurate'

In [838]:
ff_2017['forecast_error'] = ff_2017['forecast_error'].replace(np.nan, 'accurate', regex=True)
ff_2017['previous_error'] = ff_2017['previous_error'].replace(np.nan, 'accurate', regex=True)


Our preliminary analysis is going to be focused on **EUR-USD only**, analysing the impact of news published by the American government, so we filter the dataframe to only get **macroeconomic news from USA** (macroeconomic news = those which have a forecast)

In [839]:
ff_2017_USA = ff_2017[ff_2017['country'] == 'USD'] 
ff_2017_USA = ff_2017_USA[ff_2017_USA['forecast'].notnull()]
ff_2017_USA.head()

Unnamed: 0,actual,country,datetime,forecast,forecast_error,impact,new,previous,previous_error,week
27,6.4%,USD,2017-12-26 09:00:00,6.3%,accurate,Low,S&P/CS Composite-20 HPI y/y,6.2%,accurate,52
28,20,USD,2017-12-26 09:59:00,22,worse,Low,Richmond Manufacturing Index,30,accurate,52
32,122.1,USD,2017-12-27 10:00:00,128.2,worse,High,CB Consumer Confidence,128.6,worse,52
33,0.2%,USD,2017-12-27 10:00:00,-0.4%,better,Medium,Pending Home Sales m/m,3.5%,accurate,52
41,245K,USD,2017-12-28 08:30:00,240K,worse,High,Unemployment Claims,245K,accurate,52


How many macro-economical news are published each year?

In [840]:
len(ff_2017_USA)

872

How many by 'impact' rate?

In [841]:
ff_2017_USA_high = ff_2017_USA[ff_2017_USA['impact'] == 'High']
ff_2017_USA_medium = ff_2017_USA[ff_2017_USA['impact'] == 'Medium']
ff_2017_USA_low = ff_2017_USA[ff_2017_USA['impact'] == 'Low']

print('High: ' + str(len(ff_2017_USA_high)) + ' - Medium: ' + str(len(ff_2017_USA_medium)) + ' - Low: ' + str(len(ff_2017_USA_low)))

High: 296 - Medium: 243 - Low: 333


Our favourite news for this analysis are those with high expected impact on the market. Let´s see how many of them we have

In [842]:
print('number of news, high: ' + 
      str(len(ff_2017_USA_high.groupby('new').impact.count())) +
      ' - med: ' +
        str(len(ff_2017_USA_medium.groupby('new').impact.count())) +
      ' - low: ' + 
        str(len(ff_2017_USA_low.groupby('new').impact.count())))

number of news, high: 22 - med: 27 - low: 29


Hmmm, not that many... :-(

Let´s see how many times forex factory publishes a wrong forecast

In [843]:
ff_2017_USA.groupby('forecast_error').impact.count()

forecast_error
accurate    301
better      288
worse       283
Name: impact, dtype: int64

Cool, forexfactory.com publishes non-accurate forecasts around 2/3 of the times !

Let´s see how often HIGH news are published

In [844]:
ff_2017_USA_high.groupby('new').impact.count()

new
ADP Non-Farm Employment Change    12
Advance GDP q/q                    4
Average Hourly Earnings m/m       12
Building Permits                  12
CB Consumer Confidence            12
CPI m/m                           12
Core CPI m/m                      12
Core Durable Goods Orders m/m     12
Core Retail Sales m/m             12
Crude Oil Inventories             52
Federal Funds Rate                 6
Final GDP q/q                      4
ISM Manufacturing PMI             12
ISM Non-Manufacturing PMI         12
Non-Farm Employment Change        12
PPI m/m                           12
Philly Fed Manufacturing Index     4
Prelim GDP q/q                     4
Prelim UoM Consumer Sentiment      4
Retail Sales m/m                  12
Unemployment Claims               50
Unemployment Rate                 12
Name: impact, dtype: int64

Mosts of them are monthly news. Let´s review one of them randomnly

In [845]:
ff_2017_USA[ff_2017_USA['new'] == 'ADP Non-Farm Employment Change']

Unnamed: 0,actual,country,datetime,forecast,forecast_error,impact,new,previous,previous_error,week
278,190K,USD,2017-12-06 08:15:00,189K,accurate,High,ADP Non-Farm Employment Change,235K,accurate,49
764,235K,USD,2017-11-01 07:15:00,202K,better,High,ADP Non-Farm Employment Change,110K,worse,44
1111,135K,USD,2017-10-04 07:15:00,131K,accurate,High,ADP Non-Farm Employment Change,228K,worse,40
1547,237K,USD,2017-08-30 07:15:00,185K,better,High,ADP Non-Farm Employment Change,201K,better,35
1884,178K,USD,2017-08-02 07:15:00,187K,worse,High,ADP Non-Farm Employment Change,191K,better,31
2215,158K,USD,2017-07-06 07:15:00,184K,worse,High,ADP Non-Farm Employment Change,230K,worse,27
2672,253K,USD,2017-06-01 07:15:00,181K,better,High,ADP Non-Farm Employment Change,174K,accurate,22
2998,177K,USD,2017-05-03 07:15:00,178K,accurate,High,ADP Non-Farm Employment Change,255K,worse,18
3368,263K,USD,2017-04-05 07:15:00,184K,better,High,ADP Non-Farm Employment Change,245K,worse,14
3711,298K,USD,2017-03-08 08:15:00,184K,better,High,ADP Non-Farm Employment Change,261K,better,10


**Interesting...**
Forexfactory provided its data in US/Eastern with with DST = off (as I ran the scrapper during winter time). <br/>
This means that we need to manually add an extra hour whenever DST = on in US/Eastern. That´s exactly what forexfactory does.

Extra work to be done... After some time-consuming search on google, it´s easier than originaly thought.

In [846]:
def add_dts_flag(df):

    # Create a list of start and end dates for US in each year, in UTC time
    dst_changes_utc = pytz.timezone('US/Eastern')._utc_transition_times[1:]

    # Convert to local times from UTC times and then remove timezone information
    dst_changes = [pd.Timestamp(i).tz_localize('UTC').tz_convert('US/Eastern').tz_localize(None) for i in dst_changes_utc]

    flag_list = []
    for index, row in df['datetime'].iteritems():
        # Isolate the start and end dates for DST in each year
        dst_dates_in_year = [date for date in dst_changes if date.year == row.year]
        spring = dst_dates_in_year[0]
        fall = dst_dates_in_year[1]
        if (row >= spring) & (row < fall):
            flag = 1
        else:
            flag = 0
        flag_list.append(flag)
    
    return flag_list


In [847]:
ff_2017_USA['dst_flag'] = add_dts_flag(ff_2017_USA)
ff_2017_USA[ff_2017_USA['new'] == 'ADP Non-Farm Employment Change']

Unnamed: 0,actual,country,datetime,forecast,forecast_error,impact,new,previous,previous_error,week,dst_flag
278,190K,USD,2017-12-06 08:15:00,189K,accurate,High,ADP Non-Farm Employment Change,235K,accurate,49,0
764,235K,USD,2017-11-01 07:15:00,202K,better,High,ADP Non-Farm Employment Change,110K,worse,44,1
1111,135K,USD,2017-10-04 07:15:00,131K,accurate,High,ADP Non-Farm Employment Change,228K,worse,40,1
1547,237K,USD,2017-08-30 07:15:00,185K,better,High,ADP Non-Farm Employment Change,201K,better,35,1
1884,178K,USD,2017-08-02 07:15:00,187K,worse,High,ADP Non-Farm Employment Change,191K,better,31,1
2215,158K,USD,2017-07-06 07:15:00,184K,worse,High,ADP Non-Farm Employment Change,230K,worse,27,1
2672,253K,USD,2017-06-01 07:15:00,181K,better,High,ADP Non-Farm Employment Change,174K,accurate,22,1
2998,177K,USD,2017-05-03 07:15:00,178K,accurate,High,ADP Non-Farm Employment Change,255K,worse,18,1
3368,263K,USD,2017-04-05 07:15:00,184K,better,High,ADP Non-Farm Employment Change,245K,worse,14,1
3711,298K,USD,2017-03-08 08:15:00,184K,better,High,ADP Non-Farm Employment Change,261K,better,10,0


Cool, it works pretty well. Let´s apply it to the dataframe.

In [848]:
def apply_dts_flag(row):
    return row['datetime'] + pd.DateOffset(hours=row['dst_flag'])

In [849]:
ff_2017_USA['datetime'] = ff_2017_USA.apply(apply_dts_flag, axis=1)


In [850]:
ff_2017_USA[ff_2017_USA['new'] == 'ADP Non-Farm Employment Change']

Unnamed: 0,actual,country,datetime,forecast,forecast_error,impact,new,previous,previous_error,week,dst_flag
278,190K,USD,2017-12-06 08:15:00,189K,accurate,High,ADP Non-Farm Employment Change,235K,accurate,49,0
764,235K,USD,2017-11-01 08:15:00,202K,better,High,ADP Non-Farm Employment Change,110K,worse,44,1
1111,135K,USD,2017-10-04 08:15:00,131K,accurate,High,ADP Non-Farm Employment Change,228K,worse,40,1
1547,237K,USD,2017-08-30 08:15:00,185K,better,High,ADP Non-Farm Employment Change,201K,better,35,1
1884,178K,USD,2017-08-02 08:15:00,187K,worse,High,ADP Non-Farm Employment Change,191K,better,31,1
2215,158K,USD,2017-07-06 08:15:00,184K,worse,High,ADP Non-Farm Employment Change,230K,worse,27,1
2672,253K,USD,2017-06-01 08:15:00,181K,better,High,ADP Non-Farm Employment Change,174K,accurate,22,1
2998,177K,USD,2017-05-03 08:15:00,178K,accurate,High,ADP Non-Farm Employment Change,255K,worse,18,1
3368,263K,USD,2017-04-05 08:15:00,184K,better,High,ADP Non-Farm Employment Change,245K,worse,14,1
3711,298K,USD,2017-03-08 08:15:00,184K,better,High,ADP Non-Farm Employment Change,261K,better,10,0


As forexite was downloaded in GMT, no DTS, we would need to do the conversion before merging both dataframes

In [851]:
ff_2017_USA['datetime_gmt'] = ff_2017_USA['datetime'].dt.tz_localize('US/Eastern').dt.tz_convert('GMT')
ff_2017_USA[ff_2017_USA['new'] == 'ADP Non-Farm Employment Change']

Unnamed: 0,actual,country,datetime,forecast,forecast_error,impact,new,previous,previous_error,week,dst_flag,datetime_gmt
278,190K,USD,2017-12-06 08:15:00,189K,accurate,High,ADP Non-Farm Employment Change,235K,accurate,49,0,2017-12-06 13:15:00+00:00
764,235K,USD,2017-11-01 08:15:00,202K,better,High,ADP Non-Farm Employment Change,110K,worse,44,1,2017-11-01 12:15:00+00:00
1111,135K,USD,2017-10-04 08:15:00,131K,accurate,High,ADP Non-Farm Employment Change,228K,worse,40,1,2017-10-04 12:15:00+00:00
1547,237K,USD,2017-08-30 08:15:00,185K,better,High,ADP Non-Farm Employment Change,201K,better,35,1,2017-08-30 12:15:00+00:00
1884,178K,USD,2017-08-02 08:15:00,187K,worse,High,ADP Non-Farm Employment Change,191K,better,31,1,2017-08-02 12:15:00+00:00
2215,158K,USD,2017-07-06 08:15:00,184K,worse,High,ADP Non-Farm Employment Change,230K,worse,27,1,2017-07-06 12:15:00+00:00
2672,253K,USD,2017-06-01 08:15:00,181K,better,High,ADP Non-Farm Employment Change,174K,accurate,22,1,2017-06-01 12:15:00+00:00
2998,177K,USD,2017-05-03 08:15:00,178K,accurate,High,ADP Non-Farm Employment Change,255K,worse,18,1,2017-05-03 12:15:00+00:00
3368,263K,USD,2017-04-05 08:15:00,184K,better,High,ADP Non-Farm Employment Change,245K,worse,14,1,2017-04-05 12:15:00+00:00
3711,298K,USD,2017-03-08 08:15:00,184K,better,High,ADP Non-Farm Employment Change,261K,better,10,0,2017-03-08 13:15:00+00:00


-----------
On a different topic, we would also need to know which units are used per each macroeconomic new, so that we can compute the error rate between forecast and reality.

In [852]:
list(set(ff_2017_USA.groupby('new').first().forecast))

['58.4',
 '97.1',
 '-135.2B',
 '-67.7B',
 '104.6',
 '99.0',
 '198K',
 '62.2',
 '0.7%',
 '57.6B',
 '-116B',
 '0.5%',
 '2.1%',
 '3.3%',
 '-3.9M',
 '-0.4%',
 '18.8',
 '54.8',
 '22',
 '1.27M',
 '4.1%',
 '2.6%',
 '0.6%',
 '0.4%',
 '17.5M',
 '6.3%',
 '2.5%',
 '6.03M',
 '54.0',
 '-0.1%',
 '-115B',
 '55.4',
 '5.53M',
 '1.25M',
 '-0.3%',
 '189K',
 '2.2%',
 '59.2',
 '240K',
 '654K',
 '1.7%',
 '17.4B',
 '0.2%',
 '-46.2B',
 '0.3%',
 '70',
 '77.2%',
 '54.6',
 '67.0',
 '21.5',
 '<1.50%',
 '53.8',
 '0.1%',
 '128.2']

OK, again this is not nice... Extra processing will need to be done to compute error_ratio

--- 

## Next steps on Forex Factory <a name="next_forex"></a>


### Sanity checks:

 - No missing weeks. Each year should have 52 weeks.
 
### Data selection:

 - Filter just macro-economic news.
 - Filter news just on the currency of interest.
 
### Feature Engineer:

 - Replace NaN in "forecast_error" and "previous_error" fields by "accurate".
 - Manually add +1h to forexfactory data to account for DTS (date time savings).
 - Set all timestamps to match the trading pair values got from Forexite, i.e. GMT without DTS. Otherwise we won´t compare apples with apples !
 - Compute percentage of error between the forecasted and actual values, taking into account the different units handled (int, float, %, Millions = 'M', Thousands = 'K').
 - Add year, quarter, month, day of week as caegorical variables.


<br/>

----

## Forexite <a name="forexite"></a>


### Initial exploration. Just EUR-USD, for the sake of getting familiarity with the data <a name="forexite_explore"></a>




Currency data from https://forextester.com/data/datasources is already provided as csv files, one per each currency pair.



In [911]:
eurusd = pd.read_csv(data_directory + 'EURUSD.zip', compression='zip', header=0, sep=',')


In [912]:
eurusd.head()

Unnamed: 0,Gmt time,Open,High,Low,Close,Volume
0,01.01.2007 00:00:00.000,1.31908,1.31961,1.31896,1.31947,5268.6
1,01.01.2007 00:05:00.000,1.31942,1.31963,1.31935,1.31945,4019.1
2,01.01.2007 00:10:00.000,1.31959,1.31964,1.31928,1.31953,3784.6
3,01.01.2007 00:15:00.000,1.31942,1.31961,1.31918,1.31929,3550.8
4,01.01.2007 00:20:00.000,1.31919,1.31934,1.31902,1.31923,4096.8


In [913]:
eurusd.dtypes

Gmt time     object
Open        float64
High        float64
Low         float64
Close       float64
Volume      float64
dtype: object

The data is listed by minute. We won´t be interested is such degree of granularity. For our study, we will need to group this data into broader chunks

In [910]:
eurusd.describe()

Unnamed: 0,Open,High,Low,Close,Volume
count,1245888.0,1245888.0,1245888.0,1245888.0,1245888.0
mean,1.287445,1.287671,1.287212,1.287445,882.6395
std,0.1296695,0.1297179,0.1296122,0.1296695,1749.533
min,1.03452,1.03584,1.03403,1.03453,0.0
25%,1.17086,1.17107,1.17066,1.17086,0.0
50%,1.30679,1.30695,1.3067,1.30679,380.32
75%,1.37196,1.37229,1.37167,1.37195,1032.592
max,1.60343,1.60389,1.60155,1.60305,229081.8


--- 

## Feature engineer using Dukascopy data <a name="next_dukas"></a>

Dukascopy provides the exchange rate for the major pairs of interest. We will use this data to evaluate the impact in that pair created by the releases of macroeconomic data

#### Situation of the market _before_ publishing the new:
    
 - Create a new dataframe, grouping the data per day (open, high, low, close).
 - Add 12 new features to the news dataframe -> (open, high, low, close) for the 3 days prior to the new publication.

#### Situation of the market _after_ publishing the new:

 - 5,10,15,30,60,90,120-min window size (volatility (high - low), direction (up|down), close).


## Feature engineer <a name="feature_engineer"></a>

 - Dataframe:
      - surprise_forecast
      - surprise_volatility
     

<br/>

----

In [902]:
ff_2017_USA[ff_2017_USA['new'] == 'Unemployment Rate']

Unnamed: 0,actual,country,datetime,forecast,forecast_error,impact,new,previous,previous_error,week,dst_flag,datetime_gmt,a_5_min
332,4.1%,USD,2017-12-08 08:30:00,4.1%,accurate,High,Unemployment Rate,4.1%,accurate,49,0,2017-12-08 13:30:00+00:00,2017-12-08 13:35:00+00:00
816,4.1%,USD,2017-11-03 08:30:00,4.2%,better,High,Unemployment Rate,4.2%,accurate,44,1,2017-11-03 12:30:00+00:00,2017-11-03 12:35:00+00:00
1152,4.2%,USD,2017-10-06 08:30:00,4.4%,better,High,Unemployment Rate,4.4%,accurate,40,1,2017-10-06 12:30:00+00:00,2017-10-06 12:35:00+00:00
1596,4.4%,USD,2017-09-01 08:30:00,4.3%,worse,High,Unemployment Rate,4.3%,accurate,35,1,2017-09-01 12:30:00+00:00,2017-09-01 12:35:00+00:00
1924,4.3%,USD,2017-08-04 08:30:00,4.3%,accurate,High,Unemployment Rate,4.4%,accurate,31,1,2017-08-04 12:30:00+00:00,2017-08-04 12:35:00+00:00
2247,4.4%,USD,2017-07-07 08:30:00,4.3%,worse,High,Unemployment Rate,4.3%,accurate,27,1,2017-07-07 12:30:00+00:00,2017-07-07 12:35:00+00:00
2694,4.3%,USD,2017-06-02 08:30:00,4.4%,better,High,Unemployment Rate,4.4%,accurate,22,1,2017-06-02 12:30:00+00:00,2017-06-02 12:35:00+00:00
3044,4.4%,USD,2017-05-05 08:30:00,4.6%,better,High,Unemployment Rate,4.5%,accurate,18,1,2017-05-05 12:30:00+00:00,2017-05-05 12:35:00+00:00
3412,4.5%,USD,2017-04-07 08:30:00,4.7%,better,High,Unemployment Rate,4.7%,accurate,14,1,2017-04-07 12:30:00+00:00,2017-04-07 12:35:00+00:00
3756,4.7%,USD,2017-03-10 08:30:00,4.7%,accurate,High,Unemployment Rate,4.8%,accurate,10,0,2017-03-10 13:30:00+00:00,2017-03-10 13:35:00+00:00


In [903]:
df = pd.read_csv('../../data/curated/macroeconomic_news_2007_2018.csv')
df.head(2)

Unnamed: 0,actual,country,datetime,forecast,forecast_error,impact,new,previous,previous_error,week,...,direction_60,pips_diff_60,close_90,volatility_90,direction_90,pips_diff_90,close_120,volatility_120,direction_120,pips_diff_120
0,12.3B,USD,2007-01-08 15:00:00,5.4B,better,Low,Consumer Credit m/m,-1.3B,accurate,2,...,down,2.0,13019,5,up,1.0,13019,5,down,1.0
1,53.7,USD,2007-01-09 10:00:00,53.7,accurate,Low,IBD/TIPP Economic Optimism,53.5,accurate,2,...,down,1.0,12994,5,down,6.0,12993,6,down,7.0


In [904]:
df[(df['new'] ==  'Unemployment Rate') & (df['year'] == 2018)][['new', 'impact','datetime','datetime_gmt', 'open_released','high_released','low_released','close_released']]

Unnamed: 0,new,impact,datetime,datetime_gmt,open_released,high_released,low_released,close_released
8765,Unemployment Rate,High,2018-01-05 08:30:00,2018-01-05 13:30:00+00:00,1.20492,12051.0,12046.0,12051.0
8834,Unemployment Rate,High,2018-02-02 08:30:00,2018-02-02 13:30:00+00:00,1.2491,12497.0,12490.0,12492.0
8911,Unemployment Rate,High,2018-03-09 08:30:00,2018-03-09 13:30:00+00:00,1.22853,12288.0,12282.0,12285.0
8981,Unemployment Rate,High,2018-04-06 08:30:00,2018-04-06 12:30:00+00:00,1.22345,12241.0,12230.0,12230.0
9052,Unemployment Rate,High,2018-05-04 08:30:00,2018-05-04 12:30:00+00:00,1.19674,11968.0,11959.0,11964.0
9110,Unemployment Rate,High,2018-06-01 08:30:00,2018-06-01 12:30:00+00:00,1.1669,11673.0,11667.0,11672.0
9191,Unemployment Rate,High,2018-07-06 08:30:00,2018-07-06 12:30:00+00:00,1.17208,11722.0,11713.0,11716.0
9259,Unemployment Rate,High,2018-08-03 08:30:00,2018-08-03 12:30:00+00:00,1.15889,11590.0,11585.0,11587.0
9337,Unemployment Rate,High,2018-09-07 08:30:00,2018-09-07 12:30:00+00:00,1.16113,11620.0,11611.0,11618.0
9405,Unemployment Rate,High,2018-10-05 08:30:00,2018-10-05 12:30:00+00:00,1.15056,11509.0,11504.0,11506.0


In [906]:
df[(df['new'] ==  'Unemployment Claims') & (df['year'] == 2017)][['new', 'impact','datetime','datetime_gmt', 'open_released','high_released','low_released','close_released']]

Unnamed: 0,new,impact,datetime,datetime_gmt,open_released,high_released,low_released,close_released
7909,Unemployment Claims,High,2017-01-05 08:30:00,2017-01-05 13:30:00+00:00,1.05187,10521.0,10514.0,10518.0
7925,Unemployment Claims,High,2017-01-12 08:30:00,2017-01-12 13:30:00+00:00,1.06659,10667.0,10662.0,10664.0
7944,Unemployment Claims,High,2017-01-19 08:30:00,2017-01-19 13:30:00+00:00,1.06693,10670.0,10667.0,10668.0
7953,Unemployment Claims,High,2017-01-26 08:30:00,2017-01-26 13:30:00+00:00,1.06921,10695.0,10689.0,10694.0
7981,Unemployment Claims,High,2017-02-02 08:30:00,2017-02-02 13:30:00+00:00,1.08133,10814.0,10812.0,10812.0
7996,Unemployment Claims,High,2017-02-09 08:30:00,2017-02-09 13:30:00+00:00,1.0685,10686.0,10680.0,10682.0
8017,Unemployment Claims,High,2017-02-16 08:30:00,2017-02-16 13:30:00+00:00,1.06483,10650.0,10648.0,10649.0
8024,Unemployment Claims,High,2017-02-23 08:30:00,2017-02-23 13:30:00+00:00,1.05717,10575.0,10568.0,10570.0
8049,Unemployment Claims,High,2017-03-02 08:30:00,2017-03-02 13:30:00+00:00,1.05161,10518.0,10515.0,10517.0
8062,Unemployment Claims,High,2017-03-09 08:30:00,2017-03-09 13:30:00+00:00,1.05548,10558.0,10554.0,10556.0


In [958]:
df = pd.read_csv("/Users/wola/Documents/MSS/Personales/GitRepos/PFM_EconomicNewsImpact/data/curated/macroeconomic_news_2007_2018.csv")

In [963]:
df_nan = df[df['new'] == 'Unemployment Rate']


In [708]:
df[(df['new'] ==  'Unemployment Rate') & (df['year'] == 2018)][['new', 'impact','datetime','datetime_gmt', 'open_released','high_released','low_released','close_released']]

Unnamed: 0,new,impact,datetime,datetime_gmt,open_released,high_released,low_released,close_released
8664,Unemployment Rate,High,2018-01-05 08:30:00,2018-01-05 13:30:00+00:00,1.2043,12046.0,12043.0,12046.0
8727,Unemployment Rate,High,2018-02-02 08:30:00,2018-02-02 13:30:00+00:00,1.2488,12488.0,12486.0,12486.0
8804,Unemployment Rate,High,2018-03-09 08:30:00,2018-03-09 13:30:00+00:00,1.2284,12285.0,12283.0,12283.0
8874,Unemployment Rate,High,2018-04-06 08:30:00,2018-04-06 12:30:00+00:00,1.2233,12239.0,12229.0,12230.0
8939,Unemployment Rate,High,2018-05-04 08:30:00,2018-05-04 12:30:00+00:00,1.1965,11966.0,11957.0,11961.0
8997,Unemployment Rate,High,2018-06-01 08:30:00,2018-06-01 12:30:00+00:00,1.1667,11672.0,11666.0,11670.0
9077,Unemployment Rate,High,2018-07-06 08:30:00,2018-07-06 12:30:00+00:00,1.1716,11717.0,11713.0,11713.0
9145,Unemployment Rate,High,2018-08-03 08:30:00,2018-08-03 12:30:00+00:00,1.1587,11587.0,11584.0,11586.0
9222,Unemployment Rate,High,2018-09-07 08:30:00,2018-09-07 12:30:00+00:00,1.1609,11617.0,11609.0,11617.0
9290,Unemployment Rate,High,2018-10-05 08:30:00,2018-10-05 12:30:00+00:00,1.1506,11507.0,11506.0,11507.0


In [965]:
df_nan[['new', 'datetime_gmt','forecast','actual', 'prediction_error', 'prediction_mean', 'prediction_std', 'prediction_zscore']].head(10)

Unnamed: 0,new,datetime_gmt,forecast,actual,prediction_error,prediction_mean,prediction_std,prediction_zscore
7895,Unemployment Rate,2007-02-02 13:30:00+00:00,4.5%,4.6%,-2.22,-2.22,1.0,0.0
7896,Unemployment Rate,2007-03-09 13:30:00+00:00,4.6%,4.5%,2.17,-0.025,3.104199,4.39
7897,Unemployment Rate,2007-04-06 12:30:00+00:00,4.6%,4.4%,4.35,1.433333,3.346376,1.409381
7898,Unemployment Rate,2007-05-04 12:30:00+00:00,4.5%,4.5%,0.0,1.075,2.82473,-0.428324
7899,Unemployment Rate,2007-06-01 12:30:00+00:00,4.5%,4.5%,0.0,0.86,2.49308,-0.380567
7900,Unemployment Rate,2007-07-06 12:30:00+00:00,4.5%,4.5%,0.0,1.304,1.944821,-0.344955
7901,Unemployment Rate,2007-08-03 12:30:00+00:00,4.5%,4.6%,-2.22,0.426,2.39497,-1.811992
7902,Unemployment Rate,2007-09-07 12:30:00+00:00,4.6%,4.6%,0.0,-0.444,0.992814,-0.177873
7903,Unemployment Rate,2007-10-05 12:30:00+00:00,4.7%,4.7%,0.0,-0.444,0.992814,0.447214
7904,Unemployment Rate,2007-11-02 12:30:00+00:00,4.7%,4.7%,0.0,-0.444,0.992814,0.447214


In [961]:
df[['new', 'forecast','actual', 'prediction_error', 'prediction_mean', 'prediction_std', 'prediction_zscore']].head(10)

Unnamed: 0,new,forecast,actual,prediction_error,prediction_mean,prediction_std,prediction_zscore
0,Consumer Credit m/m,5.4B,12.3B,-127.78,-127.78,1.0,0.0
1,Consumer Credit m/m,7.0B,6.0B,14.29,-56.745,100.45866,142.07
2,Consumer Credit m/m,7.0B,6.4B,8.57,-34.973333,80.4238,0.650168
3,Consumer Credit m/m,5.5B,3.0B,45.45,-14.8675,76.999804,0.999994
4,Consumer Credit m/m,4.2B,13.5B,-221.43,-56.18,113.931298,-2.682637
5,Consumer Credit m/m,6.0B,2.6B,56.67,-19.29,114.81024,0.990509
6,Consumer Credit m/m,6.4B,12.9B,-101.56,-42.46,117.985229,-0.716574
7,Consumer Credit m/m,5.5B,13.2B,-140.0,-72.174,120.599998,-0.826714
8,Consumer Credit m/m,8.8B,7.5B,14.77,-78.31,113.702506,0.720929
9,Consumer Credit m/m,10.0B,12.2B,-22.0,-38.424,81.311218,0.49524


In [962]:
df.head(2)

Unnamed: 0,index,actual,country,datetime,forecast,forecast_error,impact,new,previous,previous_error,...,high_240,low_240,close_240,volatility_240,direction_240,pips_agg_240,pips_candle_240,prediction_mean,prediction_std,prediction_zscore
0,0,12.3B,USD,2007-01-08 15:00:00,5.4B,better,Low,Consumer Credit m/m,-1.3B,accurate,...,13038,13034,13036,4,up,18,1,-127.78,1.0,0.0
1,65,6.0B,USD,2007-02-07 15:00:00,7.0B,worse,Low,Consumer Credit m/m,13.7B,better,...,13022,13017,13021,5,up,16,2,-56.745,100.45866,142.07


In [921]:
df1 = df.groupby('new').forecast_error_ratio.mean()
df1 = df1.reset_index()

In [930]:
df1['new'].unique()

array(['ADP Non-Farm Employment Change', 'Advance GDP Price Index q/q',
       'Advance GDP q/q', 'Average Hourly Earnings m/m',
       'Building Permits', 'Business Inventories m/m',
       'CB Consumer Confidence', 'CB Leading Index m/m', 'CPI m/m',
       'Capacity Utilization Rate', 'Chicago PMI',
       'Construction Spending m/m', 'Consumer Credit m/m', 'Core CPI m/m',
       'Core Durable Goods Orders m/m', 'Core PCE Price Index m/m',
       'Core PPI m/m', 'Core Retail Sales m/m', 'Crude Oil Inventories',
       'Current Account', 'Durable Goods Orders m/m',
       'Empire State Manufacturing Index', 'Employment Cost Index q/q',
       'Existing Home Sales', 'Factory Orders m/m',
       'Federal Budget Balance', 'Federal Funds Rate',
       'Final GDP Price Index q/q', 'Final GDP q/q',
       'Final Manufacturing PMI', 'Final Services PMI',
       'Final Wholesale Inventories m/m', 'Flash Manufacturing PMI',
       'Flash Services PMI', 'Goods Trade Balance', 'HPI m/m',
       

In [936]:
df[df['new']=='ADP Non-Farm Employment Change'].head(4)

Unnamed: 0,actual,country,datetime,forecast,forecast_error,impact,new,previous,previous_error,week,...,direction_60,pips_diff_60,close_90,volatility_90,direction_90,pips_diff_90,close_120,volatility_120,direction_120,pips_diff_120
39,152K,USD,2007-01-31 08:15:00,135K,better,Medium,ADP Non-Farm Employment Change,147K,better,5,...,up,11.0,12967,16,up,14.0,12973,6,up,20.0
122,57K,USD,2007-03-07 08:15:00,100K,worse,Medium,ADP Non-Farm Employment Change,121K,worse,10,...,up,5.0,13133,4,down,1.0,13142,5,up,10.0
182,106K,USD,2007-04-04 08:15:00,125K,worse,High,ADP Non-Farm Employment Change,65K,better,14,...,up,1.0,13356,11,up,7.0,13360,5,up,11.0
243,64K,USD,2007-05-02 08:15:00,107K,worse,High,ADP Non-Farm Employment Change,98K,worse,18,...,up,14.0,13587,2,down,7.0,13585,6,down,5.0


39      NaN
122     NaN
182    152K
243     57K
Name: actual, dtype: object

In [983]:
df_temp = df[df['datetime_gmt'] == '2018-06-01 12:30:00+00:00']
df_temp.head()

Unnamed: 0,index,actual,country,datetime,forecast,forecast_error,impact,new,previous,previous_error,...,high_240,low_240,close_240,volatility_240,direction_240,pips_agg_240,pips_candle_240,prediction_mean,prediction_std,prediction_zscore
7890,366,223K,USD,2018-06-01 08:30:00,189K,better,High,Non-Farm Employment Change,159K,accurate,...,11674,11669,11672,5,down,0,1,-4.456,36.54066,-0.579069
8031,367,3.8%,USD,2018-06-01 08:30:00,3.9%,better,High,Unemployment Rate,3.9%,accurate,...,11674,11669,11672,5,down,0,1,0.012,2.515098,1.46296
8172,365,0.3%,USD,2018-06-01 08:30:00,0.2%,better,High,Average Hourly Earnings m/m,0.1%,accurate,...,11674,11669,11672,5,down,0,1,2.842171e-15,50.0,-1.434274


In [996]:
df_temp[['new', 'datetime_gmt','pips_agg_30','direction_30','volatility_30','direction_60','pips_agg_60','forecast','actual', 'prediction_error', 'prediction_mean', 'prediction_std', 'prediction_zscore']]

Unnamed: 0,new,datetime_gmt,pips_agg_30,direction_30,volatility_30,direction_60,pips_agg_60,forecast,actual,prediction_error,prediction_mean,prediction_std,prediction_zscore
7890,Non-Farm Employment Change,2018-06-01 12:30:00+00:00,14,down,16,up,10,189K,223K,-17.99,-4.456,36.54066,-0.579069
8031,Unemployment Rate,2018-06-01 12:30:00+00:00,14,down,16,up,10,3.9%,3.8%,2.56,0.012,2.515098,1.46296
8172,Average Hourly Earnings m/m,2018-06-01 12:30:00+00:00,14,down,16,up,10,0.2%,0.3%,-50.0,2.842171e-15,50.0,-1.434274


In [981]:
df_temp = df[df['new'] == 'Unemployment Claims']
df_temp[['new', 'datetime_gmt','pips_agg_30','forecast','actual', 'prediction_error', 'prediction_mean', 'prediction_std', 'prediction_zscore']].tail(5)

Unnamed: 0,new,datetime_gmt,pips_agg_30,forecast,actual,prediction_error,prediction_mean,prediction_std,prediction_zscore
1795,Unemployment Claims,2018-09-27 12:30:00+00:00,17,208K,214K,-2.88,1.976,3.240205,-3.257684
1796,Unemployment Claims,2018-10-04 12:30:00+00:00,11,214K,207K,3.27,2.536,3.155714,0.399357
1797,Unemployment Claims,2018-10-11 12:30:00+00:00,4,207K,214K,-3.38,0.832,3.658356,-1.874695
1798,Unemployment Claims,2018-10-18 12:30:00+00:00,6,211K,210K,0.47,0.354,3.478869,-0.098952
1799,Unemployment Claims,2018-10-25 12:30:00+00:00,15,214K,215K,-0.47,-0.598,2.695621,-0.236859


In [982]:
%matplotlib
df_temp.pips_agg_30.hist()

Using matplotlib backend: MacOSX


<matplotlib.axes._subplots.AxesSubplot at 0x6f22bfda0>