# GETTING FAMILIAR WITH THE DATA 
----



## Forex Factory <a name="forex"></a>


Data from https://www.forexfactory.com/ was gotten using our own scrapper. Thus, we need to do some sanity checks to ensure that the downloaded data corresponds to the expected one.

As we have data from several years, the best approach for data curation is to create a script.
Before that, we need to explore the data for getting familiarity with our dataset. That´s exactly the goal of this notebook.


-----------


In [1]:
import pandas as pd
import numpy as np

In [7]:
data_directory_news = '../data/raw/'

### Initial exploratory analysis, just for 2017, to get familiar to the data


In [8]:
ff_2017 = pd.read_csv(data_directory_news + 'forexfactory_2017.csv')

In [9]:
ff_2017.head()

Unnamed: 0.1,Unnamed: 0,actual,country,datetime,forecast,forecast_error,impact,new,previous,previous_error,week
0,0,,NZD,2017-12-31 00:00:00,,,Non-Economic,Bank Holiday,,,52
1,1,,AUD,2017-12-31 00:00:00,,,Non-Economic,Bank Holiday,,,52
2,2,,JPY,2017-12-31 00:00:00,,,Non-Economic,Bank Holiday,,,52
3,3,,CNY,2017-12-31 00:00:00,,,Non-Economic,Bank Holiday,,,52
4,4,,NZD,2017-12-24 00:00:00,,,Non-Economic,Bank Holiday,,,52


In [10]:
ff_2017 = ff_2017.drop(columns = ['Unnamed: 0'])
ff_2017.describe()

Unnamed: 0,week
count,4566.0
mean,26.382611
std,15.01986
min,1.0
25%,13.0
50%,26.0
75%,40.0
max,52.0


In [11]:
ff_2017.dtypes

actual            object
country           object
datetime          object
forecast          object
forecast_error    object
impact            object
new               object
previous          object
previous_error    object
week               int64
dtype: object

Please note that **"forecast_error"** is a variable that I´ve created when scrapping the website, set to NaN whenever there was no error between the published forecast and the actual value.<br/> 
Equivalently, **"previous_error"** was also created by me, set to NaN whenever there was no goverment correction on the published value for the previous release event. 

Let´s replace those NaN by a categorical value = 'accurate'

In [12]:
ff_2017['forecast_error'] = ff_2017['forecast_error'].replace(np.nan, 'accurate', regex=True)
ff_2017['previous_error'] = ff_2017['previous_error'].replace(np.nan, 'accurate', regex=True)


Our preliminary analysis is going to be focused on **EUR-USD only**, analysing the impact of news published by the American government, so we filter the dataframe to only get **macroeconomic news from USA** (macroeconomic news = those which have a forecast)

In [13]:
ff_2017_USA = ff_2017[ff_2017['country'] == 'USD'] 
ff_2017_USA = ff_2017_USA[ff_2017_USA['forecast'].notnull()]
ff_2017_USA.head()

Unnamed: 0,actual,country,datetime,forecast,forecast_error,impact,new,previous,previous_error,week
27,6.4%,USD,2017-12-26 09:00:00,6.3%,accurate,Low,S&P/CS Composite-20 HPI y/y,6.2%,accurate,52
28,20,USD,2017-12-26 09:59:00,22,worse,Low,Richmond Manufacturing Index,30,accurate,52
32,122.1,USD,2017-12-27 10:00:00,128.2,worse,High,CB Consumer Confidence,128.6,worse,52
33,0.2%,USD,2017-12-27 10:00:00,-0.4%,better,Medium,Pending Home Sales m/m,3.5%,accurate,52
41,245K,USD,2017-12-28 08:30:00,240K,worse,High,Unemployment Claims,245K,accurate,52


How many macro-economical news are published each year?

In [14]:
len(ff_2017_USA)

872

How many released grouped by 'impact' rate?

In [26]:
ff_2017_USA_high = ff_2017_USA[ff_2017_USA['impact'] == 'High']
ff_2017_USA_medium = ff_2017_USA[ff_2017_USA['impact'] == 'Medium']
ff_2017_USA_low = ff_2017_USA[ff_2017_USA['impact'] == 'Low']

print('High: ' + str(len(ff_2017_USA_high)) + ' - Medium: ' + str(len(ff_2017_USA_medium)) + ' - Low: ' + str(len(ff_2017_USA_low)))

High: 296 - Medium: 243 - Low: 333


In [27]:
len(ff_2017_USA_high) *100/len(ff_2017_USA)

33.944954128440365

In [28]:
len(ff_2017_USA_medium) *100/len(ff_2017_USA)

27.86697247706422

In [29]:
len(ff_2017_USA_low) *100/len(ff_2017_USA)

38.18807339449541

Our favourite news for this analysis are those with higher expected impact on the market. Let´s see how many of them we have

In [16]:
print('number of news, high: ' + 
      str(len(ff_2017_USA_high.groupby('new').impact.count())) +
      ' - med: ' +
        str(len(ff_2017_USA_medium.groupby('new').impact.count())) +
      ' - low: ' + 
        str(len(ff_2017_USA_low.groupby('new').impact.count())))

number of news, high: 22 - med: 27 - low: 29


In [17]:
ff_2017_USA_high.groupby('new').impact.count()

new
ADP Non-Farm Employment Change    12
Advance GDP q/q                    4
Average Hourly Earnings m/m       12
Building Permits                  12
CB Consumer Confidence            12
CPI m/m                           12
Core CPI m/m                      12
Core Durable Goods Orders m/m     12
Core Retail Sales m/m             12
Crude Oil Inventories             52
Federal Funds Rate                 6
Final GDP q/q                      4
ISM Manufacturing PMI             12
ISM Non-Manufacturing PMI         12
Non-Farm Employment Change        12
PPI m/m                           12
Philly Fed Manufacturing Index     4
Prelim GDP q/q                     4
Prelim UoM Consumer Sentiment      4
Retail Sales m/m                  12
Unemployment Claims               50
Unemployment Rate                 12
Name: impact, dtype: int64

Hmmm, not that many... :-(

Interesting to see that some news have High impact most of the times, but Medium impact some other times, like 'Unemployment Claims'

In [18]:
ff_2017_USA[ff_2017_USA['new'] == 'Unemployment Claims']

Unnamed: 0,actual,country,datetime,forecast,forecast_error,impact,new,previous,previous_error,week
41,245K,USD,2017-12-28 08:30:00,240K,worse,High,Unemployment Claims,245K,accurate,52
109,245K,USD,2017-12-21 08:30:00,232K,worse,Medium,Unemployment Claims,225K,accurate,51
219,225K,USD,2017-12-14 08:30:00,237K,better,Medium,Unemployment Claims,236K,accurate,50
303,236K,USD,2017-12-07 08:30:00,239K,accurate,High,Unemployment Claims,238K,accurate,49
413,238K,USD,2017-11-30 08:30:00,241K,accurate,High,Unemployment Claims,240K,accurate,48
477,239K,USD,2017-11-22 08:30:00,241K,accurate,High,Unemployment Claims,252K,accurate,47
586,249K,USD,2017-11-16 08:30:00,235K,worse,High,Unemployment Claims,239K,accurate,46
684,239K,USD,2017-11-09 08:30:00,232K,worse,High,Unemployment Claims,229K,accurate,45
799,229K,USD,2017-11-02 07:30:00,235K,better,High,Unemployment Claims,234K,accurate,44
870,233K,USD,2017-10-26 07:30:00,235K,accurate,High,Unemployment Claims,223K,accurate,43


We need to know which meassure units are used per each macroeconomic new, so that we can compute the error rate between forecast and reality.

In [19]:
list(set(ff_2017_USA.groupby('new').actual.first()))

['77.1%',
 '733K',
 '67.6',
 '0.2%',
 '0.6%',
 '17.5M',
 '190K',
 '23.2B',
 '74',
 '245K',
 '65.5',
 '2.2%',
 '58.2',
 '20',
 '4.1%',
 '-69.7B',
 '228K',
 '0.1%',
 '18.0',
 '-138.5B',
 '52.4',
 '51.9',
 '95.9',
 '6.4%',
 '0.3%',
 '3.3%',
 '96.8',
 '5.81M',
 '0.5%',
 '53.9',
 '0.8%',
 '-0.5%',
 '3.2%',
 '1.4%',
 '1.30M',
 '122.1',
 '3.0%',
 '54.5',
 '55.0',
 '107.5',
 '2.1%',
 '1.0%',
 '57.4',
 '-101B',
 '1.3%',
 '0.7%',
 '6.00M',
 '<1.50%',
 '26.2',
 '20.5B',
 '0.4%',
 '-112B',
 '-0.1%',
 '-0.2%',
 '-48.7B',
 '-4.6M']

Let´s see how many times forex factory publishes a wrong forecast

In [20]:
ff_2017_USA.groupby('forecast_error').impact.count()

forecast_error
accurate    301
better      288
worse       283
Name: impact, dtype: int64

Cool, forexfactory.com publishes non-accurate forecasts around 2/3 of the times !

Are there news published on the same day?

In [21]:
ff_2017_USA.groupby('datetime').new.count().sort_values(ascending=False)

datetime
2017-09-28 07:30:00    5
2017-03-15 07:30:00    5
2017-04-27 07:30:00    5
2017-07-27 07:30:00    5
2017-12-22 08:30:00    5
2017-11-15 08:30:00    5
2017-02-15 08:30:00    5
2017-05-12 07:30:00    4
2017-03-16 07:30:00    4
2017-11-30 08:30:00    4
2017-06-14 07:30:00    4
2017-12-14 08:30:00    4
2017-09-19 07:30:00    4
2017-01-19 08:30:00    4
2017-08-31 07:30:00    4
2017-01-27 08:30:00    4
2017-04-14 07:30:00    4
2017-11-03 07:30:00    4
2017-01-13 08:30:00    4
2017-10-13 07:30:00    4
2017-07-14 07:30:00    4
2017-05-04 07:30:00    4
2017-09-01 09:00:00    4
2017-06-15 07:30:00    4
2017-12-21 08:30:00    4
2017-06-02 07:30:00    4
2017-02-28 08:30:00    4
2017-05-26 07:30:00    4
2017-08-04 07:30:00    4
2017-02-16 08:30:00    4
                      ..
2017-08-22 08:00:00    1
2017-08-17 09:30:00    1
2017-07-20 09:30:00    1
2017-08-17 09:00:00    1
2017-07-24 09:00:00    1
2017-07-25 08:59:00    1
2017-07-25 09:00:00    1
2017-07-26 09:00:00    1
2017-07-26 09:30

Ok, there are... Not great news ... :-( 

Let´s see one of them as an example

In [22]:
ff_2017_USA[ff_2017_USA['datetime'] == '2017-09-28 07:30:00']

Unnamed: 0,actual,country,datetime,forecast,forecast_error,impact,new,previous,previous_error,week
1203,3.1%,USD,2017-09-28 07:30:00,3.0%,better,High,Final GDP q/q,3.0%,accurate,39
1204,272K,USD,2017-09-28 07:30:00,269K,accurate,High,Unemployment Claims,260K,accurate,39
1205,1.0%,USD,2017-09-28 07:30:00,1.0%,accurate,Low,Final GDP Price Index q/q,1.0%,accurate,39
1206,-62.9B,USD,2017-09-28 07:30:00,-65.0B,better,Low,Goods Trade Balance,-63.9B,better,39
1207,1.0%,USD,2017-09-28 07:30:00,0.4%,worse,Low,Prelim Wholesale Inventories m/m,0.6%,worse,39


 Which % of news are published at the same time?

In [23]:
df_temp = ff_2017_USA.groupby('datetime').new.count().reset_index()
df_temp.columns = ['datetime', 'news']
df_temp.head(2)

Unnamed: 0,datetime,news
0,2017-01-03 09:45:00,1
1,2017-01-03 10:00:00,3


In [24]:
len(df_temp[df_temp['news'] > 1]) * 100 / len(df_temp)

31.010452961672474

Ok, 31%, not a negliglable number...


The news classified as 'High' impact, which % of times they are released in bundles?

In [25]:
df_temp = ff_2017_USA_high.groupby('datetime').new.count().reset_index()
df_temp.columns = ['datetime', 'news']
len(df_temp[df_temp['news'] > 1]) * 100 / len(df_temp)

20.642201834862384

--- 

## Next steps on Forex Factory <a name="next_forex"></a>

### Sanity checks:

 - Check that the raw data has no missing weeks.

### Data selection:

 - Filter out non macro-economic news.
 - Filter out non USA news.

### Feature Engineer:

 - Compute % of error between the forecast and the actual values, taking into account the different units handled (int, float, %, Millions = 'M', Thousands = 'K').
 - Compute goverment corrections on official values published on the previous release.
 - Set all timestamps to match the trading pair values got from Dukascopy, i.e. GMT with DTS. Otherwise we won´t compare apples with apples !
 - Split current date and time fields to capture year, month, day of week, hour, time.
 - Replace NaN in "forecast_error" and "previous_error" fields by "accurate".
 

