# GETTING FAMILIAR WITH THE DATA 

# Table of contents
1. [Forex Factory](#forex)
    - [Next steps](#next_forex)
2. [Forexite](#forexite)
    - [Next steps](#next_forexite)



## Forex Factory <a name="forex"></a>


Data from https://www.forexfactory.com/ was gotten using our own scrapper. Thus, we need to do some sanity checks to ensure that the downloaded data corresponds to the expected one.

As we have data from several years, the best approach for this data curation is to create a script.
Before that, we need to explore the data for getting familiarity with our dataset. That´s exactly the goal of this notebook.


-----------


In [25]:
import pandas as pd
import numpy as np

In [18]:
# Global variables
# Please note this are relative directories to the project, so you need to edit these variables if modifying the folder structure

data_directory_news = '../data/news/'

### Initial exploratory analysis, just for 2017, to get familiar to the data


In [28]:
ff_2017 = pd.read_csv(data_directory_news + 'forexfactory_2017.csv')

In [29]:
ff_2017.head()

Unnamed: 0.1,Unnamed: 0,actual,country,date,forecast,forecast_error,impact,new,previous,previous_error,time
0,0,,NZD,2017-12-31 00:00:00,,,Non-Economic,Bank Holiday,,,12:00am
1,1,,AUD,2017-12-31 00:00:00,,,Non-Economic,Bank Holiday,,,12:00am
2,2,,JPY,2017-12-31 00:00:00,,,Non-Economic,Bank Holiday,,,12:00am
3,3,,CNY,2017-12-31 00:00:00,,,Non-Economic,Bank Holiday,,,12:00am
4,4,,CHF,2017-01-01 00:00:00,,,Non-Economic,Bank Holiday,,,12:00am


In [30]:
ff_2017 = ff_2017.drop(columns = ['Unnamed: 0'])
ff_2017.describe()

Unnamed: 0,actual,country,date,forecast,forecast_error,impact,new,previous,previous_error,time
count,3789,4644,4644,3079,2150,4644,4644,3788,863,4644
unique,1357,10,331,1042,2,4,413,1375,2,314
top,0.4%,USD,2017-01-04 00:00:00,0.2%,better,Low,Trade Balance,0.4%,better,8:30am
freq,116,1201,42,170,1127,2247,100,116,487,412


In [22]:
ff_2017.dtypes

date_time         object
actual            object
country           object
forecast          object
forecast_error    object
impact            object
new               object
previous          object
previous_error    object
dtype: object

"forecast_error" is NaN when there was no error between the published forecast and actual value. Equivalently, "previous_error" is NaN when there was no goverment correction on the published value for the last release. 

Let´s replace those NaN by a categorical value = 'accurate'

In [23]:
ff_2017['forecast_error'] = ff_2017['forecast_error'].replace(np.nan, 'accurate', regex=True)
ff_2017['previous_error'] = ff_2017['previous_error'].replace(np.nan, 'accurate', regex=True)


Our preliminary analysis is going to be focused on **EUR-USD only**, analysing the impact of news published by the American government, so we filter the dataframe to only get **macroeconomic news from USA** (macroeconomic news = those which have a forecast)

In [8]:
ff_2017_USA = ff_2017[ff_2017['country'] == 'USD'] 
ff_2017_USA = ff_2017_USA[ff_2017_USA['forecast'].notnull()]
ff_2017_USA.head()

Unnamed: 0,date_time,actual,country,forecast,forecast_error,impact,new,previous,previous_error
26,2017-01-02 00:00:00 10:45am,55.1,USD,55.0,accurate,Low,Final Manufacturing PMI,55.0,accurate
32,2017-01-03 00:00:00 11:00am,59.7,USD,58.1,better,High,ISM Manufacturing PMI,58.2,accurate
33,2017-01-03 00:00:00 11:00am,0.8%,USD,0.6%,accurate,Low,Construction Spending m/m,0.9%,worse
34,2017-01-03 00:00:00 11:00am,69.0,USD,64.8,better,Low,ISM Manufacturing Prices,65.5,accurate
35,2017-01-03 00:00:00 12:00am,17.9M,USD,17.5M,better,Low,Total Vehicle Sales,17.5M,accurate


How many macro-economical news are published each year?

In [9]:
len(ff_2017_USA)

888

How many released grouped by 'impact' rate?

In [10]:
ff_2017_USA_high = ff_2017_USA[ff_2017_USA['impact'] == 'High']
ff_2017_USA_medium = ff_2017_USA[ff_2017_USA['impact'] == 'Medium']
ff_2017_USA_low = ff_2017_USA[ff_2017_USA['impact'] == 'Low']

print('High: ' + str(len(ff_2017_USA_high)) + ' - Medium: ' + str(len(ff_2017_USA_medium)) + ' - Low: ' + str(len(ff_2017_USA_low)))

High: 304 - Medium: 243 - Low: 341


Our favourite news for this analysis are those with higher expected impact on the market. Let´s see how many of them we have

In [11]:
print('number of news, high: ' + 
      str(len(ff_2017_USA_high.groupby('new').impact.count())) +
      ' - med: ' +
        str(len(ff_2017_USA_medium.groupby('new').impact.count())) +
      ' - low: ' + 
        str(len(ff_2017_USA_low.groupby('new').impact.count())))

number of news, high: 22 - med: 27 - low: 30


In [12]:
ff_2017_USA_high.groupby('new').impact.count()

new
ADP Non-Farm Employment Change    13
Advance GDP q/q                    4
Average Hourly Earnings m/m       13
Building Permits                  12
CB Consumer Confidence            12
CPI m/m                           12
Core CPI m/m                      12
Core Durable Goods Orders m/m     12
Core Retail Sales m/m             12
Crude Oil Inventories             53
Federal Funds Rate                 6
Final GDP q/q                      4
ISM Manufacturing PMI             13
ISM Non-Manufacturing PMI         13
Non-Farm Employment Change        13
PPI m/m                           12
Philly Fed Manufacturing Index     4
Prelim GDP q/q                     4
Prelim UoM Consumer Sentiment      4
Retail Sales m/m                  12
Unemployment Claims               51
Unemployment Rate                 13
Name: impact, dtype: int64

Hmmm, not that many... :-(

We need to know which meassure units are used per each macroeconomic new, so that we can compute the error rate

In [27]:
list(set(ff_2017_USA.groupby('new').actual.first()))

['0.3%',
 '733K',
 '107.5',
 '-206B',
 '-101B',
 '5.81M',
 '55.9',
 '20',
 '6.4%',
 '23.2B',
 '0.1%',
 '1.0%',
 '0.2%',
 '-0.5%',
 '20.5B',
 '-50.5B',
 '18.0',
 '<1.50%',
 '59.7',
 '-138.5B',
 '-69.7B',
 '-0.1%',
 '148K',
 '95.9',
 '0.7%',
 '3.2%',
 '0.8%',
 '-7.4M',
 '55.0',
 '74',
 '26.2',
 '122.1',
 '2.2%',
 '1.30M',
 '0.6%',
 '67.6',
 '53.7',
 '4.1%',
 '0.4%',
 '-0.2%',
 '250K',
 '96.8',
 '77.1%',
 '51.9',
 '1.3%',
 '52.4',
 '69.0',
 '17.9M',
 '6.00M',
 '3.0%',
 '2.1%',
 '55.1',
 '3.3%',
 '0.5%']

Let´s see how many times forex factory publishes a wrong forecast

In [14]:
ff_2017_USA.groupby('forecast_error').impact.count()

forecast_error
accurate    307
better      294
worse       287
Name: impact, dtype: int64

Cool, forexfactory.com publishes non-accurate forecasts around 2/3 of the times !

Is forexfactory estimating always the same impact rate for all the releases that correspond to the same economic new?

--- 

## Next steps on Forex Factory <a name="next_forex"></a>

### Sanity checks:

 - No missing weeks.
 - Same news released each year, with the same cadence.

### Data selection:

 - Filter out non macro-economic news.
 - Filter out non USA news.

### Feature Engineer:

 - Compute % of error between the forecast and the actual values, taking into account the different units handled (int, float, %, Millions = 'M', Thousands = 'K')
 - Set all timestamps to match the trading pair values got from Forexite, i.e. GMT with DTS. Otherwise we won´t compare apples with apples !
 - Split current date and time fields to capture year, month, day of week, hour, time
 - Replace NaN in "forecast_error" and "previous_error" fields by "accurate"


