# TOC
-  [Imports](#imports)
- [Export the train data files](#export-train-data-file)
- [Exploring data](#exploring-data)
  - [Market data train exploration](#data-train-exploration)
- [Prices](#prices)
- [Feature engineering](#feature-engineering)
   - [Daily percent Change](#daily-percent-change)
   - [SMA 5 days](#sma-5days)
   - [EMA 10 days](#ema-10-days)
   - [EMA 10 days](#ema-20-days)
   - [EMA 10 days](#ema-30-days)
   - [EMA 10 days](#ema-50-days)
   - [EMA 10 days](#ema-100-days)
   - [EMA 10 days](#ema-200-days)
-  [MACD](#macd)
 
  - [26-Days EMA](#26-days-ema)
  - [12-Days EMA](#12-days-ema)
  - [MACD calc](#MACD-calc)
  - [Signal Line](#signal-line)
  - [Playing with equal](#playing-with-signal)

- [Using Time](#using-time)

  - [Trading on different months](#trading-on-different-months)
  - [Trading on different days](trading-on-different-days)
  
- [assetCode](#assetCode)


# Imports <a name="imports"></a>
Let's import the modules that we will use

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn  as sns
import gc
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

# Export the train data files <a name="export-train-data-file"></a>
We have to do same special to access to the train data.

In [None]:
from kaggle.competitions import twosigmanews
env = twosigmanews.make_env()

In [None]:
(market_train_df, news_train_df) = env.get_training_data()

In [None]:
# Let's do a copy because the above methods could be call once
market_train = market_train_df
news_train = news_train_df

# Exploring data <a name="exploring-data"></a>
Now,  I will explore the datas. I will start with market data
## Market data train exploration <a name="data-train-exploration"></a>

In [None]:
print("shape market_train ", market_train.shape)
market_train.head(5)

In [None]:
print(market_train.info())

In [None]:
# Let's see the the NaN values. 
print(market_train.isnull().sum())

In [None]:
print("The NaN values on returnsClosePrevMktres1 represent the: %f" % (15980/4072956))
print("The NaN values on returnsClosePrevMktres10 represent the: %f" % (93010/4072956))

There is a small proportion of NaN values.

EDIT: This NaN values are correct. This can appear when there are prev values to calculate returns

### Dtypes columns

In [None]:
market_train.dtypes

### Uniques Number


In [None]:
market_train.nunique()

### Describe dataframe

In [None]:
market_train.describe(include='all')

# Prices <a name="#prices"></a>

In [None]:
aapl_jan = market_train.query("time.dt.year == 2010 and assetCode == 'AAPL.O'")
aapl_jan

In [None]:
plt.figure(figsize=(10,6))
# plt.plot(range(len(aapl_jan.time)), aapl_jan.close, label='Close price')
# plt.plot(range(len(aapl_jan.time)), aapl_jan.open, label='Open price')
plt.title("Opening and closing price")
plt.plot(aapl_jan.time, aapl_jan.open, label='Open price')
plt.plot(aapl_jan.time, aapl_jan.close, label='Close price')
plt.legend()
plt.show()

In [None]:
plt.figure(figsize=(10,6))
plt.title("Opening and closing return mtres 1")
plt.bar(range(len(aapl_jan.time)), aapl_jan.returnsOpenPrevMktres1, label='Return Open price')
plt.bar(range(len(aapl_jan.time)), aapl_jan.returnsClosePrevMktres1, label='Return Close price')
plt.legend()
plt.show()

# Feature engineering  <a name="feature-engineering"></a>
## Daily percent change <a name="daily-percent-change"></a>
This is the percentage change applied to a security.  This cold be calculate from this way:

If the price increased, use the formula [(New Price - Old Price)/Old Price] and then multiply that number by 100.  If the price decreased, use the formula [(Old Price - New Price)/Old Price] and multiply that number by 100.  

This way you could track the price of a asset, as well as compare the values of different currencies. 

Source: https://www.investopedia.com/terms/p/percentage-change.asp

In [None]:
aapl_daily_pct_change = aapl_jan.close / aapl_jan.close.shift(1) - 1
aapl_daily_pct_change.hist(bins=50)

In [None]:
market_train = market_train.assign(
    daily_percent_price=market_train.groupby('assetCode',
                                            as_index=False).apply(lambda x: x.close / x.close.shift(1) - 1)
    .reset_index(0, drop=True)
)

Let's see some daily_percent_price stock

In [None]:
plt.figure(figsize=(12,8))
ax1 = plt.subplot(221)
market_train.query("time.dt.year == 2016 and assetCode == 'AAPL.O'")['daily_percent_price'].hist(bins=50)
ax2 = plt.subplot(222)
market_train.query("time.dt.year == 2016 and assetCode == 'YPF.N'")['daily_percent_price'].hist(bins=50)
ax3 = plt.subplot(223)
market_train.query("time.dt.year == 2016 and assetCode == 'A.N'")['daily_percent_price'].hist(bins=50)
ax4 = plt.subplot(224)
market_train.query("time.dt.year == 2016 and assetCode == 'CMC.N'")['daily_percent_price'].hist(bins=50)
plt.show()

## Moving average <a name="moving-average"></a>
This is a technical analysis tool that help us to know what is the price trend, in the short, middle and long term.  The average is take over a specific period of time (e.g 10, 20, 30, 100, 200). This period of time could be seconds, minutes, days, weeks, etc.  A moving average helps cut down the amount of "noise" on a price chart. 

Sometime the moving average can be a support or resistance

There are different kind of moving average:

- SMA (simple moving area): this adds up the N most recent daily closing price and divide by N. This create the average  for each days.
- EMA (exponencial moving area): it applies more weighting to the most recent prices. 

Commontly is used the 5 (using sma), 10, 20,  50 100 and 200-days average (at least that I read) So, we will calculate this average. 

Source: https://www.investopedia.com/articles/active-trading/052014/how-use-moving-average-buy-stocks.asp



### SMA 5 days <a name="sma-5days"></a>

In [None]:
market_train = market_train.assign(
    sma_5=market_train.groupby(['assetCode'], 
                     as_index=False)[['close']]
    .rolling(window=5).mean().reset_index(0, drop=True))

### EMA 10 days <a name="ema-10-days"></a>

In [None]:
market_train = market_train.assign(
    ema_10=market_train.groupby(['assetCode'], as_index=False)
    .apply(lambda g: g.close.ewm(10).mean()).reset_index(0, drop=True)
)

### EMA 20 days <a name="ema-20-days"></a>

In [None]:
market_train = market_train.assign(
    ema_20=market_train.groupby(['assetCode'], as_index=False)
    .apply(lambda g: g.close.ewm(20).mean()).reset_index(0, drop=True)
)

### EMA 30 days <a name="ema-30-days"></a>

In [None]:
market_train = market_train.assign(
    ema_30=market_train.groupby(['assetCode'], as_index=False)
    .apply(lambda g: g.close.ewm(30).mean()).reset_index(0, drop=True)
)

### EMA 50 days <a name="ema-50-days"></a>

In [None]:
market_train = market_train.assign(
    ema_50=market_train.groupby(['assetCode'], as_index=False)
    .apply(lambda g: g.close.ewm(50).mean()).reset_index(0, drop=True)
)

### EMA 100 days <a name="ema-100-days"></a>

In [None]:
market_train = market_train.assign(
    ema_100=market_train.groupby(['assetCode'], as_index=False)
    .apply(lambda g: g.close.ewm(100).mean()).reset_index(0, drop=True)
)

### EMA 200 days <a name="ema-200-days"></a>

In [None]:
market_train = market_train.assign(
    ema_200=market_train.groupby(['assetCode'], as_index=False)
    .apply(lambda g: g.close.ewm(200).mean()).reset_index(0, drop=True)
)

In [None]:
plt.figure(figsize=(10, 8))
plt.title("Moving average for AAPL. 2016")
market_train.query("time.dt.year == 2016 and assetCode == 'AAPL.O'").close.plot(legend=True)
market_train.query("time.dt.year == 2016 and assetCode == 'AAPL.O'").sma_5.plot(legend=True)
market_train.query("time.dt.year == 2016 and assetCode == 'AAPL.O'").ema_10.plot(legend=True)
market_train.query("time.dt.year == 2016 and assetCode == 'AAPL.O'").ema_20.plot(legend=True)
market_train.query("time.dt.year == 2016 and assetCode == 'AAPL.O'").ema_30.plot(legend=True)
market_train.query("time.dt.year == 2016 and assetCode == 'AAPL.O'").ema_50.plot(legend=True)
market_train.query("time.dt.year == 2016 and assetCode == 'AAPL.O'").ema_100.plot(legend=True)
market_train.query("time.dt.year == 2016 and assetCode == 'AAPL.O'").ema_200.plot(legend=True)
plt.show()

Here, we can see that for AAPL, in the long term, its trent is bearish (seeing the EMA 100 days). Also,  we can see that this down slowly. For EMA 100 days, we can see that this is drawing a U, maybe this represent a change of trent. 

In the short term, the trent is bullish.

# MACD <a name=macd></a>
Moving average convergence divergence (MACD) is a trend-following momentum indicator that shows the relationship between two moving averages of prices. The MACD is calculated by subtracting the 26-day exponential moving average (EMA) from the 12-day EMA. A nine-day EMA of the MACD, called the "signal line", is then plotted on top of the MACD, functioning as a trigger for buy and sell signals.

MACD can be interpreted using 3 different methods:
*  Crossover: When the MACD falls bellow the signal line (9-day EMA) this is a bearish signal. When the MACD rise above the signal line, this is a bulish signal.

* Divergence: Whent the price diverges from MACD, it signal the end of the current trend

* Dramatic Rise: When MACD rises dramatically, that is, the shorter moving average pulls away from the longer-term moving average, this is a signal that the stock is overboutgh

Source: https://www.investopedia.com/terms/m/macd.asp

Let's calculate 26-day EMA, 12-day EMA and 9-day EMA

## 26-days EMA <a name="26-days-ema"></a>

In [None]:
market_train = market_train.assign(
    ema_26=market_train.groupby(['assetCode'], as_index=False)
    .apply(lambda g: g.close.ewm(26).mean()).reset_index(0, drop=True)
)

## 12-days EMA <a name="12-days-ema"></a>

In [None]:
market_train = market_train.assign(
    ema_12=market_train.groupby(['assetCode'], as_index=False)
    .apply(lambda g: g.close.ewm(12).mean()).reset_index(0, drop=True)
)

## MACD calc <a name="macd-calc"></a>

In [None]:
market_train['MACD'] = market_train.ema_12 - market_train.ema_26

In [None]:
market_train.tail(1)

## Signal line <a name="signal-line"></a>

In [None]:
market_train = market_train.assign(
    signal_line_macd=market_train.groupby(['assetCode'], as_index=False)
    .apply(lambda g: g.MACD.ewm(9).mean()).reset_index(0, drop=True)
)

Let's draw the MACD for AAPL in 2011

In [None]:
query = market_train.query("time.dt.year == 2011 and assetCode == 'AAPL.O'")
f1, ax1 = plt.subplots(figsize=(8,4))
ax1.plot(query.index, query.close, color='black', lw=2, label='Close Price')
ax1.legend(loc='upper right')
ax1.set(title="Close Price for AAPL. 2011", ylabel='Price')
f2, ax2 = plt.subplots(figsize=(8,4))
ax2.plot(query.index, query.MACD, color='green', lw=1, label='MACD Line (26, 12)')
ax2.plot(query.index, query.signal_line_macd, color='purple', lw=1, label='Signal')
ax2.fill_between(query.index, query.MACD - query.signal_line_macd, color='gray', alpha=0.5, label='MACD Histogram')
ax2.set(title='MACD for AAPL. 2011', ylabel='MACD')
ax2.legend(loc='upper right')
plt.show()

We can see that crossover method could be a good strategy. We can see if the the Signal fall bellow MACD it's a sign of price growth. 

Let's make this method in a variable:

## Playing with signal <a name="playing-with-signal"></a>
According to the first method to interpret the MACD, we will try save the crossover signals.  To do this, we need to know when the macd and signal line cross. If we have a cross of MACD above SL, this is a bullish signal, in other way is a bearish signal. So, here we go. 

In [None]:
market_train['signal_crossover_macd'] = 0.0

In [None]:
market_train.signal_crossover_macd = np.where(market_train.MACD > market_train.signal_line_macd, 1.0, 0.0)

In [None]:
market_train['signal_crossover_macd'] = market_train.groupby(['assetCode'], as_index=False)['signal_crossover_macd'].diff().reset_index(0, drop=True)

So, now we can have a strategy using the crossover method. When MACD (signal_crossover_macd == -1.0)  we must go short, in the other hand (signal_crossover_macd == 1.0) we must go long. 

# Using the time <a name='using-time'></a>
## Trading on different months <a name='trading-on-different-months'></a>

There will be certain months where it will operate more? Let's see!

In [None]:
market_train['month'] = market_train['time'].apply(lambda x: x.month)

In [None]:
market_train.groupby('month').sum()['volume'].plot(figsize=(10,8))

We can see that there are more transactions on october.

Now, I want to know in the all year of dataset when there are more volume (I wait in the last years)

In [None]:
market_train['year'] = market_train['time'].apply(lambda x: x.year)

In [None]:
# market_train.groupby(['year', 'month']).sum()['volume'].heatmap(figsize=(10,8))
df = market_train.pivot_table(index='year', columns='month', values='volume', aggfunc=np.sum)
plt.figure(figsize=(10, 8))
sns.heatmap(df, annot=False, fmt=".1f")
plt.show()

Ok, will see that on octuber 2008,  there are a lot of volume!!! what happend there? -> that happend https://en.wikipedia.org/wiki/Financial_crisis_of_2007%E2%80%932008

On Is not the result that I wait, but I see that in the last year the volume grow. And was a lot of volume on 2009.

## Trading on different days <a name='trading-on-different-days'></a>

Now, I want to know what days there are more volume.


In [None]:
market_train['day'] = market_train['time'].apply(lambda x: x.dayofweek)

In [None]:
market_train.groupby('day').sum()['volume'].plot(figsize=(10,8))

We can see that the hot day is Wednesday and Thursday

# assetCode
In the assetCode there is a .Symbol. maybe that represent a market. 

In [None]:
market_train['ticket'] = market_train.assetCode.str.split('.', expand=True).iloc[:, 0]
market_train['market'] = market_train.assetCode.str.split('.', expand=True).iloc[:, 1]
market_train.market.value_counts()

Looking on wikipedia:  https://en.wikipedia.org/wiki/Ticker_symbol 

This is the name of each code:

N: third class – preferred shares

O: second class – preferred shares

A: Class "A"

Q: In bankruptcy

B: Class "B"

P: first class preferred shares 

