# Project 2: Breakout Strategy
## Instructions
Each problem consists of a function to implement and instructions on how to implement the function.  The parts of the function that need to be implemented are marked with a `# TODO` comment. After implementing the function, run the cell to test it against the unit tests we've provided. For each problem, we provide one or more unit tests from our `project_tests` package. These unit tests won't tell you if your answer is correct, but will warn you of any major errors. Your code will be checked for the correct solution when you submit it to Udacity.

## Packages
When you implement the functions, you'll only need to you use the packages you've used in the classroom, like [Pandas](https://pandas.pydata.org/) and [Numpy](http://www.numpy.org/). These packages will be imported for you. We recommend you don't add any import statements, otherwise the grader might not be able to run your code.

The other packages that we're importing are `helper`, `project_helper`, and `project_tests`. These are custom packages built to help you solve the problems.  The `helper` and `project_helper` module contains utility functions and graph functions. The `project_tests` contains the unit tests for all the problems.

### Install Packages

In [1]:
# import sys
# !{sys.executable} -m pip install -r requirements.txt

### Load Packages

In [2]:
import pandas as pd
import numpy as np
import helper
import project_helper
import project_tests

## Market Data
### Load Data
While using real data will give you hands on experience, it's doesn't cover all the topics we try to condense in one project. We'll solve this by creating new stocks. We've create a scenario where companies mining [Terbium](https://en.wikipedia.org/wiki/Terbium) are making huge profits. All the companies in this sector of the market are made up. They represent a sector with large growth that will be used for demonstration latter in this project.

In [6]:
df_original = pd.read_csv('../../data/project_2/eod-quotemedia.csv', parse_dates=['date'], index_col=False)

# Add TB sector to the market
df = df_original
df = pd.concat([df] + project_helper.generate_tb_sector(df[df['ticker'] == 'AAPL']['date']), ignore_index=True)

close = df.reset_index().pivot(index='date', columns='ticker', values='adj_close')
high = df.reset_index().pivot(index='date', columns='ticker', values='adj_high')
low = df.reset_index().pivot(index='date', columns='ticker', values='adj_low')

print('Loaded Data')

Loaded Data


### View Data
To see what one of these 2-d matrices looks like, let's take a look at the closing prices matrix.

In [9]:
close

ticker,A,AAL,AAP,AAPL,ABBV,ABC,ABT,ACN,ADBE,ADI,...,XL,XLNX,XOM,XRAY,XRX,XYL,YUM,ZBH,ZION,ZTS
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2013-07-01,29.99418563,16.17609308,81.13821681,53.10917319,34.92447839,50.86319750,31.42538772,64.69409505,46.23500000,39.91336014,...,27.66879066,35.28892781,76.32080247,40.02387348,22.10666494,25.75338607,45.48038323,71.89882693,27.85858718,29.44789315
2013-07-02,29.65013670,15.81983388,80.72207258,54.31224742,35.42807578,50.69676639,31.27288084,64.71204071,46.03000000,39.86057632,...,27.54228410,35.05903252,76.60816761,39.96552964,22.08273998,25.61367511,45.40266113,72.93417195,28.03893238,28.57244125
2013-07-03,29.70518453,16.12794994,81.23729877,54.61204262,35.44486235,50.93716689,30.72565028,65.21451912,46.42000000,40.18607651,...,27.33445191,35.28008569,76.65042719,40.00442554,22.20236479,25.73475794,46.06329899,72.30145844,28.18131017,28.16838652
2013-07-05,30.43456826,16.21460758,81.82188233,54.17338125,35.85613355,51.37173702,31.32670680,66.07591068,47.00000000,40.65233352,...,27.69589920,35.80177117,77.39419581,40.67537968,22.58516418,26.06075017,46.41304845,73.16424628,29.39626730,29.02459772
2013-07-08,30.52402098,16.31089385,82.95141667,53.86579916,36.66188936,52.03746147,31.76628544,66.82065546,46.62500000,40.25645492,...,27.98505704,35.20050655,77.96892611,40.64620776,22.48946433,26.22840332,46.95062632,73.89282298,29.57661249,29.76536472
2013-07-09,30.68916447,16.71529618,82.43619048,54.81320389,36.35973093,51.69535307,31.16522893,66.48866080,47.26000000,40.69632003,...,28.31939579,35.50113886,78.89018496,40.80179133,22.48946433,26.58233774,47.28094525,73.70108798,28.91218282,29.80384612
2013-07-10,31.17771395,16.53235227,81.99032166,54.60295791,36.85493502,52.28710814,31.16522893,66.71298151,47.25000000,41.10979324,...,27.95794850,36.39419366,78.45068533,40.71427558,22.96796358,26.98284247,47.08340158,74.00785631,28.32368796,29.86156823
2013-07-11,31.45983407,16.72492481,82.00022986,55.45406479,37.08155384,53.72026495,31.85599537,67.47567196,47.99000000,42.22705062,...,28.50011944,37.00430040,78.83102155,41.01571874,23.23113816,27.03872686,46.54333492,74.93774876,27.84909533,29.74612402
2013-07-12,31.48047700,16.90786872,81.91105609,55.35309481,38.15724076,53.98840397,31.81096287,67.76280247,48.39000000,42.53495620,...,28.92482002,38.00346072,78.94089646,40.83096325,23.49431274,27.08529718,45.96422730,75.68549560,28.44708204,30.15979909
2013-07-15,31.72819223,17.10044125,82.61453801,55.47379158,37.79303181,53.84971137,31.95506689,68.41781897,48.12000000,42.57894271,...,29.27723113,38.17146113,78.81411772,40.84068723,23.54216266,27.06666905,46.69299195,76.27027369,28.77929688,30.38106716


### Stock Example
Let's see what a single stock looks like from the closing prices. For this example and future display examples in this project, we'll use Apple's stock (AAPL). If we tried to graph all the stocks, it would be too much information.

In [10]:
apple_ticker = 'AAPL'
project_helper.plot_stock(close[apple_ticker], '{} Stock'.format(apple_ticker))

## The Alpha Research Process

In this project you will code and evaluate a "breakout" signal. It is important to understand where these steps fit in the alpha research workflow. The signal-to-noise ratio in trading signals is very low and, as such, it is very easy to fall into the trap of _overfitting_ to noise. It is therefore inadvisable to jump right into signal coding. To help mitigate overfitting, it is best to start with a general observation and hypothesis; i.e., you should be able to answer the following question _before_ you touch any data:

> What feature of markets or investor behaviour would lead to a persistent anomaly that my signal will try to use?

Ideally the assumptions behind the hypothesis will be testable _before_ you actually code and evaluate the signal itself. The workflow therefore is as follows:

![image](images/alpha_steps.png)

In this project, we assume that the first three steps area done ("observe & research", "form hypothesis", "validate hypothesis"). The hypothesis you'll be using for this project is the following:
- In the absence of news or significant investor trading interest, stocks oscillate in a range.
- Traders seek to capitalize on this range-bound behaviour periodically by selling/shorting at the top of the range and buying/covering at the bottom of the range. This behaviour reinforces the existence of the range.
- When stocks break out of the range, due to, e.g., a significant news release or from market pressure from a large investor:
    - the liquidity traders who have been providing liquidity at the bounds of the range seek to cover their positions to mitigate losses, thus magnifying the move out of the range, _and_
    - the move out of the range attracts other investor interest; these investors, due to the behavioural bias of _herding_ (e.g., [Herd Behavior](https://www.investopedia.com/university/behavioral_finance/behavioral8.asp)) build positions which favor continuation of the trend.


Using this hypothesis, let start coding..
## Compute the Highs and Lows in a Window
You'll use the price highs and lows as an indicator for the breakout strategy. In this section, implement `get_high_lows_lookback` to get the maximum high price and minimum low price over a window of days. The variable `lookback_days` contains the number of days to look in the past. Make sure this doesn't include the current day.

In [22]:
def get_high_lows_lookback(high, low, lookback_days):
    """
    Get the highs and lows in a lookback window.
    
    Parameters
    ----------
    high : DataFrame
        High price for each ticker and date
    low : DataFrame
        Low price for each ticker and date
    lookback_days : int
        The number of days to look back
    
    Returns
    -------
    lookback_high : DataFrame
        Lookback high price for each ticker and date
    lookback_low : DataFrame
        Lookback low price for each ticker and date
    """
    #TODO: Implement function
#     print high
#     print high.shift(1)
#     print high.shift(1).rolling(lookback_days).max()
#     print lookback_days
    return high.shift(1).rolling(lookback_days).max(), low.shift(1).rolling(lookback_days).min()

project_tests.test_get_high_lows_lookback(get_high_lows_lookback)

                  EPPL        DMTE        RKPE
2005-11-24 35.44110000 34.17990000 34.02230000
2005-11-25 92.11310000 91.05430000 90.95720000
2005-11-26 57.97080000 57.78140000 58.19820000
2005-11-27 34.17050000 92.45300000 58.51070000
                  EPPL        DMTE        RKPE
2005-11-24         nan         nan         nan
2005-11-25 35.44110000 34.17990000 34.02230000
2005-11-26 92.11310000 91.05430000 90.95720000
2005-11-27 57.97080000 57.78140000 58.19820000
                  EPPL        DMTE        RKPE
2005-11-24         nan         nan         nan
2005-11-25         nan         nan         nan
2005-11-26 92.11310000 91.05430000 90.95720000
2005-11-27 92.11310000 91.05430000 90.95720000
2
Tests Passed


### View Data
Let's use your implementation of `get_high_lows_lookback` to get the highs and lows for the past 50 days and compare it to it their respective stock.  Just like last time, we'll use Apple's stock as the example to look at.

In [23]:
lookback_days = 50
lookback_high, lookback_low = get_high_lows_lookback(high, low, lookback_days)
project_helper.plot_high_low(
    close[apple_ticker],
    lookback_high[apple_ticker],
    lookback_low[apple_ticker],
    'High and Low of {} Stock'.format(apple_ticker))

ticker       A  AAL  AAP  AAPL  ABBV  ABC  ABT  ACN  ADBE  ADI ...   XL  XLNX  \
date                                                           ...              
2013-07-01 nan  nan  nan   nan   nan  nan  nan  nan   nan  nan ...  nan   nan   
2013-07-02 nan  nan  nan   nan   nan  nan  nan  nan   nan  nan ...  nan   nan   
2013-07-03 nan  nan  nan   nan   nan  nan  nan  nan   nan  nan ...  nan   nan   
2013-07-05 nan  nan  nan   nan   nan  nan  nan  nan   nan  nan ...  nan   nan   
2013-07-08 nan  nan  nan   nan   nan  nan  nan  nan   nan  nan ...  nan   nan   
2013-07-09 nan  nan  nan   nan   nan  nan  nan  nan   nan  nan ...  nan   nan   
2013-07-10 nan  nan  nan   nan   nan  nan  nan  nan   nan  nan ...  nan   nan   
2013-07-11 nan  nan  nan   nan   nan  nan  nan  nan   nan  nan ...  nan   nan   
2013-07-12 nan  nan  nan   nan   nan  nan  nan  nan   nan  nan ...  nan   nan   
2013-07-15 nan  nan  nan   nan   nan  nan  nan  nan   nan  nan ...  nan   nan   
2013-07-16 nan  nan  nan   n

## Compute Long and Short Signals
Using the generated indicator of highs and lows, create long and short signals using a breakout strategy. Implement `get_long_short` to generate the following signals:

| Signal | Condition |
|----|------|
| -1 | Low > Close Price |
| 1  | High < Close Price |
| 0  | Otherwise |

In this chart, **Close Price** is the `close` parameter. **Low** and **High** are the values generated from `get_high_lows_lookback`, the `lookback_high` and `lookback_low` parameters.

In [41]:
def get_long_short(close, lookback_high, lookback_low):
    """
    Generate the signals long, short, and do nothing.
    
    Parameters
    ----------
    close : DataFrame
        Close price for each ticker and date
    lookback_high : DataFrame
        Lookback high price for each ticker and date
    lookback_low : DataFrame
        Lookback low price for each ticker and date
    
    Returns
    -------
    long_short : DataFrame
        The long, short, and do nothing signals for each ticker and date
    """
    #TODO: Implement function
#     print close
#     print lookback_high
#     print lookback_low

    higher_high = close - lookback_high
    lower_low = close - lookback_low
    
#     print higher_high
#     print lower_low 
#     print higher_high[higher_high > 0]

    singal = close.copy()
    singal.loc[:,:] = 0
    singal[higher_high > 0] = 1
    singal[lower_low < 0] = -1
    singal = singal.astype('int')
    return singal

project_tests.test_get_long_short(get_long_short)

Tests Passed


### View Data
Let's compare the signals you generated against the close prices. This chart will show a lot of signals. Too many in fact. We'll talk about filtering the redundant signals in the next problem. 

In [42]:
signal = get_long_short(close, lookback_high, lookback_low)
project_helper.plot_signal(
    close[apple_ticker],
    signal[apple_ticker],
    'Long and Short of {} Stock'.format(apple_ticker))

## Filter Signals
That was a lot of repeated signals! If we're already shorting a stock, having an additional signal to short a stock isn't helpful for this strategy. This also applies to additional long signals when the last signal was long.

Implement `filter_signals` to filter out repeated long or short signals within the `lookahead_days`. If the previous signal was the same, change the signal to `0` (do nothing signal). For example, say you have a single stock time series that is

`[1, 0, 1, 0, 1, 0, -1, -1]`

Running `filter_signals` with a lookahead of 3 days should turn those signals into

`[1, 0, 0, 0, 1, 0, -1, 0]`

To help you implement the function, we have provided you with the `clear_signals` function. This will remove all signals within a window after the last signal. For example, say you're using a windows size of 3 with `clear_signals`. It would turn the Series of long signals

`[0, 1, 0, 0, 1, 1, 0, 1, 0]`

into

`[0, 1, 0, 0, 0, 1, 0, 0, 0]`

`clear_signals` only takes a Series of the same type of signals, where `1` is the signal and `0` is no signal. It can't take a mix of long and short signals. Using this function, implement `filter_signals`. 

For implementing `filter_signals`, we don't reccommend you try to find a vectorized solution. Instead, you should use the [`iterrows`](https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.iterrows.html) over each column.

In [74]:
def clear_signals(signals, window_size):
    """
    Clear out signals in a Series of just long or short signals.
    
    Remove the number of signals down to 1 within the window size time period.
    
    Parameters
    ----------
    signals : Pandas Series
        The long, short, or do nothing signals
    window_size : int
        The number of days to have a single signal       
    
    Returns
    -------
    signals : Pandas Series
        Signals with the signals removed from the window size
    """
    # Start with buffer of window size
    # This handles the edge case of calculating past_signal in the beginning
    clean_signals = [0]*window_size
    
    for signal_i, current_signal in enumerate(signals):
        # Check if there was a signal in the past window_size of days
        has_past_signal = bool(sum(clean_signals[signal_i:signal_i+window_size]))
        # Use the current signal if there's no past signal, else 0/False
        clean_signals.append(not has_past_signal and current_signal)
        
    # Remove buffer
    clean_signals = clean_signals[window_size:]

    # Return the signals as a Series of Ints
    return pd.Series(np.array(clean_signals).astype(np.int), signals.index)


def filter_signals(signal, lookahead_days):
    """
    Filter out signals in a DataFrame.
    
    Parameters
    ----------
    signal : DataFrame
        The long, short, and do nothing signals for each ticker and date
    lookahead_days : int
        The number of days to look ahead
    
    Returns
    -------
    filtered_signal : DataFrame
        The filtered long, short, and do nothing signals for each ticker and date
    """
    #TODO: Implement function
    print signal
    print lookahead_days
    
    pos_sig = signal.copy()
    neg_sig = signal.copy()
    pos_sig[pos_sig <=0] = 0
    neg_sig[neg_sig >=0] = 0
    
    print pos_sig
    for pos_label, pos_j in pos_sig.iteritems():
        pos_sig[pos_label] = clear_signals(pos_j, lookahead_days)
    print pos_sig
    
    for neg_label, neg_j in neg_sig.iteritems():
        neg_sig[neg_label] = clear_signals(neg_j, lookahead_days)
    
    return pos_sig + neg_sig

project_tests.test_filter_signals(filter_signals)

            KBX  DKFI  NKD
2006-06-04    0     0    0
2006-06-05   -1    -1   -1
2006-06-06    1     0   -1
2006-06-07    0     0    0
2006-06-08    1     0    0
2006-06-09    0     1    0
2006-06-10    0     0    1
2006-06-11    0    -1    1
2006-06-12   -1     0    0
2006-06-13    0     0    0
3
            KBX  DKFI  NKD
2006-06-04    0     0    0
2006-06-05    0     0    0
2006-06-06    1     0    0
2006-06-07    0     0    0
2006-06-08    1     0    0
2006-06-09    0     1    0
2006-06-10    0     0    1
2006-06-11    0     0    1
2006-06-12    0     0    0
2006-06-13    0     0    0
            KBX  DKFI  NKD
2006-06-04    0     0    0
2006-06-05    0     0    0
2006-06-06    1     0    0
2006-06-07    0     0    0
2006-06-08    0     0    0
2006-06-09    0     1    0
2006-06-10    0     0    1
2006-06-11    0     0    0
2006-06-12    0     0    0
2006-06-13    0     0    0
Tests Passed


### View Data
Let's view the same chart as before, but with the redundant signals removed.

In [75]:
signal_5 = filter_signals(signal, 5)
signal_10 = filter_signals(signal, 10)
signal_20 = filter_signals(signal, 20)
for signal_data, signal_days in [(signal_5, 5), (signal_10, 10), (signal_20, 20)]:
    project_helper.plot_signal(
        close[apple_ticker],
        signal_data[apple_ticker],
        'Long and Short of {} Stock with {} day signal window'.format(apple_ticker, signal_days))

ticker      A  AAL  AAP  AAPL  ABBV  ABC  ABT  ACN  ADBE  ADI ...   XL  XLNX  \
date                                                          ...              
2013-07-01  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-02  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-03  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-05  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-08  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-09  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-10  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-11  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-12  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-15  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-16  0    0    0     0     0    0

ticker      A  AAL  AAP  AAPL  ABBV  ABC  ABT  ACN  ADBE  ADI ...   XL  XLNX  \
date                                                          ...              
2013-07-01  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-02  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-03  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-05  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-08  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-09  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-10  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-11  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-12  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-15  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-16  0    0    0     0     0    0

ticker      A  AAL  AAP  AAPL  ABBV  ABC  ABT  ACN  ADBE  ADI ...   XL  XLNX  \
date                                                          ...              
2013-07-01  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-02  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-03  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-05  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-08  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-09  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-10  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-11  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-12  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-15  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-16  0    0    0     0     0    0

ticker      A  AAL  AAP  AAPL  ABBV  ABC  ABT  ACN  ADBE  ADI ...   XL  XLNX  \
date                                                          ...              
2013-07-01  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-02  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-03  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-05  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-08  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-09  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-10  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-11  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-12  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-15  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-16  0    0    0     0     0    0

ticker      A  AAL  AAP  AAPL  ABBV  ABC  ABT  ACN  ADBE  ADI ...   XL  XLNX  \
date                                                          ...              
2013-07-01  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-02  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-03  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-05  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-08  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-09  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-10  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-11  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-12  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-15  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-16  0    0    0     0     0    0

ticker      A  AAL  AAP  AAPL  ABBV  ABC  ABT  ACN  ADBE  ADI ...   XL  XLNX  \
date                                                          ...              
2013-07-01  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-02  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-03  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-05  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-08  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-09  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-10  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-11  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-12  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-15  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-16  0    0    0     0     0    0

## Lookahead Close Prices
With the trading signal done, we can start working on evaluating how many days to short or long the stocks. In this problem, implement `get_lookahead_prices` to get the close price days ahead in time. You can get the number of days from the variable `lookahead_days`. We'll use the lookahead prices to calculate future returns in another problem.

In [77]:
def get_lookahead_prices(close, lookahead_days):
    """
    Get the lookahead prices for `lookahead_days` number of days.
    
    Parameters
    ----------
    close : DataFrame
        Close price for each ticker and date
    lookahead_days : int
        The number of days to look ahead
    
    Returns
    -------
    lookahead_prices : DataFrame
        The lookahead prices for each ticker and date
    """
    #TODO: Implement function
    
    return close.shift(-lookahead_days)

project_tests.test_get_lookahead_prices(get_lookahead_prices)

Tests Passed


### View Data
Using the `get_lookahead_prices` function, let's generate lookahead closing prices for 5, 10, and 20 days.

Let's also chart a subsection of a few months of the Apple stock instead of years. This will allow you to view the differences between the 5, 10, and 20 day lookaheads. Otherwise, they will mesh together when looking at a chart that is zoomed out.

In [78]:
lookahead_5 = get_lookahead_prices(close, 5)
lookahead_10 = get_lookahead_prices(close, 10)
lookahead_20 = get_lookahead_prices(close, 20)
project_helper.plot_lookahead_prices(
    close[apple_ticker].iloc[150:250],
    [
        (lookahead_5[apple_ticker].iloc[150:250], 5),
        (lookahead_10[apple_ticker].iloc[150:250], 10),
        (lookahead_20[apple_ticker].iloc[150:250], 20)],
    '5, 10, and 20 day Lookahead Prices for Slice of {} Stock'.format(apple_ticker))

## Lookahead Price Returns
Implement `get_return_lookahead` to generate the log price return between the closing price and the lookahead price.

In [79]:
def get_return_lookahead(close, lookahead_prices):
    """
    Calculate the log returns from the lookahead days to the signal day.
    
    Parameters
    ----------
    close : DataFrame
        Close price for each ticker and date
    lookahead_prices : DataFrame
        The lookahead prices for each ticker and date
    
    Returns
    -------
    lookahead_returns : DataFrame
        The lookahead log returns for each ticker and date
    """
    #TODO: Implement function
    
    return np.log(lookahead_prices) - np.log(close)

project_tests.test_get_return_lookahead(get_return_lookahead)

Tests Passed


### View Data
Using the same lookahead prices and same subsection of the Apple stock from the previous problem, we'll view the lookahead returns.

In order to view price returns on the same chart as the stock, a second y-axis will be added. When viewing this chart, the axis for the price of the stock will be on the left side, like previous charts. The axis for price returns will be located on the right side.

In [80]:
price_return_5 = get_return_lookahead(close, lookahead_5)
price_return_10 = get_return_lookahead(close, lookahead_10)
price_return_20 = get_return_lookahead(close, lookahead_20)
project_helper.plot_price_returns(
    close[apple_ticker].iloc[150:250],
    [
        (price_return_5[apple_ticker].iloc[150:250], 5),
        (price_return_10[apple_ticker].iloc[150:250], 10),
        (price_return_20[apple_ticker].iloc[150:250], 20)],
    '5, 10, and 20 day Lookahead Returns for Slice {} Stock'.format(apple_ticker))

## Compute the Signal Return
Using the price returns generate the signal returns.

In [83]:
def get_signal_return(signal, lookahead_returns):
    """
    Compute the signal returns.
    
    Parameters
    ----------
    signal : DataFrame
        The long, short, and do nothing signals for each ticker and date
    lookahead_returns : DataFrame
        The lookahead log returns for each ticker and date
    
    Returns
    -------
    signal_return : DataFrame
        Signal returns for each ticker and date
    """
    #TODO: Implement function
#     print signal, '\n'
#     print lookahead_returns
    return lookahead_returns * signal

project_tests.test_get_signal_return(get_signal_return)

            XMBO  PYN  JRS
2002-02-27     0    0    0
2002-02-28    -1   -1   -1
2002-03-01     1    0    0
2002-03-02     0    0    0
2002-03-03     0    1    0 

                 XMBO         PYN         JRS
2002-02-27 0.88702896  0.96521098  0.65854789
2002-02-28 1.13391240  0.87420969 -0.53914925
2002-03-01 0.35450805 -0.56900529 -0.64808965
2002-03-02 0.38572896 -0.94655617  0.12356438
2002-03-03        nan         nan         nan
Tests Passed


### View Data
Let's continue using the previous lookahead prices to view the signal returns. Just like before, the axis for the signal returns is on the right side of the chart.

In [84]:
title_string = '{} day LookaheadSignal Returns for {} Stock'
signal_return_5 = get_signal_return(signal_5, price_return_5)
signal_return_10 = get_signal_return(signal_10, price_return_10)
signal_return_20 = get_signal_return(signal_20, price_return_20)
project_helper.plot_signal_returns(
    close[apple_ticker],
    [
        (signal_return_5[apple_ticker], signal_5[apple_ticker], 5),
        (signal_return_10[apple_ticker], signal_10[apple_ticker], 10),
        (signal_return_20[apple_ticker], signal_20[apple_ticker], 20)],
    [title_string.format(5, apple_ticker), title_string.format(10, apple_ticker), title_string.format(20, apple_ticker)])

ticker      A  AAL  AAP  AAPL  ABBV  ABC  ABT  ACN  ADBE  ADI ...   XL  XLNX  \
date                                                          ...              
2013-07-01  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-02  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-03  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-05  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-08  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-09  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-10  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-11  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-12  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-15  0    0    0     0     0    0    0    0     0    0 ...    0     0   
2013-07-16  0    0    0     0     0    0

ticker               A         AAL         AAP        AAPL        ABBV  \
date                                                                     
2013-07-01  0.05185519  0.06621123  0.00596581  0.04998569  0.06061668   
2013-07-02  0.05683604  0.11595544  0.00024546  0.02785288  0.04971808   
2013-07-03  0.06131931  0.09503877  0.00620100  0.02570758  0.05740028   
2013-07-05  0.04357651  0.07931158 -0.00449057  0.01787861  0.05036644   
2013-07-08  0.05354504  0.06899287 -0.01165388  0.02676779  0.03820038   
2013-07-09  0.04107308  0.03897698 -0.00602775 -0.00798730  0.03169107   
2013-07-10  0.03150038  0.07460706 -0.01485296  0.04594185  0.00292718   
2013-07-11  0.02035143  0.08070676 -0.00654627  0.02590155  0.01045227   
2013-07-12 -0.00328408  0.08354105 -0.00036295  0.03338639 -0.00578808   
2013-07-15 -0.01684020  0.07952282 -0.01267291  0.04651044  0.00981275   
2013-07-16 -0.02335937  0.06732016 -0.00230905  0.05235968  0.02011140   
2013-07-17 -0.01948257  0.04762805  0.

## Test for Significance
### Histogram
Let's plot a histogram of the signal return values.

In [85]:
project_helper.plot_signal_histograms(
    [signal_return_5, signal_return_10, signal_return_20],
    'Signal Return',
    ('5 Days', '10 Days', '20 Days'))

### Question: What do the histograms tell you about the signal returns?

*#TODO: Put Answer In this Cell*

## Outliers
You might have noticed the outliers in the 10 and 20 day histograms. To better visualize the outliers, let's compare the 5, 10, and 20 day signals returns to normal distributions with the same mean and deviation for each signal return distributions.

In [86]:
project_helper.plot_signal_to_normal_histograms(
    [signal_return_5, signal_return_10, signal_return_20],
    'Signal Return',
    ('5 Days', '10 Days', '20 Days'))

## Kolmogorov-Smirnov Test
While you can see the outliers in the histogram, we need to find the stocks that are causing these outlying returns. We'll use the Kolmogorov-Smirnov Test or KS-Test. This test will be applied to teach ticker's signal returns where a long or short signal exits.

In [87]:
# Filter out returns that don't have a long or short signal.
long_short_signal_returns_5 = signal_return_5[signal_5 != 0].stack()
long_short_signal_returns_10 = signal_return_10[signal_10 != 0].stack()
long_short_signal_returns_20 = signal_return_20[signal_20 != 0].stack()

# Get just ticker and signal return
long_short_signal_returns_5 = long_short_signal_returns_5.reset_index().iloc[:, [1,2]]
long_short_signal_returns_5.columns = ['ticker', 'signal_return']
long_short_signal_returns_10 = long_short_signal_returns_10.reset_index().iloc[:, [1,2]]
long_short_signal_returns_10.columns = ['ticker', 'signal_return']
long_short_signal_returns_20 = long_short_signal_returns_20.reset_index().iloc[:, [1,2]]
long_short_signal_returns_20.columns = ['ticker', 'signal_return']

# View some of the data
long_short_signal_returns_5.head(10)

Unnamed: 0,ticker,signal_return
0,AGENEN,0.02908735
1,BAKERI,0.02434866
2,BIFLOR,0.02010358
3,CLUSIA,0.01655266
4,DASYST,0.02261208
5,GESNER,0.02987758
6,GREIGI,0.02357519
7,KOLPAK,0.00689596
8,LINIFO,0.02801644
9,PRAEST,0.01489702


This gives you the data to use in the KS-Test.

Now it's time to implement the function `calculate_kstest` to use Kolmogorov-Smirnov test (KS test) between a normal distribution and each stock's signal returns. Run KS test on a normal distribution against each stock's signal returns. Use [`scipy.stats.kstest`](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.kstest.html#scipy-stats-kstest) perform the KS test. When calculating the standard deviation of the signal returns, make sure to set the delta degrees of freedom to 0.

For this function, we don't reccommend you try to find a vectorized solution. Instead, you should iterate over the [`groupby`](https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.groupby.html) function.

In [163]:
from scipy.stats import kstest


def calculate_kstest(long_short_signal_returns):
    """
    Calculate the KS-Test against the signal returns with a long or short signal.
    
    Parameters
    ----------
    long_short_signal_returns : DataFrame
        The signal returns which have a signal.
        This DataFrame contains two columns, "ticker" and "signal_return"
    
    Returns
    -------
    ks_values : Pandas Series
        KS static for all the tickers
    p_values : Pandas Series
        P value for all the tickers
    """
    #TODO: Implement function
    ticker_list = []
    ks_values_list = []
    p_values_list = []
    print type(norm.rvs(size=100)), "========"
    print kstest(norm.rvs(size=100), 'norm')
    print long_short_signal_returns
    print long_short_signal_returns.groupby('ticker')
    for i,j in long_short_signal_returns.groupby('ticker'):
        print "=======ticker??====", i
        ticker_list.append(i)
        print j
        print "=========\n",np.array(j.loc[:,'signal_return'])
        temp = np.array(j.loc[:,'signal_return'])
        print "=========\n", type(temp)
        print kstest(temp, 'norm')
        ks_values_list.append(kstest(temp, 'norm',args = (0,))[0])
        p_values_list.append(kstest(temp, 'norm',args = (0,))[1])
    ks_values = pd.Series(ks_values_list, index =ticker_list)
    p_values = pd.Series(p_values_list, index =ticker_list)
    
    return ks_values, p_values


project_tests.test_calculate_kstest(calculate_kstest)

KstestResult(statistic=0.06974890299910269, pvalue=0.7223576454515594)
    signal_return ticker
0      0.12000000    XZM
1     -0.83000000   AINK
2      0.37000000    PAK
3      0.83000000    XZM
4     -0.34000000   AINK
5      0.27000000    PAK
6     -0.68000000    XZM
7      0.29000000   AINK
8      0.69000000    PAK
9      0.57000000    XZM
10     0.39000000   AINK
11     0.56000000    PAK
12    -0.97000000    XZM
13    -0.72000000   AINK
14     0.26000000    PAK
<pandas.core.groupby.DataFrameGroupBy object at 0x1a2062cad0>
    signal_return ticker
1     -0.83000000   AINK
4     -0.34000000   AINK
7      0.29000000   AINK
10     0.39000000   AINK
13    -0.72000000   AINK
[-0.83 -0.34  0.29  0.39 -0.72]
<type 'numpy.ndarray'>
KstestResult(statistic=0.3482682734640177, pvalue=0.4800914400131395)
    signal_return ticker
2      0.37000000    PAK
5      0.27000000    PAK
8      0.69000000    PAK
11     0.56000000    PAK
14     0.26000000    PAK
[0.37 0.27 0.69 0.56 0.26]
<type 'numpy.nd

AssertionError: Wrong value for calculate_kstest.

INPUT long_short_signal_returns:
    signal_return ticker
0      0.12000000    XZM
1     -0.83000000   AINK
2      0.37000000    PAK
3      0.83000000    XZM
4     -0.34000000   AINK
5      0.27000000    PAK
6     -0.68000000    XZM
7      0.29000000   AINK
8      0.69000000    PAK
9      0.57000000    XZM
10     0.39000000   AINK
11     0.56000000    PAK
12    -0.97000000    XZM
13    -0.72000000   AINK
14     0.26000000    PAK

OUTPUT ks_values:
AINK   0.34826827
PAK    0.60256811
XZM    0.20326939
dtype: float64

OUTPUT p_values:
AINK   0.48009144
PAK    0.02898631
XZM    0.98593727
dtype: float64

EXPECTED OUTPUT FOR ks_values:
XZM    0.29787827
AINK   0.35221525
PAK    0.63919407
dtype: float64

EXPECTED OUTPUT FOR p_values:
XZM    0.69536353
AINK   0.46493498
PAK    0.01650327
dtype: float64


In [164]:
ks_values_5, p_values_5 = calculate_kstest(long_short_signal_returns_5)
ks_values_10, p_values_10 = calculate_kstest(long_short_signal_returns_10)
ks_values_20, p_values_20 = calculate_kstest(long_short_signal_returns_20)

print('ks_values_5')
print(ks_values_5.head(10))
print('p_values_5')
print(p_values_5.head(10))

KstestResult(statistic=0.08798493054751688, pvalue=0.4012900628512701)
      ticker  signal_return
0     AGENEN     0.02908735
1     BAKERI     0.02434866
2     BIFLOR     0.02010358
3     CLUSIA     0.01655266
4     DASYST     0.02261208
5     GESNER     0.02987758
6     GREIGI     0.02357519
7     KOLPAK     0.00689596
8     LINIFO     0.02801644
9     PRAEST     0.01489702
10    PULCHE     0.02619500
11    SAXATI     0.01930950
12    SYLVES     0.01051589
13    TURKES     0.02405270
14    URUMIE     0.01304008
15    ALTAIC     0.02143803
16    ARMENA     0.02833592
17    KAUFMA     0.02905800
18    SCHREN     0.01115147
19    SPRENG     0.01630705
20     TARDA     0.02926331
21    VVEDEN     0.02759809
22    ORPHAN     0.02122366
23    HUMILI     0.02115505
24    BAKERI     0.03118183
25    CLUSIA     0.01893035
26    DASYST     0.02112652
27    GESNER     0.01726671
28    KOLPAK     0.01920617
29    LINIFO     0.01959694
...      ...            ...
3682  SAXATI     0.04483649
3683 

 0.05314549 0.05648337 0.05484528 0.05280004 0.0541991 ]
<type 'numpy.ndarray'>
KstestResult(statistic=0.5050121985582317, pvalue=0.0)
      ticker  signal_return
5     GESNER     0.02987758
27    GESNER     0.01726671
50    GESNER     0.03486794
74    GESNER     0.01857232
97    GESNER     0.02499698
122   GESNER     0.02263244
145   GESNER     0.02695143
172   GESNER     0.03021558
202   GESNER     0.01688940
223   GESNER     0.02991103
250   GESNER     0.02140915
275   GESNER     0.02950598
298   GESNER     0.02961145
321   GESNER     0.02745639
343   GESNER     0.02587033
370   GESNER     0.02726423
397   GESNER     0.03447945
420   GESNER     0.02525461
443   GESNER     0.03127350
465   GESNER     0.02955061
493   GESNER     0.03593470
518   GESNER     0.03593313
541   GESNER     0.03279765
567   GESNER     0.03453840
590   GESNER     0.03387038
617   GESNER     0.03652862
643   GESNER     0.02796855
666   GESNER     0.02735023
689   GESNER     0.04082403
712   GESNER     0.026706

KstestResult(statistic=0.5059428329034227, pvalue=0.0)
      ticker  signal_return
10    PULCHE     0.02619500
45    PULCHE     0.03035077
64    PULCHE     0.03216861
90    PULCHE     0.02269160
113   PULCHE     0.02112875
135   PULCHE     0.02602425
156   PULCHE     0.01812928
176   PULCHE     0.01709735
199   PULCHE     0.02858578
227   PULCHE     0.03406798
254   PULCHE     0.02290265
277   PULCHE     0.02771878
301   PULCHE     0.02755745
328   PULCHE     0.02368800
352   PULCHE     0.03031698
374   PULCHE     0.02875168
398   PULCHE     0.02149136
422   PULCHE     0.02654788
446   PULCHE     0.03800508
472   PULCHE     0.02723901
496   PULCHE     0.03201978
524   PULCHE     0.03167444
550   PULCHE     0.02939299
573   PULCHE     0.03219388
597   PULCHE     0.03046945
618   PULCHE     0.03331003
640   PULCHE     0.03521002
667   PULCHE     0.03129926
694   PULCHE     0.02672943
718   PULCHE     0.02205366
...      ...            ...
2999  PULCHE     0.04798844
3023  PULCHE     0.05

KstestResult(statistic=0.5069520811574241, pvalue=0.0)
KstestResult(statistic=0.07873059153231465, pvalue=0.5498944918341624)
      ticker  signal_return
0     AGENEN     0.03825018
1     BAKERI     0.04734411
2     BIFLOR     0.04957341
3     CLUSIA     0.04492722
4     DASYST     0.04687383
5     GESNER     0.04241275
6     GREIGI     0.04215971
7     KOLPAK     0.03340674
8     LINIFO     0.05088290
9     PRAEST     0.03784840
10    PULCHE     0.05142979
11    SAXATI     0.04677755
12    SYLVES     0.05022989
13    TURKES     0.05327831
14    URUMIE     0.03906136
15    ALTAIC     0.04662437
16    ARMENA     0.04441871
17    KAUFMA     0.04545542
18    SCHREN     0.03888338
19    SPRENG     0.04693952
20     TARDA     0.05589757
21    VVEDEN     0.05073637
22    ORPHAN     0.05141064
23    HUMILI     0.04094813
24    AGENEN     0.04337903
25    BAKERI     0.04430670
26    BIFLOR     0.04258929
27    CLUSIA     0.05216387
28    GESNER     0.05761123
29    GREIGI     0.04084296
...   

 0.1028024 ]
<type 'numpy.ndarray'>
KstestResult(statistic=0.5164845689903534, pvalue=0.0)
      ticker  signal_return
6     GREIGI     0.04215971
29    GREIGI     0.04084296
52    GREIGI     0.04246865
75    GREIGI     0.05000007
98    GREIGI     0.05306387
120   GREIGI     0.05395299
144   GREIGI     0.05366565
168   GREIGI     0.05735519
194   GREIGI     0.05287972
218   GREIGI     0.06092711
241   GREIGI     0.06034205
268   GREIGI     0.06074272
299   GREIGI     0.06819373
328   GREIGI     0.06276750
352   GREIGI     0.06317214
376   GREIGI     0.06596330
399   GREIGI     0.07167458
425   GREIGI     0.06875971
450   GREIGI     0.05980941
474   GREIGI     0.07476302
496   GREIGI     0.06815933
520   GREIGI     0.07818168
543   GREIGI     0.07380215
567   GREIGI     0.08012958
591   GREIGI     0.06017968
614   GREIGI     0.06407969
637   GREIGI     0.06761028
660   GREIGI     0.07073482
683   GREIGI     0.07822444
707   GREIGI     0.07876645
...      ...            ...
1334  GREIGI 

[86 rows x 2 columns]
[0.04677755 0.04027473 0.04018178 0.05144855 0.04831857 0.05133228
 0.06053753 0.04765304 0.0508356  0.06791876 0.06742875 0.05616268
 0.06223467 0.05154234 0.06482846 0.07197663 0.05752295 0.06003429
 0.06273225 0.0645152  0.06970622 0.06706201 0.07871576 0.07051554
 0.07197653 0.07141437 0.07601924 0.07668075 0.06656654 0.07707032
 0.0779445  0.06898097 0.07976213 0.07689602 0.07782373 0.08032589
 0.08400589 0.08127251 0.07713978 0.07751716 0.08620488 0.08588136
 0.07598277 0.08355197 0.08027697 0.09216077 0.08694858 0.07912749
 0.08224103 0.08166824 0.09233339 0.09191826 0.08330289 0.08049771
 0.10102772 0.09268765 0.08905339 0.09411076 0.09269688 0.09844968
 0.09532132 0.09140122 0.09599813 0.09539536 0.0910383  0.10209091
 0.10003015 0.09108166 0.08348826 0.0942999  0.10556518 0.10395868
 0.11068804 0.10385983 0.09606905 0.10692697 0.09923803 0.10596097
 0.09391025 0.10151828 0.10507324 0.1046434  0.10691288 0.09209502
 0.09519542 0.10689326]
<type 'numpy.nda

1070  AGENEN     0.19943394
[0.09263043 0.10602324 0.10091476 0.10824934 0.10988121 0.12843256
 0.12293448 0.13210586 0.11674913 0.12396617 0.14356932 0.13821137
 0.13019745 0.15036813 0.1537515  0.15802875 0.15884338 0.15618713
 0.16448538 0.16578872 0.1672462  0.17463959 0.16615247 0.16407186
 0.16293263 0.16990924 0.17386671 0.17625267 0.17208519 0.18700416
 0.18718812 0.19078566 0.18840484 0.19452858 0.18424095 0.19330262
 0.19017756 0.19537621 0.19607091 0.19845177 0.20668639 0.20540875
 0.20457244 0.21274098 0.19943394]
<type 'numpy.ndarray'>
KstestResult(statistic=0.5369014176437047, pvalue=1.1726175586090903e-12)
      ticker  signal_return
15    ALTAIC     0.08612202
35    ALTAIC     0.11534353
55    ALTAIC     0.09746208
78    ALTAIC     0.10662137
99    ALTAIC     0.10752463
122   ALTAIC     0.12758746
150   ALTAIC     0.12247333
173   ALTAIC     0.12111107
197   ALTAIC     0.12812515
221   ALTAIC     0.12653143
247   ALTAIC     0.13936240
270   ALTAIC     0.13914824
292   A

1059  LINIFO     0.21155469
[0.10397425 0.09952904 0.10415718 0.11048295 0.1109781  0.11753803
 0.12654712 0.13277299 0.1246333  0.13802163 0.13603537 0.15052517
 0.1428145  0.14467257 0.14171863 0.15531163 0.16215543 0.15066271
 0.15599394 0.16039994 0.16968969 0.16968813 0.177042   0.17147674
 0.16739599 0.16532492 0.17311209 0.17034416 0.17030824 0.18756321
 0.18371053 0.18645915 0.18693307 0.19262555 0.18640065 0.18517891
 0.19441393 0.19362971 0.1985954  0.20352238 0.19936852 0.19937636
 0.20071612 0.20952815 0.21155469]
<type 'numpy.ndarray'>
KstestResult(statistic=0.5396408825570782, pvalue=8.590905764549461e-13)
      ticker  signal_return
22    ORPHAN     0.09846531
44    ORPHAN     0.09793568
66    ORPHAN     0.10877400
87    ORPHAN     0.10615980
112   ORPHAN     0.11028828
136   ORPHAN     0.12030151
158   ORPHAN     0.12362030
182   ORPHAN     0.12511540
206   ORPHAN     0.14016676
229   ORPHAN     0.13617320
253   ORPHAN     0.13273524
277   ORPHAN     0.13499942
299   OR

ks_values_5
AGENEN   0.50603502
ALTAIC   0.50570193
ARMENA   0.50676634
BAKERI   0.50548842
BIFLOR   0.50633464
CLUSIA   0.50590502
DASYST   0.50501220
GESNER   0.50673758
GREIGI   0.50576659
HUMILI   0.50628688
dtype: float64
p_values_5
AGENEN   0.00000000
ALTAIC   0.00000000
ARMENA   0.00000000
BAKERI   0.00000000
BIFLOR   0.00000000
CLUSIA   0.00000000
DASYST   0.00000000
GESNER   0.00000000
GREIGI   0.00000000
HUMILI   0.00000000
dtype: float64


## Find Outliers
With the ks and p values calculate, let's find which symbols are the outliers. Implement the `find_outliers` function to find the following outliers:
- Symbols that pass the null hypothesis with a p-value less than `pvalue_threshold`.
- Symbols that with a KS value above `ks_threshold`.

In [192]:
def find_outliers(ks_values, p_values, ks_threshold, pvalue_threshold=0.05):
    """
    Find outlying symbols using KS values and P-values
    
    Parameters
    ----------
    ks_values : Pandas Series
        KS static for all the tickers
    p_values : Pandas Series
        P value for all the tickers
    ks_threshold : float
        The threshold for the KS statistic
    pvalue_threshold : float
        The threshold for the p-value
    
    Returns
    -------
    outliers : set of str
        Symbols that are outliers
    """
    #TODO: Implement function
    tickers = []
    temp_ks = ks_values[ks_values > ks_threshold].index
    temp_p = p_values[ks_values < pvalue_threshold].index
    temp_ks = [x.encode('UTF8') for x in temp_ks]
    temp_p = [x.encode('UTF8') for x in temp_p]
    print temp_ks
    print temp_p
    tickers = temp_ks + temp_p
    print tickers
    
    return tickers


project_tests.test_find_outliers(find_outliers)

['EMXO']
[]
['EMXO']


AssertionError: Wrong type for output outliers. Got <type 'list'>, expected <type 'set'>

### View Data
Using the `find_outliers` function you implemented, let's see what we found.

In [None]:
ks_threshold = 0.8
outliers_5 = find_outliers(ks_values_5, p_values_5, ks_threshold)
outliers_10 = find_outliers(ks_values_10, p_values_10, ks_threshold)
outliers_20 = find_outliers(ks_values_20, p_values_20, ks_threshold)

outlier_tickers = outliers_5.union(outliers_10).union(outliers_20)
print('{} Outliers Found:\n{}'.format(len(outlier_tickers), ', '.join(list(outlier_tickers))))

### Show Significance without Outliers
Let's compare the 5, 10, and 20 day signals returns without outliers to normal distributions. Also, let's see how the P-Value has changed with the outliers removed.

In [None]:
good_tickers = list(set(close.columns) - outlier_tickers)

project_helper.plot_signal_to_normal_histograms(
    [signal_return_5[good_tickers], signal_return_10[good_tickers], signal_return_20[good_tickers]],
    'Signal Return Without Outliers',
    ('5 Days', '10 Days', '20 Days'))

That's more like it! The returns are closer to a normal distribution. You have finished the research phase of a Breakout Strategy. You can now submit your project.
## Submission
Now that you're done with the project, it's time to submit it. Click the submit button in the bottom right. One of our reviewers will give you feedback on your project with a pass or not passed grade. You can continue to the next section while you wait for feedback.