<h1>Pipeline</h1>

In [1]:
from quantopian.pipeline import Pipeline
from quantopian.research import run_pipeline # this is research only

Good to wrap this in a function to set up more complex pipelines, for now this is just an empty pipeline.

In [2]:
def make_pipeline():
    return Pipeline()


The output of an empty pipeline is a DataFrame with no columns. In this example, our pipeline has an index made up of all 8000+ securities (only 5 rows displayed in the image) for May 5th, 2015, but doesn't have any columns.

In [3]:
my_pipe = make_pipeline()

# running the pipeline returns a df, double index data and secs
result = run_pipeline(my_pipe, '2015-05-05', '2015-05-05')
result.head() 

#currently empty pipeline, all indexes and no columns



Unnamed: 0,Unnamed: 1
2015-05-05 00:00:00+00:00,Equity(2 [HWM])
2015-05-05 00:00:00+00:00,Equity(21 [AAME])
2015-05-05 00:00:00+00:00,Equity(24 [AAPL])
2015-05-05 00:00:00+00:00,Equity(25 [HWM_PR])
2015-05-05 00:00:00+00:00,Equity(31 [ABAX])


<h1>Factors</h1> <br>
Check the parameters required for the factor to be used
<br>
<a href = "https://www.quantopian.com/docs/api-reference/pipeline-api-reference#pipeline-quickref-factors">Open Link in new tab</a>

In [4]:
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.factors import SimpleMovingAverage

# each factor needs to be instantiated with parameters to be actually used
# to check params visit link above

def make_pipeline():
    
    # this is just instantiating the function, not doing any computations
    mean_close_10 = SimpleMovingAverage(
        inputs = [USEquityPricing.close], # BOUNDColumn indicates the data type
        window_length = 10
    )
    
    # attach to pipeline - this will do the computations the given col name
    pipe = Pipeline(
        columns = {
            '10_day_mean_close': mean_close_10
        }
    )
    
    return pipe

The DataFrame has a MultiIndex where the first level is a datetime representing the date of the computation and the second level is an Equity object corresponding to the security. For example, the first row in the above DataFrame (`2015-05-05 00:00:00+00:00`, `Equity(2 [AA])`) contains the result of our mean_close_10 factor for AA on May 5th, 2015.

In [5]:
result = run_pipeline(make_pipeline(), '2015-05-05', '2015-05-05')
result.head()



Unnamed: 0,Unnamed: 1,10_day_mean_close
2015-05-05 00:00:00+00:00,Equity(2 [HWM]),13.5595
2015-05-05 00:00:00+00:00,Equity(21 [AAME]),3.9625
2015-05-05 00:00:00+00:00,Equity(24 [AAPL]),129.0257
2015-05-05 00:00:00+00:00,Equity(25 [HWM_PR]),88.3625
2015-05-05 00:00:00+00:00,Equity(31 [ABAX]),61.9209


# Latest - inbuilt factor
The Latest factor gets the most recent value of a given data column. This factor is common enough that it is instantiated differently from other factors. The best way to get the latest value of a data column is by getting its .latest attribute. 

In [6]:
def make_pipeline():
    
    # this is just instantiating the function, not doing any computations
    mean_close_10 = SimpleMovingAverage(
        inputs = [USEquityPricing.close], # BOUNDColumn indicates the data type
        window_length = 10
    )
    
    # LATEST IS INSTANTIATED DIFFERENTLY TO OTHER FACTORS, can even return other things than factors!
    latest_close = USEquityPricing.close.latest
    
    # attach to pipeline - this will do the computations the given col name
    pipe = Pipeline(
        columns = {
            '10_day_mean_close': mean_close_10,
            'latest_closing_price': latest_close
        }
    )
    
    return pipe

In [7]:
result = run_pipeline(make_pipeline(), '2015-05-05', '2015-05-05')
result.head()



Unnamed: 0,Unnamed: 1,10_day_mean_close,latest_closing_price
2015-05-05 00:00:00+00:00,Equity(2 [HWM]),13.5595,14.015
2015-05-05 00:00:00+00:00,Equity(21 [AAME]),3.9625,
2015-05-05 00:00:00+00:00,Equity(24 [AAPL]),129.0257,128.699
2015-05-05 00:00:00+00:00,Equity(25 [HWM_PR]),88.3625,
2015-05-05 00:00:00+00:00,Equity(31 [ABAX]),61.9209,55.03


Some factors have default inputs that should never be changed. For example the VWAP built-in factor is always calculated from USEquityPricing.close and USEquityPricing.volume. When a factor is always calculated from the same BoundColumns, we can call the constructor without specifying inputs.

In [8]:
from quantopian.pipeline.factors import VWAP
vwap = VWAP(window_length=10)

# Combining Factors

Factors can be combined, both with other Factors and with scalar values, via any of the builtin mathematical operators (+, -, *, etc). <br><br>
f1 = SomeFactor(...)<br>
f2 = SomeOtherFactor(...)<br>
average = (f1 + f2) / 2.0

In [9]:
def make_pipeline():
    
    # this is just instantiating the function, not doing any computations
    mean_close_10 = SimpleMovingAverage(
        inputs = [USEquityPricing.close], # BOUNDColumn indicates the data type
        window_length = 10
    )
    
    mean_close_30 = SimpleMovingAverage(
        inputs = [USEquityPricing.close], # BOUNDColumn indicates the data type
        window_length = 30
    )
    
    # difining a higher order factor that is NOT YET COMPUTED
    # static type of factor
    percent_difference = (mean_close_10 - mean_close_30)/ mean_close_30
    
    # LATEST IS INSTANTIATED DIFFERENTLY TO OTHER FACTORS, can even return other things than factors!
    latest_close = USEquityPricing.close.latest
    
    # attach to pipeline - this will do the computations the given col name
    pipe = Pipeline(
        columns = {
            '10_day_mean_close': mean_close_10,
            '30_day_mean_close': mean_close_30,
            'percent_difference': percent_difference,
            'latest_closing_price': latest_close
        }
    )
    
    return pipe

In [15]:
result = run_pipeline(make_pipeline(), '2015-05-05', '2015-05-05')
result.head()



Unnamed: 0,Unnamed: 1,10_day_mean_close,30_day_mean_close,latest_closing_price,percent_difference
2015-05-05 00:00:00+00:00,Equity(2 [HWM]),13.5595,13.320067,14.015,0.017975
2015-05-05 00:00:00+00:00,Equity(21 [AAME]),3.9625,3.971735,,-0.002325
2015-05-05 00:00:00+00:00,Equity(24 [AAPL]),129.0257,126.880733,128.699,0.016905
2015-05-05 00:00:00+00:00,Equity(25 [HWM_PR]),88.3625,86.498944,,0.021544
2015-05-05 00:00:00+00:00,Equity(31 [ABAX]),61.9209,63.1613,55.03,-0.019639


# Filters

f (asset, timestamp) -> boolean

Use when narrowing down set of securities included in computation or in the final output of a pipeline.

Comparison operators on Factors and Classifiers produce Filters

In [None]:
last_close_price = USEquityPricing.close.latest
filter_close_price = last_close_price > 20

In [16]:
# we can alternatively actually produce these filters sa a column which would
# simplify the code in rebalance quite a bit

# here instead of checking these two columns when iterating over the securities
# we could simply record the signals as a column in itself
mean_crossover_filter = mean_close_10 < mean_close_30

NameError: name 'mean_close_10' is not defined

#    

Various methods of the Factor and Classifier classes return Filters. The Factor.top(n) method produces a Filter that returns True for the top n securities of a given factor each day. The following example produces a filter that returns True for exactly 200 securities every day, indicating that those securities were in the top 200 by last close price across all known securities.

<b>Link of factor methodsthat return filters (inbuilt):

METHODS THAT CREATE FACTORS

FACTORS THAT CREATE OTHER FACTORS (LIKE LINEAR REGRESSION AND OTHER STUFF BUILT IN!) </b>

https://www.quantopian.com/docs/api-reference/pipeline-api-reference#pipeline-quickref-factor-methods


In [18]:
last_close_price = USEquityPricing.close.latest
top_close_price_filter = last_close_price.top(200)

Creating a filter that returns True if a security's 30-day average dollar volume is above $10,000,000. To do this, we'll first need to create an AverageDollarVolume factor to compute the 30-day average dollar volume.

In [19]:
from quantopian.pipeline.factors import AverageDollarVolume, SimpleMovingAverage

In [20]:
def make_pipeline():
    
    # this factor by default uses equitypricing.close and
    # equitypricing.volume as inputs so they don't need to be specified
    dollar_volume = AverageDollarVolume(window_length = 30)
    filter_volume = dollar_volume > 10**7
    
    # this is just instantiating the function, not doing any computations
    mean_close_10 = SimpleMovingAverage(
        inputs = [USEquityPricing.close], # BOUNDColumn indicates the data type
        window_length = 10
    )
    
    mean_close_30 = SimpleMovingAverage(
        inputs = [USEquityPricing.close], # BOUNDColumn indicates the data type
        window_length = 30
    )
    
    # difining a higher order factor that is NOT YET COMPUTED
    # static type of factor
    percent_difference = (mean_close_10 - mean_close_30)/ mean_close_30

    # attach to pipeline - this will do the computations the given col name
    pipe = Pipeline(
        columns = {
            'percent_difference': percent_difference,
            'high_dollar_volume':filter_volume
        }
    )
    
    return pipe

In [21]:
result = run_pipeline(make_pipeline(), '2015-05-05', '2015-05-05')
result.head()



Unnamed: 0,Unnamed: 1,high_dollar_volume,percent_difference
2015-05-05 00:00:00+00:00,Equity(2 [HWM]),True,0.017975
2015-05-05 00:00:00+00:00,Equity(21 [AAME]),False,-0.002325
2015-05-05 00:00:00+00:00,Equity(24 [AAPL]),True,0.016905
2015-05-05 00:00:00+00:00,Equity(25 [HWM_PR]),False,0.021544
2015-05-05 00:00:00+00:00,Equity(31 [ABAX]),False,-0.019639


So far we have learnt to produce all of these filters and factors as columns themselves, and quantopian by default produces results for each security for EACH DAY. We may only want a subset of these securities to actually even be considered. How do?

# Apply a Screen

We can tell our Pipeline to ignore securities for which a filter produces False by passing that filter to our Pipeline via the screen keyword.

To screen our pipeline output for securities with a 30-day average dollar volume greater than $10,000,000, we can simply pass our high_dollar_volume filter as the screen argument.


In [22]:
def make_pipeline():
    
    # this factor by default uses equitypricing.close and
    # equitypricing.volume as inputs so they don't need to be specified
    dollar_volume = AverageDollarVolume(window_length = 30)
    filter_volume = dollar_volume > 10**7
    
    # this is just instantiating the function, not doing any computations
    mean_close_10 = SimpleMovingAverage(
        inputs = [USEquityPricing.close], # BOUNDColumn indicates the data type
        window_length = 10
    )
    
    mean_close_30 = SimpleMovingAverage(
        inputs = [USEquityPricing.close], # BOUNDColumn indicates the data type
        window_length = 30
    )
    
    # difining a higher order factor that is NOT YET COMPUTED
    # static type of factor
    percent_difference = (mean_close_10 - mean_close_30)/ mean_close_30

    # attach to pipeline - this will do the computations the given col name
    pipe = Pipeline(
        columns = {
            'percent_difference': percent_difference,
        },
        screen = filter_volume
    )
    
    return pipe

In [24]:
result = run_pipeline(make_pipeline(), '2015-05-05', '2015-05-05')
result.head()
print('Number of securities that passed the filter: %d' % len(result))



Number of securities that passed the filter: 2106


# Higher order filters

Filters can also be inverted using "~"

They can be combined as well into a new filter using any boolean operators.

New filter => all_volume = high_dollar_vol & low_dollar_vol

In [25]:
low_dollar_vol = ~ filter_volume # filter_volume is high_dollar_vol filter

NameError: name 'filter_volume' is not defined

# Masking

Sometimes we want to ignore certain assets when computing pipeline expresssions. There are two common cases where ignoring assets is useful:

1. We want to compute an expression that's computationally expensive, and we know we only care about results for certain assets. A common example of such an expensive expression is a Factor computing the coefficients of a regression (RollingLinearRegressionOfReturns).

1. We want to compute an expression that performs comparisons between assets, but we only want those comparisons to be performed against a subset of all assets. For example, we might want to use the top method of Factor to compute the top 200 assets by earnings yield, ignoring assets that don't meet some liquidity constraint.

To support these two use-cases, all Factors and many Factor methods can accept a mask argument, which must be a Filter indicating which assets to consider when computing.

<b><i><u> Masking is applied to the FACTOR itself, EXTREMELY efficient as this filters assets / dates BEFORE the computation is performed, where SCREEN computes everything, and filters AFTER. 
    
# So for the GOOD ALPHA, this will be the way to go.


In [None]:
# Dollar volume factor
dollar_volume = AverageDollarVolume(window_length=30)

# High dollar volume filter
high_dollar_volume = (dollar_volume > 10000000)

# Average close price factors
mean_close_10 = SimpleMovingAverage(
    inputs=[USEquityPricing.close],
    window_length=10,
    mask=high_dollar_volume
)
mean_close_30 = SimpleMovingAverage(
    inputs=[USEquityPricing.close],
    window_length=30,
    mask=high_dollar_volume
)

# Relative difference factor
percent_difference = (mean_close_10 - mean_close_30) / mean_close_30

Applying the mask to SimpleMovingAverage restricts the average close price factors to a computation over the ~2000 securities passing the high_dollar_volume filter, as opposed to ~8000 without a mask. When we combine mean_close_10 and mean_close_30 to form percent_difference, the computation is performed on the same ~2000 securities.

# Masking Filters

Masks can be also be applied to methods that return filters like top, bottom, and percentile_between.

Masks are most useful when we want to apply a filter in the earlier steps of a combined computation. For example, suppose we want to get the 50 securities with the highest open price that are also in the top 10% of dollar volume. Suppose that we then want the 90th-100th percentile of these securities by close price. We can do this with the following:


In [28]:
def make_pipeline():

    # Dollar volume factor
    dollar_volume = AverageDollarVolume(window_length=30)

    # High dollar volume filter
    high_dollar_volume = dollar_volume.percentile_between(90,100)
    
    # ONLY GET TOP 50 PRICES FOR HIGH DOLLAR VOLUME SECS
    # Top open securities filter (high dollar volume securities)
    top_open_price = USEquityPricing.open.latest.top(50, mask=high_dollar_volume)
    
    # ONLY GET TOP PERCENTILES FOR THE TOP_OPEN_PRICE (TAKES TOP 50 OF HIGH DOLLAR VOLUMES)
    # Top percentile close price filter (high dollar volume, top 50 open price)
    high_close_price = USEquityPricing.close.latest.percentile_between(90, 100, mask=top_open_price)
    
    
    # THINK OF THIS SORT OF MASKING AS ASSET FUNNELING
    return Pipeline(
        screen=high_close_price
    )

In [29]:
result = run_pipeline(make_pipeline(), '2015-05-05', '2015-05-05')
result



Unnamed: 0,Unnamed: 1
2015-05-05 00:00:00+00:00,Equity(693 [AZO])
2015-05-05 00:00:00+00:00,Equity(1091 [BRK_A])
2015-05-05 00:00:00+00:00,Equity(19917 [BKNG])
2015-05-05 00:00:00+00:00,Equity(23709 [NFLX])
2015-05-05 00:00:00+00:00,Equity(28016 [CMG])


# CLASSIFIERS - SKIPPED AS I DON'T THINK WE NEED IT

# DATASETS AND STATIC TYPING IN THIS PLATFORM

Please read this one is super important <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
https://www.quantopian.com/tutorials/pipeline#lesson9