#Pipeline tutorial Lesson 2

##Creating a Pipeline

In [1]:
# importing the pipeline class
from quantopian.pipeline import Pipeline

In [2]:
# define a function to wrap our pipeline
# for now this returns an empty pipeline
def make_pipeline():
    return Pipeline()


In [3]:
# instantiate pipeline by running make_pipeline()
my_pipe = make_pipeline()

##Running the Pipeline
Before running our pipeline, we first need to import `run_pipeline`, a research-only function that allows us to run a pipeline over a specified time period.

In [4]:
from quantopian.research import run_pipeline

In [5]:
# run the pipeline for a single day
result = run_pipeline(my_pipe, '2015-05-05','2015-05-05')

call to `run_pipeline` returns a [pandas DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) indexed by date and securities

In [6]:
# let's take a look
result.head()

Unnamed: 0,Unnamed: 1
2015-05-05 00:00:00+00:00,Equity(2 [ARNC])
2015-05-05 00:00:00+00:00,Equity(21 [AAME])
2015-05-05 00:00:00+00:00,Equity(24 [AAPL])
2015-05-05 00:00:00+00:00,Equity(25 [ARNC_PR])
2015-05-05 00:00:00+00:00,Equity(31 [ABAX])


The output of an empty pipeline is a datafame without columns

#Pipeline tutorial Lesson 3

##Factors
A factor is a fucntion from an asset and a moment in time to a number.
$$f(asset, timestamp)\to numerical value$$

Factors represent the result of any computation producing a numerical result. Factors require a column of data as well as a window length.

The Simplest factors in Pipeline are [built-in Factors](https://www.quantopian.com/help#built-in-factors) which are prebuilt to perform common computations. 



I will use the built in `SimpleMovingAverage` factor for a 10-day window. It is first necessary to import the `SimpleMovingAverage` factor and also the [USEquityPricing dataset](https://www.quantopian.com/help#importing-datasets)

In [7]:
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.factors import SimpleMovingAverage

##Creating a Factor

creates a Factor for computing the 10-day mean close price

In [8]:
mean_close_10 = SimpleMovingAverage(inputs=[USEquityPricing.close],
                                    window_length=10)

##Adding a Factor to a Pipeline
First we will update the old `make_pipeline` function using the new instantiated factor. Then we tell the pipeline to compute the factor by passing it a `columns` argument, which should be a dictionary mapping column names to factors, filter, or classifiers.

In [9]:
def make_pipeline():
    #creates a Factor for computing the 10-day mean close price
    mean_close_10 = SimpleMovingAverage(inputs=[USEquityPricing.close],
                                        window_length=10)
    return Pipeline(columns={'10_day_mean_close': mean_close_10})

Let's run it and display some of the result

In [10]:
result = run_pipeline(make_pipeline(), '2015-05-05', '2015-05-05')
result.head()

Unnamed: 0,Unnamed: 1,10_day_mean_close
2015-05-05 00:00:00+00:00,Equity(2 [ARNC]),13.5595
2015-05-05 00:00:00+00:00,Equity(21 [AAME]),3.9625
2015-05-05 00:00:00+00:00,Equity(24 [AAPL]),129.0257
2015-05-05 00:00:00+00:00,Equity(25 [ARNC_PR]),88.3625
2015-05-05 00:00:00+00:00,Equity(31 [ABAX]),61.9209


Note: factors can also be added to an existing `Pipeline` instance using the `Pipeline.add` method. Using `add` looks something like this:
```python
>>> my_pipe = Pipeline()
>>> f1 = SomeFactor(...)
>>> my_pipe.add(f1)
```

##Latest
The most commonly used built in factor is `Latest`. The `Latest` factor gets the most recent value of a given data column. This factor is common enough that is is **instantiated differently** from other factors The best way to get the latest value of a data column is by getting its `.latest` attribute. As an example, let's update `make_pipeline` to create a latest close price factor and add it to our Pipeline:

In [11]:
def make_pipeline():
    mean_close_10 = SimpleMovingAverage(inputs=[USEquityPricing.close],
                                        window_length=10)
    latest_close = USEquityPricing.close.latest
    
    return Pipeline(
        columns={
            '10_day_mean_close': mean_close_10,
            'Latest_close_price': latest_close
        }
    )

In [12]:
result = run_pipeline(make_pipeline(), '2015-05-05', '2015-05-05')
result.head()

Unnamed: 0,Unnamed: 1,10_day_mean_close,Latest_close_price
2015-05-05 00:00:00+00:00,Equity(2 [ARNC]),13.5595,14.015
2015-05-05 00:00:00+00:00,Equity(21 [AAME]),3.9625,
2015-05-05 00:00:00+00:00,Equity(24 [AAPL]),129.0257,128.699
2015-05-05 00:00:00+00:00,Equity(25 [ARNC_PR]),88.3625,
2015-05-05 00:00:00+00:00,Equity(31 [ABAX]),61.9209,55.03


##Default Inputs
Some factors have default inputs that should never be changed. for example the [VWAP built-in factor](https://www.quantopian.com/help#built-in-factors) is always calculated from `USEquityPricing.close` and `USEquityPricing.volume`. When a factor is always calculated from the same `BoundColumns`, we can call the constructor without specifying `inputs`. 

In [13]:
from quantopian.pipeline.factors import VWAP
vwap = VWAP(window_length=10)

#Pipeline Tutorial Lesson 4
##Combining Factors
We can combine factors, both with other factors and with scalar values, via any builtin math operator. This makes it easy to write complex expressions that combine multiple Factors. For example, constructing a Factor that computes the average of two other Factors is simply:
```python
>>> f1 = SomeFactor(...)
>>> f2 = SomeOtherFactor(...)
>>> average = (f1 + f2) / 2.0
```

In this lesson, we will create a pipeline that creates a `percent_difference` factor by combining a 10-day average price factor and a 30-day one. Let's start out by making the two factors.

In [14]:
mean_close_10 = SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=10)
mean_close_30 = SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=30)

Then let's create a percent difference factor by combining out `mean_close_30` factor with our `mean_close_10` factor.

In [15]:
percent_difference = ((mean_close_10 - mean_close_30)
                      / mean_close_30)

In this example, `percent_difference` is still a `Factor` even though it's composed as a combination of more primitive factors. We can add `percent_difference` as a column in our pipeline. Let's define `make_pipeline` to create a pipeline with `percent_difference` as a column (and not the mean close factors):

In [16]:
def make_pipeline():
    
    mean_close_10 = SimpleMovingAverage(inputs=[USEquityPricing.close],
                                        window_length=10)
    mean_close_30 = SimpleMovingAverage(inputs=[USEquityPricing.close],
                                        window_length=30)
    percent_difference = ((mean_close_10 - mean_close_30)\
                          / mean_close_30)
    return Pipeline(
        columns={
            'percent_difference': percent_difference
        }
    )


Let's see what the new output looks like.

In [17]:
result = run_pipeline(make_pipeline(), '2015-05-05', '2015-05-05')
result.head()

Unnamed: 0,Unnamed: 1,percent_difference
2015-05-05 00:00:00+00:00,Equity(2 [ARNC]),0.017975
2015-05-05 00:00:00+00:00,Equity(21 [AAME]),-0.002325
2015-05-05 00:00:00+00:00,Equity(24 [AAPL]),0.016905
2015-05-05 00:00:00+00:00,Equity(25 [ARNC_PR]),0.021544
2015-05-05 00:00:00+00:00,Equity(31 [ABAX]),-0.019639


#Pipeline Tutorial Lesson 5
##Filters
A filter is a function from an asset and a moment in time to a boolean:

$$f(asset, timestamp)\to boolean$$

In Pipeline, [Filters](https://www.quantopian.com/help#quantopian_pipeline_filters_Filter) are used for narrowing down the set of securities included in a computation or in the final output of a pipeline. There are two common ways to creat a Filter: comparison operators and `Factor`/`Classifier` methods.

##Comparison Operators
Comparson operators on `Factors` and `Classifiers` produce `Filters`. Since we haven't looked at `Classifiers` yet, let's stick to examples using `Factors`. The following example produces a filter that returns `True` whenever the latest close price is above $20.

In [18]:
latest_close_price = USEquityPricing.close.latest
close_price_filter = latest_close_price > 20

And this example produces a filter that returns `True` whenever the 10-day mean is below the 30-day mean.

In [19]:
mean_close_10 = SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=10)
mean_close_30 = SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=30)
mean_crossover_filter = mean_close_10 < mean_close_30

Remember, a filter produces a `True` of `False` value for each security every day.

##Factor/Classifier Methods
Various methods of the `Factor` and `Classifier` classes return `Filters`. Again, since we haven't yet looked at `Classifiers` let's stick to `Factor` methods for now (we'll look at `Classifier` methods later). the `Factor.top(n)` method producesa `Filter` that returns `True` for the top $n$ securities of a given factor each day. The following example produces a filter that returns `True` for exactly 200 securities every day, indicating that those securities were in the top 200 by last close price accross all known securities.

In [20]:
latest_close_price = USEquityPricing.close.latest
top_close_price_filter = latest_close_price.top(200)

for a full list of `Factor` methods that return `Filters`, see [this link](https://www.quantopian.com/help#quantopian_pipeline_classifiers_Classifier).

##Dollar Volume Filter
Let's create a filter that returns `True` if a security's 30-day average dollar volume is above $10,000,000. To do this we'll first need to create an `AverageDollarVolume` filter to compute the 30-day average dollar volume.

To import the the built-in `AverageDollarVolume` filter, we can add to the line that we used to import `SimpleMovingAverage`.

In [21]:
from quantopian.pipeline.factors import AverageDollarVolume, SimpleMovingAverage

And then we can instantiate the Factor.

In [22]:
dollar_volume = AverageDollarVolume(window_length=30)

By default, `AverageDollarVolum` uses `USEquityPricing.close` and `USEquityPricing.volume` as its inputs, so we don't specify them.

Now that we have a dollar volume factor, we can create a filter with a boolean expression. the following line creates a filter returning `True` for securities with a `dollar_volume` greater than 10,000,000:

In [23]:
high_dollar_volume = (dollar_volume > 10000000)

To see what this filter looks like, we can add it as a column to our pipeline.

In [24]:
def make_pipeline():
    mean_close_10 = SimpleMovingAverage(inputs=[USEquityPricing.close],
                                        window_length=10)
    mean_close_30 = SimpleMovingAverage(inputs=[USEquityPricing.close],
                                    window_length=30)

    percent_difference = ((mean_close_10 - mean_close_30)\
                          / mean_close_30)

    dollar_volume = AverageDollarVolume(window_length=30)

    high_dollar_volume = (dollar_volume > 10000000)

    return Pipeline(
        columns={
            'percent_difference': percent_difference,
            'high_dollar_volume': high_dollar_volume
        }
    )

If we make and run our pipeline, we now have a column `high_dollar_volume` with a boolean value corresponding to the result of the expression for each security.

In [25]:
result = run_pipeline(make_pipeline(), '2015-05-05', '2015-05-05')
result.head()

Unnamed: 0,Unnamed: 1,high_dollar_volume,percent_difference
2015-05-05 00:00:00+00:00,Equity(2 [ARNC]),True,0.017975
2015-05-05 00:00:00+00:00,Equity(21 [AAME]),False,-0.002325
2015-05-05 00:00:00+00:00,Equity(24 [AAPL]),True,0.016905
2015-05-05 00:00:00+00:00,Equity(25 [ARNC_PR]),False,0.021544
2015-05-05 00:00:00+00:00,Equity(31 [ABAX]),False,-0.019639


##Applying a Screen
By default, a pipeline produces computed values each day for every asset in the Quantopian database. Very often however, we only care about a subset of securities that meet specific criteria (for example, we might only care about securities that have enough daily trading volume to fill our orders quickly). We can tell our Pipeline to ignore securities for which a filter produces `False` by passing that filter to our Pipeline via the `screen` keyword.

To screen our pipeline output for securities with a 30-day average dollar volume greater than $10,000,000, we can simply pass our `high_dollar_volume` filter as the `screen` argument. This is what our `make_pipeline` function now looks like:


In [26]:
def make_pipeline():

    SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=10)
    mean_close_30 = SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=30)

    percent_difference = (mean_close_10 - mean_close_30) / mean_close_30

    dollar_volume = AverageDollarVolume(window_length=30)
    high_dollar_volume = (dollar_volume > 10000000)

    return Pipeline(
        columns={
          'percent_difference': percent_difference
        },
        screen=high_dollar_volume # here is the screen
      )

Running this will produce an output for only the securities that passed the `high_dollar_volume` on a given day. For example, running this pipeline on May 5th, 2015 results in an output for ~2,100 securities.


In [27]:
result = run_pipeline(make_pipeline(), '2015-05-05', '2015-05-05')
print 'Number of securities that passed the filter: %d' % len(result)

Number of securities that passed the filter: 2106


##Inverting a Filter
The `~` operator is used to inver a filter, swapping all `True` values with `Falses` and vice-versa. for example, we can write the following to filter for low dollar volume secutities:

In [28]:
low_dollar_volume = ~high_dollar_volume

This will return `True` for all securities with an average dollar volme below or equal to $10,000,000 over the last 30 days.


#Pipeline Tutorial Lesson 6
##Combining Filters
Like factors, filters can be combined. Combining filters is done using the `&` (and) and `|` (or) operators. For example, let's say we want to screen for securities that are in the top 10% of average dollar volume and have a latest close price of above $20. To start, let's make a high dollar volume filter using an `AverageDollarVolume` factor and `percentile_between`:

In [29]:
high_dollar_volume = dollar_volume.percentile_between(90,100)

Note: `percentile_between` is a `Factor` method returning a Filter.

Next, let's create a `latest_close` factor and define a filter for securities that close above $20:

In [30]:
latest_close = USEquityPricing.close.latest
above_20 = latest_close > 20

Now we can combine our `high_dollar_volume` filter with our `above_20` filter using the `&` operator:

In [31]:
is_tradeable = high_dollar_volume & above_20

This filter will evaluate to `True` for securities where both `high_dollar_volume` and `above_20` are `True`. Otherwise, it will evaluate to `False`. A similar computation can be made with the `|` (or) operator.

If we want to use this filter as a screen in our pipeline, we can set the `screen` to be `is_tradeable`.

In [32]:
def make_pipeline():
    mean_close_10 = SimpleMovingAverage(inputs=[USEquityPricing.close],
                                        window_length=10)
    mean_close_30 = SimpleMovingAverage(inputs=[USEquityPricing.close],
                                        window_length=30)
    percent_difference = ((mean_close_10 - mean_close_30)\
                          / mean_close_30)
    
    dollar_volume = AverageDollarVolume(window_length=30)
    high_dollar_volume = dollar_volume.percentile_between(90, 100)
    
    latest_close = USEquityPricing.close.latest
    above_20 = latest_close > 20
    
    is_tradeable = high_dollar_volume & above_20
    
    return Pipeline(
        columns={
            'percent_difference': percent_difference
        },
        screen=is_tradeable
    )

        


Running this pipeline on May 5th, 2015 outputs around 700 securities.

In [33]:
result = run_pipeline(make_pipeline(), '2015-05-05', '2015-05-05')
print 'Number of securities that passed the filter: %d' % len(result)

Number of securities that passed the filter: 741


#Pipeline Tutorial Lesson 7
##Masking

Sometimes we want to ignore certain assets when computing pipeline expression. There are two common cases where ignoring is useful:

1. We want to compute an expression that's computationally expensive, and we know we only care about results for certain assets. A common example of such an expensive expression is a `Factor` computing the coefficients of a regression (RollingLineaRegressionOfReturns).
2. We want to compute an expression that performs comparisons between assets, but we only want those comparisons to be performed against a subset of all assets. For example, we might want to use the `top` method of `Factor` to compute the top 200 assets by earnings yield, ignoring assets that don't meet some liquidity restraint.

To support these two use-cases, all `Factor` methods can accept a mask argument, which must be a `Filter` indicating which assets to consider when computing.

##Masking Factors

Let's say we want our pipeline to output securities with a high or low percent difference but we also only want to consider securities with a dollar volume above $10,000,000. To do this, let's rearrange our `make_pipeline` function so that we first create the `high_dollar_volume` filter. We can then use this filter as a `mask` for moving average factors by passing `high_dollar_volumes` as the argument to `SimpleMovingAverage`.

In [34]:
# Dollar volume facor
dollar_volume = AverageDollarVolume(window_length=30)

# High dollar volume filter
high_dollar_volume = (dollar_volume > 10000000)

# Average close price factors
mean_close_10 = SimpleMovingAverage(inputs=[USEquityPricing.close],
                                    window_length=10,
                                    mask=high_dollar_volume)

mean_close_30 = SimpleMovingAverage(inputs=[USEquityPricing.close],
                                    window_length=30,
                                    mask=high_dollar_volume)

# Relative difference factor
percent_difference = (mean_close_10 - mean_close_30) / mean_close_30



Applying the mask to `SimpleMovingAverage` restricts the average close price factors to a computation over the ~2000 securities passing the `high_dollar_volume` filter, as opposed to ~8000 without a mask. When we combine `mean_close_10` and `mean_close_30` to form `percent_difference`, the computation is performed on the same ~2000 securities.


##Masking Filters

Masks can be also applied to methods that return filters like `top`, `bottom`, and `percentile_between`.

Masks are most usefull when we want to apply a filter in the ealier steps of a combineed computation. For example, suppose we want to get the 50 securities with the highest open price that are also in the top 10% of dollar volume. Suppose that we then want the 90th-100th percentile of these securities by close price. we can do this with the following:

In [35]:
# Dollar volume factor
dollar_volume = AverageDollarVolume(window_length=30)

# High dollar volume filter
high_dollar_volume = dollar_volume.percentile_between(90, 100)

# Top open price filter (high dollar volume securities)
top_open_price = USEquityPricing.open.latest.top(50, mask=high_dollar_volume)

# Top percentile close price filter (high dollar volume, top 50 open price)
high_close_price = USEquityPricing.close.latest.percentile_between(90, 100, mask=top_open_price)

Let's put this into `make_pipeline` and output an empty pipeline screened with out `high_close_price` filter

In [36]:
def make_pipeline():
    # Dollar volume factor
    dollar_volume = AverageDollarVolume(window_length=30)

    # High dollar volume filter
    high_dollar_volume = dollar_volume.percentile_between(90,100)

    # Top open securities filter (high dollar volume securities)
    top_open_price = USEquityPricing.open.latest.top(50, mask=high_dollar_volume)

    # Top percentile close price filter (high dollar volume, top 50 open price)
    high_close_price = USEquityPricing.close.latest.percentile_between(90, 100, mask=top_open_price)

    return Pipeline(
        screen=high_close_price
    )

Running this pipeline outputs 5 securities on May 5th, 2015.

In [37]:
result = run_pipeline(make_pipeline(), '2015-05-05', '2015-05-05')
print 'Number of securities that passed the filter: %d' % len(result)

Number of securities that passed the filter: 5


Note that applying masks in layers as we did above can be thought of as an "asset funnel".

In the next lesson, we will look at classifiers.

#Pipeline Tutorial Lesson 8
##Classifiers

A classifier is a function from an asset and a moment in time to a categorical output such as a string or integer label:

$f(asset, timestamp)\to category$

An example of a classifier producing a strin output is the exchange ID of a security. To create this classifier, we'll have to import `morningstar.share_class_reference.exchange_id` and use the latest attribute to instantiate out classifier:

In [38]:
from quantopian.pipeline.data import morningstar

# Since the underlying data of morningstar.share_class_reference.exchange_id
# is of type string, .latest returns a Classifier
exchange = morningstar.share_class_reference.exchange_id.latest

Previously, we saw that the `latest` attribute produced an instance of a `Factor`. In this case, since the underlying data is of type `string`, `latest` produces a `Classifier`. 

Similarly, a computation producing the latest Morningstar sector code of a security is a `Classifier`. In this case, the underlying type is an `int`, but the integer doesn't represent a numerical value (it's a category) so it produces a classifier. To get the latest sector code, we can use the built-in `Sector` classifier.

In [39]:
from quantopian.pipeline.classifiers.morningstar import Sector
morningstar_sector = Sector()

Using `Sector` is equivalent to `morningstar.asset_classification.morningstar_sector_code.latest`.

##Building Filters from Classifiers

Classifiers can aslo be used to produce filters with methods like `isnull`, `eq`, and `startswith`. The full list of `Classifier` methods producing `Filters` can be found [here](https://www.quantopian.com/help#quantopian_pipeline_classifiers_Classifier).

As an example, if we wanted a filter to select securities trading on the New York Stock Exchange, we can use the `eq` method of our `exchange` classifier.

In [40]:
nyse_filter = exchange.eq('NYS')

This filter will return `True` for securities having `NYS` as their most recent `exchange_id`.

##Quantiles

Classifiers can also be produced from various `Factor` methods. The most general of these is the `quantiles` method, which accepts a bin counts as an argument. The `quantiles` classifier assigns a label from 0 to (bins - 1) to every non-NaN data point in the factor output. `NaN`s are labeled with -1. Aliases are available for [quartiles](https://www.quantopian.com/help/#quantopian_pipeline_factors_Factor_quartiles) (`quantiles(4)`), [quintiles](https://www.quantopian.com/help/#quantopian_pipeline_factors_Factor_quintiles) (`quantiles(5)`), and [deciles](https://www.quantopian.com/help/#quantopian_pipeline_factors_Factor_deciles) (`quantiles(10)`). As an example, this is what a filter for the top decile of a factor might look like:

In [41]:
dollar_volume_decile = AverageDollarVolume(window_length=10).deciles()
top_decile = (dollar_volume_decile.eq(9))

Let's put each of our classifiers into a pipeline and run ith to see what they look like.

In [42]:
def make_pipeline():
    exchange = morningstar.share_class_reference.exchange_id.latest
    nyse_filter = exchange.eq('NYS')
    
    morningstar_sector = Sector()
    
    dollar_volume_decile = AverageDollarVolume(window_length=10).deciles()
    top_decile = (dollar_volume_decile.eq(9))
    
    return Pipeline(
        columns={
            'exchange': exchange,
            'sector_code': morningstar_sector,
            'dollar_volume_decile': dollar_volume_decile
        },
        screen=(nyse_filter & top_decile)
    )
    

In [43]:
result = run_pipeline(make_pipeline(), '2015-05-05', '2015-05-05')
print 'Number of securities that passed the filter: %d' % len(result)
result.head()

Number of securities that passed the filter: 510


Unnamed: 0,Unnamed: 1,dollar_volume_decile,exchange,sector_code
2015-05-05 00:00:00+00:00,Equity(2 [ARNC]),9,NYS,101
2015-05-05 00:00:00+00:00,Equity(62 [ABT]),9,NYS,206
2015-05-05 00:00:00+00:00,Equity(64 [ABX]),9,NYS,101
2015-05-05 00:00:00+00:00,Equity(76 [TAP]),9,NYS,205
2015-05-05 00:00:00+00:00,Equity(128 [ADM]),9,NYS,205


Classifiers are also useful for describing gourping keys for complex transormations on Factor outputs. Grouping operations such as [demean](https://www.quantopian.com/help#quantopian_pipeline_factors_Factor_demean) and [groupby](https://www.quantopian.com/help#quantopian_pipeline_factors_Factor_groupby)
 are out of the scope of this tutorial. a future tutorial with cover more advanced features of classifiers. 
 
In the next lesson, we'll look at the different datasets that we can use in pipline.

#Pipeline Tutorial Lessson 9
##Datasets and BoundColumns

When building a pipeline, we need a way to identify the inputs to our computations. The input to a pipeline is specified using `DataSets` and `BoundColumns`.

`DataSets` are simply collections of objects that tell the Pipeline API where and how to find to inputs to computations. An example of a `DataSet` that we have already seen is `USEquityPricing`.

A `BoundColumn` is a column of data that is concretely bound to a `DataSet`. Instances of `BoundColumns` are dynamically created upon access to attributes of `DataSets`.  Inputs to pipline computations must be of type `BoundColumn`. A example of a `BoundColumn` that we have already seen is `USEquityPricing.close`.

It is important to understand that `DataSets` and `BoundColumns` do not hold actual data. Remember that when computations are created and added to a pipeline, they don't actually perform the computation until the pipeline is run. `DataSets` and `BoundColumns` can be though of in a similar way; they are simply used to identify the inputs of a computation. The data is populated when the pipeline is run.

##dtypes

When defining pipeline computations, we need to know the types of our inputs in order to know which operations and functions we can use. The `dtype` of a `BoundColumn` tells a computation what the type of the data will be when the pipeline is run. For example, `USEquityPricing` has a `float` `dtype` so a factor may perform arithmetic operations on `USEquityPricing.close` (e.g. compute the 5-day mean). The importance of this will become more clear in the next lesson.

The `dtype` of a `BoundColumn` can also determine the type of a computation. In the case of the  `latest` computation, the `dtype` determines whether the computation is a factor (`float`), a filter (`bool`), or a classifier (`string`,`int`).

##Pricing Data

US equity pricing data is stored in the `USEquityPricing` dataset. `USEquityPricing` provides five columns
-`USEquityPricing.close`
-`USEquityPricing.high`
-`USEquityPricing.low`
-`USEquityPricing.close`
-`USEquityPricing.volume`
Each of these columns has a `float` `dtype`.


##Fundamental Data

[Morningstar](http://corporate1.morningstar.com/us/home/) fundamental datasets are namespaced under the `quiantopian.pipeline.data.morningstar` module.

The following datasets are currently available from the `morningstar` module:
- [asset_classification](https://www.quantopian.com/help/fundamentals#asset-classification)
- [balance_sheet](https://www.quantopian.com/help/fundamentals#balance-sheet)
- [cash_flow_statement](https://www.quantopian.com/help/fundamentals#cash-flow-statement)
- [company_reference](https://www.quantopian.com/help/fundamentals#company-reference)
- [earnings_ratios](https://www.quantopian.com/help/fundamentals#earnings-ratios)
- [earnigns_report](https://www.quantopian.com/help/fundamentals#earnings-report)
- [financial_statement_filing](https://www.quantopian.com/help/fundamentals#financial-statement-filing)
- [general_profile](https://www.quantopian.com/help/fundamentals#general-profile)
- [income_statement](https://www.quantopian.com/help/fundamentals#income-statement)
- [operation_ratios](https://www.quantopian.com/help/fundamentals#operation-ratios)
- [share_class_reference](https://www.quantopian.com/help/fundamentals#share-class-reference)
- [valuation](https://www.quantopian.com/help/fundamentals#valuation)
- [valuation_ratios](https://www.quantopian.com/help/fundamentals#valuation-ratios)

Each of these datasets provides columns that can be passed as inputs to pipeline computations. The `dtype` of the columns vary. For example, `morningstar.valuation.market_cap` is a column representing the most recently reported market cap for each asset on each date. There are over 900 total columns available in the morningstar datasets. See the [Quantopian Fundamentals Reference](https://www.quantopian.com/help/fundamentals) for a full description of all such attributes.

##Partner Data

Many datasets besides `USEquityPricing` and Morningstar fundamentals are available on Quantopian. These include corporate fundamental data, news sentiment, macroeconomic indicators, and more. All datasets are namespaced by provider under `quantopian.pipeline.data`.

* `quantopian.pipeline.data.accern` ([accern](https://www.quantopian.com/data/accern))
* `quantopian.pipeline.data.estimize` ([Estimize](https://www.quantopian.com/data/estimize))
* `quantopian.pipeline.data.eventVestor` ([EventVestor](https://www.quantopian.com/data/eventvestor))
* `quantopian.pipeline.data.psychsignal` ([PsychSignal](https://www.quantopian.com/data/psychsignal))
* `quantopian.pipeline.data.quandl` ([Quandl](https://www.quantopian.com/data/quandl))
* `quantopian.pipeline.data.sentdex` ([Sentdex](https://www.quantopian.com/data/sentdex))

Similar to `USEquityPricing`, each of these datasets have columns (`BoundColumns`) that can be used in pipeline computations. The columns, along with example algorithms and notebooks can be found on the [Data page](https://www.quantopian.com/data). The `dtypes` of the columns vary.

`BoundColumns` are most commonly used in CustomFactors which we will explore in the next lesson.

#Pipeline Tutorial Lesson 10
##Custom Factors

When we first looked at factors, we explored the set of built-in factors. Frequently, a desired computation isn't included as a built-in factor. One of the most powerful features of the Pipeline API is that it allows us to define our own custom factor.

Conceptually, a custom factor is identical to a built-in factor. It accepts `inputs`, `window_length`, and `mask` as constructor arguments, and returns a `Factor` object each day.

Let's take an example of a computation that doesn't exist as a built-in: standard deviation. To create a factor that computes the [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation) over a trailing window, we can subclass `quantopian.pipeline.CustomFactor` and implement a compute method whose signature is:
```python
def compute(self, today, asset_ids, out, *inputs)
    ...
```

- `*inputs` are M x N [numpy arrays](http://docs.scipy.org/doc/numpy-1.10.1/reference/arrays.ndarray.html), where M is the `window_length` and N is the number of securities (usually around ~8000 unless a mask is provided). `*inputs` are trailing data windows. Note that there will be one M x N array for each `BoundColumn` provided in the factor's `inputs` list. The data type of each array will be the `dtype` of the corresponding `BoundColumn`.
- `out` is an empty array of length N. `out` will be the output of our custom factor each day. The job of compute is to write output values into `out`.
- `asset_ids` will be an integer [array](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.array.html) of length N containing security ids corresponding to the columns in our `*inputs` arrays.
- `today` will be a [pandas Timestamp](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#converting-to-timestamps) representing the day for which `compute` is being called.

Of these, `*inputs` and `out` are most commonly used.

An instance of `CustomFactor` that's been added to a pipeline will have its compute method called every day. For example, let's define a custon factor that computes the standard deviation of the close price over te last 5 days. To start, let's add `CustomFactor` and `numpy` to our import statements.

In [44]:
from quantopian.pipeline import CustomFactor, Pipeline
import numpy

Next, let's define our custon factor to calculate the standard deviation over a trailing window using [numpy.nanstd](http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.nanstd.html):

In [45]:
class StdDev(CustomFactor):
    def compute(self, today, asset_ids, out, values):
        # Calculates the column-wise standard deviation, ignoring NaNs
        out[:] = numpy.nanstd(values, axis=0)

Finally, let's instantiate our factor in `make_pipeline()`:

In [46]:
def make_pipeline():
    std_dev = StdDev(inputs=[USEquityPricing.close], window_length=5)
    
    return Pipeline(
        columns={
            'std_dev': std_dev
        }
    )

When this pipeline is run, `StdDev.compute()` will be called everyday with data as follows:

- `values`: An M x N [numpy](http://docs.scipy.org/doc/numpy-1.10.1/reference/arrays.ndarray.html) array, where M is 5 (`window_length`), and N is ~8000 (the number of securities in our database on the day in question).
- `out`: An empty array of length N(~8000). In this example, the job of `compute` is to populate `out` with an array of 5-day close price standard deviations.

In [47]:
result = run_pipeline(make_pipeline(), '2015-05-05', '2015-05-05')



In [48]:
result.head()

Unnamed: 0,Unnamed: 1,std_dev
2015-05-05 00:00:00+00:00,Equity(2 [ARNC]),0.293428
2015-05-05 00:00:00+00:00,Equity(21 [AAME]),0.004714
2015-05-05 00:00:00+00:00,Equity(24 [AAPL]),1.737677
2015-05-05 00:00:00+00:00,Equity(25 [ARNC_PR]),0.275
2015-05-05 00:00:00+00:00,Equity(31 [ABAX]),4.402971


##Default Inputs

When writing a custom factor, we can set default `inputs` and `window_length` in our `CustomFactor` subclass. For example, let's define the `TenDayMeanDifference` custom factor to compute the mean difference between two data columns over a trailing window using [numpy.nanmean](http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.nanmean.html). Let's set the default inputs to `[USEquityPricing.close, USEquityPricing.open]` and the default `window_length` to 10:

In [49]:
class TenDayMeanDifference(CustomFactor):
    # Default inputs.
    inputs = [USEquityPricing.close, USEquityPricing.open]
    window_length = 10
    
    def compute(self, today, asset_ids, out, close, open):
        # Calculates the column-wise mean difference, ignoring NaNs
        out[:] = numpy.nanmean(close - open, axis=0)

Remember in this case that `close` and `open` are each 10 x ~8000 2D [numpy arrays](http://docs.scipy.org/doc/numpy-1.10.1/reference/arrays.ndarray.html).

Now, if we call `TenDayMeanDifference` without providing any arguments, it will use the defaults.

In [50]:
# Computes the 10-day mean difference between the daily open and close price
close_open_diff = TenDayMeanDifference()

the defaults can be manually overridden by specifying arguments in the constructor call.

In [51]:
# Computes the 10-day mean difference between the daily high and low prices.
high_low_diff = TenDayMeanDifference(inputs=[USEquityPricing.high, USEquityPricing.low])


##Further Example

Let's take another example where we build a [momentum](http://www.investopedia.com/terms/m/momentum.asp) factor and use it to create a filter. We will then use that filter as a `screen` on our pipeline.

Let's start by defining a `Momentum` factor to be the division of the most recent close price by the close price from `n` days ago where `n` is the `window_length`.

In [52]:
class Momentum(CustomFactor):
    # Default inputs
    inputs = [USEquityPricing.close]
    
    #Compute momentum
    def compute(self, today, assets, out, close):
        out[:] = close[-1] / close[0]

Now, let's instantiate our `Momentum` factor (twice) to create a 10-day momentum factor and a 20-day momentum factor. Let's also create a `positive_momentum` filter returning `True` for securities with both a positive 10-day momentum and a positive 20-day momentum.

In [53]:
ten_day_momentum = Momentum(window_length=10)
twenty_day_momentum = Momentum(window_length=20)

positive_momentum = ((ten_day_momentum > 1) & (twenty_day_momentum > 1))

Next, let's add our momentum factors and our `positive_momentum` filter to `make_pipeline`. Let's also set `positive_momentum` to be the screen of our pipeline.

In [56]:
def make_pipeline():
    ten_day_momentum = Momentum(window_length=10)
    twenty_day_momentum = Momentum(window_length=20)

    positive_momentum = ((ten_day_momentum > 1) & (twenty_day_momentum > 1))
    
    std_dev = StdDev(inputs=[USEquityPricing.close],
                     window_length=5)
    
    return Pipeline(
        columns={
            'std_dev': std_dev,
            'ten_day_momentum': ten_day_momentum,
            'twenty_day_momentum': twenty_day_momentum
        }, screen=positive_momentum
    )

Running this pipeline outputs the standard deviation and each of our momentum computations for securities with positive 10-day and 20-day momentum.

In [57]:
result = run_pipeline(make_pipeline(), '2015-05-05', '2015-05-05')
result.head()

Unnamed: 0,Unnamed: 1,std_dev,ten_day_momentum,twenty_day_momentum
2015-05-05 00:00:00+00:00,Equity(2 [ARNC]),0.293428,1.036612,1.042783
2015-05-05 00:00:00+00:00,Equity(24 [AAPL]),1.737677,1.014256,1.02138
2015-05-05 00:00:00+00:00,Equity(39 [DDC]),0.138939,1.062261,1.167319
2015-05-05 00:00:00+00:00,Equity(52 [ABM]),0.09368,1.009212,1.015075
2015-05-05 00:00:00+00:00,Equity(64 [ABX]),0.178034,1.025721,1.065587


Custom factors allow us to define custom computations in a pipeline. They are frequently the best way to perform computations on [partner datasets](https://www.quantopian.com/data) or on multiple data columns. The full documentation for CustomFactor is available [here](https://www.quantopian.com/help#custom-factors).

In the next lesson, we'll use everythin we've learned so far to create a pipeline for an algorithm.

#Pipeline Tutorial Lesson 11
##Putting It All Together

Now that we've covered the basic components of the Pipeline API, let's construct a pipeline that we might want to use in an algorithm.

To start, let's first create a filter to narrow down the types of securities coming out of our pipeline. In this example, we will create a filter to select for securities that meet the following criteria:

- Is a primary share
- Is listed as a common stock
- Is not a [depositary receipt](http://www.investopedia.com/terms/d/depositaryreceipt.asp) (ADR/GDR)
- Is not trading [over-the-counter](http://www.investopedia.com/terms/o/otc.asp) (OTC)
- Is not [when-issued](http://www.investopedia.com/terms/w/wi.asp) (WI)
- Doesn't have a name indicating it's [limited partnership](http://www.investopedia.com/terms/l/limitedpartnership.asp) (LP)
- Doesn't have company reference indicating it's a LP
- Is not an [ETF](http://www.investopedia.com/terms/e/etf.asp) (has Morningstar fundamental data)

###Why These Criteria?

Selecting for primary shares and common stock helps us select only a single security for each company. In general, primary shares are a good representative asset of a company so we will select for these in our pipeline.

ADRs and GDRs are issuances in the US equity market for stocks that trade on other exchanges. Frequently, there is inherent risk associated with depositary receipts due to currency fluctuations so we exclude them from our pipeline.

OTC, WI, and LP securities are not tradeable with most brokers. As a result, we exclude them from our pipeline.

When it comes to ranking and comparing securities, it rarely makes sense to compare ETFs with regular stocks. ETFs are composites without fundamental data. They derive their value from a larger group of securities. To avoid comparing apples and oranges, we exclude them from our pipeline.

##Creating Our Pipeline

Let's create a filter for each criterion and combine them together to create a `tradeable_stocks` filter. First, we need to import the Morningstar `Dataset` as well as the `IsPrimaryShares` builtin filter.

In [58]:
from quantopian.pipeline.data import morningstar
from quantopian.pipeline.filters.morningstar import IsPrimaryShare

Now we can define our filters:

In [59]:
# Filters for primary share equities. IsPrimaryShare is a built-in
primary_share = IsPrimaryShare()

# Equities listed a common stock (as opposed to, say, preferred stock)
# 'ST00000001' indicates common stock.
common_stock = morningstar.share_class_reference.security_type.latest.eq('ST00000001')

# Non-depositary receipts. Recall that the ~ operator inverts filters,
# turning Trues into Falses and vice versa
not_depositary = ~morningstar.share_class_reference.is_depositary_receipt.latest

# Equities not trading over-the-counter.
not_otc = ~morningstar.share_class_reference.exchange_id.latest.startswith('OTC')

# Not when-issued equities.
not_wi = ~morningstar.share_class_reference.symbol.latest.endswith('.WI')

# Equities without LP in their name, .matches does a match using a regular
# expression
not_lp_name = ~morningstar.company_reference.standard_name.latest.matches('.* L[. ]?P.?$')

# Equities with a null value in the limited_partnership Morningstar
# fundamental field.
not_lp_balance_sheet = morningstar.balance_sheet.limited_partnership.latest.isnull()

# Equities whose most recent Morningstar market cap is not null have
# fundamental data and therefore are not ETFs.
have_market_cap = morningstar.valuation.market_cap.latest.notnull()

# Filter for stocks that pass all of our previous filters.
tradeable_stocks = (
    primary_share
    & common_stock
    & not_depositary
    & not_otc
    & not_wi
    & not_lp_name
    & not_lp_balance_sheet
    & have_market_cap
)

Note that wehn defining our filters, we used several `Classifier` methods that we haven't yet seen including `notnull`, `startswith`, `endswith`, `matches`. Documentation on these methods is available [here](https://www.quantopian.com/help#quantopian_pipeline_classifiers_Classifier).

Next, let's create a filter for the top 30% of tradeable stocks by 20-day average dollar volume. We'll call this our `base_universe`.

In [60]:
base_universe = AverageDollarVolume(window_length=20, mask=tradeable_stocks).percentile_between(70,100)

###Built-in Base Universe

We have just defined our own universe to select 'tradeable' securities with high dollar volume. However, Quantopian has two built-in filters that do something similar. The [Q500US](https://www.quantopian.com/help#quantopian_pipeline_filters_Q500US) and the [Q1500US](https://www.quantopian.com/help#quantopian_pipeline_filters_Q1500US) are built-in pipeline filters that select a group of 500 or 1500 tradeable, liquid stocks each day. Constituents of these gorups are chosen at the start of each calendar month by selecting the top 'tradeable' stocks by 200-day average dollar volume, capped at 30% of equities allocated to any single sector (more detail on the selection criteria of these filters can be found [here](https://www.quantopian.com/posts/the-q500us-and-q1500us)).

To simplify our pipeline, let's replace what we've already written for our `base_universe` with the `Q1500US` built-in filter. First, we need to import it.

In [61]:
from quantopian.pipeline.filters.morningstar import Q1500US

Then, let's set our `base_universe` to the Q1500US.

In [62]:
base_universe = Q1500US()

###Mean Reversion Factors

Now that we have a filter `base_universe` that we can use to select a subset of securities, let's fous on creating factors for this subset. For example, let's create a pipeline for a mean reversion strategy. In this strategy, we'll look at the 10-day and 30-day moving averages (close price). Let's plan to open equally weighted long positions in the 25 securities with the least (most negative) percent difference and equally weighted short positions in the 25 with the greatest percent difference. To do this, let's create two moving average factors using our `base_universe` filter as a mask. then let's combine them into a factor computing the percent difference.

In [63]:
# 10-day close price average
mean_10 = SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=10, mask=base_universe)

# 30-day close price average.
mean_30 = SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=30, mask=base_universe)

percent_difference = (mean_10 - mean_30) / mean_30

Next, let's create filters for the top 25 and bottom 25 equities by `percent_difference`.

In [64]:
# Create a filter to select securities to short.
shorts = percent_difference.top(25)

# Create a filter to select securities to long.
longs = percent_difference.bottom(25)

Let's then combine `shorts` and `longs` to create a new filter that we can use as the screen of our pipeline:

In [65]:
securities_to_trade = (shorts | longs)

Since our earlier filters were used as masks as we built up to this final filter, when we use `securities_to_trade` as a screen, the output securities will meet the criteria ourlined at the beginning of the lesson (primary shares, non-ETFs, etc.). They will also have high dollar volume.

Finally, let's instantiate our pipeline. Since we are planning on opening equally weighted long and short potitions later, the only information that we actually need from our pipeline is which securities we want to trade (the pipeline index) and whether or not to open a long or a short position. Let's add our `longs` and `shorts` filters to our pipeline and set our screen to be `securities_to_trade`.

In [66]:
def make_pipeline():

    # Base universe filter.
    base_universe = Q1500US()

    # 10-day close price average.
    mean_10 = SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=10, mask=base_universe)

    # 30-day close price average.
    mean_30 = SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=30, mask=base_universe)

    # Percent difference factor.
    percent_difference = (mean_10 - mean_30) / mean_30

    # Create a filter to select securities to short.
    shorts = percent_difference.top(25)

    # Create a filter to select securities to long.
    longs = percent_difference.bottom(25)

    # Filter for the securities that we want to trade.
    securities_to_trade = (shorts | longs)

    return Pipeline(
      columns={
        'longs': longs,
        'shorts': shorts
      },
      screen=securities_to_trade
    )

Running this pipeline will result in a DataFrame with 50 rows and 2 columns each day. Each day, the columns will contain boolean values that we can use to decide whether we want to open a long or a short position in each security.

In [67]:
result = run_pipeline(make_pipeline(), '2015-05-05', '2015-05-05')
result.head()

Unnamed: 0,Unnamed: 1,longs,shorts
2015-05-05 00:00:00+00:00,Equity(351 [AMD]),True,False
2015-05-05 00:00:00+00:00,Equity(523 [AAN]),False,True
2015-05-05 00:00:00+00:00,Equity(1068 [BPT]),False,True
2015-05-05 00:00:00+00:00,Equity(1663 [CRK]),False,True
2015-05-05 00:00:00+00:00,Equity(4668 [MAT]),False,True


In the next lesson, we'll add this pipeline to an algorithm.