# Project 2. Analyzing The Effect of Short Sale Volume to Stock Pricing

In this project, we will analyze the correlations between short sale volume and stock pricing. Specifically, we'd like to answer the following questions:

1. Does a large short sale volume warrant a higher return in the next 1, 5, 22, and 65 working days (1 day, 1 week, 1 month, and 3 months)?
2. Are there certain sectors that are more sensitive to short interests?

<div class="alert alert-info">
Welcome to Project 2 of the Python for Finance course! As in the previous project, update the code under the <code># Todo</code> comments in the code cells below, and run your cells until they yield the desired outputs.
</div>

## 1. Import Libraries and Setup Global Variables

The code cell below can be run a few times whenever you need to include additional modules or update global variables.

In [None]:
# Todo: Import the Self-Serve dataset into your environment and include it here.
from quantopian.pipeline.data...

# from quantopian.pipeline.filters import QTradableStocksUS

from quantopian.pipeline.data import USEquityPricing
from quantopian.pipeline import Pipeline
from quantopian.research import run_pipeline
from datetime import datetime
from quantopian.pipeline.factors import CustomFactor, Returns
from quantopian.pipeline.filters import Q500US
from quantopian.pipeline.classifiers.morningstar import Sector

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

MORNINGSTAR_SECTOR_CODES = {
     -1: 'Misc',
    101: 'Basic Materials',
    102: 'Consumer Cyclical',
    103: 'Financial Services',
    104: 'Real Estate',
    205: 'Consumer Defensive',
    206: 'Healthcare',
    207: 'Utilities',
    308: 'Communication Services',
    309: 'Energy',
    310: 'Industrials',
    311: 'Technology' ,    
}

# When building the code, use short time range and a few tickers in your universe.
# When the code is ready, re-run with a longer time range and all stocks.
# Comment and uncomment the lines below as necessary.

# Todo: Don't forget to switch from development settings to
#       production settings after you have completed the project.

# Development settings
# start_date = datetime.strptime('04/01/2019', '%m/%d/%Y') + pd.tseries.offsets.BDay(65)
# end_date = datetime.strptime('05/01/2019', '%m/%d/%Y') + pd.tseries.offsets.BDay(65)
# def universe():
#     return (Q500US()) & (short_interests.short_volume.latest.notnull())
# mask = {'mask': universe()}

# Production settings (warning, will take about 5 minutes for each run!)
start_date = datetime.strptime('04/02/2013', '%m/%d/%Y') + pd.tseries.offsets.BDay(65)
end_date = datetime.strptime('02/10/2020', '%m/%d/%Y')
def universe():
    return short_interests.short_volume.latest.notnull()
mask = {'mask': universe()}

## 2. Review the Dataset Columns

Print out the columns in the short sale volume dataset.

In [None]:
# Todo: Print out the columns of the short sale volume dataset.


## 3. Build a Custom factor to get Short Interest Ratio
On its own, short interest volume is not a very useful factor without a proper context. Let's say a stock has 10,000,000 short interests, is that a lot? A few? To figure this out, we need to know how many short interests are there compared to the total number of shares in circulation. Therefore, a ratio of `short_volume/total_volume` would be ideal.

In [None]:
class ShortInterestRatio(CustomFactor):
    # Todo: create a Factor that calculates the ratio of short interest and
    #       total volume.

In [None]:
# Debug by getting 1 day of data.
def make_pipeline():
    # Todo: Build and return a Pipeline object.
    #       Don't forget to include proper mask and screen parameters.
    
si_pipe = ...
si_mdf = run_pipeline(si_pipe, '04/02/2013', '04/02/2013')
si_mdf

## 4. Build a Pipeline that outputs Short Interest Ratio and Stock Returns

When this is correct, the first value of `sir_d65` should be the same with the `sir` value above and the value of `sir_2d ` should be the same with the next day's `sir_today`.

In [None]:
def make_pipeline():
    # Todo: Get returns and short interest ratios with various window lengths as
    #       asked in question 1.
    u = ...
    sector = ...
    sir_today = ...
    sir_d2 = ...
    sir_d5 = ...
    sir_d22 = ...
    sir_d65 = ...
    return_d2 = ...
    return_d5 = ...
    return_d22 = ...
    return_d65 = ...
    si_pipe = Pipeline(
        columns={
            'sector': ...,
            'sir_today': ...,
            'sir_d2': ...,
            'sir_d5': ...,
            'sir_d22': ...,
            'sir_d65': ...,
            'return_d2': ...,
            'return_d5': ...,
            'return_d22': ...,
            'return_d65': ...
        },
        screen=u
    )
    return si_pipe
si_pipe = ...
si_mdf = ...

**Sample output:**

In [None]:
si_mdf.head(5)

**DataFrame info:**

In [None]:
si_mdf.info()

**How many equities are there?**

In [None]:
len(si_mdf.index.get_level_values(1).unique())

## 5. Sampling The Data

Visualizing 7+ million rows of data is pointless, as most of the data points are going to be located around similar locations (remember the return distributions plot in the first project?). Therefore, we will take a sample of our data.

Sampling needs to be done carefully so that no sector is over-represented:

1. Firstly, make sure there is no missing data. Remove all rows that contain NaN values.
2. We are going to sample 1000 records from each sector, so we end up with the same number of records for each sector.

In [None]:
# Todo: Drop missing data from si_mdf 
si_sample_mdf = ...

random_state = 1
# Todo: Sample 1000 data from si_sample_mdf
si_sample_mdf = ...

In [None]:
si_sample_mdf.head(5)

In [None]:
si_sample_mdf.info()

<div class="alert alert-info">When correct, you should see 12000 rows in the DataFrame.</div>

In [None]:
# Print out summary statistics
si_sample_mdf.describe()

## 6. Initial Visualization

In the first step of our visualization step, we are just going to visualize all pairs of returns and short interest ratios. As shown in the summary statistics produced by the `describe()` function above, the data contain some far outliers for its returns. For instance, the `return_d2` has a median price (50% quantile) of 0 and a max price of 1.721973, despite the standard deviation of only 0.031926, which means this outlier is way farther from 2 standard deviations.

However, it's still interesting to view an initial presentation of the data points, at least to give us a better picture on what we are dealing with.

In the next code cell, create a facet grid of four regplots, one for each variable pair. When done, your facet grid should look like the following:

![facet-regplots](https://platform.codingnomads.co/learn/pluginfile.php/6233/mod_page/content/3/facet-regplots.png)

In [None]:
# Todo: Create a facet grid of regplots


## 7. Convert sector codes to names

Since we are going to group the data points by sectors, to make it easier to analyze, convert sector codes to sector names by looking up `MORNINGSTAR_SECTOR_CODES` variable above. You may use [pandas.DataFrame.replace](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html) function for this.

In [None]:
# Todo: Create a column `sector_name` that stores the sector names.
si_sample_mdf['sector_name'] = ...
si_sample_mdf.head(5)

## 8. Remove Outliers

As seen from the visualization above, the outliers make it impossible to notice any trend in the data. Next, we are going to keep only the average returns. Now, there are different arguments to make about deciding what "average" is, but since the stock market returns is [not normally distributed](https://www.investopedia.com/terms/t/tailrisk.asp), we can't simply follow the [68–95–99.7 rule](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule) and just take between 32% to 68% quantile and assume the returns there are between 1 standard deviation from the mean.

Therefore, let's take the liberty of keeping between 0.25 and 0.75 quantiles out of each returns. You will need to create four DataFrames here.

In [None]:
# `df` is a quick-reference to si_sample_mdf, to shorten the code since
# it needs to be referenced multiple times.
df = si_sample_mdf

# Todo: Create one DataFrame for each number of days.
si_sample_d2_mdf = ...
si_sample_d5_mdf = ...
si_sample_d22_mdf = ...
si_sample_d65_mdf = ...

print("Size of d2 sample: {} rows".format(len(si_sample_d2_mdf)))
print("Size of d5 sample: {} rows".format(len(si_sample_d5_mdf)))
print("Size of d22 sample: {} rows".format(len(si_sample_d22_mdf)))
print("Size of d65 sample: {} rows".format(len(si_sample_d65_mdf)))

<div class="alert alert-info">
There should be 6000 rows for each of the DataFrames.
</div>

## 9. Visualize Multivariate Plots

Before visualizing multivariate plots, we first choose a color palette to use. Some of the available palettes are documented [here](https://seaborn.pydata.org/tutorial/color_palettes.html).

In [None]:
# Todo: Choose a color plot.


Now, you will create a facet grid of four Axes, with each Axes visualizes an independent dataframe. Your final plot should look similar to the following:

![facet-lmplots](https://platform.codingnomads.co/learn/pluginfile.php/6233/mod_page/content/3/facet-lmplots.png)

Note that it does not need to be a 100% exact copy. The result is acceptable so long as it contains all the information.

In [None]:
# Todo: 
def draw_plot(x, y, hue, data, ax, pal, legend=False):
    pass

fig = plt.figure(figsize=(10, 10))
...

In [None]:
# Bonus Todo: If you'd like to see the larger version of the visualization,
# run this code cell. Replace the '...' with the appropriate values.

facet = sns.lmplot(..., palette=pal, size=7, aspect=1)
facet.set(axis_bgcolor='grey');

## 10. Conclusion and Future Work

As we have seen from the above visualizations, for the 2-day data, short interests had the most positive positive correlation with stock returns in the Technology sector. as we moved towards a longer timeframe, however, this trend is replaced by the stocks in the Utilities sector.

In other words, for stocks in the Technology industry, the visualization suggests that a high number of short interests correlates with a higher return in the next day, but we see a reversal of the trend in the next three months.

This information might be useful for deciding whether to use short interest data to decide on which industry's stocks to go long and short with.

For future work, it might be interesting to see how the correlation changes in different periods. In addition to sector-based grouping, you may add time-based grouping e.g. according to business or political cycles.