# TODO: Read 6.1 Time Series and 6.4 Advanced NLP

# Abstract

##### Abstract will be written after all analyses are complete.

For a copy of the proposal, visit https://github.com/makennedy626/Final-Capstone/blob/master/Final%20Capstone%20Project%20Proposal.ipynb. 

# Table of Contents

### I. Objective

### II. Data Access and Overview

### III. Scraping Google News API for Investor Sentiment

### IV. Unsupervised Machine Learning on Scraped Data to Generate Sentiment Analysis

### V. Supervised Machine Learning to Generate Prediction Models

### VI. Unsupervised Neural Networks on Scraped Data to Generate Sentiment Analysis

### VII. Supervised Neural Networks to Generate Prediction Models

### VIII. Performance Comparison of Techniques

### IX. Creating an Ensemble Model of the Highest Peforming Components of the Techniques

### X. Results of the Ensemble Model

### XI. Conclusion

### XII. Important Information from Quantopian's Contest Page

# I. Objective

The objective of this project is to create a profitable algorithm leveraging Quantopian's investment platform. The platform provides users with an integrated Python environment from which the user may backtest the algorithm(s) on historical stock prices. Quantopian also provides many free and paid-for data sets and built in functions. 

The end goal is to have a solution that will generate positive gains over time in the stock market while meeting Quantopian's Contest Criteria. For an overview of the contest rules and judging criteria, see XII. Importan Information from Quantopian's Contest Page.

If the criteria are met, the algorithm is evaluated for the possibility for licensing to Quantopian. If licensed, capital is allocated (starting allocations average approximately five million dollars per alogrithm), and the author receives ten percent of the net profit. 

# II. Data Access and Overview

The project will utilize Quantopian's provided open-sourced backtesting engine (Zipline) and free data sources from Quantopian (provided by Morningstar), which consists of over 600 metrics measuring the financial performance of companies and is derived from their public filings. The data is split- and dividend-adjusted.

Quantopian algorithms utilize "universes" that are collections of approved stocks (based upon metrics outlined below) and are generated by Quantopian and frequently updated. 

The data will be accessed using Quantopian's Pipeline API.

## II.I Information About the QTradableStocksUS Universe from Quantopian
*Taken from Quantopian's page, https://www.quantopian.com/posts/working-on-our-best-universe-yet-qtradablestocksus, where the advantages of this new universe over the previous (Q500US and Q1500US) are discussed.

QTradableStocksUS has no explicit size limit, and generally has between 1600-2100 members.

The new universe has more effective screens to remove illiquid or otherwise untradeable stocks.

For companies with more than one share class, the new universe picks the most liquid rather than always picking the primary share.

The new universe is updated daily rather than monthly.

Here are the specific limits applied to the QTradableStocksUS:

Market cap:  over \$500M: This restriction eliminates many undiversifiable risks like low liquidity and difficulty in shorting.

Dollar volume: It is important that stocks in our universe be relatively easy to trade when entering and exiting positions. The QTradableStocksUS manages that by including only stocks that have median daily dollar volume of $2.5m or more over the trailing 200 days.

Prior day's close: If a stock's price is lower than \$5, the bid-ask spread becomes larger relative to the price, and the transaction cost becomes too high.

\200 days of price and volume: If a stock has missing data for the previous 200 days, the company is excluded. This targets stocks with trading halts, IPOs, and other situations that make them harder to assess.

Primary/Common share: The QTradableStocksUS chooses a single share class for each company. The criteria is to find the common share with the most dollar volume.

ADRs, Limited Partnerships: QTradableStocksUS excludes ADRs and LPs.

## II.II Visualizing QTradableStocksUS

In [2]:
from quantopian.pipeline.data import Fundamentals
from quantopian.pipeline.data import morningstar
from quantopian.pipeline.factors import AverageDollarVolume
from quantopian.pipeline.factors.morningstar import MarketCap
from quantopian.pipeline.classifiers.morningstar import Sector
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline import Pipeline
from quantopian.research import run_pipeline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import DataFrame as df
import time

from quantopian.pipeline.experimental import QTradableStocksUS

ModuleNotFoundError: No module named 'quantopian'

In [None]:
# Function used to create a pipeline with the QTradableStocksUS as a screen. The pipeline contains common metrics as columns:
def make_pipeline():
    
    average_day_dv_200 = AverageDollarVolume(window_length=200)
    market_cap = Fundamentals.market_cap.latest
    price = USEquityPricing.close.latest
    volume = USEquityPricing.volume.latest
    sector = Sector()

    return Pipeline(
        columns={
            'AverageDollarVolume200': average_day_dv_200,
            'MarketCap': market_cap,
            'Price': price,
            'Volume': volume,
            'Sector': sector,
        },
        screen=QTradableStocksUS()
    )

In [None]:
# Pipeline is run over this time range and outputs a dataframe indexed by asset name:
START_DATE = '2003'
END_DATE = '2018-01-01'

start = time.time()
QTU_pipeline = run_pipeline(make_pipeline(), START_DATE, END_DATE, chunksize=252)
print 'Took %s seconds' % (time.time() - start)

In [None]:
# Display constituent stocks of QTradableStocksUS:
QTU_pipeline

In [None]:
## The QTradableStocksUS universe generally contains a greater number of assets than previous iterations of the tradable universe.
## The resulting summary table displays the mean, std, min-max of daily median in addition to number of assets in this universe:
daily_constituent_count = QTU_pipeline.groupby(level=0).sum()
QTU_pipeline.groupby(level=0).median().describe()

In [None]:
#Displays the number of assets in universe every day, which mirrors major economic events throughout the time period.
dates = QTU_pipeline.index.levels[0]
grouped = QTU_pipeline.groupby(level=0).count()
num_securities = grouped['AverageDollarVolume200'].values
plt.plot(dates, num_securities)
plt.title('Number of securities in tradeable universe')

In [None]:
#Number of assets in universe broken down by sector type:

colors = ['b', 'g', 'r', 'c', 'm', 'y', 'orange', 'gray', 'maroon', 'olive', 'navy']

for (sector, name), color in zip(Sector.SECTOR_NAMES.iteritems(), colors):
    sector_result = QTU_pipeline.loc[QTU_pipeline['Sector'] == sector]
    grouped = sector_result.groupby(level=0).count()
    num_securities = grouped['AverageDollarVolume200'].values
    plt.plot(dates, num_securities, label=name, color=color)

plt.legend(bbox_to_anchor=(1.3, 1))
plt.title('Number of securities per sector in tradeable universe')

In [None]:
#Added and Removed Assets
#Number of assets added to universe is usually slightly greater than number of assets removed from universe.

assets_each_day = [set(df.loc[date].index) for date, df in QTU_pipeline.groupby(level=0)]
a = []
for i in range(1, len(assets_each_day)):
    a.append(assets_each_day[i] - assets_each_day[i-1])

#Record the number of new assets to universe each day:
new_assets_each_day = pd.Series(a, dates[1:])
num_new_assets_each_day = new_assets_each_day.apply(lambda x: len(x))

b = []
for i in range(1, len(assets_each_day)):
    b.append(assets_each_day[i-1] - assets_each_day[i])

#Record the number of assets removed from universe each day:
removed_assets_each_day = pd.Series(b, dates[1:])
num_removed_assets_each_day = removed_assets_each_day.apply(lambda x: len(x))

plt.plot(num_new_assets_each_day)
plt.title('Number of new securities per day in tradeable universe')

In [None]:
plt.plot(num_removed_assets_each_day)
plt.title('Number of removed securities per day in tradeable universe')

# III. Scraping Google News API for Investor Sentiment

The project will utilize Scrapy and Google News API to capture articles about stocks in the QTradableStocksUS universe to be used in sentiment analysis techniques. 

### Do I need to use Scrapy at all? Newsapi may be sufficient by itself.
### I can take the description from each article, store it in a DataFrame, then run sentiment analysis on each one.

In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
import json
import requests
from scrapy.linkextractors import LinkExtractor
from newsapi import NewsApiClient

api = NewsApiClient(api_key='2613ce5e838a464b814b7d5b4c2e6bf8')

# TODO: Make a DF of Date, Stock, Article_Source, Article_Description, LDA_Result

stocks_list = []
stocks_list.append(QTradableStocksUS[0])
# For each stock, crawl the google news api for relevant articles for each day
For stock in stocks_list:
    all_articles = newsapi.get_everything(q= str([stock]),
                                      #sources='bbc-news,the-verge',
                                      #domains='bbc.co.uk,techcrunch.com',
                                      #START_DATE and END_DATE were declared when making the pipeline
                                      from_parameter= START_DATE,
                                      to= END_DATE,
                                      language='en')
                                      #,sort_by='relevancy'
                                      # Only a limited number are shown at a time though, so use the page 
                                      # parameter in your requests to page through them.
                                      #,page=2)


# TODO: Extract the date, source url, and description
    For article in response.articles: 
        # Append date, source url, and description to the DF created above



In [3]:
# Example response from Google News API
'''
https://newsapi.org/v2/everything?q=bitcoin&apiKey=API_KEY
{
"status": "ok",
"totalResults": 46339,
-"articles": [
-{
-"source": {
"id": null,
"name": "Cointelegraph.com"
},
"author": "CoinTelegraph By Darryn Pollock",
"title": "Bitcoin-Related Jobs Booming Along With Bitcoin",
"description": "With an 82 percent growth in the third quarter, Bitcoin-related jobs are the fastest growing category, according to employment website Freelancer.",
"url": "https://cointelegraph.com/news/bitcoin-related-jobs-booming-along-with-bitcoin",
"urlToImage": "https://cointelegraph.com/images/725_Ly9jb2ludGVsZWdyYXBoLmNvbS9zdG9yYWdlL3VwbG9hZHMvdmlldy80MDY5YWQ5MmQxNTU4YWIyYTdhYTg0MTIxM2QwM2M5Zi5qcGc=.jpg",
"publishedAt": "2017-10-31T08:43:21Z"
}
]
}
'''

'\nhttps://newsapi.org/v2/everything?q=bitcoin&apiKey=API_KEY\n{\n"status": "ok",\n"totalResults": 46339,\n-"articles": [\n-{\n-"source": {\n"id": null,\n"name": "Cointelegraph.com"\n},\n"author": "CoinTelegraph By Darryn Pollock",\n"title": "Bitcoin-Related Jobs Booming Along With Bitcoin",\n"description": "With an 82 percent growth in the third quarter, Bitcoin-related jobs are the fastest growing category, according to employment website Freelancer.",\n"url": "https://cointelegraph.com/news/bitcoin-related-jobs-booming-along-with-bitcoin",\n"urlToImage": "https://cointelegraph.com/images/725_Ly9jb2ludGVsZWdyYXBoLmNvbS9zdG9yYWdlL3VwbG9hZHMvdmlldy80MDY5YWQ5MmQxNTU4YWIyYTdhYTg0MTIxM2QwM2M5Zi5qcGc=.jpg",\n"publishedAt": "2017-10-31T08:43:21Z"\n}\n]\n}\n'

In [None]:
****************************************************************************************
#Might not need this


#Build a crawler to crawl Google's top news articles and pull articles containing any mentions of a stock from QTradableStocksUS
class GoogleSpider(scrapy.Spider):
    name = "GS"
    
    allowed_domains = ['newsapi.org']
    
    
    # Here is where we insert our API call to get Google news articles related to bitcoin.
    start_urls = ['https://newsapi.org/v2/everything?q=bitcoin&apiKey=2613ce5e838a464b814b7d5b4c2e6bf8'
                 ,'https://newsapi.org/v2/everything?q=ethereum&apiKey=2613ce5e838a464b814b7d5b4c2e6bf8']
      
    # Identifying the information we want from the query response and extracting it using xpath.
    def parse(self, response):
        data = json.loads(response.body_as_unicode())
        data2 = []
        for article in data['articles']:
            yield {
                'url' : article['url']
            }
                

process = CrawlerProcess({
    'FEED_FORMAT': 'json',
    'FEED_URI': 'GoogleLinks5.json',
    # Note that because we are doing API queries, the robots.txt file doesn't apply to us.
    'ROBOTSTXT_OBEY': False,
    'USER_AGENT': 'MatthewGoogleNewsCrawler (makennedy626@gmail.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True,
    'LOG_ENABLED': False,
    # We use CLOSESPIDER_PAGECOUNT to limit our scraper to the first 10 links.    
    'CLOSESPIDER_PAGECOUNT' : 10
})
                                         

# Starting the crawler with our spider.
process.crawl(GoogleSpider)
process.start()
print('First 100 links extracted!')

***********************************************************************

# IV. Unsupervised Machine Learning on Scraped Data for Sentiment Analysis

### Methods for Unsupervised Sentiment Analysis:
#### Latent Dirichlet Allocation (LDA): https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
#### LDA with Gensim: https://en.wikipedia.org/wiki/Gensim
#### LDA with Spark: https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda

# V. Supervised Machine Learning to Generate Prediction Models

# XII. Important Information from Quantopian's Contest Page

### Process

A new contest is started at the beginning of each month. Contest entries are paper traded for 6 months. At the end of the 6 months, the winner is announced.

While the Quantopian Open is limited to one winner per month, our allocations are not. We want dozens of algorithms, and anyone meeting the low beta and consistent-returns filters is well-positioned for those rewards.

It is a rolling competition. All entries are automatically entered in every subsequent contest, unless withdrawn.

If you win, the overall performance of your algorithm during the prize period will be public - other people will want to see how you are doing!

### Judging

Your algorithm's performance must have low correlation to the general market's performance. This correlation is calculated as the beta-to-SPY, and it must be between 0.3 and -0.3. The algorithms that meet this requirement are placed at the top of the leaderboard and are marked with a badge.

Your algorithm must be hedged. It should hold both long and short positions simultaneously, or be entirely in cash. Hedged strategies reduce their market risk and correlation risk to individual positions. These algorithms are placed at the top of the leaderboard and are marked with a badge.

Your algorithm must have positive returns. It must make trades in both paper trading and backtesting. Algorithms meeting this criterion also are placed at the top of the leaderboard and are marked with a badge.

To generate your algorithm's score, its live trading performance is ranked against all of the other entries on the 7 criteria listed below. The ranks are averaged and scored on a scale of 0 to 100. Your score is quite volatile in the days immediately after you make your entry, and your score smooths out as your entry runs for a longer time period.

The best way to evaluate your algorithm is to use our backtest analysis tool to generate a 'tear sheet.' The tear sheet is chock-full of staticstics and comparisons to help you evaluate your algorithm's performance. It's the same tool that we use at Quantopian to evaluate algorithms for allocation.

These criteria were picked to encourage algorithm creation that matches our allocation interests.

#### Sharpe Ratio: The gold standard of performance metrics. Penalizes an algorithm if it takes excessive risk to achieve its return. Higher Sharpe is better.

#### Annualized volatility: Lower volatility is better.

#### Annualized returns: The algorithm has to make money.

#### Max Drawdown: The greatest loss suffered from a peak in equity to its subsequent trough. By minimizing drawdown, Participant's algorithm can better compound gains.

#### Stability of Return: This measures how consistently an algorithm generates its profits over time. (Mathematically, the R-squared of the linear regression line drawn through the algorithm's equity curve based on log-returns).

#### Sortino Ratio: Annual Return / standard deviation of negative returns. This is a quick way to compare across algorithms how long it might take to get back above its high-water mark after suffering a loss equal to its historical max drawdown. A large Sortino ratio indicates a low probability of a large loss.

#### Beta-to-SPY: How connected your algorithm is to swings in the value of SPY.

### Important Rules 

Your algorithm must keep its leverage under three. If your leverage exceeds three, in backtest or paper trading, your entry will be disqualified. You can track your leverage using context.account.leverage.

Your algorithm must use the default commission model. We will assign a custom slippage model of $0.001/share for judging. If you would like to test your algorithm with this commission model, you can use set_commission(commission.PerShare(cost=0.001, min_trade_cost=0)).

Your algorithm can't use fetcher. The Participant's algorithm must not use the fetch_csv() feature. (If you have a data source or signal that you think should be available for the contest, please tell us about it at feedback@quantopian.com)

Your algorithm can't trade leveraged ETFs or ETNs. The Participant's algorithm must not trade in leveraged ETFs, such as the Ultra S&P500 or Ultra Dow30. To avoid trading these assets, use the set_do_not_order() trading guard. You can also use the QTradableStocksUS, which doesn't contain any ETFs, as your base universe.

If your algorithm crashes, your entry will be disqualified.

Each person is limited to three entries. If you want to enter a fourth time, you will need to stop one of your existing entries to make room.

There is no fee for entry. You can read the full set of contest rules here. https://www.quantopian.com/open/rules