# Capstone Project
## Machine Learning Engineer Nanodegree
Jessica Yung
September 2016

## I. Definition
_(approx. 1-2 pages)_



### Project Overview
In this section, look to provide a high-level overview of the project in layman’s terms. Questions to ask yourself when writing this section:
- _Has an overview of the project been provided, such as the problem domain, project origin, and related datasets or input data?_
- _Has enough background information been given so that an uninformed reader would understand the problem domain and following problem statement?_

### Problem Statement
In this section, you will want to clearly define the problem that you are trying to solve, including the strategy (outline of tasks) you will use to achieve the desired solution. You should also thoroughly discuss what the intended solution will be for this problem. Questions to ask yourself when writing this section:
- _Is the problem statement clearly defined? Will the reader understand what you are expecting to solve?_
- _Have you thoroughly discussed how you will attempt to solve the problem?_
- _Is an anticipated solution clearly defined? Will the reader understand what results you are looking for?_

### Project Overview

Trading. 
We are trying to predict daily stock prices of a certain stock s.

People have used machine learning in trading for decades. People use all sorts of strategies. 

This is an interesting domain: 
Firstly, there are many non-engineered features. If we include only equities, we already have over 10,000 equities globally. That makes for at least 10,000 potential non-engineered features. 

Secondly, there are many datapoints. High-frequency trading firms trade . Even access to daily trading information gives us 30 years * 365 days = over 10,000 datapoints.

It is also interesting because research in machine learning and statistics has affected how markets behave. There is no strategy or algorithm that will solve this problem or remain forever 'optimal' - if a profitable strategy is found, it may be copied by other people and so be priced in or it may be fought against or taken advantage of. This is more relevant to high-frequency trading than daily trading but nonetheless has an impact.

This exploratory study is 

### Problem
Build a stock price predictor with:

* Input: daily trade data oveor a certain date range (Open, High, Low, Adjusted Close) for a set of stocks S.
* Output: Projected estimates of Adjusted Close prices for query dates for pre-chosen stock(s) s_i in S.
    * Define 
    * Results satisfy predicted stock value 7 days out is within +/- 5% of actual value, on average.
* Optional Output: Suggested trades

### Strategy

This is a regression problem (as opposed to a classification problem) because we are predicting daily Adjusted Close prices for a stock. These prices are continuous.

A related problem: If this were high-frequency trading and we were trying to predict the stock price in the next nanosecond we could tackle price prediction as a binary classificaiton problem (does the price go up or down?).

It's not immediately obvious what kind of model will be best.

Characteristic of problem: 
- Time-series data.
- Noisy data
- Datapoints (prices of different stocks) are not independent of each other -> Naive Bayes is not appropriate
- Many features. (Daily open, high, low, adjusted close for many stocks)
- Regression problem (continuous output).
- Training cost or time: it is not critical to keep this lower than 12 hours because we are predicting daily prices based on stats from prior days' trading. 
- Prediction time: Again not critical to keep this low. Anything within an hour would do.

#### Expected Solution

### Metrics
In this section, you will need to clearly define the metrics or calculations you will use to measure performance of a model or result in your project. These calculations and metrics should be justified based on the characteristics of the problem and problem domain. Questions to ask yourself when writing this section:
- _Are the metrics you’ve chosen to measure the performance of your models clearly discussed and defined?_
- _Have you provided reasonable justification for the metrics chosen based on the problem and solution?_

## II. Analysis
_(approx. 2-4 pages)_

### Data Exploration
In this section, you will be expected to analyze the data you are using for the problem. This data can either be in the form of a dataset (or datasets), input data (or input files), or even an environment. The type of data should be thoroughly described and, if possible, have basic statistics and information presented (such as discussion of input features or defining characteristics about the input or environment). Any abnormalities or interesting qualities about the data that may need to be addressed have been identified (such as features that need to be transformed or the possibility of outliers). Questions to ask yourself when writing this section:
- _If a dataset is present for this problem, have you thoroughly discussed certain features about the dataset? Has a data sample been provided to the reader?_
- _If a dataset is present for this problem, are statistics about the dataset calculated and reported? Have any relevant results from this calculation been discussed?_
- _If a dataset is **not** present for this problem, has discussion been made about the input space or input data for your problem?_
- _Are there any abnormalities or characteristics about the input space or dataset that need to be addressed? (categorical variables, missing values, outliers, etc.)_

### Exploratory Visualization
In this section, you will need to provide some form of visualization that summarizes or extracts a relevant characteristic or feature about the data. The visualization should adequately support the data being used. Discuss why this visualization was chosen and how it is relevant. Questions to ask yourself when writing this section:
- _Have you visualized a relevant characteristic or feature about the dataset or input data?_
- _Is the visualization thoroughly analyzed and discussed?_
- _If a plot is provided, are the axes, title, and datum clearly defined?_

### Algorithms and Techniques
In this section, you will need to discuss the algorithms and techniques you intend to use for solving the problem. You should justify the use of each one based on the characteristics of the problem and the problem domain. Questions to ask yourself when writing this section:
- _Are the algorithms you will use, including any default variables/parameters in the project clearly defined?_
- _Are the techniques to be used thoroughly discussed and justified?_
- _Is it made clear how the input data or datasets will be handled by the algorithms and techniques chosen?_

### Benchmark
In this section, you will need to provide a clearly defined benchmark result or threshold for comparing across performances obtained by your solution. The reasoning behind the benchmark (in the case where it is not an established result) should be discussed. Questions to ask yourself when writing this section:
- _Has some result or value been provided that acts as a benchmark for measuring performance?_
- _Is it clear how this result or value was obtained (whether by data or by hypothesis)?_


The data used is daily stock data for stocks on the London Stock Exchange (LSE). The date range for stock data varies depending on when the stock went public. The furthest date was **YEAR**. The most recent date in the dataset was 9 September 2016. The data was taken from Quandl.

All the data is in one CSV, with each row being one datapoint. Each row includes:
* Stock symbol
* Date
* Open
* High
* Low
* Close
* Volume
* Ex-Dividend
* Split Ratio
* Adjusted Open
* Adjusted High
* Adjusted Low
* Adjusted Close
* Adjusted Volume

That means we have 12 features for each stock on every trading day since YEAR when the stock was tradable.


**Data Preprocessing**
On opening the CSV and sampling it with `df.head()`, I realised the CSV had no header. I added a header to the CSV:
```python
df = pd.read_csv('~/lse-data/lse/WIKI_20160909.csv', header=None, names=header_names)
```


What are **Adjusted** figures?

What is an 'Adjusted Closing Price'
An adjusted closing price is a stock's closing price on any given day of trading that has been amended to include any distributions and corporate actions that occurred at any time prior to the next day's open. The adjusted closing price is often used when examining historical returns or performing a detailed analysis on historical returns.

BREAKING DOWN 'Adjusted Closing Price'
A stock's price is typically affected by supply and demand of market participants. However, there are some corporate actions that affect a stock's price, which needs to be adjusted in the event of these actions. The adjusted closing price is a useful tool when examining historical returns because it gives analysts an accurate representation of the firm's equity value beyond the simple market price. It accounts for all corporate actions such as stock splits, dividends/distributions and rights offerings. Investors should understand how corporate actions are accounted for in a stock's adjusted closing price.
Adjusting Prices for Stock Splits
A stock split is a corporate action that is usually done by companies to make their share prices more marketable. A stock split does not affect a company's total market capitalization, but it does affect the company's stock price. Consequently, a company undergoing a stock split must adjust its closing price to depict the effect of the corporate action.

For example, a company's board of directors may decide to split the company's stock three-for-one. Therefore, the company's shares outstanding increase by a multiple of three, while its share price is divided by three. If a stock closed at $300 the day prior to its stock split, the closing price is adjusted to $100, or $300 divided by 3, per share to show the effect of the corporate action.

Adjusting for Dividends
Common distributions that affect a stock's price include cash dividends and stock dividends. The difference between cash dividends and stock dividends is shareholders are entitled to a predetermined price per share and additional shares, respectively. For example, assume a company declared a $1 cash dividend and is trading at $51 per share on the ex-dividend date. On the ex-dividend date, the stock price is reduced by $1 and the adjusted closing price is $50.

Adjusting for Rights Offerings
A stock's adjusted closing price also reflects rights offerings that may occur. A rights offering is an issue of rights given to existing shareholders, which entitles the shareholders to subscribe to the rights issue in proportion to their shares. For example assume a company declares a rights offering, in which existing shareholders are entitled to one additional share for every two shares owned. Assume the stock is trading at $50 and existing shareholders are able to purchase additional shares at a subscription price of $45. On the ex-date, the adjusted closing price is calculated based on the adjusting factor and the closing price.



Read more: Adjusted Closing Price Definition | Investopedia U 
Follow us: Investopedia on Facebook

Reference: [Investopedia](http://www.investopedia.com/terms/a/adjusted_closing_price.asp#ixzz4LOlpDX7)


In [38]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [32]:
header_names_hpq = [
 'Date',
 'Open',
 'High',
 'Low',
 'Close',
 'Volume',
 'Ex-Dividend',
 'Split Ratio',
 'Adj. Open',
 'Adj. High',
 'Adj. Low',
 'Adj. Close',
 'Adj. Volume']

In [36]:
hpq = pd.read_csv('/Users/jessica/lse-data/lse/WIKI-HPQ.csv')

In [19]:
hpq.describe()
header = hpq.dtypes.index
header_list = header
header_list

Index(['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Ex-Dividend',
       'Split Ratio', 'Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close',
       'Adj. Volume'],
      dtype='object')

In [29]:
header_names = ['Symbol',
 'Date',
 'Open',
 'High',
 'Low',
 'Close',
 'Volume',
 'Ex-Dividend',
 'Split Ratio',
 'Adj. Open',
 'Adj. High',
 'Adj. Low',
 'Adj. Close',
 'Adj. Volume']

In [6]:
hpq.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Ex-Dividend,Split Ratio,Adj. Open,Adj. High,Adj. Low,Adj. Close,Adj. Volume
0,2016-09-09,14.52,14.53,14.07,14.07,16495753.0,0.0,1.0,14.52,14.53,14.07,14.07,16495753.0
1,2016-09-08,14.6,14.69,14.55,14.63,10247628.0,0.0,1.0,14.6,14.69,14.55,14.63,10247628.0
2,2016-09-07,14.63,14.75,14.51,14.69,6804777.0,0.0,1.0,14.63,14.75,14.51,14.69,6804777.0
3,2016-09-06,14.54,14.67,14.53,14.63,7386291.0,0.0,1.0,14.54,14.67,14.53,14.63,7386291.0
4,2016-09-02,14.61,14.7,14.45,14.49,8245584.0,0.0,1.0,14.61,14.7,14.45,14.49,8245584.0


In [37]:
df = pd.read_csv('~/lse-data/lse/WIKI_20160909.csv', header=None, names=header_names)

In [11]:
df = data

In [3]:
data.describe()

Unnamed: 0,45.5,50.0,40.0,44.0,44739900.0,0.0,1.0,43.471809559155,47.771219295775,38.21697543662,42.038672980282,44739900.0.1
count,14328190.0,14328860.0,14328860.0,14329130.0,14329350.0,14329320.0,14329220.0,14328190.0,14328860.0,14328860.0,14329130.0,14329340.0
mean,70.92291,71.88109,70.47024,71.20251,1182023.0,0.00198279,1.00021,75.18079,76.33755,74.51613,75.4457,1402922.0
std,2193.723,2220.224,2191.79,2206.792,8868544.0,0.3370723,0.02165061,2266.636,2295.34,2261.718,2279.264,6620807.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0
25%,11.8,12.0,11.56,11.8,34200.0,0.0,1.0,6.213267,6.328367,6.096832,6.214642,44100.0
50%,22.88,23.21,22.5,22.88,171200.0,0.0,1.0,13.5568,13.78368,13.32,13.55915,223000.0
75%,38.33,38.85,37.82,38.35,668600.0,0.0,1.0,26.89341,27.3,26.46492,26.89548,880000.0
max,228180.0,229374.0,227530.0,229300.0,6674913000.0,962.5,50.0,228180.0,229374.0,227530.0,229300.0,2304019000.0


In [4]:
data.head()

Unnamed: 0,A,1999-11-18,45.5,50.0,40.0,44.0,44739900.0,0.0,1.0,43.471809559155,47.771219295775,38.21697543662,42.038672980282,44739900.0.1
0,A,1999-11-19,42.94,43.0,39.81,40.38,10897100.0,0.0,1.0,41.025923,41.083249,38.035445,38.580037,10897100.0
1,A,1999-11-22,41.31,44.0,40.06,44.0,4705200.0,0.0,1.0,39.468581,42.038673,38.274301,42.038673,4705200.0
2,A,1999-11-23,42.5,43.63,40.25,40.25,4274400.0,0.0,1.0,40.605536,41.685166,38.455832,38.455832,4274400.0
3,A,1999-11-24,40.13,41.94,40.0,41.06,3464400.0,0.0,1.0,38.341181,40.070499,38.216975,39.229725,3464400.0
4,A,1999-11-26,40.88,41.5,40.75,41.19,1237100.0,0.0,1.0,39.057749,39.650112,38.933544,39.35393,1237100.0


## III. Methodology
_(approx. 3-5 pages)_

### Data Preprocessing
In this section, all of your preprocessing steps will need to be clearly documented, if any were necessary. From the previous section, any of the abnormalities or characteristics that you identified about the dataset will be addressed and corrected here. Questions to ask yourself when writing this section:
- _If the algorithms chosen require preprocessing steps like feature selection or feature transformations, have they been properly documented?_
- _Based on the **Data Exploration** section, if there were abnormalities or characteristics that needed to be addressed, have they been properly corrected?_
- _If no preprocessing is needed, has it been made clear why?_

### Implementation
In this section, the process for which metrics, algorithms, and techniques that you implemented for the given data will need to be clearly documented. It should be abundantly clear how the implementation was carried out, and discussion should be made regarding any complications that occurred during this process. Questions to ask yourself when writing this section:
- _Is it made clear how the algorithms and techniques were implemented with the given datasets or input data?_
- _Were there any complications with the original metrics or techniques that required changing prior to acquiring a solution?_
- _Was there any part of the coding process (e.g., writing complicated functions) that should be documented?_

### Refinement
In this section, you will need to discuss the process of improvement you made upon the algorithms and techniques you used in your implementation. For example, adjusting parameters for certain models to acquire improved solutions would fall under the refinement category. Your initial and final solutions should be reported, as well as any significant intermediate results as necessary. Questions to ask yourself when writing this section:
- _Has an initial solution been found and clearly reported?_
- _Is the process of improvement clearly documented, such as what techniques were used?_
- _Are intermediate and final solutions clearly reported as the process is improved?_


## IV. Results
_(approx. 2-3 pages)_

### Model Evaluation and Validation
In this section, the final model and any supporting qualities should be evaluated in detail. It should be clear how the final model was derived and why this model was chosen. In addition, some type of analysis should be used to validate the robustness of this model and its solution, such as manipulating the input data or environment to see how the model’s solution is affected (this is called sensitivity analysis). Questions to ask yourself when writing this section:
- _Is the final model reasonable and aligning with solution expectations? Are the final parameters of the model appropriate?_
- _Has the final model been tested with various inputs to evaluate whether the model generalizes well to unseen data?_
- _Is the model robust enough for the problem? Do small perturbations (changes) in training data or the input space greatly affect the results?_
- _Can results found from the model be trusted?_

### Justification
In this section, your model’s final solution and its results should be compared to the benchmark you established earlier in the project using some type of statistical analysis. You should also justify whether these results and the solution are significant enough to have solved the problem posed in the project. Questions to ask yourself when writing this section:
- _Are the final results found stronger than the benchmark result reported earlier?_
- _Have you thoroughly analyzed and discussed the final solution?_
- _Is the final solution significant enough to have solved the problem?_




## V. Conclusion
_(approx. 1-2 pages)_

### Free-Form Visualization
In this section, you will need to provide some form of visualization that emphasizes an important quality about the project. It is much more free-form, but should reasonably support a significant result or characteristic about the problem that you want to discuss. Questions to ask yourself when writing this section:
- _Have you visualized a relevant or important quality about the problem, dataset, input data, or results?_
- _Is the visualization thoroughly analyzed and discussed?_
- _If a plot is provided, are the axes, title, and datum clearly defined?_

### Reflection
In this section, you will summarize the entire end-to-end problem solution and discuss one or two particular aspects of the project you found interesting or difficult. You are expected to reflect on the project as a whole to show that you have a firm understanding of the entire process employed in your work. Questions to ask yourself when writing this section:
- _Have you thoroughly summarized the entire process you used for this project?_
- _Were there any interesting aspects of the project?_
- _Were there any difficult aspects of the project?_
- _Does the final model and solution fit your expectations for the problem, and should it be used in a general setting to solve these types of problems?_

### Improvement
In this section, you will need to provide discussion as to how one aspect of the implementation you designed could be improved. As an example, consider ways your implementation can be made more general, and what would need to be modified. You do not need to make this improvement, but the potential solutions resulting from these changes are considered and compared/contrasted to your current solution. Questions to ask yourself when writing this section:
- _Are there further improvements that could be made on the algorithms or techniques you used in this project?_
- _Were there algorithms or techniques you researched that you did not know how to implement, but would consider using if you knew how?_
- _If you used your final solution as the new benchmark, do you think an even better solution exists?_

-----------

**Before submitting, ask yourself. . .**

- Does the project report you’ve written follow a well-organized structure similar to that of the project template?
- Is each section (particularly **Analysis** and **Methodology**) written in a clear, concise and specific fashion? Are there any ambiguous terms or phrases that need clarification?
- Would the intended audience of your project be able to understand your analysis, methods, and results?
- Have you properly proof-read your project report to assure there are minimal grammatical and spelling mistakes?
- Are all the resources used for this project correctly cited and referenced?
- Is the code that implements your solution easily readable and properly commented?
- Does the code execute without error and produce results similar to those reported?
