# Machine Learning Engineer Nanodegree
## Capstone Project
Nicolás Kittsteiner
December 29, 2018

## I. Definition

### Project Overview

In the world of finance, particular on investment and stock trading, the disruption of cryptocurrencies<sup>1</sup> or crypto assets<sup>2</sup> related to the blockchain<sup>3</sup> technology, created a new ecosystem of possibilities for investors. Based on the quantity of public information related to cryptocurrencies available on the Web it’s possible to perform different analysis related of valuation and future price predictions using the appropriate key data and supervised machine learning algorithms.

Ethereum<sup>4</sup> is a cryptocurrency that handles a technology called Ethereum Virtual Machine (EVM)<sup>5</sup>. This allows to process smart-contracts<sup>6</sup> which is custom code statements that enable multiple use cases like, creating autonomous organizations, making crowdfunding projects, or applications that can transfer value automatically if the rules defined in the contract are processed accordingly. In this way it’s possible to understand this invention as a commodity like for instance a cloud-based provided for computing.

The idea of this project is to investigate if it's possible to predict Ethereum prices based on different machine learning algorithms and a sample dataset containing daily information about trades and network health. 


### Problem Statement

In this exploration, based on network and pricing information of Ethereum (ETH), the goal is to predict with some degree of confidence the 'close' price of this cryptocurrency. For that, the steps involved in the analysis and implementation includes:

- Analizing ethereum dataset prices and network 
- Data cleaning / merge. 
- Determine most important features in the dataset. 
- Separate training and testing datasets. 
- Determine an evaluation metric.
- Explore different supervised machine learning approaches to the problem.
- Test each model results using the evaluation metric defined.
- Giving a benchmark analysis.

With all this steps, it's possible to expect identify one or more approaches that may solve the problem of prices prediction, and determine if the performance of this models could be used on a real world application.

### Metrics

In the case of a regression problem the selected metric is R<sup>2</sup> which provides a measure of how well future samples are likely tho be predicted for each model. The best scores are close to 1.0 and also can be negative if the model performs worse. In the case that the R<sup>2</sup> score is 0.0, this tells that the model doesn't take in account input features and always returns the same results.

R<sup>2</sup> it's more interesting in this context because using mean squared error doesn't give enough information if the model performs bad or worst than another.


## II. Analysis

### Data Exploration

The dataset used is a composition of different sources that has historical information of different cryptocurrencies, but in this analysis only two sets are considered (ethereum_price and ethereum_dataset) []. The fields considered on each dataset are:

- Ethereum Dataset (ethereum_dataset.csv):
    - Date(UTC) : Date of transaction
    - UnixTimeStamp : unix timestamp
    - eth_etherprice : price of ethereum
    - eth_tx : number of transactions per day
    - eth_address : Cumulative address growth
    - eth_supply : Number of ethers in supply
    - eth_marketcap : Market cap in USD
    - eth_hashrate : hash rate in GH/s
    - eth_difficulty : Difficulty level in TH
    - eth_blocks : number of blocks per day
    - eth_uncles : number of uncles per day
    - eth_blocksize : average block size in bytes
    - eth_blocktime : average block time in seconds
    - eth_gasprice : Average gas price in Wei
    - eth_gaslimit : Gas limit per day
    - eth_gasused : total gas used per day
    - eth_ethersupply : new ether supply per day
    - eth_chaindatasize : chain data size in bytes
    - eth_ens_register : Ethereal Name Service (ENS) registrations per day

- Ethereum prices (ethereum_price.csv):
    - Date : date of observation
    - Open : Opening price on the given day
    - High : Highest price on the given day
    - Low : Lowest price on the given day
    - Close : Closing price on the given day (*)
    - Volume : Volume of transactions on the given day
    - Market Cap : Market capitalization in USD

Most of the information of each dataset contains dates and decimal features that go from negative values to thousands. To merge both datasets, each dataset has to be formatted using dates as key, also the last row in prices hasn't information about prices so needs to be excluded. There are also some 'Close' days that are missing from the dataset 'Prices'. Looking to each dataset statistics, appears to be consistent without many outliers on each feature.


https://www.kaggle.com/sudalairajkumar/cryptocurrencypricehistory/home


### Exploratory Visualization

The first task that needs to be done in order to identify the more relevant features is to visualize the correlation between each feature of the merged dataset.

![alt text](img/correlation_matrix.png "Logo Title Text 1")

In this graph it's possible to identify (in lighter colors) the correlation index. Using non correlated features doesn't really help when splitting data for each model. The high correlated features in comparison with the 'Close' are: 


Field | Correlation index
--- | --- |
eth_marketcap | 0.999769
Market Cap | 0.996268
eth_tx | 0.966170
eth_address | 0.958083
eth_gasused | 0.951524
eth_hashrate | 0.939107
eth_blocksize | 0.930363
Volume | 0.907974
eth_difficulty | 0.827023


### Algorithms and Techniques

In this exploration the idea behind forecasting a variable it's part of a regression problem. It's possible to use several algorithms and techniques. Nevertheless, in the context of this investigation, the set of algorithms to use focused on resolving the problem of predicting the ETH price are:

- Linear Regression: The most common and simple method when it's used to forecast a variable based on a different set of features. The set of input parameters used are <i>de facto</i> parameters
- K-Nearest Neighbors Reggressor: Often used for clustering problems, also have application on time series analysis and financial predictors. The only parameter changed is the number of neighbors to two.
- Random Forest Regressor: The set of parameters also are default parameters in this case.
- Ada Boost Regressor: In the case of this algorithm, the configuration parameters are different to the default values. Specifically the base estimator is set to a decision tree regressor with max depth of 10.
- Gradient Boosting Regressor: Has another input configuration:
    - Number of estimators: 1000
    - Max depth: 4
    - Minimun sample split: 2
    - Learning rate: 0.01
    - Loss: ls

In all this algorithms the way of work is the same, that is, pass training set of X and y to fit the algorithm and then make a prediction using a test set of X features.


### Benchmark

To compare each of this techniques, the R<sup>2</sup> score it's computed based on a previous trained model, on the test set of features, and with this prediction, compare with the y values of the test case. So the results on each algorithm are:

Model | R<sup>2</sup> Score
--- | --- |
Linear Regression | 0.7654767848001398
K-Nearest Neighbors Reggressor | -1.9013806360331635
Random Forest Regressor: | -1.169761995372531
Ada Boost Regressor | -1.112811356682978
Gradient Boosting Regressor | -1.3588933042301603


## III. Methodology
_(approx. 3-5 pages)_


### Data Preprocessing

In the data exploration, given the nature of the data analyzed there, there was four steps in order to normalize data formats and some data of the close date that wasn't present. 

![alt text](img/raw_values.png "Logo Title Text 1")

- The two datasets needs to be merged by the key, in this case the date of the record, before analyzing. There was a date transformation to match the format of the 'price' dataset. The last record in the price dataset was discarded because hasn't information about the close price.

- The first step is to apply a logarithmic transformation to all numerical variables.

![alt text](img/log_scaled_values.png "Logo Title Text 1")

- Then using a scaler (Min-Max) all this numerical features was translated to values between 0 and 1.

![alt text](img/minmax-scaled-feat.png "Logo Title Text 1")

- Finally filling null values using a backfill method.

Before:

![alt text](img/data-loss.png "Logo Title Text 1")

After:

![alt text](img/fixed_data_loss.png "Logo Title Text 1")


### Implementation

The model implementation was straight forward using the scikit-learn framework. 

In this section, the process for which metrics, algorithms, and techniques that you implemented for the given data will need to be clearly documented. It should be abundantly clear how the implementation was carried out, and discussion should be made regarding any complications that occurred during this process. Questions to ask yourself when writing this section:
- _Is it made clear how the algorithms and techniques were implemented with the given datasets or input data?_
- _Were there any complications with the original metrics or techniques that required changing prior to acquiring a solution?_
- _Was there any part of the coding process (e.g., writing complicated functions) that should be documented?_

### Refinement
In this section, you will need to discuss the process of improvement you made upon the algorithms and techniques you used in your implementation. For example, adjusting parameters for certain models to acquire improved solutions would fall under the refinement category. Your initial and final solutions should be reported, as well as any significant intermediate results as necessary. Questions to ask yourself when writing this section:
- _Has an initial solution been found and clearly reported?_
- _Is the process of improvement clearly documented, such as what techniques were used?_
- _Are intermediate and final solutions clearly reported as the process is improved?_


## IV. Results
_(approx. 2-3 pages)_

### Model Evaluation and Validation
In this section, the final model and any supporting qualities should be evaluated in detail. It should be clear how the final model was derived and why this model was chosen. In addition, some type of analysis should be used to validate the robustness of this model and its solution, such as manipulating the input data or environment to see how the model’s solution is affected (this is called sensitivity analysis). Questions to ask yourself when writing this section:
- _Is the final model reasonable and aligning with solution expectations? Are the final parameters of the model appropriate?_
- _Has the final model been tested with various inputs to evaluate whether the model generalizes well to unseen data?_
- _Is the model robust enough for the problem? Do small perturbations (changes) in training data or the input space greatly affect the results?_
- _Can results found from the model be trusted?_

### Justification
In this section, your model’s final solution and its results should be compared to the benchmark you established earlier in the project using some type of statistical analysis. You should also justify whether these results and the solution are significant enough to have solved the problem posed in the project. Questions to ask yourself when writing this section:
- _Are the final results found stronger than the benchmark result reported earlier?_
- _Have you thoroughly analyzed and discussed the final solution?_
- _Is the final solution significant enough to have solved the problem?_


## V. Conclusion
_(approx. 1-2 pages)_

### Free-Form Visualization
In this section, you will need to provide some form of visualization that emphasizes an important quality about the project. It is much more free-form, but should reasonably support a significant result or characteristic about the problem that you want to discuss. Questions to ask yourself when writing this section:
- _Have you visualized a relevant or important quality about the problem, dataset, input data, or results?_
- _Is the visualization thoroughly analyzed and discussed?_
- _If a plot is provided, are the axes, title, and datum clearly defined?_

### Reflection
In this section, you will summarize the entire end-to-end problem solution and discuss one or two particular aspects of the project you found interesting or difficult. You are expected to reflect on the project as a whole to show that you have a firm understanding of the entire process employed in your work. Questions to ask yourself when writing this section:
- _Have you thoroughly summarized the entire process you used for this project?_
- _Were there any interesting aspects of the project?_
- _Were there any difficult aspects of the project?_
- _Does the final model and solution fit your expectations for the problem, and should it be used in a general setting to solve these types of problems?_

### Improvement
In this section, you will need to provide discussion as to how one aspect of the implementation you designed could be improved. As an example, consider ways your implementation can be made more general, and what would need to be modified. You do not need to make this improvement, but the potential solutions resulting from these changes are considered and compared/contrasted to your current solution. Questions to ask yourself when writing this section:
- _Are there further improvements that could be made on the algorithms or techniques you used in this project?_
- _Were there algorithms or techniques you researched that you did not know how to implement, but would consider using if you knew how?_
- _If you used your final solution as the new benchmark, do you think an even better solution exists?_

-----------

**Before submitting, ask yourself. . .**

- Does the project report you’ve written follow a well-organized structure similar to that of the project template?
- Is each section (particularly **Analysis** and **Methodology**) written in a clear, concise and specific fashion? Are there any ambiguous terms or phrases that need clarification?
- Would the intended audience of your project be able to understand your analysis, methods, and results?
- Have you properly proof-read your project report to assure there are minimal grammatical and spelling mistakes?
- Are all the resources used for this project correctly cited and referenced?
- Is the code that implements your solution easily readable and properly commented?
- Does the code execute without error and produce results similar to those reported?
