# **Summative Assessment 2**
by Leiana Mari D. Aquino

### **1 - Modelling Bitcoin Returns**

Find out which probability distribution function best fits Bitcoin’s returns for trading data every minute, from January 1, 2012 to April 15, 2023, for Bitcoin quoted in United States dollars or the BTC/USD pair.

**Background**

Bitcoin is a decentralized digital currency that was created in 2009 by an unknown person or group using the pseudonym Satoshi Nakamoto. It is based on a peer-to-peer network that allows users to send and receive payments without the need for a central authority, such as a bank or government.

**Data Description**

This dataset includes the historical bitcoin market data at 1-min intervals for select bitcoin exchanges where trading takes place. 

**Loading packages**

In [11]:
oldw <- getOption("warn") # to remove warnings
options(warn = -1)

library("readr")          # for csv file
library("dplyr")          # for data manipulation & transformation
library("data.table")     # for data manipulation & transformation
library("plyr")           # for data manipulation & transformation
library("ggplot2")        # for data visualization
library("DataCombine")    # for combining & reshaping data frames

library("EnvStats")       # for environmental statistics
library("anytime")

library("VGAM")           # for fitting various regression models
library("fitur")          # for fitting various probability distributions to data

library("fitdistrplus")   # for fitting probability distributions to data
library("tsallisqexp")    # for fitting the tsallis q-exponential distribution to data 
library("poweRlaw")       # for fitting power-law distributions to data
library("dgof")           # for testing goodness-of-fit of probability distributions

options(warn = oldw)


Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



Attaching package: 'data.table'


The following objects are masked from 'package:dplyr':

    between, first, last


The following object is masked from 'package:DataCombine':

    shift


------------------------------------------------------------------------------

You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)

------------------------------------------------------------------------------


Attaching package: 'plyr'


The following objects are masked from 'package:dplyr':

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize



Attaching package: 'EnvStats'


The following objects are masked from 'package:stats':

    pred

**Importing the Data Set**

Importing the data set is the first step in any data analysis project. It involves reading in the data from a file or database and storing it in a data structure that can be manipulated and analyzed.

In [1]:
# load the data set then run a summary
btc_data <- read.csv("C:/Users/maril/Downloads/SA2/merged_btc_data.csv")
summary(btc_data)

   Timestamp              Open              High              Low         
 Min.   :1.325e+09   Min.   :    3.8   Min.   :    3.8   Min.   :    1.5  
 1st Qu.:1.414e+09   1st Qu.:  608.3   1st Qu.:  608.6   1st Qu.:  608.0  
 Median :1.503e+09   Median : 6856.9   Median : 6860.6   Median : 6852.2  
 Mean   :2.943e+11   Mean   :12546.6   Mean   :12554.7   Mean   :12538.4  
 3rd Qu.:1.592e+09   3rd Qu.:17207.6   3rd Qu.:17214.5   3rd Qu.:17201.8  
 Max.   :1.680e+12   Max.   :69000.0   Max.   :69000.0   Max.   :68786.7  
                     NA's   :1243608   NA's   :1243608   NA's   :1243608  
     Close             volume        Volume_.Currency.  Weighted_Price   
 Min.   :    1.5   Min.   :   0.0    Min.   :       0   Min.   :    3.8  
 1st Qu.:  608.3   1st Qu.:   0.7    1st Qu.:     452   1st Qu.:  443.8  
 Median : 6857.1   Median :   4.1    Median :    3810   Median : 3596.8  
 Mean   :12546.6   Mean   :  28.0    Mean   :   41763   Mean   : 6008.9  
 3rd Qu.:17207.7   3rd Qu.:  1

**Data Dictionary**


As seen in the provided summary of the data set above, the measures of central tendency along with the 1st / 3rd quartile are given. Thus, we can numerically understand the distribution of data -- this is important for when the features are visualized, we will be able to add value and conceptual understanding of the sample for the Bitcoin returns.

- **Open** & **Close** represents the *opening* and *closing* price of the given time period.
- **Low** & **High** represents the *lowest* and *greatest* value of the given time period.
- **Volume** refers to the total number of Bitcoins that have been traded in a given time period.
- **Volume Currency** refers to the total value of Bitcoin traded in a given time period.
- **Weighted Price** refers to the average price of all Bitcoin trades in a given time period, weighted by the volume of each trade.

**Pre-Processing the Data Set**

Pre-processing the data set is a crucial step in any data analysis project. It involves cleaning and transforming the data to make it suitable for analysis, and performing any necessary data manipulations. This step is important because it ensures that the data is accurate, complete, and in the correct format for analysis.


*To analyze returns, we can start by calculating the daily returns for the Bitcoin trading data set. Therefore, the dataframe needs to be grouped into 24-hour periods.*

In [12]:
# remove rows with missing values in the "Close" column
btc_data = DropNA(btc_data, Var = "Close", message = FALSE)

# convert the first timestamp to a date and time format
btc_data['Data'] = anydate(btc_data[1, 'Timestamp'])

# replace any remaining missing values with 0
btc_data[is.na(btc_data)] <- 0

In [None]:
# loop through each row of the data table and convert the timestamp to a date and time format
for (i in 1:nrow(btc_data)) {
  btc_data[i, 'Data'] <- anydate(btc_data[i, 'Timestamp'])
}

 *Calculate the returns by taking the difference between the midpoint of the current row and the midpoint of the previous row, and dividing by the midpoint of the previous row.*

In [None]:
# convert the data table to a data table object
df <- data.table(btc_data)

# aggregate the low and high prices by date
a <- aggregate(df$Low, by = list(df$Data), min)
names(a)[1] <- c("Data")
names(a)[2] <- c("Low")

b <- aggregate(df$High, by = list(df$Data), max)
names(b)[1] <- c("Data")
names(b)[2] <- c("High")

# merge the low and high price data frames by date
df <- merge(x = a, y = b, by = "Data", all = TRUE)

*This will give us the percentage change in price over each 24-hour period.*

In [None]:
# create new columns for daily return and midpoint of daily trading range
df['return'] = NaN
df['Mid'] = NaN

# calculate midpoint of daily trading range for first row
df[1, 'Mid'] = (df[1, 'High'] - df[1, 'Low']) / 2 + df[1, 'Low']

# loop through each row of the data table and calculate midpoint and daily return
for (i in 2:nrow(df)) {
  # Calculate midpoint of daily trading range for current row
  df[i, 'Mid'] = (df[i, 'High'] - df[i, 'Low']) / 2 + df[i, 'Low']
  
  # Calculate daily return for current row
  df[i, 'return'] = (df[i, 'Mid'] - df[i - 1, 'Mid']) / df[i - 1, 'Mid']
}

**Data Visualization**

After calculating the daily returns, the distribution of returns can be visualized using a histogram or density plot. By visualizing the distribution of returns, we can get a better understanding of the shape of the distribution, including its central tendency, spread, and skewness. This information can be useful for identifying any outliers or extreme values in the data, as well as for selecting appropriate statistical models for analyzing the data.

In [None]:
# create a histogram of the daily returns for Bitcoin
qplot(DropNA(df['return']),
      geom = "histogram",
      binwidth = 0.005,
      main = "Histogram of Returns of Bitcoin in US Dollars",
      xlab = "Return",
      fill = I("blue"),
      col = I("red"),
      alpha = I(.2),
      xlim = c(-0.3, 0.3))

In [None]:
# create a density plot of the daily returns for Bitcoin
ggplot(data = df, aes(x = return)) +
  geom_density(fill = "blue", alpha = 0.2, color = "red") +
  ggtitle("Density Plot of Returns of Bitcoin in US Dollars") +
  xlab("Return") +
  xlim(-0.3, 0.3)

**Data Distribution**

Assess which of the distributions (Normal, Student's T, Laplace, Tsallis, Power Law) best fits the price of Bitcoin.

- Normal Distribution

In [None]:
# normal distribution:
df_test = rnorm(length(DropNA(df['return'])), mean = mean(DropNA(df['return'])), sd = sd(DropNA(df['return'])))
ks.test(DropNA(df['return']),df_test)

- Student's T Distribution

In [None]:
# student's t distribution:
df_test = rt(length(DropNA(df['return'])), length(DropNA(df['return']))-1)
ks.test(DropNA(df['return']),df_test)

- Laplace Distribution

In [None]:
# laplace distribution:
df_test = rlaplace(length(DropNA(df['return'])), mean(DropNA(df['return'])), sd(DropNA(df['return'])))
ks.test(DropNA(df['return']),df_test)

- Tsallis Distribution

In [None]:
# tsallis distribution :
df_test = rtsal(length(DropNA(df['return'])), mean(DropNA(df['return'])), sd(DropNA(df['return'])))
ks.test(DropNA(df['return']),df_test)

- Power Law Distribution

In [None]:
# powerLaw distribution:
df_test = rplcon(length(DropNA(df['return'])), -0.3, sd(DropNA(df['return'])))
ks.test(DropNA(df['return']),df_test)

- Other ways to determine which probability fits the best.

In [None]:
# fit several probability distributions to the daily returns
fit.norm <- fitdist(returns, "norm")
fit.t <- fitdist(returns, "t")
fit.laplace <- fitdist(returns, "laplace")
fit.cauchy <- fitdist(returns, "cauchy")

In [None]:
# compare the goodness of fit of the probability distributions using the AIC criterion
AIC.df <- data.frame(Distribution = c("Normal", "Student-t", "Laplace", "Cauchy"),
                     AIC = c(AIC(fit.norm), AIC(fit.t), AIC(fit.laplace), AIC(fit.cauchy)))

# print the AIC values for each distribution
print(AIC.df)

**Conclusion**

The D statistic is a measure of the distance between the empirical distribution function (EDF) of the data and the theoretical distribution function (TDF) of the fitted probability distribution. A smaller D statistic indicates a better fit between the data and the probability distribution.


Based on the D statistics provided, the **Laplace distribution** has the smallest D statistic (0.1285), followed closely by the Normal distribution (0.14951). This suggests that the Laplace distribution provides the best fit to the Bitcoin returns data, but the Normal distribution is also a good fit.

### **2 - Testing the Ethereum Returns**

Using Shapiro-Wilk normality, test the Ethereum returns for trading data every five minutes, from August 7, 2015 to April 15, 2023.

**Background**

Ethereum is a decentralized, open-source blockchain with smart contract functionality. Ether (ETH) is the native cryptocurrency of the platform. Among cryptocurrencies, Ether is second only to Bitcoin in market capitalization. Ethereum was proposed in 2013 by programmer Vitalik Buterin.

**Data Description**

This dataset provides the history of daily prices of Ethereum at 5-min interval.

In [1]:
# load the packages
library("stats")
library("quantmod")
library("readr")

"package 'quantmod' was built under R version 4.2.3"
Loading required package: xts

"package 'xts' was built under R version 4.2.3"
Loading required package: zoo

"package 'zoo' was built under R version 4.2.3"

Attaching package: 'zoo'


The following objects are masked from 'package:base':

    as.Date, as.Date.numeric


Loading required package: TTR

"package 'TTR' was built under R version 4.2.3"
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 

"package 'readr' was built under R version 4.2.3"


In [1]:
# load the Ethereum trading data set
eth_data <- read.csv("C:/Users/maril/Downloads/SA2/ethereum_compiled_data.csv")
summary(eth_data)

 Unix.Timestamp          Date              Symbol               Open       
 Min.   :1.463e+09   Length:2294948     Length:2294948     Min.   :   0.0  
 1st Qu.:1.493e+09   Class :character   Class :character   1st Qu.: 108.0  
 Median :1.522e+09   Mode  :character   Mode  :character   Median : 209.0  
 Mean   :6.155e+11                                         Mean   : 456.4  
 3rd Qu.:1.555e+12                                         3rd Qu.: 441.0  
 Max.   :1.587e+12                                         Max.   :4862.0  
 NA's   :310936                                                            
      High              Low             Close             Volume         
 Min.   :   5.99   Min.   :   0.0   Min.   :   5.99   Min.   :        0  
 1st Qu.: 108.00   1st Qu.: 107.9   1st Qu.: 107.95   1st Qu.:        0  
 Median : 209.09   Median : 208.9   Median : 209.00   Median :        0  
 Mean   : 456.98   Mean   : 455.8   Mean   : 456.41   Mean   :   609621  
 3rd Qu.: 441.46   3rd

In [None]:
# calculate the returns from the Ethereum trading set
eth_returns <- diff(log(eth_data$Close))

# perform a Shapiro-Wilk normality test on the Ethereum returns
shapiro.test(eth_returns)

# print the results of the Shapiro-Wilk test
cat("Shapiro-Wilk normality test results:\n")
cat("Test statistic = ", shapiro.test(eth_returns)$statistic, "\n")
cat("p-value = ", shapiro.test(eth_returns)$p.value, "\n")

# interpret the results of the Shapiro-Wilk test
if (shapiro.test(eth_returns)$p.value < 0.05) {
  cat("The Ethereum returns are not normally distributed.\n")
} else {
  cat("The Ethereum returns are normally distributed.\n")
}