In [12]:
library(data.table)
library(caret)
library(randomForest)
library(gbm)
options(warn=-1)

## Toyota Corolla Data Set

Toyota Corolla data set consists of 1436 observations which shows features of the cars such as age, fuel type or price. In this assignment, price will be the output variable and it will be predicted with different models depending on the numerical and catagorical attributes.

In [2]:
data <- fread("C:/Users/kaan9/OneDrive/Masaüstü/ToyotaCorolla.csv",
              colClasses=list(numeric=c(1,2,3,5,8,10),factor=c(4,6,7,9)))

### 1.a) Splitting the Data Set 

The data should be splitted in to test and training data sets. The training data set consists of 70% of the observations, which is approximately 1005 observations, while the test data set consists of 431. Splitting is completely  random.

In [3]:
set.seed(425)
train <- sample(1:1436,1005)
train_dt <- data[train,]
test_dt <- data[-train,]

### 1.b) RandomForest Model

In [4]:
control1 <- trainControl(method="repeatedcv",
                        number=10,
                        repeats=5)
models <- list()
for(ntree in (1:5)*100){
    model <- train(Price ~ ., 
                data = train_dt,
                method = 'rf',
                metric = 'RMSE',
                trControl = control1,
                tuneGrid = expand.grid(.mtry=1:9),
                ntree=ntree)
    models[[toString(ntree)]] <- model
}

In [5]:
models

$`100`
Random Forest 

1005 samples
   9 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 905, 903, 902, 906, 905, 905, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE      
  1     2080.656  0.8355822  1556.9839
  2     1309.297  0.8929262   965.4054
  3     1145.338  0.9086789   843.1122
  4     1109.745  0.9126914   822.7885
  5     1098.741  0.9136345   816.7343
  6     1096.281  0.9137045   818.8668
  7     1100.641  0.9128943   822.0562
  8     1105.881  0.9119093   826.4065
  9     1114.124  0.9103800   829.2413

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 6.

$`200`
Random Forest 

1005 samples
   9 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 905, 903, 905, 905, 905, 904, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquare

To find the best paramters, CARET package is used. After the tunning process, the best tree in terms of the RMSE is the one with ntree=200 and mtry=6. The lowest RMSE score for these parameters ,which is obtained by repeated cross-validation, is 1088.532. With the randomForest package, the model with ntree=200 and mtry=6 parameters is constructed and applied in test data set to evaluate the performance.

In [6]:
model_opt <- randomForest(Price~.,
                          data=train_dt,
                          mtry=6,
                          ntree=200,
                          importance=TRUE)
prediction <- predict(model_opt,newdata=test_dt)
RMSE_rf <- (sum((test_dt$Price-prediction)^2)/length(prediction))^0.5
round(RMSE_rf,3)

The optimum model has the RMSE of 1077.616.

### 1.c) Variable Importance

In [7]:
importance_dt <- data.frame(model_opt$importance)
importance_dt[order(importance_dt$IncNodePurity,decreasing=TRUE),]

Unnamed: 0,X.IncMSE,IncNodePurity
Age,16591448.441,9632767856
KM,1427289.275,2159009284
Weight,1574060.482,1374842749
HP,342225.266,290432141
CC,229916.175,141338387
Doors,78766.825,67399700
MetColor,29020.117,52921508
FuelType,60896.578,44503236
Automatic,5258.658,15491597


IncMSE means the increase in the mean squared error (MSE) if the predictor excluded from the model. Thus, Age and KM attributes are the most important predictors while FuelType and Automatic are the least.

### 1.d) Comparision with Linear Regression Model

In [8]:
model_reg <- lm(Price~.,data=train_dt)
predict_reg <- predict(model_reg,newdata=test_dt)
RMSE_reg <- (sum((test_dt$Price-predict_reg)^2)/length(predict_reg))^0.5
round(RMSE_reg,3)
round((RMSE_reg-RMSE_rf)/RMSE_rf*100,2)

The linear regression model has RMSE of 1423.096. The RMSE of the linear regression model is 32.06% higher than the Random Forest model which means Random Forest model is a significantly better model.

### 2) Gradient Boosting Machines

In [9]:
control2 <- trainControl(method="repeatedcv",
                        number=10,
                        repeats=5)
grid <- expand.grid(interaction.depth = c(3:5), 
                        n.trees = (3:6)*25, 
                        shrinkage = (1:3)*0.1,
                        n.minobsinnode = c(1:3)*5)
model_gbm <- train(Price ~ ., 
                data = train_dt,
                method = 'gbm',
                metric = 'RMSE',
                trControl = control2,
                tuneGrid = grid)
model_gbm

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1 11808735.2191             nan     0.1000 2027949.4487
     2 10014359.2975             nan     0.1000 1659362.5973
     3  8553128.0659             nan     0.1000 1413507.8721
     4  7391074.0534             nan     0.1000 1204463.0726
     5  6475544.0959             nan     0.1000 992258.1246
     6  5688698.4317             nan     0.1000 786441.6737
     7  5070633.2912             nan     0.1000 604821.1888
     8  4511744.2855             nan     0.1000 496309.9093
     9  4055595.1667             nan     0.1000 455971.5940
    10  3670949.0901             nan     0.1000 378186.2013
    20  1829163.0900             nan     0.1000 46117.9874
    40  1162565.4186             nan     0.1000 2332.2895
    60  1039222.6363             nan     0.1000 1932.9625
    80   979929.1689             nan     0.1000 -5517.4026
   100   944884.0582             nan     0.1000 -1783.6226
   120   909293.2045             nan     0.10

Stochastic Gradient Boosting 

1005 samples
   9 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 904, 904, 904, 906, 906, 904, ... 
Resampling results across tuning parameters:

  shrinkage  interaction.depth  n.minobsinnode  n.trees  RMSE      Rsquared 
  0.1        3                   5               75      1130.334  0.9085235
  0.1        3                   5              100      1125.085  0.9094134
  0.1        3                   5              125      1123.139  0.9096858
  0.1        3                   5              150      1122.558  0.9097338
  0.1        3                  10               75      1142.803  0.9065004
  0.1        3                  10              100      1133.570  0.9079417
  0.1        3                  10              125      1130.511  0.9083393
  0.1        3                  10              150      1125.363  0.9090157
  0.1        3                  15               75      1154.316 

After the tuning of the parameters, the lowest cross-validation RMSE is 1110.212 with the parameters ntree=150, interaction.depth=5 n.minobsinnode=5 and shrinkage = 0.1.

In [11]:
model_opt_gbm <- gbm(Price~.,
                     data=train_dt,
                     n.trees = 150,
                     interaction.depth = 5,
                     n.minobsinnode = 5,
                     shrinkage = 0.1)
predict_gbm <- predict(model_opt_gbm,newdata=test_dt)
RMSE_gbm <- (sum((test_dt$Price-predict_gbm)^2)/length(predict_gbm))^0.5
round(RMSE_gbm,3)

Distribution not specified, assuming gaussian ...


Using 150 trees...



The predictions obtained with the optimum GBM model and it has RMSE of 1058.501. It could be concluded that for the predictions of the Toyota Corolla prices, Gradient Boosting Machines and Random Forest models have the smilar error rates and both of them are significantly better than the linear regression model.

In [None]:
library(jsonlite)
library(httr)
library(data.table)
library(ggplot2)
library(ggcorrplot)
library(readr)
library(tsibble)
library(zoo)
library(forecast)
library(mgcv)
library(lubridate)
library(urca)
library(Metrics)
library(mgcv)
library(gratia)
options(repr.plot.width = 10, repr.plot.height = 4)

accu=function(actual,forecast){
  n=length(actual)
  error=actual-forecast
  mean=mean(actual)
  sd=sd(actual)
  CV=sd/mean
  FBias=sum(error)/sum(actual)
  MAPE=sum(abs(error/actual))/n
  RMSE=sqrt(sum(error^2)/n)
  MAD=sum(abs(error))/n
  MADP=sum(abs(error))/sum(abs(actual))
  WMAPE=MAD/mean
  l=data.frame(n,mean,sd,CV,FBias,MAPE,RMSE,MAD,MADP,WMAPE)
  return(l)
}

get_token <- function(username, password, url_site){
  
  post_body = list(username=username,password=password)
  post_url_string = paste0(url_site,'/token/')
  result = POST(post_url_string, body = post_body)
  
  # error handling (wrong credentials)
  if(result$status_code==400){
    print('Check your credentials')
    return(0)
  }
  else if (result$status_code==201){
    output = content(result)
    token = output$key
  }
  
  return(token)
}

get_data <- function(start_date="1", token, url_site){
  
  post_body = list(start_date=start_date,username=username,password=password)
  post_url_string = paste0(url_site,'/dataset/')
  
  header = add_headers(c(Authorization=paste('Token',token,sep=' ')))
  result = GET(post_url_string, header, body = post_body)
  output = content(result)
  data = data.table::rbindlist(output)
  data[,event_date:=as.Date(event_date)]
  data = data[order(product_content_id,event_date)]
  return(data)
}


subm_url = 'http://46.101.163.177'

u_name = "Group4"
p_word = "a4TStQDQYjpverak"
submit_now = FALSE

username = u_name
password = p_word

token = get_token(username=u_name, password=p_word, url=subm_url)
data_son = get_data(token=token,url=subm_url)

ProjectRawData <- read_csv("C:/Users/kaan9/OneDrive/Masaüstü/ProjectRawData.csv", 
                           col_types = cols(event_date = col_date(format = "%Y-%m-%d")))
data <- data.table(ProjectRawData)
data <- data.table(rbind(data,data_son[event_date > max(data$event_date)]))[order(event_date)]


discount_dates <- as.Date(c("2021-03-10", "2021-03-11", "2021-03-12",
                            "2021-02-13", "2021-02-14", "2021-02-12",
                            "2020-12-31", "2020-12-30","2020-12-29",
                            "2020-11-25", "2020-11-26", "2020-11-27", "2020-11-28", "2020-11-29",
                            "2020-11-09", "2020-11-10","2020-11-11",
                            "2020-09-10", "2020-09-11", "2020-09-12",
                            "2020-06-18", "2020-06-19", "2020-06-20",
                            "2021-05-07","2021-05-08", "2021-05-09"))

data <- data[,special_days:=0]
data[data$event_date %in% discount_dates]$special_days <- 1
data$ratio <- data$category_sold / data$category_favored


train_start=as.Date('2020-05-25')
test_start=as.Date('2021-05-24')
test_end=as.Date('2021-05-31')
test_dates=seq(test_start,test_end,by='day')

### ORAL B ŞARJLI DİŞ FIRÇASI

disf <- data[product_content_id=="32939029",][order(event_date)]
disf_train <- disf[event_date<test_start,]
disf_ts <- ts(disf_train$sold_count,freq=7)

After the decomposition, it could be said that the chargable toothbrush has a weekly seasonality. Differencing could make the series stationary. 

ggplot(disf_train[2:.N,],aes(x=event_date)) + geom_line(aes(y=diff(disf_train$sold_count,1))) + 
labs(title= "Graph of Differenced Series", x= "Date", y="Quantity")

After the differencing, the data seems stationary but it has some outliers and non-constant variance.

acf(diff(disf_train$sold_count,1))
pacf(diff(disf_train$sold_count,1))

At the ACF graph, there is a spike at lag 2 which indicates MA(2) model and also there is slighlty significant spike at lag 7 which is the seasonal MA(1). At the PACF graph, there is spikes at lag 2 and 4.  AR(2) and AR(4) models could be applied.

arima(disf_ts,order=c(2,1,2),seasonal=c(0,0,1))

arima(disf_ts,order=c(4,1,2),seasonal=c(0,0,1))

auto.arima(disf_ts)

After trying some SARIMA models, auto.arima function gives the nearly same result with the SARIMA(2,1,2)(0,0,1). We could use our model to forecasting.

model_disf_sarima <- arima(disf_ts,order=c(2,1,2),seasonal=c(0,0,1))
checkresiduals(model_disf_sarima)

Residuals look stationary with constant mean at 0. Variance isn't constant but there is not any significant correlations at ACF graph. The distribution seems normal.

disf_train$res1 <- model_disf_sarima$residuals
corr <- cor(disf_train[!is.na(disf_train$price),c(-1,-2)])
ggcorrplot(corr,hc.order = TRUE, type = "lower")

Residuals have highest correlation with basket_count which 0.4. Also it still has correlation with output variable which means there is still unexplained data in sold_count. We could add basket_count as a regressor.

model_disf_sarimax <- arima(disf_ts,order=c(2,1,2),seasonal=c(0,0,1),xreg=disf_train$basket_count)
disf_train$res2 <- model_disf_sarimax$residuals
model_disf_sarimax
checkresiduals(model_disf_sarimax)

After the addition of the baskes_count as a regressor, AIC value is reduced to 3359 from 3767. Residuals didn't change much after the addition of the regressor.

corr <- cor(disf_train[!is.na(disf_train$price),c(-1,-2)])
ggcorrplot(corr,hc.order = TRUE, type = "lower")

Residuals still have correlation with category_visits but not significantly, it could be added as a regressor.

model_disf_sarimax2 <- arima(disf_ts,order=c(2,1,2),seasonal=c(0,0,1),xreg=as.matrix(disf_train$basket_count,disf_train$category_visits))
disf_train$res3 <- model_disf_sarimax2$residuals
model_disf_sarimax2

The addition of the new regressor didn't change the AIC value. So, the previous SARIMAX model could be use in forecasting.

forecast_data_disf <- data.table(event_date=disf[event_date>=test_start&event_date<=test_end,]$event_date,
                            sold_count=disf[event_date>=test_start&event_date<=test_end,]$sold_count)
sarima_fc <- numeric(0)
sarimax_fc <- numeric(0)
for(i in 1:length(test_dates)){
  
  train_dt <- disf[event_date<test_dates[i],]
  model_sarima <- Arima(train_dt$sold_count,order=c(2,1,2),seasonal=c(0,0,1))
  model_sarimax <- Arima(train_dt$sold_count,order=c(2,1,2),seasonal=c(0,0,1),xreg=train_dt$basket_count)
  newreg <- forecast(auto.arima(train_dt$basket_count),h=1)$mean[1]
  sarima_temp <- forecast(model_sarima)
  sarimax_temp <- forecast(model_sarimax,xreg=newreg)
  sarima_fc <- c(sarima_fc,sarima_temp$mean[1])
  sarimax_fc <- c(sarimax_fc,sarimax_temp$mean[1])
  
}
forecast_data_disf <- forecast_data_disf[,`:=`(sarima_p=sarima_fc,
                                     sarimax_p=sarimax_fc)]
accu(forecast_data_disf$sold_count,forecast_data_disf$sarima_p)
accu(forecast_data_disf$sold_count,forecast_data_disf$sarimax_p)

After the applicaiton of the models on the test period, SARIMA model has WMAPE value of 0.288 while the SARIMAX model with basket_count as a regressor has WMAPE value of 0.307. It could be concluded that the SARIMA model is a better predictive model.

### Sleepy Bebek Islak Mendil

mendil <- data[product_content_id=="4066298",][order(event_date)]
mendil_train <- mendil[event_date<test_start,]
mendil_ts <- ts(mendil_train$sold_count)

ggplot(mendil_train,aes(x=event_date,y=sold_count)) + geom_line() +
labs(title= "Graph of Sleepy Towel", x= "Date", y="Quantity")

There are a lot of outliers in the data, mean isn't constant and also variance varies through time. Differencing should be applied to the series.

ggplot(mendil_train[2:.N,],aes(x=event_date)) + geom_line(aes(y=diff(mendil_train$sold_count,1))) + 
labs(title= "Graph of Differenced Series", x= "Date", y="Quantity")

acf(diff(mendil_train$sold_count,1))
pacf(diff(mendil_train$sold_count,1))

At the ACF graph, there is a spike at lags 3 and 4 which could be MA(3) or MA(4) model. Also, there is spikes at PACF graph at lags 3 and 4 which indicates AR(3) OR AR(4) models. Seasonality is not occured in the decomposition steps so, ARIMA model could be consturcted.

arima(mendil_ts,order=c(3,1,3))

arima(mendil_ts,order=c(3,1,4))

arima(mendil_ts,order=c(4,1,3))

arima(mendil_ts,order=c(4,1,4))

auto.arima(mendil_ts)

The best ARIMA model is (3,1,4) with the lowest AIC value of 5207.07.

model_mendil_arima <- arima(mendil_ts,order=c(3,1,4))
checkresiduals(model_mendil_arima)
mendil_train$res1 <- model_mendil_arima$residuals
corr <- cor(mendil_train[!is.na(mendil_train$price),c(-1,-2)])
ggcorrplot(corr,hc.order = TRUE, type = "lower")

Redisuals of the first model isn't autocorrelated and mean is constant at 0. Variance is mostly constant through time. Distribution looks normal but slightly right skewed. Residuals still correlated with sold_count which is output variable and also with basket_count.

ggplot(mendil_train,aes(x=res1,y=basket_count)) +  geom_point() + geom_smooth()

ggplot(mendil_train[2:.N],aes(x=res1)) +  geom_point(aes(y=diff(mendil_train$basket_count,1))) +
geom_smooth(aes(y=diff(mendil_train$basket_count,1)))

cor(mendil_train[2:.N]$res1,diff(mendil_train$basket_count,1))
mendil_train$diff_basket <- c(NA,diff(mendil_train$basket_count,1))
mendil$diff_basket <- c(0,diff(mendil$basket_count,1))

The correlation between residuals and basket_count is non-linear but differencing the basket_count make this relation linear and with a correlation 0.678. We could add differenced basket_count as a regressor.

arima(mendil_ts,order=c(3,1,4),xreg=mendil_train$diff_basket)

After the addition of regressor, AIC value is reduced slightly. 

model_mendil_arimax <- arima(mendil_ts,order=c(3,1,4),xreg=mendil_train$diff_basket)
checkresiduals(model_mendil_arimax)
mendil_train$res2 <- model_mendil_arimax$residuals
corr <- cor(mendil_train[!is.na(mendil_train$price)& !is.na(mendil_train$res2),c(-1,-2)])
ggcorrplot(corr,hc.order = TRUE, type = "lower")

Residuals are more stationary than the previous model. Variance looks more constant and the distribution have mean near to the zero.

forecast_data_mendil <- data.table(event_date=mendil[event_date>=test_start&event_date<=test_end,]$event_date,
                            sold_count=mendil[event_date>=test_start&event_date<=test_end,]$sold_count)
arima_fc <- numeric(0)
arimax_fc <- numeric(0)
for(i in 1:length(test_dates)){
  
  train_dt <- mendil[event_date<test_dates[i],]
  model_arima <- Arima(train_dt$sold_count,order=c(3,1,4))
  model_arimax <- Arima(train_dt$sold_count,order=c(3,1,4),xreg=train_dt$diff_basket)
  newreg <- forecast(auto.arima(train_dt$diff_basket),h=1)$mean[1]
  arima_temp <- forecast(model_arima)
  arimax_temp <- forecast(model_arimax,xreg=newreg)
  arima_fc <- c(arima_fc,arima_temp$mean[1])
  arimax_fc <- c(arimax_fc,arimax_temp$mean[1])
  
}
forecast_data_mendil <- forecast_data_mendil[,`:=`(arima_p=arima_fc,
                                     arimax_p=arimax_fc)]
accu(forecast_data_mendil$sold_count,forecast_data_mendil$arima_p)
accu(forecast_data_mendil$sold_count,forecast_data_mendil$arimax_p)

### La Roche Posay Yüz Temizleyici

jel <- data[product_content_id=="85004",][order(event_date)]
jel_train <- jel[event_date<test_start,]
jel_ts <- ts(jel_train$sold_count)

ggplot(jel_train,aes(x=event_date,y=sold_count)) + geom_line() +
labs(title= "Graph of SLa Roche Posay", x= "Date", y="Quantity")

Again, series are not stationariy. Analysis could start with differencing to make it stationary.

ggplot(jel_train[2:.N,],aes(x=event_date)) + geom_line(aes(y=diff(jel_train$sold_count,1))) + 
labs(title= "Graph of Differenced Series", x= "Date", y="Quantity")

acf(diff(jel_train$sold_count,1))
pacf(diff(jel_train$sold_count,1))

In the ACF graph there is spikes at lag 3 and 4. Pacf graph shows spikes at lag 3 and 4 also. Decomposition tells there is no seasonality, so an ARIMA model could be constructed.

arima(jel_ts,order=c(3,1,3))

arima(jel_ts,order=c(4,1,3))

arima(jel_ts,order=c(3,1,4))

arima(jel_ts,order=c(4,1,4))

auto.arima(jel_ts)

The best model is ARIMA(3,1,3) with an AIC of 3725.75. Thus, we could use it in forecast steps.

model_jel_arima <- arima(jel_ts,order=c(3,1,3))
checkresiduals(model_jel_arima)
jel_train$res1 <- model_jel_arima$residuals
corr <- cor(jel_train[!is.na(jel_train$price),c(-1,-2)])
ggcorrplot(corr,hc.order = TRUE, type = "lower")

Residuals have constant mean at 0 and variance looks constant although there is some outliers. There is not lags with correlation. Distribution is slightly right skewed. Residuals are correlated with category_visits.

ggplot(jel_train,aes(x=res1,y=category_visits)) +  geom_point() + geom_smooth()

ggplot(jel_train[2:.N],aes(x=res1)) +  geom_point(aes(y=diff(jel_train$category_visits,1))) +
geom_smooth(aes(y=diff(jel_train$category_visits,1)))

cor(jel_train[2:.N]$res1,diff(jel_train$category_visits,1))
jel_train$diff_visits <- c(NA,diff(jel_train$category_visits,1))
jel$diff_visits <- c(NA,diff(jel$category_visits,1))

Differenced category_visits is highlt correlated with residuals with a value of 0.694. We could add it as a regressor.

arima(jel_ts,order=c(3,1,3),xreg=jel_train$diff_visits)

auto.arima(jel_ts,xreg=jel_train$diff_visits)

model_jel_arimax <- arima(jel_ts,order=c(3,1,3),xreg=jel_train$diff_visits)
checkresiduals(model_jel_arimax)
jel_train$res2 <- model_jel_arimax$residuals
corr <- cor(jel_train[!is.na(jel_train$price)& !is.na(jel_train$res2),c(-1,-2)])
ggcorrplot(corr,hc.order = TRUE, type = "lower")

Residuals don't look changed much in the stationary manner. However, we could see its success in the test period.

forecast_data_jel <- data.table(event_date=jel[event_date>=test_start&event_date<=test_end,]$event_date,
                            sold_count=jel[event_date>=test_start&event_date<=test_end,]$sold_count)
arima_fc <- numeric(0)
arimax_fc <- numeric(0)
for(i in 1:length(test_dates)){
  
  train_dt <- jel[event_date<test_dates[i],]
  model_arima <- Arima(train_dt$sold_count,order=c(3,1,4))
  model_arimax <- Arima(train_dt$sold_count,order=c(3,1,4),xreg=train_dt$diff_visits)
  newreg <- forecast(auto.arima(train_dt$diff_visits),h=1)$mean[1]
  arima_temp <- forecast(model_arima)
  arimax_temp <- forecast(model_arimax,xreg=newreg)
  arima_fc <- c(arima_fc,arima_temp$mean[1])
  arimax_fc <- c(arimax_fc,arimax_temp$mean[1])
  
}
forecast_data_jel <- forecast_data_jel[,`:=`(arima_p=arima_fc,
                                     arimax_p=arimax_fc)]
accu(forecast_data_jel$sold_count,forecast_data_jel$arima_p)
accu(forecast_data_jel$sold_count,forecast_data_jel$arimax_p)

### Fakir Dik Süpürge

fakir <- data[product_content_id=="7061886",][order(event_date)]
fakir_train <- fakir[event_date<test_start,]
fakir_ts <- ts(fakir_train$sold_count)

ggplot(fakir_train,aes(x=event_date,y=sold_count)) + geom_line() +
labs(title= "Graph of Fakir Vacuum Clenaer", x= "Date", y="Quantity")

Fakir vacuum cleaner data is not stationary, because of the trend and outliers.

ggplot(fakir_train[2:.N,],aes(x=event_date)) + geom_line(aes(y=diff(fakir_train$sold_count,1))) + 
  labs(title= "Graph of Differenced Series", x= "Date", y="Quantity")

acf(diff(fakir_train$sold_count,1))
pacf(diff(fakir_train$sold_count,1))

After analyzing the ACF and PACF graphs of the differenced series, AR(1), AR(4), MA(1) and MA(4) models could be tried as a ARIMA model.

arima(fakir_ts,order=c(1,1,1))

arima(fakir_ts,order=c(4,1,1))

arima(fakir_ts,order=c(1,1,4))

arima(fakir_ts,order=c(4,1,4))

auto.arima(fakir_ts)

The best model is ARIMA(1,1,4) with the AIC value of 3374.99

model_fakir_arima <- arima(fakir_ts,order=c(1,1,4))
checkresiduals(model_fakir_arima)
fakir_train$res1 <- model_fakir_arima$residuals
corr <- cor(fakir_train[!is.na(fakir_train$price),c(-1,-2)])
ggcorrplot(corr,hc.order = TRUE, type = "lower")

Residuals seem stationary with a constant mean of 0 and nearly constant variance. Histogram of the residuals look as a normal distribution. Category_sold attribute is correlated with the residuals.

ggplot(fakir_train,aes(x=res1,y=category_sold)) +  geom_point() + geom_smooth()

ggplot(fakir_train[2:.N],aes(x=res1)) +  geom_point(aes(y=diff(fakir_train$category_sold,1))) +
geom_smooth(aes(y=diff(fakir_train$category_sold,1)))

cor(fakir_train[2:.N]$res1,diff(fakir_train$category_sold,1))
fakir_train$diff_sold <- c(NA,diff(fakir_train$category_sold,1))
fakir$diff_sold <- c(NA,diff(fakir$category_sold,1))

arima(fakir_ts,order=c(1,1,4),xreg=fakir_train$diff_sold)

New regressor give better AIC value which is 3139.72 and better than the previous model.

model_fakir_arimax <- arima(fakir_ts,order=c(1,1,4),xreg=fakir_train$diff_sold)
checkresiduals(model_fakir_arimax)
fakir_train$res2 <- model_fakir_arimax$residuals
corr <- cor(fakir_train[!is.na(fakir_train$price)& !is.na(fakir_train$res2),c(-1,-2)])
ggcorrplot(corr,hc.order = TRUE, type = "lower")

Residuals are more stationary now. We should apply the new model to the test period.

forecast_data_fakir <- data.table(event_date=fakir[event_date>=test_start&event_date<=test_end,]$event_date,
                            sold_count=fakir[event_date>=test_start&event_date<=test_end,]$sold_count)
arima_fc <- numeric(0)
arimax_fc <- numeric(0)
for(i in 1:length(test_dates)){
  
  train_dt <- fakir[event_date<test_dates[i],]
  model_arima <- arima(train_dt$sold_count,order=c(1,1,4))
  model_arimax <- Arima(train_dt$sold_count,order=c(1,1,4),xreg=train_dt$diff_sold)
  newreg <- forecast(auto.arima(train_dt$diff_sold),h=1)$mean[1]
  arima_temp <- forecast(model_arima)
  arimax_temp <- forecast(model_arimax,xreg=newreg)
  arima_fc <- c(arima_fc,arima_temp$mean[1])
  arimax_fc <- c(arimax_fc,arimax_temp$mean[1])
  
}
forecast_data_fakir <- forecast_data_fakir[,`:=`(arima_p=arima_fc,
                                     arimax_p=arimax_fc)]
accu(forecast_data_fakir$sold_count,forecast_data_fakir$arima_p)
accu(forecast_data_fakir$sold_count,forecast_data_fakir$arimax_p)

### Xaomi Bluetooth Kulaklık

xaomi <- data[product_content_id=="6676673",][order(event_date)]
xaomi_train <- xaomi[event_date<test_start,]
xaomi_ts <- ts(xaomi_train$sold_count,freq=7)

ggplot(xaomi_train,aes(x=event_date,y=sold_count)) + geom_line() +
  labs(title= "Graph of Xaomi Bluetooth Earphone", x= "Date", y="Quantity")

Xaomi sold_count series is not stationary due to the non constant mean and variance.

ggplot(xaomi_train[2:.N,],aes(x=event_date)) + geom_line(aes(y=diff(xaomi_train$sold_count,1))) + 
  labs(title= "Graph of Differenced Series", x= "Date", y="Quantity")

acf(diff(xaomi_train$sold_count,1))
pacf(diff(xaomi_train$sold_count,1))

At the decomposition step, it was said that the data has weekly seasonality. We could build a SARIMA model. ACF graphs has spikes at lag 1 and 3. Also, PACF graph has spikes at lag 1,3,4 and 9. We could try these parameters on a SARIMA model.

arima(xaomi_ts,order=c(1,1,1),seasonal=c(2,0,0))

arima(xaomi_ts,order=c(3,1,1),seasonal=c(2,0,0))

arima(xaomi_ts,order=c(3,1,3),seasonal=c(2,0,0))

arima(xaomi_ts,order=c(1,1,3),seasonal=c(2,0,0))

auto.arima(xaomi_ts)

The best model is SARIMA(1,1,3)(2,0,0) with the AIC value of 4692.53

model_xaomi_sarima <- arima(xaomi_ts,order=c(1,1,3),seasonal=c(2,0,0))
checkresiduals(model_xaomi_sarima)
xaomi_train$res1 <- model_xaomi_sarima$residuals
corr <- cor(xaomi_train[!is.na(xaomi_train$price),c(-1,-2)])
ggcorrplot(corr,hc.order = TRUE, type = "lower")


Residuals have a constant mean at 0. They are not correlated and they have nearly constant variance. Although the distribution seems normal, histogram is right skewed. Residuals are correlated with category_sold.

ggplot(xaomi_train,aes(x=res1,y=category_sold)) +  geom_point() + geom_smooth()

cor(xaomi_train$res1,xaomi_train$category_sold)

The correlation seems to be linear with 0.544, there is no need to transform the data. We could add it as a regressor.

model_xaomi_sarimax <- arima(xaomi_ts,order=c(1,1,3),seasonal=c(2,0,0),xreg=xaomi_train$category_sold)
checkresiduals(model_xaomi_sarimax)
xaomi_train$res2 <- model_xaomi_sarimax$residuals
corr <- cor(xaomi_train[!is.na(xaomi_train$price),c(-1,-2)])
ggcorrplot(corr,hc.order = TRUE, type = "lower")

Residuals seems more stationart and also right skew of the histogram is gone. We could use this SARIMAX model in the forecasting.

forecast_data_xaomi <- data.table(event_date=xaomi[event_date>=test_start&event_date<=test_end,]$event_date,
                            sold_count=xaomi[event_date>=test_start&event_date<=test_end,]$sold_count)
sarima_fc <- numeric(0)
sarimax_fc <- numeric(0)
for(i in 1:length(test_dates)){
  
  train_dt <- xaomi[event_date<test_dates[i],]
  model_sarima <- Arima(train_dt$sold_count,order=c(1,1,3),seasonal=c(2,0,0))
  model_sarimax <- Arima(train_dt$sold_count,order=c(1,1,3),seasonal=c(2,0,0),xreg=train_dt$category_sold)
  newreg <- forecast(auto.arima(train_dt$category_sold),h=1)$mean[1]
  sarima_temp <- forecast(model_sarima)
  sarimax_temp <- forecast(model_sarimax,xreg=newreg)
  sarima_fc <- c(sarima_fc,sarima_temp$mean[1])
  sarimax_fc <- c(sarimax_fc,sarimax_temp$mean[1])
  
}
forecast_data_xaomi <- forecast_data_xaomi[,`:=`(sarima_p=sarima_fc,
                                                 sarimax_p=sarimax_fc)]
accu(forecast_data_xaomi$sold_count,forecast_data_xaomi$sarima_p)
accu(forecast_data_xaomi$sold_count,forecast_data_xaomi$sarimax_p)

### Trendyolmilla Tayt

tayt <- data[product_content_id=="31515569",][order(event_date)]
tayt_train <- tayt[event_date<test_start,]
tayt_ts <- ts(tayt_train$sold_count)

ggplot(tayt_train,aes(x=event_date,y=sold_count)) + geom_line() +
  labs(title= "Graph of Trendyolmilla Tayt", x= "Date", y="Quantity")

ggplot(tayt_train[2:.N,],aes(x=event_date)) + geom_line(aes(y=diff(tayt_train$sold_count,1))) + 
  labs(title= "Graph of Differenced Series", x= "Date", y="Quantity")

acf(diff(tayt_train$sold_count,1))
pacf(diff(tayt_train$sold_count,1))

After the differencing, there is spikes at ACF graph at lag 4 and at PACF graph lag 4. An ARIMA(4,1,4) model could be constructed.

arima(tayt_ts,order=c(4,1,4))

auto.arima(tayt_ts)

The best model is ARIMA(4,1,4) with lowest AIC 5792.55.

model_tayt_arima <- arima(tayt_ts,order=c(4,1,4))
checkresiduals(model_tayt_arima)
tayt_train$res1 <- model_tayt_arima$residuals
corr <- cor(tayt_train[!is.na(tayt_train$price),c(-1,-2)])
ggcorrplot(corr,hc.order = TRUE, type = "lower")

Residuals have constant mean at 0 and constant variance. Also, distribution seems normal while there is no autocorrelation. Residuals are correlated with basket_count and  category_sold. First we could add the basket_count variable.

ggplot(tayt_train,aes(x=res1,y=basket_count)) +  geom_point() + geom_smooth()

ggplot(tayt_train[2:.N],aes(x=res1)) +  geom_point(aes(y=diff(tayt_train$basket_count,1))) + 
geom_smooth(aes(y=diff(tayt_train$basket_count,1)))

A more linear curve obtained with differencing.

cor(tayt_train[2:.N]$res1,diff(tayt_train$basket_count,1))
tayt_train$diff_basket <- c(NA,diff(tayt_train$basket_count,1))
tayt$diff_basket <- c(NA,diff(tayt$basket_count,1))

arima(tayt_ts,order=c(4,1,4),xreg=tayt_train$diff_basket)

model_tayt_arimax <- arima(tayt_ts,order=c(4,1,4),xreg=tayt_train$diff_basket)
checkresiduals(model_tayt_arimax)
tayt_train$res2 <- model_tayt_arimax$residuals
corr <- cor(tayt_train[!is.na(tayt_train$price) & !is.na(tayt_train$res2),c(-1,-2)])
ggcorrplot(corr,hc.order = TRUE, type = "lower")

ggplot(tayt_train,aes(x=res2,y=category_sold)) +  geom_point() + geom_smooth()

ggplot(tayt_train[2:.N],aes(x=res1)) +  geom_point(aes(y=diff(tayt_train$category_sold,1))) + geom_smooth(aes(y=diff(tayt_train$category_sold,1)))

cor(tayt_train[2:.N]$res2,diff(tayt_train$category_sold,1))
tayt_train$diff_sold <- c(NA,diff(tayt_train$category_sold,1))
tayt$diff_sold <- c(NA,diff(tayt$category_sold,1))

arima(tayt_ts,order=c(4,1,4),xreg=as.matrix(tayt_train$diff_basket,tayt_train$diff_sold))

model_tayt_arimax <- arima(tayt_ts,order=c(4,1,4),xreg=as.matrix(tayt_train$diff_basket,tayt_train$diff_sold))
checkresiduals(model_tayt_arimax)
tayt_train$res3 <- model_tayt_arimax$residuals
corr <- cor(tayt_train[!is.na(tayt_train$price) & !is.na(tayt_train$res2),c(-1,-2)])
ggcorrplot(corr,hc.order = TRUE, type = "lower")

The final model has basket_count and category_sold as regressors. And residuals seem better both in stationarity and distribution manners.

forecast_data_tayt <- data.table(event_date=tayt[event_date>=test_start&event_date<=test_end,]$event_date,
                            sold_count=tayt[event_date>=test_start&event_date<=test_end,]$sold_count)
arima_fc <- numeric(0)
arimax_fc <- numeric(0)
for(i in 1:length(test_dates)){
  
  train_dt <- tayt[event_date<test_dates[i],]
  model_arima <- Arima(train_dt$sold_count,order=c(4,1,4))
  model_arimax <- Arima(tayt_ts,order=c(4,1,4),xreg=as.matrix(tayt_train$diff_basket,tayt_train$diff_sold))
  newreg <- forecast(auto.arima(train_dt$diff_basket),h=1)$mean[1]
  newreg2 <- forecast(auto.arima(train_dt$diff_sold),h=1)$mean[1]
  arima_temp <- forecast(model_arima)
  arimax_temp <- forecast(model_arimax,xreg=as.matrix(newreg,newreg2))
  arima_fc <- c(arima_fc,arima_temp$mean[1])
  arimax_fc <- c(arimax_fc,arimax_temp$mean[1])
  
}
forecast_data_tayt <- forecast_data_tayt[,`:=`(arima_p=arima_fc,
                                     arimax_p=arimax_fc)]
accu(forecast_data_tayt$sold_count,forecast_data_tayt$arima_p)
accu(forecast_data_tayt$sold_count,forecast_data_tayt$arimax_p)

### Trendyolmilla Bikini 2

bikini2 <- data[product_content_id=="32737302",][order(event_date)]
bikini2$is_summer <- 0
bikini2[month(bikini2$event_date) %in% c(6,7,8),]$is_summer <- 1
bikini2_train <- bikini2[event_date<test_start,]
bikini2_ts <- ts(bikini2_train$sold_count,freq=9)

ggplot(bikini2_train,aes(x=event_date,y=sold_count)) + geom_line() +
  labs(title= "Graph of Trendyolmilla Bikini 2", x= "Date", y="Quantity")

ggplot(bikini2_train[2:.N,],aes(x=event_date)) + geom_line(aes(y=diff(bikini2_train$sold_count,1))) + 
  labs(title= "Graph of Differenced Series", x= "Date", y="Quantity")

After the differencing, series seem more stationary now. According to the decomposition, there is a seasonal cycle at every 9 data points.

acf(diff(bikini2_train$sold_count,1))
pacf(diff(bikini2_train$sold_count,1))

After the analysis of ACF and PACF graphs, we could say that AR(1), AR(2), AR(3) and seasonal AR(1)-AR(2) could be tried. And also MA(2), MA(3) and seasonal MA(1)-MA(2) are strong candidates for a SARIMA model.

arima(bikini2_ts,order=c(1,1,2),seasonal=c(2,0,2))

arima(bikini2_ts,order=c(2,1,2),seasonal=c(2,0,2))

arima(bikini2_ts,order=c(1,1,3),seasonal=c(2,0,2))

arima(bikini2_ts,order=c(3,1,3),seasonal=c(2,0,2))

arima(bikini2_ts,order=c(3,1,3),seasonal=c(1,0,1))

auto.arima(bikini2_ts)

The best model is SARIMA(3,1,3)(1,0,1) with AIC value 2361.16.

model_bikini2_sarima <- arima(bikini2_ts,order=c(3,1,3),seasonal=c(1,0,1))
checkresiduals(model_bikini2_sarima)
bikini2_train$res1 <- model_bikini2_sarima$residuals
corr <- cor(bikini2_train[!is.na(bikini2_train$price),c(-1,-2)])
ggcorrplot(corr,hc.order = TRUE, type = "lower")

Residuals have constant mean at 0 and constant variance. Distribution seems normal. Due to the correlation with residuals, basket_count could be added as a regressor.

ggplot(bikini2_train,aes(x=res1,y=basket_count)) +  geom_point() + geom_smooth()

ggplot(bikini2_train[2:.N],aes(x=res1)) +  geom_point(aes(y=diff(bikini2_train$basket_count,1))) + 
geom_smooth(aes(y=diff(bikini2_train$basket_count,1)))

cor(bikini2_train[2:.N]$res1,diff(bikini2_train$basket_count,1))
bikini2_train$diff_basket <- c(NA,diff(bikini2_train$basket_count,1))
bikini2$diff_basket <- c(NA,diff(bikini2$basket_count,1))

model_bikini2_arimax <- arima(bikini2_ts,order=c(3,1,3),seasonal=c(1,0,1),xreg=bikini2_train$diff_basket)
checkresiduals(model_bikini2_arimax)
bikini2_train$res2 <- model_bikini2_arimax$residuals
corr <- cor(bikini2_train[!is.na(bikini2_train$price),c(-1,-2)])
ggcorrplot(corr,hc.order = TRUE, type = "lower")

forecast_data_bikini2 <- data.table(event_date=bikini2[event_date>=test_start&event_date<=test_end,]$event_date,
                                 sold_count=bikini2[event_date>=test_start&event_date<=test_end,]$sold_count)
sarima_fc <- numeric(0)
sarimax_fc <- numeric(0)
for(i in 1:length(test_dates)){
  
  train_dt <- bikini2[event_date<test_dates[i],]
  model_sarima <- Arima(train_dt$sold_count,order=c(3,1,3),seasonal=c(1,0,1))
  model_sarimax <- Arima(train_dt$sold_count,order=c(3,1,3),seasonal=c(1,0,1),xreg=train_dt$diff_basket)
  newreg <- forecast(auto.arima(train_dt$diff_basket),h=1)$mean[1]
  sarima_temp <- forecast(model_sarima)
  sarimax_temp <- forecast(model_sarimax,xreg=newreg)
  sarima_fc <- c(sarima_fc,sarima_temp$mean[1])
  sarimax_fc <- c(sarimax_fc,sarimax_temp$mean[1])
  
}
forecast_data_bikini2 <- forecast_data_bikini2[,`:=`(sarima_p=sarima_fc,
                                               sarimax_p=sarimax_fc)]
accu(forecast_data_bikini2$sold_count,forecast_data_bikini2$sarima_p)
accu(forecast_data_bikini2$sold_count,forecast_data_bikini2$sarimax_p)

### Trendyolmilla Bikini1

bikini1 <- data[product_content_id=="73318567",][order(event_date)]
bikini1_train <- bikini1[event_date<test_start,]
bikini1_ts <- ts(bikini1_train$sold_count)

ggplot(bikini1_train,aes(x=event_date,y=sold_count)) + geom_line() +
  labs(title= "Graph of Trendyolmilla Bikini1", x= "Date", y="Quantity")

ggplot(bikini1_train[2:.N,],aes(x=event_date)) + geom_line(aes(y=diff(bikini1_train$sold_count,1))) + 
  labs(title= "Graph of Differenced Series", x= "Date", y="Quantity")

acf(diff(bikini1_train$sold_count,1))
pacf(diff(bikini1_train$sold_count,1))

According to the ACF and PACF graphs, AR(2), AR(3) and MA(2), MA(3) are good candidates for an ARIMA model.

arima(bikini1_ts,order=c(3,1,2))

arima(bikini1_ts,order=c(2,1,3))

auto.arima(bikini1_ts)

The lowest AIC value is 2825 on the ARIMA(2,1,3)

model_bikini1_arima <- arima(bikini1_ts,order=c(2,1,1))
checkresiduals(model_bikini1_arima)
bikini1_train$res1 <- model_bikini1_arima$residuals
corr <- cor(bikini1_train[!is.na(bikini1_train$price),c(-1,-2)])
ggcorrplot(corr,hc.order = TRUE, type = "lower")

There is no significant correlation between residuals and other variables because the nature of the data.

forecast_data_bikini1 <- data.table(event_date=bikini1[event_date>=test_start&event_date<=test_end,]$event_date,
                            sold_count=bikini1[event_date>=test_start&event_date<=test_end,]$sold_count)
arima_fc <- numeric(0)
for(i in 1:length(test_dates)){
  
  train_dt <- bikini1[event_date<test_dates[i],]
  model_arima <- Arima(train_dt$sold_count,order=c(3,1,4))
  arima_temp <- forecast(model_arima)
  arima_fc <- c(arima_fc,arima_temp$mean[1])
  
}
forecast_data_bikini1 <- forecast_data_bikini1[,`:=`(arima_p=arima_fc)]
accu(forecast_data_bikini1$sold_count,forecast_data_bikini1$arima_p)

### MONT 

mont <- data[product_content_id=="48740784",][order(event_date)]
mont$is_winter <- 0
mont[month(mont$event_date) %in% c(9,10,11,12,1),]$is_winter <- 1
mont_train <- mont[event_date<test_start,]
mont_ts <- ts(mont_train$sold_count)

ggplot(mont_train,aes(x=event_date,y=sold_count)) + geom_line() +
  labs(title= "Graph of Altınyıldız Mont", x= "Date", y="Quantity")

ggplot(mont_train[2:.N,],aes(x=event_date)) + geom_line(aes(y=diff(mont_train$sold_count,1))) + 
  labs(title= "Graph of Differenced Series", x= "Date", y="Quantity")

acf(diff(mont_train$sold_count,1))
pacf(diff(mont_train$sold_count,1))

There are spikes at the ACF graph at lags 1 and 4. There are also spikes at the PACF graph at lags 1,2,4. These paramters could be tried on a ARIMA model.

arima(mont_ts,order=c(1,1,1))

arima(mont_ts,order=c(1,1,2))

arima(mont_ts,order=c(1,1,4))

arima(mont_ts,order=c(4,1,1))

arima(mont_ts,order=c(4,1,2))

arima(mont_ts,order=c(4,1,4))

The best model is ARIMA(4,1,2) with lowest AIC value of 1866.99.

model_mont_arima <- arima(mont_ts,order=c(4,1,2))
checkresiduals(model_mont_arima)
mont_train$res1 <- model_mont_arima$residuals
corr <- cor(mont_train[!is.na(mont_train$price),c(-1,-2)])
ggcorrplot(corr,hc.order = TRUE, type = "lower")

Resdiuals looks statioanry with 0 mean and constant variance. Basket_count could be added as a regressor.

ggplot(mont_train,aes(x=res1,y=basket_count)) +  geom_point() + geom_smooth()

cor(mont_train$res1,mont_train$basket_count)

arima(mont_ts,order=c(4,1,2),xreg=mont_train$basket_count)

AIC value is decreased when the regressor added.

model_mont_arimax <- arima(mont_ts,order=c(4,1,2),xreg=mont_train$basket_count)
checkresiduals(model_mont_arimax)
mont_train$res2 <- model_mont_arimax$residuals
corr <- cor(mont_train[!is.na(mont_train$price),c(-1,-2)])
ggcorrplot(corr,hc.order = TRUE, type = "lower")

forecast_data_mont <- data.table(event_date=mont[event_date>=test_start&event_date<=test_end,]$event_date,
                            sold_count=mont[event_date>=test_start&event_date<=test_end,]$sold_count)
arima_fc <- numeric(0)
arimax_fc <- numeric(0)
for(i in 1:length(test_dates)){
  
  train_dt <- mont[event_date<test_dates[i],]
  model_arima <- Arima(train_dt$sold_count,order=c(4,1,2))
  model_arimax <- Arima(train_dt$sold_count,order=c(4,1,2),xreg=train_dt$basket_count)
  newreg <- forecast(auto.arima(train_dt$basket_count),h=1)$mean[1]
  arima_temp <- forecast(model_arima)
  arimax_temp <- forecast(model_arimax,xreg=newreg)
  arima_fc <- c(arima_fc,arima_temp$mean[1])
  arimax_fc <- c(arimax_fc,arimax_temp$mean[1])
  
}
forecast_data_mont <- forecast_data_mont[,`:=`(arima_p=arima_fc,
                                     arimax_p=arimax_fc)]
accu(forecast_data_mont$sold_count,forecast_data_mont$arima_p)
accu(forecast_data_mont$sold_count,forecast_data_mont$arimax_p)

