# Tuning

![](banner_project.jpg)

In [1]:
analyst = "Lilit Petrosyan"

In [2]:
f = "setup.R"; for (i in 1:10) { if (file.exists(f)) break else f = paste0("../", f) }; source(f)
options(repr.matrix.max.rows=674)
options(repr.matrix.max.cols=200)
update_geom_defaults("point", list(size=1))                                

### Objective

Constructing and tuning a classifier and a regressor, each trained on a transformed dataset about public company fundamentals.  Using the best classifer or regressor along with additional analysis to recommend a portfolio of 12 company investments that maximizes 12-month return of an overall \$1,000,000 investment.

### Approach

Retrieve a transformed dataset.

Construct a model to predict whether stock price will grow more than 30% over 12 months, given 12 months of past company fundamentals data, and using a machine learning model construction method and transformed data.  Tune the model by systematically selecting various combinations of predictor variables and cutoffs, and identify the best business performance based on a business model and business parameters.  

Similarly, construct a model to predict how much stock price will grow over 12 months, given 12 months of past company fundamentals data, and using a machine learning model construction method and transformed data.  Tune the model by systematically selecting various combinations of predictor variables, and identify the best business performance based on a business model and business parameters.

## Business Model


The business model is ...

$ \begin{align} profit = \left( \sum_{i \in portfolio} (1 + growth_i) \times allocation_i \right) - budget \end{align} $

<br>

$ profit\,rate = profit \div budget $


$ \begin{align} budget = \sum_{i \in portfolio} allocation_i \end{align} $

<br>

Business model parameters include ...

* Budget = \\$1,000,000: total investment to allocate across the companies in the portfolio
* Portfolio Size = 12: number of companies in the portfolio
* Allocations = \\$1,000,000 $\div$ 12 to each company: investments to allocate to specific companies in the portfolio 

For classifier evaluation, fill the portfolio with companies with the lowest gvkey values from among those you predict to grow above 30%.  If you predict fewer than the portfolio size to grow above 30%, then fill the rest of the portfolio with the remaining companies with lowest gvkey values.

For regressor evaluation, fill the portfolio with companies that have the highest predicted growths.

In [3]:
# Set the business parameters.

budget = 1000000
portfolio_size = 12
allocation = rep(budget/portfolio_size, portfolio_size)

layout(fmt(budget), fmt(portfolio_size), fmt(allocation))

budget,Unnamed: 1_level_0,Unnamed: 2_level_0
portfolio_size,Unnamed: 1_level_1,Unnamed: 2_level_1
allocation,Unnamed: 1_level_2,Unnamed: 2_level_2
1000000,,
12,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,

budget
1000000

portfolio_size
12

allocation
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33


## Data Retrieval

We retrieve the data that we previously stored from the Project B as "My Data". This is the data that includes gvkey(the global company key) sorted lowest-highest, the company names, values of two of the principal analysis, the stock price, growth rate, and a character variable that indicates whether a company had over 30% growth over 12 months.

In [4]:
# Retrieve "My Data.csv"
data = read.csv("My Data.csv", header=TRUE)

# Present a few rows ...
data[1:6,]

gvkey,tic,conm,PC1,PC2,prccq,growth,big_growth
1004,AIR,AAR CORP,3.4371231,-0.2260719,43.69,0.0507455507,NO
1045,AAL,AMERICAN AIRLINES GROUP INC,-12.0332067,0.8045109,32.11,-0.3828560446,NO
1050,CECE,CECO ENVIRONMENTAL CORP,3.9532234,-0.7553386,6.75,0.3157894737,YES
1062,ASA,ASA GOLD AND PRECIOUS METALS,3.6561434,-0.7981915,8.66,-0.2164739518,NO
1072,AVX,AVX CORP,2.9282228,-0.71042,15.25,-0.1184971098,NO
1075,PNW,PINNACLE WEST CAPITAL CORP,0.3488491,1.1389605,85.2,0.0002347969,NO


## Build & Tune Classification Model

In [5]:
set.seed(0)
fold = createFolds(data$big_growth, k=5)
data.train = list()
data.test  = list()
for (i in 1:5) { data.train[[i]] = data[setdiff(1:nrow(data), fold[[i]]),]
                     data.test[[i]]  = data[fold[[i]],] }
cm = list()
accuracy = list()
profit=list()
profit.rate=list()
for (i in 1:5) {
                set.seed(0)
                model = naiveBayes(big_growth~PC1+PC2, data.train[[i]],laplace=TRUE)
                prob = predict(model, data.test[[i]], type="raw")
                class.predicted = as.class(prob, class="YES", cutoff=0.33)
                CM = confusionMatrix(class.predicted, data.test[[i]]$big_growth)$table
                cm[[i]] = CM/ sum(CM)
                accuracy[[i]] = (cm[[i]][1,1]+cm[[i]][2,2])/sum(cm[[i]])
                data.test[[i]]$pred=class.predicted
                dd=filter(data.test[[i]], ( pred=="YES"))[1:12,]
                profit[[i]]=sum(dd$growth)*allocation[1]
                }
d=data.frame(fold=c(1,2,3,4,5),accuracy=c(accuracy[[1]],accuracy[[2]],accuracy[[3]],accuracy[[4]],accuracy[[5]]), profit=c(profit[[1]],profit[[2]],profit[[3]],profit[[4]],profit[[5]]))
accuracy.cv=mean(d$accuracy)
profit.cv=mean(c(profit[[1]], profit[[2]], profit[[3]], profit[[4]], profit[[5]]))
d_PC1_PC2_33=data.frame(method="naive bayes", variables="PC1, PC2, big_growth",cutoff="0.33", accuracy.cv=accuracy.cv,profit.cv=profit.cv)

In [6]:
cm = list()
accuracy = list()
profit=list()
profit.rate=list()
for (i in 1:5) {
                set.seed(0)
                model = naiveBayes(big_growth~PC1+PC2, data.train[[i]],laplace=TRUE)
                prob = predict(model, data.test[[i]], type="raw")
                class.predicted = as.class(prob, class="YES", cutoff=0.25)
                CM = confusionMatrix(class.predicted, data.test[[i]]$big_growth)$table
                cm[[i]] = CM/ sum(CM)
                accuracy[[i]] = (cm[[i]][1,1]+cm[[i]][2,2])/sum(cm[[i]])
                data.test[[i]]$pred=class.predicted
                dd=filter(data.test[[i]], ( pred=="YES"))[1:12,]
                profit[[i]]=sum(dd$growth)*allocation[1]
                }
d=data.frame(fold=c(1,2,3,4,5),accuracy=c(accuracy[[1]],accuracy[[2]],accuracy[[3]],accuracy[[4]],accuracy[[5]]), profit=c(profit[[1]],profit[[2]],profit[[3]],profit[[4]],profit[[5]]))
accuracy.cv=mean(d$accuracy)
profit.cv=mean(c(profit[[1]], profit[[2]], profit[[3]], profit[[4]], profit[[5]]))
d_PC1_PC2_25=data.frame(method="naive bayes", variables="PC1, PC2, big_growth",cutoff="0.25", accuracy.cv=accuracy.cv,profit.cv=profit.cv)

In [7]:
data_1.train = data[setdiff(1:nrow(data), fold$Fold1),]
data_1.test  = data[fold$Fold1,]
model.1 = naiveBayes(big_growth ~PC1+PC2, data_1.train)
prob.1 = predict(model.1, data_1.test, type="raw")
class.predicted.1 = as.class(prob.1, class="YES", cutoff=0.5)
CM.1 = confusionMatrix(class.predicted.1, data_1.test$big_growth)$table
cm.1 = CM.1 / sum(CM.1)
accuracy.1 = (cm.1[1,1]+cm.1[2,2])/sum(cm.1)
data_1.test$pred=class.predicted.1
dd.1=filter(data_1.test, ( pred=="YES"))[1:12,]
profit.1=sum(dd.1$growth)*allocation[1]
profit_rate.1=profit.1/budget

data_2.train = data[setdiff(1:nrow(data), fold$Fold2),]
data_2.test  = data[fold$Fold2,]
model.2 = naiveBayes(big_growth ~ PC1+PC2, data_2.train)
prob.2 = predict(model.2, data_2.test, type="raw")
class.predicted.2 = as.class(prob.2, class="NO", cutoff=0.5)

CM.2 = confusionMatrix(class.predicted.2, data_2.test$big_growth)$table
cm.2 = CM.2 / sum(CM.2)
accuracy.2 = (cm.2[1,1]+cm.2[2,2])/sum(cm.2)
data_2.test$pred=class.predicted.2
dd.2=filter(data_2.test, ( pred=="NO"))[1:12,]
profit.2=sum(dd.2$growth)*allocation[1]
profit_rate.2=profit.2/budget

data_3.train = data[setdiff(1:nrow(data), fold$Fold3),]
data_3.test  = data[fold$Fold3,]
model.3 = naiveBayes(big_growth ~ PC1+PC2, data_3.train)
prob.3 = predict(model.3, data_3.test, type="raw")
class.predicted.3 = as.class(prob.3, class="YES", cutoff=0.5)

CM.3 = confusionMatrix(class.predicted.3, data_3.test$big_growth)$table
cm.3 = CM.3 / sum(CM.3)
accuracy.3 = (cm.3[1,1]+cm.3[2,2])/sum(cm.3)

data_3.test$pred=class.predicted.3
dd.3=filter(data_3.test, ( pred=="YES"))[1:12,]
profit.3=sum(dd.3$growth)*allocation[1]
profit_rate.3=profit.3/budget

data_4.train = data[setdiff(1:nrow(data), fold$Fold4),]
data_4.test  = data[fold$Fold4,]
model.4 = naiveBayes(big_growth ~ PC1+PC2, data_4.train)
prob.4 = predict(model.4, data_4.test, type="raw")
class.predicted.4 = as.class(prob.4, class="YES", cutoff=0.5)

CM.4 = confusionMatrix(class.predicted.4, data_4.test$big_growth)$table
cm.4 = CM.4 / sum(CM.4)
accuracy.4 = (cm.4[1,1]+cm.4[2,2])/sum(cm.4)

data_4.test$pred=class.predicted.4
dd.4=filter(data_4.test, ( pred=="YES"))[1:12,]
profit.4=sum(dd.4$growth)*allocation[1]
profit_rate.4=profit.4/budget

data_5.train = data[setdiff(1:nrow(data), fold$Fold5),]
data_5.test  = data[fold$Fold5,]
model.5 = naiveBayes(big_growth ~ PC1+PC2, data_5.train)
prob.5 = predict(model.5, data_5.test, type="raw")
class.predicted.5 = as.class(prob.5, class="YES", cutoff=0.5)

CM.5 = confusionMatrix(class.predicted.5, data_5.test$big_growth)$table
cm.5 = CM.5 / sum(CM.5)
accuracy.5 = (cm.5[1,1]+cm.5[2,2])/sum(cm.5)
data_5.test$pred=class.predicted.5
dd.5=filter(data_5.test, ( pred=="YES"))[1:12,]
profit.5=sum(dd.5$growth)*allocation[1]
profit_rate.5=profit.5/budget

d=data.frame(fold=c(1,2,3,4,5),accuracy=c(accuracy.1,accuracy.2,accuracy.3,accuracy.4,accuracy.5), profit=c(profit.1,profit.2,profit.3,profit.4,profit.5))
accuracy.cv=mean(d$accuracy)
profit.cv=mean(c(profit.1, profit.2, profit.3, profit.4, profit.5))
profit_rate.cv=mean(c(profit_rate.1,profit_rate.2,profit_rate.3,profit_rate.4,profit_rate.5))
d_PC1_PC2_5=data.frame(method="naive bayes", variables="PC1, PC2, big_growth",cutoff="0.50", accuracy.cv=accuracy.cv,profit.cv=profit.cv)

In [8]:
model.1 = naiveBayes(big_growth ~PC2, data_1.train)
prob.1 = predict(model.1, data_1.test, type="raw")
class.predicted.1 = as.class(prob.1, class="YES", cutoff=0.25)
CM.1 = confusionMatrix(class.predicted.1, data_1.test$big_growth)$table
cm.1 = CM.1 / sum(CM.1)
accuracy.1 = (cm.1[1,1]+cm.1[2,2])/sum(cm.1)
data_1.test$pred=class.predicted.1
dd.1=filter(data_1.test, ( pred=="YES"))[1:12,]
profit.1=sum(dd.1$growth)*allocation[1]
profit_rate.1=profit.1/budget

model.2 = naiveBayes(big_growth ~PC2, data_2.train)
prob.2 = predict(model.2, data_2.test, type="raw")
class.predicted.2 = as.class(prob.2, class="NO", cutoff=0.25)
CM.2 = confusionMatrix(class.predicted.2, data_2.test$big_growth)$table
cm.2 = CM.2 / sum(CM.2)
accuracy.2 = (cm.2[1,1]+cm.2[2,2])/sum(cm.2)
data_2.test$pred=class.predicted.2
dd.2=filter(data_2.test, ( pred=="NO"))[1:12,]
profit.2=sum(dd.2$growth)*allocation[1]
profit_rate.2=profit.2/budget


model.3 = naiveBayes(big_growth ~ PC2, data_3.train)
prob.3 = predict(model.3, data_3.test, type="raw")
class.predicted.3 = as.class(prob.3, class="YES", cutoff=0.25)
CM.3 = confusionMatrix(class.predicted.3, data_3.test$big_growth)$table
cm.3 = CM.3 / sum(CM.3)
accuracy.3 = (cm.3[1,1]+cm.3[2,2])/sum(cm.3)
data_3.test$pred=class.predicted.3
dd.3=filter(data_3.test, ( pred=="YES"))[1:12,]
profit.3=sum(dd.3$growth)*allocation[1]
profit_rate.3=profit.3/budget


model.4 = naiveBayes(big_growth ~ PC2, data_4.train)
prob.4 = predict(model.4, data_4.test, type="raw")
class.predicted.4 = as.class(prob.4, class="YES", cutoff=0.25)
CM.4 = confusionMatrix(class.predicted.4, data_4.test$big_growth)$table
cm.4 = CM.4 / sum(CM.4)
accuracy.4 = (cm.4[1,1]+cm.4[2,2])/sum(cm.4)
data_4.test$pred=class.predicted.4
dd.4=filter(data_4.test, ( pred=="YES"))[1:12,]
profit.4=sum(dd.4$growth)*allocation[1]
profit_rate.4=profit.4/budget


model.5 = naiveBayes(big_growth ~ PC2, data_5.train)
prob.5 = predict(model.5, data_5.test, type="raw")
class.predicted.5 = as.class(prob.5, class="YES", cutoff=0.25)
CM.5 = confusionMatrix(class.predicted.5, data_5.test$big_growth)$table
cm.5 = CM.5 / sum(CM.5)
accuracy.5 = (cm.5[1,1]+cm.5[2,2])/sum(cm.5)
data_5.test$pred=class.predicted.5
dd.5=filter(data_5.test, ( pred=="YES"))[1:12,]
profit.5=sum(dd.5$growth)*allocation[1]
profit_rate.5=profit.5/budget

d=data.frame(fold=c(1,2,3,4,5),accuracy=c(accuracy.1,accuracy.2,accuracy.3,accuracy.4,accuracy.5), profit=c(profit.1,profit.2,profit.3,profit.4,profit.5))
accuracy.cv=mean(d$accuracy)
profit.cv=mean(c(profit.1, profit.2, profit.3, profit.4, profit.5))
profit_rate.cv=mean(c(profit_rate.1,profit_rate.2,profit_rate.3,profit_rate.4,profit_rate.5))
d_PC2_25=data.frame(method="naive bayes", variables="PC2, big_growth",cutoff="0.25", accuracy.cv=accuracy.cv,profit.cv=profit.cv)

In [9]:
model.1 = naiveBayes(big_growth ~PC2, data_1.train)
prob.1 = predict(model.1, data_1.test, type="raw")
class.predicted.1 = as.class(prob.1, class="YES", cutoff=0.33)
CM.1 = confusionMatrix(class.predicted.1, data_1.test$big_growth)$table
cm.1 = CM.1 / sum(CM.1)
accuracy.1 = (cm.1[1,1]+cm.1[2,2])/sum(cm.1)
data_1.test$pred=class.predicted.1
dd.1=filter(data_1.test, ( pred=="YES"))[1:12,]
profit.1=sum(dd.1$growth)*allocation[1]
profit_rate.1=profit.1/budget

model.2 = naiveBayes(big_growth ~PC2, data_2.train)
prob.2 = predict(model.2, data_2.test, type="raw")
class.predicted.2 = as.class(prob.2, class="NO", cutoff=0.33)
CM.2 = confusionMatrix(class.predicted.2, data_2.test$big_growth)$table
cm.2 = CM.2 / sum(CM.2)
accuracy.2 = (cm.2[1,1]+cm.2[2,2])/sum(cm.2)
data_2.test$pred=class.predicted.2
dd.2=filter(data_2.test, ( pred=="NO"))[1:12,]
profit.2=sum(dd.2$growth)*allocation[1]
profit_rate.2=profit.2/budget

model.3 = naiveBayes(big_growth ~ PC2, data_3.train)
prob.3 = predict(model.3, data_3.test, type="raw")
class.predicted.3 = as.class(prob.3, class="NO", cutoff=0.33)
CM.3 = confusionMatrix(class.predicted.3, data_3.test$big_growth)$table
cm.3 = CM.3 / sum(CM.3)
accuracy.3 = (cm.3[1,1]+cm.3[2,2])/sum(cm.3)
data_3.test$pred=class.predicted.3
dd.3=filter(data_3.test, ( pred=="NO"))[1:12,]
profit.3=sum(dd.3$growth)*allocation[1]
profit_rate.3=profit.3/budget

model.4 = naiveBayes(big_growth ~ PC2, data_4.train)
prob.4 = predict(model.4, data_4.test, type="raw")
class.predicted.4 = as.class(prob.4, class="YES", cutoff=0.33)
CM.4 = confusionMatrix(class.predicted.4, data_4.test$big_growth)$table
cm.4 = CM.4 / sum(CM.4)
accuracy.4 = (cm.4[1,1]+cm.4[2,2])/sum(cm.4)
data_4.test$pred=class.predicted.4
dd.4=filter(data_4.test, ( pred=="YES"))[1:12,]
profit.4=sum(dd.4$growth)*allocation[1]
profit_rate.4=profit.4/budget


model.5 = naiveBayes(big_growth ~ PC2, data_5.train)
prob.5 = predict(model.5, data_5.test, type="raw")
class.predicted.5 = as.class(prob.5, class="YES", cutoff=0.33)
CM.5 = confusionMatrix(class.predicted.5, data_5.test$big_growth)$table
cm.5 = CM.5 / sum(CM.5)
accuracy.5 = (cm.5[1,1]+cm.5[2,2])/sum(cm.5)
data_5.test$pred=class.predicted.5
dd.5=filter(data_5.test, ( pred=="YES"))[1:12,]
profit.5=sum(dd.5$growth)*allocation[1]
profit_rate.5=profit.5/budget

d=data.frame(fold=c(1,2,3,4,5),accuracy=c(accuracy.1,accuracy.2,accuracy.3,accuracy.4,accuracy.5), profit=c(profit.1,profit.2,profit.3,profit.4,profit.5))
accuracy.cv=mean(d$accuracy)
profit.cv=mean(c(profit.1, profit.2, profit.3, profit.4, profit.5))
profit_rate.cv=mean(c(profit_rate.1,profit_rate.2,profit_rate.3,profit_rate.4,profit_rate.5))
d_PC2_33=data.frame(method="naive bayes", variables="PC2, big_growth",cutoff="0.33", accuracy.cv=accuracy.cv,profit.cv=profit.cv)

In [10]:
model.1 = naiveBayes(big_growth ~PC2, data_1.train)
prob.1 = predict(model.1, data_1.test, type="raw")
class.predicted.1 = as.class(prob.1, class="YES", cutoff=0.5)
CM.1 = confusionMatrix(class.predicted.1, data_1.test$big_growth)$table
cm.1 = CM.1 / sum(CM.1)
accuracy.1 = (cm.1[1,1]+cm.1[2,2])/sum(cm.1)
data_1.test$pred=class.predicted.1
dd.1=filter(data_1.test, ( pred=="YES"))
ddd.1=filter(data_1.test, ( pred=="NO"))[1:11,]
profit.1=sum(dd.1$growth)*allocation[1]+sum(ddd.1$growth)*allocation[1]
profit_rate.1=profit.1/budget



model.2 = naiveBayes(big_growth ~PC2, data_2.train)
prob.2 = predict(model.2, data_2.test, type="raw")
class.predicted.2 = as.class(prob.2, class="YES", cutoff=0.5)
CM.2 = confusionMatrix(class.predicted.2, data_2.test$big_growth)$table
cm.2 = CM.2 / sum(CM.2)
accuracy.2 = (cm.2[1,1]+cm.2[2,2])/sum(cm.2)
data_2.test$pred=class.predicted.2
dd.2=filter(data_2.test, ( pred=="NO"))[1:12,]
profit.2=sum(dd.2$growth)*allocation[1]
profit_rate.2=profit.2/budget


model.3 = naiveBayes(big_growth ~ PC2, data_3.train)
prob.3 = predict(model.3, data_3.test, type="raw")
class.predicted.3 = as.class(prob.3, class="YES", cutoff=0.5)
CM.3 = confusionMatrix(class.predicted.3, data_3.test$big_growth)$table
cm.3 = CM.3 / sum(CM.3)
accuracy.3 = (cm.3[1,1]+cm.3[2,2])/sum(cm.3)
data_3.test$pred=class.predicted.3
dd.3=filter(data_3.test, ( pred=="NO"))[1:12,]
profit.3=sum(dd.3$growth)*allocation[1]
profit_rate.3=profit.3/budget

model.4 = naiveBayes(big_growth ~ PC2, data_4.train)
prob.4 = predict(model.4, data_4.test, type="raw")
class.predicted.4 = as.class(prob.4, class="YES", cutoff=0.5)
CM.4 = confusionMatrix(class.predicted.4, data_4.test$big_growth)$table
cm.4 = CM.4 / sum(CM.4)
accuracy.4 = (cm.4[1,1]+cm.4[2,2])/sum(cm.4)
data_4.test$pred=class.predicted.4
dd.4=filter(data_4.test, ( pred=="NO"))[1:12,]
profit.4=sum(dd.4$growth)*allocation[1]
profit_rate.4=profit.4/budget

model.5 = naiveBayes(big_growth ~ PC2, data_5.train)
prob.5 = predict(model.5, data_5.test, type="raw")
class.predicted.5 = as.class(prob.5, class="YES", cutoff=0.5)
CM.5 = confusionMatrix(class.predicted.5, data_5.test$big_growth)$table
cm.5 = CM.5 / sum(CM.5)
accuracy.5 = (cm.5[1,1]+cm.5[2,2])/sum(cm.5)
data_5.test$pred=class.predicted.5
dd.5=filter(data_5.test, (pred=="YES"))
profit.5=sum(dd.5$growth)*allocation[1]
profit_rate.5=profit.5/budget
ddd.5=filter(data_5.test, ( pred=="NO"))[1:10,]
profit.5=sum(dd.5$growth)*allocation[1]+sum(ddd.5$growth)*allocation[1]
profit_rate.5=profit.5/budget

d=data.frame(fold=c(1,2,3,4,5),accuracy=c(accuracy.1,accuracy.2,accuracy.3,accuracy.4,accuracy.5), profit=c(profit.1,profit.2,profit.3,profit.4,profit.5))
accuracy.cv=mean(d$accuracy)
profit.cv=mean(c(profit.1, profit.2, profit.3, profit.4, profit.5))
profit_rate.cv=mean(c(profit_rate.1,profit_rate.2,profit_rate.3,profit_rate.4,profit_rate.5))
d_PC2_5=data.frame(method="naive bayes", variables="PC2, big_growth",cutoff="0.50", accuracy.cv=accuracy.cv,profit.cv=profit.cv)

In [11]:
model.1 = naiveBayes(big_growth ~PC1, data_1.train)
prob.1 = predict(model.1, data_1.test, type="raw")
class.predicted.1 = as.class(prob.1, class="YES", cutoff=0.25)
CM.1 = confusionMatrix(class.predicted.1, data_1.test$big_growth)$table
cm.1 = CM.1 / sum(CM.1)
accuracy.1 = (cm.1[1,1]+cm.1[2,2])/sum(cm.1)
data_1.test$pred=class.predicted.1
dd.1=filter(data_1.test, ( pred=="NO"))[1:12,]
profit.1=sum(dd.1$growth)*allocation[1]
profit_rate.1=profit.1/budget


model.2 = naiveBayes(big_growth ~PC1, data_2.train)
prob.2 = predict(model.2, data_2.test, type="raw")
class.predicted.2 = as.class(prob.2, class="YES", cutoff=0.25)
CM.2 = confusionMatrix(class.predicted.2, data_2.test$big_growth)$table
cm.2 = CM.2 / sum(CM.2)
accuracy.2 = (cm.2[1,1]+cm.2[2,2])/sum(cm.2)
data_2.test$pred=class.predicted.2
dd.2=filter(data_2.test, ( pred=="NO"))[1:12,]
profit.2=sum(dd.2$growth)*allocation[1]
profit_rate.2=profit.2/budget


model.3 = naiveBayes(big_growth ~ PC1, data_3.train)
prob.3 = predict(model.3, data_3.test, type="raw")
class.predicted.3 = as.class(prob.3, class="YES", cutoff=0.25)
CM.3 = confusionMatrix(class.predicted.3, data_3.test$big_growth)$table
cm.3 = CM.3 / sum(CM.3)
accuracy.3 = (cm.3[1,1]+cm.3[2,2])/sum(cm.3)
data_3.test$pred=class.predicted.3
ddd.3=filter(data_3.test, (pred=="YES"))[1:12,]
profit.3=sum(ddd.3$growth)*allocation[1]
profit_rate.3=profit.3/budget


model.4 = naiveBayes(big_growth ~ PC1, data_4.train)
prob.4 = predict(model.4, data_4.test, type="raw")
class.predicted.4 = as.class(prob.4, class="YES", cutoff=0.25)
CM.4 = confusionMatrix(class.predicted.4, data_4.test$big_growth)$table
cm.4 = CM.4 / sum(CM.4)
accuracy.4 = (cm.4[1,1]+cm.4[2,2])/sum(cm.4)
data_4.test$pred=class.predicted.4
dd.4=filter(data_4.test, ( pred=="NO"))[1:11,]
ddd.4=filter(data_4.test, ( pred=="YES"))
profit.4=sum(dd.4$growth)*allocation[1]+sum(ddd.4$growth)*allocation[1]
profit_rate.4=profit.4/budget

model.5 = naiveBayes(big_growth ~ PC1, data_5.train)
prob.5 = predict(model.5, data_5.test, type="raw")
class.predicted.5 = as.class(prob.5, class="YES", cutoff=0.25)
CM.5 = confusionMatrix(class.predicted.5, data_5.test$big_growth)$table
cm.5 = CM.5 / sum(CM.5)
accuracy.5 = (cm.5[1,1]+cm.5[2,2])/sum(cm.5)
data_5.test$pred=class.predicted.5
dd.5=filter(data_5.test, ( pred=="NO"))[1:11,]
ddd.5=filter(data_5.test, ( pred=="YES"))
profit.5=sum(dd.5$growth)*allocation[1]+sum(ddd.5$growth)*allocation[1]
profit_rate.5=profit.5/budget

d=data.frame(fold=c(1,2,3,4,5),accuracy=c(accuracy.1,accuracy.2,accuracy.3,accuracy.4,accuracy.5), profit=c(profit.1,profit.2,profit.3,profit.4,profit.5))
accuracy.cv=mean(d$accuracy)
profit.cv=mean(c(profit.1, profit.2, profit.3, profit.4, profit.5))
profit_rate.cv=mean(c(profit_rate.1,profit_rate.2,profit_rate.3,profit_rate.4,profit_rate.5))
d_PC1_25=data.frame(method="naive bayes", variables="PC1, big_growth",cutoff="0.25", accuracy.cv=accuracy.cv,profit.cv=profit.cv)

In [12]:
model.1 = naiveBayes(big_growth ~PC1, data_1.train)
prob.1 = predict(model.1, data_1.test, type="raw")
class.predicted.1 = as.class(prob.1, class="YES", cutoff=0.33)
CM.1 = confusionMatrix(class.predicted.1, data_1.test$big_growth)$table
cm.1 = CM.1 / sum(CM.1)
accuracy.1 = (cm.1[1,1]+cm.1[2,2])/sum(cm.1)
data_1.test$pred=class.predicted.1
dd.1=filter(data_1.test, ( pred=="NO"))[1:12,]
profit.1=sum(dd.1$growth)*allocation[1]
profit_rate.1=profit.1/budget

model.2 = naiveBayes(big_growth ~PC1, data_2.train)
prob.2 = predict(model.2, data_2.test, type="raw")
class.predicted.2 = as.class(prob.2, class="YES", cutoff=0.33)
CM.2 = confusionMatrix(class.predicted.2, data_2.test$big_growth)$table
cm.2 = CM.2 / sum(CM.2)
accuracy.2 = (cm.2[1,1]+cm.2[2,2])/sum(cm.2)
data_2.test$pred=class.predicted.2
dd.2=filter(data_2.test, ( pred=="NO"))[1:12,]
profit.2=sum(dd.2$growth)*allocation[1]
profit_rate.2=profit.2/budget

model.3 = naiveBayes(big_growth ~ PC1, data_3.train)
prob.3 = predict(model.3, data_3.test, type="raw")
class.predicted.3 = as.class(prob.3, class="YES", cutoff=0.33)
CM.3 = confusionMatrix(class.predicted.3, data_3.test$big_growth)$table
cm.3 = CM.3 / sum(CM.3)
accuracy.3 = (cm.3[1,1]+cm.3[2,2])/sum(cm.3)
data_3.test$pred=class.predicted.3
dd.3=filter(data_3.test, (pred=="NO"))[1:10,]
ddd.3=filter(data_3.test, (pred=="YES"))
profit.3=sum(ddd.3$growth)*allocation[1]+sum(dd.3$growth)*allocation[1]
profit_rate.3=profit.3/budget

model.4 = naiveBayes(big_growth ~ PC1, data_4.train)
prob.4 = predict(model.4, data_4.test, type="raw")
class.predicted.4 = as.class(prob.4, class="YES", cutoff=0.33)
CM.4 = confusionMatrix(class.predicted.4, data_4.test$big_growth)$table
cm.4 = CM.4 / sum(CM.4)
accuracy.4 = (cm.4[1,1]+cm.4[2,2])/sum(cm.4)
data_4.test$pred=class.predicted.4
dd.4=filter(data_4.test, ( pred=="NO"))[1:11,]
ddd.4=filter(data_4.test, ( pred=="YES"))
profit.4=sum(dd.4$growth)*allocation[1]+sum(ddd.4$growth)*allocation[1]
profit_rate.4=profit.4/budget


model.5 = naiveBayes(big_growth ~ PC1, data_5.train)
prob.5 = predict(model.5, data_5.test, type="raw")
class.predicted.5 = as.class(prob.5, class="YES", cutoff=0.33)
CM.5 = confusionMatrix(class.predicted.5, data_5.test$big_growth)$table
cm.5 = CM.5 / sum(CM.5)
accuracy.5 = (cm.5[1,1]+cm.5[2,2])/sum(cm.5)
data_5.test$pred=class.predicted.5
dd.5=filter(data_5.test, ( pred=="NO"))[1:11,]
ddd.5=filter(data_5.test, ( pred=="YES"))
profit.5=sum(dd.5$growth)*allocation[1]+sum(ddd.5$growth)*allocation[1]
profit_rate.5=profit.5/budget


d=data.frame(fold=c(1,2,3,4,5),accuracy=c(accuracy.1,accuracy.2,accuracy.3,accuracy.4,accuracy.5), profit=c(profit.1,profit.2,profit.3,profit.4,profit.5))
accuracy.cv=mean(d$accuracy)
profit.cv=mean(c(profit.1, profit.2, profit.3, profit.4, profit.5))
profit_rate.cv=mean(c(profit_rate.1,profit_rate.2,profit_rate.3,profit_rate.4,profit_rate.5))
d_PC1_33=data.frame(method="naive bayes", variables="PC1, big_growth",cutoff="0.33", accuracy.cv=accuracy.cv,profit.cv=profit.cv)

In [13]:
model.1 = naiveBayes(big_growth ~PC1, data_1.train)
prob.1 = predict(model.1, data_1.test, type="raw")
class.predicted.1 = as.class(prob.1, class="YES", cutoff=0.5)
CM.1 = confusionMatrix(class.predicted.1, data_1.test$big_growth)$table
cm.1 = CM.1 / sum(CM.1)
accuracy.1 = (cm.1[1,1]+cm.1[2,2])/sum(cm.1)
data_1.test$pred=class.predicted.1
dd.1=filter(data_1.test, ( pred=="NO"))[1:12,]
profit.1=sum(dd.1$growth)*allocation[1]
profit_rate.1=profit.1/budget


model.2 = naiveBayes(big_growth ~PC1, data_2.train)
prob.2 = predict(model.2, data_2.test, type="raw")
class.predicted.2 = as.class(prob.2, class="YES", cutoff=0.5)
CM.2 = confusionMatrix(class.predicted.2, data_2.test$big_growth)$table
cm.2 = CM.2 / sum(CM.2)
accuracy.2 = (cm.2[1,1]+cm.2[2,2])/sum(cm.2)
data_2.test$pred=class.predicted.2
dd.2=filter(data_2.test, ( pred=="NO"))[1:12,]
profit.2=sum(dd.2$growth)*allocation[1]
profit_rate.2=profit.2/budget

model.3 = naiveBayes(big_growth ~ PC1, data_3.train)
prob.3 = predict(model.3, data_3.test, type="raw")
class.predicted.3 = as.class(prob.3, class="YES", cutoff=0.5)
CM.3 = confusionMatrix(class.predicted.3, data_3.test$big_growth)$table
cm.3 = CM.3 / sum(CM.3)
accuracy.3 = (cm.3[1,1]+cm.3[2,2])/sum(cm.3)
data_3.test$pred=class.predicted.3
dd.3=filter(data_3.test, (pred=="NO"))[1:10,]
ddd.3=filter(data_3.test, (pred=="YES"))
profit.3=sum(ddd.3$growth)*allocation[1]+sum(dd.3$growth)*allocation[1]
profit_rate.3=profit.3/budget

model.4 = naiveBayes(big_growth ~ PC1, data_4.train)
prob.4 = predict(model.4, data_4.test, type="raw")
class.predicted.4 = as.class(prob.4, class="YES", cutoff=0.5)
CM.4 = confusionMatrix(class.predicted.4, data_4.test$big_growth)$table
cm.4 = CM.4 / sum(CM.4)
accuracy.4 = (cm.4[1,1]+cm.4[2,2])/sum(cm.4)
data_4.test$pred=class.predicted.4
dd.4=filter(data_4.test, ( pred=="NO"))[1:11,]
ddd.4=filter(data_4.test, ( pred=="YES"))
profit.4=sum(dd.4$growth)*allocation[1]+sum(ddd.4$growth)*allocation[1]
profit_rate.4=profit.4/budget


model.5 = naiveBayes(big_growth ~ PC1, data_5.train)
prob.5 = predict(model.5, data_5.test, type="raw")
class.predicted.5 = as.class(prob.5, class="YES", cutoff=0.5)
CM.5 = confusionMatrix(class.predicted.5, data_5.test$big_growth)$table
cm.5 = CM.5 / sum(CM.5)
accuracy.5 = (cm.5[1,1]+cm.5[2,2])/sum(cm.5)
data_5.test$pred=class.predicted.5
dd.5=filter(data_5.test, ( pred=="NO"))[1:11,]
ddd.5=filter(data_5.test, ( pred=="YES"))
profit.5=sum(dd.5$growth)*allocation[1]+sum(ddd.5$growth)*allocation[1]
profit_rate.5=profit.5/budget


d=data.frame(fold=c(1,2,3,4,5),accuracy=c(accuracy.1,accuracy.2,accuracy.3,accuracy.4,accuracy.5), profit=c(profit.1,profit.2,profit.3,profit.4,profit.5))
accuracy.cv=mean(d$accuracy)
profit.cv=mean(c(profit.1, profit.2, profit.3, profit.4, profit.5))
profit_rate.cv=mean(c(profit_rate.1,profit_rate.2,profit_rate.3,profit_rate.4,profit_rate.5))
d_PC1_5=data.frame(method="naive bayes", variables="PC1, big_growth",cutoff="0.50", accuracy.cv=accuracy.cv,profit.cv=profit.cv)


In [14]:
fmt(d_PC2_5,"best model")
fmt(rbind(d_PC1_25,d_PC1_33,d_PC1_5, d_PC2_25,d_PC2_33, d_PC2_5,d_PC1_PC2_25,d_PC1_PC2_33, d_PC1_PC2_5),"search for best model")

method,variables,cutoff,accuracy.cv,profit.cv
naive bayes,"PC2, big_growth",0.5,0.9156794,-74823.24


method,variables,cutoff,accuracy.cv,profit.cv
naive bayes,"PC1, big_growth",0.25,0.7869919,-96079.26
naive bayes,"PC1, big_growth",0.33,0.9154472,-84246.11
naive bayes,"PC1, big_growth",0.5,0.9154472,-84246.11
naive bayes,"PC2, big_growth",0.25,0.3530778,-75691.51
naive bayes,"PC2, big_growth",0.33,0.7087108,-100165.74
naive bayes,"PC2, big_growth",0.5,0.9156794,-74823.24
naive bayes,"PC1, PC2, big_growth",0.25,0.2157956,-83721.86
naive bayes,"PC1, PC2, big_growth",0.33,0.2573751,-104010.0
naive bayes,"PC1, PC2, big_growth",0.5,0.4176539,-79601.1


## Build & Tune Regression Model

In [15]:
# Partitioning the data into 5 folds (use set.seed(0) and createFolds(...) based on growth).

# Constructing several linear regression models to predict growth.
# Iterating through unique combinations of predictor variables, selected from PC1 and PC2.

# Estimating each model's RMSE and profit, using 5-fold cross validation.

# Presenting the best model: selected variables, RMSE, and profit.
# Presenting all the models: selected variables, selected cutoff, accuracy, and profit.
set.seed(0)
fold = createFolds(data$growth, k=5)
data.train = list()
data.test  = list()
for (i in 1:5) { data.train[[i]] = data[setdiff(1:nrow(data), fold[[i]]),]
                     data.test[[i]]  = data[fold[[i]],] }
cm = list()
RMSE = list()
profit=list()
for (i in 1:5) {
                set.seed(0)
                model = lm(growth~PC2, data.train[[i]])
                prob = predict(model, data.test[[i]])
                error = prob-data.test[[i]]$growth
                square_error = error^2
                RMSE[[i]] = sqrt(mean(square_error))
                dd2=data.frame(data.test[[i]]$growth, prob, error, square_error)
                d2=dd2[order(-prob),]
                p2=d2[1:12,1:2]
                profit[[i]]=sum(p2$data.test..i...growth)*allocation[1]
                }

d=data.frame(fold=c(1,2,3,4,5),rmse=c(RMSE[[1]],RMSE[[2]],RMSE[[3]],RMSE[[4]],RMSE[[5]]), profit=c(profit[[1]],profit[[2]],profit[[3]],profit[[4]],profit[[5]]))
rmse.cv=mean(d$rmse)
profit.cv=mean(c(profit[[1]],profit[[2]],profit[[3]],profit[[4]],profit[[5]]))
p11=data.frame(method="linear regression", variables="PC2, growth",rmse.cv=rmse.cv,profit.cv=profit.cv)

cm = list()
RMSE = list()
profit=list()
for (i in 1:5) {
                set.seed(0)
                model = lm(growth~PC1, data.train[[i]])
                prob = predict(model, data.test[[i]])
                error = prob-data.test[[i]]$growth
                square_error = error^2
                RMSE[[i]] = sqrt(mean(square_error))
                dd2=data.frame(data.test[[i]]$growth, prob, error, square_error)
                d2=dd2[order(-prob),]
                p2=d2[1:12,1:2]
                profit[[i]]=sum(p2$data.test..i...growth)*allocation[1]
                }

d=data.frame(fold=c(1,2,3,4,5),rmse=c(RMSE[[1]],RMSE[[2]],RMSE[[3]],RMSE[[4]],RMSE[[5]]), profit=c(profit[[1]],profit[[2]],profit[[3]],profit[[4]],profit[[5]]))
rmse.cv=mean(d$rmse)
profit.cv=mean(c(profit[[1]],profit[[2]],profit[[3]],profit[[4]],profit[[5]]))
p22=data.frame(method="linear regression", variables="PC1, growth",rmse.cv=rmse.cv,profit.cv=profit.cv)

cm = list()
RMSE = list()
profit=list()
for (i in 1:5) {
                set.seed(0)
                model = lm(growth~PC1+PC2, data.train[[i]])
                prob = predict(model, data.test[[i]])
                error = prob-data.test[[i]]$growth
                square_error = error^2
                RMSE[[i]] = sqrt(mean(square_error))
                dd2=data.frame(data.test[[i]]$growth, prob, error, square_error)
                d2=dd2[order(-prob),]
                p2=d2[1:12,1:2]
                profit[[i]]=sum(p2$data.test..i...growth)*allocation[1]
                }

d=data.frame(fold=c(1,2,3,4,5),rmse=c(RMSE[[1]],RMSE[[2]],RMSE[[3]],RMSE[[4]],RMSE[[5]]), profit=c(profit[[1]],profit[[2]],profit[[3]],profit[[4]],profit[[5]]))
rmse.cv=mean(d$rmse)
profit.cv=mean(c(profit[[1]],profit[[2]],profit[[3]],profit[[4]],profit[[5]]))
p33=data.frame(method="linear regression", variables="PC1, PC2, growth",rmse.cv=rmse.cv,profit.cv=profit.cv)

fmt(p33,"best model")
fmt(rbind(p22,p11,p33),"search for best model")

method,variables,rmse.cv,profit.cv
linear regression,"PC1, PC2, growth",0.4660607,-71385.76


method,variables,rmse.cv,profit.cv
linear regression,"PC1, growth",0.4659713,-260853.48
linear regression,"PC2, growth",0.466048,-77011.29
linear regression,"PC1, PC2, growth",0.4660607,-71385.76


<font size=1;>
<p style="text-align: left;">
Copyright (c) Berkeley Data Analytics Group, LLC
<span style="float: right;">
Document revised June 14, 2020
</span>
</p>
</font>