# Classification

![](banner_project.jpg)

In [1]:
analyst = "Lilit Petrosyan"

In [2]:
f = "setup.R"; for (i in 1:10) { if (file.exists(f)) break else f = paste0("../", f) }; source(f)
options(repr.matrix.max.rows=674)
options(repr.matrix.max.cols=200)
update_geom_defaults("point", list(size=1))                                

### Objective

I will construct and evaluate a classifier trained on a transformed dataset about public company fundamentals.  Later, I will use the classifier along with additional analysis to recommend a portfolio of 12 company investments that maximizes 12-month return of an overall \$1,000,000 investment.

### Approach

Retrieve the transformed dataset.

Construct a model to predict whether stock price will grow more than 30% over 12 months, given 12 months of past company fundamentals data, and using a machine learning model construction method and transformed data.

Evaluate the model's business performance based on a business model and business parameters.

## Business Model

The business model is ...

$ \begin{align} profit = \left( \sum_{i \in portfolio} (1 + growth_i) \times allocation_i \right) - budget \end{align} $

<br>

$ profit\,rate = profit \div budget $


$ \begin{align} budget = \sum_{i \in portfolio} allocation_i \end{align} $

<br>

Business model parameters include ...

* Budget = \\$1,000,000: total investment to allocate across the companies in the portfolio
* Portfolio Size = 12: number of companies in the portfolio
* Allocations = \\$1,000,000 $\div$ 12 to each company: investments to allocate to specific companies in the portfolio 

task: Fill the portfolio with companies with the lowest gvkey values from among those you predict to grow above 30%.  If you predict fewer than the portfolio size to grow above 30%, then fill the rest of the portfolio with the remaining companies with lowest gvkey values.

In [3]:
# Set the business parameters.

budget = 1000000
portfolio_size = 12
allocation = rep(budget/portfolio_size, portfolio_size)

layout(fmt(budget), fmt(portfolio_size), fmt(allocation))

budget,Unnamed: 1_level_0,Unnamed: 2_level_0
portfolio_size,Unnamed: 1_level_1,Unnamed: 2_level_1
allocation,Unnamed: 1_level_2,Unnamed: 2_level_2
1000000,,
12,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,

budget
1000000

portfolio_size
12

allocation
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33


## Data

In this section, we retrieve the data that we have previously stored in our folder, from the Project B notebook. This is the data that includes gvkey(the global company key) sorted from lowest to highest, the company names, values of two of the principal analysis, the stock price, growth rate, and a character variable that indicates whether a company had over 30% growth over 12 months.

In [12]:
# Retrieve "My Data.csv"
data = read.csv("My Data.csv", header=TRUE)

# Present a few rows ...
data[1:6,]

gvkey,tic,conm,PC1,PC2,prccq,growth,big_growth
1004,AIR,AAR CORP,3.4371231,-0.2260719,43.69,0.0507455507,NO
1045,AAL,AMERICAN AIRLINES GROUP INC,-12.0332067,0.8045109,32.11,-0.3828560446,NO
1050,CECE,CECO ENVIRONMENTAL CORP,3.9532234,-0.7553386,6.75,0.3157894737,YES
1062,ASA,ASA GOLD AND PRECIOUS METALS,3.6561434,-0.7981915,8.66,-0.2164739518,NO
1072,AVX,AVX CORP,2.9282228,-0.71042,15.25,-0.1184971098,NO
1075,PNW,PINNACLE WEST CAPITAL CORP,0.3488491,1.1389605,85.2,0.0002347969,NO


In [13]:
size(data)

observations,variables
4305,8


## Classification Model
In this section, we use three generated models to determine the ones that create the highest profit.
The first model is a naive Bayes model. By using its confusion matrix, which includes the conditional probabilities of the combinations, our predicted big_growth, and the existing big_growth can have. We are only interested in the observations with "YES" predicted big_growth values, as we are testing the accuracy of the prediction. By doing calculations based on the expected data, we ended up with its efficiency and profit. 

The next model is established by using an Out-of-Sample method, which uses part of our original data to create a model and test it using the remaining of the original data. We choose to build our model based on 75% of our initial data, using the remaining 25% to evaluate the performance of the model. Using the same calculation techniques and from our first estimation, we calculate its accuracy and profit.

Our last model, us created by using the fold method, which divides or folds our data by sections, in our case 5 parts, and compares each section or fold by the remaining from the original data. This is similar to the second method. And like all our calculations, we determine the accuracy and profit of each fold.

The calculation method used to find the profit uses the growth values of the first 12 "YES" predictions, multiplying by one allocation, determining the growth in dollars for each of the 12 stocks and adding them, to come up with the overall profit.

### Build Model

In [14]:
# Constructing a naive Bayes model to predict big_growth given PC1 and PC2 (use laplace=TRUE).
model = naiveBayes(big_growth ~ PC1+PC2, data )
model


Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
        NO        YES 
0.91637631 0.08362369 

Conditional probabilities:
     PC1
Y           [,1]      [,2]
  NO  -0.2239142 13.299922
  YES  2.4537263  4.550796

     PC2
Y           [,1]     [,2]
  NO   0.0424303 7.676443
  YES -0.4649654 1.453473


### In-Sample Estimated Performance

In [15]:
# Presenting the model's in-sample estimated accuracy, profit, and profit rate at cutoff=0.5.
prob = predict(model, data, type="raw")
class.predicted = as.class(prob, class="NO", cutoff=0.5)
dataa=data
dataa$pred=class.predicted
CM = confusionMatrix(class.predicted, data$big_growth)$table
cm = CM / sum(CM)
accuracy = (cm[1,1]+cm[2,2])/sum(cm)

dd=filter(dataa, ( pred=="YES"))[1:12,]
profit=sum(dd$growth)*allocation[1]
profit_rate=round(profit/budget, digits=7)
data.frame(accuracy, profit, profit_rate)
dd

accuracy,profit,profit_rate
0.3082462,-80393.21,-0.0803932


gvkey,tic,conm,PC1,PC2,prccq,growth,big_growth,pred
1004,AIR,AAR CORP,3.437123,-0.2260719,43.69,0.050745551,NO,YES
1050,CECE,CECO ENVIRONMENTAL CORP,3.953223,-0.7553386,6.75,0.315789474,YES,YES
1062,ASA,ASA GOLD AND PRECIOUS METALS,3.656143,-0.7981915,8.66,-0.216473952,NO,YES
1072,AVX,AVX CORP,2.928223,-0.71042,15.25,-0.11849711,NO,YES
1076,AAN,AARON'S INC,2.253755,0.1343366,42.05,0.055207026,NO,YES
1094,ACETQ,ACETO CORP,3.769009,-0.4208701,0.84,-0.918683446,NO,YES
1097,ACMTA,ACMAT CORP -CL A,4.091633,-0.3227982,29.0,0.444942701,YES,YES
1104,ACU,ACME UNITED CORP,4.046675,-0.1964383,14.25,-0.391025641,NO,YES
1117,BKTI,BK TECHNOLOGIES CORP,4.064954,-0.8300392,3.75,0.056338028,NO,YES
1121,AE,ADAMS RESOURCES & ENERGY INC,3.97556,-0.5404158,38.71,-0.110114943,NO,YES


### Out-of-Sample Estimated Performance

In [16]:
# Partitioning the data into training (75%) and validation (25%)
set.seed(0)
holdout = sample(1:nrow(data), 0.75*nrow(data))
holdin = setdiff(1:nrow(data), holdout)
data.train = data[holdout,]
data.dev  = data[holdin,]

layout(fmt(size(data.train)), fmt(size(data.dev)))

observations,variables
observations,variables
3228,8
1077,8
size(data.train)  observations variables 3228 8,size(data.dev)  observations variables 1077 8

observations,variables
3228,8

observations,variables
1077,8


In [17]:
# The model's out-of-sample estimated accuracy, profit, and profit rate at cutoff=0.5.
model.1 = naiveBayes(big_growth ~ PC1+PC2, data.train)
prob = predict(model.1, data.dev, type="raw")
class.predicted = as.class(prob, class="YES", cutoff=0.5)
data.dev$pred=class.predicted
CM.1 = confusionMatrix(class.predicted, data.dev$big_growth)$table
cm.1 = CM.1 / sum(CM.1)
accuracy.1 = (cm.1[1,1]+cm.1[2,2])/sum(cm.1)
dd.a=filter(data.dev, ( pred=="YES"))[1:12,]
profit.a=sum(dd.a$growth)*allocation[1]
profit.dev=profit.a
profit_rate.dev=round(profit.dev/budget, digits=7)
fmt(data.frame(acuracy=accuracy.1, profit= profit.dev, profit_rate=profit_rate.dev),"Out-of-Sample Estimated Performance")

acuracy,profit,profit_rate
0.2989786,-120201.9,-0.1202019


### 5-Fold Cross-Validation Estimated Performance

In [18]:
# Partitioning the data into 5 folds (use set.seed(0) and createFolds(...)).
set.seed(0)
fold = createFolds(data$big_growth, k=5)
str(fold)

List of 5
 $ Fold1: int [1:861] 9 13 17 19 31 42 44 54 60 66 ...
 $ Fold2: int [1:861] 1 2 6 11 16 25 32 49 55 59 ...
 $ Fold3: int [1:861] 4 8 14 22 28 34 40 45 50 52 ...
 $ Fold4: int [1:861] 3 5 15 18 21 24 26 27 30 36 ...
 $ Fold5: int [1:861] 7 10 12 20 23 29 33 35 37 46 ...


In [19]:
data_1.train = data[setdiff(1:nrow(data), fold$Fold1),]
data_1.test  = data[fold$Fold1,]
model.1 = naiveBayes(big_growth ~PC1+PC2, data_1.train)
prob.1 = predict(model.1, data_1.test, type="raw")
class.predicted.1 = as.class(prob.1, class="YES", cutoff=0.5)
CM.1 = confusionMatrix(class.predicted.1, data_1.test$big_growth)$table
cm.1 = CM.1 / sum(CM.1)
accuracy.1 = (cm.1[1,1]+cm.1[2,2])/sum(cm.1)
data_1.test$pred=class.predicted.1
dd.1=filter(data_1.test, ( pred=="YES"))[1:12,]
profit.1=sum(dd.1$growth)*allocation[1]
profit_rate.1=profit.1/budget

In [20]:
data_2.train = data[setdiff(1:nrow(data), fold$Fold2),]
data_2.test  = data[fold$Fold2,]
model.2 = naiveBayes(big_growth ~ PC1+PC2, data_2.train)
prob.2 = predict(model.2, data_2.test, type="raw")
class.predicted.2 = as.class(prob.2, class="NO", cutoff=0.5)

CM.2 = confusionMatrix(class.predicted.2, data_2.test$big_growth)$table
cm.2 = CM.2 / sum(CM.2)
accuracy.2 = (cm.2[1,1]+cm.2[2,2])/sum(cm.2)
data_2.test$pred=class.predicted.2
dd.2=filter(data_2.test, ( pred=="NO"))[1:12,]
profit.2=sum(dd.2$growth)*allocation[1]
profit_rate.2=profit.2/budget

In [21]:
data_3.train = data[setdiff(1:nrow(data), fold$Fold3),]
data_3.test  = data[fold$Fold3,]
model.3 = naiveBayes(big_growth ~ PC1+PC2, data_3.train)
prob.3 = predict(model.3, data_3.test, type="raw")
class.predicted.3 = as.class(prob.3, class="YES", cutoff=0.5)

CM.3 = confusionMatrix(class.predicted.3, data_3.test$big_growth)$table
cm.3 = CM.3 / sum(CM.3)
accuracy.3 = (cm.3[1,1]+cm.3[2,2])/sum(cm.3)

data_3.test$pred=class.predicted.3
dd.3=filter(data_3.test, ( pred=="YES"))[1:12,]
profit.3=sum(dd.3$growth)*allocation[1]
profit_rate.3=profit.3/budget

In [22]:
data_4.train = data[setdiff(1:nrow(data), fold$Fold4),]
data_4.test  = data[fold$Fold4,]
model.4 = naiveBayes(big_growth ~ PC1+PC2, data_4.train)
prob.4 = predict(model.4, data_4.test, type="raw")
class.predicted.4 = as.class(prob.4, class="YES", cutoff=0.5)

CM.4 = confusionMatrix(class.predicted.4, data_4.test$big_growth)$table
cm.4 = CM.4 / sum(CM.4)
accuracy.4 = (cm.4[1,1]+cm.4[2,2])/sum(cm.4)

data_4.test$pred=class.predicted.4
dd.4=filter(data_4.test, ( pred=="YES"))[1:12,]
profit.4=sum(dd.4$growth)*allocation[1]
profit_rate.4=profit.4/budget

In [23]:
data_5.train = data[setdiff(1:nrow(data), fold$Fold5),]
data_5.test  = data[fold$Fold5,]
model.5 = naiveBayes(big_growth ~ PC1+PC2, data_5.train)
prob.5 = predict(model.5, data_5.test, type="raw")
class.predicted.5 = as.class(prob.5, class="YES", cutoff=0.5)

CM.5 = confusionMatrix(class.predicted.5, data_5.test$big_growth)$table
cm.5 = CM.5 / sum(CM.5)
accuracy.5 = (cm.5[1,1]+cm.5[2,2])/sum(cm.5)
data_5.test$pred=class.predicted.5
dd.5=filter(data_5.test, ( pred=="YES"))[1:12,]
profit.5=sum(dd.5$growth)*allocation[1]
profit_rate.5=profit.5/budget

In [24]:
# Presenting the model's estimated accuracy and profit at cutoff=0.5 for each fold.
d=data.frame(fold=c(1,2,3,4,5),accuracy=c(accuracy.1,accuracy.2,accuracy.3,accuracy.4,accuracy.5), profit=c(profit.1,profit.2,profit.3,profit.4,profit.5))
d

fold,accuracy,profit
1,0.3042973,-221281.1
2,0.9163763,-90115.24
3,0.2659698,-28710.93
4,0.2950058,-89837.99
5,0.3066202,31939.77


In [25]:
# Presenting the model's 5-fold cross-validation estimated accuracy, profit, and profit rate at cutoff=0.5
accuracy.cv=mean(d$accuracy)
profit.cv=mean(c(profit.1, profit.2, profit.3, profit.4, profit.5))
profit_rate.cv=mean(c(profit_rate.1,profit_rate.2,profit_rate.3,profit_rate.4,profit_rate.5))
fmt(data.frame(accuracy.cv,profit.cv,profit_rate.cv), "5-Fold Cross-Validation Estimated Performance")

accuracy.cv,profit.cv,profit_rate.cv
0.4176539,-79601.1,-0.0796011


<font size=1;>
<p style="text-align: left;">
Copyright (c) Berkeley Data Analytics Group, LLC
<span style="float: right;">
Document revised June 9, 2020
</span>
</p>
</font>