#  Regression

![](banner_project.jpg)

In [1]:
analyst = "Lilit Petrosyan"

In [2]:
f = "setup.R"; for (i in 1:10) { if (file.exists(f)) break else f = paste0("../", f) }; source(f)
options(repr.matrix.max.rows=674)
options(repr.matrix.max.cols=200)
update_geom_defaults("point", list(size=1))                                

## Approach

Retrieve a transformed dataset.

Construct a model to predict how much stock price will grow over 12 months, given 12 months of past company fundamentals data, and using a machine learning model construction method and transformed data.

Evaluate the model's business performance based on a business model and business parameters.

## Business Model

The business model is ...

$ \begin{align} profit = \left( \sum_{i \in portfolio} (1 + growth_i) \times allocation_i \right) - budget \end{align} $

<br>

$ profit\,rate = profit \div budget $


$ \begin{align} budget = \sum_{i \in portfolio} allocation_i \end{align} $

<br>

Business model parameters include ...

* Budget = \\$1,000,000: total investment to allocate across the companies in the portfolio
* Portfolio Size = 12: number of companies in the portfolio
* Allocations = \\$1,000,000 $\div$ 12 to each company: investments to allocate to specific companies in the portfolio 

Fill the portfolio with companies that have the highest predicted growths.

In [3]:
# Set the business parameters.
budget = 1000000
portfolio_size = 12
allocation = rep(budget/portfolio_size, portfolio_size)

layout(fmt(budget), fmt(portfolio_size), fmt(allocation))

budget,Unnamed: 1_level_0,Unnamed: 2_level_0
portfolio_size,Unnamed: 1_level_1,Unnamed: 2_level_1
allocation,Unnamed: 1_level_2,Unnamed: 2_level_2
1000000,,
12,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,

budget
1000000

portfolio_size
12

allocation
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33


## Data

We retrieve the data that we have previously stored from the Project B notebook. This is the data that includes gvkey(the global company key) sorted lowest-highest, the company names, values of two of the principal analysis, the stock price, growth rate, and a character variable that indicates whether a company had over 30% growth over 12 months.

In [4]:
# Retrieve "My Data.csv"
data = read.csv("My Data.csv", header=TRUE)
# Present a few rows ...
data[1:6,]

gvkey,tic,conm,PC1,PC2,prccq,growth,big_growth
1004,AIR,AAR CORP,3.4371231,-0.2260719,43.69,0.0507455507,NO
1045,AAL,AMERICAN AIRLINES GROUP INC,-12.0332067,0.8045109,32.11,-0.3828560446,NO
1050,CECE,CECO ENVIRONMENTAL CORP,3.9532234,-0.7553386,6.75,0.3157894737,YES
1062,ASA,ASA GOLD AND PRECIOUS METALS,3.6561434,-0.7981915,8.66,-0.2164739518,NO
1072,AVX,AVX CORP,2.9282228,-0.71042,15.25,-0.1184971098,NO
1075,PNW,PINNACLE WEST CAPITAL CORP,0.3488491,1.1389605,85.2,0.0002347969,NO


## Regression Model

In this section we use similar methods of modeling like we did in project C, however, here we use the Linear Regression to build our model. To test the effectiveness of the model, we use different techniques, like in-sample estimation, out-of-sample estimation, and 5-Fold Cross-Validation estimation, and find the performence of each method of estimation.

### Build Model

In [5]:
# Constructing a linear regression model to predict growth given PC1 and PC2.
model=lm(growth ~ PC1 + PC2, data)
model


Call:
lm(formula = growth ~ PC1 + PC2, data = data)

Coefficients:
(Intercept)          PC1          PC2  
 -0.1185887    0.0002455    0.0006294  


### In-Sample Estimated Performance

In [8]:
prob = predict(model, data)
error = prob-data$growth
square_error = error^2
RMSE = sqrt(mean(square_error))
dd=data.frame(data$growth, prob, error, square_error)
d=dd[order(-prob),]
p=d[1:12,]
profit=sum(p$data.growth)*allocation[1]
profit_rate=profit/budget

In [9]:
# The model's in-sample estimated RMSE, profit, and profit rate.
data.frame(rmse=RMSE,profit=profit, profit_rate=profit_rate)

rmse,profit,profit_rate
0.468815,-115641.1,-0.1156411


### Out-of-Sample Estimated Performance

In [10]:
# Partitioning the data into training (75%) and validation (25%)
set.seed(0)
holdout = sample(1:nrow(data), 0.75*nrow(data))
holdin = setdiff(1:nrow(data), holdout)
data.train = data[holdout,]
data.dev  = data[holdin,]

layout(fmt(size(data.train)), fmt(size(data.dev)))

observations,variables
observations,variables
3228,8
1077,8
size(data.train)  observations variables 3228 8,size(data.dev)  observations variables 1077 8

observations,variables
3228,8

observations,variables
1077,8


In [11]:
model=lm(growth ~ PC1 + PC2, data.train)
prob2 = predict(model, data.dev)
error2 = prob2-data.dev$growth
square_error2 = error2^2
RMSE2 = sqrt(mean(square_error2))
dd2=data.frame(data.dev$growth, prob2, error2, square_error2)
d2=dd2[order(-prob2),]
p2=d2[1:12,]
profit2=sum(p2$data.dev.growth)*allocation[1]
profit_rate2=profit2/budget

In [12]:
# the model's out-of-sample estimated RMSE, profit, and profit rate.
fmt(data.frame(rmse=RMSE2,profit=profit2, profit_rate=profit_rate2), "Out-of-Sample Estimated Performance")

rmse,profit,profit_rate
0.5081251,-40986.91,-0.0409869


### 5-Fold Cross-Validation Estimated Performance

In [13]:
# Partition the data into 5 folds (use set.seed(0) and createFolds(...) based on growth).
set.seed(0)
fold = createFolds(data$growth, k=5)
str(fold)

List of 5
 $ Fold1: int [1:862] 8 11 16 22 30 32 38 40 41 44 ...
 $ Fold2: int [1:860] 3 9 10 23 26 27 34 39 52 64 ...
 $ Fold3: int [1:862] 2 7 19 29 35 42 53 57 61 62 ...
 $ Fold4: int [1:861] 1 4 5 6 15 17 28 33 36 43 ...
 $ Fold5: int [1:860] 12 13 14 18 20 21 24 25 31 37 ...


In [14]:
set.seed(0)
data_1.train = data[setdiff(1:nrow(data), fold$Fold1),]
data_1.dev  = data[fold$Fold1,]
model.1 = lm(growth ~PC1+PC2, data_1.train)
prob.1 = predict(model.1, data_1.dev)
error.1 = prob.1-data_1.dev$growth
square_error.1 = error.1^2
RMSE.1 = sqrt(mean(square_error.1))
dd2.1=data.frame(data_1.dev$growth, prob.1, error.1, square_error.1)
d2.1=dd2.1[order(-prob.1),]
p2.1=d2.1[1:12,]
profit.1=sum(p2.1$data_1.dev.growth)*allocation[1]
profit_rate.1=profit.1/budget

In [15]:
set.seed(0)
data_2.train = data[setdiff(1:nrow(data), fold$Fold2),]
data_2.dev  = data[fold$Fold2,]
model.2 = lm(growth ~PC1+PC2, data_2.train)
prob.2 = predict(model.2, data_2.dev)
error.2 = prob.2-data_2.dev$growth
square_error.2 = error.2^2
RMSE.2 = sqrt(mean(square_error.2))
dd2.2=data.frame(data_2.dev$growth, prob.2, error.2, square_error.2)
d2.2=dd2.2[order(-prob.2),]
p2.2=d2.2[1:12,]
profit.2=sum(p2.2$data_2.dev.growth)*allocation[1]
profit_rate.2=profit.2/budget

In [16]:
set.seed(0)
data_3.train = data[setdiff(1:nrow(data), fold$Fold3),]
data_3.dev  = data[fold$Fold3,]
model.3 = lm(growth ~PC1+PC2, data_3.train)
prob.3 = predict(model.3, data_3.dev)
error.3 = prob.3-data_3.dev$growth
square_error.3 = error.3^2
RMSE.3 = sqrt(mean(square_error.3))
dd2.3=data.frame(data_3.dev$growth, prob.3, error.3, square_error.3)
d2.3=dd2.3[order(-prob.3),]
p2.3=d2.3[1:12,]
profit.3=sum(p2.3$data_3.dev.growth)*allocation[1]
profit_rate.3=profit.3/budget

In [17]:
set.seed(0)
data_4.train = data[setdiff(1:nrow(data), fold$Fold4),]
data_4.dev  = data[fold$Fold4,]
model.4 = lm(growth ~PC1+PC2, data_4.train)
prob.4 = predict(model.4, data_4.dev)
error.4 = prob.4-data_4.dev$growth
square_error.4 = error.4^2
RMSE.4 = sqrt(mean(square_error.4))
dd2.4=data.frame(data_4.dev$growth, prob.4, error.4, square_error.4)
d2.4=dd2.4[order(-prob.4),]
p2.4=d2.4[1:12,]
profit.4=sum(p2.4$data_4.dev.growth)*allocation[1]
profit_rate.4=profit.4/budget

In [18]:
set.seed(0)
data_5.train = data[setdiff(1:nrow(data), fold$Fold5),]
data_5.dev  = data[fold$Fold5,]
model.5 = lm(growth ~PC1+PC2, data_5.train)
prob.5 = predict(model.5, data_5.dev)
error.5 = prob.5-data_5.dev$growth
square_error.5 = error.5^2
RMSE.5 = sqrt(mean(square_error.5))
dd2.5=data.frame(data_5.dev$growth, prob.5, error.5, square_error.5)
d2.5=dd2.5[order(-prob.5),]
p2.5=d2.5[1:12,]
profit.5=sum(p2.5$data_5.dev.growth)*allocation[1]
profit_rate.5=profit.5/budget

In [19]:
# Presenting the model's estimated RMSE and profit for each fold.
d=data.frame(fold=c(1,2,3,4,5), rmse=c(RMSE.1,RMSE.2,RMSE.3,RMSE.4,RMSE.5), profit=c(profit.1,profit.2,profit.3,profit.4,profit.5))
d

fold,rmse,profit
1,0.4446211,-68328.2
2,0.435679,-84359.01
3,0.5041439,-83515.95
4,0.3998234,-114715.51
5,0.546036,-6010.14


In [20]:
# Presenting the model's 5-fold cross-validation estimated RMSE, profit, and profit rate.
rmse.cv=mean(d$rmse)
profit.cv=mean(c(profit.1, profit.2, profit.3, profit.4, profit.5))
profit_rate.cv=mean(c(profit.1/budget, profit.2/budget, profit.3/budget, profit.4/budget, profit.5/budget))
fmt(data.frame(rmse.cv,profit.cv,profit_rate.cv), "5-Fold Cross-Validation Estimated Performance")

rmse.cv,profit.cv,profit_rate.cv
0.4660607,-71385.76,-0.0713858


<font size=1;>
<p style="text-align: left;">
Copyright (c) Berkeley Data Analytics Group, LLC
<span style="float: right;">
Document revised June 14, 2020
</span>
</p>
</font>