# Project Part D: Regression

![](banner_project.jpg)

In [1]:
analyst = "Priscila Carcamo Amorim" # Replace this with your name

In [2]:
f = "setup.R"; for (i in 1:10) { if (file.exists(f)) break else f = paste0("../", f) }; source(f)
options(repr.matrix.max.rows=674)
options(repr.matrix.max.cols=200)
update_geom_defaults("point", list(size=1))                                

## Directions

### Objective

Construct and evaluate a regressor trained on a transformed dataset about public company fundamentals.  Later, use the regressor along with additional analysis to recommend a portfolio of 12 company investments that maximizes 12-month return of an overall \$1,000,000 investment.

### Approach

Retrieve a transformed dataset.

Construct a model to predict how much stock price will grow over 12 months, given 12 months of past company fundamentals data, and using a machine learning model construction method and transformed data.

Evaluate the model's business performance based on a business model and business parameters.

## Business Model

The business model is ...

$ \begin{align} profit = \left( \sum_{i \in portfolio} (1 + growth_i) \times allocation_i \right) - budget \end{align} $

<br>

$ profit\,rate = profit \div budget $


$ \begin{align} budget = \sum_{i \in portfolio} allocation_i \end{align} $

<br>

Business model parameters include ...

* Budget = \\$1,000,000: total investment to allocate across the companies in the portfolio
* Portfolio Size = 12: number of companies in the portfolio
* Allocations = \\$1,000,000 $\div$ 12 to each company: investments to allocate to specific companies in the portfolio 

Fill the portfolio with companies that have the highest predicted growths.

In [3]:
# Set the business parameters.

budget = 1000000
portfolio_size = 12
allocation = rep(budget/portfolio_size, portfolio_size)

layout(fmt(budget), fmt(portfolio_size), fmt(allocation))

budget,Unnamed: 1_level_0,Unnamed: 2_level_0
portfolio_size,Unnamed: 1_level_1,Unnamed: 2_level_1
allocation,Unnamed: 1_level_2,Unnamed: 2_level_2
1000000,,
12,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,

budget
1000000

portfolio_size
12

allocation
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33


## Data

_<< Discuss this data retrieval. >>_  
  
**ANSWER:**  
  
This data comes from Project Part B. It is the transformed dataset that we stored after selecting certain predictor variables--such as gvkey, tic, conm, and the PCs--and outcome variables, which include prccq, growth, and big_growth. This transformed data set, which has gone through rigorous data cleaning, filtering, and transformation, will be used in the following section for building regression models.

In [4]:
# Retrieve "My Data.csv"
data = read.csv("My Data.csv", header=TRUE)

# Present a few rows ...
head(data)

gvkey,tic,conm,PC1,PC2,prccq,growth,big_growth
1004,AIR,AAR CORP,3.4371231,-0.2260719,43.69,0.0507455507,NO
1045,AAL,AMERICAN AIRLINES GROUP INC,-12.0332067,0.8045109,32.11,-0.3828560446,NO
1050,CECE,CECO ENVIRONMENTAL CORP,3.9532234,-0.7553386,6.75,0.3157894737,YES
1062,ASA,ASA GOLD AND PRECIOUS METALS,3.6561434,-0.7981915,8.66,-0.2164739518,NO
1072,AVX,AVX CORP,2.9282228,-0.71042,15.25,-0.1184971098,NO
1075,PNW,PINNACLE WEST CAPITAL CORP,0.3488491,1.1389605,85.2,0.0002347969,NO


## Regression Model

_<< Discuss this model construction and evaluation. >>_  
  
**ANSWER:**  
  
In this entire section, we are using linear regression to construct a regression model using only PC1 and PC2; however, in each part of this section, we are building the model in different ways. In 4.2, we build the model using the entire data set; in other words, the data is not split into partitions at all. With the full data set, we get an RMSE value of roughly 0.47, and a negative profit of about $116,000, thus resulting in a negative profit rate of -0.12.  
  
In 4.3, we partition the data into training and validation, with 75% of the data devoted for training, and another 25% for validation. With this, we see that there are 3228 observations in the training data, and 1077 observations in the validation set. When we create the linear regression model, we train it using only the training data. We predict with PC1 and PC2 of the validation data, and determine the RMSE value by finding the number of "hits", or correct predictions, the model has made. This is easily done by computing the difference between the expected value and the predicted value for each observation. In this section, the RMSE is roughly 0.51, which is close to the RMSE from section 4.1, but also slightly higher. The profit in 4.2 is way more negative; in this section, the profit is roughly -$41,000. Thus, the profit rate is also far higher, with a value of -0.04. From this, we can see that using the training and validation data has made the linear regression model have worse performance with the RMSE, but results in a far better profit value, thus inviting us to consider the tradeoffs. So far, linear regression is not doing a good job of predicting the values of growth very well just PC1 and PC2. Maybe using the values of PC1 and PC2 are not good predictor values.  
  
In the final part, which is 4.4, we are using cross validation. We separate the data into 5 folds. For each fold, we separate the training from validation, in which the validation has the fold values from the fold partitions. With each fold, we compute the RMSE, profit, and profit rate as we have done in the previous sections by creating a model for each training data, computing predictions for growth, and determining the values that we are interested in. It is interesting to see that all folds resulted in negative profit, fold 5 is the only fold that saw the highest profit (which has the smallest negative value), yet fold 4 saw the lowest RMSE (with a value of roughly 0.4). The overall cross validation RMSE, profit, and profit rate are roughly 0.47, -71,000, and -0.07, respectively.  
  
Thus, overall in this entire analysis with different linear regression models, we are seeing negative profit and high RMSE values, suggesting that PC1 and PC2 are not good predictor variables for determining the values of growth for a specific company. This means that we should not fully rely on our linear regression models to compute roughly accurate values of growth. It is likely that we need to use more variables to correctly predict growth, and that simply using PC1 and PC2 is not sufficient.

### Build Model

In [5]:
# Construct a linear regression model to predict growth given PC1 and PC2.
# Present a brief summary of the model parameters.
model = lm(growth ~ PC1+PC2, data)
model


Call:
lm(formula = growth ~ PC1 + PC2, data = data)

Coefficients:
(Intercept)          PC1          PC2  
 -0.1185887    0.0002455    0.0006294  


### In-Sample Estimated Performance

In [6]:
# Present the model's in-sample estimated RMSE, profit, and profit rate.
data.4_2 = data
data.4_2$predicted.growth = predict(model, data)
rmse = sqrt(mean((data.4_2$growth - data.4_2$predicted.growth)^2))


data.4_2_sorted = data.4_2[order(-data.4_2$predicted.growth),]
portfolio.data = data.4_2_sorted[1:portfolio_size,]
profit = sum((rep(1, portfolio_size) + portfolio.data$growth) * allocation) - budget
profit_rate = profit / budget

df.4_2 = data.frame(rmse, profit, profit_rate)
fmt(df.4_2, "In-Sample Estimated Performance")

rmse,profit,profit_rate
0.468815,-115641.1,-0.1156411


### Out-of-Sample Estimated Performance

In [7]:
# Partition the data into training (75%) and validation (25%)
# (use set.seed(0) and sample(...) to choose training observations).
# How many observations and variables in the training data?
# How many observations and variables in the validation data?
set.seed(0)
train = sample(1:nrow(data), 0.75*nrow(data))
dev = setdiff(1:nrow(data), train)

data.train = data[train,]
data.dev = data[dev,]

layout(fmt(size(data.train)), fmt(size(data.dev)))

observations,variables
observations,variables
3228,8
1077,8
size(data.train)  observations variables 3228 8,size(data.dev)  observations variables 1077 8

observations,variables
3228,8

observations,variables
1077,8


In [8]:
# Present the model's out-of-sample estimated RMSE, profit, and profit rate.
model = lm(growth ~ PC1+PC2, data.train)
data.4_3 = data.dev
data.4_3$predicted.growth = predict(model, data.dev)
rmse = sqrt(mean((data.4_3$growth - data.4_3$predicted.growth)^2))


data.4_3_sorted = data.4_3[order(-data.4_3$predicted.growth),]
portfolio.data = data.4_3_sorted[1:portfolio_size,]
profit = sum((rep(1, portfolio_size) + portfolio.data$growth) * allocation) - budget
profit_rate = profit / budget

df.4_3 = data.frame(rmse, profit, profit_rate)
fmt(df.4_3, "Out-of-Sample Estimated Performance")

rmse,profit,profit_rate
0.5081251,-40986.91,-0.0409869


### 5-Fold Cross-Validation Estimated Performance

In [9]:
# Partition the data into 5 folds (use set.seed(0) and createFolds(...) based on growth).
# Present the first few observation (row) numbers for each of the folds.
#
# You can use the str() function.
set.seed(0)
fold = createFolds(data$growth, k=5)
str(fold)

List of 5
 $ Fold1: int [1:862] 8 11 16 22 30 32 38 40 41 44 ...
 $ Fold2: int [1:860] 3 9 10 23 26 27 34 39 52 64 ...
 $ Fold3: int [1:862] 2 7 19 29 35 42 53 57 61 62 ...
 $ Fold4: int [1:861] 1 4 5 6 15 17 28 33 36 43 ...
 $ Fold5: int [1:860] 12 13 14 18 20 21 24 25 31 37 ...


In [10]:
# Present the model's estimated RMSE and profit for each fold.
rmse = c(NA, NA, NA, NA, NA)
profit = c(NA, NA, NA, NA, NA)
profit_rate = c(NA, NA, NA, NA, NA)

for (i in 1:5) { 
    data.train = data[setdiff(1:nrow(data), fold[[i]]),]
    data.dev = data[fold[[i]],]

    model = lm(growth ~ PC1+PC2, data.train)
    data.dev$predicted.growth = predict(model, data.dev)
   
    rmse.val = sqrt(mean((data.dev$growth - data.dev$predicted.growth)^2))
    rmse[i] = rmse.val
    
    new.data_dev = data.dev[order(-data.dev$predicted.growth),]
    portfolio.data = new.data_dev[1:portfolio_size,]
    profit.value = sum((rep(1, portfolio_size) + portfolio.data$growth) * allocation) - budget
    profit[i] = profit.value
    
    profit.rate_value = profit.value / budget
    profit_rate[i] = profit.rate_value
  }

data.frame(fold=1:5, rmse, profit)

fold,rmse,profit
1,0.4446211,-68328.2
2,0.435679,-84359.01
3,0.5041439,-83515.95
4,0.3998234,-114715.51
5,0.546036,-6010.14


In [11]:
# Present the model's 5-fold cross-validation estimated RMSE, profit, and profit rate.
rmse.cv = mean(rmse)
profit.cv = mean(profit)
profit_rate.cv = mean(profit_rate)

fmt(data.frame(rmse.cv, profit.cv, profit_rate.cv), "5-Fold Cross-Validation Estimated Performance")

rmse.cv,profit.cv,profit_rate.cv
0.4660607,-71385.76,-0.0713858


<font size=1;>
<p style="text-align: left;">
Copyright (c) Berkeley Data Analytics Group, LLC
<span style="float: right;">
Document revised June 14, 2020
</span>
</p>
</font>