# Project Part F: Deployment

![](banner_project.jpg)

In [6]:
f = "setup.R"; for (i in 1:10) { if (file.exists(f)) break else f = paste0("../", f) }; source(f)
options(repr.matrix.max.rows=674)
options(repr.matrix.max.cols=200)
update_geom_defaults("point", list(size=1))                                

## Directions

### Objective

Recommend a portfolio of 12 company investments that maximizes 12-month return of an overall \$1,000,000 investment.

### Approach

Retrieve a transformed dataset about public company fundamentals and use it reproduce the construction of a selected model.

Retrieve an investment opportunities dataset, comprising fundamentals for some set of public companies over some one-year period.  Transform the representation of the investment opportunities to match the representation expected by the model, leveraging previous analysis.

Use the model to make predictions about the investment opportunities and accordingly recommend a portfolio of 12 company investments.

## Business Model


The business model is ...

$ \begin{align} profit = \left( \sum_{i \in portfolio} (1 + growth_i) \times allocation_i \right) - budget \end{align} $

<br>

$ profit\,rate = profit \div budget $


$ \begin{align} budget = \sum_{i \in portfolio} allocation_i \end{align} $

<br>

Business model parameters include ...

* Budget = \\$1,000,000: total investment to allocate across the companies in the portfolio
* Portfolio Size = 12: number of companies in the portfolio
* Allocations = \\$1,000,000 $\div$ 12 to each company: investments to allocate to specific companies in the portfolio 

Fill the portfolio with companies that have the highest predicted growths.

In [9]:
# Set the business parameters.

budget = 1000000
portfolio_size = 12
allocation = rep(budget/portfolio_size, portfolio_size)

layout(fmt(budget), fmt(portfolio_size), fmt(allocation))

ERROR: Error in row_spec(x, 0, background = "#FFFFFF"): не могу найти функцию "row_spec"


## Data Retrieval

_<< Discuss this data retrieval. >>_  
  
**ANSWER:**  
  
This data comes from Project Part B. It is the transformed dataset that we stored after selecting certain predictor variables--such as gvkey, tic, conm, and the PCs--and outcome variables, which include prccq, growth, and big_growth. This transformed data set, which has gone through rigorous data cleaning, filtering, and transformation, will be used in the following section for building regression models.

In [8]:
# Retrieve "My Data.csv".  This is the ORIGINAL model training data.
data = read.csv("My Data.csv", header=TRUE)

# Present a few rows ...
head(data)

“не могу открыть файл 'My Data.csv': No such file or directory”

ERROR: Error in file(file, "rt"): не могу открыть соединение


## Build Model

_<< Discuss this model construction. >>_  
  
**ANSWER:**  
  
We are building a linear regression model based on the original model training data. This is the data we have been working with since Project B, which comes from "My Data.csv". We want to use PC1 and PC2 from this data set to predict growth, and we can do this since growth is a numeric variable, which is what we'd like to predict with a linear regression model.

In [5]:
# Construct a linear regression model to predict growth given PC1 and PC2, based on the
# ORIGINAL model training data.
# Present a brief summary of the model parameters.
model = lm(growth ~ PC1+PC2, data)
model


Call:
lm(formula = growth ~ PC1 + PC2, data = data)

Coefficients:
(Intercept)          PC1          PC2  
 -0.1185887    0.0002455    0.0006294  


## Investment Opportunities

_<< Discuss this handling of investment opportunties. >>_  
  
**ANSWER:**  
  
We read in new data from "Investment Opportunities.csv". We want to transform it so that it resembles the data we trained the model on so we can correctly and accurately make predictions. In section 5.2, we parition the investment opportunities data (named IOD from here) by calendar quarter, ensure that we are only keeping data from IOD that has a value for prccq, and remove the quarter column to remove unnecessary information. We also want to ensure that we remove any observations about companies from IOD that reported more than once per quarter. After this filtering by calendar quarter, we merge the data together so that a single observation has all the information about a company. Thus, we will only have distinct gvkeys for every row/observation. This makes it easier to retrieve any information we need about a single company, and also significantly reduces the number of observations we have (from the original 918 to 230).  
  
In section 5.4, we are transforming the data further to closely represent our original data from "My Data.csv". We first remove columns so the IOD has the same pre-filter columns from "My Data.csv". We then impute any missing values for each column of the IOD, and we impute it using the values we used to impute missing values of "My Data.csv" to maintain consistency, especially since the "My Data.csv" is the original data. We then use the principal component information from "My Data.csv" to compute the principal component values of IOD, and we want to use the information from "My Data.csv" because we want accurate predictions based on the training data. In the final stage of this section, we simply filter the columns to only have predictor variables, which are: gvkey, tic,
conm, PC1, PC2.

### Retrieve Data

In [6]:
# Retrieve "Investment Opportunities.csv"
new.data = read.csv("Investment Opportunities.csv", header=TRUE)

# Present the dataset size ...
size(new.data)

observations,variables
918,680


### Partition Data by Calendar Quarter 

To partition the dataset by calendar quarter in which information is reported, first add a synthetic variable to indicate such.  Then partition into four new datasets, one for each quarter, and drop the quarter variables. Additionally, filter the observations to include only those with non-missing `prccq`.  Then remove any observations about companies that reported more than once per quarter.  Then change all the variable names (except for the `gvkey`, `tic`, and `conm` variables) by suffixing them with quarter information - e.g., in the Quarter 1 dataset, `prccq` becomes `prccq.q1`, etc.

In [7]:
# Partition the dataset as described.
new.data$quarter = quarter(mdy(new.data[,2]))

data.current.q1 = new.data[(new.data$quarter==1) & !is.na(new.data$prccq), -ncol(new.data)]
data.current.q2 = new.data[(new.data$quarter==2) & !is.na(new.data$prccq), -ncol(new.data)]
data.current.q3 = new.data[(new.data$quarter==3) & !is.na(new.data$prccq), -ncol(new.data)]
data.current.q4 = new.data[(new.data$quarter==4) & !is.na(new.data$prccq), -ncol(new.data)]

data.current.q1 = data.current.q1[,colnames(data.current.q1)!="quarter"]
data.current.q2 = data.current.q2[,colnames(data.current.q2)!="quarter"]
data.current.q3 = data.current.q3[,colnames(data.current.q3)!="quarter"]
data.current.q4 = data.current.q4[,colnames(data.current.q4)!="quarter"]

data.current.q1 = data.current.q1[!duplicated(data.current.q1$gvkey),]
data.current.q2 = data.current.q2[!duplicated(data.current.q2$gvkey),]
data.current.q3 = data.current.q3[!duplicated(data.current.q3$gvkey),]
data.current.q4 = data.current.q4[!duplicated(data.current.q4$gvkey),]

colnames(data.current.q1)[-c(1, 10, 12)] = paste0(colnames(data.current.q1)[-c(1, 10, 12)], ".q1")
colnames(data.current.q2)[-c(1, 10, 12)] = paste0(colnames(data.current.q2)[-c(1, 10, 12)], ".q2")
colnames(data.current.q3)[-c(1, 10, 12)] = paste0(colnames(data.current.q3)[-c(1, 10, 12)], ".q3")
colnames(data.current.q4)[-c(1, 10, 12)] = paste0(colnames(data.current.q4)[-c(1, 10, 12)], ".q4")

In [8]:
# Present the sizes of the data partitions
layout(fmt(size(data.current.q1)),
       fmt(size(data.current.q2)),
       fmt(size(data.current.q3)),
       fmt(size(data.current.q4)))

observations,variables,Unnamed: 2_level_0,Unnamed: 3_level_0
observations,variables,Unnamed: 2_level_1,Unnamed: 3_level_1
observations,variables,Unnamed: 2_level_2,Unnamed: 3_level_2
observations,variables,Unnamed: 2_level_3,Unnamed: 3_level_3
209,680,,
221,680,,
227,680,,
230,680,,
size(data.current.q1)  observations variables 209 680,size(data.current.q2)  observations variables 221 680,size(data.current.q3)  observations variables 227 680,size(data.current.q4)  observations variables 230 680

observations,variables
209,680

observations,variables
221,680

observations,variables
227,680

observations,variables
230,680


### Consolidate Data by Company

Consolidate the four quarter datasets into one dataset, with one observation per company that includes variables for all four quarters.  Remove any observations with missing `prccq.q4` values.

In [9]:
# Consolidate the partitions as described.
# How many observations and variables in the resulting dataset? 
m12 = merge(data.current.q1, data.current.q2, by=c("gvkey", "tic", "conm"), all=TRUE)
m34 = merge(data.current.q3, data.current.q4, by=c("gvkey", "tic", "conm"), all=TRUE)
data.current = merge(m12, m34, by=c("gvkey", "tic", "conm"), all=TRUE, sort=TRUE)

data.current = data.current[!is.na(data.current$prccq.q4),]

size(data.current)

observations,variables
230,2711


### Transform Representation of Data

In [10]:
# Filter the data to include only those variables with at least 80% non-missing values
# in the ORIGINAL model training data.
# How many observations and variables in the resulting dataset? 
#
# You can use the readRDS() function. 
cols = readRDS("My Pre-Filter.rds")

new.data = data.current[,cols]
size(new.data)

observations,variables
230,923


In [11]:
# Impute the data using the same imputation values as computed for the ORIGINAL model
# training data. 
# How many observations and variables in the resulting dataset? 
#
# You can use the readRDS() and put_impute() functions.
impute.vals = readRDS("My Imputation.rds")
data.impute = put_impute(new.data, impute.vals)
size(data.impute)

observations,variables
230,923


In [12]:
# Compute principal components using the centroids and weight matrix from the analysis
# of the ORIGINAL model training data.  Apply to only the (numeric and integer) variables
# used in the analysis of the ORIGINAL model training data. 
# How many observations and variables in the resulting dataset? 
#
# You can use the readRDS() and predict() functions.
# You can use rownames(pc$rotation) to get the (numeric and integer) variables. 
pc = readRDS("My PC.rds")
data.pc = predict(pc, data.impute[,rownames(pc$rotation)])
size(data.pc)

observations,variables
230,737


In [13]:
# Combine and filter datasets as necessary to produce a new datset that includes all investment
# opportunities, but includes only predictor variables stored by previous analysis. 
# How many observations and variables?
# Present the few few observations of the resulting dataset.
#
# You can use the readRDS() function.
predictor.vars = readRDS("My Post-Filter.rds")
data.pc = data.pc[, c("PC1", "PC2")]
data.final = cbind(data.impute, data.pc)
data.final = data.final[, predictor.vars]

size(data.final)
head(data.final)

observations,variables
230,5


gvkey,tic,conm,PC1,PC2
1004,AIR,AAR CORP,3.472287,-0.08425766
1410,ABM,ABM INDUSTRIES INC,2.796849,-0.10000479
1562,AMSWA,AMERICAN SOFTWARE -CL A,3.986999,-0.65366925
1618,AXR,AMREP CORP,3.642196,-0.56307768
1632,ADI,ANALOG DEVICES,-4.313079,0.9635113
1686,APOG,APOGEE ENTERPRISES INC,3.491347,0.18518797


## Apply Model

_<< Discuss this application of the model. >>_  
  
**ANSWER:**  
  
We predict growth values for IOD using the PC1 and PC2 values computed above because the model is trained to only use PC1 and PC2 as predictor variables of growth. I then proceed to order the data in descending order based on the predicted values of growth. This is done so I can quickly select the top 12 portfolios for recommendation, since we would like to recommend portfolios with the highest growth. Upon examination, I saw that none of our predictions are positive; thus, no company will see positive growth based on our predictions. However, the portfolio companies I recommend does not necessarily result in positive growth or profit. We select to only present the gvkey, tic, conm, and allocation columns to remove unnecessary information for our presentation of the portfolios.

### Predict & Make Portfolio Recommendation

In [14]:
# Use the model to predict growths of each investment opportunity.
# Recommend a portfolio of allocations to 12 investment opportunities: gvkey, tic, conm, allocation
data.final$predictions = predict(model, data.final)
portfolio = data.final[order(-data.final$predictions),]
portfolio = portfolio[1:portfolio_size,]
portfolio = cbind(portfolio[, c("gvkey", "tic", "conm")], allocation)

fmt(portfolio, "portfolio")

gvkey,tic,conm,allocation
23809,AZO,AUTOZONE INC,83333.33
29692,WEBC,WEBCO INDUSTRIES INC,83333.33
3570,CBRL,CRACKER BARREL OLD CTRY STOR,83333.33
63172,FDS,FACTSET RESEARCH SYSTEMS INC,83333.33
64344,MTN,VAIL RESORTS INC,83333.33
178704,ULTA,ULTA BEAUTY INC,83333.33
65430,PLCE,CHILDRENS PLACE INC,83333.33
10549,THO,THOR INDUSTRIES INC,83333.33
3504,COO,COOPER COS INC (THE),83333.33
7921,NDSN,NORDSON CORP,83333.33


### Store Portfolio Recommendation

In [15]:
# Store portfolio recommendation

write.csv(portfolio, paste0(analyst, ".csv"), row.names=FALSE)