# Project Part C: Classification

![](banner_project.jpg)

In [1]:
analyst = "Firstname Lastname" # Replace this with your name

In [2]:
f = "setup.R"; for (i in 1:10) { if (file.exists(f)) break else f = paste0("../", f) }; source(f)
options(repr.matrix.max.rows=674)
options(repr.matrix.max.cols=200)
update_geom_defaults("point", list(size=1))                                

## Directions

### Objective

Construct and evaluate a classifier trained on a transformed dataset about public company fundamentals.  Later, use the classifier along with additional analysis to recommend a portfolio of 12 company investments that maximizes 12-month return of an overall \$1,000,000 investment.

### Approach

Retrieve a transformed dataset.

Construct a model to predict whether stock price will grow more than 30% over 12 months, given 12 months of past company fundamentals data, and using a machine learning model construction method and transformed data.

Evaluate the model's business performance based on a business model and business parameters.

## Business Model

The business model is ...

$ \begin{align} profit = \left( \sum_{i \in portfolio} (1 + growth_i) \times allocation_i \right) - budget \end{align} $

<br>

$ profit\,rate = profit \div budget $


$ \begin{align} budget = \sum_{i \in portfolio} allocation_i \end{align} $

<br>

Business model parameters include ...

* Budget = \\$1,000,000: total investment to allocate across the companies in the portfolio
* Portfolio Size = 12: number of companies in the portfolio
* Allocations = \\$1,000,000 $\div$ 12 to each company: investments to allocate to specific companies in the portfolio 

Fill the portfolio with companies with the lowest gvkey values from among those you predict to grow above 30%.  If you predict fewer than the portfolio size to grow above 30%, then fill the rest of the portfolio with the remaining companies with lowest gvkey values.

In [3]:
# Set the business parameters.

budget = 1000000
portfolio_size = 12
allocation = rep(budget/portfolio_size, portfolio_size)

layout(fmt(budget), fmt(portfolio_size), fmt(allocation))

budget,Unnamed: 1_level_0,Unnamed: 2_level_0
portfolio_size,Unnamed: 1_level_1,Unnamed: 2_level_1
allocation,Unnamed: 1_level_2,Unnamed: 2_level_2
1000000,,
12,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,

budget
1000000

portfolio_size
12

allocation
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33


## Data

_<< Discuss this data retrieval. >>_

In [4]:
# Retrieve "My Data.csv"
data = read.csv("My Data.csv", header=TRUE)

# Present a few rows ...
data[1:6,]

gvkey,tic,conm,PC1,PC2,prccq,growth,big_growth
1004,AIR,AAR CORP,3.4371231,-0.2260719,43.69,0.0507455507,NO
1045,AAL,AMERICAN AIRLINES GROUP INC,-12.0332067,0.8045109,32.11,-0.3828560446,NO
1050,CECE,CECO ENVIRONMENTAL CORP,3.9532234,-0.7553386,6.75,0.3157894737,YES
1062,ASA,ASA GOLD AND PRECIOUS METALS,3.6561434,-0.7981915,8.66,-0.2164739518,NO
1072,AVX,AVX CORP,2.9282228,-0.71042,15.25,-0.1184971098,NO
1075,PNW,PINNACLE WEST CAPITAL CORP,0.3488491,1.1389605,85.2,0.0002347969,NO


## Classification Model

_<< Discuss this model construction and evaluation. >>_

### Build Model

In [5]:
# Construct a naive Bayes model to predict big_growth given PC1 and PC2 (use laplace=TRUE).
# Present a brief summary of the model parameters.



Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
        NO        YES 
0.91637631 0.08362369 

Conditional probabilities:
     PC1
Y           [,1]      [,2]
  NO  -0.2239142 13.299922
  YES  2.4537263  4.550796

     PC2
Y           [,1]     [,2]
  NO   0.0424303 7.676443
  YES -0.4649654 1.453473


### In-Sample Estimated Performance

In [6]:
# Present the model's in-sample estimated accuracy, profit, and profit rate at cutoff=0.5.


accuracy,profit,profit_rate
0.3082462,-80393.21,-0.0803932


### Out-of-Sample Estimated Performance

In [7]:
# Partition the data into training (75%) and validation (25%)
# (use set.seed(0) and sample(...) to choose training observations).
# How many observations and variables in the training data?
# How many observations and variables in the validation data?


observations,variables
observations,variables
3228,8
1077,8
size(data.train)  observations variables 3228 8,size(data.dev)  observations variables 1077 8

observations,variables
3228,8

observations,variables
1077,8


In [8]:
# Present the model's out-of-sample estimated accuracy, profit, and profit rate at cutoff=0.5.


accuracy,profit,profit_rate
0.2989786,-120201.9,-0.1202019


### 5-Fold Cross-Validation Estimated Performance

In [9]:
# Partition the data into 5 folds (use set.seed(0) and createFolds(...)).
# Present the first few observation (row) numbers for each of the folds.
#
# You can use the str() function.


List of 5
 $ Fold1: int [1:861] 9 13 17 19 31 42 44 54 60 66 ...
 $ Fold2: int [1:861] 1 2 6 11 16 25 32 49 55 59 ...
 $ Fold3: int [1:861] 4 8 14 22 28 34 40 45 50 52 ...
 $ Fold4: int [1:861] 3 5 15 18 21 24 26 27 30 36 ...
 $ Fold5: int [1:861] 7 10 12 20 23 29 33 35 37 46 ...


In [10]:
# Present the model's estimated accuracy and profit at cutoff=0.5 for each fold.


fold,accuracy,profit
1,0.3042973,-221281.1
2,0.9163763,-90115.24
3,0.2659698,-28710.93
4,0.2950058,-89837.99
5,0.3066202,31939.77


In [11]:
# Present the model's 5-fold cross-validation estimated accuracy, profit, and profit rate at cutoff=0.5


accuracy.cv,profit.cv,profit_rate.cv
0.4176539,-79601.1,-0.0796011


<font size=1;>
<p style="text-align: left;">
Copyright (c) Berkeley Data Analytics Group, LLC
<span style="float: right;">
Document revised June 9, 2020
</span>
</p>
</font>