## Project information:
### Program: Udacity | Business Analyst Nanodegree
### Title: Project 1 | Mail catalog ROI predictive analysis
### Date started: 2018.03.15

## Business problem:
You recently started working for a company that manufactures and sells high-end home goods. Last year the company sent out its first print catalog, and is preparing to send out this year's catalog in the coming months. The company has 250 new customers from their mailing list that they want to send the catalog to.

Your manager has been asked to determine how much profit the company can expect from sending a catalog to these customers. You, the business analyst, are assigned to help your manager run the numbers. While fairly knowledgeable about data analysis, your manager is not very familiar with predictive models.

You’ve been asked to predict the expected profit from these 250 new customers. Management does not want to send the catalog out to these new customers unless the expected profit contribution exceeds $10,000.

Details:
- The costs of printing and distributing is \$6.50 per catalog.
- The average gross margin (price - cost) on all products sold through the catalog is 50%.
- Make sure to multiply your revenue by the gross margin first before you subtract out the $6.50 cost when calculating your profit.
- Write a short report with your recommendations outlining your reasons why the company should go with your recommendations to your manager.

### What decision needs to be made?
The primary decision to be made is whether or not to send a mail catalog to new members on the company mailing list.
### What data is needed to inform that decision?
Data needed are any historical data on the success of past mail catalog projects, including who responded, how much they spent, and any other covariates thought to be valuable to a prediction and feasible to gather.

### Libraries

In [51]:
library( tidyverse )
library( PerformanceAnalytics )

Loading required package: xts
Loading required package: zoo

Attaching package: ‘zoo’

The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric


Attaching package: ‘xts’

The following objects are masked from ‘package:dplyr’:

    first, last


Attaching package: ‘PerformanceAnalytics’

The following object is masked from ‘package:graphics’:

    legend



### Load up data

In [16]:
df.train <- readxl::read_excel( '../data/raw/p1-customers.xlsx' )
df.pred <- readxl::read_excel( '../data/raw/p1-mailinglist.xlsx')

### Munge

In [19]:
df.train <- within( df.train,{
        Responded_to_Last_Catalog[Responded_to_Last_Catalog == 'No'] <- 0 
        Responded_to_Last_Catalog[Responded_to_Last_Catalog == 'Yes'] <- 1 })

In [73]:
df.train <- transform( df.train 
         , Responded_to_Last_Catalog = factor( Responded_to_Last_Catalog 
                                             , levels = c( 0, 1 )
                                             , labels = c( 'No response', 'Responded ' ) 
                                             )
                     , Store_Number = factor( Store_Number )
                     )

In [89]:
df.pred <- transform( df.pred
         , Store_Number = factor( Store_Number )
                     )

### Modeling

In [74]:
str( df.train )

'data.frame':	2375 obs. of  15 variables:
 $ Name                      : chr  "Pamela Wright" "Danell Valdez" "Jessica Rinehart" "Nancy Clark" ...
 $ Customer_Segment          : chr  "Store Mailing List" "Store Mailing List" "Store Mailing List" "Store Mailing List" ...
 $ Customer_ID               : num  2 7 8 9 10 11 12 16 17 19 ...
 $ Address                   : chr  "376 S Jasmine St" "12066 E Lake Cir" "7225 S Gaylord St" "4497 Cornish Way" ...
 $ City                      : chr  "Denver" "Greenwood Village" "Centennial" "Denver" ...
 $ State                     : chr  "CO" "CO" "CO" "CO" ...
 $ ZIP                       : num  80224 80111 80122 80239 80206 ...
 $ Avg_Sale_Amount           : num  228 55 213 195 111 ...
 $ Store_Number              : Factor w/ 10 levels "100","101","102",..: 1 6 2 6 1 7 9 4 8 3 ...
 $ Responded_to_Last_Catalog : Factor w/ 2 levels "No response",..: NA NA NA NA NA NA NA NA NA NA ...
 $ Avg_Num_Products_Purchased: num  1 1 1 1 1 1 1 3 2 1 ...
 $ X._Y

#### Calculate revenue

In [75]:
df.train$margin <- df.train$Avg_Sale_Amount / 2
df.train$revenue <- df.train$margin - 6.50

In [76]:
fit.train <- lm( margin
               ~ ZIP
               + Customer_Segment
               + Store_Number
               , data = df.train )

In [77]:
summary( fit.train )


Call:
lm(formula = margin ~ ZIP + Customer_Segment + Store_Number, 
    data = df.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-501.29  -34.79    0.60   35.63  937.85 

Coefficients:
                                               Estimate Std. Error t value
(Intercept)                                  -1.054e+03  1.680e+03  -0.627
ZIP                                           1.750e-02  2.095e-02   0.835
Customer_SegmentLoyalty Club and Credit Card  1.961e+02  7.876e+00  24.902
Customer_SegmentLoyalty Club Only            -1.434e+02  5.694e+00 -25.183
Customer_SegmentStore Mailing List           -2.627e+02  5.026e+00 -52.269
Store_Number101                              -2.538e+00  7.610e+00  -0.334
Store_Number102                              -3.591e+00  1.153e+01  -0.311
Store_Number103                              -7.081e-01  8.224e+00  -0.086
Store_Number104                              -1.005e+01  8.030e+00  -1.252
Store_Number105                              -9.841

Removing ZIP

In [78]:
fit.train <- lm( margin
               ~ Customer_Segment
               + Store_Number
               , data = df.train )

In [79]:
summary( fit.train )


Call:
lm(formula = margin ~ Customer_Segment + Store_Number, data = df.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-498.69  -34.58    0.92   35.64  938.38 

Coefficients:
                                             Estimate Std. Error t value
(Intercept)                                   349.414      6.352  55.008
Customer_SegmentLoyalty Club and Credit Card  196.061      7.875  24.896
Customer_SegmentLoyalty Club Only            -143.392      5.694 -25.185
Customer_SegmentStore Mailing List           -262.719      5.025 -52.279
Store_Number101                                -2.113      7.592  -0.278
Store_Number102                                -5.493     11.304  -0.486
Store_Number103                                -2.135      8.044  -0.265
Store_Number104                               -12.122      7.637  -1.587
Store_Number105                               -11.472      7.401  -1.550
Store_Number106                               -19.807      7.541  -2.627
Store_Num

Including only Store 106

In [80]:
df.train$store_106_flg <- as.numeric( df.train$Store_Number == '106' )

In [84]:
fit.train <- lm( margin
               ~ Customer_Segment
               + Store_Number 
               , data = df.train )

In [85]:
summary( fit.train )


Call:
lm(formula = margin ~ Customer_Segment + Store_Number, data = df.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-498.69  -34.58    0.92   35.64  938.38 

Coefficients:
                                             Estimate Std. Error t value
(Intercept)                                   349.414      6.352  55.008
Customer_SegmentLoyalty Club and Credit Card  196.061      7.875  24.896
Customer_SegmentLoyalty Club Only            -143.392      5.694 -25.185
Customer_SegmentStore Mailing List           -262.719      5.025 -52.279
Store_Number101                                -2.113      7.592  -0.278
Store_Number102                                -5.493     11.304  -0.486
Store_Number103                                -2.135      8.044  -0.265
Store_Number104                               -12.122      7.637  -1.587
Store_Number105                               -11.472      7.401  -1.550
Store_Number106                               -19.807      7.541  -2.627
Store_Num

### Score the new dataset

In [91]:
df.pred$margin <- predict( fit.train, newdata = df.pred )

In [94]:
df.pred$prob_margin <- df.pred$Score_Yes * df.pred$margin

In [95]:
df.pred

Name,Customer_Segment,Customer_ID,Address,City,State,ZIP,Store_Number,Avg_Num_Products_Purchased,X._Years_as_Customer,Score_No,Score_Yes,margin,prob_margin
A Giametti,Loyalty Club Only,2213,5326 S Lisbon Way,Centennial,CO,80015,105,3,0.2,0.6949642,0.3050358,194.5506,59.34490
Abby Pierson,Loyalty Club and Credit Card,2785,4344 W Roanoke Pl,Denver,CO,80236,101,6,0.6,0.5272755,0.4727245,543.3626,256.86082
Adele Hallman,Loyalty Club Only,2931,5219 S Delaware St,Englewood,CO,80110,101,7,0.9,0.4211182,0.5788819,203.9096,118.03959
Alejandra Baird,Loyalty Club Only,2231,2301 Lawrence St,Denver,CO,80205,103,2,0.6,0.6948622,0.3051378,203.8871,62.21367
Alice Dewitt,Loyalty Club Only,2530,5549 S Hannibal Way,Centennial,CO,80015,104,4,0.5,0.6122941,0.3877059,193.9000,75.17617
Amanda Donahoe,Credit Card Only,1946,10093 E Warren Ave,Denver,CO,80247,105,7,0.7,0.7327217,0.2672783,337.9426,90.32473
Amanda Huerta,Loyalty Club and Credit Card,1212,3889 Aldenbridge Cir,Highlands Ranch,CO,80126,101,4,1.0,0.7782605,0.2217395,543.3626,120.48494
Angie Reffel,Credit Card Only,369,4502 S Buckley Way,Aurora,CO,80015,104,6,0.2,0.8065529,0.1934471,337.2920,65.24818
Anh Tran,Credit Card Only,1683,7328 E Maple Ave,Denver,CO,80230,100,6,0.0,0.7493424,0.2506576,349.4142,87.58334
Anna Crumrine,Loyalty Club Only,1940,7354 S Catawba Way,Aurora,CO,80016,102,4,0.9,0.7354768,0.2645232,200.5296,53.04471


In [96]:
sum( df.pred$prob_margin )

### Preliminary answer:
Yes. Send the catalog. Estimated revenue is $~23,000.00.

## TODO: 
- Review the sub template and make additions to R notebook
- Check sub against rubric at: https://review.udacity.com/#!/rubrics/186/view
- Complete the other sections of the project