# Forecasting Elantra Sales

An important application of linear regression is understanding sales. Consider a company that produces and sells a product. In a given period, if the company produces more units than how many consumers will buy, the company will not earn money on the unsold units and will incur additional costs due to having to store those units in inventory before they can be sold. If it produces fewer units than how many consumers will buy, the company will earn less than it potentially could have earned. Being able to predict consumer sales, therefore, is of first order importance to the company.

In this problem, we will try to predict monthly sales of the Hyundai Elantra in the United States. The Hyundai Motor Company is a major automobile manufacturer based in South Korea. The Elantra is a car model that has been produced by Hyundai since 1990 and is sold all over the world, including the United States. We will build a linear regression model to predict monthly sales using economic indicators of the United States as well as Google search queries.

The file elantra.csv contains data for the problem. Each observation is a month, from January 2010 to February 2014. For each month, we have the following variables:

- Month = the month of the year for the observation (1 = January, 2 = February, 3 = March, ...).
- Year = the year of the observation.
- ElantraSales = the number of units of the Hyundai Elantra sold in the United States in the given month.
- Unemployment = the estimated unemployment percentage in the United States in the given month.
- Queries = a (normalized) approximation of the number of Google searches for "hyundai elantra" in the given month.
- CPI_energy = the monthly consumer price index (CPI) for energy for the given month.
- CPI_all = the consumer price index (CPI) for all products for the given month; this is a measure of the magnitude of the prices paid by consumer households for goods and services (e.g., food, clothing, electricity, etc.).

Load the data set. Split the data set into training and testing sets as follows: place all observations for 2012 and earlier in the training set, and all observations for 2013 and 2014 into the testing set.

In [2]:
elantra = read.csv('./dataset/elantra.csv')
str(elantra)

'data.frame':	50 obs. of  7 variables:
 $ Month       : int  1 1 1 1 1 2 2 2 2 2 ...
 $ Year        : int  2010 2011 2012 2013 2014 2010 2011 2012 2013 2014 ...
 $ ElantraSales: int  7690 9659 10900 12174 15326 7966 12289 13820 16219 16393 ...
 $ Unemployment: num  9.7 9.1 8.2 7.9 6.6 9.8 9 8.3 7.7 6.7 ...
 $ Queries     : int  153 259 354 230 232 130 266 296 239 240 ...
 $ CPI_energy  : num  213 229 244 243 248 ...
 $ CPI_all     : num  217 221 228 231 235 ...


In [3]:
elantra_train = subset(elantra, Year <= 2012)
elantra_test = subset(elantra, Year > 2012)

In [4]:
nrow(elantra_train)

Build a linear regression model to predict monthly Elantra sales using Unemployment, CPI_all, CPI_energy and Queries as the independent variables. Use all of the training set data to do this.

In [5]:
salesReg = lm(ElantraSales ~ Unemployment + CPI_all + CPI_energy + Queries, data=elantra_train)
summary(salesReg)


Call:
lm(formula = ElantraSales ~ Unemployment + CPI_all + CPI_energy + 
    Queries, data = elantra_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-6785.2 -2101.8  -562.5  2901.7  7021.0 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)   95385.36  170663.81   0.559    0.580
Unemployment  -3179.90    3610.26  -0.881    0.385
CPI_all        -297.65     704.84  -0.422    0.676
CPI_energy       38.51     109.60   0.351    0.728
Queries          19.03      11.26   1.690    0.101

Residual standard error: 3295 on 31 degrees of freedom
Multiple R-squared:  0.4282,	Adjusted R-squared:  0.3544 
F-statistic: 5.803 on 4 and 31 DF,  p-value: 0.00132


Our model R-Squared is relatively low, so we would now like to improve our model. In modeling demand and sales, it is often useful to model seasonality. Seasonality refers to the fact that demand is often cyclical/periodic in time. For example, in countries with different seasons, demand for warm outerwear (like jackets and coats) is higher in fall/autumn and winter (due to the colder weather) than in spring and summer. (In contrast, demand for swimsuits and sunscreen is higher in the summer than in the other seasons.) Another example is the "back to school" period in North America: demand for stationary (pencils, notebooks and so on) in late July and all of August is higher than the rest of the year due to the start of the school year in September.

In our problem, since our data includes the month of the year in which the units were sold, it is feasible for us to incorporate monthly seasonality. From a modeling point of view, it may be reasonable that the month plays an effect in how many Elantra units are sold.

To incorporate the seasonal effect due to the month, build a new linear regression model that predicts monthly Elantra sales using Month as well as Unemployment, CPI_all, CPI_energy and Queries. Do not modify the training and testing data frames before building the model.

In [6]:
salesReg2 = lm(ElantraSales ~ Month + Unemployment + CPI_all + CPI_energy + Queries, data=elantra_train)
summary(salesReg2)


Call:
lm(formula = ElantraSales ~ Month + Unemployment + CPI_all + 
    CPI_energy + Queries, data = elantra_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-6416.6 -2068.7  -597.1  2616.3  7183.2 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  148330.49  195373.51   0.759   0.4536  
Month           110.69     191.66   0.578   0.5679  
Unemployment  -4137.28    4008.56  -1.032   0.3103  
CPI_all        -517.99     808.26  -0.641   0.5265  
CPI_energy       54.18     114.08   0.475   0.6382  
Queries          21.19      11.98   1.769   0.0871 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3331 on 30 degrees of freedom
Multiple R-squared:  0.4344,	Adjusted R-squared:  0.3402 
F-statistic: 4.609 on 5 and 30 DF,  p-value: 0.003078
