# Combining predictive techniques

#### Description

The capstone project has three main tasks, each of which requires you to use skills you developed during the Nanodegree program. Once you complete all three tasks, please submit the project as a PDF. 

#### Importing required modules

In [1]:
from IPython.display import Image
from IPython.core.display import HTML 
import os, glob, sys
import pandas as pd
import numpy as np
import scipy as sci
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn as sk

## Task 1: Store Format for Existing Stores

Your company currently has 85 grocery stores and is planning to open 10 new stores at the beginning of the year. Currently, all stores use the same store format for selling their products. Up until now, the company has treated all stores similarly, shipping the same amount of product to each store. This is beginning to cause problems as stores are suffering from product surpluses in some product categories and shortages in others. You've been asked to provide analytical support to make decisions about store formats and inventory planning.

<img src="Images/Task1.JPG" style="width: 300px;">

### Determining Store Format

To remedy the product surplus and shortages, the company wants to introduce different store formats. Each store format will have a different product selection in order to better match local demand. The actual building sizes will not change, just the product selection and internal layouts. The terms "formats" and "segments" will be used interchangeably throughout this project. You’ve been asked to:

- Determine the optimal number of store formats based on sales data.
    - Sum sales data by StoreID and Year
    - Use percentage sales per category per store for clustering (category sales as a percentage of total store sales).
    - Use only 2015 sales data.
    - Use a K-means clustering model.
- Segment the 85 current stores into the different store formats.
- Use the StoreSalesData.csv and StoreInformation.csv files.


***Alteryx workflow:***

<img src="Images/Alteryx1.PNG" alt="Drawing" style="width: 900px;"/>

#### 1.	What is the optimal number of store formats? How did you arrive at that number?

The optimal number of formats, or clusters, into which stores can be classified is three, accordingly with the methods implemented. The number of clusters into which the stores can be classified was chosen based on the *Adjusted Rand* and *Calinski-Harabasz* indexs, as is shown below:

<img src="Images/Kmeans_Report.PNG" alt="Drawing" style="width: 900px;"/>

<img src="Images/Clusters_Indexs.PNG" alt="Drawing" style="width: 500px;"/>

The mentioned indexs shown the highest statistical values when the stores are grouped into three clusters, considering the amount of outliers that fall into each of the clusters. Thus, three clusters are used to group the stores accordingly with their sales. 

#### 2.	How many stores fall into each store format?

The amount of stores that can be classified among the selected clusters are presented next:

<img src="Images/Kmeans_summary.PNG" alt="Drawing" style="width: 900px;"/>

#### 3.	Based on the results of the clustering model, what is one way that the clusters differ from one another?

The differences among clusters characteristics are presented next, only for three different sales features. However, the same behaviour can be found among the rest of the sales features. For the three features presented, the sum of sales for the Dry Grocery, Bakery and Produce features, their relation to the found clusters are presented in the next figures.

<img src="Images/ClustersFeatures1.PNG" alt="Drawing" style="width: 900px;"/>

***Public link:***

[Clusters characteristics - Tableau Dashboard](https://public.tableau.com/profile/alfonso.sanchez#!/vizhome/ClustersCharacteristics/Dashboard2?publish=yes)

It can be seen, in the figure on the left, that the expected sales for the presented features can, somehow, be separated for each of the clusters depending on their values. This allows the classification of the analyzed stores into the assigned clusters. This classification of the stores can be easier to understand when the expected range of values for the sales features are analyzed combined, as presented in the figure on the rigth. 

#### 4.	Please provide a Tableau visualization (saved as a Tableau Public file) that shows the location of the stores, uses color to show cluster, and size to show total sales.

<img src="Images/ClustersLocations.PNG" alt="Drawing" style="width: 700px;"/>

***Public link:***

[Clusters Locations - Tableau Dashboard](https://public.tableau.com/profile/alfonso.sanchez#!/vizhome/ClustersLocations/Dashboard1?publish=yes)

## Task 2: Formats for New Stores 

The grocery store chain has 10 new stores opening up at the beginning of the year. The company wants to determine which store format each of the new stores should have. However, we don’t have sales data for these new stores yet, so we’ll have to determine the format using each of the new store’s demographic data. 

<img src="Images/Task2.JPG" alt="Drawing" style="width: 500px;"/>

### Determine the Store Format for New Stores


- Develop a model that predicts which segment a store falls into based on the demographic and socioeconomic characteristics of the population that resides in the area around each new store.
 - Use a 20% validation sample with Random Seed = 3 when creating samples with which to compare the accuracy of the models. Make sure to compare a decision tree, forest, and boosted model.
- Use the model to predict the best store format for each of the 10 new stores.
- Use the StoreDemographicData.csv file, which contains the information for the area around each store.

***Alteryx workflow:***

<img src="Images/Alteryx2.PNG" alt="Drawing" style="width: 900px;"/>

#### 1.	What methodology did you use to predict the best store format for the new stores? Why did you choose that methodology? (Remember to Use a 20% validation sample with Random Seed = 3 to test differences in models.)

The followed methodology to found the best formats, or clusters, into which the new stores can be classified; started with the use of PCA to reduce the number of features to be used for the classification. As result, it was found that the first 10 components of the analysis contain almost 90% of the variance of the features, as presented below:

<img src="Images/PCA1.PNG" alt="Drawing" style="width: 700px;"/>

<img src="Images/PCA2.PNG" alt="Drawing" style="width: 400px;"/>

Thus,  the 10 components obtained from the PCA analysis are used to classify new stores into the found clusters. The use of the found PCA components is justified with the next figure, where it can be seen how the components can be used to classify the stores depending on their values:

<img src="Images/PCAAnalysis1st.PNG" alt="Drawing" style="width: 700px;"/>

***Public link:***

[PCA analysis (1st component) - Tableau Dashboard](https://public.tableau.com/profile/alfonso.sanchez#!/vizhome/PCAAnalysis_16115551864340/Dashboard1?publish=yes)

The next step in the methodology was the balance of the classes to classify to avoid bias toward one specific cluster. Then, the implementation of a Decision Tree, Random Forest and a Boosted model were used to found the best model for the classification task. The performance of the mentioned models can be seen in the next figure:

<img src="Images/Model1.PNG" alt="Drawing" style="width: 700px;"/>

As result, the boosted model was chosen as the model for the classification, accordingly with its confusion matrix and its accuracy and F1 score. The most important variables that help to explain relationships among demographic indicators and store formats are presented next:

<img src="Images/VaraibleImportance.PNG" alt="Drawing" style="width: 350px;"/>

The PCA components are directly related to the demographic indicators as shown below for the second component:

<img src="Images/PCA2_Components.PNG" alt="Drawing" style="width: 700px;"/>

#### 2.	What format do each of the 10 new stores fall into? Please fill in the table below.

In [2]:
path = os.path.join(os.getcwd(),'Solution_2nd.csv')
df = pd.read_csv(path)
display(df)

Unnamed: 0,Store,Cluster
0,S0086,1
1,S0087,2
2,S0088,1
3,S0089,2
4,S0090,2
5,S0091,3
6,S0092,2
7,S0093,3
8,S0094,2
9,S0095,2


## Task 3: Forecasting

Fresh produce has a short life span, and due to increasing costs, the company wants to have an accurate monthly sales forecast.

<img src="Images/Task3.JPG" alt="Drawing" style="width: 500px;"/>

#### Task 3: Forecasting Produce Sales

You’ve been asked to prepare a monthly forecast for produce sales for the full year of 2016 for both existing and new stores. To do so, follow the steps below.

Note: Use a 6 month holdout sample for the TS Compare tool (this is because we do not have that much data so using a 12 month holdout would remove too much of the data)

**Step 1:** To forecast produce sales for existing stores you should aggregate produce sales across all stores by month and create a forecast.

***Alteryx workflow:***

<img src="Images/Alteryx3a.PNG" alt="Drawing" style="width: 900px;"/>

**Step 2:** To forecast produce sales for new stores:

- Forecast produce sales (not total sales) for the average store (rather than the aggregate) for each segment.
- Multiply the average store produce sales forecast by the number of new stores in that segment.
- For example, if the forecasted average store produce sales for segment 1 for March is 10,000, and there are 4 new stores in segment 1, the forecast for the new stores in segment 1 would be 40,000.
- Sum the new stores produce sales forecasts for each of the segments to get the forecast for all new stores.

***Alteryx workflow:***

<img src="Images/Alteryx3b.PNG" alt="Drawing" style="width: 900px;"/>

**Step 3:** Sum the forecasts of the existing and new stores together for the total produce sales forecast. 

***Alteryx workflow:***

<img src="Images/Alteryx3c.PNG" alt="Drawing" style="width: 900px;"/>

#### 1. What type of ETS or ARIMA model did you use for each forecast? Use ETS(a,m,n) or ARIMA(ar, i, ma) notation. How did you come to that decision?

The modelling of the time series was implementing firstly by analyzing the behaviour of the store's sales, as is presented in the next figure; where it can bee send the decomposition of the store sales time series, its seasonal behaviour and its trend.

<img src="Images/TSAnalysis.PNG" alt="Drawing" style="width: 900px;"/>

The previous figure can help us to determine the characteristics for the modelling, for example, the ACF and the PCAF can be used to determine the order of the *ARIMA* model. Nonetheless, in this project, *Alteryx* selected the characteristics of the model automatically. The performances of the implemented models are presented next when forecasting the next six months.

***ARIMA(1,0,0)(1,1,0)[12]***

<img src="Images/ARIMA.PNG" alt="Drawing" style="width: 900px;"/>

***ETS(M,N,M)***

<img src="Images/ETS.PNG" alt="Drawing" style="width: 900px;"/>

As result, the ETS model was selected for the forecasting of sales because of its better performance. This improvement in performance can be appreciated better in the next figures, where both models are evaluated using different performance metrics and used for forecasting the next 6 months of sales and validated against the true expected sales:

***Models Performances***

<img src="Images/Stores_clusterssize.PNG" alt="Drawing" style="width: 400px;"/>

***Models Forecast***

<img src="Images/ModelsForecast.PNG" alt="Drawing" style="width: 900px;"/>

#### 2. Please provide a table of your forecasts for existing and new stores. Also, provide visualization of your forecasts that includes historical data, existing stores forecasts, and 

In [3]:
pd.options.display.float_format = '${:,.2f}'.format
path = os.path.join(os.getcwd(), 'Solution_3th_c.csv')
df = pd.read_csv(path)
df_Sales = df[df.Label == 'Forecasted_New'].rename(columns={'DateTime':'Date', 'Sales':'New_Stores', 'Label':'Existing_Stores'})
df_Sales.Existing_Stores = df[df.Label == 'Forecasted_Existed'].Sales.values
df_Sales.reset_index(drop=True, inplace=True)
display(df_Sales)

Unnamed: 0,Date,New_Stores,Existing_Stores
0,2016-01-01,"$2,527,338.50","$21,136,641.78"
1,2016-02-01,"$2,446,154.76","$20,507,039.12"
2,2016-03-01,"$2,872,050.73","$23,506,565.98"
3,2016-04-01,"$2,722,157.62","$22,208,405.76"
4,2016-05-01,"$3,098,095.87","$25,380,147.77"
5,2016-06-01,"$3,150,602.99","$25,966,799.47"
6,2016-07-01,"$3,172,545.05","$26,113,792.57"
7,2016-08-01,"$2,814,269.98","$22,899,285.77"
8,2016-09-01,"$2,486,631.56","$20,499,583.91"
9,2016-10-01,"$2,434,261.23","$19,971,242.82"


<img src="Images/ForecastedSales.PNG" alt="Drawing" style="width: 900px;"/>

***Public link:***

[Forecasted Sales - Tableau Dashboard](https://public.tableau.com/profile/alfonso.sanchez#!/vizhome/ForecastedSales_16115531667100/Dashboard1?publish=yes)

### FIN