# Data Visualisation Report for ChrisCo

## Detailed specification

To carry out a visual data exploration for ChrisCo, the fictional company whose sales and website data we have been analysing throughout the module, using a Python Notebook (in Colab or Jupyter).
ChrisCo is a fictional, but nonetheless very successful company managing a range of retail outlets across the UK. ChrisCo collects a huge amount of data about individual customers visiting its outlets using its loyalty card scheme but this customer data has been aggregated/averaged to give information about the company’s 45 outlets, each identified by a unique 3 letter code (e.g. ABC, XYZ, etc).

### Data Sources

The data for this report are available at the following locations:

- [Customer Visits per day](https://tinyurl.com/ChrisCoDV/001207659/OutletDailyCustomers.csv)

- [Total annual spend on local Marketing](https://tinyurl.com/ChrisCoDV/001207659/OutletMarketing.csv)
- [Total annual overhead cost per outlet](https://tinyurl.com/ChrisCoDV/001207659/OutletOverheads.csv)
- [Outlet size per squared meter](https://tinyurl.com/ChrisCoDV/001207659/OutletSize.csv)
- [Staff count per outlet](https://tinyurl.com/ChrisCoDV/001207659/OutletStaff.csv)

## Preparing Data for Exploration

First, Let's import the necessary libraries to explore and visualise our data

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import hvplot.pandas
import holoviews as hv
import panel as pn
import panel.interact as interact
import math


import warnings

warnings.simplefilter('ignore')
%matplotlib inline

Now that the necessary libraries are imported, let's import our datasets with `pandas` library and then compile into two dataframes - Daily Customer Data(indexed by dates) and Summary Data(indexed by outlets)

In [None]:
# All files are stored in .csv format.
daily_customers = pd.read_csv("https://tinyurl.com/ChrisCoDV/001207659/OutletDailyCustomers.csv", index_col=0) # Time series data
outlet_marketing = pd.read_csv("https://tinyurl.com/ChrisCoDV/001207659/OutletMarketing.csv")
outlet_overheads = pd.read_csv("https://tinyurl.com/ChrisCoDV/001207659/OutletOverheads.csv")
outlet_size = pd.read_csv("https://tinyurl.com/ChrisCoDV/001207659/OutletSize.csv")
outlet_staff = pd.read_csv("https://tinyurl.com/ChrisCoDV/001207659/OutletStaff.csv")

# convert the date column into `pandas date` object.
daily_customers.index = pd.to_datetime(daily_customers.index)

The daily_customers visits has a date/time in the frame so the index column is set as datetime.

Lets have a look at the Customers daily visits data

In [None]:
print(f"Data gathered over {daily_customers.shape[0]} days")
daily_customers.head()

The Customer Daily Visits data is a `Time Series` data. A data with datapoints sequenced in order of time. The table shows that the data are collected from 45 outlets over a 365 day (1 year) period.

Briefly, let's explore for any missing value

In [None]:
print(daily_customers.isna().sum())
print(outlet_marketing.isna().sum())
print(outlet_overheads.isna().sum())
print(outlet_size.isna().sum())
print(outlet_staff.isna().sum())

There are no missing values in the summary data (per Outlet data). Run by commenting 4 others.

Each file has complete data points for all its values. So now we build all dataframes into two:
- Daily Customer Data (indexed by dates): Already built.
- Summary Data (indexed by outlets)

In [None]:
# Summary Data (indexed by outlets)
summary = pd.DataFrame(index=daily_customers.columns)
summary["Marketing (£)"] = outlet_marketing.iloc[:, 1].values
summary["Overheads (£)"] = outlet_overheads.iloc[:, 1].values
summary["OutletSize"] = outlet_size.iloc[:, 1].values
summary["StaffCount"] = outlet_staff.iloc[:, 1].values
summary['TotalVisit'] = daily_customers.sum()

print("A look into the Summary Data")
summary.head()

<!--  -->

## Analysis Of Daily Customer Visits

#### **High and Medium Volume Outlets**

Segmentation of Outlets into high, medium and low visits for the year.

Let's see a view of these sum of visits, which will also justify our segmentation by visits.

In [None]:
total_visit = daily_customers.sum().sort_values(ascending=False)

plt.figure(figsize=(10,8))
plt.grid(True)
ax = sns.barplot(x=total_visit, y=total_visit.index)
ax.set_title('Sorted Bar Plot of Total Visits of Customers to Outlets', fontsize=18)
# Annotate bars
ax.bar_label(ax.containers[0], color='#336e2d', label_type='edge');

From the sums above, segmentation can be on `600000` and `200000`, where visits above `600000` is high volume, between `600000` and `200000` for medium volume and less than `200000` for low volume.

In [None]:
high_volumes = []
medium_volumes = []
low_volumes = []

for outlet in daily_customers.columns:
    if daily_customers[outlet].sum() > 600000:
        high_volumes.append(outlet)
    elif daily_customers[outlet].sum() <= 600000 and daily_customers[outlet].sum() > 200000:
        medium_volumes.append(outlet)
    else:
        low_volumes.append(outlet)

    
#  Data categories for daily customer visits
low_visit_data = daily_customers[low_volumes]
med_visit_data = daily_customers[medium_volumes]
high_visit_data = daily_customers[high_volumes]

# Split by monthly visits
monthly_customers = daily_customers.groupby(daily_customers.index.to_period('M')).sum()
low_monthly_data = monthly_customers[low_volumes]
med_monthly_data = monthly_customers[medium_volumes]
high_monthly_data = monthly_customers[high_volumes]

Based on these values of sectionalisation, there are 3 outlets with with volume visits (`'RFY', 'DMN', 'RAN'`), 7 outlets for medium volume visits (`'DSA', 'EYS', 'EEC', 'BMF', 'CYK', 'CGV', 'BSQ'`), and 35 low volume outlets by visits.

Since the interest is in the high and medium volume visits per outlets, but client would like to have a summary of low volume outlets, let's qiuckly run through the low volume outlets.



#### Useful FUnctions

Let's define somes functions for visualizing our findings and for code reuse.

In [None]:
import matplotlib.patches as patch

# Function for visualizing distribution of a given data.
def visualize_annual_distribution(segment: list, figure_size: tuple=(8,6), annot: bool=True, sort: bool = True):
    '''
    \nThis function plots the total distribution by outlets.\n
    \nAlso tells how many visits for all outlets in the segment.\n

    \nParameters:\n
    segment: List of columns or features\n
    figure_size: Size of figure
    annot: if true, bars will be annotated with hight value.
    sort: if true, data will be sorted by sum of annual visits descending.
    \nUse case:\n
    plot_annual_distribution(['A', 'B', 'C'])
    '''
    # create a new aggregated dataframe with medium outlets with reindexed index, sorted by total visits
    if sort:
        data = daily_customers[segment].reindex(daily_customers[segment].sum().sort_values(ascending=False).index, axis=1)
    else:
        data = daily_customers[segment]
    
    # Count total segmentation visits
    visits = 0
    for count in data.sum():
        visits += count

    # Plot distribution
    plt.figure(figsize=figure_size)
    x = np.arange(len(medium_volumes))

    ax = sns.barplot(x=segment, y=data.sum())
    ax.set_ylabel('Annual Visits', fontsize=14)
    ax.set_xlabel('Outlets', fontsize=14)
    
    if segment == high_volumes:
        ax.set_title(f'Total Annual Visits in High Volume Outlets', fontsize=16);
    elif segment == medium_volumes:
        ax.set_title(f'Total Annual Visits in Medium Volume Outlets', fontsize=16);
    else:
        ax.set_title(f'Total Annual Visits in Low Volume Outlets', fontsize=16);

    if annot:
        ax.bar_label(ax.containers[0]); # annotate bars
        ax.set_yticks([])
    
    color_patch = patch.Patch(color='black', label=f'Total: {visits}')
    ax.legend(loc='best', handles=[color_patch])



# Function for visualizing correlation with subplot

def visualize_scatter_correlation(data: pd.DataFrame, volumes: list,  figure_size: tuple=(8,6)):
    '''
    \nFunction gives scatter plots of correlation of features.\n
    \nParameters\n
    data: Pandas DataFrame.
    volumes: List of features.
    figure_size: Size of super figure plot.
    '''
    fig = plt.figure(figsize=figure_size)
    if volumes == low_volumes:
        fig.suptitle("Low Volume Visits Correlation", fontsize=20)
    elif volumes == medium_volumes:
        fig.suptitle("Medium Volume Visits Correlation", fontsize=20)
    else:
        fig.suptitle("High Volume Visits Correlation", fontsize=20)


    N = len(volumes)
    index = 1

    if N == 2:
        data.scatter(volumes[0], volumes[1]);
    
    for i, name_i in enumerate(volumes):
        for j in range(i+1, N):
            name_j = volumes[j]
            ax = fig.add_subplot(N-1, N-1, index)
            ax.set_title(name_i + " vs " + name_j, fontsize=11)
            ax.scatter(data[name_i], data[name_j]);
            index += 1

    plt.subplots_adjust(wspace=.5, hspace=.5)
    plt.tight_layout()




# Function that plots heatmap of a given data.
def visualize_correlation_heatmap(data: pd.DataFrame, size: int=6, title='Heatmap of Summary Data'):
    '''
    \nFunction plots a heatmap of data.\n
    \nParameters\n
    data: Pandas dataframe
    size: figure size
    title: title of figure
    '''
    plt.figure(figsize=(size+1,size))
    corr = data.corr()
    ax = sns.heatmap(corr, vmin=-1, vmax=1, annot=True, annot_kws={"size": 8}, cmap='coolwarm')
    ax.set_xticklabels(ax.get_xticklabels())
    ax.set_title(title, fontsize=16)
    plt.show()



# Function to return pairs with predefined, high corrlation coefficeint.
def get_high_correlations(data, threshold: float=0.5):
    '''
    Function returns pairs with correlation coefficient greater than\n
    or equal to a theshold. Default threshold is 0.5.

    parameters
    data: data of reqiured correlation
    threshold: minimum correlation coefficient of interest
    '''
    corr = data.corr()
    high_corr = corr.where(np.triu(np.ones(corr.shape), k=1).astype("bool")) # k=0 (main diagonal), k=1(triangle above), k=-1(traingle below)

    pairs_above_threshold = {}
    temp_corr = high_corr.values

    for i in range(len(temp_corr)):
        for j in range(len(temp_corr)):
            if (temp_corr[i,j] > threshold) and (i != j):
                pairs_above_threshold[(corr.columns[i], corr.columns[j])] = round(temp_corr[i,j], 3)
    print(f'The following {len(pairs_above_threshold)} pair(s) have correlation coefficients greater than {threshold}')
    return pairs_above_threshold


<!--  -->

### **Summarized EDA For Low Visits Outlets**

There 35 outlets with low customer visits. This will be explored briefly to figure out specific information about them

#### Daily Customer Visits

First, I'll look at the distribution of the dataset:

In [None]:
visualize_annual_distribution(low_volumes, figure_size=(16,8), annot=False, sort=False)

The distribution shows that we have the lowest sum of total visits at Outlet `ZMY` (less than 5000) and the highest sum at Outlet `END` (about 34000).

I will like to look into each outlet to for further exploration. Let's visualise this with composite `line plot`.

#### Line Chart

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
ax.plot(daily_customers[low_volumes], linewidth=2)
ax.set_title("Line Plots of Low Volume Outlets", fontsize=18)
ax.set_ylim(0) # set y-axis to start from 0
plt.gca().set_prop_cycle(None) #gca = get current axes
ax.legend(daily_customers[low_volumes].columns, loc="best")
plt.show()

The line plots looks messy so I will be using subplots to visualise how these outlets behaved over the months.

So, let's do a subplot for each outlet:

#### Line Charts For Each Outlet

In [None]:
# Set figure size
plt.figure(figsize=(12,10))
plt.suptitle("Plots of Low Visits Outlets", fontsize=18)
low_volumes_sorted = sorted(low_volumes)

index = 1
for outlet in low_volumes_sorted:
    sub = plt.subplot(6, 6, index)
    sub.plot(daily_customers[outlet], c="g") # plot lines for each outlet of low volumes
    sub.set_title(outlet)
    sub.set_xticklabels([])
    index += 1
plt.tight_layout()
plt.show()

#### Observations - Line Chart Subplots for Low Visits Outlets

The subplots grouping has given a better insight into the visits to each outlet through the year.

>The following were can be observed from the line charts of low customer visits above:
- Outlet `AYD` opened in April 2021.
- Outlet `IZX` closed in May 2021.
- OUtlet `AGN` opened in October 2021.
- Outlet `ZMY` opened in October 2021. This outlet saw a sharp increase in customer visits for the next four months. This could be due to previous awareness or high customers expectations of set up in that area or outlet sold items at relatively low prices. So, this outlet looks promising.
- Outlet `HTF` closed in May 2021.
- Outlet `ZYT` closed in April 2021.
- Outlet `XSB` opened in April 2021.
- Outlet `ZSJ` opened in July 20201.
- Outlet `HNV` closed in October 2021.
- Outlet `YMQ` opened in July 2021.
- Outlet `YGE` closed in October 2021.

However, visits were higher for these outlets in December 2021 as, obviously, it was a festive period, with Oultet `AYD` having the highest turn up of 227 customers.

### **Summary For Low Volume Outlets**

The chart below gives a broad picture of low visits outlets

In [None]:
low_monthly_data.plot(figsize=(12,8), title='Low Visits Outlets Records for Year 2021');
plt.legend(loc='upper right');

- Most stores recorded the lowest visits in February.
- 2 outlts **opened** in April, 2 in July and 2 in October. This makes a total of 6 new outlets.
- 1 outlet closed in April, 2 in May, 2 in October. This gives a total of 5 shortdowns.
-In effect, 30 low visits outlets were left at the end of the year 2021.

<!--  -->

## Medium And High Visits Exploratory Data Analysis

Monthly Total Visits Data Visualization

In [None]:
def visualize_monthly_barchart_for(data: pd.DataFrame):
    '''
    Function to visualize monthly visits of a given dataframe\n

    Parameters\n
    data: given pandas dataframe
    '''
    data.plot(kind='bar', figsize=(10,8))
    plt.ylabel("Monthly Visits", fontsize=14)
    plt.legend(loc="upper right")
    if len(data.columns) == 3:
        plt.title('High Outlets Customer Visits Per Month', fontsize=16)
    else:
        plt.title('Medium Outlets Customer Visits Per Month', fontsize=16)
    plt.show()




def visualize_line_charts_for(segment: list):
    '''
    Function to visualize grouped barchart of a given volume.\n
    Parameters
    segment: list of volume outlet names
    '''
    fig, ax = plt.subplots(figsize=(12,8))
    ax.plot(daily_customers[segment], linewidth=2)
    ax.set_ylim(0) # set y-axis to start from 0
    ax.legend(daily_customers[segment].columns, loc="best")

    if segment == high_volumes:
        ax.set_title("Line Plots of High Volume Outlets", fontsize=18)
    elif segment == medium_volumes:
        ax.set_title("Line Plots of Medium Volume Outlets", fontsize=18)
    else:
        ax.set_title("Line Plots of Low Volume Outlets", fontsize=18)

    plt.show()





def visualize_subcharts_for(segment: list):
    '''
    To visualize subplots for a given volume\n
    Parameter:
    segment: list of volume outlet names
    '''
    # Set figure size
    plt.figure(figsize=(12,10))
    sub_plots = ()
    if segment == high_volumes:
        plt.suptitle("Subplots of Daily Visits for High Volume Outlets", fontsize=18)
        sub_plots = (2,2)
    elif segment == medium_volumes:
        plt.suptitle("Subplots of Daily Visits for Medium Volume Outlets", fontsize=18)
        sub_plots = (3,3)
    else:
        plt.suptitle("Subplots of Daily Visits for Low Volume Outlets", fontsize=18)
        sub_plots = (6,6)

    index = 1
    for outlet in segment:
        sub = plt.subplot(sub_plots[0], sub_plots[1], index)
        sub.plot(daily_customers[outlet]) # plot lines for each outlet of medium volumes
        sub.set_title(outlet)
        sub.set_xticks([])
        index += 1
    plt.tight_layout()
    plt.show()




def visualize_visits_per_month(data: pd.DataFrame):
    '''
    Monthly visits visualization for the year.\n
    Parameter
    data: monthly dataframe -\n
    \tlow_monthly_data,\n
    \tmed_monthly_data,\n
    \thigh_monthly_data
    '''
    viz = data.sum(axis=1).plot(kind='bar', figsize=(10,6))
    viz.bar_label(viz.containers[0])
    plt.yticks([])
    plt.ylabel('Total Monthly Visits', fontsize=14)
    
    if data.columns is high_volumes:
        plt.title('High Volumes Visits Per Month', fontsize=16)
    else:
        plt.title('Medium Volumes Visits Per Month', fontsize=16)
    plt.show()
    plt.show()


In [None]:
visualize_visits_per_month(med_monthly_data)

### **EDA For Medium Visits Outlets**

This Explaratory Data Analysis (EDA) will follow the template of annual to monthly to daily explorations.

First, we'll look at the statistics for this segment of outlets

In [None]:
med_visit_data.describe()

Outlet `BMF` recorded the lowest customer visits while `EEC` had the highest.

#### **Total Annual Visits for Medium Outlets**

There are 7 outlets with annual medium visits by customers based on our segmentation. These outlets are:

- 'DSA', 'EYS', 'EEC', 'BMF', 'CYK', 'CGV', 'BSQ'

AS shown, the total visits to all Medium Outlets through the year is 2,186,224.

So, let's explore how these 2186224 visits were distributed by outlets for the entire year.

In [None]:
visualize_annual_distribution(segment=medium_volumes)

#### Distribution Observation



 - Total annual visits across all outlets is 2,186,224
 
 - For the Outlets with medium count of customer visits (outlets with over 200k but not up to 600k visits), highest record is `DSA` with over 360k visits while `CGV` had the lowest of just above 273k visits through the year.

#### Per Outlet Daily Visits

I will be exploring each outlet behavior by visits over the year.

In [None]:
visualize_line_charts_for(medium_volumes)

From the plot above, although still noisy, we can see that the turn ups at these outlets maintain a similar range. Outlet `BMF` had the lowest visits towards the ending of December while outlet `EEC` has the highest visits in few weeks into December.

Now, let's see a individual plot of the outlets with medium visits.

In [None]:
visualize_subcharts_for(medium_volumes)

#### Observations - Medium Visits Outlets Subplots

We can't really say if there is any trend for sure as there are lots of noice in the data, but Outlet `BSQ` appear to be stationary. However, there seems to be some sorts of seasonality in the plots.

>Definitions

- **Seasonality**: The variations in a time series data that occurs a any particular regular intervals or time periods in a year.
- **Trend**: This represents the upward or downward observed movement of a time series over a period of time.
- **Stationarity**: The occurence of neither upward nor downward movement of a time series over an entire period.



Let's create subplots of line charts for Medium Volumes Outlet with trendlines added. The subplots are sorted in decreasing order of total visits.

In [None]:
import matplotlib.dates as mpldate

# sort outlets by total visits
med_sum_sorted = daily_customers[medium_volumes].sum().sort_values(ascending=False)

# create rolling mean object with a 7 day period
period = 28
moving_average = med_visit_data.rolling(window=period).mean() #rolling mean

# loop through all medium visits outlets.
plt.figure(figsize=(12,8))
plt.suptitle('Rolling Mean Plot with Trends For Medium Volume Outlets', fontsize=18)
index = 1

for outlet in med_sum_sorted.index:
    x = mpldate.date2num(moving_average.index.values)
    y = med_visit_data[outlet].values
    z1 = np.polyfit(x, y, 1)
    z3 = np.polyfit(x, y, 3)
    trend1 = np.poly1d(z1)
    trend3 = np.poly1d(z3)
    sub = plt.subplot(3,3,index)
    sub.plot(moving_average[outlet], linewidth=.8)
    sub.plot(x, trend1(x), linewidth=2, linestyle="--", label='linear')
    sub.plot(x, trend3(x), linewidth=2, linestyle="--", label='poly', c='m')
    sub.set_xticklabels(sub.get_xticklabels(), rotation=30)
    sub.set(title=outlet)
    # sub.legend(loc='best')
    index += 1
plt.tight_layout()
plt.show()

#### Observations for Seasonality and Trend

The rolling mean plots are quite interesting. The trends and seasonalities are now visible. Linear and polynomial trend lines are combined for more exploration at these outlets.

>**DAS**

- `Trend`: I would say this outlet's visits behaviour is cyclic. This is because the trend seems to be cyclic - starting in January 2022 where it almost started in January 2021. On the average, the monthly visits saw an uptrend through the year. It has the highest customers turnout through the year with visits peaked in July at the 1540 visits.

- `Seasonality`: Outlet visits seems seasonal bimonthly. There was a deep in visits early June but peaked in July. Seasonality is also observed between November and January.

>**EEC**

- `Trend`: This outlet's visits behaviour is almost opposite that of `DAS`. Although an uptrend with peak in December and trough in February.

- `Seasonality`: Bimonthly seasonality also observed in the outlet.


>**EYS**

- `Trend`: There is an uptrend in this outlet as observed with both linear and poly plots.

- `Seasonality`: Seasonality is almost bimonthly.

>**BSQ**

- `Trend`: Outlet is visits behaviour is rather stationary as shown by both trend lines through the year.

- `Seasonality`: Seasonalities can also been seen at this outlet through the year. It is almost quarterly.

>**BMF**

- `Trend`: This outlet experienced a downtrend of customer visits through the year. This trend seems to be the inverse of the behavior at outlet `EEC` with January and December inverted.

- `Seasonality`: Outlet's visits seem to have quaterly seasonality.


>**CYK**

- `Trend`: This ouotlet has a gentle rising trend. 

- `Seasonality`: Seasonality is roughly quaterly.

>**CGV**

- `Trend`: There is an uptrend.

- `Seasonality`: Seasonality is roughly monthly.

Generally, most of the medium visits outlets saw a peaked turn up of customers between September and December most likely because it was festive season.

The grouped plot shows how the medium visits outlets compare monthly. The average peaked visits happened in July.

As shown below, the medium volume outlets recorded the lowest customers turn up in February. This is better shown below.

In [None]:
med_monthly_data.plot(figsize=(8,6))
plt.ylabel('Monthly Visits', fontsize=14)
plt.title('Medium Volume Outlets Customer Visits');

Finally, let's check for outliers in medium volumes

In [None]:
sns.boxplot(data=daily_customers[medium_volumes]);
plt.title('Boxplot For Medium Volume Outlets')
plt.ylabel('Visits Records')
plt.xlabel('Medium Volumes')
plt.show()

> **Observations**

- Lowest Visit count: Outlet BMF had the lowest visit.
- Spread of the visits records (Visits consistency): EEC had the highest spread and therefore most irregular with visits. EYS had the least spread and therefore most consistent with visit records.
- Median Visits: DSA comes first with the highest meadian daily visits.
- Highest Visits: EEC recorded the highest.
- Outliers: An outlier is a datapoint with value that is significantly different from the rest of the data. 5 outlets - DSA, EEC, BMF, CYK and CGV have outliers. However, DSA had the most outliers.


<!--  -->

### **EDA For High Volume Visits Outlets**

This Explaratory Data Analysis (EDA) will follow the template of annual to monthly to daily explorations for high end outlets by visits (`'RFY', 'DMN', 'RAN'`)

#### **Total Annnual Visit for High Customer Outlets**

AS shown below, the total visits to all High Outlets through the year is 2,069,183.

In [None]:
visualize_annual_distribution(segment=high_volumes)

There are 3 outlets with annual high visits by customers based on our segmentation. These outlets are:

- `'RFY', 'DMN', 'RAN'`

#### Monthly Visits

So, let's explore how these 2069183 visits were distributed by outlets monthly through the entire year.

In [None]:
visualize_monthly_barchart_for(high_monthly_data)

One common behave the high volume outlets have is that they recorded the lowest visits in February, just as in medium volume outlets.

January happens to bee the month with most customers visits. Other months with high visits are July, OCtober and December.

<!--  -->

#### Daily Visits Analysis

Let's see plots of how each outlet performed based on daily visits. This will be visualized with line charts.

In [None]:
visualize_subcharts_for(high_volumes)

An uptrend is suspected in outlet `RFY` while in outlets `DMN` and `RAN` we have down trends. However, this will be explored further with moving average.


Plot rolling average over 28 day period

In [None]:
# sort outlets by total visits
high_sum_sorted = daily_customers[high_volumes].sum().sort_values(ascending=False)

# create rolling mean object with a 28 day period
period = 28
moving_average_h = high_visit_data.rolling(window=period).mean() #rolling mean

# loop through all medium visits outlets.
plt.figure(figsize=(12,5))
plt.suptitle('Moving Average Plot with Trends For High Volume Outlets', fontsize=18)
index = 1

for outlet in high_sum_sorted.index:
    sub = plt.subplot(1,3,index)
    sub.plot(moving_average_h[outlet], linewidth=.8)
    sub.set_xticklabels(sub.get_xticklabels(),rotation=30)
    sub.set(title=outlet,
            xlabel='Date',
            ylabel='Visits')
    index += 1
plt.tight_layout()
plt.show()

#### Observations

Trends and seasonality are now obvious.

>RFY
- **Trend**: The overall highest visits volume outlets continued to see a growth of customers visit through the year. Besides, the outlet had the highest maginal growth. Outlet had a total customer growth of 6680 at year end.

- **Seasonality**: This can be observed in April and in November.


>RAN
- **Trend**: `RAN` outlet suffered customers loss as the year ran full length, losing 5820 between January and December inclusive.

- **Seasonality**: No seasonality observed.


>DMN
- **Trend**: Just like `RAN` outlet, `DMN` also suffered customers loss as the year ran full length, but lost higher customers - 6507, between January and December inclusive.

- **Seasonality**: No seasonality observed.

## Conclusion of Daily Visits Data

So far, the daily customer visits the these outlets show quite interesting behaviour. In general, the lowest monthly visits were recorded in February, highest in July.

We saw some low volume outlets like `HTF, ZYT, YGE` struggled with customers visits and eventually closed while others like `YMQ`, `XSB`, `ZSJ` opened as the year ran with some even seeing an aggressive increase in customer visits like `ZMY`.

The medium volume outlets had some that can be invested in through outlet size expansion or increasing staff strength, for example `DAS, EEC, EYS, CGV`.

The high volume outlets had drops in visits in `RAN` and `DMN`.

Overall, the average daily visits in about 13,800 in ChrisCo in 2021.



<!--  -->

<!--  -->

## EDA For Summary Data

The Summary Data has a record of information about the spendings and staff strength at each outlet through the year. Just like in the [Daily Customer Visits](#analysis-of-daily-customer-visits) data, analyses will be based on the segmentations, mostly for High and Medium Outlets. Interactive plots will also be used in this session.

### Pre-Exploration

But first, let's have a look again at what the data looks like.

In [None]:
summary_data = summary.copy()
summary_data.head()

Let's check for any missing data

In [None]:
print(f'Missing data:\n{summary_data.isna().sum()}')

No missing data. Now, we look at the statistical properties.

In [None]:
summary_data.describe()

> Key Notes:

- 	There are 45 data points (outlets) and 5 features, which are ` Marketing, Overheads, Size of Outlet, Total Annual Visit and Staff Strength`.

- Highest staff strength is 67! We will see how this outlet maximized this.

- Biggest outlet is of size 7530 $m^2$.

- Biggest spend on marketing was £87k

- Most spend in overhead was £98k.


Let's explore how these outlets used their respective features to their advantage.

#### Useful Functions

Let's define functions for creating plots and for normalizing.

In [None]:
from sklearn.preprocessing import StandardScaler

# MinMaxScaler Normalizer
def min_max_scale(df):
    result = df.copy()
    for feature_name in df.columns:
        max_value = df[feature_name].max()
        min_value = df[feature_name].min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result


def standard_rescale(df):
    return pd.DataFrame(StandardScaler().fit_transform(df), columns=df.columns, index=df.index)



# To visualtize Summary or Modified Summary Data
def visualize_radar_plot_for(segment: list, data=summary_data):
    '''
    To visualtize Summary or Modified Summary Data

    parameters:\n
    segment: list of outlets

    data: dataframe of summary or modified summary data.
    '''
    normalized_data = min_max_scale(data)


    categories = ['OutletSize', 'TotalVisits', 'Marketing (£)', 'StaffCount', 'Overheads (£)']

    n_attributes = len(categories)
    angles = [n / float(n_attributes) * 2 * np.pi for n in range(n_attributes + 1)]
    colours = ['b', 'g', 'r', 'c', 'm', 'y', 'k', 'w']
    radius = radius = np.linspace(0, 1, 5)
    c = 0
    plt.figure(figsize=(8, 8))

    if segment == high_volumes:
        plt.suptitle('Radar Plot for High Volume Outlets', fontsize=16)
    else:
        plt.suptitle('Radar Plot for Medium Volume Outlets', fontsize=16)
        
    sub = plt.subplot(1, 1, 1, polar=True)
    
    for name in segment:
        values = normalized_data.loc[[name]].values.flatten().tolist()
        values += values[:1]
        sub.plot(angles, values, colours[c % len(colours)], label='Outlet ' + name)
        sub.fill(angles, values, colours[c % len(colours)], alpha=0.1)
        sub.set_ylim(ymax=1.05)
        sub.set_yticks(radius)
        sub.set_xticks(angles[0:-1])
        sub.set_xticklabels(categories)
        c += 1
    plt.legend(loc='upper right', bbox_to_anchor=(1.1, 1))
    plt.show()


# Bar charts for summary data or modified summary data
def visualize_barcharts_for(segment: list, data: pd.DataFrame=summary_data):
    '''
    Bar charts for summary data for high and medium volumes

    Parameters
    segment: list of outlets
    data: Pandas Dataframe (of daily visits)
    '''
    plt.figure(figsize=(10, 6))
    counter = 1
    if segment == high_volumes:
        plt.suptitle('Bar Subplots for High Volume Outlets', fontsize=18)
    else:
        plt.suptitle('Bar Subplots for Medium Volume Outlets', fontsize=18)

    for attribute in data:
        x_pos = np.arange(len(data.loc[segment].index))
        sub = plt.subplot(3,3, counter)
        sub.bar(x_pos, data[attribute].loc[segment], align='center')
        sub.set_xticks(x_pos, data.loc[segment].index)
        # sub.set_xlabel('Outlets', fontsize=18)
        color_patch = patch.Patch(color='green', label=attribute)
        sub.legend(loc='best', handles=[color_patch])
        # sub.set_ylabel(attribute, fontsize=18)
        counter += 1
    plt.tight_layout()
    plt.show()
    # return sub

#### Parameters Correlation

Let's visualize how these features correlate with each other but I'll introduce a metric called marketing returns. This is the number of visits per marketing spend at any outlet.

In [None]:
# high_plus_med = high_volumes + medium_volumes # high and medium volume outlets
# marketing_returns = (summary_data['TotalVisit']*10/summary_data['Marketing (£)']).astype('int')
# # marketing_returns.loc[volumes]
# suggested_data = summary_data.copy()
# suggested_data['MarketingReturns'] = marketing_returns
# suggested_data.head()


In [None]:
plots = []
size = (summary_data['TotalVisit']*10/summary_data['Marketing (£)']).astype('int')
categories = summary_data.columns
for i, name_i in enumerate(categories):
    for j in range(i + 1, len(categories)):
        name_j = categories[j]
        plot = standard_rescale(summary_data).hvplot.scatter(x=name_i, y=name_j,
                                                          frame_width=200, frame_height=200,
                                                          xlabel=name_i, ylabel=name_j, size=size,
                                                          title=name_i + ' vs ' + name_j,
                                                          color=name_i, cmap='coolwarm', alpha=.6)
        plots.append(plot)
        
hv.Layout(plots).cols(3)

**Heatmap for Summary Data**

Let's visualize this table as a heatmap.

In [None]:
visualize_correlation_heatmap(summary_data, size=7)

### **Observations**

- `Marketing vs Staff Count` has strong positive correlation. Bigger outlets require  more staff.
- `Marketing vs Outlet Size` has strong positive correlation. Bigger outlets require more marketing as more staff would be catered for.
- `Marketing vs Overheads` has strong negative correlation. This would mean that a fixed budget is set for any outlet: the higher the Overhead Cost, the lower the spend on Marketing.
- `Outlet Size vs Staff Count` are fully correlated.
- `Daily Visits` correlates highly with all but overheads. Of great interest is the almost unity correlation with marketing.
- It's a bit strange that `Overheads` negatively correlates with other parameters.

The map shows that more features with correlation coefficient. Let's see pairs that are > 0.5

In [None]:
get_high_correlations(summary_data, threshold=0.5)

Interpretation:

- Outlets size, Staff Count, and Total Visits, and Marketing contribute to the Visits.
- Bigger outlets tend to spend more on marketing.
- Outlets with more staff tend to spend more on marketing.
- Customer's presence at outlets depends `largely` on how much marketing was done.
- Bigger outlets require more staff.
- Outlets with more staff attract more customers
- Bigger outlets attract more customers

<!-- Before we proceed, new features will be added to the Summary dataframe for indept analysis. They are:

- Spend Efficiency: Total Customer Visits to Outlet Per Total Spend:

    $$
    SpendReturns = \frac{Total Customer Visits }{Marketing Spend(£)}{*100}
    $$
 -->


<!--  -->

<!--  -->

### **Summary Data For High Volume Outlets**

Let's define the summary data for high volume. 

The modified summary data will be used.

In [None]:
summary_high = summary_data.loc[high_volumes]
summary_high

In [None]:
def scale_by_max(data):
    return data/data.max()


def visualize_heatmap(data: pd.DataFrame, width:int=700, height:int=400, title='Heatmap of data'):
    '''
    \nFunction plots a heatmap of data.\n
    \nParameters\n
    data: Pandas dataframe
    size: figure size
    title: title of figure
    '''
    corr = data.corr()
    ax = corr.hvplot.heatmap(title=title, cmap='coolwarm',
                             width=700, height=400,)
    return ax

bubbles = []

def summary_scatter3d_plot(data: pd.DataFrame, segment: list, factor: int=500):
    '''
    Works only summary data

    Parameters:
    data: Pandas Dataframe
    segment: list of segmented outlets
    factor: factor to resize bubble
    '''
    
    title=''
    if segment == high_volumes:
        title = '3D Scatter Summary Plot for High Volume Outlets with scaled TotalVisit\n'+\
                                     'as the Third Dimension'
    else:
        title = '3D Scatter Summary Plot for Medium Volume Outlets with scaled TotalVisit\n'+\
                                     'as the Third Dimension'
    newplot = pd.DataFrame({'x': np.zeros(5), 'y': np.zeros(5)}).hvplot(kind='scatter', x='x', y='y',
                                                                        xlabel='Feature X', ylabel='Feature Y',
                                                                        alpha=0)

    selected = data.columns
    for i, name_i in enumerate(selected):
        for j in range(i + 1, len(selected)):
            name_j = selected[j]
            bubble_size = (data.TotalVisit[i]/factor).astype('int')
            newplot *= scale_by_max(data.loc[segment]).hvplot(kind='scatter', x=name_i, y=name_j,
                              frame_width=600, frame_height=500, 
                              title=title,
                              label=f'{name_i} vs {name_j}',
                              alpha=.8, size=bubble_size)
            # plots.append(plot)
    return newplot.opts(legend_position='right')


In [None]:
summary_scatter3d_plot(summary_data, medium_volumes)

**Observation**

- Outlet size and Staff Strength have more to do with total visits.

- Marketing vs Visits plots are most clustered. Shows highest correlation.

### Radar Plot for High Volume


Radar Plot Structure
- Categories that ChrisCo might want to maximize are placed on the upper half of the plot.
- Categories to minimize are on the lower half of the plot.

In [None]:
visualize_radar_plot_for(high_volumes, data=summary_data)

#### Observations: High Volumes Radar Plots

>RFY
- This outlet recorded the highest customer visit in 2021, with the biggest outlet size, marketing spend, but least in staff strength and overhead cost.

>DMN
- This outlet spent least on marketing but so much on overheads, staff strength is relatively low.

>RAN
- RAN came behind DMN in overheads, just a little behind RFY in marketing. Largest staff count with the lowest visit.


<!--  -->

Let's explore high volume outlets for metrics

Interactive Plot for Medium Volumne

### Summary Data for Medium Volumes

In [None]:
totals = summary.sum().plot(kind='barh', title='Sum of Features Across Outlets for Summary Data', ylabel='Features')
totals.bar_label(totals.containers[0], fmt='%.0f');

**Observations**
- ChrsiCo budget for 2021 was £2,817,000.

- Every £ on Markting made 9 visits.

In [None]:
summary_scatter3d_plot(summary_data, medium_volumes)

### DashBoard

The dashboard will contain the dailycustomer visits and the summary data and both will be binded by events.

In [None]:
pan_title = '## Dashboard' # panel title
volumes = ['High Volumes', 'Medium Volumes', 'Low Volumes', 'All Volumes'] # segments of outlets
all_outlets = list(summary_data.index) # all outlets

# create widget to select daily visits volumes
select_volumes = pn.widgets.Select(name='Select Volumes', options=volumes, width=128)

# create widget to select sumary data features
parameter_x = pn.widgets.Select(name='Feature X', options=list(summary_data.columns), width=128)
parameter_y = pn.widgets.Select(name='Feature Y', options=list(summary_data.columns), width=128)

# create widget to select period
period = pn.widgets.Select(name='Select Period', options=list([1, 7, 14, 28]), value=7, width=80)


# define function to return svalues for volumes
def get_volume():
    if select_volumes.value == 'High Volumes':
        return high_volumes
    elif select_volumes.value == 'Medium Volumes':
        return medium_volumes
    elif select_volumes == 'Low Volumes':
        return low_volumes
    else:
        return all_outlets
#     return segment


# define function to plot scatterplot
def plot_scatter(segment, x, y):
    segment = get_volume()
    corr = summary_data.corr() # get correlation
    factor = size = (summary_data['TotalVisit']*10/summary_data['Marketing (£)']).astype('int')
    return (summary_data.loc[segment]).hvplot.scatter(x, y, frame_width=300,
                           cmap='coolwarm', color=parameter_x.value,
                           size=factor,
                           title=f'{select_volumes.value}: {x} vs {y}\ncorrelation: {corr.loc[x,y]:.2f}')


# define function to plot lines
def plot_line(segment, period):
    segment = get_volume() # get segment of outlets
    moving_average = daily_customers[segment].rolling(window=period).mean() #rolling mean
    return moving_average.hvplot(title=f'Moving Average for {select_volumes.value} at period {period}',
                                 frame_width=350)


# define function to combine plots
def display_plots(segment, period, x, y):
    return plot_line(segment, period) + plot_scatter(segment, x, y)


interact(display_plots, segment=select_volumes, period=period,x=parameter_x, y=parameter_y)

**Description**

Moving Average Plot
- This shows the moving average the over a selected period for a segment of the outlets in daily customer's data

Scatter Plot
- This shows the correlation of the features of the summary data