# SECRID  DATA ANALYSIS PROJECT


<span style="color: gray; font-size:1em;">November-2019</span>


## Table of Contents
* [Introduction](#introduction)
* [Section One - Import Data into IDE](#import_data)
    * [Part I - Gathering Data](#gather_data)
    * [Part II - Assessing Data](#assess_data)
    * [Part III - Cleaning Data](#clean_data)

* [Section Two - Explaratory Data Analysis](#EDA)
  * [Part I - Dataset of interest](#Dataset_of_interest)
  * [Part II - Asses dataset of interest](#Asses_dataset_of_interest)
  * [Part III - Variable : Customer Name](#Customer_Name)
   * [Customer Cluster: Cluster A](#cluster_A)
   * [Customer Cluster: Cluster B](#cluster_B)
   * [Customer Cluster: Cluster C](#cluster_C)
   * [Customer Cluster: Cluster D](#cluster_D)
   * [Customer Cluster: Cluster E](#cluster_E)
 



    
    
    
    
          
    

<a id='introduction'></a>
## Introduction

SECRID is a business entity based in the Netherlands. It produces, stocks and sells designer wallets, particularly leather based wallets in more than 100 countries all over the world.

This notebook explores SECRID sales data


<a id='#import_data'></a>
## Section One : Import Data into IDE

<a id='gather_data'></a>
## Part I : Gathering Data

In [1]:
# load required libraries
import numpy as np
import pandas as pd
import docx
from pandas import DataFrame
from pandas import read_excel


import xlsxwriter

import seaborn as sns
# Use seaborn style defaults and set the default figure size
sns.set(rc={'figure.figsize':(11, 4)})

import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.ticker as ticker
from matplotlib import pyplot as plt
from matplotlib import dates as mpl_dates
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
plt.style.use('seaborn')

import six

from datetime import datetime, timedelta

# environment settings:
pd.options.display.float_format = '{:.2f}'.format
pd.set_option('display.max_column',None)
pd.set_option('display.max_rows',None)
pd.set_option('display.max_seq_items',None)
pd.set_option('display.max_colwidth', 500)
pd.set_option('expand_frame_repr', True)


### Load .xlxs files

In [2]:
#load sales 2015 
df1 = pd.read_excel('SECRID DATA.xlsx',0) #load first spreadsheet of SECRID DATA.xlxs 

In [3]:
#load salesw 2016 
df2 = pd.read_excel('SECRID DATA.xlsx',1) #load second spreadsheet of SECRID DATA.xlxs

In [4]:
#load sales 2017
df3 = pd.read_excel('SECRID DATA.xlsx',2)  #load third spreadsheet ofSECRID DATA.xlxs

In [5]:
#load sales 2018
df4 = pd.read_excel('SECRID DATA.xlsx',3)  #load fourth spreadsheet of SECRID DATA.xlxs

In [6]:
#load sales 2019
df5 = pd.read_excel('SECRID DATA.xlsx',4)  #load fifth spreadsheet of SECRID DATA.xlxs

In [7]:
#combine df1, df2, df3, df4 and df5  into one complete dataframe 'df' for sales data
df = pd.concat([df1, df2, df3, df4, df5]) 

<a id='assess_data'></a>
## Part II - Assessing  Data

In [None]:
df.head() #preview first five rows

In [None]:
df.tail() #preview last five rows

In [8]:
# Check size of the dataframe 
df.shape 

(1609533, 18)

In [None]:
# list names of columns in dataframe
df.columns 

In [None]:
# View info of the dataframe 
df.info()

In [None]:
# view some of the core statistics about columns
df.describe(include='all')

In [None]:
# check the Data types (dtypes) of each column in Dataframe
df.dtypes 

In [None]:
# Total sum of duplicate rows
df.duplicated().sum() # returns a Boolean Series with True value for each duplicated row and sums them

In [None]:
#return the number of unique elements in each column
print(df.nunique()) 

In [None]:
df.count() #returns the number of non-missing values for each column or row

In [None]:
#Total missing values(NaN) in a DataFrame
df.isnull().sum().sum()

In [None]:
#Count number of NaN for each column in DataFrame
print(df.isnull().sum()) 

<a id='issues'></a>
**Quality issues**
 * Rename column names to have clear, descriptive names in small letters according to best practice. Column 'name' can be renamed to 'customer_name' and column 'material' can be renamed to 'type_of_material'
 * Set to columns to appropriate category data type: 'internal_id', 'document_number', 'customer_name', 'customer _category', 'retailer_role', 'shipping_country', 'item', 'display_name', 'pim_category','type_of_material', 
   'pim_colour', 'wsl', 'while_stock_lasts' and 'cardprotector_colour' 

<a id='clean_data'></a>
## Part III - Cleaning Data

In [9]:
# Create copy of original DataFrame
df_clean = df.copy()

In [10]:
#Fixing messy column names
df_clean.columns = df_clean.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')

In [11]:
# change column names using rename function
df_clean.rename(columns={                                                 
                         'name':'customer_name',
                         'wsl_+':'wsl',
                         'material':'type_of_material' }, 
                 inplace=True)

**Test**

In [12]:
df_clean.columns #List of column names in df_clean Dataframe

Index(['internal_id', 'document_number', 'date', 'customer_name',
       'customer_category', 'retailer_role', 'shipping_country', 'item',
       'display_name', 'quantity', 'amount', 'amount_foreign_currency',
       'pim_category', 'type_of_material', 'pim_colour', 'wsl',
       'while_stock_lasts', 'cardprotector_colour'],
      dtype='object')

**Define**
<br>Set appropriate data types for fields mentioned in the [Quality issues](#issues) 

In [13]:
# use .astype to change data type of dataframe columns
df_clean = df_clean.astype({"internal_id":'category',"document_number":'category', "customer_name":'category', "customer_category":'category', "retailer_role":'category', "shipping_country":'category', "item":'category',"display_name":'category', "pim_category":'category', "type_of_material":'category', "pim_colour":'category',"wsl":'category', "while_stock_lasts":'category', "cardprotector_colour":'category'})

**Test**

In [14]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1609533 entries, 0 to 310899
Data columns (total 18 columns):
internal_id                1609533 non-null category
document_number            1609533 non-null category
date                       1609533 non-null datetime64[ns]
customer_name              1609533 non-null category
customer_category          1604118 non-null category
retailer_role              45994 non-null category
shipping_country           1603431 non-null category
item                       1609533 non-null category
display_name               1609533 non-null category
quantity                   1609533 non-null int64
amount                     1609533 non-null float64
amount_foreign_currency    1609533 non-null float64
pim_category               1608493 non-null category
type_of_material           1608493 non-null category
pim_colour                 1603475 non-null category
wsl                        1609533 non-null category
while_stock_lasts          1609533 non-nu

In [None]:
# view some of the core statistics about the columns
df_clean.describe(include='all')

### content structure of sales dataset
The sales data contains 18 columns (variables) and 1,609,533 rows (entries). 
This is evidence that 1,609,533 sale transactions were completed in the January 2015 – July 2019 period for SECRID business. The dataset contained features about:

* Products for sale: item, display_name, pim_category, pim_colour, type_of_material and cardprotector_colour
* The country the item was shipped to : shipping_country  
* Customer data:  customer_name, customer_category and retailer_role
* Sale transactions: internal_id, document_number, quantity, amount, amount(foreign_currency) and date


### Detected Missing Values
A null value is a value in a field that appears to be blank. A null value is a field with no value. 
The table below indicates the number and  the resulting percentage of missing values per column.

| Variable Name  | Value Count| Number Of Missing Values| % Of Missing Values
| -------------  | ------------- |------------- |-------------
| internal_id    | 1,609,533  |0 |0%
| document_number| 1,609,533  |0 |0%
| date            | 1,609,533  |0 |0%
| customer_name  | 1,609,533  |0 |0%
| customer_category  | 1,604,118  |5,415     |0.34%
| retailer_role      | 45,994     |1,563,539 |97.14%
| shipping_country   | 1,603,431  |6,102     |0.38%
| item               | 1,609,533  |0 |0%
| display_name       | 1,609,533  |0 |0%
| quantity           | 1,609,533  |0 |0%
| amount             | 1,609,533  |0 |0%
| amount(foreign_currency)| 1,609,533 |0     |0%
| type_of_material        | 1,608,493 |1,040 |0.065%
| pim_category            | 1,608,493 |1,040 |0.065%
| pim_colour              | 1,603,475 |6,058 |0.38%
| wsl_+                   | 1,609,533 |0     |0%
| while_stock_lasts       | 1,609,533 |0     |0%
| cardprotector_colour    | 1,556,521 |53,012|3.29%




<a id='EDA'></a>
## Expalaratory Data Analysis

In [None]:
df_clean['year'] = df_clean.date.dt.year # create new column 'year' that registers year sale transaction was held.(helps with analysis)

**Test**

In [None]:
df_clean.head()

<a id='Dataset_of_interest'></a>
## DATASET OF INTEREST

### The dataset of interest is made up of transactions only made in the ‘shipping country’ Netherlands, where the quantity is positive(>0)  and the amount is greater or equal to € 9.95.

In [None]:
# rows with Quantity > 0 column and Amount column >= 9.95
dfrevenue = df_clean.loc[(df_clean.amount>=9.95) & (df_clean.quantity>0)]

In [None]:
# subset of data of transactions shipped to netherlands and Quantity >0 column and Amount column >=9.95
netherlandsrevenue =dfrevenue.loc[(dfrevenue.shipping_country == 'Netherlands')]

In [None]:
netherlandsrevenue.head()

In [None]:
#export to csv
netherlandsrevenue.to_csv (r'netherlandsrevenue.csv', index = True, header=True)

In [None]:
# subset to only columns of interest
netherlandsrevenue=netherlandsrevenue[['date','customer_name','customer_category','display_name','quantity','amount','year']]

**Test**

In [None]:
netherlandsrevenue.head()

<a id='Asses_dataset_of_interest'></a>
### Asses dataset of interest

In [None]:
netherlandsrevenue.info()

In [None]:
#total revenue generated in netherlands transactions for January 2015 -July 2019
netherlandsrevenue.amount.sum() 

In [None]:
#maximum revenue generated in a netherlands transaction
netherlandsrevenue.amount.max() 

In [None]:
#minimum revenue generated in a netherlands transaction
netherlandsrevenue.amount.min() 

In [None]:
#Total quantity of items sold in netherlands transactions for January 2015 -July 2019
netherlandsrevenue.quantity.sum() 

In [None]:
#maximum quantity sold in a single netherlands transaction
netherlandsrevenue.quantity.max() 

In [None]:
#minimum quantity sold in a single netherlands transaction
netherlandsrevenue.quantity.min() 

## ASSES DATASET OF REVENUE LESS THAN THE 9.95

In [None]:
# rows with Quantity > 0 column and Amount column < 9.95
lessdfrevenue = df_clean.loc[(df_clean.amount > 0) & (df_clean.quantity>0)]

In [None]:
lessdfrevenue = lessdfrevenue[(lessdfrevenue.amount < 9.95) & (lessdfrevenue.amount > 0)]

In [None]:
lessdfrevenue.info()

In [None]:
# subset of data of transactions shipped to netherlands and Quantity >0 column and Amount column >=9.95
lessdfrevenue =lessdfrevenue.loc[(lessdfrevenue.shipping_country == 'Netherlands')]

In [None]:
lessdfrevenue.quantity.sum() 

In [None]:
lessdfrevenue.amount.min()

In [None]:
lessdfrevenue.amount.max()

In [None]:
lessdfrevenue.amount.mean()

In [None]:
#quantity of items sold and revenue generated in The Netherlands per year
lessdfrevenuerank = lessdfrevenue.groupby(
   ['year']
).agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)

lessdfrevenuerank.sort_values(by=['year'], inplace=True, ascending=True)

In [None]:
lessdfrevenuerank

<a id='Customer_Name'></a>
### VARIABLE  : customer_name

In [None]:
netherlandsrevenue.customer_name.describe()#overview of variable; count, unique, top,freq

In [None]:
print (netherlandsrevenue.customer_name.cat.categories)# list all unique customers

In [None]:
netherlandsrevenue.customer_name.value_counts()  #Return counts of unique values

In [None]:
print(netherlandsrevenue.customer_name.isnull().sum())  #Number of missing values in Customer_Name column

### Findings
This column records the customer name per sale transaction in the Netherlands for the January 2015 -July 2019 period. The format the Customer Name is entered is as follows; the letter C (capital c) is entered as a prefix before a hyphen (‘-’), a unique number is then added followed by the customer name. There are 15,003 unique customer names. This means that for the January 2015–July 2019 period the business  had 15,003 unique **Netherlands customers**. The customer who performed the most transactions is **C-6423 Sarthro Travelbags B.V.** at 17,783 sale transactions.

<a id='qi_Customer_Name'></a>
**Quality Issue**

For customers with different branches of their business, each branch is considered as a unique customer.
The letter C (capital c) that is entered as a prefix before a hyphen (‘-’) and the unique number following is repeated for all branches.<br>
For example;

* 'C-10215 James Shoe Care Canada Place Limited',
* 'C-10215| C-1 James Shoe Care Canada Place Limited : James Shoe Care Canada Place Limited : James Shoe Care Wharf',
* 'C-10215| C-2 James Shoe Care Canada Place Limited : James Shoe Care Canada Place Limited : James Shoe Care Westfield',


In [None]:
#quantity of items sold and revenue generated in The Netherlands per year
netherlandsrevenuerank = netherlandsrevenue.groupby(
   ['year']
).agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)

netherlandsrevenuerank.sort_values(by=['year'], inplace=True, ascending=True)

In [None]:
netherlandsrevenuerank

In [None]:
#export to csv
netherlandsrevenuerank.to_csv(r'Netherlands_aggregatedrevenue_per_year.csv', index = True, header=True)

In [None]:
# view revenue per customer in the netherlands
customerrank = netherlandsrevenue.groupby(
   ['customer_name']
).agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)

customerrank.sort_values(by=['amount'], inplace=True, ascending=False)

In [None]:
customerrank.head(2)

In [None]:
# view revenue per customer per year in the netherlands using pivot table
customernametable = pd.pivot_table(netherlandsrevenue, index= "customer_name",columns='year',
                       values=["amount","quantity"],aggfunc=sum, margins=True)

In [None]:
# sort by total amount per customer in descending order
customernametable.sort_values(by=('amount', 'All'), ascending=False, inplace=True)

**Test**

In [None]:
customernametable.head()

In [None]:
#fix column names
customernametable.columns =[s1 + '_' + str(s2) for (s1,s2) in customernametable.columns.tolist()]

In [None]:
#proposed column order
columnsTitles = ['quantity_2015','amount_2015','quantity_2016','amount_2016','quantity_2017','amount_2017','quantity_2018','amount_2018','quantity_2019','amount_2019','quantity_All','amount_All']

In [None]:
#re-arrange column indexes(order) based on above columnsTitles
customernametable = customernametable.reindex(columns=columnsTitles)

**Test**

In [None]:
customernametable.head()

In [None]:
#drop first row
customernametable=customernametable.drop(customernametable.index[0])

**Test**

In [None]:
customernametable.head()

## CUSTOMER CLUSTERS
### Cluster customers into clusters based on the total revenue per customer for the January 2015 - July 2019

<a id='cluster_A'></a>
### Cluster A : cluster_A

In [None]:
# cluster of customers who brought in a revenue greater or equals to 1,000,000 Euros in 2015 - 2019 period
cluster_A = customernametable[(customernametable.amount_All>=1000000)]

**Test**

In [None]:
cluster_A

In [None]:
cluster_A.info()

### Totals for cluster_A

In [None]:
cluster_A.isnull().sum()

In [None]:
cluster_A = cluster_A.fillna(0)

In [None]:
#sum total of revenue for customer cluster_A
sum(amount for amount in cluster_A.amount_All)

In [None]:
#sum total of revenue for 2015 customer cluster_A
sum(amount for amount in cluster_A.amount_2015)

In [None]:
#sum total of revenue for 2016 customer cluster_A
sum(amount for amount in cluster_A.amount_2016)

In [None]:
#sum total of revenue for 2017 customer cluster_A
sum(amount for amount in cluster_A.amount_2017)

In [None]:
#sum total of revenue for 2018 customer cluster_A
sum(amount for amount in cluster_A.amount_2018)

In [None]:
#sum total of revenue for 2019 customer cluster_A
sum(amount for amount in cluster_A.amount_2019)

In [None]:
#sum total of quantity of items sold for customer cluster_A
sum(qty for qty in cluster_A.quantity_All)

In [None]:
#sum total of quantity of items sold for 2015 customer cluster_A
sum(qty for qty in cluster_A.quantity_2015)

In [None]:
#sum total of quantity of items sold for 2016 customer cluster_A
sum(qty for qty in cluster_A.quantity_2016)

In [None]:
#sum total of quantity of items sold for 2017 customer cluster_A
sum(qty for qty in cluster_A.quantity_2017)

In [None]:
#sum total of quantity of items sold for 2018 customer  cluster_A
sum(qty for qty in cluster_A.quantity_2018)

In [None]:
#sum total of quantity of items sold for 2019 customer cluster_A
sum(qty for qty in cluster_A.quantity_2019)

<a id='cluster_B'></a>
### Cluster B : cluster_B

In [None]:
# cluster of customers who brought in a total revenue is less than 1,000,000 but greater or equals to 100,000 Euros in 2015 - 2019 period
cluster_B = customernametable[(customernametable.amount_All < 1000000) & (customernametable.amount_All >= 100000)]

In [None]:
cluster_B.head()

In [None]:
cluster_B.info()

In [None]:
cluster_B = cluster_B.fillna(0)

In [None]:
#sum total of revenue for customercluster_B
sum(amount for amount in cluster_B.amount_All)

In [None]:
#sum total of revenue for 2015 customer cluster_B
sum(amount for amount in cluster_B.amount_2015)

In [None]:
#sum total of revenue for 2016 customer cluster_B
sum(amount for amount in cluster_B.amount_2016)

In [None]:
#sum total of revenue for 2017 customer cluster_B
sum(amount for amount in cluster_B.amount_2017)

In [None]:
#sum total of revenue for 2018 customer cluster_B
sum(amount for amount in cluster_B.amount_2018)

In [None]:
#sum total of revenue for 2019 customer cluster_B
sum(amount for amount in cluster_B.amount_2019)

In [None]:
#sum total of quantity of items sold for customer cluster_B
sum(qty for qty in cluster_B.quantity_All)

In [None]:
#sum total of quantity of items sold for 2015 customer cluster_B
sum(qty for qty in cluster_B.quantity_2015)

In [None]:
#sum total of quantity of items sold for 2016 customer cluster_B
sum(qty for qty in cluster_B.quantity_2016)

In [None]:
#sum total of quantity of items sold for 2017 customer cluster_B
sum(qty for qty in cluster_B.quantity_2017)

In [None]:
#sum total of quantity of items sold for 2018 customer cluster_B
sum(qty for qty in cluster_B.quantity_2018)

In [None]:
#sum total of quantity of items sold for 2019 customer cluster_B
sum(qty for qty in cluster_B.quantity_2019)

<a id='cluster_C'></a>
### Cluster C : cluster_C

In [None]:
# cluster of customers who brought in a total revenue less than 100,000 but greater or equals to 10,000 Euros in 2015 - 2019 period
cluster_C = customernametable[(customernametable.amount_All < 100000)& (customernametable.amount_All >= 10000)]

In [None]:
cluster_C.head()

In [None]:
cluster_C.info()

In [None]:
cluster_C = cluster_C.fillna(0)

In [None]:
#sum total of revenue for customercluster_C
sum(amount for amount in cluster_C.amount_All)

In [None]:
#sum total of revenue for 2015 customer cluster_C
sum(amount for amount in cluster_C.amount_2015)

In [None]:
#sum total of revenue for 2016 customer cluster_C
sum(amount for amount in cluster_C.amount_2016)

In [None]:
#sum total of revenue for 2017 customer cluster_C
sum(amount for amount in cluster_C.amount_2017)

In [None]:
#sum total of revenue for 2018 customer cluster_C
sum(amount for amount in cluster_C.amount_2018)

In [None]:
#sum total of revenue for 2019 customer cluster_C
sum(amount for amount in cluster_C.amount_2019)

In [None]:
#sum total of quantity of items sold for customercluster_C
sum(qty for qty in cluster_C.quantity_All)

In [None]:
#sum total of quantity of items sold for 2015 customer cluster_C
sum(qty for qty in cluster_C.quantity_2015)

In [None]:
#sum total of quantity of items sold for 2016 customer cluster_C
sum(qty for qty in cluster_C.quantity_2016)

In [None]:
#sum total of quantity of items sold for 2017 customer cluster_C
sum(qty for qty in cluster_C.quantity_2017)

In [None]:
#sum total of quantity of items sold for 2018 customer cluster_C
sum(qty for qty in cluster_C.quantity_2018)

In [None]:
#sum total of quantity of items sold for 2019 customer cluster_C
sum(qty for qty in cluster_C.quantity_2019)

<a id='cluster_D'></a>
### Cluster D : cluster_D

In [None]:
# cluster of customers who brought in a revenue greater or equals to 1,000  but less than 10,000 Euros in 2015 - 2019 period
cluster_D = customernametable[(customernametable.amount_All < 10000)& (customernametable.amount_All >= 1000)]

In [None]:
cluster_D

In [None]:
cluster_D = cluster_D.fillna(0)

In [None]:
#sum total of revenue for customercluster_D
sum(amount for amount in cluster_D.amount_All)

In [None]:
#sum total of revenue for 2015 customer cluster_D
sum(amount for amount in cluster_D.amount_2015)

In [None]:
#sum total of revenue for 2016 customer cluster_D
sum(amount for amount in cluster_D.amount_2016)

In [None]:
#sum total of revenue for 2017 customer cluster_D
sum(amount for amount in cluster_D.amount_2017)

In [None]:
#sum total of revenue for 2018 customer cluster_D
sum(amount for amount in cluster_D.amount_2018)

In [None]:
#sum total of revenue for 2019 customer cluster_D
sum(amount for amount in cluster_D.amount_2019)

In [None]:
#sum total of quantity of items sold for customercluster_D
sum(qty for qty in cluster_D.quantity_All)

In [None]:
#sum total of quantity of items sold for 2015 customercluster_D
sum(qty for qty in cluster_D.quantity_2015)

In [None]:
#sum total of quantity of items sold for 2016 customercluster_D
sum(qty for qty in cluster_D.quantity_2016)

In [None]:
#sum total of quantity of items sold for 2017 customercluster_D
sum(qty for qty in cluster_D.quantity_2017)

In [None]:
#sum total of quantity of items sold for 2018 customercluster_D
sum(qty for qty in cluster_D.quantity_2018)

In [None]:
#sum total of quantity of items sold for 2019 customercluster_D
sum(qty for qty in cluster_D.quantity_2019)

<a id='cluster_E'></a>
### Cluster E: cluster_E

In [None]:
# cluster of customers who brought in a revenue less than 100 Euros in 2015 - 2019 period
cluster_E = customernametable[(customernametable.amount_All < 1000)&(customernametable.amount_All >= 100)]

In [None]:
cluster_E.info()

In [None]:
cluster_E = cluster_E.fillna(0)

In [None]:
#sum total of revenue for customercluster_E
sum(amount for amount in cluster_E.amount_All)

In [None]:
#sum total of revenue for 2015 customer cluster_E
sum(amount for amount in cluster_E.amount_2015)

In [None]:
#sum total of revenue for 2016 customer cluster_E
sum(amount for amount in cluster_E.amount_2016)

In [None]:
#sum total of revenue for 2017 customer cluster_E
sum(amount for amount in cluster_E.amount_2017)

In [None]:
#sum total of revenue for 2018 customer cluster_E
sum(amount for amount in cluster_E.amount_2018)

In [None]:
#sum total of revenue for 2019 customer cluster_E
sum(amount for amount in cluster_E.amount_2019)

In [None]:
#sum total of quantity of items sold for customercluster_E
sum(qty for qty in cluster_E.quantity_All)

In [None]:
#sum total of quantity of items sold for 2015 customer cluster_E
sum(qty for qty in cluster_E.quantity_2015)

In [None]:
#sum total of quantity of items sold for 2016 customer cluster_E
sum(qty for qty in cluster_E.quantity_2016)

In [None]:
#sum total of quantity of items sold for 2017 customer cluster_E
sum(qty for qty in cluster_E.quantity_2017)

In [None]:
#sum total of quantity of items sold for 2018 customer cluster_E
sum(qty for qty in cluster_E.quantity_2018)

In [None]:
#sum total of quantity of items sold for 2019 customer cluster_E
sum(qty for qty in cluster_E.quantity_2019)

<a id='cluster_E'></a>
### Cluster F: cluster_F

In [None]:
# cluster of customers who brought in a revenue less than 100 Euros in 2015 - 2019 period
cluster_F = customernametable[(customernametable.amount_All < 100)]

In [None]:
cluster_F.info()

In [None]:
cluster_F = cluster_F.fillna(0)

In [None]:
#sum total of revenue for customercluster_E
sum(amount for amount in cluster_F.amount_All)

In [None]:
#sum total of revenue for 2015 customer cluster_E
sum(amount for amount in cluster_F.amount_2015)

In [None]:
#sum total of revenue for 2016 customer cluster_E
sum(amount for amount in cluster_F.amount_2016)

In [None]:
#sum total of revenue for 2017 customer cluster_E
sum(amount for amount in cluster_F.amount_2017)

In [None]:
#sum total of revenue for 2018 customer cluster_E
sum(amount for amount in cluster_F.amount_2018)

In [None]:
#sum total of revenue for 2019 customer cluster_E
sum(amount for amount in cluster_F.amount_2019)

In [None]:
#sum total of quantity of items sold for customercluster_E
sum(qty for qty in cluster_F.quantity_All)

In [None]:
#sum total of quantity of items sold for 2015 customer cluster_E
sum(qty for qty in cluster_F.quantity_2015)

In [None]:
#sum total of quantity of items sold for 2016 customer cluster_E
sum(qty for qty in cluster_F.quantity_2016)

In [None]:
#sum total of quantity of items sold for 2017 customer cluster_E
sum(qty for qty in cluster_F.quantity_2017)

In [None]:
#sum total of quantity of items sold for 2018 customer cluster_E
sum(qty for qty in cluster_F.quantity_2018)

In [None]:
#sum total of quantity of items sold for 2019 customer cluster_E
sum(qty for qty in cluster_F.quantity_2019)

## customer category

In [None]:
netherlandsrevenue.customer_category.describe()#overview of variable; count, unique, top,freq

In [None]:
dict(enumerate(netherlandsrevenue['customer_category'].cat.categories ) )

In [None]:
netherlandsrevenue.customer_category.cat.categories # list all unique customer categories

In [None]:
netherlandsrevenue.customer_category.value_counts()  #This method will return the number of unique values for a particular column

In [None]:
print(netherlandsrevenue.customer_category.isnull().sum()) #Number of missing values in customer_category column

In [None]:
# view revenue per customer in the netherlands
customercategoryrank = netherlandsrevenue.groupby(
   ['customer_name','customer_category']
).agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)

customercategoryrank.sort_values(by=['amount'], inplace=True, ascending=False)

**Test**

In [None]:
customercategoryrank.head()

## customer_category per customer_name pivot table

In [None]:
# view revenue per customer per customer category per year in the netherlands using pivot table
customercategorytable = pd.pivot_table(netherlandsrevenue, index= ['customer_name', 'customer_category'],columns='year',
                       values=["amount","quantity"],aggfunc=sum, margins=True)

In [None]:
# sort by total amount per customer in descending order
customercategorytable.sort_values(by=('amount', 'All'), ascending=False, inplace=True)

In [None]:
# view pivot table
customercategorytable.head(7)

In [None]:
#fix column names
customercategorytable.columns =[s1 + '_' + str(s2) for (s1,s2) in customercategorytable.columns.tolist()]

In [None]:
#proposed column order
columnsTitles = ['quantity_2015','amount_2015','quantity_2016','amount_2016','quantity_2017','amount_2017','quantity_2018','amount_2018','quantity_2019','amount_2019','quantity_All','amount_All']

In [None]:
#re-arrange column indexes(order) based on above columnsTitles
customercategorytable  = customercategorytable.reindex(columns=columnsTitles)

In [None]:
#drop first row
customercategorytable=customercategorytable.drop(customercategorytable.index[0])

**Test**

In [None]:
customercategorytable.head()

In [None]:
#export to csv
customercategorytable.to_csv (r'customername_customercategorytable.csv', index = True, header=True)

## Revenue/Amount CUSTOMER CATEGORY

In [None]:
# view quantity of items sold and revenue generated per customer per year in the netherlands using pivot table
amount_customercategorytable = pd.pivot_table(netherlandsrevenue, index= "customer_category",columns='year',
                       values=["amount","quantity"],aggfunc=sum, margins=True)

In [None]:
# sort by total amount per customer in descending order
amount_customercategorytable .sort_values(by=('amount', 'All'), ascending=False, inplace=True)

In [None]:
# view pivot table
amount_customercategorytable .head()

In [None]:
#fix column names
amount_customercategorytable .columns =[s1 + '_' + str(s2) for (s1,s2) in amount_customercategorytable.columns.tolist()]

In [None]:
#proposed column order
columnsTitles = ['quantity_2015','amount_2015','quantity_2016','amount_2016','quantity_2017','amount_2017','quantity_2018','amount_2018','quantity_2019','amount_2019','quantity_All','amount_All']

In [None]:
#re-arrange column indexes(order) based on above columnsTitles
amount_customercategorytable = amount_customercategorytable .reindex(columns=columnsTitles)

In [None]:
#drop first row
amount_customercategorytable = amount_customercategorytable.drop(amount_customercategorytable.index[0])

In [None]:
amount_customercategorytable

In [None]:
#export to csv
amount_customercategorytable.to_csv (r'customercategorytable_per_year.csv', index = True, header=True)

## display_name

In [None]:
netherlandsrevenue.display_name.describe()

In [None]:
# view revenue per customer in the netherlands
dispalynamerank = netherlandsrevenue.groupby(
   ['display_name']
).agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)

dispalynamerank.sort_values(by=['amount'], inplace=True, ascending=False)

In [None]:
dispalynamerank

In [None]:
dispalynameranktable = pd.pivot_table(netherlandsrevenue, index= "display_name",columns='year',
                       values=["amount","quantity"],aggfunc=sum, margins=True)

# sort by total amount per customer in descending order
dispalynameranktable .sort_values(by=('amount', 'All'), ascending=False, inplace=True)

In [None]:
dispalynameranktable

## TIME SERIES SEASONALITY

In [None]:
netherlandsrevenue.head()

In [None]:
netherlandsrevenue = netherlandsrevenue.set_index('date')

In [None]:
netherlandsrevenue.head()

In [None]:
netherlandsrevenue['amount'].plot(linewidth=0.5)

In [None]:
netherlandsrevenue["2017-01-02"].head()

In [None]:
# Plot the data
plt.rcParams['figure.figsize']=(15,10)
netherlandsrevenue.amount.resample('W').sum().plot(linestyle='solid',x_compat=True)

# Add a legend
plt.legend(['Weekly Revenue generated'],fontsize=18)
plt.xlabel('Date', fontsize=18)
plt.ylabel('Revenue(€)',fontsize=18)
plt.title( 'TIME SERIES PLOT OF WEEKLY REVENUE GENERATED OVER JAN 2015-JULY 2019 IN THE NETHERLANDS',fontsize=20,y=1.03)
plt.tick_params(labelsize=12)
plt.tight_layout()

plt.savefig('tsamount.png',type="png",dpi=300)#Export graph as .png
# Show the plot
plt.show()

In [None]:
ts = netherlandsrevenue.copy()

In [None]:
ts.head()

### 2015 QUANTITY OF ITEMS SOLD AND REVENUE GENERATED PER QUATER

In [None]:
Q12015 = ts.loc['2015-01-01':'2015-03-31'].agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)


In [None]:
Q12015

In [None]:
Q22015 = ts.loc['2015-04-01':'2015-06-30'].agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)
Q22015

In [None]:
Q32015 = ts.loc['2015-07-01':'2015-09-30'].agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)
Q32015

In [None]:
Q42015 = ts.loc['2015-10-01':'2015-12-31'].agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)
Q42015

### 2016 QUANTITY OF ITEMS SOLD AND REVENUE GENERATED

In [None]:
Q12016 = ts.loc['2016-01-01':'2016-03-31'].agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)
Q12016

In [None]:
Q22016 = ts.loc['2016-04-01':'2016-06-30'].agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)
Q22016

In [None]:
Q32016 = ts.loc['2016-07-01':'2016-09-30'].agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)
Q32016

In [None]:
Q42016 = ts.loc['2016-10-01':'2016-12-31'].agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)
Q42016

### 2017 QUANTITY OF ITEMS SOLD AND REVENUE GENERATED

In [None]:
Q12017 = ts.loc['2017-01-01':'2017-03-31'].agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)
Q12017

In [None]:
Q22017 = ts.loc['2017-04-01':'2017-06-30'].agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)
Q22017

In [None]:
Q32017 = ts.loc['2017-07-01':'2017-09-30'].agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)
Q32017

In [None]:
Q42017 = ts.loc['2017-10-01':'2017-12-31'].agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)
Q42017

### 2018 QUANTITY OF ITEMS SOLD AND REVENUE GENERATED

In [None]:
Q12018 = ts.loc['2018-01-01':'2018-03-31'].agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)
Q12018

In [None]:
Q22018 = ts.loc['2018-04-01':'2018-06-30'].agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)
Q22018

In [None]:
Q32018 = ts.loc['2018-07-01':'2018-09-30'].agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)
Q32018

In [None]:
Q42018 = ts.loc['2018-10-01':'2018-12-31'].agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)
Q42018

### 2019 QUANTITY OF ITEMS SOLD AND REVENUE GENERATED

In [None]:
Q12019 = ts.loc['2019-01-01':'2019-03-31'].agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)
Q12019

In [None]:
Q22019 = ts.loc['2019-04-01':'2019-06-30'].agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)
Q22019

In [None]:
Q32019 = ts.loc['2019-07-01':'2019-09-30'].agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)
Q32019