# SECRID  DATA ANALYSIS PROJECT


<span style="color: gray; font-size:1em;">January-2020</span>


## Table of Contents
* [Introduction](#introduction)
* [Section One - Import Data into IDE](#import_data)
    * [Part I - Gathering Data](#gather_data)
    * [Part II - Assessing Data](#assess_data)
    * [Part III - Cleaning Data](#clean_data)
* [Section Two - Product Item Analysis](#items)
    * [Item I - Miniwallet Analysis](#miniwallet)
    * [Part II - Slimwallet Analysis](#slimwallet)
    * [Part III - Twinwallet Analysis](#twinwallet) 
    


<a id='introduction'></a>
## Introduction

SECRID is a business entity based in the Netherlands. It produces, stocks and sells designer wallets, particularly leather based wallets in more than 100 countries all over the world.

This notebook explores SECRID miniwallet sales data


<a id='#import_data'></a>
## Section One : Import Data into IDE

<a id='gather_data'></a>
## Part I : Gathering Data

In [1]:
# load required libraries
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import read_excel

import zipfile
import xlsxwriter

import seaborn as sns
# Use seaborn style defaults and set the default figure size
sns.set(rc={'figure.figsize':(11, 4)})

import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.ticker as ticker
from matplotlib import pyplot as plt
from matplotlib import dates as mpl_dates
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
plt.style.use('seaborn')

import six

from datetime import datetime, timedelta

# environment settings:
pd.options.display.float_format = '{:.2f}'.format
pd.set_option('display.max_column',None)
pd.set_option('display.max_rows',None)
pd.set_option('display.max_seq_items',None)
pd.set_option('display.max_colwidth', 500)
pd.set_option('expand_frame_repr', True)


### Load .xlxs files

In [2]:
#load sales 2015 
df1 = pd.read_excel('SECRID DATA.xlsx',0) #load first spreadsheet of SECRID DATA.xlxs 

In [3]:
# #load sales 2016 
df2 = pd.read_excel('SECRID DATA.xlsx',1) #load second spreadsheet of SECRID DATA.xlxs

In [4]:
#load sales 2017
df3 = pd.read_excel('SECRID DATA.xlsx',2)  #load third spreadsheet ofSECRID DATA.xlxs

In [5]:
#load sales 2018
df4 = pd.read_excel('SECRID DATA.xlsx',3)  #load fourth spreadsheet of SECRID DATA.xlxs

In [6]:
#load sales 2019
df5 = pd.read_excel('SECRID DATA.xlsx',4)  #load fifth spreadsheet of SECRID DATA.xlxs

In [7]:
#load product data
df6 = pd.read_excel('SECRID DATA.xlsx',5) #load sixth spreadsheet of SECRID DATA.xlxs

In [8]:
#combine df4 and df5  into one complete dataframe 'df' for miniwallet data
df = pd.concat([df1, df2, df3, df4, df5]) 

<a id='assess_data'></a>
## Part II - Assessing  Data

In [None]:
df.head() #preview first five rows

In [None]:
df.tail() #preview last five rows

In [9]:
# Check size of the dataframe 
df.shape 

(1609533, 18)

In [None]:
# list names of columns in dataframe
df.columns 

In [None]:
# View info of the dataframe 
df.info()

In [None]:
# view some of the core statistics about columns
df.describe(include='all')

In [None]:
# check the Data types (dtypes) of each column in Dataframe
df.dtypes 

In [None]:
# Total sum of duplicate rows
df.duplicated().sum() # returns a Boolean Series with True value for each duplicated row and sums them

In [None]:
#return the number of unique elements in each column
print(df.nunique()) 

In [None]:
df.count() #returns the number of non-missing values for each column or row

In [None]:
#Total missing values(NaN) in a DataFrame
df.isnull().sum().sum()

In [None]:
#Count number of NaN for each column in DataFrame
print(df.isnull().sum()) 

<a id='issues'></a>
**Quality issues**
 * Rename column names to have clear, descriptive names in small letters according to best practice. Column 'name' can be renamed to 'customer_name' and column 'material' can be renamed to 'type_of_material'
 * Set to columns to appropriate category data type: 'internal_id', 'document_number', 'customer_name', 'customer _category', 'retailer_role', 'shipping_country', 'item', 'display_name', 'pim_category','type_of_material', 
   'pim_colour', 'wsl', 'while_stock_lasts' and 'cardprotector_colour' 

<a id='clean_data'></a>
## Part III - Cleaning Data

In [10]:
# Create copy of original DataFrame
df_clean = df.copy()

In [11]:
#Fixing messy column names
df_clean.columns = df_clean.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')

In [12]:
# change column names using rename function
df_clean.rename(columns={                                                 
                         'name':'customer_name',
                         'wsl_+':'wsl',
                         'material':'type_of_material' }, 
                 inplace=True)

**Test**

In [13]:
df_clean.columns #List of column names in df_clean Dataframe

Index(['internal_id', 'document_number', 'date', 'customer_name',
       'customer_category', 'retailer_role', 'shipping_country', 'item',
       'display_name', 'quantity', 'amount', 'amount_foreign_currency',
       'pim_category', 'type_of_material', 'pim_colour', 'wsl',
       'while_stock_lasts', 'cardprotector_colour'],
      dtype='object')

**Define**
<br>Set appropriate data types for fields mentioned in the [Quality issues](#issues) 

In [14]:
# use .astype to change data type of dataframe columns
df_clean = df_clean.astype({"internal_id":'category',"document_number":'category', "customer_name":'category', "customer_category":'category', "retailer_role":'category', "shipping_country":'category', "item":'category',"display_name":'category', "pim_category":'category', "type_of_material":'category', "pim_colour":'category',"wsl":'category', "while_stock_lasts":'category', "cardprotector_colour":'category'})

**Test**

In [None]:
df_clean.info()

In [None]:
# view some of the core statistics about columns
df_clean.describe(include='all')

### content structure of sales dataset
The sales data contains 18 columns (variables) and 1,609,533 rows (entries). 
This is evidence that 1,609,533 sale transactions were completed in the January 2015 – July 2019 period for SECRID business. The dataset contained features about:

* Products for sale: item, display_name, pim_category, pim_colour, type_of_material and cardprotector_colour
* The country the item was shipped to : shipping_country  
* Customer data:  customer_name, customer_category and retailer_role
* Sale transactions: internal_id, document_number, quantity, amount, amount(foreign_currency) and date


### Detected Missing Values
A null value is a value in a field that appears to be blank. A null value is a field with no value. 
The table below indicates the number and  the resulting percentage of missing values per column.

| Variable Name  | Value Count| Number Of Missing Values| % Of Missing Values
| -------------  | ------------- |------------- |-------------
| internal_id    | 1,609,533  |0 |0%
| document_number| 1,609,533  |0 |0%
| date            | 1,609,533  |0 |0%
| customer_name  | 1,609,533  |0 |0%
| customer_category  | 1,604,118  |5,415     |0.34%
| retailer_role      | 45,994     |1,563,539 |97.14%
| shipping_country   | 1,603,431  |6,102     |0.38%
| item               | 1,609,533  |0 |0%
| display_name       | 1,609,533  |0 |0%
| quantity           | 1,609,533  |0 |0%
| amount             | 1,609,533  |0 |0%
| amount(foreign_currency)| 1,609,533 |0     |0%
| type_of_material        | 1,608,493 |1,040 |0.065%
| pim_category            | 1,608,493 |1,040 |0.065%
| pim_colour              | 1,603,475 |6,058 |0.38%
| wsl_+                   | 1,609,533 |0     |0%
| while_stock_lasts       | 1,609,533 |0     |0%
| cardprotector_colour    | 1,556,521 |53,012|3.29%




<a id='items'></a>
# PRODUCT ITEM ANALYSIS

In [15]:
#create new column 'year' that registered year sale transaction was held
df_clean['year'] = df_clean.date.dt.year

In [16]:
df_clean.head()

Unnamed: 0,internal_id,document_number,date,customer_name,customer_category,retailer_role,shipping_country,item,display_name,quantity,amount,amount_foreign_currency,pim_category,type_of_material,pim_colour,wsl,while_stock_lasts,cardprotector_colour,year
0,24560,I-1510000,2015-01-05,C-9855 SkilledIn,Promotional sales - End user,,Netherlands,TA-Brown,Twinwallet Amazon Brown,1,54.26,54.26,Twinwallet,Amazon,Brown,No,No,Silver,2015
1,24580,I-1510001,2015-01-05,C-9255 Rubino di vittorio E Sergio Della Rocca & C. SNC attn. Spimar,Leather goods,,Italy,C-Black,Cardprotector Black,1,11.2,11.2,Cardprotector,Aluminium,Black,No,No,Black,2015
2,24580,I-1510001,2015-01-05,C-9255 Rubino di vittorio E Sergio Della Rocca & C. SNC attn. Spimar,Leather goods,,Italy,C-Blue,Cardprotector Blue,2,22.4,22.4,Cardprotector,Aluminium,Blue,No,No,Blue,2015
3,24580,I-1510001,2015-01-05,C-9255 Rubino di vittorio E Sergio Della Rocca & C. SNC attn. Spimar,Leather goods,,Italy,C-Red,Cardprotector Red,2,22.4,22.4,Cardprotector,Aluminium,Red,No,No,Red,2015
4,24580,I-1510001,2015-01-05,C-9255 Rubino di vittorio E Sergio Della Rocca & C. SNC attn. Spimar,Leather goods,,Italy,C-Titanium,Cardprotector Titanium Color,2,22.4,22.4,Cardprotector,Aluminium,Titanium,No,No,Titanium,2015


<a id='miniwallet'></a>
## MINIWALLET ANALYSIS

In [17]:
# filter to only category of interest ('Miniwallet')
miniwalletdf =df_clean[(df_clean.pim_category == 'Miniwallet')]

In [18]:
# rows with positive values in Quantity column and Amount column
miniwalletdf= miniwalletdf[(miniwalletdf.amount>0) & (miniwalletdf.quantity>0)]

In [19]:
miniwalletdf.head() #preview first five rows

Unnamed: 0,internal_id,document_number,date,customer_name,customer_category,retailer_role,shipping_country,item,display_name,quantity,amount,amount_foreign_currency,pim_category,type_of_material,pim_colour,wsl,while_stock_lasts,cardprotector_colour,year
12,30196,I-1510002,2015-01-06,C-6496 Eug Hoffman,Leather goods,,Luxembourg,M-Black,Miniwallet Original Black,4,91.6,91.6,Miniwallet,Original,Black,No,No,Silver,2015
13,30196,I-1510002,2015-01-06,C-6496 Eug Hoffman,Leather goods,,Luxembourg,M-Dark brown,Miniwallet Original Dark Brown,3,68.7,68.7,Miniwallet,Original,Dark brown,No,No,Silver,2015
14,30196,I-1510002,2015-01-06,C-6496 Eug Hoffman,Leather goods,,Luxembourg,MV-Black,Miniwallet Vintage Black,4,91.6,91.6,Miniwallet,Vintage,Black,No,No,Black,2015
15,30196,I-1510002,2015-01-06,C-6496 Eug Hoffman,Leather goods,,Luxembourg,MV-Blue Silver,Miniwallet Vintage Blue,1,22.9,22.9,Miniwallet,Vintage,Blue,No,No,Silver,2015
16,30196,I-1510002,2015-01-06,C-6496 Eug Hoffman,Leather goods,,Luxembourg,MV-Cognac,Miniwallet Vintage Cognac,3,68.7,68.7,Miniwallet,Vintage,Brown,No,No,Silver,2015


In [None]:
miniwalletdf.type_of_material.describe() #overview of variable; count, unique, top,freq

In [None]:
#Types of leather used in production of miniwallet
print (miniwalletdf.type_of_material.cat.categories) # Get list of categories in categorical variable

In [None]:
miniwalletdf.display_name.value_counts() #transactions per display name

In [None]:
print(miniwalletdf.isnull().sum()) # check for mising values in miniwalletdf

In [20]:
#Total quantity of items sold and corresponding revenue generated per 'miniwallet' item.
miniwallet_items= miniwalletdf.groupby(
   ['display_name']
).agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)

miniwallet_items.sort_values(by=['amount'], inplace=True, ascending=False) 

In [22]:
miniwallet_items.head()

Unnamed: 0_level_0,amount,quantity
display_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Miniwallet Original Black,10031289.43,445285
Miniwallet Vintage Black,7656124.06,336362
Miniwallet Vintage Brown,7562415.69,334856
Miniwallet Vintage Chocolate,4521212.76,196919
Miniwallet Original Dark Brown,2949940.0,130085


In [23]:
# PIVOT TABLE
product_table = pd.pivot_table(miniwalletdf, index="display_name",columns='year',
                              values =["amount","quantity"],aggfunc=sum, margins=True)

product_table.sort_values(by=('amount','All'), ascending=False, inplace= True)

In [25]:
product_table.head()

Unnamed: 0_level_0,amount,amount,amount,amount,amount,amount,quantity,quantity,quantity,quantity,quantity,quantity
year,2015,2016,2017,2018,2019,All,2015,2016,2017,2018,2019,All
display_name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
All,7640767.46,12788547.47,19105985.67,24533341.64,13717442.09,77786084.33,345635.0,550526.0,811785.0,1020926.0,559589.0,3288461
Miniwallet Original Black,1599384.69,2185042.38,2451277.69,2603448.83,1192135.84,10031289.43,75143.0,96678.0,107809.0,113730.0,51925.0,445285
Miniwallet Vintage Black,1191610.59,1567755.09,1850371.6,2133114.81,913271.97,7656124.06,54055.0,68662.0,80953.0,93025.0,39667.0,336362
Miniwallet Vintage Brown,1798526.06,1732866.12,1690950.91,1638341.16,701731.44,7562415.69,81577.0,76492.0,74501.0,71904.0,30382.0,334856
Miniwallet Vintage Chocolate,148.07,1040080.08,1366693.95,1455563.65,658727.01,4521212.76,7.0,45341.0,59552.0,63460.0,28559.0,196919


## MINIWALLET COLLECTIONS 

### TIMELESS COLLECTION DATAFRAME

In [None]:
timeless_list = ["Original","Vintage","Dutch Martin","Vegetable Tanned"]

In [None]:
timelessdisplay = miniwalletdf[ miniwalletdf.type_of_material.isin(timeless_list)]

In [None]:
timelessdisplay.display_name.value_counts() # missing Miniwallet Vegan Soft Touch Black
                                            # includes Slimallet Perforated Cognac and Slimwallet Perforated Black

In [None]:
#Miniwallet Vegan Soft Touch Black subset dataframe
vegansoft_black = miniwalletdf[miniwalletdf.display_name.str.contains("Miniwallet Vegan Soft Touch Black")]

In [None]:
vegansoft_black.head()

In [None]:
# combine two dataframes
miniwallettimelessdesign = pd.concat([timelessdisplay,vegansoft_black])

In [None]:
# delete all rows with display_name 'Slimallet Perforated Cognac and Slimwallet Perforated Black'
miniwallettimelessdesign = miniwallettimelessdesign[~miniwallettimelessdesign.display_name.str.contains("Slimwallet")]

In [None]:
#insert new column with one value 'tm'
miniwallettimelessdesign['design']='tm'

In [None]:
miniwallettimelessdesign.tail()

In [None]:
timelessdesign= miniwallettimelessdesign.groupby(
   ['display_name']
).agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)

timelessdesign.sort_values(by=['amount'], inplace=True, ascending=False)

In [None]:
timelessdesign

In [None]:
#export to csv
miniwallettimelessdesign.to_csv('timelessdesign.csv', index = None, header=True)

In [None]:
sum(amount for amount in miniwallettimelessdesign.amount if amount>0)

### FASHIONABLE COLLECTION DATAFRAME

In [None]:
fashionable_list = ["Crisple","Cleo","Prism","Nile","Vegetable Tanned Stitched","Ornament","Metallic","Indigo"]

In [None]:
fashinabledisplay = miniwalletdf[ miniwalletdf.type_of_material.isin(fashionable_list)]

In [None]:
fashinabledisplay.head()

In [None]:
fashinabledisplay.display_name.value_counts()

In [None]:
matte = miniwalletdf[(miniwalletdf.display_name == 'Miniwallet Matte Black & Yellow') | (miniwalletdf.display_name == 'Miniwallet Matte Purple') | (miniwalletdf.display_name == 'Miniwallet Matte Black & Red')]

In [None]:
matte.display_name.value_counts()

In [None]:
rango = miniwalletdf[(miniwalletdf.display_name == 'Miniwallet Rango Green') | (miniwalletdf.display_name == 'Miniwallet Rango Red-Bordeaux') | (miniwalletdf.display_name == 'Miniwallet Rango Violet-Violet')]

In [None]:
rango.display_name.value_counts()

In [None]:
diamond = miniwalletdf[(miniwalletdf.display_name == 'Miniwallet Diamond Black')] 

In [None]:
diamond.display_name.value_counts()

In [None]:
optical = miniwalletdf[(miniwalletdf.display_name == 'Miniwallet Optical Black')] 

In [None]:
optical.display_name.value_counts()

In [None]:
cubic = miniwalletdf[(miniwalletdf.display_name == 'Miniwallet Cubic Black-Blue')] 

In [None]:
cubic.display_name.value_counts()

In [None]:
dash = miniwalletdf[(miniwalletdf.display_name == 'Miniwallet Dash Navy')] 

In [None]:
dash.display_name.value_counts()

In [None]:
miniwalletfashionabledesign = pd.concat([fashinabledisplay,matte,rango,diamond,optical,cubic,dash])

In [None]:
#insert new column with one value 'fsh'
miniwalletfashionabledesign['design']='fsh'

In [None]:
miniwalletfashionabledesign.head()

In [None]:
miniwalletfashionabledesign.display_name.value_counts()

In [None]:
sum(amount for amount in miniwalletfashionabledesign.amount)

In [None]:
fashionabledesign= miniwalletfashionabledesign.groupby(
   ['display_name']
).agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)

fashionabledesign.sort_values(by=['amount'], inplace=True, ascending=False) 

In [None]:
fashionabledesign

## COMBINED TIMELESS AND FASHIONABLE DESIGN DATAFRAMES

In [None]:
miniwalletdesign =  pd.concat([miniwallettimelessdesign, miniwalletfashionabledesign])

In [None]:
miniwalletdesign.head()

In [None]:
miniwalletdesign.tail()

In [None]:
sum(amount for amount in miniwalletdesign.amount)

In [None]:
#export to csv
miniwalletdesign.to_csv('miniwalletdesign.csv', index = None, header=True)

In [None]:
rankrevenuedesign= miniwalletdesign.groupby(
   ['design']
).agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)

rankrevenuedesign.sort_values(by=['amount'], inplace=True, ascending=False) 

In [None]:
rankrevenuedesign

### NETHERLANDS MINIWALLET ANALYSIS

In [None]:
#filter miniwalletdesign for items only sold in the netherlands.

In [None]:
# filter to only country of interest ('Miniwallet')
miniwalletdesign =miniwalletdesign[(miniwalletdesign.shipping_country == 'Netherlands')]

In [None]:
miniwalletdesign.shipping_country.value_counts()

In [None]:
netherlandsrankrevenuedesign= miniwalletdesign.groupby(
   ['design']
).agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)

netherlandsrankrevenuedesign.sort_values(by=['amount'], inplace=True, ascending=False) 

In [None]:
netherlandsrankrevenuedesign

In [None]:
# netherlands timeless breakdown
miniwallettimelessdesign = miniwallettimelessdesign[(miniwallettimelessdesign.shipping_country == 'Netherlands')]

In [None]:
miniwallettimelessdesign.head()

In [None]:
sum(amount for amount in miniwallettimelessdesign.amount)

In [None]:
netherlandstimelessdesign= miniwallettimelessdesign.groupby(
   ['display_name']
).agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)

netherlandstimelessdesign.sort_values(by=['amount'], inplace=True, ascending=False) 

In [None]:
netherlandstimelessdesign

In [None]:
# netherlands fashionable breakdown
miniwalletfashionabledesign = miniwalletfashionabledesign[(miniwalletfashionabledesign.shipping_country == 'Netherlands')]

In [None]:
miniwalletfashionabledesign.head()

In [None]:
sum(amount for amount in miniwalletfashionabledesign.amount)

In [None]:
netherlandsfashionabledesign= miniwalletfashionabledesign.groupby(
   ['display_name']
).agg(
    {
         'amount':sum,    # Sum revenue per customer
         'quantity': sum  # get the sum of items sold per year
         
    }
)

netherlandsfashionabledesign.sort_values(by=['amount'], inplace=True, ascending=False) 

In [None]:
netherlandsfashionabledesign