# Data Quality Assessment for a Medium Size Bikes & Cycling Accessories Organization
> This project was done under the umbrella of KPMG internship experience. I was provided data sets of an organization targeting a client who wants a feedback from us on their dataset quality and how this can be improved.

### Background
- Sprocket Central Pty Ltd, a medium size bikes & cycling accessories organisation
- needs help with its customer and transactions data
- how to analyse it to help optimise its marketing strategy effectively.

### Datasets
- New Customer List
- Customer Demographic
- Customer Addresses
- Transactions data in the past 3 months

### [Data Quality Framework Table](https://towardsdatascience.com/a-comprehensive-framework-for-data-quality-management-b110a0465e83)
- Accuracy : Accuracy is defined as the closeness between a value to its correct representation of the real-life phenomenon
- Completeness : The extent to which data are of sufficient breadth, depth, and scope for the task at hand
- Consistency : The extent to which data are uniform in format, use, and meaning across a data collection
- Currency : Currency reflects the freshness of data.
- Volatility : Volatility can also be expressed as the length of time the data remains valid.
- Relevancy : Relevancy is the extent to which data are appropriate for the task at hand.
- Validity : Validity is the extent to which data conform to defined business rules or constraints.
- Uniqueness : Uniqueness is the extent to which data are unique within the dataset.

In [1]:
# Importing libraries
import pandas as pd
import numpy as np


In [2]:
# Importing data and making respective dataframes for each sheet
xls = pd.ExcelFile('KPMG_VI_New_raw_data_update_final.xlsx')

Transactions = pd.read_excel(xls, "Transactions", skiprows=1)
NewList = pd.read_excel(xls, "NewCustomerList", skiprows=1)
Demographic = pd.read_excel(xls, "CustomerDemographic", skiprows=1)
Address = pd.read_excel(xls, "CustomerAddress", skiprows=1)


### Exploring and Analysing the Dataset: Transactions

In [3]:
Transactions.head()

Unnamed: 0,transaction_id,product_id,customer_id,transaction_date,online_order,order_status,brand,product_line,product_class,product_size,list_price,standard_cost,product_first_sold_date
0,1,2,2950,2017-02-25,0.0,Approved,Solex,Standard,medium,medium,71.49,53.62,41245.0
1,2,3,3120,2017-05-21,1.0,Approved,Trek Bicycles,Standard,medium,large,2091.47,388.92,41701.0
2,3,37,402,2017-10-16,0.0,Approved,OHM Cycles,Standard,low,medium,1793.43,248.82,36361.0
3,4,88,3135,2017-08-31,0.0,Approved,Norco Bicycles,Standard,medium,medium,1198.46,381.1,36145.0
4,5,78,787,2017-10-01,1.0,Approved,Giant Bicycles,Standard,medium,large,1765.3,709.48,42226.0


### Checking Consistency and Validity of the Dataset

In [4]:
# Display columns of the dataframe
Transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   transaction_id           20000 non-null  int64         
 1   product_id               20000 non-null  int64         
 2   customer_id              20000 non-null  int64         
 3   transaction_date         20000 non-null  datetime64[ns]
 4   online_order             19640 non-null  float64       
 5   order_status             20000 non-null  object        
 6   brand                    19803 non-null  object        
 7   product_line             19803 non-null  object        
 8   product_class            19803 non-null  object        
 9   product_size             19803 non-null  object        
 10  list_price               20000 non-null  float64       
 11  standard_cost            19803 non-null  float64       
 12  product_first_sold_date  19803 n

In [5]:
# checking the shape of your data
print(Transactions.shape)

(20000, 13)


### Highlights of Consistency and Validity Issues
Transaction dataset has 20k records and 13 coulmns
- transaction_id, customer_id, product_id are keys and are of **int64** datatype
- transaction_date is of **datetime64** datatype in the format **YYYY-MM-DD**
- online_order is of **float64** datatype and should be of **boolean** datatype
- 5 columns are of **object** datatype, which are order_status, brand, product_line, product_class, product_size
- 3 columns are of **float64** datatype, which are list_price, standard_cost, product_first_sold_date, of which one is a date and should be of **datetime64** datatype

### Checking Completeness of the Dataset

In [6]:
# looking for null values
total_null_values = Transactions.isnull().sum()

# looking for total values
total_values = Transactions.count().sort_values(ascending=True)

# calculating the percentage of null values
null_values_percentage = (total_null_values/total_values)*100

# creating a dataframe of null values and percentage
missing_data = pd.concat({'Null Values': total_null_values,'Total Values':total_values, 'Percentage of Missing Values': null_values_percentage}, axis=1)

In [7]:
missing_data.sort_values(by='Percentage of Missing Values', ascending=False)

Unnamed: 0,Null Values,Total Values,Percentage of Missing Values
online_order,360,19640,1.832994
brand,197,19803,0.994799
product_line,197,19803,0.994799
product_class,197,19803,0.994799
product_size,197,19803,0.994799
standard_cost,197,19803,0.994799
product_first_sold_date,197,19803,0.994799
transaction_id,0,20000,0.0
product_id,0,20000,0.0
customer_id,0,20000,0.0


### Highlights of Completeness Issues in Transactions
- 360 missing values in online_order column (almost 2% of values are missing)
- 197 missing values in brand,product_line,product_class,product_size columns (almost 1% of values are missing)

### Checking Accuracy of the Dataset

In [13]:
# checking the product_id and its details

# getting the unique values of product_id (0)
bool_series = Transactions["product_id"] == 0

# product_id_0 is a dataframe with only product_id = 0
product_id_0 = Transactions[bool_series]

product_id_0[['product_id', 'brand', 'product_line', 'product_class', 'product_size']].head()


Unnamed: 0,product_id,brand,product_line,product_class,product_size
34,0,Norco Bicycles,Road,medium,medium
39,0,Norco Bicycles,Road,medium,medium
54,0,Norco Bicycles,Standard,low,medium
60,0,OHM Cycles,Road,high,large
63,0,Trek Bicycles,Standard,medium,medium


### Highlights of Accuracy Issues in Transactions
Multiple values of brand, product_line, product_class, product_size for the same product_id, a single product should reference to a single brand, product_line, product_class, product_size

### Checking Uniqueness of the Dataset

In [18]:
# looking for duplicates
duplicated_values = Transactions.duplicated().sum()
duplicated_values

0

### Highlights of Uniqueness Issues in Transactions
Transaction records are unique

### Exploring and Analysing the Dataset: NewCustomerList

In [20]:
NewList.head()

Unnamed: 0,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,owns_car,...,state,country,property_valuation,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Rank,Value
0,Chickie,Brister,Male,86,1957-07-12,General Manager,Manufacturing,Mass Customer,N,Yes,...,QLD,Australia,6,0.56,0.7,0.875,0.74375,1,1,1.71875
1,Morly,Genery,Male,69,1970-03-22,Structural Engineer,Property,Mass Customer,N,No,...,NSW,Australia,11,0.89,0.89,1.1125,0.945625,1,1,1.71875
2,Ardelis,Forrester,Female,10,1974-08-28 00:00:00,Senior Cost Accountant,Financial Services,Affluent Customer,N,No,...,VIC,Australia,5,1.01,1.01,1.01,1.01,1,1,1.71875
3,Lucine,Stutt,Female,64,1979-01-28,Account Representative III,Manufacturing,Affluent Customer,N,Yes,...,QLD,Australia,1,0.87,1.0875,1.0875,1.0875,4,4,1.703125
4,Melinda,Hadlee,Female,34,1965-09-21,Financial Analyst,Financial Services,Affluent Customer,N,No,...,NSW,Australia,9,0.52,0.52,0.65,0.65,4,4,1.703125


### Checking Consistency and Validity of the Dataset

In [24]:
NewList.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 23 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   first_name                           1000 non-null   object 
 1   last_name                            971 non-null    object 
 2   gender                               1000 non-null   object 
 3   past_3_years_bike_related_purchases  1000 non-null   int64  
 4   DOB                                  983 non-null    object 
 5   job_title                            894 non-null    object 
 6   job_industry_category                835 non-null    object 
 7   wealth_segment                       1000 non-null   object 
 8   deceased_indicator                   1000 non-null   object 
 9   owns_car                             1000 non-null   object 
 10  tenure                               1000 non-null   int64  
 11  address                        

In [22]:
NewList.shape

(1000, 23)

### Highlights of Consistency and Validity Issues
NewCustomerList dataset has 1k records and 23 coulmns
- DOB is of **object** datatype and should be of **datetime64** datatype in the format **YYYY-MM-DD**
- deceased_indicator is of **object** datatype and should be of **boolean** datatype
- there is no customer_id column in the dataset, hence no key item
- Data captured in the Gender should only be M, F, U
- there are 4 Unnamed columns which cant be identified
- Unnamed: 20 can very well be just the rank column


### Checking the Completeness of the Dataset

In [27]:
# looking for null values
total_null_values = NewList.isnull().sum()

# looking for total values
total_values = NewList.count().sort_values(ascending=True)

# calculating the percentage of null values
null_values_percentage = (total_null_values/total_values)*100

missing_data_NewList = pd.concat({'Null Values': total_null_values,'Total Values':total_values, 'Percentage of Missing Values': null_values_percentage}, axis=1)
missing_data_NewList.sort_values(by='Percentage of Missing Values', ascending=False)

Unnamed: 0,Null Values,Total Values,Percentage of Missing Values
job_industry_category,165,835,19.760479
job_title,106,894,11.856823
last_name,29,971,2.986612
DOB,17,983,1.7294
first_name,0,1000,0.0
country,0,1000,0.0
Rank,0,1000,0.0
Unnamed: 20,0,1000,0.0
Unnamed: 19,0,1000,0.0
Unnamed: 18,0,1000,0.0


### Highlights of Completeness Issues in NewCustomerList
- approx 20% of the job_industry_category values are missing
- approx 12% of the job_title values are missing
- approx 3% of the last_name values are missing
- approx 2% of the DOB values are missing

### Checking the Uniqueness of the Dataset

In [29]:
# looking for duplicates
duplicated_values = NewList.duplicated().sum()
duplicated_values

0

### Highlights of Uniqueness Issues in NewCustomerList
There are no duplicated values in the dataset

### Exploring and Analysing the Dataset: CustomerDemographic

In [31]:
Demographic.head()

Unnamed: 0,customer_id,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,default,owns_car,tenure
0,1,Laraine,Medendorp,F,93,1953-10-12 00:00:00,Executive Secretary,Health,Mass Customer,N,"""'",Yes,11.0
1,2,Eli,Bockman,Male,81,1980-12-16 00:00:00,Administrative Officer,Financial Services,Mass Customer,N,<script>alert('hi')</script>,Yes,16.0
2,3,Arlin,Dearle,Male,61,1954-01-20 00:00:00,Recruiting Manager,Property,Mass Customer,N,2018-02-01 00:00:00,Yes,15.0
3,4,Talbot,,Male,33,1961-10-03 00:00:00,,IT,Mass Customer,N,() { _; } >_[$($())] { touch /tmp/blns.shellsh...,No,7.0
4,5,Sheila-kathryn,Calton,Female,56,1977-05-13 00:00:00,Senior Editor,,Affluent Customer,N,NIL,Yes,8.0


### Checking Consistency and Validity of the Dataset

In [32]:
Demographic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 13 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   customer_id                          4000 non-null   int64  
 1   first_name                           4000 non-null   object 
 2   last_name                            3875 non-null   object 
 3   gender                               4000 non-null   object 
 4   past_3_years_bike_related_purchases  4000 non-null   int64  
 5   DOB                                  3913 non-null   object 
 6   job_title                            3494 non-null   object 
 7   job_industry_category                3344 non-null   object 
 8   wealth_segment                       4000 non-null   object 
 9   deceased_indicator                   4000 non-null   object 
 10  default                              3698 non-null   object 
 11  owns_car                      

### Highlights of Consistency and Validity Issues
CustomerDemographic dataset has 4k records and 13 columns
- customer_id is the key and is of **int64** datatype
- gender needs to be consistent
- DOB is of **object** datatype and should be of **datetime64** datatype in the format **YYYY-MM-DD**
- deceased_indicator is of **object** datatype and should be of **boolean** datatype
- owns_car is of **object** datatype and should be of **boolean** datatype
- tenure is of **float64** datatype and should be of **int64** datatype

### Checking the Completeness of the Dataset

In [34]:
# looking for null values
total_null_values = Demographic.isnull().sum()

# looking for total values
total_values = Demographic.count().sort_values(ascending=True)

# calculating the percentage of null values
null_values_percentage = (total_null_values/total_values)*100

missing_data_Demographic = pd.concat({'Null Values': total_null_values,'Total Values':total_values, 'Percentage of Missing Values': null_values_percentage}, axis=1)
missing_data_Demographic.sort_values(by='Percentage of Missing Values', ascending=False)

Unnamed: 0,Null Values,Total Values,Percentage of Missing Values
job_industry_category,656,3344,19.617225
job_title,506,3494,14.481969
default,302,3698,8.166577
last_name,125,3875,3.225806
DOB,87,3913,2.223358
tenure,87,3913,2.223358
customer_id,0,4000,0.0
first_name,0,4000,0.0
gender,0,4000,0.0
past_3_years_bike_related_purchases,0,4000,0.0


### Highlights of Completeness Issues in CustomerDemographic
- approx 14% of the job_title values are missing
- approx 3% of the last_name values are missing
- approx 2% of the DOB values are missing
- approx 19% of the job_industry_category values are missing
- the **default** values in the dataset dont make any sense
- the keys should be consistent with the other dataset, but the newcustomerlist dataset doesnt have a customer_id column

### Checking the Uniqueness of the Dataset

In [36]:
duplicated_values = Demographic.duplicated().sum()
duplicated_values

0

### Highlights of Uniqueness Issues in CustomerDemographic
There are no duplicated values in the dataset

### Exploring and Analysing the Dataset: CustomerAddress

In [37]:
Address.head()

Unnamed: 0,customer_id,address,postcode,state,country,property_valuation
0,1,060 Morning Avenue,2016,New South Wales,Australia,10
1,2,6 Meadow Vale Court,2153,New South Wales,Australia,10
2,4,0 Holy Cross Court,4211,QLD,Australia,9
3,5,17979 Del Mar Point,2448,New South Wales,Australia,4
4,6,9 Oakridge Court,3216,VIC,Australia,9


### Checking Consistency and Validity of the Dataset

In [38]:
Address.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3999 entries, 0 to 3998
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   customer_id         3999 non-null   int64 
 1   address             3999 non-null   object
 2   postcode            3999 non-null   int64 
 3   state               3999 non-null   object
 4   country             3999 non-null   object
 5   property_valuation  3999 non-null   int64 
dtypes: int64(3), object(3)
memory usage: 187.6+ KB


### Highlights of Consistency and Validity Issues
CustomerAddress dataset has 4k records and 6 columns
- customer_id is the key and is of **int64** datatype
- postcode should be consistent with the country and state
- state should be consistent with the country
- the keys should be consistent with the other dataset, but the newcustomerlist dataset doesnt have a customer_id column
- newcustomerlist dataset has 1k records and customeraddress dataset has 4k records

### Checking the Completeness of the Dataset

In [39]:
total_null_values = Address.isnull().sum()
total_values = Address.count().sort_values(ascending=True)
null_values_percentage = (total_null_values/total_values)*100
missing_data_Address = pd.concat({'Null Values': total_null_values,'Total Values':total_values, 'Percentage of Missing Values': null_values_percentage}, axis=1)
missing_data_Address.sort_values(by='Percentage of Missing Values', ascending=False)

Unnamed: 0,Null Values,Total Values,Percentage of Missing Values
customer_id,0,3999,0.0
address,0,3999,0.0
postcode,0,3999,0.0
state,0,3999,0.0
country,0,3999,0.0
property_valuation,0,3999,0.0


### Highlights of Completeness Issues in CustomerAddress
There are no missing values in the dataset

### Checking the Uniqueness of the Dataset

In [41]:
address_duplicated_values = Address.duplicated().sum()
address_duplicated_values

0

### Highlights of Uniqueness Issues in CustomerAddress
There are no duplicated values in the dataset