# Data Quality Assessment for a Medium Size Bikes & Cycling Accessories Organization

> This project was done under the umbrella of KPMG internship experience. I was provided data sets of an organization targeting a client who wants a feedback from us on their dataset quality and how this can be improved.

### Purpose 
Primarily, Sprocket Central Pty Ltd needs help with its customer and transactions data. The organisation has a large dataset relating to its customers, but their team is unsure how to effectively analyse it to help optimise its marketing strategy. 

“the importance of optimising the quality of customer datasets cannot be underestimated. The better the quality of the dataset, the better chance you will be able to use it drive company growth.” 

Perform the preliminary data exploration and identify ways to improve the quality of Sprocket Central Pty Ltd’s data.

### Datasets
The client provided KPMG with 3 datasets:
- Customer Demographic 
- Customer Addresses
- Transactions data in the past 3 months

### Data Quality Framework Table
Using the dimensions included in the Data Quality Framework, I will assess the quality of these datasets. Followings are the dimesnions provided by the Data Quality Framework: 
- Completeness : How much information all entities have. Number of missing values.
- Consistency : How conistent is your Data. Number of inconsistencies in your data.
- Accuarcy : How accurate is your Data. Number of errors in you data.
- Relevancy/Auditability : Relevanct data in your entities. Number of irrelavant values.
- Validity : Validated data with allowable values.
- Uniqueness: How much uniques is your data. Number of duplicated values.
- Timeliness: Updated data. Current data.


In [None]:
# importing pandas library for i/o and dataframes 
import pandas as pd

# loading dataset and extracting sheets'
dataset = pd.ExcelFile('KPMG_VI_New_raw_data_update_final.xlsx')

# parsing sheets
Transactions = dataset.parse('Transactions', header=0, skiprows=1)
NewCustomerList = dataset.parse('NewCustomerList')
CustomerDemographic = dataset.parse('CustomerDemographic')
CustomerAddress = dataset.parse('CustomerAddress')

## Exploring and Analyzing Data Quality of Sheet: Transactions 

In [None]:
# display data inside sheet
print(Transactions.head())

## Checking Consistency and Validity of Dataset

In [None]:
# Display columns of dataset Transactions
print(Transactions.info())

In [None]:
# checking the shape of your data
print(Transactions.shape)

## Highlights of Consistency and Validity in Transactions
Transactions dataset has 20000 records with 13 columns. 
- Out of which, 3 are of datatype **int64** which are keys. 
- One is the date **datetime64** in format **MM/DD/YYYY**. The date format used to capture DOB of customers is **YYYY-MM-DD**. It would be better if it is kept consistent.
- Another one is Online Order which is captured in a column of **float64** datatype, however the values are **boolean**, that is true and false. 
- 5 columns are of datatype **object** which are order_status, brand, product_line, product_class, product_size. 
- Last 3 columns are of datatype **float64** again from which one of them is a date and should be **datetime64** and must be in the standard format.

## Checking Completeness of Dataset

In [None]:
# looking for the null values
total_null_values = Transactions.isnull().sum()

# calculating total values
total_values = Transactions.count().sort_values(ascending=True) 

# calculating the percentage of null values
null_values_percentage = total_null_values/total_values *100

# converting to dataframe of missing values
missing_values = pd.concat({'Total Values' : total_values, 'Null_values': total_null_values, 'Percentage of Missing Values': null_values_percentage}, axis=1)

# display missing values
print(missing_values)

## Highlights of Completeness in Transactions
- Order Online columns has about 1.83% of null values. There are 360 records in which order_online was not captured.
- Columns brand, product_line, product_class, product_size, standard_cost, product_first_sold_date also has a percentage of 0.995% missing values that is 197 null values, which should not be missing if product_id is inherited and the details of the product cannot be missing.

## Checking Accuracy of Dataset

In [None]:
# checking a single product id and its details
bool_series = Transactions['product_id'] == 0

product_id_0 = Transactions[bool_series]

#view the product details
print(product_id_0[['brand', 'product_line','product_class']])

## Highlights of Accuracy in Transactions
A single product ID should be referencing a single product with unique values.

## Checking Uniqueness of Dataset

In [None]:
# looking for duplicated values
duplicated_values = Transactions.duplicated()

# number of duplicated values in dataset
print("The number of duplicated records in Transactions dataset is {}".format(duplicated_values.sum()))

## Highlights of Uniqueness in Transactions
Transaction records are unique.

## Exploring and Analyzing Data Quality of Sheet: NewCustomerList, Customer Demographic and Customer Address

In [None]:
# display data of sheet NewCustomerList
print(NewCustomerList.head())

In [None]:
# display data of sheet Customer Demographic
print(CustomerDemographic.head())

In [None]:
# display data of sheet Customer Address
print(CustomerAddress.head())

## Checking Consistency and Validity of Datasets

In [None]:
# Display columns of dataset NewCustomerList
print(NewCustomerList.info())

In [None]:
# checking the shape of your data
print(NewCustomerList.shape)

In [None]:
# Display columns of dataset CustomerDemographic
print(CustomerDemographic.info())

In [None]:
# Display columns of dataset CustomerAddress
print(CustomerAddress.info())

In [None]:
# checking the shape of your data
print(CustomerDemographic.shape)

In [None]:
# checking the shape of your data
print(CustomerAddress.shape)

## Highlights of Consistency and Validity in NewCustomerList, Customer Demographic and Customer Address
NewCustomerList dataset has 1000 records with 23 columns, yet Customer Demographics have 4000 records with 13 columns and remaining in Customer Address with 6 columns using **customer_id** has key.
- Structure format of NewCustomerList must be consistent with Customer Demographic and Customer Address.
- There is no **customer_id** in NewCustomerList.
- Number of columns are inconsistent because in NewCustomerList there are **4 columns which are Unnamed** and they contain some values as well, however are not labeled so cannot be identified.
- There is one column in *NewCustomerList* which is **Value**, it is captured in a column of **float64** datatype but this was not captured before and is not present in *CustomerDemographic* or *CustomerAddress*.
- There is one column named **default** in *CustomerDemographic*, it is captured in a column of **object** datatype, some values are observed to be date values but this was not captured after and is not present in *NewCustomerList*. 
- From remaining columns 5 columns are of datatype **int64** which are past_3_years_bike_related_purchases, tenure, postcode, property_valuation, and Rank.                                . 
- DOB is the date column **datetime64** in format **YYYY-MM-DD**. The date format used to capture transaction date in Transactions is **MM/DD/YYYY**. It would be better if it is kept consistent.
- Rest of the columns are in **object** data type values but, deceased_indicator must have contain **boolean** like True and False.
- Data Captured in Gender column in the dataset CustomerDemographic is not consistent. It should be "Male", "Female" and "U" as per the NewCustomerList.

## Checking Completeness of Datasets

In [None]:
# looking for the null values
total_null_values = NewCustomerList.isnull().sum()

# calculating total values
total_values = NewCustomerList.count().sort_values(ascending=True) 

# calculating the percentage of null values
null_values_percentage = total_null_values/total_values *100

# converting to dataframe of missing values
missing_values_NewCustomerList = pd.concat({'Total Values' : total_values, 'Null_values': total_null_values, 'Percentage of Missing Values': null_values_percentage}, axis=1)

# display missing values
print(missing_values_NewCustomerList)

In [None]:
# looking for the null values
total_null_values = CustomerDemographic.isnull().sum()

# calculating total values
total_values = CustomerDemographic.count().sort_values(ascending=True) 

# calculating the percentage of null values
null_values_percentage = total_null_values/total_values *100

# converting to dataframe of missing values
missing_values_CustomerDemographic = pd.concat({'Total Values' : total_values, 'Null_values': total_null_values, 'Percentage of Missing Values': null_values_percentage}, axis=1)

# display missing values
print(missing_values_CustomerDemographic)

## Highlights of Completeness in NewCustomerList, Customer Demographic and Customer Address
- In NewCustomerList 19.76% of job_industry_category values are missing almost similar to CustomerDemographic which is 19.61%.
- 11.85% of job_title values are missing in NewCustomerList a little less as compared to CustomerDemographic that has 14.48% of missing values.
- 3.22% of last_name values were missing in CustomerDemographic yet 2.98% of last_name values are missing in NewCustomerList.
- CustomerDemographic has 2.22% of missing DOB values which is slighlty decreased to 1.72% NewCustomerList.
- There is a 2.22% of missing tenure values in CustomerDemographic but there is no missing values of tenure in NewCustomerList.
- There is 1 missing record of address of **customer_id = 3** in CustomerAddress, as per identified by the shape of the datasets.


## Checking Accuracy of Dataset

In [None]:
CustomerDemographic['DOB']

## Highlights of Accuracy in NewCustomerList, Customer Demographic and Customer Address
One date value is wrong. 1843 year is not possible.

## Checking Uniqueness of Dataset

In [None]:
# looking for duplicated values
duplicated_values = NewCustomerList.duplicated()

# number of duplicated values in dataset
print("The number of duplicated records in NewCustomerList dataset is {}".format(duplicated_values.sum()))

In [None]:
# looking for duplicated values
duplicated_values = CustomerDemographic.duplicated()

# number of duplicated values in dataset
print("The number of duplicated records in CustomerDemographic dataset is {}".format(duplicated_values.sum()))

In [None]:
# looking for duplicated values
duplicated_values = CustomerAddress.duplicated()

# number of duplicated values in dataset
print("The number of duplicated records in CustomerAddress dataset is {}".format(duplicated_values.sum()))

## Highlights of Uniqueness
All records are unique.