## Report: Data Quality Assessment Report: Sprocket Central Pty Ltd

### Introduction

Sprocket Central Pty Ltd , a medium size bikes & cycling accessories organisation, has approached Tony Smith (Partner) in KPMG’s Lighthouse & Innovation Team. Sprocket Central Pty Ltd  is keen to learn more about KPMG’s expertise in its Analytics, Information & Modelling team. They need help with its customer and transactions data. The organisation has a large dataset relating to its customers, but their team is unsure how to effectively analyse it to help optimise its marketing strategy.

The purpose of this assessment was to optimised the customer's datasets, so it can have better quality for analysis, and various insights can be uncovered.

### Gathering

The datasets were gathered through the file provided by the company, [here](https://cdn-assets.theforage.com/vinternship_modules/kpmg_data_analytics/KPMG_VI_New_raw_data_update_final.xlsx).
A quick visual assessment using Google Sheet revealed that the three datasets we would be working with were present in different tab, so I proceeded to download these three datasets as a CSV.

I assessed the 3 datasets extracted from the provided sheet.
1. Customer Demographic
2. Customer Address
3. Transactions Data in the past 3 months


The wrangling efforts followed the iterative approach of the three steps in Data Wrangling, Gather -> Assess -> Clean. For the purpose of this exercise, the last step of Cleaning is omitted, rather recommendations to how to go about that is given.

### Assessing

Both visual and programmatic assessment was done on the three datasets. This assessment helped identify a number of Data Quality and Data Tidiness issues, which were properly documented with regards to the dataset it was identified in. The four Data Quality dimensions (such as Completeness, Consistency, Accuracy, Validity, etc) helped in identifying quality issues, which was documented accordingly.

### Identified Issues

After assessign the datasets, I identified various issues with the quality of the datasets. And for the purpose of documentation, I used a Quality and Tidiness Issues Framework. Below are the identified issues found in the datasets.

#### Quality Issues
`customer_demographic`
- last_name: missing values
- gender: Full gender sometimes (Male, Female), abbreviations other times (M, F), 
- DOB: missing values
- DOB has a string datatype
- job_title: missing values
- job_industry_category: missing values
- default: unrecognised data format
- tenure: missing values

`customer_addresses`
- missing record (3999 instead of 4000)


`transactions`
- transaction_date is a string
- online_order: missing values
- brand: missign values
- product_line: missing values
- product_class: missing values
- product_size: missing values
- standard_cost: missing values
- standard_cost: has a string datatype
- product_first_sold_date: missing values


#### Tidiness Issues
`customer_demographic`
- DOB - does not follow the lowercase naming format of other columns
- Unrealistic DOB for a customer with (1843-12-21)

`transactions`
- transaction_date format should consistent (year-month-day)
- online_order has a datatype string [with True and False values]
- [order_status, brand, product_line, product_class, product_size] all have datatype string

One table `customers` split into two `customer_demographic` and `customer_addresses`


### Recommendations

After thorough assessment of the provided datasets, I identified both quality and tidiness issues which have been properly documented. The next step is to give recommendations on how the datasets can be cleaned, so it is ready for both exploratory analysis and explanatory analysis.

#### Quality Issues:

`customer_demographic`
- The empty values in the last_name can be replace with N/A (Not Applicable).
- For consistency, the gender column needs to be reformatted to use a specific representation of gender (Male, Female, Others) or (M, F, O)
- The datatype of the DOB column should be converted to datetime
- When it is time for analysis, if the DOB column is essential, then entries without DOB should be dropped.
- The entries with empty job_title and job_industry_category should be droppoed if it would be crucial to our analysis.
- The default column should be removed, it is not clear enough as to what it represents or how it can be used.
- The entries with empty tenure should be droppoed if it would be crucial to our analysis.

`customer_addresses`
- There are 4 customers without address information. We can drop these entries.


`transactions`
- The transaction_date column datatype should be converted to a datetime
- The entries with missing online_order values should be dropped and datatype converted to boolean
- The entries with missing brand should be dropped and datatype should be converted to categorical
- The entries with missing product_line should be dropped and datatype should be converted to categorical
- The entries with missing product_class should be dropped and datatype should be converted to categorical
- The entries with missing product_size should be dropped and datatype should be converted to categorical
- The entries with missing standard_cost should be dropped, data in column should be reformatted to just the numeric value (striping the currency sign)
- The entries with missing product_first_sold_date should be dropped


#### Note
For column properties that have been identified as being crucial to our analysis and we do not want to drop entries with empty values for that column, a specified default can be set.
