# Loading Libraries

In [2]:
options(warn=-1)
library(dplyr)
library(mltools)
library(data.table)
library(ggplot2)
library(flextable)

# Importing Dataset

The data set is imported using read.csv() function. By using the str() function we can
print the all the variables of the data set.

In [2]:
superStore <- read.csv("Superstore.csv")

str(superStore)

'data.frame':	9994 obs. of  20 variables:
 $ Order.ID     : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
 $ Order.Date   : chr  "11/08/2016" "11/08/2016" "06/12/2016" "10/11/2015" ...
 $ Ship.Date    : chr  "11/11/2016" "11/11/2016" "6/16/2016" "10/18/2015" ...
 $ Ship.Mode    : chr  "Second Class" "Second Class" "second class" "Standard Class" ...
 $ Customer.ID  : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
 $ Customer.Name: chr  "Claire Gute" "Claire Gute" "Darrin Van Huff" "Sean O'Donnell" ...
 $ Segment      : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
 $ Country      : chr  "United States" "United States" "United States" "United States" ...
 $ City         : chr  "Henderson" "Henderson" "Los Angeles" "Fort Lauderdale" ...
 $ State        : chr  "Kentucky" "Kentucky" "California" "Florida" ...
 $ Postal.Code  : int  42420 42420 90036 33311 33311 90032 90032 90032 90032 90032 ...
 $ Region       : chr  "South" "South" "West" "Sout

# 1 Introduction

The objective of the task is to apply the six information quality aspects to one of the picked
datasets to investigate and upgrade its quality. The 6 information quality aspects were
applied to the super store data set to assess its quality and, subsequently, proven methods for information quality
aspects were utilized at whatever point important to sanitize the information and accomplish a more serious level of 
value. Fulfillment, Conformity,Consistency, Accuracy, Duplicates, and Integrity are the six information quality
aspects that would be suggested in this paper.

###  Dataset

This dataset is very self explanatory.It contains data about the product sales of a global superstore.
The data set have 9994 observations (rows) with 20 variables


### Data Dictionary

*Order id:Contains the order's ID
*Order Date:Contains the order date 
*Ship Date:Contains the shipping date
*Ship mode:Contains the class of the transport
*Customer id:Contains the customer id
*Customer name:Stores the customer name
*Segment:Contains the segment of the product
*Country:Contains the country of the order
*City:Contains the city of the order
*State:Contains the state of the order
*Postal Code:Contains the postal code of the order's destination
*Region:Contains the region for the order
*Product Id:Contains the product ID
*Category:Contains the product category
*Sub category:Contains the product sub category
*Product Name:Contains the product name
*Sales:Contains the sales of a product
*Quantity:Contains the quantity of a product
*Discount:Contains the discount of a product
*Profit:Contains the profit of a product

# 2 Completeness

### 2.1 Counting missing values

The blank cells or NAs were found in two factors: postal code and amount.
Postal code and amount factors have 9, 3 missing values, separately

In [4]:
df_withna <- superStore %>% filter_all(any_vars(is.na(.)))
summary(df_withna)

   Order.ID          Order.Date         Ship.Date          Ship.Mode        
 Length:12          Length:12          Length:12          Length:12         
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 Customer.ID        Customer.Name        Segment            Country         
 Length:12          Length:12          Length:12          Length:12         
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            

### 2.2 Replacing missing values 

For variable postal code and amount factors the missing values are replaced  with zero
to demonstrate no postal code and no amount for those particular variables.

In [5]:
superStore$Postal.Code[is.na(superStore$Postal.Code)] <- 0
superStore$Quantity[is.na(superStore$Quantity)] <- 0

### 2.3 Correcting variable labels

There was one more issue in  ship mode, segment and region labels. For the ship mode section and  
segment column had a portion of the labels utilized were lower case and others capitalized .
The region segment had labels utilized for the west classification were W, the south classification 
were S, the central class C and the east classification were E. Along these columns, there is need
to replace this with the right ones.

In [20]:
flextable(filter(superStore, Region == "C" | Region == "E" | Region == "S" | Region == "W"))

a flextable object.
col_keys: `Order.ID`, `Order.Date`, `Ship.Date`, `Ship.Mode`, `Customer.ID`, `Customer.Name`, `Segment`, `Country`, `City`, `State`, `Postal.Code`, `Region`, `Product.ID`, `Category`, `Sub.Category`, `Product.Name`, `Sales`, `Quantity`, `Discount`, `Profit` 
header has 1 row(s) 
body has 11 row(s) 
original dataset sample: 
        Order.ID Order.Date  Ship.Date      Ship.Mode Customer.ID
1 CA-2014-115812 06/09/2014  6/14/2014 Standard Class    BH-11710
2 CA-2016-137330 12/09/2016 12/13/2016 Standard Class    KB-16585
3 CA-2016-118255 03/11/2016  3/13/2016    First Class    ON-18715
4 CA-2016-111682  6/17/2016  6/18/2016    First Class    TB-21055
5 US-2015-134026  4/26/2015 05/02/2015 Standard Class    JE-15745
    Customer.Name   Segment       Country        City      State Postal.Code
1 Brosina Hoffman  Consumer United States Los Angeles California       90032
2       Ken Black corporate United States     Fremont   Nebraska       68025
3   Odella Nelson Corporate

In [9]:
superStore$Ship.Mode[superStore$Ship.Mode == "first class"] <- "First Class"
superStore$S[superStore$Ship.Mode == "second class"] <-  "Second Class"
superStore$Ship.Mode[superStore$Ship.Mode == "standard class"] <- "Standard Class"
superStore$Ship.Mode[superStore$Ship.Mode == "Standard class"] <- "Standard Class"
superStore$Ship.Mode[superStore$Ship.Mode == "standard Class"] <- "Standard Class"

superStore$Segment[superStore$Segment == "consumer"] <- "Consumer"
superStore$Segment[superStore$Segment == "corporate"] <- "Corporate"
superStore$Segment[superStore$Segment == "home office"] <- "Home Office"

superStore$Region[superStore$Region == "C"] <- "Central"
superStore$Region[superStore$Region == "E"] <- "East"
superStore$Region[superStore$Region == "S"] <- "South"
superStore$Region[superStore$Region == "W"] <- "West"

superStore$State[superStore$State == "FL"] <- "Florida"
superStore$State[superStore$State == "CL"] <- "Colorado"

superStore$Country[superStore$Country == "US"] <- "United States"

### 2.4 Setting Decimal Places

The sales columnn and the quantity column have two decimal places but in real sense the 
sales and quantity columns have to be whole numbers to incidate an exact amount.

In [3]:
round(superStore$Sales)
round(superStore$Quantity)

# 3 Conformity

By looking at the structure of the variables we can see that the factors having right design. 
The data type for every factor is underneath where postal code, sales, quantity, discount and profit
are numeric while the other factors are character factors.

In [4]:
str(superStore)

'data.frame':	9994 obs. of  20 variables:
 $ Order.ID     : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
 $ Order.Date   : chr  "11/08/2016" "11/08/2016" "06/12/2016" "10/11/2015" ...
 $ Ship.Date    : chr  "11/11/2016" "11/11/2016" "6/16/2016" "10/18/2015" ...
 $ Ship.Mode    : chr  "Second Class" "Second Class" "second class" "Standard Class" ...
 $ Customer.ID  : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
 $ Customer.Name: chr  "Claire Gute" "Claire Gute" "Darrin Van Huff" "Sean O'Donnell" ...
 $ Segment      : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
 $ Country      : chr  "United States" "United States" "United States" "United States" ...
 $ City         : chr  "Henderson" "Henderson" "Los Angeles" "Fort Lauderdale" ...
 $ State        : chr  "Kentucky" "Kentucky" "California" "Florida" ...
 $ Postal.Code  : int  42420 42420 90036 33311 33311 90032 90032 90032 90032 90032 ...
 $ Region       : chr  "South" "South" "West" "Sout

# 4 Consistency

The values for profits have a major issue in consistency. The profits are not steady under one.

In [22]:
flextable(filter(superStore, Profit < 0))

a flextable object.
col_keys: `Order.ID`, `Order.Date`, `Ship.Date`, `Ship.Mode`, `Customer.ID`, `Customer.Name`, `Segment`, `Country`, `City`, `State`, `Postal.Code`, `Region`, `Product.ID`, `Category`, `Sub.Category`, `Product.Name`, `Sales`, `Quantity`, `Discount`, `Profit` 
header has 1 row(s) 
body has 1871 row(s) 
original dataset sample: 
        Order.ID Order.Date  Ship.Date      Ship.Mode Customer.ID
1 US-2015-108966 10/11/2015 10/18/2015 Standard Class    SO-20335
2 US-2015-118983 11/22/2015 11/26/2015 Standard Class    HP-14815
3 US-2015-118983 11/22/2015 11/26/2015 Standard Class    HP-14815
4 US-2017-156909  7/16/2017  7/18/2017   Second Class    SF-20065
5 US-2015-150630  9/17/2015  9/21/2015 Standard Class    TB-21520
    Customer.Name     Segment       Country            City        State
1  Sean O'Donnell    Consumer United States Fort Lauderdale      Florida
2   Harold Pawlan Home Office United States      Fort Worth        Texas
3   Harold Pawlan Home Office United 

We can create a subset of the data by removing these inconsistent number of profits.

In [24]:
superStore <- subset(superStore, Profit >= 0)


# 5 Accuracy

The data in this dataset is accurate as it matches the various categories used.

In [29]:
flextable(head(filter(superStore, Region == "South")))

a flextable object.
col_keys: `Order.ID`, `Order.Date`, `Ship.Date`, `Ship.Mode`, `Customer.ID`, `Customer.Name`, `Segment`, `Country`, `City`, `State`, `Postal.Code`, `Region`, `Product.ID`, `Category`, `Sub.Category`, `Product.Name`, `Sales`, `Quantity`, `Discount`, `Profit` 
header has 1 row(s) 
body has 6 row(s) 
original dataset sample: 
        Order.ID Order.Date  Ship.Date      Ship.Mode Customer.ID
1 US-2017-100930 04/07/2017 04/12/2017 Standard Class    CS-12400
2 CA-2014-136567 12/20/2014 12/21/2014    First Class    PS-19045
3 CA-2016-144344 10/28/2016 10/28/2016       Same Day    PG-18820
4 CA-2014-104283  6/27/2014 07/01/2014 Standard Class    LM-17065
5 US-2014-155502  1/26/2014  1/31/2014 Standard Class    SD-20485
       Customer.Name     Segment       Country          City       State
1 Christopher Schild Home Office United States         Tampa     Florida
2    Penelope Sewall Home Office United States  Harrisonburg    Virginia
3    Patrick Gardner    Consumer United 

# 6 Duplicates

There are no duplicate entries in the data set.

In [30]:
superStore %>%
transmute(duplicate = duplicated(.)) %>%
filter(duplicate == "TRUE")

duplicate
<lgl>


# 7 Intergrity

The data has no intergral issues.When the integrity of data is secure, the information stored will
remain complete, accurate, and reliable no matter how long it’s stored or how often it’s accessed.

# 8 Conclusion

It is exceptionally vital to clean the data set before examination and perceptions. The super store data set 
was cleaned by utilizing six layered information quality ascribes. The data set was found have missing qualities
and was cleaned. A few issues were found as to ship mode, segment and region variable having two qualities for
a similar classification and were rectified.
The information was conformant and put away in appropriate configuration. There was a serious issue in 
profits which were conflicting under 0, and were eliminated. No copied sections were found in the dataset.
There is need to appropriately investigate the informational index and may address the class or may erase these columns
to abstain from any deceptive data during information examination and representations.