# IC Data consistency checks

### Contents list

##### 01. Importing data
##### 02. Consistency checks
###### 02.1 Finding missing values
###### 02.2 Subset data to view missing values
###### 02.3 New dataframe without missing values
###### 02.4 Finding duplicate values
###### 02.5 Drop duplicates
##### 03. Mixed data types
##### 04. Describe dataframes (step 2)
##### 05. Checking for mixed data types (step 3)
###### 05.1 step 4
##### 06. Missing values (step 5)
##### 07. Statistics (median, mean, frequency, max, describe) (step 6)
###### 07.1 Imput value to missing values (e.g., median)
##### 08. Duplicates and missing values (step 7 and 8)
##### 09. 09. Export dataframe (step 9)

In [1]:
#import libraries
import pandas as pd
import numpy as np
import os

### 01. Importing data

In [2]:
path= r'C:\Users\isobr\Box\02122022Instacart Basket Analysis'

In [3]:
path

'C:\\Users\\isobr\\Box\\02122022Instacart Basket Analysis'

In [5]:
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Prepared data', 'orders_wrangled.csv'), index_col = False)

In [6]:
df_ords.head()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,0,2539329,1,1,2,8,
1,1,2398795,1,2,3,7,15.0
2,2,473747,1,3,3,12,21.0
3,3,2254736,1,4,4,7,29.0
4,4,431534,1,5,4,15,28.0


In [7]:
df_prods=pd.read_csv(os.path.join(path, '02 Data', 'Original data', 'products.csv'), index_col = False)

In [8]:
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


In [9]:
df_dep=pd.read_csv(os.path.join(path, '02 Data', 'Original data', 'departments.csv'), index_col = False)

In [10]:
df_dep.head()

Unnamed: 0,department_id,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,21
0,department,frozen,other,bakery,produce,alcohol,international,beverages,pets,dry goods pasta,...,meat seafood,pantry,breakfast,canned goods,dairy eggs,household,babies,snacks,deli,missing


In [55]:
df_dep.T

Unnamed: 0,0
department_id,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta


### 02. Consistency checks

##### 02.1 Finding missing values

In [11]:
#finding missing values
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

There are 16 missing value sin product name

##### 02.2 Subset data to view missing values

In [12]:
# To view the values, create a subset of the dataframe that shows them
df_nan = df_prods [df_prods['product_name'].isnull()==True]

In [13]:
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


In [14]:
df_nan.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,16.0,16.0,16.0,16.0
mean,6684.0,89.9375,10.9375,13.0125
std,12836.665242,33.731229,4.639953,3.881731
min,34.0,26.0,1.0,1.2
25%,459.25,70.75,7.75,12.175
50%,2413.0,98.5,11.5,13.65
75%,3872.75,120.0,14.5,14.425
max,40440.0,126.0,16.0,20.9


In [15]:
df_nan['department_id'].value_counts()

16    4
11    3
7     2
13    2
14    1
3     1
1     1
8     1
12    1
Name: department_id, dtype: int64

In [16]:
# Use shape function to compare number of rows after cleaning dataframe
df_prods.shape

(49693, 5)

##### 02.3 New dataframe without missing values

In [17]:
#create new data frame without the missing values, alternative code is df_prods.dropna(inplace = True) but it overwrites df
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]

In [18]:
df_prods_clean.shape

(49677, 5)

##### 02.4 Finding duplicate values

In [19]:
#Find if there are duplicate values
df_dups = df_prods_clean[df_prods_clean.duplicated()]

In [20]:
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [21]:
#use .shape to compare number of rows once duplicates are removed
df_prods_clean.shape

(49677, 5)

##### 02.5 Drop duplicates

In [22]:
# create new dataframe without the duplicates,
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

In [23]:
df_prods_clean_no_dups.shape

(49672, 5)

In [24]:
df_prods_clean_no_dups.to_csv(os.path.join(path, '02 Data','Prepared data', 'products_checked.csv'))

### 0.3 Mixed data types

In [25]:
#Create a dataframe
df_test = pd.DataFrame()

In [26]:
#Create a mixed type column
df_test ['mix']= ['a','b',1,True]

In [27]:
df_test.head()

Unnamed: 0,mix
0,a
1,b
2,1
3,True


In [28]:
#check for mixed data types

for col in df_test.columns.tolist():
  weird = (df_test[[col]].applymap(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_test[weird]) > 0:
    print (col)

mix


### 0.4 Describe dataframes (step 2)

In [29]:
df_prods.describe ()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49693.0,49693.0,49693.0,49693.0
mean,24844.345139,67.770249,11.728433,9.994136
std,14343.717401,38.316774,5.850282,453.519686
min,1.0,1.0,1.0,1.0
25%,12423.0,35.0,7.0,4.1
50%,24845.0,69.0,13.0,7.1
75%,37265.0,100.0,17.0,11.2
max,49688.0,134.0,21.0,99999.0


In [30]:
df_dep.describe ()

Unnamed: 0,department_id,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,21
count,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
unique,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
top,department,frozen,other,bakery,produce,alcohol,international,beverages,pets,dry goods pasta,...,meat seafood,pantry,breakfast,canned goods,dairy eggs,household,babies,snacks,deli,missing
freq,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


In [31]:
df_dep.describe()

Unnamed: 0,department_id,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,21
count,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
unique,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
top,department,frozen,other,bakery,produce,alcohol,international,beverages,pets,dry goods pasta,...,meat seafood,pantry,breakfast,canned goods,dairy eggs,household,babies,snacks,deli,missing
freq,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


Min are the same for all column, this can be ok, but still seems unusual. But max values seem strange in the case of prices and my indicate missing values or outliers.

### 05. Checking for mixed data types (step 3)

In [32]:
#checking for mixed data type in df_ords
for col in df_ords.columns.tolist():
  weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird]) > 0:
    print (col)

##### 05.1 step 4

There are no mixed data types in df_ords

### 06. Missing values (step 5)

In [33]:
#checking for missing values
df_ords.isnull().sum()

Unnamed: 0                     0
order_id                       0
user_id                        0
order_number                   0
orders_day_of_week             0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

There are 206209 missing values in the column 'days_since_prior_order', this may be because it may be difficult to register these values, since they are likely an extra field that is not automatically filled in with a purchase, unlike other values, such as order_id, user_id, etc. It is possible that this field needs to be manually filled by employees in each purchase.

### 07. Statistics (median, mean, frequency, max, describe) (step 6)

In [34]:
#Finding out the median values first for days_since_prior_order
df_ords['days_since_prior_order'].median()

7.0

In [35]:
# Finding the mean. 
df_ords['days_since_prior_order'].mean()

11.114836226863012

In [36]:
#alternative 
df_ords.days_since_prior_order.describe()

count    3.214874e+06
mean     1.111484e+01
std      9.206737e+00
min      0.000000e+00
25%      4.000000e+00
50%      7.000000e+00
75%      1.500000e+01
max      3.000000e+01
Name: days_since_prior_order, dtype: float64

In [37]:
df_ords['days_since_prior_order'].value_counts()

30.0    369323
7.0     320608
6.0     240013
4.0     221696
3.0     217005
5.0     214503
2.0     193206
8.0     181717
1.0     145247
9.0     118188
14.0    100230
10.0     95186
13.0     83214
11.0     80970
12.0     76146
0.0      67755
15.0     66579
16.0     46941
21.0     45470
17.0     39245
20.0     38527
18.0     35881
19.0     34384
22.0     32012
28.0     26777
23.0     23885
27.0     22013
24.0     20712
25.0     19234
29.0     19191
26.0     19016
Name: days_since_prior_order, dtype: int64

In [40]:
# and checking the counts for each value to look for outliers
df_ords['orders_day_of_week'].value_counts()

0    600905
1    587478
2    467260
5    453368
6    448761
3    436972
4    426339
Name: orders_day_of_week, dtype: int64

In [41]:
df_ords['orders_day_of_week'].median()

3.0

In [42]:
df_ords['orders_day_of_week'].max()

6

In [43]:
df_ords['order_number'].value_counts()

1      206209
2      206209
3      206209
4      206209
5      182223
        ...  
96       1592
97       1525
98       1471
99       1421
100      1374
Name: order_number, Length: 100, dtype: int64

Since the value is numeric it is possible to input either the mean or the median to the missing values, or to just input a 999 value. I choose to use the median for replacing the missing values. The reason is that the median is lower than the mean, so data is skewed to the right, and there are indeed outliers with very high values. Using the median for replacing missing values seems like the best option in this case.

##### 07.1 Imput value to missing values (e.g., median)

In [44]:
# imputing the median to the missing values for this column
df_ords['days_since_prior_order'].fillna(7.0, inplace=True)

### 08. Duplicates and missing values (step 7 and 8)

In [45]:
#checking for duplicates
df_ords_dups = df_ords[(df_ords.duplicated()== True)]

In [46]:
df_ords_dups

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order


There are no duplicates in the df_ords. If there were, I would create a new dataframe without them using e.g.: 
df_ords_clean_no_dups = df_ords_clean.drop_duplicates()

### 09. Export dataframe (step 9)

In [47]:
df_ords.to_csv(os.path.join(path, '02 Data','Prepared data', 'orders_checked.csv'))