# Data Consistency Checks

### Step 1

1. If you haven’t performed the consistency checks covered in this Exercise on your df_prods dataframe, do so now.

In [1]:
# Importing Libraries
import numpy as np
import pandas as pd
import os

# setting master path
path = r'/Users/Norberto/Desktop/2023-10 Instacart Basket Analysis'

In [2]:
# importing products data file
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'))


#### Addressing Missing Values

In [3]:
# List a count of null values in each column
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [4]:
# Create new dataframe of only NaN values
df_nan = df_prods[df_prods['product_name'].isnull() == True]

# Create new dataframe of non-NaN values
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]

In [5]:
# Confirming datframe row numbers are mathing
a = df_prods.shape[0]
b = df_nan.shape[0]
c = df_prods_clean.shape[0]
print('df_prods({}) - df_nan({}) = df_prods_clean({})'.format(a,b,c))
print(a-b==c)

df_prods(49693) - df_nan(16) = df_prods_clean(49677)
True


#### Addressing Duplicates

In [6]:
df_dups = df_prods_clean[df_prods_clean.duplicated()]
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [7]:
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()
df_prods_clean_no_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3
...,...,...,...,...,...
49688,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5,5.3
49689,49685,En Croute Roast Hazelnut Cranberry,42,1,3.1
49690,49686,Artisan Baguette,112,3,7.8
49691,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8,4.7


In [8]:
# Confirming datframe row numbers are mathing
d = df_dups.shape[0]
e = df_prods_clean_no_dups.shape[0]
print('df_prods_clean({}) - df_dups({}) = df_prods_clean_no_dups({})'.format(c,d,e))
print(c-d==e)

df_prods_clean(49677) - df_dups(5) = df_prods_clean_no_dups(49672)
True


#### Addressing Mixed-Type Data
I completed these steps AFTER checking for missing and duplicate data because this action can affect their output

In [9]:
# List all columns with mixed data types
for col in df_prods.columns.tolist():
  weird = (df_prods[[col]].applymap(type) != df_prods[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_prods[weird]) > 0:
    print (col)

product_name


In [10]:
# List beginning and end of data to understand and inspect it
df_prods['product_name'].describe


<bound method NDFrame.describe of 0                               Chocolate Sandwich Cookies
1                                         All-Seasons Salt
2                     Robust Golden Unsweetened Oolong Tea
3        Smart Ones Classic Favorites Mini Rigatoni Wit...
4                                Green Chile Anytime Sauce
                               ...                        
49688            Vodka, Triple Distilled, Twist of Vanilla
49689                   En Croute Roast Hazelnut Cranberry
49690                                     Artisan Baguette
49691           Smartblend Healthy Metabolism Dry Cat Food
49692                               Fresh Foaming Cleanser
Name: product_name, Length: 49693, dtype: object>

In [11]:
# Set new data type for product name to str
df_prods['product_name'] = df_prods['product_name'].astype('str')

# Rerun mixed-type data check
for col in df_prods.columns.tolist():
  weird = (df_prods[[col]].applymap(type) != df_prods[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_prods[weird]) > 0:
    print (col)

## Task

### Step 2

2. Run the df.describe() function on your df_ords dataframe. Using your new knowledge about how to interpret the output of this function, share in a markdown cell whether anything about the data looks off or should be investigated further.
    * Tip: Keep an eye on min and max values!

In [12]:
# importing orders data file
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_wrangled.csv'),index_col=0,dtype ={1:'str',2:'str'})

In [13]:
df_ords.describe()

Unnamed: 0,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3214874.0
mean,17.15486,2.776219,13.45202,11.11484
std,17.73316,2.046829,4.226088,9.206737
min,1.0,0.0,0.0,0.0
25%,5.0,1.0,10.0,4.0
50%,11.0,3.0,13.0,7.0
75%,23.0,5.0,16.0,15.0
max,100.0,6.0,23.0,30.0


'Unnamed: 0' Column kept appearing due to not indicating that the imported file already had an index. I tried including index_col=False as an argument, but still go the same result. After indicating it was in position 0, it left it out of the analysis.

'order_id' & 'user_id' Columns were being changed to an int when pandas read the file despite being changed when the data was wrangled. The only solution I found to this issue was specifying the data type for those columns when importing the file. Columns are treated as identifying strings and hold no numerical analysis value.

'days_since_prior_order' Column has less values than the other columns. This can make sense if some customers were new. It has appropriate min: 0, max: 30. We expect variation in this field, so a lower value makes sense.

'order_number' Column has appropriate min: 1, max: 100. Second quartile data is a little low, but makes sense if most customers only make fewer orders

'order_day_of_week' & 'order_hour_of_day' Columns have appropriate min:0, max:(6,23). They start at 0, end 1 value before their normal maxes. Second quartile data puts the orders near the center for both.



### Steps 3 & 4

3. Check for mixed-type data in your df_ords dataframe.
4. If you find mixed-type data, fix it. The column in question should contain observations of a single data type.

In [14]:
# Iterate through df_ords, which columns for NaN values and returning column names with a count higher than 0
for col in df_ords.columns.tolist():
  weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird]) > 0:
    print (col)

There were no mixed data types in the orders csv file. When I imported the file, i explicitly changed two columns to strings, but before making that adjustment to my import, pandas was evluating them as an int64, which means to me they are all digits, at least. With that information, I know that by specifying dtypes, I did mask any potential mixed data types. Also, in the event that I know a column has a mixed data type before importing, one way to fix it is indicating that as an argument when reading the file.

### Step 5

5. Run a check for missing values in your df_ords dataframe.
    * In a markdown cell, report your findings and propose an explanation for any missing values you find.

In [15]:
# List a count of null values in each column
df_ords.isnull().sum()

order_id                       0
user_id                        0
order_number                   0
orders_day_of_week             0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

'days_since_prior_order' Column has 206,209 missing values. Those missing values can represent the number of new customers. When examining numerical column analysis of the orders csv file before converting user_id to 'str', the max value in that field was 206,209 as well. It makes sense that the first order a customer made would have an empty value in this column.

### Step 6

6. Address the missing values using an appropriate method.
    * In a markdown cell, explain why you used your method of choice.

In [16]:
# Create new column and assign values based on null status of specified column
df_ords.assign(new_customer=lambda x: x.days_since_prior_order.isnull())

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,new_customer
0,2539329,1,1,2,8,,True
1,2398795,1,2,3,7,15.0,False
2,473747,1,3,3,12,21.0,False
3,2254736,1,4,4,7,29.0,False
4,431534,1,5,4,15,28.0,False
...,...,...,...,...,...,...,...
3421078,2266710,206209,10,5,18,29.0,False
3421079,1854736,206209,11,4,10,30.0,False
3421080,626363,206209,12,1,12,18.0,False
3421081,2977660,206209,13,1,12,7.0,False


I elected to create a new variable that acts like a flag, based on the missing value. The new column would be called 'new_customer' and any record that had a missing value would be labeled TRUE if missing. All other non-missing values will be FALSE.

I used this method because it was the clearest indicator of what that value meant and it explains the reason for the missing value.

### Steps 7 & 8

7. Run a check for duplicate values in your df_ords data.
    * In a markdown cell, report your findings and propose an explanation for any duplicate values you find.

In [17]:
df_ords_dups = df_ords[df_ords.duplicated()]
df_ords_dups


Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order


No duplicates were found in the orders dataframe

### Step 9

9. Export your final, cleaned df_prods and df_ords data as “.csv” files in your “Prepared Data” folder and give them appropriate, succinct names.


In [18]:
# export checked data to new orders csv file in Prepared Data Folder
df_ords.to_csv(os.path.join(path, '02 Data','Prepared Data', 'orders_checked.csv'))

# export checked data to new products csv file in Prepared Data Folder
df_prods_clean_no_dups.to_csv(os.path.join(path, '02 Data','Prepared Data', 'products_checked.csv'))