# 4.5. IC_ Data Consistency Checks

# 01.Importing Libraries

In [2]:
# Import libraries: pandas, NumPy and os.

import pandas as pd
import numpy as np
import os

# 02. Importing Data Files

Python Shortcut for Importing Data Files, using os.path.join() function

path = r'/folderpath_to main project folder/'

df = pd.read_csv(os.path.join(path,'folderpath','name.csv'), index_col = False)


In [3]:
# folder path to my main project folder is now stored within variable 'path'

path = r'/Users/pau/06-05-2024 Instacart Basket Analysis'

#### Importing the “orders_wrangled.csv” data set into my Jupyter notebook using the os library as df_ords

In [4]:
# using the os.path.join() function to simplify the importing data and create dataframe: orders

df_ords = pd.read_csv(os.path.join(path,'02 Data','Prepared Data','orders_wrangled.csv'), index_col = False)

#### Importing the “products.csv” data set into my Jupyter notebook using the os library as df_prods

In [5]:
# using the os.path.join() function to simplify the importing data and create dataframe: products

df_prods = pd.read_csv(os.path.join(path,'02 Data', 'Original Data', 'products.csv'), index_col = False)

#### Checking the df are correctly loaded

In [6]:
# Checking "orders_wrangled.csv" data is correctly loaded
print(df_ords.head())
print(df_ords.info())
print(df_ords.shape)

   Unnamed: 0  order_id  user_id  order_number  orders_day_of_week  \
0           0   2539329        1             1                   2   
1           1   2398795        1             2                   3   
2           2    473747        1             3                   3   
3           3   2254736        1             4                   4   
4           4    431534        1             5                   4   

   order_hour_of_day  days_since_prior_order  
0                  8                     NaN  
1                  7                    15.0  
2                 12                    21.0  
3                  7                    29.0  
4                 15                    28.0  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 7 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   Unnamed: 0              int64  
 1   order_id                int64  
 2   user_id                 int64  


In [7]:
# Checking "products.csv" data is correctly loaded
print(df_prods.head())
print(df_prods.info())
print(df_prods.shape)

   product_id                                       product_name  aisle_id  \
0           1                         Chocolate Sandwich Cookies        61   
1           2                                   All-Seasons Salt       104   
2           3               Robust Golden Unsweetened Oolong Tea        94   
3           4  Smart Ones Classic Favorites Mini Rigatoni Wit...        38   
4           5                          Green Chile Anytime Sauce         5   

   department_id  prices  
0             19     5.8  
1             13     9.3  
2              7     4.5  
3              1    10.5  
4             13     4.3  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49693 entries, 0 to 49692
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   product_id     49693 non-null  int64  
 1   product_name   49677 non-null  object 
 2   aisle_id       49693 non-null  int64  
 3   department_id  49693 non-null  int64

# 03. Data Consistency Checks

### df.describe() function

**The `df.describe()` function returns descriptive statistics for the numeric values in your dataframe.**

- Using these results, you can begin investigating the accuracy of the columns in your dataframe.


In [8]:
# finding the unnecessary "eval_set" column from the “orders.csv” file.

df_ords.describe()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710541.0,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,0.0,1.0,1.0,1.0,0.0,0.0,0.0
25%,855270.5,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710541.0,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421082.0,3421083.0,206209.0,100.0,6.0,23.0,30.0


### Some of the most common checks to perform on data to confirm its consistency. 
These include:

- **Finding and addressing mixed data types**
- **Finding and addressing missing values**
- **Finding and addressing duplicate records**

# 04. Mixed-Type Data

- a common reason for changing the data type of a column.
- a mixed-type column is a column that includes both string values and numeric values.
- Ex. if you had a column of names (string format), where missing values were marked with a “0” (numeric format).
- When you import a data set into pandas, it will try to guess the data type of every column based on their most prevalent data types. However, when working with large data sets, it could get confused and decide not to assign a data type to a mixed-type column.
- Always check for these mixed-type columns before moving forward with any analytical work, as they can break functions and generally cause problems in your procedures.
- data prep and data analysis should be two discrete/separate stages in any data project.
    - This holds true for your scripts, as well. *Analysis* scripts should never be interspersed/place in between with data *prep* scripts.

- **Your Instacart data has already undergone all these data-prep checks, and you know there aren’t any mixed-type columns.**

 ### to practice fixing mixed-type columns now in your studies, so let’s create a small test dataframe for you to work with.

In [9]:
# creating a new dataframe called df_test

df_test = pd.DataFrame()

In [10]:
# creating a mixed type column: 
# creates a new column, mix, within df_test and fills it with numeric, string, and boolean values

df_test['mix'] = ['a', 'b', 1, True]

In [11]:
# Checking the new mixed-typed column

df_test.head()

Unnamed: 0,mix
0,a
1,b
2,1
3,True


In [12]:
# function for checking whether a dataframe contains any mixed-type columns

for col in df_test.columns.tolist():
    weird = (df_test[[col]].map(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
    if len (df_test[weird]) > 0:
        print(col)
        
# structure used in this code is 'for-loop'
# !=  Not equal





mix


- The structure that’s being used in this code is called a “for-loop.”
    - The “for” in for-loop stands for “for these elements, do this,”
    - and the “loop” describes how the structure works: looping over and over again as it performs the procedures detailed by the “for.”
    - Here, the for-loop is *looping* through each column in the dataframe and executing the same block of code each time.
    - Within the for-loop,
        - a new variable is created: `weird`. Assigned to it is a test that checks whether the data types within the column are consistent.
            - The `weird` variable will ultimately take a boolean value of either `True` or `False`.
                - If `True`, that means the column contains inconsistent data types.
                - If `False,` that means the column contains only one data type.
                - Boolean values can also be represented by numbers: 0 as `False` and 1 as `True`.
    - Here comes the “if” statement.
        - An if statement checks if some condition is met, and if it’s met, executes a line of code. If the condition isn’t met, the code isn’t executed.
            - Here, the if statement is checking whether `weird` is true or false.
                - If it’s greater than 0, than it’s true. If not, it’s false.
                - If `weird` is true, the command `print(col)` is executed, which prints the problematic column for you to see.
                - Because of the for-loop, this command will be executed on every column in your dataframe, printing every mixed-type column it finds.

### How to fix mixed-type columns
1st deciding what single data type the column in question should be. Based on the most freq. data type in the column

In [13]:
# Change "mix" column data type to string

df_test['mix'] = df_test['mix'].astype('str')

In [14]:
# Check the results of the change
df_test['mix'].dtype

dtype('O')

# 04. Missing Values

- missing values can occur for two reasons:
    - 1) data corruption, or
    - 2) they were never recorded in the first place.
- IMP: investigate and address any missing values in your data when conducting an analysis in Python.
- They can break your functions and throw errors in your analytical procedures.
- IMP when deriving, or creating new variables.



## How you can find and fix missing values in Python

### Finding Missing Values

#### df.isnull().sum()

- the function `isnull()` to the `df_prods`dataframe, then sum the result with the attached `sum()` function.
- The `isnull()` function
    - is used to find missing observations/entries in your dataframe, like cells in Excel.
    - If you were to use the `isnull()` function by itself,
        - it would return a value of `True` or `False`, which, by itself, isn’t very helpful.
    - You need to know how many total missing observations there are, which is where the `sum()` function comes in.
        - `True`values can also be interpreted numerically as 1,
        - and `False` values can also be interpreted numerically as 0.
        - If every missing observation is equal to 1, then you can simply add them up using the `sum()` function to obtain the total number of missing observations.


In [15]:
# Looking for missing values in df_prods dataframe

df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

the only column with missing values is the "product_name" column, and it’s missing 16 values.

- To view these 16 values, you can create a subset of the df containing only the values in question.
- Create a new dataframe, `df_nan`, containing only those values within the "product_name" column that meet the condition `isnull() = True`.

In [16]:
# Create a subset of df_prods called "df_nan" that contains only the missing values from the "product_name" column

df_nan = df_prods[df_prods['product_name'].isnull() == True]

In [17]:
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


### Addressing Missing Values

ways to deal with missing data:

**1. Create a new variable that acts like a flag based on the missing value.**
- When missing values actually hold just as much importance as the non-missing values. 
- Sol: create a new column containing the string values A or B.

**2. Impute the value with the mean or median of the column (if variable is numeric).**
- MEAN is a statistical measure that can be greatly influenced by extreme values.
    - Use the **df.describe()function** to find the mean of the column in question.
    - Code to replace missing values with mean: 
         - **df['column with missings'].fillna(mean value, inplace=True)**
- MEDIAN 
    - Find it using the **df.median() function**.
    - Code to replace missing values with median, same com. but with "median":
      - **df['column with missings'].fillna(median value, inplace=True)**
- alternative way to impute missing values is to use **Linear interpolation**, “connecting two points with a line”, dealing with missing data in time-series data, involves finding the mean of the rows before and after the missing value occurs, and estimating where the missing value should fall between those two means.
- String values can’t be imputed like numeric values.
    
**3. Remove or filter out the missing data.**
- when (string missing value) you can 
    - 1.  either **remove the missing values entirely**
    - 2. or **filter out the ones that aren’t missing into a subset dataframe** and continue your analysis with this new dataframe.
    





#### df_nan: 
- The missing values are strings, so imputation is not possible. We will instead **create a new dataframe that excludes the missing values**.


#### creating a new dataframe that excludes the missing values.


- First check the **"df_prods.shape"** in order to later compare the number of rows in the original dataframe with the number in the new subset once the missing rows have been removed

In [18]:
# Check the rows in "df_prods"
df_prods.shape

(49693, 5)

- create a new dataframe called `df_prods_clean`.
    - use the same line of code from above, when you created the `df_nan`dataframe:  
      - **df_nan = df_prods[df_prods['product_name'].isnull() == True]**
    - this time setting the `isnull()` condition to `False`instead of `True`, you want *non-missing* values in your new dataframe as opposed to *missing* values: 
      - **df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]**

In [19]:
# Creating a new dataframe called df_prods_clean

df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]

- run **df_prods_clean.shape** again to check that the number of rows has decreased.
- new dataframe should have exactly 16 less rows than the original dataframe (the same as the number of missing values).


In [20]:
# Checking the rows in the new subset (it should be 16 rows less than the original df_prods)

df_prods_clean.shape

(49677, 5)

#### **Another way you can drop all missing values** is via the following command:

```python
df_prods.dropna(inplace = True)
```

#### If you wanted to use this command to drop only the NaNs from a particular column, the code would look like this:

```python
df_prods.dropna(subset = [‘product_name’], inplace = True)
```

- In both cases, **!!!rather than creating an entirely new dataframe, you’re overwriting `df_prods` with a new version of `df_prods` that doesn’t contain the missing values.!!!**
    - This is done by way of the `inplace = True` function, which overwrites the original dataframe.
    - If you don’t specify an `inplace` argument in your code, the function will take the default setting, which is `inplace = False`.
        - When specified as `False`, the command will only return a *view* of the changed dataframe, leaving the original dataframe untouched.
- As mentioned before, overwriting can be risky. Unless you’re absolutely sure it’s safe to drop the values in question, you should create a new dataframe instead.

# 05. Duplicates

- common occurrence when working with data. 
- they need to be handled with care and investigated thoroughly.
-  important to know what kind of duplicates exist in your data. 
    - Oftentimes, you’ll need clarification from your client before proceeding with any data manipulation. 

### Finding Duplicates

- you’ll want to look for full duplicates—multiple rows that have the exact same values in every column. 
- This is because single duplicates aren’t actually inconsistencies in your data. 

command will look for full duplicates within your dataframe:

```python
df_dups = df_prods_clean[df_prods_clean.duplicated()]
```

-  This code creates a new subset of `df_prods_clean`—`df_dups`—containing *only* rows that are duplicates.
    - The `duplicated()`function is what identifies duplicate rows.
    - It’s run on the `df_prods_clean` dataframe.
    - Any duplicate rows that it finds are saved within the new `df_dups` dataframe.

In [21]:
# Checking for duplicates in df_prods_clean by creating a new subset that contains only duplicates

df_dups = df_prods_clean[df_prods_clean.duplicated()]

In [23]:
# calling the df_dups dataframe. 
# This will display all the duplicate rows within the dataframe df_prods_clean

df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


### Addressing Duplicates

1. check the current number of rows in your df_prods_clean dataframe so that you can compare the number after removing the duplicates

2. Next, create a new dataframe that doesn’t include the duplicates you just identified using the drop_duplicates() function:

#### df.drop_duplicates()

- function to delete duplicates


In [25]:
# Check the number of rows in "df_prods_clean" before removing the duplicates

df_prods_clean.shape

(49677, 5)

In [26]:
# Create a new dataframe that doesn't include the duplicates

df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

This command has created a new dataframe: **df_prods_clean_no_dups** that contains only the unique rows from df_prods_clean.

In [27]:
# Check the number of rows in the new dataframe (it should be 5 less than the original df_prods_clean)
df_prods_clean_no_dups.shape

(49672, 5)

now we have 49,672 rows in our dataframe. The five duplicates have been successfully deleted

# 06. Tidying Up and Exporting Changes:

## Exporting the new cleaned Products dataframe

In [28]:
# Perform a final check of the dataframe before exporting
print(df_prods_clean_no_dups.head())
print(df_prods_clean_no_dups.info())
print(df_prods_clean_no_dups.shape)

   product_id                                       product_name  aisle_id  \
0           1                         Chocolate Sandwich Cookies        61   
1           2                                   All-Seasons Salt       104   
2           3               Robust Golden Unsweetened Oolong Tea        94   
3           4  Smart Ones Classic Favorites Mini Rigatoni Wit...        38   
4           5                          Green Chile Anytime Sauce         5   

   department_id  prices  
0             19     5.8  
1             13     9.3  
2              7     4.5  
3              1    10.5  
4             13     4.3  
<class 'pandas.core.frame.DataFrame'>
Index: 49672 entries, 0 to 49692
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   product_id     49672 non-null  int64  
 1   product_name   49672 non-null  object 
 2   aisle_id       49672 non-null  int64  
 3   department_id  49672 non-null  int64  
 4

In [1]:
df_prods_clean_no_dups.shape

NameError: name 'df_prods_clean_no_dups' is not defined

In [29]:
df_prods_clean_no_dups.to_csv(os.path.join(path, '02 Data', 'Prepared Data', 'products_checked.csv'))

# ______________________________________________


 # Task 4.5. 
 

## 01.  Data Consistency Checks on the "df_prods" dataframe

Data consistency checks performed during this exercise and exported the new cleaned Products dataframe as 'products_checked.csv' and stored it in your “Prepared Data” folder.

## Data Consistency Checks on the "df_ords" dataframe


## 02. Run the df.describe() function on the "df_ords" dataframe.

In [30]:
# Start by running the "df.describe()" function

df_ords.describe()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710541.0,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,0.0,1.0,1.0,1.0,0.0,0.0,0.0
25%,855270.5,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710541.0,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421082.0,3421083.0,206209.0,100.0,6.0,23.0,30.0


**Output**

The df.describe() function returns descriptive statistics for the numeric values in the dataframe.
Using these results, we can begin investigating the accuracy of the columns in the df_ords  dataframe

#### Order Number: 
- The maximum order number is 100, suggesting that only up to 100 orders per customer are kept in the data.
#### Order Day of the Week: 
- Ranges from 0 to 6, correctly representing a full week.
#### Order Hour of Day: 
- Ranges from 0 to 23, representing each hour of the day.
#### Days Since Prior Order: 
- The count is less than other columns, which could mean missing values: 3.421083e+06 > 3.214874e+06

## 03. Check for mixed-type data in your df_ords dataframe

In [34]:
# Check "df_ords" for any mixed-type columns with the exercise function

for col in df_ords.columns.tolist():
    weird = (df_ords[[col]].map(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
    if len (df_ords[weird]) > 0:
        print (col)

In [33]:
# Double checking with alternative method for identifying columns with mixed-type data

for col in df_ords.columns.tolist():
    weird = (df_ords[[col]].map(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
    if len (df_ords[weird]) > 0:
        print (f"Mixed-type data found in column: {col}")
    else: print(f"No mixed-type data in column: {col}")

No mixed-type data in column: Unnamed: 0
No mixed-type data in column: order_id
No mixed-type data in column: user_id
No mixed-type data in column: order_number
No mixed-type data in column: orders_day_of_week
No mixed-type data in column: order_hour_of_day
No mixed-type data in column: days_since_prior_order


#### Output: "df_ords" appears to have no columns with mixed-type data to correct


## 05. Run a check for missing values in your df_ords dataframe

#### 1. To find missing values we are going to use the function:
df.isnull().sum()

In [35]:
# Look for missing values in df_ords dataframe

df_ords.isnull().sum()

Unnamed: 0                     0
order_id                       0
user_id                        0
order_number                   0
orders_day_of_week             0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

#### 2. Interpreting the results:

#### The "days_since_prior_order" column has missing values as was suspected after running the "df_ords.describe()" function above.

- Since "days_since_prior_order" has a MAX value of 30, the missing values could represent customers who haven't placed an order in the last 30 days. 
- Another explanation is the missing value represents customers who haven't placed an order yet.

#### 3. Examining the missing values:

- To view these 206209 values, we can create a subset of the df containing only the values in question.
- Create a new dataframe, `df_ords_nan`, containing only those values within the "days_since_prior_order" column that meet the condition `isnull() = True`.


In [36]:
# To more closely examine the rows with missing values, create a subset containing only those rows

df_ords_nan = df_ords[df_ords['days_since_prior_order'].isnull() == True]

In [37]:
df_ords_nan

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,0,2539329,1,1,2,8,
11,11,2168274,2,1,2,11,
26,26,1374495,3,1,1,14,
39,39,3343014,4,1,6,11,
45,45,2717275,5,1,3,12,
...,...,...,...,...,...,...,...
3420930,3420930,969311,206205,1,4,12,
3420934,3420934,3189322,206206,1,3,18,
3421002,3421002,2166133,206207,1,6,19,
3421019,3421019,2227043,206208,1,1,15,


#### Output

#### It appears that all the "days_since_prior_order" NaNs occur when the "order_number" is 1. 
It could mean that when a customer places a first order and since the system has no prior order to reference it fills this column with "NaN".

## 06. Address the missing values using an appropriate method

#### To address the missing values in "days_since_prior_order" (assuming all NaNs correspond to an "order_number" of 1), I will create a new column to flag the first time orders, then count the number of flags to see if it equals the number of "NaN"s in the column.

In [38]:
# Creating a new column to flag first orders with either 1 (it is a first time order) or 0 (it is not a first time order)

df_ords['is_first_order'] = df_ords['days_since_prior_order'].isnull().astype(int)

In [44]:
# Counting the number of first time orders

print(df_ords['is_first_order'].sum())

206209


#### This confirms that all first time orders are given "NaN" in the "days_since_prior_order" column.

In [45]:
# Checking the top rows to confirm the new "flag" column

df_ords.head()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,is_first_order
0,0,2539329,1,1,2,8,,1
1,1,2398795,1,2,3,7,15.0,0
2,2,473747,1,3,3,12,21.0,0
3,3,2254736,1,4,4,7,29.0,0
4,4,431534,1,5,4,15,28.0,0


## 07. Run a check for duplicate values in your df_ords data

In [47]:
# Checking for duplicates in df_ords by creating a new subset that contains only duplicates

df_ords_dups = df_ords[df_ords.duplicated()]

In [48]:
# calling the df_ords_dups dataframe. 
# This will display all the duplicate rows within the dataframe df_ords

df_ords_dups

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,is_first_order


#### Output: No duplicate values were found in the orders dataframe.

## 09. Export your final, cleaned df_prods and df_ords data as “.csv” files in your “Prepared Data” folder

#### Export of the final, cleaned "df_prods" data was already performed above. New csv file created "products_checked.csv"

In [50]:
# Perform a final check of the dataframe before exporting
print(df_ords.head())
print(df_ords.info())
print(df_ords.shape)

   Unnamed: 0  order_id  user_id  order_number  orders_day_of_week  \
0           0   2539329        1             1                   2   
1           1   2398795        1             2                   3   
2           2    473747        1             3                   3   
3           3   2254736        1             4                   4   
4           4    431534        1             5                   4   

   order_hour_of_day  days_since_prior_order  is_first_order  
0                  8                     NaN               1  
1                  7                    15.0               0  
2                 12                    21.0               0  
3                  7                    29.0               0  
4                 15                    28.0               0  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 8 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   Unnamed: 0

In [51]:
# Exporting the final cleaned "df_ords" data as "orders_checked.csv"

df_ords.to_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_checked.csv'))