# Table of Contents


1) Import Libraries and Data

2) Describe products data

3) Check for missing data in products

4) Check for duplicate data in products

5) Check for mixed data types in orders data

6) Check for missing values in orders data

7) Check for duplicate data in orders

8) Export data

### 1) 
Import Libraries and Data

In [1]:
#import libraries
import pandas as pd
import numpy as np
import os

In [2]:
#set path to data folder
path = r'C:\Users\Owner\Documents\Career Foundry\Tasks\Data Immersion Tasks\Instacart Project\2 Data'

In [3]:
#import original product data as df_prods and wrangled orders data as df_ords
df_prods = pd.read_csv(os.path.join(path, 'original data','products.csv'))
df_ords = pd.read_csv(os.path.join(path, 'prepared data','orders_wrangled.csv'))

In [5]:
##remove the extra index column from df_ords
df_ords = df_ords.drop(columns = ['Unnamed: 0'])

## 2) Run the df.describe() function on your df_prods dataframe. 
Using your new knowledge about how to interpret the output of this function, share in a markdown cell whether anything about the data looks off or should be investigated further.

In [6]:
#run describe function on df_prods
df_prods.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49693.0,49693.0,49693.0,49693.0
mean,24844.345139,67.770249,11.728433,9.994136
std,14343.717401,38.316774,5.850282,453.519686
min,1.0,1.0,1.0,1.0
25%,12423.0,35.0,7.0,4.1
50%,24845.0,69.0,13.0,7.1
75%,37265.0,100.0,17.0,11.2
max,49688.0,134.0,21.0,99999.0


As a confirmation I would like to make sure that the physical store does have 134 aisles to match the max in aisle_id. Similarly I would confirm there are 21 departments to match the max in deparment_id.

The max value of 99999 in prices seems suspect as based on the min and quartile values I would expect the max to be in the 12-20 range.

### 3)
Check for missing values in products data

In [7]:
#finding sum of missing values by column in df_prods
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [8]:
#create a subset of the df_prods dataframe called df_nan that only has the 16 missing product_name values
df_nan = df_prods[df_prods['product_name'].isnull() == True]

In [9]:
#check size of df_prods
df_prods.shape

(49693, 5)

In [10]:
#create a subset of the df_prods dataframe called df_prods_clean that only has NO missing product_name values
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]

In [11]:
#check size of df_prods_clean
df_prods_clean.shape

(49677, 5)

### 4) 
Check products data for duplicates

In [12]:
#creating new data frame df_dups of just the duplicate rows in df_prods_clean
df_dups = df_prods_clean[df_prods_clean.duplicated()]

In [13]:
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [15]:
#remove duplicates from df_prods_clean and create new dataframe df_prods_clean_no_dups
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

In [16]:
df_prods_clean_no_dups.shape

(49672, 5)

### 5) 
Check for mixed-type data in your df_ords dataframe.

In [None]:
#search df_ords for columns that have mixed data types
for col in df_ords.columns.tolist():
  weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird]) > 0:
    print (col)

No results show, suggesting that df_ords has no mixed type columns if this was run correctly.

## 6)
Run a check for missing values in your df_ords dataframe.
In a markdown cell, report your findings and propose an explanation for any missing values you find.

In [None]:
#find count of missing values in df_ords by column
df_ords.isnull().sum()

In [None]:
df_ords.head()

In the data frame df_ords there are 206209 missing values in the column 'days_since_prior_order'. This is the only column with missing data.
This count likely represents customers that have only made a single order. They will have no second order date value to complete the calucation for this column and would be missing. 

Address the missing values using an appropriate method.

In [None]:
#create data frame df_ords_miss that only has the rows from df_ords with null values in 'days_since_prior_order' column
df_ords_miss = df_ords[df_ords['days_since_prior_order'].isnull() ==True]

In [None]:
#use describe on df_ords_miss to show that order number is always 1 for the null values
df_ords_miss.describe()

I chose to do nothing.

The first of 3 options I thought of was to create a new column that uses string values to define an order as some variation of "First" or "Repeat". I thought this was not appropriate as the order_number column already can provide this information and this would be duplicate data.

The second option was to replace the null values in 'days_since_prior_order' with some string value to denote 'First' order. I chose not to do this because I dotn want to put string values in the floating point data values.

The third option was to replace the null values in 'days_since_prior_order' with 0.0. I felt this was innapropriate as this could mean the same thing as a second order that was placed on the same day.

## 7) 
Run a check for duplicate values in your df_ords data.

In [None]:
#create new data frame df_ords_dupes that returns rows of df_rds that are duplicates across all columns
df_ords_dupes = df_ords[df_ords.duplicated()]

In [None]:
#look at df_ords_dupes to see which rows are duplcicates in df_ords
df_ords_dupes

No duplicate values found in df_ords. Likely due to the order_number column keeping things unique.

## 8) 
Export your final, cleaned df_prods and df_ords data as “.csv” files in your “Prepared Data” folder and give them appropriate, succinct names.

In [None]:
#export df_ords data as 'orders_cleaned.csv'
df_ords.to_csv(os.path.join(path, 'prepared data','orders_cleaned.csv' ))

In [None]:
#export the cleaned product data during the task instruction time as 'products_cleaned.csv'
df_prods_clean_no_dups.to_csv(os.path.join(path, 'prepared data','products_cleaned.csv' ))