# Table of Contents 

### Exercise Walk-Through
- 01 Importing Data
- 02 Checking for Mixed-Type Data 
- 03 Missing Values 
- 04 Duplicates 

### Exercises 2-9
- Ex 2. Describe function
- Ex 3 & 4. Checking for mixed-type data
- Ex 5 & 6. Checking and addressing missing values 
- Ex 7 & 8. Checking and addressing duplicates
- Ex 9. Exporting data

# Exercise Walk-Through

## 01 Importing Data

In [2]:
# Import libraries
import pandas as pd
import numpy as np
import os

In [3]:
# Project folder string
path = r'/Users/nora/Desktop/Instacart Basket Analysis'

In [4]:
# Import products.csv from Original Data folder as df_prods
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'), index_col = False)

In [5]:
df_prods.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49693.0,49693.0,49693.0,49693.0
mean,24844.345139,67.770249,11.728433,9.994136
std,14343.717401,38.316774,5.850282,453.519686
min,1.0,1.0,1.0,1.0
25%,12423.0,35.0,7.0,4.1
50%,24845.0,69.0,13.0,7.1
75%,37265.0,100.0,17.0,11.2
max,49688.0,134.0,21.0,99999.0


In [6]:
df_prods.shape

(49693, 5)

In [5]:
# Import orders.csv from Prepared Data folder as df_ords

In [6]:
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_wrangled.csv'), index_col = False)

## 02 Checking for Mixed-Type Data

In [7]:
# Create a data frame

In [8]:
df_test = pd.DataFrame()

In [9]:
# Create a mixed type column

In [10]:
df_test['mix'] = ['a', 'b', 1, True]

In [11]:
df_test.head()

Unnamed: 0,mix
0,a
1,b
2,1
3,True


In [12]:
#Check for mixed-type columns

for col in df_test.columns.tolist():
  weird = (df_test[[col]].applymap(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_test[weird]) > 0:
    print (col)

mix


In [13]:
# Fix the mixed-type column
df_test['mix'] = df_test['mix'].astype('str')

## 03 Missing Values

In [14]:
# Find missing values
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [15]:
# Create a subset with the missing values
df_nan = df_prods[df_prods['product_name'].isnull() == True]

In [16]:
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


In [17]:
# Create a new data frame with clean data
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]

In [18]:
# Compare number of rows in current data frame with number of rows in subset without missing values

In [19]:
df_prods.shape

(49693, 5)

In [20]:
df_prods_clean.shape

(49677, 5)

In [21]:
# Another way of dropping missing values is via the following command, which overwrites df_prods:
# df_prods[‘product_name’].dropna(inplace = True)
# Overwriting can be risky so creating a new data frame is the better option

## 04 Duplicates

In [22]:
# Look for full duplictates within your data frame

In [23]:
df_dups = df_prods_clean[df_prods_clean.duplicated()]

In [24]:
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [25]:
# Check the number of rows in the prods_clean data frame
df_prods_clean.shape

(49677, 5)

In [26]:
# Drop duplicates
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

In [27]:
# Check the number of rows in the data frame without duplicates 
df_prods_clean_no_dups.shape

(49672, 5)

# Exercises 2-9

### Ex 2. Describe function

In [28]:
# 2. Run the df.describe() function on your df_prods dataframe. 
#    Using your new knowledge about how to interpret the output of this function, 
#    share in a markdown cell whether anything about the data looks off or should be investigated further.

In [29]:
df_prods.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49693.0,49693.0,49693.0,49693.0
mean,24844.345139,67.770249,11.728433,9.994136
std,14343.717401,38.316774,5.850282,453.519686
min,1.0,1.0,1.0,1.0
25%,12423.0,35.0,7.0,4.1
50%,24845.0,69.0,13.0,7.1
75%,37265.0,100.0,17.0,11.2
max,49688.0,134.0,21.0,99999.0


The max value for the prices column is very high for a grocery item and should be investigated.

### Ex 3 & 4. Checking for mixed-type data

In [30]:
# 3. Check for mixed-type data in your df_ords dataframe

In [31]:
for col in df_ords.columns.tolist():
  weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird]) > 0:
    print (col)

In [32]:
# 4. If you find mixed-type data, fix it. The column in question should contain observations of a single data type.

There are no mixed-type columns in the df_ords data frame.

### Ex 5 & 6. Checking and addressing missing values 

In [33]:
# 5. Run a check for missing values in your df_ords dataframe.

In [34]:
df_ords.isnull().sum()

Unnamed: 0                     0
order_id                       0
user_id                        0
order_number                   0
orders_day_of_week             0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

In [35]:
df_ords_nan = df_ords[df_ords['days_since_prior_order'].isnull() == True]

In [36]:
df_ords_nan

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,0,2539329,1,1,2,8,
11,11,2168274,2,1,2,11,
26,26,1374495,3,1,1,14,
39,39,3343014,4,1,6,11,
45,45,2717275,5,1,3,12,
...,...,...,...,...,...,...,...
3420930,3420930,969311,206205,1,4,12,
3420934,3420934,3189322,206206,1,3,18,
3421002,3421002,2166133,206207,1,6,19,
3421019,3421019,2227043,206208,1,1,15,


In [37]:
# In a markdown cell, report your findings and propose an explanation for any missing values you find.

The only column with missing values is the days_since_prior_order column, which has 206209 missing values. There are probably many missing values in this column because some customers have only placed one order which means there is no entry in the days_since_prior_order column.

In [38]:
# 6. Address the missing values using an appropriate method.

We cannot remove the missing data as it provides important information i.e. that it is the customer's first order. For the same reason, we cannot impute the missing data. Therefore, we need to create a new variable that acts like a flag based on the missing value. The new column should be called first_order and indicate whether it is the customers first order. This column then explains the missing values in the days_since_prior_order column.

In [39]:
# defining a new data frame for cleaned data
df_ords_clean = df_ords

In [40]:
df_ords_clean['first_order'] = df_ords['days_since_prior_order'].isnull() == True

In [41]:
df_ords_clean.head()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order
0,0,2539329,1,1,2,8,,True
1,1,2398795,1,2,3,7,15.0,False
2,2,473747,1,3,3,12,21.0,False
3,3,2254736,1,4,4,7,29.0,False
4,4,431534,1,5,4,15,28.0,False


### Ex 7 & 8. Checking and addressing duplicates

In [42]:
# 7. Run a check for duplicate values in your df_ords data.

In [43]:
df_ords_dups = df_ords_clean[df_ords_clean.duplicated()]

In [44]:
df_ords_dups

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order


In [45]:
# 8. Address the duplicates using an appropriate method.

In [46]:
df_ords_dups.shape

(0, 8)

There are no duplicate values. 

### Ex 9. Exporting data 

In [47]:
# 9. Export your final, cleaned df_prods and df_ords data as “.csv” files in your “Prepared Data” folder 
#    and give them appropriate, succinct names.

In [48]:
df_ords_clean.to_csv(os.path.join(path, '02 Data','Prepared Data', 'orders_cleaned.csv'))

In [49]:
df_prods_clean_no_dups.to_csv(os.path.join(path, '02 Data','Prepared Data', 'products_cleaned.csv'))

In [50]:
df_ords_clean.shape

(3421083, 8)

In [51]:
df_prods_clean_no_dups.shape

(49672, 5)