# Exercise 4.9 Part 1 Steps 1-5

## This script will contain the following:

### Step 1 - Download customer data set and add to original data folder
### Step 2 - Create New Notebook in scripts folder for part 1 of this task
### Step 3 - Import Analysis Libraries and Customer data set as dataframe
### Step 4 - Wrangle the Data so that it follows consistent logic
### Step 5 - Complete fundamental data quality and consistency checks

In [1]:
# Some of these steps are completed outside of this script, these will be addressed in markdown kernels.

#### Step 1 - Download customer data set and add to original data folder

The customer data set was downloaded and added to the original data folder in the Instacart analysis folder.

#### Step 2 - Create New Notebook in Scripts folder for part 1 of this task

This notebook was created for the purposes of Step 2.

#### Step 3 - Import Analysis Libraries and customer data set as dataframe

In [2]:
# Import libraries

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

In [3]:
# Set Path

path = r'C:\Users\Josh Wattay\anaconda3\Instacart Basket Analysis'

In [4]:
# Import customer data set as dataframe

customers = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'customers.csv'))

In [5]:
# Check output and shape

customers.head()

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


In [6]:
customers.shape

(206209, 10)

In [9]:
customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206209 entries, 0 to 206208
Data columns (total 10 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       206209 non-null  int64 
 1   First Name    194950 non-null  object
 2   Surnam        206209 non-null  object
 3   Gender        206209 non-null  object
 4   STATE         206209 non-null  object
 5   Age           206209 non-null  int64 
 6   date_joined   206209 non-null  object
 7   n_dependants  206209 non-null  int64 
 8   fam_status    206209 non-null  object
 9   income        206209 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 15.7+ MB


At first glance, there appear to be several thousand first name entries missing. Fortunately, there are no mixed datatypes.

#### Step 4 - Wrangle the Data so that it follows consistent logic

The surname column is mispelled and I would like to rename fam_status to marital_status to be more specific.

In [14]:
# I want to see the metrics of the date joined column to inspect whether or not it will be relevant to analyze

customers['date_joined'].min()

'1/1/2017'

In [15]:
customers['date_joined'].max()

'9/9/2019'

In [16]:
# I will keep this column as the date joined will be useful to analyze spending patterns of customers over time

In [17]:
# Descriptive stats check

customers.describe()

Unnamed: 0,user_id,Age,n_dependants,income
count,206209.0,206209.0,206209.0,206209.0
mean,103105.0,49.501646,1.499823,94632.852548
std,59527.555167,18.480962,1.118433,42473.786988
min,1.0,18.0,0.0,25903.0
25%,51553.0,33.0,0.0,59874.0
50%,103105.0,49.0,1.0,93547.0
75%,154657.0,66.0,3.0,124244.0
max,206209.0,81.0,3.0,593901.0


There may be a few outliers in the income column but the rest of the columns descriptive statistics look fine.

In [19]:
# I now want to make sure all of the states are accounted for in the STATE column

customers['STATE'].nunique()

51

There are 51 states, perhaps this includes Washington D.C.

In [20]:
# Use str.contains to search for rows containing "DC"

dc_rows = customers[customers['STATE'].str.contains('DC', case=False, na=False)]

In [21]:
# Check output

dc_rows

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income


In [22]:
# Rerun str.contains with "D.C."

dc_rows = customers[customers['STATE'].str.contains('D.C.', case=False, na=False)]

In [23]:
# Check output

dc_rows

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income


In [24]:
# Rerun str.contains with "District"

dc_rows = customers[customers['STATE'].str.contains('District', case=False, na=False)]

In [25]:
# Check output

dc_rows

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
91,202887,John,Harris,Male,District of Columbia,20,1/1/2017,3,living with parents and siblings,84380
122,155355,Bonnie,Richardson,Female,District of Columbia,60,1/1/2017,1,married,115744
225,25305,Aaron,Beck,Male,District of Columbia,43,1/2/2017,3,married,40127
342,98745,Karen,Parker,Female,District of Columbia,50,1/3/2017,2,married,146183
551,163821,Sandra,Blankenship,Female,District of Columbia,31,1/4/2017,2,married,81922
...,...,...,...,...,...,...,...,...,...,...
205821,192177,Robin,Ward,Female,District of Columbia,63,3/30/2020,2,married,144042
205863,126591,Shirley,Ellis,Female,District of Columbia,20,3/30/2020,1,living with parents and siblings,56023
206078,134292,Sarah,Rich,Female,District of Columbia,28,4/1/2020,0,single,36946
206104,723,Alice,Mayo,Female,District of Columbia,48,4/1/2020,2,married,123120


Now we know the District of Columbia is the reason there are 51 unique values in the STATE column.

I will now change the datatypes of the int64 columns to reduce memory usage.

In [26]:
# Use .astype to change datatypes to int32

customers[['user_id', 'Age', 'n_dependants', 'income']] = customers[['user_id', 'Age', 'n_dependants', 'income']].astype('int32')

In [27]:
# Check datatypes

customers[['user_id', 'Age', 'n_dependants', 'income']].dtypes

user_id         int32
Age             int32
n_dependants    int32
income          int32
dtype: object

#### Renaming the Surname column to fix typo

In [30]:
customers.rename(columns = {'Surnam' : 'Surname'}, inplace = True)

In [31]:
# Check output

customers.head()

Unnamed: 0,user_id,First Name,Surname,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


#### Renaming State so its not in all CAPS

In [32]:
customers.rename(columns = {'STATE' : 'State'}, inplace = True)

In [33]:
# Check output

customers.head()

Unnamed: 0,user_id,First Name,Surname,Gender,State,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


#### Renaming fam_status to marital_status to be more specific

In [35]:
customers.rename(columns = {'fam_status' : 'marital_status'}, inplace = True)

In [36]:
# Check output

customers.head()

Unnamed: 0,user_id,First Name,Surname,Gender,State,Age,date_joined,n_dependants,marital_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


After review, I will be dropping the First Name and Surname columns. These wont do anything for this analysis, as the exact identity of customers is not a key question or criteria for this analysis. 

We can glean answers to our key questions from the remaining rows of data in this dataframe.

Therefore, I will be creating a cust_checked dataframe as to not overwrite the customers dataframe, but also create a wrangled dataframe for this project.

In [39]:
# Create wrangled dataframe that drops teh First Name and Surname columns

cust_checked = customers.drop(columns = {'First Name', 'Surname'})

In [40]:
# Check output

cust_checked.head()

Unnamed: 0,user_id,Gender,State,Age,date_joined,n_dependants,marital_status,income
0,26711,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Female,Maryland,26,1/1/2017,1,married,40374


This wrangled dataframe accomplishes three things:
1. It removes names of customers which is irrelevant to this analysis.
2. In doing so, it increases the privacy of the customers.
3. Eliminates the need to worry about the missing values in the First Name column, as they are now removed and were irrelevant to our analysis.

#### Step 5 - Complete fundamental data quality and consistency checks

In [43]:
# Check for duplicates

df_dups = cust_checked[cust_checked.duplicated()]

In [44]:
# Check output for duplicates

df_dups

Unnamed: 0,user_id,Gender,State,Age,date_joined,n_dependants,marital_status,income


No duplicates were discovered within our wrangled data.

In [45]:
# Check for missing values

cust_checked.isnull().sum()

user_id           0
Gender            0
State             0
Age               0
date_joined       0
n_dependants      0
marital_status    0
income            0
dtype: int64

No missing values were discovered within our wrangled data.

In [46]:
# Check for mixed-type data columns

for col in cust_checked.columns.tolist():
    weird = (cust_checked[[col]].map(type) != cust_checked
             [[col]].iloc[0].apply(type)).any(axis = 1)
    if len (cust_checked[weird]) > 0:
        print (col)

In [49]:
# Check output to verify no mixed-type data columns

In [48]:
weird.value_counts()

False    206209
Name: count, dtype: int64

No mixed-type data columns were discovered within our wrangled data.

#### Step 6 - Combine Customer Data with prepared InstaCart data

In [50]:
# Import ords_prods_merge data

ords_prods_merge = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_merge_4.8.pkl'))

In [52]:
# Make sure data types are the same for the user_id column

ords_prods_merge['user_id'].dtype

dtype('int32')

In [53]:
cust_checked['user_id'].dtype

dtype('int32')

In [55]:
ords_prods_merge.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,orders_time_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,_merge,price_range_loc,busiest_days,busiest_period_of_day,max_order,loyalty_flag,average_purchase_cost,spending_flag,median_days_btwn_orders,order_frequency_flag
0,2539329,1,1,2,8,,196,1,0,Soda,...,both,Mid-range product,Regularly busy,Average Orders,10,New customer,6.367535,Low spender,20.5,Non-frequent customer
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,both,Mid-range product,Slowest days,Average Orders,10,New customer,6.367535,Low spender,20.5,Non-frequent customer
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,both,Mid-range product,Slowest days,Most Orders,10,New customer,6.367535,Low spender,20.5,Non-frequent customer
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,both,Mid-range product,Slowest days,Average Orders,10,New customer,6.367535,Low spender,20.5,Non-frequent customer
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,both,Mid-range product,Slowest days,Most Orders,10,New customer,6.367535,Low spender,20.5,Non-frequent customer


## Export wrangled Data

Exporting Data, will upload and combine in separate notebook.

In [58]:
cust_checked.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'cust_checked.pkl'))