# 4.9 Part 1 Preparing Customers Data

### Contents
    01. Import data and libraries
    02. Wrangle customer data
    03. Consistency checks
    04. Combine customers with ords_prods_merged
    05. Export combined data

## 01. Import data and libraries

In [3]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

In [1]:
# Define path to data

path = r'/Users/louise/Desktop/CF Coursework/Achievement 4 /Instacart Basket Analysis'

In [5]:
# Import customer data


customers = pd.read_csv(os.path.join(path, '02 Data', 'Original Data','customers.csv'))

## 02. Wrangle customer data

In [6]:
customers.head()

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


In [7]:
# Rename columns for clarity and consistency

customers.rename(columns = {'First Name' : 'first_name'}, inplace = True)

In [8]:
customers.rename(columns = {'Surnam' : 'last_name'}, inplace = True)

In [9]:
customers.rename(columns = {'Gender' : 'gender'}, inplace = True)

In [10]:
customers.rename(columns = {'STATE' : 'state'}, inplace = True)

In [11]:
customers.rename(columns = {'Age' : 'age'}, inplace = True)

In [12]:
customers.rename(columns = {'n_dependants' : 'num_of_dependants'}, inplace = True)

In [13]:
customers.rename(columns = {'fam_status' : 'marital_status'}, inplace = True)

In [14]:
# Check results

customers.head()

Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,num_of_dependants,marital_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


In [17]:
# Check for missing values

customers['first_name'].value_counts(dropna = False)

NaN        11259
Marilyn     2213
Barbara     2154
Todd        2113
Jeremy      2104
           ...  
Merry        197
Eugene       197
Garry        191
Ned          186
David        186
Name: first_name, Length: 208, dtype: int64

In [25]:
customers['income'].value_counts(dropna = False)

57192     10
95891     10
95710     10
97532      9
98675      9
          ..
73141      1
71524      1
74408      1
44780      1
148828     1
Name: income, Length: 108012, dtype: int64

I only found missing values in the 'first_name' column. I think it best neither to delete nor to impute values in this case, but just to let these customers go by their last names in our data. 

## 03. Consistency checks

In [26]:
# Check major stats of all continuous variables

customers.describe()

Unnamed: 0,user_id,age,num_of_dependants,income
count,206209.0,206209.0,206209.0,206209.0
mean,103105.0,49.501646,1.499823,94632.852548
std,59527.555167,18.480962,1.118433,42473.786988
min,1.0,18.0,0.0,25903.0
25%,51553.0,33.0,0.0,59874.0
50%,103105.0,49.0,1.0,93547.0
75%,154657.0,66.0,3.0,124244.0
max,206209.0,81.0,3.0,593901.0


These stats look good at first blush. Mins, maxs, and means all look to be about where they should be given the kind of data.

In [27]:
# Check for null values

customers.isnull().sum()

user_id                  0
first_name           11259
last_name                0
gender                   0
state                    0
age                      0
date_joined              0
num_of_dependants        0
marital_status           0
income                   0
dtype: int64

This confirms that we only have null values in the 'first_name' column, which I'm not worried about.

In [28]:
# Check for duplicate rows (by making a subset)

df_dups = customers[customers.duplicated()]

In [29]:
df_dups

Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,num_of_dependants,marital_status,income


No duplicates, awesome.

In [31]:
# Check for mixed-type data

for col in customers.columns.tolist():
  weird = (customers[[col]].applymap(type) != customers[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (customers[weird]) > 0:
    print (col)

first_name


Ok, so the first_name column has mixed-type data. What type is it?

In [33]:
# Check data types for whole df

customers.dtypes

user_id               int64
first_name           object
last_name            object
gender               object
state                object
age                   int64
date_joined          object
num_of_dependants     int64
marital_status       object
income                int64
dtype: object

In [34]:
# Change data type for first_name to string 

customers['first_name'] = customers['first_name'].astype('str')

In [45]:
# Check data types again

customers.dtypes

user_id               int64
first_name           object
last_name            object
gender               object
state                object
age                   int64
date_joined          object
num_of_dependants     int64
marital_status       object
income                int64
dtype: object

Looked it up, and string is a type of object, so makes sense 'first_name' didn't change type. Let's see if it's still a mixed data type.

In [38]:
# Check for mixed-type data again

for col in customers.columns.tolist():
  weird = (customers[[col]].applymap(type) != customers[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (customers[weird]) > 0:
    print (col)

No longer! Good

## 04. Combine customer data with ords_prods_merged

In [40]:
# Import ords_prods_merged

ords_prods_merged = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data','orders_products_merged.pkl'))

In [41]:
# Check outputs

ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,prices,price_range_loc,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price,spending_flag,median_days_since,frequency_flag
0,2539329,1,1,2,8,,196,1,0,Soda,...,9.0,Mid-range product,Regularly busy days,Average Orders,10,New customer,6.367797,Low Spender,20.5,Infrequent
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,9.0,Mid-range product,Least busy days,Average Orders,10,New customer,6.367797,Low Spender,20.5,Infrequent
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,9.0,Mid-range product,Least busy days,Most Orders,10,New customer,6.367797,Low Spender,20.5,Infrequent
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,9.0,Mid-range product,Least busy days,Average Orders,10,New customer,6.367797,Low Spender,20.5,Infrequent
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,9.0,Mid-range product,Least busy days,Most Orders,10,New customer,6.367797,Low Spender,20.5,Infrequent


In [47]:
ords_prods_merged.shape

(32404859, 22)

In [42]:
customers.head()

Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,num_of_dependants,marital_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


In [48]:
customers.shape

(206209, 10)

In [43]:
# Check ords_prods_merged data types 

ords_prods_merged.dtypes

order_id                    int64
user_id                     int64
order_number                int64
order_day_of_week           int64
order_hour_of_day           int64
days_since_prior_order    float64
product_id                  int64
add_to_cart_order           int64
reordered                   int64
product_name               object
aisle_id                    int64
department_id               int64
prices                    float64
price_range_loc            object
busiest_days               object
busiest_period_of_day      object
max_order                   int64
loyalty_flag               object
avg_price                 float64
spending_flag              object
median_days_since         float64
frequency_flag             object
dtype: object

user_id is an integer in both dataframes. Let's combine!

In [52]:
# Merge ords_prods_merged with customers

df_merged = ords_prods_merged.merge(customers, on = 'user_id', indicator = True)

In [53]:
# Confirm the results of the merge

df_merged['_merge'].value_counts()

both          32404859
left_only            0
right_only           0
Name: _merge, dtype: int64

In [54]:
df_merged.shape

(32404859, 32)

## 05. Export combined data

In [58]:
# Given the large size, exporting as pickle:

df_merged.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_customers.pkl'))