# 4.9.1 IC_Intro to Data Visualization with Python_Part 1

#### Context:
- In this task, you’ll revisit some of the fundamental data preparation and combination techniques you learned in earlier Exercises as you incorporate an additional dataframe into your project. Then, you’ll move on to generating visualizations for your analysis.
- The senior Instacart officers have given you a new data set of customer information to go along with your product and order data. 
    - In part 1 of the task, you’ll need to incorporate this new data set into your project. 
    - In part 2, you’ll create some visualizations, conduct some exploratory analysis, and begin wrapping up everything you’ve done in this Achievement in preparation for the final task in the next Exercise, where you’ll write up a report for your client.

#### Directions Part 1
- 1. Download the customer data set and add it to your “Original Data” folder.
- 2. Create a new notebook in your “Scripts” folder for part 1 of this task.
- 3. Import your analysis libraries, as well as your new customer data set as a dataframe.
- 4. Wrangle the data so that it follows consistent logic; for example, rename columns with illogical names and drop columns that don’t add anything to your analysis.
- 5. Complete the fundamental data quality and consistency checks you’ve learned throughout this Achievement; for example, check for and address missing values and duplicates, and convert any mixed-type data.
- 6. Combine your customer data with the rest of your prepared Instacart data. (Hint: Make sure the key columns are the same data type!)
- 7. Ensure your notebook contains logical titles, section headings, and descriptive code comments.
- 8. Export this new dataframe as a pickle file so you can continue to use it in the second part of this task.
- 9. Save your notebook so that you can send it to your tutor for review after completing part 2.


## This script contains the following points:

#### 0. Importing Libraries
#### 1. Loading and Checking the Data
#### 2. Wrangling the Data
#### 3. Data Quality and Consistency Checks 
#### 4. Combining Customer Data with Previously Prepared Data
#### 5. Exporting the New Dataframe as a Pickle


## 0. Importing Libraries

In [1]:
# Import libraries: pandas, NumPy, os, 
# Import Visualization Libraries: matplotlib, seaborn, and scipy

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

## 1. Loading and Checking the Data

In [2]:
# Define the path to the data files, folder path to my main project folder is now stored within variable 'path'

path = r'/Users/pau/06-05-2024 Instacart Basket Analysis'

#### Loading the "customers.csv" data set into my Jupyter notebook using the os library as df_customers

In [3]:
# Load "customers.csv" from the "Original Data" folder as "df_customers"

df_customers = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'customers.csv'), index_col = False)

#### Checking the dimensions of the imported dataframe and if the data is correctly loaded

In [4]:
# Checking "customers.csv" data is correctly loaded

print(df_customers.head()) # to ensure nothing looks off about our imported dataframes.
print(df_customers.info())
df_customers.shape # to confirm the total size of our imported df. Great way to get a feel for the data and have a better idea how to proceed.

   user_id First Name    Surnam  Gender       STATE  Age date_joined  \
0    26711    Deborah  Esquivel  Female    Missouri   48    1/1/2017   
1    33890   Patricia      Hart  Female  New Mexico   36    1/1/2017   
2    65803    Kenneth    Farley    Male       Idaho   35    1/1/2017   
3   125935   Michelle     Hicks  Female        Iowa   40    1/1/2017   
4   130797        Ann   Gilmore  Female    Maryland   26    1/1/2017   

   n_dependants fam_status  income  
0             3    married  165665  
1             0     single   59285  
2             2    married   99568  
3             0     single   42049  
4             1    married   40374  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206209 entries, 0 to 206208
Data columns (total 10 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       206209 non-null  int64 
 1   First Name    194950 non-null  object
 2   Surnam        206209 non-null  object
 3   Gender        

(206209, 10)

## 2. Wrangling the Data

#### Wrangle the data so that it follows consistent logic:

- **rename columns** with illogical names
- **drop columns** that don’t add anything to our analysis
- change wrong **data types** 
- **transpose** data if needed.

In [5]:
# Renaming columns for consistency and clarity

df_customers.rename(columns={'First Name': 'first_name', 'Surnam': 'surname', 'Gender': 'gender', 'STATE': 'state', 'Age': 'age', 'n_dependants': 'dependants', 'fam_status': 'family_status', 'income': 'income'}, inplace=True)

In [6]:
# Verify the names of the columns after making changes

print(df_customers.columns)

Index(['user_id', 'first_name', 'surname', 'gender', 'state', 'age',
       'date_joined', 'dependants', 'family_status', 'income'],
      dtype='object')


In [7]:
# Check data types and make adjustments as needed

print(df_customers.dtypes)

user_id           int64
first_name       object
surname          object
gender           object
state            object
age               int64
date_joined      object
dependants        int64
family_status    object
income            int64
dtype: object


In [8]:
# Convert "user_id" data type to "string"

df_customers['user_id'] = df_customers['user_id'].astype(str)

In [9]:
# Convert "date_joined" to data type datetime

df_customers['date_joined'] = pd.to_datetime(df_customers['date_joined'])

In [10]:
# Check data types after making changes

print(df_customers.dtypes)

user_id                  object
first_name               object
surname                  object
gender                   object
state                    object
age                       int64
date_joined      datetime64[ns]
dependants                int64
family_status            object
income                    int64
dtype: object


## 3. Data Quality and Consistency Checks 

#### Complete the fundamental data quality and consistency checks:
- Find and address **mixed type** variables in the dataframe
- Find and address **missing values** in the dataframe
- Find and address **duplicate values** in the dataframe

In [11]:
# Check for mixed data types in each column

for col in df_customers.columns.tolist():
    weird = (df_customers[[col]].map(type) != df_customers[[col]].iloc[0].apply(type)).any(axis = 1)
    if len (df_customers[weird]) > 0:
        print (col)

first_name


In [12]:
# The "first_name" column should have a data type "string"

df_customers['first_name'] = df_customers['first_name'].astype('str')

In [13]:
# Check data types after making changes

print(df_customers.dtypes)

user_id                  object
first_name               object
surname                  object
gender                   object
state                    object
age                       int64
date_joined      datetime64[ns]
dependants                int64
family_status            object
income                    int64
dtype: object


In [14]:
# check for mixed data types

for col in df_customers.columns.tolist():
    weird = (df_customers[[col]].map(type) != df_customers[[col]].iloc[0].apply(type)).any(axis = 1)
    if len (df_customers[weird]) > 0:
        print (col)

In [15]:
# Check for missing values

print(df_customers.isnull().sum())

user_id          0
first_name       0
surname          0
gender           0
state            0
age              0
date_joined      0
dependants       0
family_status    0
income           0
dtype: int64


In [16]:
# Check for duplicates

print("Duplicates:", df_customers.duplicated().sum())

Duplicates: 0


In [17]:
# Check the descriptive stats for anything unusal

df_customers.describe()

Unnamed: 0,age,date_joined,dependants,income
count,206209.0,206209,206209.0,206209.0
mean,49.501646,2018-08-17 03:06:30.029532928,1.499823,94632.852548
min,18.0,2017-01-01 00:00:00,0.0,25903.0
25%,33.0,2017-10-23 00:00:00,0.0,59874.0
50%,49.0,2018-08-16 00:00:00,1.0,93547.0
75%,66.0,2019-06-10 00:00:00,3.0,124244.0
max,81.0,2020-04-01 00:00:00,3.0,593901.0
std,18.480962,,1.118433,42473.786988


## 4. Combining Customer Data with Previously Prepared Data

Loading the prepared Instacart data from task 4.8 "ords_prods_merge_new_var_group_agg.pkl" as "df_ords_prods_new"

In [18]:
# Load the most up-to-date version of the previously prepared data as "df_ords_prods_new"

df_ords_prods_new = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_merge_new_var_group_agg.pkl'))


Checking the dimensions of the imported dataframe and if the data is correctly loaded

In [19]:
# Checking "ords_prods_merge_new_var_group_agg.pkl" data is correctly loaded

print(df_ords_prods_new.head()) 
print(df_ords_prods_new.info()) 
df_ords_prods_new.shape 

   product_id                product_name  aisle_id  department_id  prices  \
0           1  Chocolate Sandwich Cookies        61             19     5.8   
1           1  Chocolate Sandwich Cookies        61             19     5.8   
2           1  Chocolate Sandwich Cookies        61             19     5.8   
3           1  Chocolate Sandwich Cookies        61             19     5.8   
4           1  Chocolate Sandwich Cookies        61             19     5.8   

   order_id  user_id  order_number  orders_day_of_week  order_hour_of_day  \
0   3139998      138            28                   6                 11   
1   1977647      138            30                   6                 17   
2    389851      709             2                   0                 21   
3    652770      764             1                   3                 13   
4   1813452      764             3                   4                 17   

   ...    price_range_loc     busiest_day  busiest_days  \
0  ...  M

(32404859, 25)

#### Combining our customer data with the rest of our prepared Instacart data. 

- 1. Finding key or common identifier column that brings the two data sets together: 
    - Both dataframes have the "user_id" column, which we can use to combine them.
    - It should be a fully matching "user_id" column.
- 2. To combine both dataframes we have to make sure that key columns are the same data type
    - "user_id" in the "df_ords_prods_new" dataframe must first be converted to "string" to match the "user_id" column in "df_customers"
    - The other identifier columns in the dataframe can also be converted to "string":
        - "product_id"
        - "aisle_id"
        - "department_id"
        - "order_id"
        - "user_id"¶

In [20]:
# Convert the identifier columns in "df_ords_prods_new" to "string"

df_ords_prods_new[['product_id', 'aisle_id', 'department_id', 'order_id', 'user_id']] = df_ords_prods_new[['product_id', 'aisle_id', 'department_id', 'order_id', 'user_id']].astype(str)

In [21]:
# Check the results of the change

print(df_ords_prods_new.dtypes)

product_id                  object
product_name                object
aisle_id                    object
department_id               object
prices                     float64
order_id                    object
user_id                     object
order_number                 int64
orders_day_of_week           int64
order_hour_of_day            int64
days_since_prior_order     float64
is_first_order               int64
add_to_cart_order            int64
reordered                    int64
_merge                    category
price_range_loc             object
busiest_day                 object
busiest_days                object
busiest_period_of_day       object
max_order                    int64
loyalty_flag                object
mean_product_price         float64
spending_flag               object
order_frequency            float64
frequency_flag              object
dtype: object


In [22]:
# Drop the existing "_merge" column from "df_ords_prods_new"

df_ords_prods_new = df_ords_prods_new.drop(columns=['_merge'])

In [23]:
# Check the columns after the change

print(df_ords_prods_new.columns)

Index(['product_id', 'product_name', 'aisle_id', 'department_id', 'prices',
       'order_id', 'user_id', 'order_number', 'orders_day_of_week',
       'order_hour_of_day', 'days_since_prior_order', 'is_first_order',
       'add_to_cart_order', 'reordered', 'price_range_loc', 'busiest_day',
       'busiest_days', 'busiest_period_of_day', 'max_order', 'loyalty_flag',
       'mean_product_price', 'spending_flag', 'order_frequency',
       'frequency_flag'],
      dtype='object')


In [24]:
# Merge the two dataframes using the default inner join

df_final_merged = df_ords_prods_new.merge(df_customers, on='user_id', indicator=True)

In [25]:
# Confirming the results of the merge

df_final_merged.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,...,first_name,surname,gender,state,age,date_joined,dependants,family_status,income,_merge
0,1,Chocolate Sandwich Cookies,61,19,5.8,3139998,138,28,6,11,...,Charles,Cox,Male,Minnesota,81,2019-08-01,1,married,49620,both
1,1,Chocolate Sandwich Cookies,61,19,5.8,1977647,138,30,6,17,...,Charles,Cox,Male,Minnesota,81,2019-08-01,1,married,49620,both
2,1,Chocolate Sandwich Cookies,61,19,5.8,389851,709,2,0,21,...,Deborah,Glass,Female,Vermont,66,2018-06-16,2,married,158302,both
3,1,Chocolate Sandwich Cookies,61,19,5.8,652770,764,1,3,13,...,Heather,Myers,Female,Wisconsin,40,2020-02-09,3,married,31308,both
4,1,Chocolate Sandwich Cookies,61,19,5.8,1813452,764,3,4,17,...,Heather,Myers,Female,Wisconsin,40,2020-02-09,3,married,31308,both


In [26]:
# Check the details of the new "df_final_merged" dataframe

print(df_final_merged.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32404859 entries, 0 to 32404858
Data columns (total 34 columns):
 #   Column                  Dtype         
---  ------                  -----         
 0   product_id              object        
 1   product_name            object        
 2   aisle_id                object        
 3   department_id           object        
 4   prices                  float64       
 5   order_id                object        
 6   user_id                 object        
 7   order_number            int64         
 8   orders_day_of_week      int64         
 9   order_hour_of_day       int64         
 10  days_since_prior_order  float64       
 11  is_first_order          int64         
 12  add_to_cart_order       int64         
 13  reordered               int64         
 14  price_range_loc         object        
 15  busiest_day             object        
 16  busiest_days            object        
 17  busiest_period_of_day   object        
 18  

In [27]:
# Checking the results of the merged data

df_final_merged.shape

(32404859, 34)

In [28]:
# Check "value_counts" after inner join

df_final_merged['_merge'].value_counts()

_merge
both          32404859
left_only            0
right_only           0
Name: count, dtype: int64

#### Checking the results of the rows and cols:

#### -  df_ords_prods_new
(32404859, 25) after dropping exisitng and adding new '_merge'

#### -  df_customers 
(206209, 10)


#### - df_final_merged
(32404859, 34)

## 8. Exporting the New Dataframe as a Pickle

In [29]:
# Export the "df_final_merged" dataframe as "ords_prods_cust_merge" for use in Part 2

df_final_merged.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'ords_prods_cust_merge.pkl'))