Instacart, an online grocery store that operates through an app.
Instacart already has very good sales, but they want to uncover more information about their sales patterns.
My task was to perform an initial data and exploratory analysis of some of their data in order to derive insights and suggest strategies for better segmentation based on the provided criteria.
- Exploratory Analysis using open-source data sets.
- Provided business insights and strategy suggestions for marketing and segmentation.
- Data set analysis to detect patterns
- Data wrangling
- Data merging
- Deriving variables
- Grouping data
- Aggregating data
- Python, creating scripts to perform data wrangling and merge data frames.
- Python, Data visualization.
- Reporting in Excel
Or Refer to the short video Jupyter Notebook with the code.
Some code and visualization options are shown here, if You want to see the full report look this file
Table of Contents
- Data importing
- Data checks and merging dataframes
- PII privacy checks
- Assign regions
#creating region flags with loc
df.loc[df['state'].isin(northeast), 'region'] = 'Northeast'
df.loc[df['state'].isin(midwest), 'region'] = 'Midwest'
df.loc[df['state'].isin(south), 'region'] = 'South'
df.loc[df['state'].isin(west), 'region'] = 'West'
- Flag and exclude customers with less than 5 orders
# creating activity flag using loc
df.loc[df['max_order']< 5, 'activity_flag'] = 'Low Activity'
df.loc[df['max_order'] >= 5, 'activity_flag'] = 'High Activity'.
- Creating customer profile
6a. Age Profile
# creating age group profile
df_high_activity.loc[df_high_activity['age']<=30, 'age_profile']= 'Young Adult'
df_high_activity.loc[(df_high_activity['age']>30) & (df_high_activity['age']<65), 'age_profile']= "Mid Adult"
df_high_activity.loc[df_high_activity['age']>=65, 'age_profile']= 'Senior'
df_high_activity['age_profile'].value_counts()
6b. Parental Age Profile
# create parental age profile for different groups
df_high_activity.loc[(df_high_activity['age']<30) & (df_high_activity['no_of_dependents']>=1), 'parental_age_profile']= 'Young Parent'
df_high_activity.loc[(df_high_activity['age']>=30) & (df_high_activity['age']<40) & (df_high_activity['no_of_dependents']>=1), 'parental_age_profile']= 'Mid Parent'
df_high_activity.loc[(df_high_activity['age']>=40) & (df_high_activity['no_of_dependents']>=1), 'parental_age_profile']= 'Older Parent'
df_high_activity['parental_age_profile'].value_counts()
6c. Single Parent Profile (Dependants Profile)
crosstabfam= pd.crosstab(df_high_activity['marital_status'], df_high_activity['no_of_dependents'], dropna=False)
crosstabfam
# create parental-relationship status flag for different groups
df_high_activity.loc[(df_high_activity['marital_status'] == 'single') & (df_high_activity['no_of_dependents']>=1), 'parental_relationship_status']= 'Single Parent'
df_high_activity.loc[(df_high_activity['marital_status'] == 'divorved/widowed') & (df_high_activity['no_of_dependents']>=1), 'parental_relationship_status']= 'Single Parent'
df_high_activity.loc[(df_high_activity['marital_status'] == 'living with parents and siblings') & (df_high_activity['no_of_dependents']>=1), 'parental_relationship_status']= 'Single Parent'
df_high_activity.loc[(df_high_activity['marital_status'] =='married') & (df_high_activity['no_of_dependents']>=1), 'parental_relationship_status']= 'Nuclear Family'
df_high_activity['parental_relationship_status'].value_counts()
6d. Relationship-Age Profile
# create a relationship-age flag
crosstab_ra= pd.crosstab(df_high_activity['age_profile'], df_high_activity['marital_status'], dropna=False)
crosstab_ra
df_high_activity.loc[(df_high_activity['age_profile']=='Mid Adult') & (df_high_activity['marital_status'] == 'divorced/widowed'), 'relationship_age_status']= 'Single Mid Adult'
df_high_activity.loc[(df_high_activity['age_profile']=='Mid Adult') & (df_high_activity['marital_status'] == 'single'), 'relationship_age_status']= 'Single Mid Adult'
df_high_activity.loc[(df_high_activity['age_profile']=='Mid Adult') & (df_high_activity['marital_status'] == 'living with parents and siblings'), 'relationship_age_status']= 'Single Mid Adult'
df_high_activity.loc[(df_high_activity['age_profile']=='Mid Adult') & (df_high_activity['marital_status']=='married'), 'relationship_age_status']= 'Married Mid Adult'
df_high_activity.loc[(df_high_activity['age_profile']=='Senior') & (df_high_activity['marital_status'] == 'divorced/widowed'), 'relationship_age_status']= 'Single Senior'
df_high_activity.loc[(df_high_activity['age_profile']=='Senior') & (df_high_activity['marital_status'] == 'single'), 'relationship_age_status']= 'Single Senior'
df_high_activity.loc[(df_high_activity['age_profile']=='Senior') & (df_high_activity['marital_status'] == 'living with parents and siblings'), 'relationship_age_status']= 'Single Senior'
df_high_activity.loc[(df_high_activity['age_profile']=='Senior') & (df_high_activity['marital_status']=='married'), 'relationship_age_status']= 'Married Senior'
df_high_activity.loc[(df_high_activity['age_profile']=='Young Adult') & (df_high_activity['marital_status'] == 'divorced/widowed'), 'relationship_age_status']= 'Single Young Adult'
df_high_activity.loc[(df_high_activity['age_profile']=='Young Adult') & (df_high_activity['marital_status'] == 'single'), 'relationship_age_status']= 'Single Young Adult'
df_high_activity.loc[(df_high_activity['age_profile']=='Young Adult') & (df_high_activity['marital_status'] == 'living with parents and siblings'), 'relationship_age_status']= 'Single Young Adult'
df_high_activity.loc[(df_high_activity['age_profile']=='Young Adult') & (df_high_activity['marital_status']=='married'), 'relationship_age_status']= 'Married Young Adult'
6e. Income Profile.
# creating Income profile
df_high_activity.loc[df_high_activity['income']<=50000, 'income_profile']= 'Low Class'
df_high_activity.loc[(df_high_activity['income']>51000) & (df_high_activity['income']<=150000), 'income_profile']= "Mid Class"
df_high_activity.loc[df_high_activity['income']>=151000, 'income_profile']= 'High Class'
df_high_activity['income_profile'].value_counts()
6f. Gender Profile
6g. Deparments Profiles - Alcohol, Baby items, International items
7. Creating an appropriate visualization to show the distribution of profiles
7a. Age Profile
7b. Parental Age Profile.
7c. Single Parent Profile (Dependants Profile).
7d. Relationship-Age Profile
7e. Income Profile
7f. Gender Profile
7g. Deparments Profiles - Alcohol, Baby items, International items
df_high_activity['alcohol_flag']= [1 if x== 'alcohol' else 0 for x in df_high_activity['department']]
df_high_activity['alcohol_flag']= df_high_activity.groupby('user_id')['alcohol_flag'].transform(np.max)
df_high_activity['alcohol_flag'].value_counts(dropna=False)
- Aggregating the max, mean, and min variables on a customer-profile level for usage frequency and expenditure
8a. Age Profile
# aggregate for age profile expenditure
df_unique_customers.groupby('age_profile').agg({'prices': ['mean', 'min', 'max']})
8b. Parental Age Profile
8c. Single Parent Profile (Dependants Profile)
8d. Income Profile
8e. Relationship-Age Profile
8f. Gender Profile
8g. Deparments Profiles - Alcohol, Baby items, International items.
9. Comparison Visualizations customer profiles with regions and departments
9a. Region Distribution
# Total customers across regions
bar_total_customers=df_unique_customers['region'].value_counts(dropna=False).sort_values(ascending=True).plot.barh(color='teal', fontsize=15)
plt.title('Total customers in each region', fontsize=18)
plt.xlabel('Number of customers', fontsize=15)
plt.legend(fontsize=15)
Output
9b. Order Frequency Distribution
9c. Price Range Distribution
# Grouped bar chart of price range products by regions
bar_ppr_region=ppr_region.plot.bar(color=['teal', 'goldenrod', 'darkkhaki'], fontsize=12)
plt.title('Distribution price ranges in US regions', fontsize=20)
plt.legend(ncol=3, fontsize=12)
Output
9d. Distribution Departments by Regions
# create crosstab for Departments and Regions
orders_region=pd.crosstab(index=df_unique_customers['department'], columns=df_unique_customers['region'], normalize='index')
dept_region
# create stacked bar chart for Departments by Regions
dept_region_stacked=dept_region.plot(kind='bar',
stacked=True,
color=['darkseagreen', 'olivedrab', 'teal', 'powderblue', 'bisque', 'orange'],
figsize=(12,10))
plt.title('Distribution of Departments by Region', fontsize = 20)
plt.legend(loc='upper center', ncol=4)
- Basic Visualisation
10a. Departments
# Distribution by Departments
department_bar=df_high_activity['department'].value_counts(dropna=False).sort_values(ascending=True).plot.barh(color=['olivedrab'], fontsize=12)
plt.title('Distribution by Deparments', fontsize=20)
plt.xlabel('The Amount of Products Purchased', fontsize=15)
Output
10b. Product Price Ranges
#create pie chart for product price range distribution
price_pie=df_high_activity['price_range_loc'].value_counts().plot.pie(shadow=True,
startangle=90,
autopct=('%1.f%%'),
colors=['steelblue', 'Sienna', 'paleturquoise'],
label='')
Output
10c. Customer Loyalty
10d. Customer Spending
10e. Order Frequency
11. Exporting Final Dataset to_pickle.
- bar_chart_orders_day_of_week.
- bar_chart_parental_age_distribution.
- bar_chart_parental_age_region_stacked
- bar_ppr_region
- bar_relationship_age_distribution
- bar_relationship_single_parent_distribution
- bar_total_customers
- gender_distribution_pie
- histogram_income
- hist_prices
- income_distribution_hist
- income_distribution_of_customres
- income_distribution
- line_dow_prices
- line_fam_status_age
- line_hour_of_day
- pei_alcohol_distribution
- pie_baby_distribution
- pie_customer_spending
- pie_international_distribution
- pie_loyalty_to_brand
- pie_order_frequency
- pie_pet_distribution
- price_range_region_stacked
- splot_income
- splot_prices
- stacked_age_order_freq
- stacked_department_region
- stacked_family_order_freq
- stacked_family_stat_price_range
- stacked_parental_relationship_region
- stacked_order_frequency_region
After reviewing all the charts above, we should focus on showing ads to 3 main groups: young adults, young parents, and people from the upper economic class. As stated earlier, you adults and parents may be struggling to afford using Instacart, so we can create a discount program for students or people with dependents. In addition, we can show more ads that target people from the upper classes because they have the money to afford more expensive products and can increase the company's revenue. They already have the funds, we just need to create an easy environment to keep our customers coming back to us and not to our competitors.