In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id="top"></a>
# <div style="padding:20px;color:white;margin:0;font-size:35px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#254E58;overflow:hidden"><b>Table of content</b></div>

<div style="background-color:aliceblue; padding:30px; font-size:15px;color:#034914">
    
* [1. Introduction](#1)
    - [Problem statement](#1.1)
    - [Data description](#1.2)
    
* [2. Import Libraries](#2) 
    
* [3. Basic Exploration](#3)
    - [Read dataset](#3.1)
    - [Some information](#3.2)
    - [Data transformation](#3.3)
    - [Data visualization](#3.4)
* [4. Machine Learning model](#4)
    
* [5 Conclusion](#5)

* [6 Author Message](#6)

<a id="1"></a>
# <div style="padding:20px;color:white;margin:0;font-size:35px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#254E58;overflow:hidden"><b>Introduction</b></div>

<a id="1.2"></a>
<h2 style="font-family: Verdana; font-size: 25px; font-style: normal; font-weight: normal; text-decoration: none; text-transform: none; letter-spacing: 2px; color: #155D07; background-color: #ffffff;"><b>Problem</b> statement

<span style="font-size:14px; font-family:Verdana;"> 
Shopping is a global phenomenon that is influenced by a wide range of factors, including economic conditions, cultural preferences, and technological advancements. In recent years, the way people shop has undergone significant changes due to the rise of e-commerce and the global pandemic, which has led to an increase in online shopping.

There are several reasons that can affect sales data in the world of shopping, including: <br>

* <b>Economic conditions:</b> The state of the economy can have a significant impact on consumer behavior and spending. During times of economic uncertainty or recession, consumers tend to be more cautious with their spending, leading to a decrease in sales. Conversely, during times of economic growth, consumers tend to have more disposable income and are more likely to spend, leading to an increase in sales.

* <b>Cultural preferences:</b> Cultural differences can affect the types of products that are popular in different regions of the world. For example, certain countries may have a preference for luxury brands, while others may prioritize affordability and practicality. These cultural preferences can have a significant impact on sales data.

* <b>Marketing and advertising:</b> Effective marketing and advertising campaigns can drive sales by creating awareness and interest in a particular product or brand. Companies that invest in effective marketing and advertising strategies are often able to achieve higher sales than those that do not.

* <b>Online shopping:</b> The rise of e-commerce has revolutionized the way people shop, allowing consumers to purchase products from anywhere in the world with just a few clicks. As a result, online sales have become a significant factor in overall sales data, and companies that invest in their online presence are often able to achieve higher sales than those that do not.

* <b>External factors:</b> External factors such as weather patterns, natural disasters, and political unrest can also have a significant impact on sales data. For example, a natural disaster may lead to a decrease in sales due to supply chain disruptions, while political unrest may lead to a decrease in consumer confidence and spending.

Overall, there are many different factors that can affect sales data in the world of shopping, and companies that are able to effectively navigate these factors are often the most successful. I hope that my analysis will provide a comparison of the interplay between factors in the dataset, and that you will have your own opinions or even better ideas from there.
</span>

<a id="1.2"></a>
<h2 style="font-family: Verdana; font-size: 25px; font-style: normal; font-weight: normal; text-decoration: none; text-transform: none; letter-spacing: 2px; color: #155D07; background-color: #ffffff;"><b>Data</b> description

<span style="font-size:14px; font-family:Verdana;"> <b>Attribute Information:</b> <br>
* <b>invoice_no:</b> Invoice number. Nominal. A combination of the letter 'I' and a 6-digit integer uniquely assigned to each operation.
* <b>customer_id:</b> Customer number. Nominal. A combination of the letter 'C' and a 6-digit integer uniquely assigned to each operation.
* <b>gender:</b> String variable of the customer's gender.
* <b>age:</b> Positive Integer variable of the customers age.
* <b>category:</b> String variable of the category of the purchased product.
* <b>quantity:</b> The quantities of each product (item) per transaction. Numeric.
* <b>price:</b> Unit price. Numeric. Product price per unit in Turkish Liras (TL).
* <b>payment_method:</b> String variable of the payment method (cash, credit card or debit card) used for the transaction.
* <b>invoice_date:</b> Invoice date. The day when a transaction was generated.
* <b>shopping_mall:</b> String variable of the name of the shopping mall where the transaction was made.
</span>

<a id="1"></a>
# <div style="padding:20px;color:white;margin:0;font-size:35px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#254E58;overflow:hidden"><b>Import Libraries</b></div>

In [None]:
import pandas as pd
import numpy as np
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
from pandas.api.types import CategoricalDtype
print("Setup Complete")

<a id="1"></a>
# <div style="padding:20px;color:white;margin:0;font-size:35px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#254E58;overflow:hidden"><b>Basic Exploration</b></div>

<a id="1.2"></a>
<h2 style="font-family: Verdana; font-size: 25px; font-style: normal; font-weight: normal; text-decoration: none; text-transform: none; letter-spacing: 2px; color: #155D07; background-color: #ffffff;"><b>Read</b> dataset

In [None]:
def read_dataset(file_path):
    data = pd.read_csv(file_path, index_col = 0)
    return data

In [None]:
data = read_dataset('/kaggle/input/customer-shopping-dataset/customer_shopping_data.csv')

<a id="1.2"></a>
<h2 style="font-family: Verdana; font-size: 25px; font-style: normal; font-weight: normal; text-decoration: none; text-transform: none; letter-spacing: 2px; color: #155D07; background-color: #ffffff;"><b>Some</b> information

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.nunique()

In [None]:
data.duplicated().any()

<a id="1.2"></a>
<h2 style="font-family: Verdana; font-size: 25px; font-style: normal; font-weight: normal; text-decoration: none; text-transform: none; letter-spacing: 2px; color: #155D07; background-color: #ffffff;"><b>Data</b> transformation

> <span style='font-size:15px; font-family:Verdana;color: #254E58;'><b>
Missing Data Treatment</b></span>

In [None]:
total_null = data.isnull().sum().sort_values(ascending = False)
percent = ((data.isnull().sum()/data.isnull().count())*100).sort_values(ascending = False)
print("Total records = ", data.shape[0])

missing_data = pd.concat([total_null,percent.round(2)],axis=1,keys=['Total Missing','In Percent'])
missing_data

> <span style='font-size:15px; font-family:Verdana;color: #254E58;'><b>
Duplicated Data Treatment</b></span>

In [None]:
duplicated_data = pd.DataFrame(data.loc[data.duplicated()].count())
duplicated_data.columns = ['Total Duplicate']
duplicated_data

> <span style='font-size:15px; font-family:Verdana;color: #254E58;'><b>
Clean Data</b></span>

In [None]:
print('Gender :', data['gender'].unique().tolist())
print('Category :', data['category'].unique().tolist())
print('Payment method :', data['payment_method'].unique().tolist())
print('Shopping mall :', data['shopping_mall'].unique().tolist())

In [None]:
data['age_group'] = pd.cut(x=data['age'], bins=[0, 16, 30, 45, 100], labels=['Child', 'Young Adults', 'Middle-aged Adults','Old-aged Adults'])
data[['age', 'age_group']].head(10)

In [None]:
data['invoice_date'] = pd.to_datetime(data['invoice_date'])
data['day'] = data['invoice_date'].dt.day
data['month'] = data['invoice_date'].dt.strftime('%b')
data['year'] = data['invoice_date'].dt.year.astype('str')
data['day_of_week'] = data['invoice_date'].dt.day_name()

In [None]:
data[['quantity', 'price']].describe().round(2)

In [None]:
data.head()

<a id="1.2"></a>
<h2 style="font-family: Verdana; font-size: 25px; font-style: normal; font-weight: normal; text-decoration: none; text-transform: none; letter-spacing: 2px; color: #155D07; background-color: #ffffff;"><b>Data</b> visualization

In [None]:
plt.style.use('seaborn')

fig = plt.figure(figsize=(10,5))

colors = ['steelblue', 'lightcoral']
sns.histplot(data=data, x='age', hue='gender', palette=colors, alpha=0.7)

plt.title("Age Distribution by Gender", pad=10, fontsize=15)
plt.ylabel("Quality", labelpad=20)
plt.xlabel("Age", labelpad=20)

plt.legend(title = "Gender", loc="upper right")

sns.despine()
plt.tight_layout()

plt.show()

In [None]:
df_age_group = data.groupby('age_group')['customer_id'].count().reset_index()
df_age_group.columns = ['age_group', 'quality']
df_age_group['percent'] = (df_age_group['quality'] / df_age_group['quality'].sum() *100).round(2)

df_category = data.groupby('category')['customer_id'].count().reset_index()
df_category.columns = ['category', 'quality']
df_category['percent'] = (df_category['quality'] / df_category['quality'].sum() *100).round(2)

df_payment_method = data.groupby('payment_method')['customer_id'].count().reset_index()
df_payment_method.columns = ['payment_method', 'quality']
df_payment_method['percent'] = (df_payment_method['quality'] / df_payment_method['quality'].sum() *100).round(2)

df_shopping_mall = data.groupby('shopping_mall')['customer_id'].count().reset_index()
df_shopping_mall.columns = ['shopping_mall', 'quality']
df_shopping_mall['percent'] = (df_shopping_mall['quality'] / df_shopping_mall['quality'].sum() *100).round(2)


In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(12, 8))

colors = sns.color_palette("Set2")

ax1.pie(df_age_group['percent'], labels=df_age_group['age_group'].tolist(), colors=colors, autopct='%1.1f%%', startangle=90, shadow=True, wedgeprops=dict(width=0.5))
ax1.set_title('Distribution of Values by Age Group', pad=20, fontsize=15)
ax1.axis('equal')
#ax1.legend(loc="upper right", labels=df_age_group['age_group'].tolist(), bbox_to_anchor=(-0.2, 1), ncol=1)

ax2.pie(df_category['percent'], labels=df_category['category'].tolist(), colors=colors, autopct='%1.1f%%', startangle=90, shadow=True, wedgeprops=dict(width=0.5))
ax2.set_title('Distribution of Values by Category', pad=20, fontsize=15)
ax2.axis('equal')
#ax2.legend(loc="upper right", labels=df_category['category'].tolist(), bbox_to_anchor=(-0.2, 1), ncol=1)

ax3.pie(df_payment_method['percent'], labels=df_payment_method['payment_method'].tolist(), colors=colors, autopct='%1.1f%%', startangle=90, shadow=True, wedgeprops=dict(width=0.5))
ax3.set_title('Distribution of Values by Payment Method', pad=20, fontsize=15)
ax3.axis('equal')
#ax3.legend(loc="upper right", labels=df_payment_method['payment_method'].tolist(), bbox_to_anchor=(-0.2, 1), ncol=1)

ax4.pie(df_shopping_mall['percent'], labels=df_shopping_mall['shopping_mall'].tolist(), colors=colors, autopct='%1.1f%%', startangle=90, shadow=True, wedgeprops=dict(width=0.5))
ax4.set_title('Distribution of Values by Shopping Mall', pad=20, fontsize=15)
ax4.axis('equal')
#ax4.legend(loc="upper right", labels=df_shopping_mall['shopping_mall'].tolist(), bbox_to_anchor=(-0.2, 1), ncol=1)

plt.subplots_adjust(hspace=1.0)

plt.tight_layout()
plt.show()


In [None]:
df_day = data.groupby(['gender','day'])['quantity'].sum().reset_index()

months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
month_type = CategoricalDtype(categories=months, ordered=True)

df_month = data.groupby(['gender','month'])['quantity'].sum().reset_index()
df_month['month'] = df_month['month'].astype(month_type)

df_year = data.groupby(['gender','year'])['quantity'].sum().reset_index()

cats = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
cat_type = CategoricalDtype(categories=cats, ordered=True)

df_day_of_week = data.groupby(['gender','day_of_week'])['quantity'].sum().reset_index()
df_day_of_week['day_of_week'] = df_day_of_week['day_of_week'].astype(cat_type)

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(16, 10),  sharey = True) 

sns.lineplot(ax = axes[0, 0], data = df_day, x = 'day', y = 'quantity', hue = 'gender', palette = 'Dark2')
sns.lineplot(ax = axes[0, 1], data = df_month, x = 'month', y = 'quantity', hue = 'gender', palette = 'Dark2')
sns.lineplot(ax = axes[1, 0], data = df_year, x = 'year', y = 'quantity', hue = 'gender', palette = 'Dark2')
sns.lineplot(ax = axes[1, 1], data = df_day_of_week, x = 'day_of_week', y = 'quantity', hue = 'gender', palette = 'Dark2')

axes[0, 0].set_title("Quantity by Day", pad=10, fontsize=15)
axes[0, 0].set_ylabel("Number of Products", labelpad=20)
axes[0, 0].set_xlabel("Day", labelpad=20)

axes[0, 1].set_title("Quantity by Month", pad=10, fontsize=15)
axes[0, 1].set_ylabel("Number of Products", labelpad=20)
axes[0, 1].set_xlabel("Month", labelpad=20)

axes[1, 0].set_title("Quantity by Year", pad=10, fontsize=15)
axes[1, 0].set_ylabel("Number of Products", labelpad=20)
axes[1, 0].set_xlabel("Year", labelpad=20)

axes[1, 1].set_title("Quantity by Weekday", pad=10, fontsize=15)
axes[1, 1].set_ylabel("Number of Products", labelpad=20)
axes[1, 1].set_xlabel("Weekday", labelpad=20)

plt.tight_layout()

plt.show()


In [None]:
df_day = data.groupby(['gender','day'])['price'].sum().reset_index()

months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
month_type = CategoricalDtype(categories=months, ordered=True)

df_month = data.groupby(['gender','month'])['price'].sum().reset_index()
df_month['month'] = df_month['month'].astype(month_type)

df_year = data.groupby(['gender','year'])['price'].sum().reset_index()

cats = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
cat_type = CategoricalDtype(categories=cats, ordered=True)

df_day_of_week = data.groupby(['gender','day_of_week'])['price'].sum().reset_index()
df_day_of_week['day_of_week'] = df_day_of_week['day_of_week'].astype(cat_type)

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(16, 10),  sharey = True) 

sns.lineplot(ax = axes[0, 0], data = df_day, x = 'day', y = 'price', hue = 'gender', palette = 'Dark2')
sns.lineplot(ax = axes[0, 1], data = df_month, x = 'month', y = 'price', hue = 'gender', palette = 'Dark2')
sns.lineplot(ax = axes[1, 0], data = df_year, x = 'year', y = 'price', hue = 'gender', palette = 'Dark2')
sns.lineplot(ax = axes[1, 1], data = df_day_of_week, x = 'day_of_week', y = 'price', hue = 'gender', palette = 'Dark2')

axes[0, 0].set_title("Price by Day", pad=10, fontsize=15)
axes[0, 0].set_ylabel("Number of Products", labelpad=20)
axes[0, 0].set_xlabel("Day", labelpad=20)

axes[0, 1].set_title("Price by Month", pad=10, fontsize=15)
axes[0, 1].set_ylabel("Number of Products", labelpad=20)
axes[0, 1].set_xlabel("Month", labelpad=20)

axes[1, 0].set_title("Price by Year", pad=10, fontsize=15)
axes[1, 0].set_ylabel("Number of Products", labelpad=20)
axes[1, 0].set_xlabel("Year", labelpad=20)

axes[1, 1].set_title("Price by Weekday", pad=10, fontsize=15)
axes[1, 1].set_ylabel("Number of Products", labelpad=20)
axes[1, 1].set_xlabel("Weekday", labelpad=20)

plt.tight_layout()

plt.show()


In [None]:
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
month_type = CategoricalDtype(categories=months, ordered=True)

df_month = pd.DataFrame(data.groupby(['shopping_mall','month'])['shopping_mall'].count())
df_month.columns = ['quality']
df_month = df_month.reset_index()
df_month['month'] = df_month['month'].astype(month_type)

cats = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
cat_type = CategoricalDtype(categories=cats, ordered=True)

df_day_of_week = pd.DataFrame(data.groupby(['shopping_mall','day_of_week'])['shopping_mall'].count())
df_day_of_week.columns = ['quality']
df_day_of_week = df_day_of_week.reset_index()
df_day_of_week['day_of_week'] = df_day_of_week['day_of_week'].astype(cat_type)

In [None]:
# Set a custom color palette
colors = ['#a6cee3', '#1f78b4', '#b2df8a', '#33a02c', '#fb9a99', '#e31a1c']

# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(16, 10), sharey = True)

# Plot line charts with custom color palette and linewidth
sns.lineplot(ax=axes[0], data=df_month, x='month', y='quality', hue='shopping_mall', linewidth=2.5)
sns.lineplot(ax=axes[1], data=df_day_of_week, x='day_of_week', y='quality', hue='shopping_mall', linewidth=2.5)

# Set titles, labels, and legends
axes[0].set_title("Shopping Mall Performance by Month", pad=10, fontsize=18, fontweight='bold')
axes[0].set_ylabel("Quality", labelpad=20, fontsize=14)
axes[0].set_xlabel("Month", labelpad=20, fontsize=14)
axes[0].legend(loc='upper left', bbox_to_anchor=(0, 1), fontsize=12)

axes[1].set_title("Shopping Mall Performance by Weekday", pad=10, fontsize=18, fontweight='bold')
axes[1].set_ylabel("Quality", labelpad=20, fontsize=14)
axes[1].set_xlabel("Weekday", labelpad=20, fontsize=14)
axes[1].legend(loc='upper left', bbox_to_anchor=(0, 1), fontsize=12)

# Add gridlines to the plots
for ax in axes:
    ax.grid(axis='y', alpha=0.5)

# Add a tight layout and show the plots
plt.tight_layout()
plt.show()

In [None]:
# Create a pivot table of count of categories by payment method
table_category = pd.pivot_table(data, values='quantity', index='category', columns='payment_method', aggfunc='count')

# Define custom colors for each payment method
colors = ['steelblue', 'limegreen', 'gold']

# Create a figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))

# Plot stacked bar chart for first subplot
table_category.plot(kind='bar', stacked=True, ax=ax1, color=colors)

# Set chart title and axis labels for first subplot
ax1.set_title('Distribution of Categories by Payment Method', pad=10, fontsize=15)
ax1.set_xlabel('Category', labelpad=20, fontsize=12)
ax1.set_ylabel('Count', labelpad=20, fontsize=12)

# Add a legend for first subplot
handles, labels = ax1.get_legend_handles_labels()
ax1.legend(handles, labels, title='Payment Method', loc='upper right')

# Remove top and right spines for first subplot
sns.despine(ax=ax1)

# Create a pivot table of count of categories by payment method
table_shopping_mall = pd.pivot_table(data, values='quantity', index='shopping_mall', columns='payment_method', aggfunc='count')

# Plot stacked bar chart for second subplot
table_shopping_mall.plot(kind='bar', stacked=True, ax=ax2, color=colors)

# Set chart title and axis labels for second subplot
ax2.set_title('Distribution of Shopping Mall by Payment Method', pad=10, fontsize=15)
ax2.set_xlabel('Shopping mall', labelpad=20, fontsize=12)
ax2.set_ylabel('Count', labelpad=20, fontsize=12)

# Add a legend for second subplot
handles, labels = ax2.get_legend_handles_labels()
ax2.legend(handles, labels, title='Payment Method', loc='upper right')

# Remove top and right spines for second subplot
sns.despine(ax=ax2)
plt.tight_layout()
# Show the chart
plt.show()


In [None]:
# Create a pivot table of count of categories by payment method
table_category = pd.pivot_table(data, values='quantity', index='category', columns='gender', aggfunc='count')

# Define custom colors for each payment method
colors = ['steelblue', 'limegreen', 'gold']

# Create a figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))

# Plot stacked bar chart for first subplot
table_category.plot(kind='bar', stacked=True, ax=ax1, color=colors)

# Set chart title and axis labels for first subplot
ax1.set_title('Distribution of Categories by Gender', pad=10, fontsize=15)
ax1.set_xlabel('Category', labelpad=20, fontsize=12)
ax1.set_ylabel('Count', labelpad=20, fontsize=12)

# Add a legend for first subplot
handles, labels = ax1.get_legend_handles_labels()
ax1.legend(handles, labels, title='Gender', loc='upper right')

# Remove top and right spines for first subplot
sns.despine(ax=ax1)

# Create a pivot table of count of categories by payment method
table_shopping_mall = pd.pivot_table(data, values='quantity', index='shopping_mall', columns='gender', aggfunc='count')

# Plot stacked bar chart for second subplot
table_shopping_mall.plot(kind='bar', stacked=True, ax=ax2, color=colors)

# Set chart title and axis labels for second subplot
ax2.set_title('Distribution of Shopping Mall by Gender', pad=10, fontsize=15)
ax2.set_xlabel('Shopping Mall', labelpad=20, fontsize=12)
ax2.set_ylabel('Count', labelpad=20, fontsize=12)

# Add a legend for second subplot
handles, labels = ax2.get_legend_handles_labels()
ax2.legend(handles, labels, title='Gender', loc='upper right')

# Remove top and right spines for second subplot
sns.despine(ax=ax2)
plt.tight_layout()
# Show the chart
plt.show()


In [None]:
# Create a pivot table of count of categories by payment method
table_category = pd.pivot_table(data, values='quantity', index='category', columns='age_group', aggfunc='count')

# Define custom colors for each payment method
colors = ['steelblue', 'limegreen', 'gold']

# Create a figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))

# Plot stacked bar chart for first subplot
table_category.plot(kind='bar', stacked=True, ax=ax1, color=colors)

# Set chart title and axis labels for first subplot
ax1.set_title('Distribution of Categories by Age Group', pad=10, fontsize=15)
ax1.set_xlabel('Category', labelpad=20, fontsize=12)
ax1.set_ylabel('Count', labelpad=20, fontsize=12)

# Add a legend for first subplot
handles, labels = ax1.get_legend_handles_labels()
ax1.legend(handles, labels, title='Age Group', loc='upper right')

# Remove top and right spines for first subplot
sns.despine(ax=ax1)

# Create a pivot table of count of categories by payment method
table_shopping_mall = pd.pivot_table(data, values='quantity', index='shopping_mall', columns='age_group', aggfunc='count')

# Plot stacked bar chart for second subplot
table_shopping_mall.plot(kind='bar', stacked=True, ax=ax2, color=colors)

# Set chart title and axis labels for second subplot
ax2.set_title('Distribution of Shopping Mall by Age Group', pad=10, fontsize=15)
ax2.set_xlabel('Shopping Mall', labelpad=20, fontsize=12)
ax2.set_ylabel('Count', labelpad=20, fontsize=12)

# Add a legend for second subplot
handles, labels = ax2.get_legend_handles_labels()
ax2.legend(handles, labels, title='Age Group', loc='upper right')

# Remove top and right spines for second subplot
sns.despine(ax=ax2)
plt.tight_layout()
# Show the chart
plt.show()


In [None]:
# Create a box plot of age by category
sns.set(style="ticks", palette="pastel")
plt.figure(figsize=(16,6))

sns.boxplot(x="category", y="age", data=data)
sns.despine(offset=10, trim=True)

plt.title("Distribution of Age by Product Category", fontsize=16)
plt.xlabel("Category", fontsize=14)
plt.ylabel("Age", fontsize=14)

plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

In [None]:
sns.set_style("whitegrid")
fig, ax = plt.subplots(figsize=(15, 8))

sns.boxplot(x="category", y="price", data=data, palette="pastel", ax=ax)
sns.despine(offset=10, trim=True)

ax.set_title("Distribution of Price by Product Category", fontsize=16)
ax.set_xlabel("Category", fontsize=14)
ax.set_ylabel("Price", fontsize=14)

plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

plt.show()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
sns.set(style="ticks", palette="pastel")

sns.boxplot(ax=axes[0], x="payment_method", y="age", data=data)
sns.boxplot(ax=axes[1], x="payment_method", y="price", data=data)

for ax_idx, var in enumerate(["age", "price"]):
    axes[ax_idx].set_title(f"Distribution of {var.capitalize()} by Payment Method", fontsize=16)
    axes[ax_idx].set_xlabel("Payment Method", fontsize=14)
    axes[ax_idx].set_ylabel(f"{var.capitalize()}", fontsize=14)

sns.despine(offset=10, trim=True)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.tight_layout(pad = 2)
plt.show()

In [None]:
# Set the size of the figure
fig, ax = plt.subplots(figsize=(17, 6))

# Set the style of the plot
sns.set_style("whitegrid")

# Create a violin plot with gender as hue, split by gender
sns.violinplot(x="shopping_mall", y="price", hue="gender", data=data, palette="Set2", split = True, ax=ax)

# Add a title and axis labels
ax.set_title("Distribution of Price by Shopping Mall", fontsize=18, fontweight='bold')
ax.set_xlabel("Shopping Mall", fontsize=14)
ax.set_ylabel("Price", fontsize=14)

# Increase the font size of the tick labels
ax.tick_params(axis='both', labelsize=12)

# Add a legend
ax.legend(title="Gender", loc="upper right", fontsize=12)

# Remove top and right spines
sns.despine()
plt.tight_layout()
# Display the plot
plt.show()

In [None]:
# Set the style of the plot
sns.set_style("whitegrid")

# Create a figure with two subplots
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Create a violin plot with gender as hue in the first subplot
sns.violinplot(ax=axes[0], x="age_group", y="price", hue='gender', data=data, palette="Set2")

# Create a scatterplot with color in the second subplot
sns.scatterplot(ax=axes[1], x='age', y='price', data=data, color='steelblue')

# Set titles and axis labels for both subplots
axes[0].set_title("Distribution of Price by Age Group", fontsize=18, fontweight='bold')
axes[0].set_xlabel("Age Group", fontsize=14)
axes[0].set_ylabel("Price", fontsize=14)
axes[1].set_title('Relationship between Price and Age', fontsize=18, fontweight='bold')
axes[1].set_xlabel('Age', fontsize=14)
axes[1].set_ylabel('Price', fontsize=14)

# Increase the font size of the tick labels
axes[0].tick_params(axis='both', labelsize=12)
axes[1].tick_params(axis='both', labelsize=12)

# Adjust the space between subplots and set the top margin
plt.subplots_adjust(wspace=0.3, top=0.9)

plt.tight_layout()
# Display the plot
plt.show()


In [None]:
# create a pivot table to count frequency of each category at each shopping mall
pivot_table = data.pivot_table(index='shopping_mall', columns='category', values='quantity', aggfunc='sum')

# create heatmap
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(pivot_table, cmap='Blues', annot=True, fmt='g', linewidths=.5, ax=ax)

# set plot title and axis labels
ax.set_title("Frequency of Product Categories by Shopping Mall",pad = 20, fontsize=16)
ax.set_xlabel("Category", fontsize=12)
ax.set_ylabel("Shopping Mall", fontsize=12)

plt.show()

<a id="1"></a>
# <div style="padding:20px;color:white;margin:0;font-size:35px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#254E58;overflow:hidden"><b>Machine Learning Model</b></div>

<span style="font-size:14px; font-family:Verdana;"> 
On going!
</span>

<a id="1"></a>
# <div style="padding:20px;color:white;margin:0;font-size:35px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#254E58;overflow:hidden"><b>Conclusion</b></div>

<span style="font-size:14px; font-family:Verdana;"> 
On going!
</span>

<a id="1"></a>
# <div style="padding:20px;color:white;margin:0;font-size:35px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#254E58;overflow:hidden"><b>Author Message</b></div>

<div style="border-radius:10px;border:#034914 solid;padding: 15px;background-color:aliceblue;font-size:90%;text-align:left">

<h4><b>Author :</b> Nguyen Tuan Thanh </h4>

<h4> <b>Some information:</b> </h4>

<b>👉Read more project :</b> https://www.kaggle.com/nttthanh <br>
<b>👉Shoot me mails :</b> thanh.ntt0504@gmail.com<br>
<b>👉Connect on LinkedIn :</b> https://www.linkedin.com/in/thanh-nguyen-a2ab32265/ <br>
<b>👉Explore Github :</b> https://github.com/TuanThanhpm <br>
    
    
<center> <strong> If you liked this Notebook, please do upvote. </strong>
    
<center> <strong> If you have any questions, feel free to comment! </strong>
    
<center> <strong> ✨Best Wishes✨ </strong>