### Table of Contents

* [Goals](#Goals)
* [Data](#Data)
    * [Data context](#data_context)
    * [Loading the Data](#loading_data)
    * [Data Information](#data_information)
- [Data Cleaning](#Data_Cleaning)
- [Exploratory Data Analysis](#Exploratory_Analysis)

### Goals <a class="anchor" id="Goals">

This notebook contains an analysis on Supermarket Sales data. The goal for this project is to do the following:
    
    - Get acquinted with the data
    - Clean the data so it is ready for analysis
    - Develop questions for analysis
    - Analyze variables within the data to gain patterns and insights

### Data <a class="anchor" id="Data">

The data for this project was downloaded from Kaggle:

https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales

Some code inspiration for this analysis was sourced from 

#### Data Context <a class="anchor" id="data_context">

The growth of supermarkets in most populated cities are increasing and market competitions are also high. The dataset is one of the historical sales of supermarket company which has recorded in 3 different branches for 3 months data. Predictive data analytics methods are easy to apply with this dataset.

##### Attribute information
    
- Invoice id: Computer generated sales slip invoice identification number
- Branch: Branch of supercenter (3 branches are available identified by A, B and C).
- City: Location of supercenters
- Customer type: Type of customers, recorded by Members for customers using member card and Normal for without member card.
- Gender: Gender type of customer
- Product line: General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and -beauty, Home and lifestyle, Sports and travel
- Unit price: Price of each product in $
- Quantity: Number of products purchased by customer
- Tax: 5% tax fee for customer buying
- Total: Total price including tax
- Date: Date of purchase (Record available from January 2019 to March 2019)
- Time: Purchase time (10am to 9pm)
- Payment: Payment used by customer for purchase (3 methods are available – Cash, Credit card and Ewallet)
- COGS: Cost of goods sold
- Gross margin percentage: Gross margin percentage
- Gross income: Gross income
- Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)


#### Loading Data <a class="anchor" id="loading_data">
    
First, we are loading the necessary libraries.

In [1]:
# sets up matplotlib with interactive features
%matplotlib notebook
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [2]:
df = pd.read_csv(r"C:\Users\mateo\OneDrive - CORE Education Trust\Documents\GitHub\CodeCademy-Projects\Business Intelligence Data Analyst\Final Project\Supermarket Sales\supermarket_sales - Sheet1.csv")

# Having a first look at our data
df.head()

#### Data Information <a class="anchor" id="data_information">

To understand our data we need to go through some fundamental questions:
    
    - We are going to check how many columns and rows we have.
    - The name and datatype of each column.
    - If there are any missing values.
    - If we should rename any of the column for better consistency.
    - Basic summary statistics.
    - If there are any duplicates.

In [3]:
print(f"There are {len(df.columns)} columns and {df.shape[0]} rows in our database.\n\n")
print(f"The column names with their data types and their missing values\n")
df.info()

In [4]:
print(f"Basic summary statistics\n")
df.describe()


We notice that gross margin percentage remains the same. We can drop it, since it doesn't provide any new information.

In [5]:
print(f"There are {df.duplicated().sum()} duplicates.")

### Data Cleaning <a class="anchor" id="Data_Cleaning">

As mentioned we will remove the gross margin percentage.
Another step that we can take is to remove the column Branch, because it has the same information as column City.
Lastly we can drop Tax 5% because we have the same information on the column gross income

In [6]:
df1 = df.drop(columns=["gross margin percentage", "Branch","Tax 5%"])
df1.head()

### Exploratory Data Analysis <a class="anchor" id="Exploratory_Analysis">
    
We are going to try a couple of hypothesis.
    
    - Which of the branches has more sales?
    - Which customer type buys the most? and how do they fare against each other?
    - Who buys more? Men or women? Is it the same on all the branches?
    - Which category sells the most?
    - What is the average price of a unit?
    - Do we have more purchases with large quantities or small ones.
    - How do the total sales per day and week look like? Can we infer any patterns?
    - What payment method is used the most? How much money is being paid with that method?
    - Is the rating correlated with the total amount spent?

##### Which of the branches has more sales?


In [7]:
df["Total"].groupby(df.City).sum()

They all seem to have almost the same amount of money generated. We can mention that there is a good management going on.

##### Which customer type buys the most? and how do they fare against each other?

In [8]:
df["Total"].groupby(df["Customer type"]).sum()

In [9]:
%matplotlib inline
plt.figure()
grouped_df = df.groupby("Customer type")["Total"].sum().reset_index()

plt.bar(grouped_df["Customer type"], grouped_df["Total"])
plt.show()

In [10]:
sns.boxplot(x="Customer type", y='Total', data=df, palette='Accent')
plt.xlabel('Customer type')
plt.ylabel('Total amount spent per customer')
plt.title('Individual Transactions by Member type')
plt.show()

In [11]:
df["Customer type"].value_counts()

The data is very similar. We do not have anything that differentiates them. This by itself is something bad. The members probably pay a certain amount for the membership, however they spent the same as the normal client. This might indicate that they do not have enough incentives to buy more products. Something that could solve this is special deals for members-only.

##### Who buys more? Men or women? Is it the same on all the branches?

In [12]:
df.Total.groupby(df.Gender).sum()

In [13]:
df.Gender.value_counts()

In [14]:
gender = df.groupby(['Gender','City'])['Total'].sum().reset_index()
gender

This data seems promising. Let's make a graph out of it.

In [15]:
plt.figure(figsize=(10,6))
ax = sns.barplot(x='City', y='Total', hue='Gender', data=gender)
plt.xlabel('City')
plt.ylabel('Total')
plt.legend(title='Gender')
plt.title('Total Sales by Gender in each City')
# Add total labels on top of bars
for p in ax.patches:
    ax.annotate(format(p.get_height(), '.0f'), 
                (p.get_x() + p.get_width() / 2., p.get_height()), 
                ha = 'center', va = 'center', 
                xytext = (0, 9), 
                textcoords = 'offset points')
plt.axis([-1,3,0,70000]) # Zooming out because of the legend
plt.show()

We notice that on Naypytaw, Females spend a lot more money than Males. This can be an indicator to put more Female related products, and focus a bit more on women.

### Which category sells the most?

In [16]:
# How many products we have per category?

df['Product line'].value_counts()

In [17]:
df.Total.groupby(df['Product line']).sum().sort_values(ascending=False)

We notice that most of the categories are in the same range. It would be a nice question on why Health and Beauty are last.

#### What is the average price of a unit? Where do their prices fall?

I want to create a new column called Unit price range. There we will distribute the unit prices to 4 categories:

    - From 0 to 20 they will be called "Low Price"
    - From 21 to 50 they will be called "Medium Price"
    - From 51 to 100 they will be called "High Price"

In [18]:
def categorize_price(price):
    if price >= 0 and price < 21:
        return "Low price"
    elif price >= 21 and price < 51:
        return "Medium price"
    else:
        return "High price"

df['Unit price range'] = df['Unit price'].apply(categorize_price)
df.head()

Let's check how many products we have in each unit price range

In [19]:
df['Unit price range'].value_counts()

In [20]:
sns.set_palette("Set2")
sns.barplot(x=df['Unit price range'].value_counts().index, y = df['Unit price range'].value_counts().values)
plt.xlabel("Unit Price Range")
plt.ylabel("Product Amount")
plt.title("Amount of Products distributed in Price Ranges")

text = "High price is 51 till 100\nMedium price is 21 to 50\nLow price is 0 to 20"
plt.text(0.95, 0.95, text, transform=plt.gca().transAxes, ha='right', va='top', fontsize=13)

Surprisingly most of our products fall under high price range. A point worth investigating is, if any of the stores have a higher ammount of high priced products. Does the Total income for that store decrease?

#### Do we have more purchases with large quantities or small ones.

Let's do the same as above. We can create a new column called Quantity range. It will be a distribution for 4 categories:

    - From 0 to 3 they will be called "Low Quantity"
    - From 4 to 6 they will be called "Medium Quantity"
    - From 7 to 10 they will be called "High Quantity"

In [21]:
def categorize_price(price):
    if price >= 0 and price < 4:
        return "Low quantity"
    elif price >= 4 and price < 7:
        return "Medium quantity"
    else:
        return "High quantity"

df['Quantity range'] = df['Quantity'].apply(categorize_price)
df.head()

In [22]:
sns.barplot(y=df['Quantity range'].value_counts().values, x=df['Quantity range'].value_counts().index)

Most of the transaction are those with high quantity. It would be worth to a/b test some theories on it. For example what if we provide a discount when clients buy high quantities? Will we have an increase in demand?

In [23]:
df.groupby(['Quantity range'])['Total'].sum()

In [24]:
plt.pie(df.groupby(['Quantity range'])['Total'].sum().sort_values(ascending=False), labels=df['Quantity range'].value_counts().index, autopct='%1.1f%%')
# plt.legends(df['Quantity range'].unique)

These are some amazing results. Most of the income comes from the high quantity transactions. The company can either continue to develop their marketing and sales on high quantity or try to increase their sales for low and medium quantity transactions.

#### How do the total sales per day look like? Can we infer any patterns?

There are a lot of things that we can search for when we are dealing with time. For now we are going to explore only some basic questions:
    
    - What is the total Revenue by Month?
    - What is the Revenue for the 7 days of the week?
    - Is there any pattern on the daily revenue? What about weekly?

In [25]:
df.head()

In [26]:
monthly_totals = df.groupby(df['Datetime'].dt.month)['Total'].sum()

# Plot the total sum for each month using a bar plot
ax = plt.subplot()
# plt.figure(figsize=(10, 6))
plt.bar(monthly_totals.index, monthly_totals)

plt.xlabel("Months")
plt.ylabel("Revenue")
plt.title("Total Revenue by Month")
# plt.xticks(monthly_totals.index, calendar.month_name[1:], rotation=45)
ax.set_xticks([1,2,3])
ax.set_xticklabels(["January","February", "March"], rotation=35)
for p in ax.patches:
    ax.annotate(format(p.get_height(), '.0f'), 
                (p.get_x() + p.get_width() / 2., p.get_height()), 
                ha = 'center', va = 'center', 
                xytext = (0, 9), 
                textcoords = 'offset points')

plt.axis([0,4,0,130000])

plt.show()

We notice that a lot of Revenue was made on January, probably because of the new years festives. Then we have a dive on February, it can be attributed to their overspending on January. And finally on March it increases again. A good idea would be to lower slightly the prices on February, in order to get more costumers and sales.

In [None]:
# Group the data by day of the week and calculate the total sum
daily_totals = df.groupby(df['Datetime'].dt.dayofweek)['Total'].sum()

# Create a line plot for the total sum for each day of the week
plt.figure(figsize=(10, 6))
plt.plot(daily_totals.index, daily_totals, marker='o')

plt.xlabel("Day of the Week")
plt.ylabel("Total Sum")
plt.title("Total Revenue by Day of the Week")
plt.xticks(daily_totals.index, calendar.day_name, rotation=45)

plt.show()

In [None]:
# Set 'Datetime' as the index
df.set_index("Datetime", inplace=True)

# Calculate the total sum for each day
daily_totals = df.resample('D')['Total'].sum()

# Create a line plot for the total sum for each day
plt.figure(figsize=(10, 6))
plt.plot(daily_totals.index, daily_totals, marker='o')

plt.xlabel("Date")
plt.ylabel("Total Sum")
plt.title("Total Revenue by Date")
plt.xticks(rotation=45)

plt.show()

In [None]:
# ! pip install mplcursors
import mplcursors
# Calculate the total sum for each week
daily_totals = df.resample('W')['Total'].sum()

# Create a line plot for the total sum for each week
plt.figure(figsize=(10, 6))
plt.plot(daily_totals.index, daily_totals, marker='o')

plt.xlabel("Date")
plt.ylabel("Total Sum")
plt.title("Total Revenue by Date")
plt.xticks(rotation=45)
cursor = mplcursors.cursor()
cursor.connect(
    "add", lambda sel: sel.annotation.set_text(f"{sel.target[0].date():%d-%b-%Y}\nTotal: {sel.target[1]}")
)
plt.show()

#### What payment method is used the most? How much money is being paid with that method?

We are going to check: 
    
    - The total amount spent on each payement method
    - Which method is used the most?

- Why do we have more Women buying things on Naypytaw? Is it because we have more female products there? Such as fashion accessories, Health and beauty products?
- Why does the category health and beauty earns less income than the rest of categories? Is it because their price?
- 