### Table of Contents

* [Goals](#Goals)
* [Data](#Data)
    * [Data context](#data_context)
    * [Loading the Data](#loading_data)
    * [Data Information](#data_information)
- [Data Cleaning](#Data_Cleaning)
- [Exploratory Data Analysis](#Exploratory_Analysis)

### Goals <a class="anchor" id="Goals">

This notebook contains an analysis on Supermarket Sales data. The goal for this project is to do the following:
    
    - Get acquinted with the data
    - Clean the data so it is ready for analysis
    - Develop questions for analysis
    - Analyze variables within the data to gain patterns and insights

### Data <a class="anchor" id="Data">

The data for this project was downloaded from Kaggle:

https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales

Some code inspiration for this analysis was sourced from 

#### Data Context <a class="anchor" id="data_context">

The growth of supermarkets in most populated cities are increasing and market competitions are also high. The dataset is one of the historical sales of supermarket company which has recorded in 3 different branches for 3 months data. Predictive data analytics methods are easy to apply with this dataset.

##### Attribute information
    
- Invoice id: Computer generated sales slip invoice identification number
- Branch: Branch of supercenter (3 branches are available identified by A, B and C).
- City: Location of supercenters
- Customer type: Type of customers, recorded by Members for customers using member card and Normal for without member card.
- Gender: Gender type of customer
- Product line: General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and -beauty, Home and lifestyle, Sports and travel
- Unit price: Price of each product in $
- Quantity: Number of products purchased by customer
- Tax: 5% tax fee for customer buying
- Total: Total price including tax
- Date: Date of purchase (Record available from January 2019 to March 2019)
- Time: Purchase time (10am to 9pm)
- Payment: Payment used by customer for purchase (3 methods are available – Cash, Credit card and Ewallet)
- COGS: Cost of goods sold
- Gross margin percentage: Gross margin percentage
- Gross income: Gross income
- Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)


#### Loading Data <a class="anchor" id="loading_data">
    
First, we are loading the necessary libraries.

In [1]:
# sets up matplotlib with interactive features
%matplotlib notebook
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [10]:
df = pd.read_csv(r"C:\Users\mateo\OneDrive - CORE Education Trust\Documents\GitHub\CodeCademy-Projects\Business Intelligence Data Analyst\Final Project\Supermarket Sales\supermarket_sales - Sheet1.csv")

# Having a first look at our data
df.head()

#### Data Information <a class="anchor" id="data_information">

To understand our data we need to go through some fundamental questions:
    
    - We are going to check how many columns and rows we have.
    - The name and datatype of each column.
    - If there are any missing values.
    - If we should rename any of the column for better consistency.
    - Basic summary statistics.
    - If there are any duplicates.

In [3]:
print(f"There are {len(df.columns)} columns and {df.shape[0]} rows in our database.\n\n")
print(f"The column names with their data types and their missing values\n")
df.info()

In [4]:
print(f"Basic summary statistics\n")
df.describe()


We notice that gross margin percentage remains the same. We can drop it, since it doesn't provide any new information.

In [5]:
print(f"There are {df.duplicated().sum()} duplicates.")

### Data Cleaning <a class="anchor" id="Data_Cleaning">

As mentioned we will remove the gross margin percentage.
Another step that we can take is to remove the column Branch, because it has the same information as column City.
Lastly we can drop Tax 5% because we have the same information on the column gross income

In [6]:
df1 = df.drop(columns=["gross margin percentage", "Branch","Tax 5%"])
df1.head()

### Exploratory Data Analysis <a class="anchor" id="Exploratory_Analysis">
    
We are going to try a couple of hypothesis.
    
    - Which of the branches has more sales?
    - Which customer type buys the most? and how do they fare against each other?
    - Who buys more? Men or women? Is it the same on all the branches?
    - Which category sells the most?
    - What is the average price of a unit?
    - Do we have more purchases with large quantities or small ones. Is that the same on all the branches?
    - What is the average of the total per academy.
    - How is the distribution of the Total? Do we have a lot of outliers?
    - How do the total sales per day look like? Can we infer any patterns?
    - Same with the time. Can we see any patterns where we have more sells?
    - What payment method is used the most? How much money is being paid with that method?
    - Is the rating correlated with the total ammount spend?

##### Which of the branches has more sales?


In [7]:
df["Total"].groupby(df.City).sum()

They all seem to have almost the same amount of money generated. We can mention that there is a good management going on.

##### Which customer type buys the most? and how do they fare against each other?

In [8]:
df["Total"].groupby(df["Customer type"]).sum()

In [9]:
%matplotlib inline
plt.figure()
grouped_df = df.groupby("Customer type")["Total"].sum().reset_index()

plt.bar(grouped_df["Customer type"], grouped_df["Total"])
plt.show()

##### Who buys more? Men or women? Is it the same on all the branches?