<h2 align=center>Exploratory Data Analysis With Python and Pandas</h2>
<img src="logo.png">

### Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import calmap
from pandas_profiling import ProfileReport
iRqH6k


  from .autonotebook import tqdm as notebook_tqdm


Link to data source: https://www.kaggle.com/aungpyaeap/supermarket-sales

**Context**

The growth of supermarkets in most populated cities are increasing and market competitions are also high. The dataset is one of the historical sales of supermarket company which has recorded in 3 different branches for 3 months data.

**Data Dictionary**

1. ***Invoice id:*** Computer generated sales slip invoice identification number

2. ***Branch:*** Branch of supercenter (3 branches are available identified by A, B and C).

3. ***City:*** Location of supercenters

4. ***Customer type:*** Type of customers, recorded by Members for customers using member card and Normal for without member card.

5. ***Gender:*** Gender type of customer

6. ***Product line:*** General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel

7. ***Unit price:*** Price of each product in USD

8. ***Quantity:*** Number of products purchased by customer

9. ***Tax:*** 5% tax fee for customer buying

10. ***Total:*** Total price including tax

11. ***Date:*** Date of purchase (Record available from January 2019 to March 2019)

12. ***Time:*** Purchase time (10am to 9pm)

13. ***Payment:*** Payment used by customer for purchase (3 methods are available – Cash, Credit card and Ewallet)

14. ***COGS:*** Cost of goods sold

15. ***Gross margin percentage:*** Gross margin percentage

16. ***Gross income:*** Gross income

17. ***Rating:*** Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)

### Task 1: Initial Data Exploration

In [2]:
df = pd.read_csv('supermarket_sales.csv')

In [9]:
df.head()

Unnamed: 0,149-71-6266,B,Mandalay,Member,Male,Sports and travel,78.07,9,35.1315,737.7615,1/28/19,12:43,Cash,702.63,4.761904762,35.1315.1,4.5
0,640-49-2076,B,Mandalay,Normal,Male,Sports and travel,83.78,8.0,33.512,703.752,1/10/19,14:49,Cash,670.24,4.761905,33.512,5.1
1,595-11-5460,A,Yangon,,Male,Health and beauty,96.58,2.0,9.658,202.818,3/15/19,10:12,Credit card,193.16,4.761905,9.658,5.1
2,183-56-6882,C,Naypyitaw,,Female,Food and beverages,99.42,4.0,19.884,417.564,2/6/19,10:42,Ewallet,397.68,4.761905,19.884,7.5
3,232-16-2483,C,Naypyitaw,,Female,Sports and travel,68.12,1.0,3.406,71.526,1/7/19,12:28,Ewallet,68.12,4.761905,3.406,6.8
4,129-29-8530,A,Yangon,,Male,Sports and travel,62.62,5.0,15.655,328.755,3/10/19,19:15,Ewallet,313.1,4.761905,15.655,7.0


In [5]:
df.columns

Index(['149-71-6266', 'B', 'Mandalay', 'Member', 'Male', 'Sports and travel',
       '78.07', '9', '35.1315', '737.7615', '1/28/19', '12:43', 'Cash',
       '702.63', '4.761904762', '35.1315.1', '4.5'],
      dtype='object')

In [6]:
df.dtypes

149-71-6266           object
B                     object
Mandalay              object
Member                object
Male                  object
Sports and travel     object
78.07                float64
9                    float64
35.1315              float64
737.7615             float64
1/28/19               object
12:43                 object
Cash                  object
702.63               float64
4.761904762          float64
35.1315.1            float64
4.5                  float64
dtype: object

In [None]:
df['Date']= pd.to_datetime(df['Date'])

In [None]:
df.set_index('Date', inplace=True)

In [12]:
df.head()

Unnamed: 0_level_0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Time,Payment,cogs,gross margin percentage,gross income,Rating
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2019-01-05,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7.0,26.1415,548.9715,13:08,Ewallet,522.83,4.761905,26.1415,9.1
2019-03-08,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5.0,3.82,80.22,10:29,Cash,76.4,4.761905,3.82,9.6
2019-03-03,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7.0,16.2155,340.5255,13:23,Credit card,324.31,4.761905,16.2155,7.4
2019-01-27,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8.0,23.288,489.048,20:33,Ewallet,465.76,4.761905,23.288,8.4
2019-02-08,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7.0,30.2085,634.3785,10:37,Ewallet,604.17,4.761905,30.2085,5.3


In [None]:
df.describe().T


### Task 2: Univariate Analysis

**Question 1:** What does the distribution of customer ratings looks like? Is it skewed?

In [None]:
sns.distplot(df['Rating'])
plt.axvline(x=np.mean(df['Rating']))


**Question 2:** Do aggregate sales numbers differ by much between branches?

### Task 3: Bivariate Analysis

**Question 3:** Is there a relationship between gross income and customer ratings?

**Question 4:** Is there a noticeable time trend in gross income?

### Task 4: Dealing With Duplicate Rows and Missing Values

### Task 5: Correlation Analysis

### Helpful Links

1. More visualizations: https://www.data-to-viz.com/
2. Seaborn gallery: https://seaborn.pydata.org/examples/index.html
3. Pandas profiling documentation: https://pypi.org/project/pandas-profiling/