<h2 align=center>Exploratory Data Analysis With Python and Pandas</h2>
<img src="logo.png">

### Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import calmap
from ydata_profiling import ProfileReport

Link to data source: https://www.kaggle.com/aungpyaeap/supermarket-sales

**Context**

The growth of supermarkets in most populated cities are increasing and market competitions are also high. The dataset is one of the historical sales of supermarket company which has recorded in 3 different branches for 3 months data.

**Data Dictionary**

1. ***Invoice id:*** Computer generated sales slip invoice identification number

2. ***Branch:*** Branch of supercenter (3 branches are available identified by A, B and C).

3. ***City:*** Location of supercenters

4. ***Customer type:*** Type of customers, recorded by Members for customers using member card and Normal for without member card.

5. ***Gender:*** Gender type of customer

6. ***Product line:*** General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel

7. ***Unit price:*** Price of each product in USD

8. ***Quantity:*** Number of products purchased by customer

9. ***Tax:*** 5% tax fee for customer buying

10. ***Total:*** Total price including tax

11. ***Date:*** Date of purchase (Record available from January 2019 to March 2019)

12. ***Time:*** Purchase time (10am to 9pm)

13. ***Payment:*** Payment used by customer for purchase (3 methods are available – Cash, Credit card and Ewallet)

14. ***COGS:*** Cost of goods sold

15. ***Gross margin percentage:*** Gross margin percentage

16. ***Gross income:*** Gross income

17. ***Rating:*** Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)

### Task 1: Initial Data Exploration

In [None]:
df = pd.read_csv('supermarket_sales.csv')

In [None]:
df.head

In [None]:
df.columns

In [None]:
df.dtypes

In [None]:
df['Date']=pd.to_datetime(df['Date'])

In [None]:
def convert_date_column_to_datetime(df: pd.DataFrame, Date: str) -> pd.DataFrame:
    """
    Converts a specified column in a DataFrame to datetime format.
    
    Args:
    df (pd.DataFrame): The DataFrame containing the date column.
    column_name (str): The name of the column to convert to datetime.
    
    Returns:
    pd.DataFrame: The DataFrame with the specified column converted to datetime.
    """
    df['Date'] = pd.to_datetime(df['Date'])
    return df

# Example usage
# df = convert_date_column_to_datetime(df, 'Date')

In [None]:
df.set_index('Date',inplace=True)

In [None]:
df.describe

### Task 2: Univariate Analysis

**Question 1:** What does the distribution of customer ratings looks like? Is it skewed?

In [None]:
sns.histplot(df['Rating'],kde=True)
plt.axvline(x=np.mean(df['Rating']),c='red',ls='--',label='mean')
plt.axvline(x=np.percentile(df['Rating'],25),c='green',ls='--',label='25-75th percentile')
plt.axvline(x=np.percentile(df['Rating'],75),c='green',ls='--')
plt.legend()

In [None]:
df.hist(figsize=(10,10))

**Question 2:** Do aggregate sales numbers differ by much between branches?

In [None]:
sns.countplot(data=df, x='Branch', hue='Branch',palette='dark')

In [None]:
df['Branch'].value_counts()

In [None]:
sns.countplot(data=df, x='Payment',hue='Branch', palette='viridis')

### Task 3: Bivariate Analysis

**Question 3:** Is there a relationship between gross income and customer ratings?

In [None]:
sns.regplot(df, x='Rating', y='gross income')

In [None]:
sns.boxplot(x=df['Branch'], y=df['gross income'], hue=df['Branch'], palette='dark')

In [None]:
sns.boxplot(x=df['Gender'], y=df['gross income'], hue=df['Gender'], palette='Greens')

**Question 4:** Is there a noticeable time trend in gross income?

In [None]:
#df.head
#df.groupby(df.index).mean()
# Select only numeric columns before calculating the mean
df_numeric = df.select_dtypes(include=['number'])  # This filters the DataFrame to include only numeric columns
result = df_numeric.groupby(df.index).mean()  # Now we can safely calculate the mean
result

In [None]:
# Set the figure size
plt.figure(figsize=(10, 10))  # Width: 10, Height: 10
sns.lineplot(x=result.index, y=result['gross income'])
plt.show()

In [None]:
sns.pairplot(result)
plt.show()

### Task 4: Dealing With Duplicate Rows and Missing Values

In [None]:
df.duplicated().sum()
#result.duplicated()

In [None]:
df[df.duplicated()==True]

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
df.isna().sum()

In [None]:
sns.heatmap(df.isnull(),cbar=False)

In [None]:
df_num = df.select_dtypes(include=['number'])  # This filters the DataFrame to include only numeric columns
df.fillna(df_num.mean(),inplace=True)

In [None]:
df.fillna(df.mode().iloc[0],inplace=True)

In [None]:
dataset = pd.read_csv('supermarket_sales.csv')
prof = ProfileReport(dataset)
prof

### Task 5: Correlation Analysis

In [None]:
#Correlation Analisis between Gross Income and Ratings
round(np.corrcoef(df['gross income'], df['Rating'])[1][0],2)

In [None]:
df_num1 = df.select_dtypes(include=['number'])  # This filters the DataFrame to include only numeric columns
np.round(df_num1.corr(),2)

In [None]:
sns.heatmap(np.round(df_num1.corr(),2), annot=True)

In [None]:
df.head(10)

### Helpful Links

1. More visualizations: https://www.data-to-viz.com/
2. Seaborn gallery: https://seaborn.pydata.org/examples/index.html
3. Pandas profiling documentation: https://pypi.org/project/pandas-profiling/