# GLOBAL SUPERSTORE DATASET SALES ANALYSIS
By Raju Vaneshwar Nareshwar

## Table of contents
1. [Task 1](#task1)

    1.1 [Load first 10 records](#load-first-10-records)
    
    1.2 [Understanding the dataset](#understanding-dataset)

2. [Assess the data](#assess)

    2.1 [Meta Data](#metadata)
    
    2.2 [Assessment Summary](#summary)

3. [Data Cleaning](#clean)

4. [Analysis and Data Visualization](#analysis)

    4.1 [Product Analysis](#product)

    4.2 [Segment Analysis](#segment)

    4.3 [Geographical market location Analysis](#market)

    4.4 [Shipping](#shipping)

    4.5 [Time Series Analysis](#time)

5. [Insights](#insights)

    5.1 [Findings](#findings)

    5.2 [Limitations](#limitation)

    5.3 [Recommendations](#recommendation)

## 1. Task 1  <a id='task1'></a>

### 1.1 Load first 10 records <a id='load-first-10-records'></a>

In [138]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import ticker as mtick
import seaborn as sns

# Letting pandas to show max columns
pd.set_option('display.max_columns', None)

In [139]:
# Reading CSV file and assigning into a dataframe ss_data
global_super_store_data = pd.read_csv('sample-superstore-2023-T3.csv')

# Set the head to 10 to retrieve the first 10 records
first_10_rows = global_super_store_data.head(n=10)
first_10_rows

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,7773,CA-2016-108196,25/11/2016,12/02/2016,Standard Class,CS-12505,Cindy Stewart,Consumer,United States,Lancaster,Ohio,43130,Est,TEC-MA-10000418,Technology,Machines,Cubify CubeX 3D Printer Double Head Print,4499.985,5,0.7,-6599.978
1,684,US-2017-168116,11/04/2017,11/04/2017,Same Day,GT-14635,Grant Thornton,Corporate,United States,Burlington,North Carolina,"""27217""",South,TEC-MA-10004125,Technology,Machines,Cubify CubeX 3D Printer Triple Head Print,7999.98,4,0.5,-3839.9904
2,9775,CA-2014-169019,26/07/2014,30/07/2014,Standard Class,LF-17185,Luke Foster,Consumer,United States,San Antonio,Texas,78207,Central,OFF-BI-10004995,Office Supplies,Binders,GBC DocuBind P400 Electric Binding System,2177.584,8,0.8,-3701.8928
3,3012,CA-2017-134845,17/04/2017,24/04/2017,Standard Class,SR-20425,Sharelle Roach,Home Office,United States,Louisville,Colorado,80027,West,TEC-MA-10000822,Technology,Machines,Lexmark MX611dhe Monochrome Laser Printer,2549.985,5,0.7,-3399.98
4,4992,US-2017-122714,12/07/2017,13/12/2017,Standard Class,HG-14965,Henry Goldwyn,Corporate,United States,Chicago,Illinois,60653,Central,OFF-BI-10001120,Office Supplies,Binders,Ibico EPK-21 Electric Binding System,1889.99,5,0.8,-2929.4845
5,3152,CA-2015-147830,15/12/2015,18/12/2015,First Class,NF-18385,Natalie Fritzler,Consumer,United States,Newark,Ohio,43055,East,TEC-MA-10000418,Technology,Machines,Cubify CubeX 3D Printer Double Head Print,1799.994,Two,0.7,"""-2639.9912"""
6,5311,CA-2017-131254,19/11/2017,21/11/2017,First Class,NC-18415,Nathan Cano,Consumer,United States,Houston,Texas,77095,Central,OFF-BI-10003527,Office Supplies,Binders,Fellowes PB500 Electric Punch Plastic Comb Bin...,1525.188,6,0.8,-2287.782
7,9640,CA-2015-116638,28/01/2015,,Second Class,JH-15985,Joseph Holt,Consumer,United States,Concord,North Carolina,28027,South,FUR-TA-10000198,Frnture,Tables,Chromcraft Bull-Nose Wood Oval Conference Tabl...,4297.644,Thirteen,0.4,
8,1200,CA-2016-130946,04/08/2016,04/12/2016,Standard Class,ZC-21910,Zuschuss Carroll,Consumer,United States,Houston,Texas,77041,Central,OFF-BI-10004995,Office Supplies,Binders,GBC DocuBind P400 Electric Binding System,1088.792,4,0.8,-1850.9464
9,2698,CA-2014-145317,18/03/2014,23/03/2014,Standard Class,SM-20320,Sean Miller,Home Office,,Jacksonville,Florida,32216,Southh,TEC-MA-10002412,Technology,Machines,Cisco TelePresence System EX90 Videoconferenci...,22638.48,6,0.5,-1811.0784


### 1.2 Understanding of the dataset <a id='understanding-dataset'></a>

Using info() and describe() function to get the descriptive statistics

In [140]:
# Get the metadata information about the dataset
global_super_store_data.info()

# Get descriptive statistics on the dataset
global_super_store_data.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Row ID         9994 non-null   int64  
 1   Order ID       9993 non-null   object 
 2   Order Date     9992 non-null   object 
 3   Ship Date      9991 non-null   object 
 4   Ship Mode      9990 non-null   object 
 5   Customer ID    9994 non-null   object 
 6   Customer Name  9991 non-null   object 
 7   Segment        9991 non-null   object 
 8   Country        9990 non-null   object 
 9   City           9992 non-null   object 
 10  State          9990 non-null   object 
 11  Postal Code    9991 non-null   object 
 12  Region         9991 non-null   object 
 13  Product ID     9992 non-null   object 
 14  Category       9992 non-null   object 
 15  Sub-Category   9990 non-null   object 
 16  Product Name   9991 non-null   object 
 17  Sales          9993 non-null   float64
 18  Quantity

Unnamed: 0,Row ID,Sales,Discount
count,9994.0,9993.0,9991.0
mean,4997.5,229.86378,0.15618
std,2885.163629,623.276019,0.206399
min,1.0,0.444,0.0
25%,2499.25,17.28,0.0
50%,4997.5,54.48,0.2
75%,7495.75,209.94,0.2
max,9994.0,22638.48,0.8


The primary key of these records are a system-generated, and denoted as column: *RowID*

The datatypes of the dataset are following:
* int64(1)
* float64(2)
* object(18)

A few records of *Quantity* and *Profit* columns has the datatype of object, but it must be float64, thus needs to be cleansed or transformed.  
*Ship Date* and *Order Date* columns are represented as strings, those needs to be converted as datetime.

Once cleansed, the descriptive statistics can be applied to the numerial columns, and they are Sales, Quantity, Discount and Profit.


function **text2float()** will take a txt number as a parameter and convert back to float64 number.

In [141]:
def text2float(textnum, numwords={}):
    try:
        # Attempt to convert to float
        return float(textnum)
    except ValueError:
        # If conversion to float fails, continue with text to number conversion
        textnum = textnum.lower()
        
        if not numwords:
            units = [
                "zero", "one", "two", "three", "four", "five", "six", "seven", "eight",
                "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
                "sixteen", "seventeen", "eighteen", "nineteen",
            ]

            tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]

            scales = ["hundred", "thousand", "million", "billion", "trillion"]

            numwords["and"] = (1, 0)
            for idx, word in enumerate(units):
                numwords[word] = (1, idx)
            for idx, word in enumerate(tens):
                numwords[word] = (1, idx * 10)
            for idx, word in enumerate(scales):
                numwords[word] = (10 ** (idx * 3 or 2), 0)

        current = result = 0
        for word in textnum.split():
            if word not in numwords:
                raise Exception("Illegal word: " + word)

            scale, increment = numwords[word]
            current = current * scale + increment
            if scale > 100:
                result += current
                current = 0

        return result + current

## 2. Task 2

Before performing any statistical analysis, the numerical column data has to be cleansed to be meaningful.
* Records with special characters on *Quantity* and needs to be cleansed. 
* Records with special characters on *Profit* and needs to be cleansed. 
* Applying the **text2float()** function to fix *Quantity* column. 

In [None]:
# Removing "?" from Quantity column
global_super_store_data['Quantity'] = global_super_store_data['Quantity'].str.replace('?', '')

# Removing """ from Profit column
global_super_store_data['Profit'] = global_super_store_data['Profit'].str.replace('"', '')

# Assuming zero values for NaN on Profits
global_super_store_data['Profit'] = global_super_store_data['Profit'].fillna(0)


# Removing """ from Postal Code column
global_super_store_data['Postal Code'] = global_super_store_data['Postal Code'].str.replace('"', '')

# Applying text2float function
global_super_store_data['Quantity'] = global_super_store_data['Quantity'].apply(text2float)
global_super_store_data['Profit'] = global_super_store_data['Profit'].apply(text2float)
global_super_store_data

### 2.1 Descriptive Statistics

In [None]:
# Row ID is not needed for the analysis, hence dropping the column
if 'Row ID' in global_super_store_data.columns:
    global_super_store_data.drop('Row ID', axis=1, inplace=True)

global_super_store_data.describe()

In [None]:
# Columns with missing data
print(f"Sum of null records:\n{global_super_store_data.isnull().sum()}")

### 2.2 Outlier Treatment (Still needs to be worked on)

In [None]:
mean_value = global_super_store_data['Sales'].mean()
std_value = global_super_store_data['Sales'].std()

# Define a threshold for identifying outliers (e.g., 3 standard deviations from the mean)
threshold = 3
lower_threshold = mean_value - threshold * std_value
upper_threshold = mean_value + threshold * std_value

print(f"mean: {mean_value}")
print(f"std: {std_value}")
print(f"lower_threshold: {lower_threshold}")
print(f"upper_threshold: {upper_threshold}")


# Filter outliers
outliers = global_super_store_data[(global_super_store_data['Sales'] < lower_threshold) | (global_super_store_data['Sales'] > upper_threshold)]
outliers

### 2.3 Normalizing and Scaling

In [None]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

#Select Numerical Features
numerical_features = ['Sales', 'Quantity', 'Discount', 'Profit']
global_super_store_numerial_data = global_super_store_data[numerical_features]

# # Min-Max Scaling
# scaler = MinMaxScaler()
# scaled_data = scaler.fit_transform(global_super_store_numerial_data)

# # Creating DataFrame with scaled data
# scaled_df = pd.DataFrame(scaled_data, columns=global_super_store_numerial_data.columns)

# print(f"Min-Max Scaled DataFrame:\n{scaled_df}")

# # Standard Scaling (Z-score normalization)
# scaler = StandardScaler()
# standardized_data = scaler.fit_transform(global_super_store_numerial_data)

# # Creating DataFrame with standardized data
# standardized_df = pd.DataFrame(standardized_data, columns=global_super_store_numerial_data.columns)

# print("\nStandardized DataFrame:")
# standardized_df


# Correlation matrix
correlation_matrix = global_super_store_numerial_data.corr()
correlation_matrix


### 2.4 Grouping of data

In [None]:
# group total sales by category from the highest sale.
sales_category = global_super_store_data.groupby('Category')['Sales'].sum().sort_values(ascending=False)
sales_category

In [None]:
# group total profits by category
profit_category = global_super_store_data.groupby('Category')['Profit'].sum().sort_values(ascending=False)
profit_category

In [None]:
# group total sales by category
sales_category = global_super_store_data.groupby('Category')['Sales'].sum()

# group total profits by category
profit_category = global_super_store_data.groupby('Category')['Profit'].sum()


# figure size
plt.figure(figsize=(16,12));

# left total sales pie chart
plt.subplot(1,2,1); # 1 row, 2 columns, the 1st plot.
plt.pie(sales_category.values, labels=sales_category.index, startangle=90, counterclock=False,
        autopct=lambda p:f'{p:.1f}% \n £{p*np.sum(sales_category.values)/100 :,.0f}', 
        wedgeprops={'linewidth': 1, 'edgecolor':'black', 'alpha':0.75});
plt.axis('square');
plt.title('Total Sales by Category',  fontdict={'fontsize':16});

# right total profits pie chart
plt.subplot(1,2,2); # 1 row, 2 columns, the 2nd plot
plt.pie(profit_category.values, labels=profit_category.index, startangle=90, counterclock=False,
        autopct=lambda p:f'{p:.1f}% \n ${p*np.sum(profit_category.values)/100 :,.0f}',
        wedgeprops={'linewidth': 1, 'edgecolor':'black', 'alpha':0.75});
plt.axis('square');
plt.title('Total Profit by Category', fontdict={'fontsize':16});

### 2.5 Handling missing values in the dataset

### 2.6 Correlation 

### 2.7 Univariate analysis and visualisation

In [None]:
columns_to_describe['Profit'] = columns_to_describe['Profit'].str.replace('"','')
columns_to_describe['Profit'] = columns_to_describe['Profit'].fillna(0)
columns_to_describe

In [None]:
columns_to_describe.describe()
columns_to_describe.info()

In [None]:
columns_to_describe['Profit'] = columns_to_describe['Profit'].apply(text2float)
columns_to_describe

In [None]:
columns_to_describe.describe()

In [None]:
columns_to_describe.info()

In [None]:
columns_to_describe['Transformed'] = np.log1p(columns_to_describe['Sales'])
print("Original data:")
columns_to_describe

In [None]:
ss_data.describe()

In [None]:
ss_data['Profit'] = ss_data['Profit'].str.replace('"', '')

In [None]:
ss_data['Quantity'] = ss_data['Quantity'].str.replace('?', '')

In [None]:
ss_data['Quantity'] = ss_data['Quantity'].apply(text2float)

In [None]:
ss_data.describe()

In [None]:
ss_data['Postal Code'] = ss_data['Postal Code'].str.replace('"', '')

In [None]:
ss_data['Country'] = ss_data['Country'].str.replace('US', 'United States')

In [None]:
ss_data.to_csv('10042024.csv')

In [None]:
ss_data['State'] = ss_data['State'].dropna()
us_states = ss_data['State'].unique()

print(us_states)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
# %matplotlib inline

In [None]:
sns.pairplot(ss_data)

In [None]:
ss_data['Category'] = ss_data['Category'].str.replace('Frnture', 'Furniture')

In [None]:
sns.pairplot(ss_data)

In [None]:
ss_data.groupby('State').size()

In [None]:
sns.lmplot(data=ss_data, x='Sales', y='Quantity', hue='Category')