# Sales Data Analysis Project

## Table of Contents
1. [Introduction](#Introduction)
2. [Data Loading and Inspection](#Data-Loading-and-Inspection)
3. [Data Cleaning](#Data-Cleaning)
4. [Exploratory Data Analysis (EDA)](#Exploratory-Data-Analysis-(EDA))
    - [Sales Over Time](#Sales-Over-Time)
    - [Sales by Product Line](#Sales-by-Product-Line)
    - [Sales by Deal Size](#Sales-by-Deal-Size)
5. [Advanced Analysis](#Advanced-Analysis)
    - [Sales by Territory](#Sales-by-Territory)
    - [Top Customers](#Top-Customers)
6. [Conclusion](#Conclusion)

## Introduction

## Data Loading and Inspection

In [1]:
# import libraries for data processing and visualization
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

First, we load the data and extract some basic information

In [2]:
sales_data = pd.read_csv('sales_data_sample.csv', index_col='ORDERNUMBER')

# First 5 rows
sales_data.head()

Unnamed: 0_level_0,QUANTITYORDERED,PRICEEACH,ORDERLINENUMBER,SALES,ORDERDATE,STATUS,QTR_ID,MONTH_ID,YEAR_ID,PRODUCTLINE,...,ADDRESSLINE1,ADDRESSLINE2,CITY,STATE,POSTALCODE,COUNTRY,TERRITORY,CONTACTLASTNAME,CONTACTFIRSTNAME,DEALSIZE
ORDERNUMBER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10107,30,95.7,2,2871.0,2/24/2003 0:00,Shipped,1,2,2003,Motorcycles,...,897 Long Airport Avenue,,NYC,NY,10022.0,USA,,Yu,Kwai,Small
10121,34,81.35,5,2765.9,5/7/2003 0:00,Shipped,2,5,2003,Motorcycles,...,59 rue de l'Abbaye,,Reims,,51100.0,France,EMEA,Henriot,Paul,Small
10134,41,94.74,2,3884.34,7/1/2003 0:00,Shipped,3,7,2003,Motorcycles,...,27 rue du Colonel Pierre Avia,,Paris,,75508.0,France,EMEA,Da Cunha,Daniel,Medium
10145,45,83.26,6,3746.7,8/25/2003 0:00,Shipped,3,8,2003,Motorcycles,...,78934 Hillside Dr.,,Pasadena,CA,90003.0,USA,,Young,Julie,Medium
10159,49,100.0,14,5205.27,10/10/2003 0:00,Shipped,4,10,2003,Motorcycles,...,7734 Strong St.,,San Francisco,CA,,USA,,Brown,Julie,Medium


In [3]:
# Outputs the "Big 5" as well as count, mean, and standard deviation
sales_data.describe()

Unnamed: 0,QUANTITYORDERED,PRICEEACH,ORDERLINENUMBER,SALES,QTR_ID,MONTH_ID,YEAR_ID,MSRP
count,2823.0,2823.0,2823.0,2823.0,2823.0,2823.0,2823.0,2823.0
mean,35.092809,83.658544,6.466171,3553.889072,2.717676,7.092455,2003.81509,100.715551
std,9.741443,20.174277,4.225841,1841.865106,1.203878,3.656633,0.69967,40.187912
min,6.0,26.88,1.0,482.13,1.0,1.0,2003.0,33.0
25%,27.0,68.86,3.0,2203.43,2.0,4.0,2003.0,68.0
50%,35.0,95.7,6.0,3184.8,3.0,8.0,2004.0,99.0
75%,43.0,100.0,9.0,4508.0,4.0,11.0,2004.0,124.0
max,97.0,100.0,18.0,14082.8,4.0,12.0,2005.0,214.0


In [4]:
null_counts = sales_data.isnull().sum().astype(str) + ' null'
non_null_counts = sales_data.notnull().sum().astype(str) + ' non-null'
dtypes = sales_data.dtypes

# Contains number of null and non-null entries as well as the type of the entry for each column
sales_info = pd.DataFrame({'Non-Null Count': non_null_counts, 'Null Count': null_counts, 'dtype': dtypes,})
print(sales_info)

                 Non-Null Count Null Count    dtype
QUANTITYORDERED   2823 non-null     0 null    int64
PRICEEACH         2823 non-null     0 null  float64
ORDERLINENUMBER   2823 non-null     0 null    int64
SALES             2823 non-null     0 null  float64
ORDERDATE         2823 non-null     0 null   object
STATUS            2823 non-null     0 null   object
QTR_ID            2823 non-null     0 null    int64
MONTH_ID          2823 non-null     0 null    int64
YEAR_ID           2823 non-null     0 null    int64
PRODUCTLINE       2823 non-null     0 null   object
MSRP              2823 non-null     0 null    int64
PRODUCTCODE       2823 non-null     0 null   object
CUSTOMERNAME      2823 non-null     0 null   object
PHONE             2823 non-null     0 null   object
ADDRESSLINE1      2823 non-null     0 null   object
ADDRESSLINE2       302 non-null  2521 null   object
CITY              2823 non-null     0 null   object
STATE             1337 non-null  1486 null   object
POSTALCODE  

## Data Cleaning

From above, we see that ADDRESSLINE2, STATE, and TERRITORY columns have null rows. We would lose too much valuable information if we dropped these rows. Let's fill in the missing values instead:

In [5]:
sales_data.fillna({'ADDRESSLINE2': 'No Address Line'}, inplace=True)
sales_data.fillna({'STATE': 'No State'}, inplace=True)
sales_data.fillna({'TERRITORY':'No Territory'}, inplace=True)

Next let's check for duplicate rows

In [6]:
duplicate_rows = sales_data.duplicated().sum()
print(f"Number of duplicate rows is {duplicate_rows}")

Number of duplicate rows is 0
