# Business Undersatnding

## Overview

## Business Problem

With limited data visibility across your operations, your company lacks insights needed for strategic decision-making. The current system doesn't provide adequate analytics on sales performance, customer behavior, product popularity, and regional trends, making it difficult to optimize inventory, pricing strategies, and customer relationships. This data gap prevents you from identifying growth opportunities and addressing operational inefficiencies.

## Objectives

### Specific Objectives

Implement a comprehensive data analytics system to track and analyze sales performance across products, customers, and regions

### Objectives

- Data Cleaning: Standardize and preprocess the sales data to ensure accuracy, completeness, and consistency
- Data Analysis: Perform exploratory data analysis to uncover patterns, trends, and relationships within the sales data
- Data Visualization: Create meaningful visual representations of the data to communicate insights effectively
- Modeling: Develop predictive models to forecast sales trends, customer behavior, and product performance
- Evaluation: Assess the performance and accuracy of the developed models
Deployment: Implement the models and insights into business operations for continuous improvement

# Data Understanding

## Data Source
The dataset was collected from Kaggle, a popular platform for data science competitions and datasets. The dataset appears to contain transaction records for a beverage distribution business, focusing on sales of water products to business customers.

## Data Structure
The dataset contains the following fields:

- Order_ID: Unique identifier for each order (e.g., "ORD1")
- Customer_ID: Unique identifier for each customer (e.g., "CUS1496")
- Customer_Type: Classification of customer (e.g., "B2B" for business-to-business)
- Product: Name of the product sold (e.g., "Vio Wasser", "Evian")
- Category: Product category (e.g., "Water")
- Unit_Price: Price per unit of the product (e.g., 1.6, 1.5)
- Quantity: Number of units ordered (e.g., 53, 90)
- Discount: Discount rate applied to the order (e.g., 0.1 or 10%)
- Total_Price: Final price after applying discount (e.g., 79.18, 126.36)
- Region: Geographic location of the sale (e.g., "Baden-Württemberg")
- Order_Date: Date when the order was placed (e.g., "2023-08-23")

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [4]:
df = pd.read_csv("synthetic_beverage_sales_data.csv")
df.head()
 

Unnamed: 0,Order_ID,Customer_ID,Customer_Type,Product,Category,Unit_Price,Quantity,Discount,Total_Price,Region,Order_Date
0,ORD1,CUS1496,B2B,Vio Wasser,Water,1.66,53,0.1,79.18,Baden-Württemberg,2023-08-23
1,ORD1,CUS1496,B2B,Evian,Water,1.56,90,0.1,126.36,Baden-Württemberg,2023-08-23
2,ORD1,CUS1496,B2B,Sprite,Soft Drinks,1.17,73,0.05,81.14,Baden-Württemberg,2023-08-23
3,ORD1,CUS1496,B2B,Rauch Multivitamin,Juices,3.22,59,0.1,170.98,Baden-Württemberg,2023-08-23
4,ORD1,CUS1496,B2B,Gerolsteiner,Water,0.87,35,0.1,27.4,Baden-Württemberg,2023-08-23


In [5]:
# Info of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8999910 entries, 0 to 8999909
Data columns (total 11 columns):
 #   Column         Dtype  
---  ------         -----  
 0   Order_ID       object 
 1   Customer_ID    object 
 2   Customer_Type  object 
 3   Product        object 
 4   Category       object 
 5   Unit_Price     float64
 6   Quantity       int64  
 7   Discount       float64
 8   Total_Price    float64
 9   Region         object 
 10  Order_Date     object 
dtypes: float64(3), int64(1), object(7)
memory usage: 755.3+ MB


In [7]:
# Descriptive statistics
df.describe()

Unnamed: 0,Unit_Price,Quantity,Discount,Total_Price
count,8999910.0,8999910.0,8999910.0,8999910.0
mean,5.818037,23.13813,0.02972879,130.7437
std,14.7005,26.89321,0.04479841,509.6947
min,0.32,1.0,0.0,0.3
25%,1.05,6.0,0.0,8.4
50%,1.75,11.0,0.0,21.14
75%,3.21,30.0,0.05,69.49
max,169.53,100.0,0.15,14295.3


## Data Preprocessing

### Data Cleaning

In [8]:
#Checking  for null values
df.isna().sum()

Order_ID         0
Customer_ID      0
Customer_Type    0
Product          0
Category         0
Unit_Price       0
Quantity         0
Discount         0
Total_Price      0
Region           0
Order_Date       0
dtype: int64

In [None]:
class DataUnderstanding()