# Retail Sales Data - Classification Analysis

To build a classification model based on the Superstore dataset for deriving business insights and making predictions.

**Business Problem:**

The company wants to identify customers who are likely to generate high profit versus those who generate low or negative profit. This classification can help in targeting high-value customers for loyalty programs or optimizing marketing spend.

**Project Objective:**

Build a classification model that predicts whether a customer is **Profitable** or **Not Profitable** based on historical purchase data.

**Target Variable:**

A new column <code>Profitable_Customer</code> (1 = Profitable, 0 = Not Profitable), derived from the <code>Profit</code> column.

Key Deliverables:
- Data cleaning & preparation
- Feature engineering
- Exploratory data analysis (EDA)
- Classification model building
- Model evaluation
- Insights & recommendations

In [None]:
!pip install pandas numpy matplotlib seaborn scikit-learn

## Phase 1: Data Collection & Understanding
### Step1: Importing Required Libraries
We begin by importing the essential libraries for data manipulation, visualization, and building our machine learning model

In [None]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# For machine learning and preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

### Step 2: Load the Dataset

In [None]:
url = 'https://raw.githubusercontent.com/harsh-aithal/Retail-Sales-Analysis-DS-Project/main/data/superstore_data.csv'
df=pd.read_csv(url, encoding='ISO-8859-1')
df.head()

### Step 3: Basic Info Check

In [None]:
df.shape  # Check rows and columns

In [None]:
df.info()  # Data types and nulls

In [None]:
df.describe()  # Stats summary

In [None]:
df.isnull().sum()  # Total nulls

In [None]:
df.duplicated().sum()  # Check for duplicates

## Phase 2: Data Cleaning & Preparation

### Step 1: Drop Unnecessary Columns

Some columns won't help us in prediction, like <code>Row ID, Customer Name, Postal Code, Country, Product ID,</code> etc. We'll drop them to simplify the model.

In [None]:
df.drop(['Row ID', 'Customer Name', 'Postal Code', 'Country', 'Product ID'], axis=1, inplace=True)

### Step 2: Convert Date Columns to DateTime Format

In [None]:
df['Order Date'] = pd.to_datetime(df['Order Date'], format = 'mixed')
df['Ship Date'] = pd.to_datetime(df['Ship Date'], format = 'mixed')

### Step 3: Feature Engineering - Create New Features

In [None]:
df['Order Month'] = df['Order Date'].dt.month
df['Order Day'] = df['Order Date'].dt.day
df['Order Weekday'] = df['Order Date'].dt.weekday
df['Shipping Duration'] = (df['Ship Date'] - df['Order Date']).dt.days

### Step 4: Encode Categorical Variables
Let's convert the categorical aolumns into numbers using label encoding (for now; we'll explore alternatives later).

In [None]:
le = LabelEncoder()
categorical_cols = ['Order ID', 'Ship Mode', 'Customer ID', 'Segment', 'City', 'State', 'Region', 'Category', 'Sub-Category', 'Product Name']

for col in categorical_cols:
    df[col] = le.fit_transform(df[col])

### Step 5: Defince Target Column

We'll be doing classification. Let's use **Profitable vs Non-Profitable** as our classification target:

In [None]:
df['Profitable'] = df['Profit'].apply(lambda x: 1 if x > 0 else 0)

Now <code>Profitable</code> will be your target, and the rest will be your features.

## Phase 3: Model Building & Evaluation
### Step 1: Feature Selection & Splitting
- Dropped irrelevant columns: <code>Order ID</code>, <code>Customer ID</code>, <code>Product Name</code>, <code>Profit</code>, <code>Order Date</code>, <code>Ship Date</code>
- Defined X (features) and y (target:<code>Profitable</code>)

In [None]:
df_model = df.drop(['Order ID', 'Customer ID', 'Product Name', 'Profit'], axis=1)
df_model = df_model.drop(['Order Date', 'Ship Date'], axis=1)

X = df_model.drop('Profitable', axis=1)
y = df_model['Profitable']

### Step 2: Train-Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Step 3: Model Training (Random Forest Classifier)

In [None]:
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

### Step 4: Model Evaluation

In [None]:
y_pred = rf_model.predict(X_test)

print("Accuracy:", accuracy_score(y_test,y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test,y_pred))
print("\nClassification Report:\n", classification_report(y_test,y_pred))

**Output:**
- **Accuracy: ~94.7%**
- Very high precision & recall for predicting profitable transactions.

### Step 5: Feature Importance Visualizations

In [None]:
importances = rf_model.feature_importances_
features = X.columns

feature_importance_df = pd.DataFrame({'Feature': features, 'Importance': importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

plt.figure(figsize=(10,6))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df)
plt.title('Feature Importance from Random Forest')
plt.tight_layout()
plt.show()

**Key Observations:**
- Features like <code>Sales</code>, <code>Discount</code>, <code>Shipping Duration</code>, and <code>Order Month</code> were top contributors to model predictions.
- Model is highly accurate, especially in predicting profitable orders.

## Phase 4: Insights and Reccendations
#### Objective:
Draw actionable business insights from the analysis and model results to help improve profitability.

### Step 1: Review Feature Importance Again

Look at the top features influencing profitability. You already saw them in the bar plot.
- High <code>Sales</code> and <code>Shipping Duration</code> negatively affect profitability
- <code>Discount</code> also plays a major role — too much discount reduces profit
- <code>Order Month</code>, <code>Region</code>, <code>Category</code> affect patterns in profit

### Step 2: Deep Dive into Key Business Insights

Use basic visualizations to derive real-world insights:

#### Insight 1: High Discount Hurt Profitability

In [None]:
plt.figure(figsize=(6,4))
sns.boxplot(x='Profitable', y='Discount', data=df)
plt.title('Discount vs Profitability')
plt.show()

**Observation:** Non-profitable orders usually have much higher discounts.

**Recommendation:**
Limit high discounts — especially on low-margin products.

#### Insight 2: Long Shipping Duration = Less Profit

In [None]:
plt.figure(figsize=(6,4))
sns.boxplot(x='Profitable', y='Shipping Duration', data=df)
plt.title('Shipping Duration vs Profitability')
plt.show()

**Observation:** Orders that took longer to ship were more likely to be unprofitable.

**Recommendation:**
Optimize logistics and partner with faster delivery services in regions with long shipping delays.

#### Insight 3: Certain Categories Have More Losses

In [None]:
category_profit = df.groupby('Category')['Profitable'].mean().sort_values()

category_profit.plot(kind='bar', figsize=(6,4), title='Profitability by Category')
plt.ylabel('Proportion of Profitable Orders')
plt.show()

**Observation:** Some product categories consistently show lower profitability.

**Recommendation:**
Audit pricing strategy and cost structure for low-performing categories.

#### Insight 4: Regional Profitability Varies

In [None]:
region_profit = df.groupby('Region')['Profitable'].mean().sort_values()

region_profit.plot(kind='bar', figsize=(6,4), title='Profitability by Region')
plt.ylabel('Proportion of Profitable Orders')
plt.show()

**Observation:** Certain regions (e.g. South/West) may perform better.

**Recommendation:**
Focus marketing & logistics efforts in profitable regions. Re-evaluate strategy in loss-prone regions.

### Final Business Suggestions:
1. **Limit high discounts** to prevent revenue loss.
2. **Speed up shipping** to improve customer satisfaction and profitability.
3. **Focus on high-margin categories** and audit the low-performing ones.
4. **Invest in profitable regions** and reconsider strategy in weak areas.