
  **(Retail Sales data Collection)**

## Objectives

The purpose of this notebook is to collect and import the Retail Sales dataset for analysis.
The dataset , sourced from kaggle (superstore Sales Dataset by Rohit Sahoo), contains transactional records for different products, stores, and dates. 
In this stage i will load the dataset , explore its structure, and save a clean copy of the raw data into the data/raw directory. 
This step ensures that future notebooks (for cleaning , analysis and modelling ) have a consistent and reproducible data source. 

## Inputs

*This notebook uses the kaggle "superstore Sales Dataset" by Rohit Sahoo, which includes sales transactions by store and product over time. 
The key input files include a single CSV dataset containing columns such as date, item number, units sold and store information. 
The main python libraries used in this notebook are `pandas` and `numpy` for data manipulation and `os` for managing the file directories.  

## Outputs

* By the end of this notebook , the dataset will be successfully loaded and varified. 
* A Copy of the raw dataset will be saved to the `data/raw` directory under the filename `sales_raw.csv`.
* This output forms the foundation for the data cleaning and exploration tasks in the next notebook.   

## Additional Comments

* This notebook focuses only on collecting and storing the dataset in its raw form. 
* Further data cleaning, feature engineering , and analysis will be handled in subsequent notebooks to maintain a clear, modular workflow. 
* All data sources are used responsibly, following the dataset's licensing and ethical guidelines. 

---

 ## Loading the dataset 
 In this step , I am loading the Superstore Sales dataset (sourced from kaggle , created by Rohit Sahoo) into a pandas dataframe. 
 The dataset contains historical retail sales transactions including product categories , store details and sales values. 
 Loading and reviewing the data structure helps confirm that it was imported correctly before proceeding with cleaning and exploration. 

In [1]:
# import required libraries 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

# set visualisation style 
sns.set_style('whitegrid')

              

In [3]:
data_path ='../data/raw/sales_raw.csv'
df = pd.read_csv(data_path)
df.head()


Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales
0,1,CA-2017-152156,08/11/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96
1,2,CA-2017-152156,08/11/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94
2,3,CA-2017-138688,12/06/2017,16/06/2017,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62
3,4,US-2016-108966,11/10/2016,18/10/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775
4,5,US-2016-108966,11/10/2016,18/10/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368


The dataset loaded successfully and displays the first few rows above. 
This confirms that the file path is correct and that the data is structured as expected. 
Next, I will perform basic checks such as column names , data types and missing values. 

In [None]:
# basic data verificatio
df.info()
df.describe()
df.isnull().sum()




<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Row ID         9800 non-null   int64  
 1   Order ID       9800 non-null   object 
 2   Order Date     9800 non-null   object 
 3   Ship Date      9800 non-null   object 
 4   Ship Mode      9800 non-null   object 
 5   Customer ID    9800 non-null   object 
 6   Customer Name  9800 non-null   object 
 7   Segment        9800 non-null   object 
 8   Country        9800 non-null   object 
 9   City           9800 non-null   object 
 10  State          9800 non-null   object 
 11  Postal Code    9789 non-null   float64
 12  Region         9800 non-null   object 
 13  Product ID     9800 non-null   object 
 14  Category       9800 non-null   object 
 15  Sub-Category   9800 non-null   object 
 16  Product Name   9800 non-null   object 
 17  Sales          9800 non-null   float64
dtypes: float

Row ID            0
Order ID          0
Order Date        0
Ship Date         0
Ship Mode         0
Customer ID       0
Customer Name     0
Segment           0
Country           0
City              0
State             0
Postal Code      11
Region            0
Product ID        0
Category          0
Sub-Category      0
Product Name      0
Sales             0
dtype: int64

#### Inspect structure and basic  health 
In this step , I confirm the dataset's shape , column names data types and unique counts. 
Thse checks ensure the data loaded correctly and no imediate structural issues exist. 

In [13]:
# Overview of dataset
print("Shape:", df.shape)
print("\nColumns:", list(df.columns))

print("\nunique counts (top 10 columns by cardinality):")
print(df.nunique().sort_values(ascending=False).head(10))


Shape: (9800, 18)

Columns: ['Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Ship Mode', 'Customer ID', 'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal Code', 'Region', 'Product ID', 'Category', 'Sub-Category', 'Product Name', 'Sales']

unique counts (top 10 columns by cardinality):
Row ID           9800
Sales            5757
Order ID         4922
Product ID       1861
Product Name     1849
Ship Date        1326
Order Date       1230
Customer ID       793
Customer Name     793
Postal Code       626
dtype: int64


In [14]:
# save the varified dataset 
output_path = '../data/raw/sales_raw_varified.csv'
df. to_csv(output_path, index=False)
print(f" Dataset verified and saved to {output_path}")

 Dataset verified and saved to ../data/raw/sales_raw_varified.csv


Notebook Summary 
## Markdown cell 
Dataset successfully imported and varified. 
No major structural or missing-value issues detected.
Verified dataset saved to data/raw//sales_raw_varified.csv. 


