Splitting Raw_data to incremental data by order date.

In [3]:
import pandas as pd

# === 1. Load the dataset ===
# Using your file path
df = pd.read_csv(
    'C:/Users/kidig/OneDrive/Desktop/ET_Exam_Peter_341/Data/Raw_data.csv',
    encoding='cp1252'
)

# Preview data
print("✅ Data loaded successfully!")
print("Shape:", df.shape)
df.head()

# === 2. Convert and sort by Order Date ===
# Ensure your column name matches exactly ("Order Date" or similar)
df['Order Date'] = pd.to_datetime(df['Order Date'], errors='coerce')
df = df.sort_values('Order Date')

# Check for missing dates
print("\nMissing 'Order Date' values:", df['Order Date'].isna().sum())

# === 3. Split data into raw and incremental subsets ===
# Use the most recent 10% of records as incremental data
cutoff_date = df['Order Date'].quantile(0.9)

raw_data = df[df['Order Date'] <= cutoff_date]
incremental_data = df[df['Order Date'] > cutoff_date]

print("\nRaw data shape:", raw_data.shape)
print("Incremental data shape:", incremental_data.shape)

print("\nRaw data date range:", raw_data['Order Date'].min(), "to", raw_data['Order Date'].max())
print("Incremental data date range:", incremental_data['Order Date'].min(), "to", incremental_data['Order Date'].max())

# === 4. Save both datasets ===
raw_data.to_csv('C:/Users/kidig/OneDrive/Desktop/ET_Exam_Peter_341/Data/raw_data.csv', index=False)
incremental_data.to_csv('C:/Users/kidig/OneDrive/Desktop/ET_Exam_Peter_341/Data/incremental_data.csv', index=False)

print("\n✅ Files saved successfully:")
print(" - raw_data.csv")
print(" - incremental_data.csv")


✅ Data loaded successfully!
Shape: (9994, 21)

Missing 'Order Date' values: 0

Raw data shape: (8996, 21)
Incremental data shape: (998, 21)

Raw data date range: 2014-01-03 00:00:00 to 2017-10-22 00:00:00
Incremental data date range: 2017-10-23 00:00:00 to 2017-12-30 00:00:00

✅ Files saved successfully:
 - raw_data.csv
 - incremental_data.csv


loading both datasets to verify.

In [5]:
import pandas as pd

# Load the full dataset
df = pd.read_csv('C:/Users/kidig/OneDrive/Desktop/ET_Exam_Peter_341/Data/incremental_data.csv')

# Check structure
df.shape
df.head()



Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,8446,CA-2017-125451,2017-10-23,10/24/2017,First Class,AH-10075,Adam Hart,Corporate,United States,Cranston,...,2920,East,FUR-TA-10001039,Furniture,Tables,KI Adjustable-Height Table,240.744,4,0.3,-13.7568
1,8451,CA-2017-125451,2017-10-23,10/24/2017,First Class,AH-10075,Adam Hart,Corporate,United States,Cranston,...,2920,East,OFF-AP-10002906,Office Supplies,Appliances,Hoover Replacement Belt for Commercial Guardsm...,2.22,1,0.0,0.666
2,8450,CA-2017-125451,2017-10-23,10/24/2017,First Class,AH-10075,Adam Hart,Corporate,United States,Cranston,...,2920,East,OFF-PA-10003724,Office Supplies,Paper,"Wirebound Message Book, 4 per Page",43.44,8,0.0,21.2856
3,1261,CA-2017-117079,2017-10-23,10/27/2017,Standard Class,JR-15700,Jocasta Rupert,Consumer,United States,Jacksonville,...,32216,South,TEC-PH-10004586,Technology,Phones,Wilson SignalBoost 841262 DB PRO Amplifier Kit,863.88,3,0.2,107.985
4,8449,CA-2017-125451,2017-10-23,10/24/2017,First Class,AH-10075,Adam Hart,Corporate,United States,Cranston,...,2920,East,FUR-TA-10004915,Furniture,Tables,"Office Impressions End Table, 20-1/2""H x 24""W ...",637.896,3,0.3,-127.5792


In [6]:
import pandas as pd

# Load the full dataset
df = pd.read_csv('C:/Users/kidig/OneDrive/Desktop/ET_Exam_Peter_341/Data/Raw_data.csv')

# Check structure
df.shape
df.head()


Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,7981,CA-2014-103800,2014-01-03,1/7/2014,Standard Class,DP-13000,Darren Powers,Consumer,United States,Houston,...,77095,Central,OFF-PA-10000174,Office Supplies,Paper,"Message Book, Wirebound, Four 5 1/2"" X 4"" Form...",16.448,2,0.2,5.5512
1,740,CA-2014-112326,2014-01-04,1/8/2014,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,...,60540,Central,OFF-LA-10003223,Office Supplies,Labels,Avery 508,11.784,3,0.2,4.2717
2,741,CA-2014-112326,2014-01-04,1/8/2014,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,...,60540,Central,OFF-ST-10002743,Office Supplies,Storage,SAFCO Boltless Steel Shelving,272.736,3,0.2,-64.7748
3,742,CA-2014-112326,2014-01-04,1/8/2014,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,...,60540,Central,OFF-BI-10004094,Office Supplies,Binders,GBC Standard Plastic Binding Systems Combs,3.54,2,0.8,-5.487
4,1760,CA-2014-141817,2014-01-05,1/12/2014,Standard Class,MB-18085,Mick Brown,Consumer,United States,Philadelphia,...,19143,East,OFF-AR-10003478,Office Supplies,Art,Avery Hi-Liter EverBold Pen Style Fluorescent ...,19.536,3,0.2,4.884


In [7]:
# Display first few rows
display(raw_data.head())

# Dataset information
raw_data.info()

# Descriptive statistics for numerical columns
raw_data.describe()


Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
7980,7981,CA-2014-103800,2014-01-03,1/7/2014,Standard Class,DP-13000,Darren Powers,Consumer,United States,Houston,...,77095,Central,OFF-PA-10000174,Office Supplies,Paper,"Message Book, Wirebound, Four 5 1/2"" X 4"" Form...",16.448,2,0.2,5.5512
739,740,CA-2014-112326,2014-01-04,1/8/2014,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,...,60540,Central,OFF-LA-10003223,Office Supplies,Labels,Avery 508,11.784,3,0.2,4.2717
740,741,CA-2014-112326,2014-01-04,1/8/2014,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,...,60540,Central,OFF-ST-10002743,Office Supplies,Storage,SAFCO Boltless Steel Shelving,272.736,3,0.2,-64.7748
741,742,CA-2014-112326,2014-01-04,1/8/2014,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,...,60540,Central,OFF-BI-10004094,Office Supplies,Binders,GBC Standard Plastic Binding Systems Combs,3.54,2,0.8,-5.487
1759,1760,CA-2014-141817,2014-01-05,1/12/2014,Standard Class,MB-18085,Mick Brown,Consumer,United States,Philadelphia,...,19143,East,OFF-AR-10003478,Office Supplies,Art,Avery Hi-Liter EverBold Pen Style Fluorescent ...,19.536,3,0.2,4.884


<class 'pandas.core.frame.DataFrame'>
Index: 8996 entries, 7980 to 2625
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Row ID         8996 non-null   int64         
 1   Order ID       8996 non-null   object        
 2   Order Date     8996 non-null   datetime64[ns]
 3   Ship Date      8996 non-null   object        
 4   Ship Mode      8996 non-null   object        
 5   Customer ID    8996 non-null   object        
 6   Customer Name  8996 non-null   object        
 7   Segment        8996 non-null   object        
 8   Country        8996 non-null   object        
 9   City           8996 non-null   object        
 10  State          8996 non-null   object        
 11  Postal Code    8996 non-null   int64         
 12  Region         8996 non-null   object        
 13  Product ID     8996 non-null   object        
 14  Category       8996 non-null   object        
 15  Sub-Category   8996 non

Unnamed: 0,Row ID,Order Date,Postal Code,Sales,Quantity,Discount,Profit
count,8996.0,8996,8996.0,8996.0,8996.0,8996.0,8996.0
mean,5003.387839,2016-02-26 03:40:53.890618112,55199.434971,231.653035,3.782792,0.156228,29.722777
min,1.0,2014-01-03 00:00:00,1040.0,0.444,1.0,0.0,-6599.978
25%,2500.75,2015-03-30 00:00:00,23223.0,17.272,2.0,0.0,1.7343
50%,5027.5,2016-04-11 00:00:00,58103.0,54.804,3.0,0.2,8.7138
75%,7473.25,2017-01-16 00:00:00,90008.0,211.168,5.0,0.2,29.6604
max,9994.0,2017-10-22 00:00:00,99301.0,22638.48,14.0,0.8,8399.976
std,2880.828381,,32068.337866,631.084692,2.213044,0.2067,231.698223


Dataset Overview

The raw dataset contains 8,996 records and 21 columns.
Each record represents a unique sales transaction with details such as order information, customer demographics, product data, and financial metrics.

Key Observations

The dataset is well-organized and includes a clear schema with 21 columns.

The Order Date column is already correctly parsed as a datetime type.

Other date fields like Ship Date are still stored as text (object) and will need to be converted during transformation.

Numerical columns (Sales, Quantity, Discount, Profit, Postal Code, Row ID) are correctly typed as integers or floats.

All columns show 8996 non-null entries, meaning there are no missing values.

The memory usage (~1.5 MB) is efficient for processing and transformations in pandas.

Conclusion

The dataset structure is clean and consistent. It is suitable for downstream ETL tasks without requiring major preprocessing at this stage.
Minor improvements, such as date standardization (Ship Date) and potential feature derivations (e.g., Total Cost, Profit Margin), will be handled in the Transform phase.

In [8]:
# Check for missing values
missing = raw_data.isna().sum()
print("Missing values per column:\n", missing)

# Check for duplicates
duplicates = raw_data.duplicated().sum()
print("\nNumber of duplicate rows:", duplicates)

# Check for inconsistent data types
print("\nColumn Data Types:\n", raw_data.dtypes)

# Example: look for invalid or negative quantities or sales
print("\nInvalid quantity values:", (raw_data['Quantity'] <= 0).sum())
print("Invalid sales values:", (raw_data['Sales'] <= 0).sum())


Missing values per column:
 Row ID           0
Order ID         0
Order Date       0
Ship Date        0
Ship Mode        0
Customer ID      0
Customer Name    0
Segment          0
Country          0
City             0
State            0
Postal Code      0
Region           0
Product ID       0
Category         0
Sub-Category     0
Product Name     0
Sales            0
Quantity         0
Discount         0
Profit           0
dtype: int64

Number of duplicate rows: 0

Column Data Types:
 Row ID                    int64
Order ID                 object
Order Date       datetime64[ns]
Ship Date                object
Ship Mode                object
Customer ID              object
Customer Name            object
Segment                  object
Country                  object
City                     object
State                    object
Postal Code               int64
Region                   object
Product ID               object
Category                 object
Sub-Category             objec

1. Missing Values

The dataset has no missing values across all 21 columns.
This means there are no nulls in key fields such as Order Date, Sales, Customer ID, or Profit, indicating that the dataset is complete and ready for transformation without imputation.

2. Duplicate Records

There are no duplicate rows, which confirms that each transaction record is unique.
This suggests proper data integrity, possibly due to unique transaction identifiers like Row ID or Order ID.

3. Data Types

Most columns have appropriate data types:

Order Date is correctly stored as datetime64[ns].

Numeric columns such as Sales, Quantity, Discount, and Profit are all numeric (float64 or int64).

Categorical columns like Region, Segment, and Category are object types.

One observation is that Ship Date is still stored as a string (object), so it should be converted to datetime during the transformation phase.

4. Invalid or Outlier Values

No negative or zero values were found in the Quantity or Sales columns, confirming that all transactions are valid sales.

Since this dataset represents clean retail data, there are no obvious outliers or anomalies at this stage.

✅ Summary of Extract Phase

The extract phase was successful.

The raw and incremental datasets were loaded, inspected, and found to be clean.

No missing or duplicate values were detected.

The data types are mostly appropriate, with a minor adjustment needed for Ship Date.

The dataset is now ready for the Transform phase, where data will be standardized, enriched, and prepared for analytical use.