# 01 - Data Intake and Audit

**Goal**: load the dataset, check basic stats, and look for duplicates and missing values.

### Load raw data
> We load the source CSV file from  and check basic metadata.

In [501]:
import pandas as pd

pd.set_option('display.max_columns', None)\

file_path = "../data/raw/superstore_raw.csv"

df = pd.read_csv(file_path)
print(f"Data loaded successfully. \n\
        Shape: {df.shape[0]} rows x {df.shape[1]} columns.")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 2944: invalid start byte

The standard UTF-8 encoding doesn't appear to be available, proceed with Latin-1 instead.

In [507]:
df = pd.read_csv(file_path, encoding = 'latin1')
print(f"Data loaded successfully. \n\
        Shape: {df.shape[0]} rows x {df.shape[1]} columns. \n\
        Read error is fixed. ")

Data loaded successfully. 
        Shape: 9994 rows x 21 columns. 
        Read error is fixed. 


Profiling
> * Run a quick profile using `.head()`, `.info()`, `.describe()`
> * Record suspicious patterns
> * Identify incorrect column types
> * Check missing values and duplicates

In [338]:
df.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2016-152156,11/8/2016,11/11/2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2,0.0,41.9136
1,2,CA-2016-152156,11/8/2016,11/11/2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3,0.0,219.582
2,3,CA-2016-138688,6/12/2016,6/16/2016,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,2,0.0,6.8714
3,4,US-2015-108966,10/11/2015,10/18/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.031
4,5,US-2015-108966,10/11/2015,10/18/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,2,0.2,2.5164


- Column names are readable and match business context (`Order ID`, `Sales`, `Profit`, etc.).
- Date columns (`Order Date`, `Ship Date`) appear as strings in `mm/dd/yyyy` format — will require parsing.
- Categorical variables (`Ship Mode`, `Segment`, `Region`) look clean and consistent.
- Monetary columns (`Sales`, `Profit`) use decimal points, no visible currency symbols.
- `Row ID` seems to be an internal index rather than a business key (to review in cleaning stage).

In [358]:
df.describe()

Unnamed: 0,Row ID,Postal Code,Sales,Quantity,Discount,Profit
count,9994.0,9994.0,9994.0,9994.0,9994.0,9994.0
mean,4997.5,55190.379428,229.858001,3.789574,0.156203,28.656896
std,2885.163629,32063.69335,623.245101,2.22511,0.206452,234.260108
min,1.0,1040.0,0.444,1.0,0.0,-6599.978
25%,2499.25,23223.0,17.28,2.0,0.0,1.72875
50%,4997.5,56430.5,54.49,3.0,0.2,8.6665
75%,7495.75,90008.0,209.94,5.0,0.2,29.364
max,9994.0,99301.0,22638.48,14.0,0.8,8399.976


In [356]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Row ID         9994 non-null   int64  
 1   Order ID       9994 non-null   object 
 2   Order Date     9994 non-null   object 
 3   Ship Date      9994 non-null   object 
 4   Ship Mode      9994 non-null   object 
 5   Customer ID    9994 non-null   object 
 6   Customer Name  9994 non-null   object 
 7   Segment        9994 non-null   object 
 8   Country        9994 non-null   object 
 9   City           9994 non-null   object 
 10  State          9994 non-null   object 
 11  Postal Code    9994 non-null   int64  
 12  Region         9994 non-null   object 
 13  Product ID     9994 non-null   object 
 14  Category       9994 non-null   object 
 15  Sub-Category   9994 non-null   object 
 16  Product Name   9994 non-null   object 
 17  Sales          9994 non-null   float64
 18  Quantity

Next, we identify which object-type columns are appropriate for conversion to the `category` dtype.

In [579]:
threshold = 0.05
object_cols = df.select_dtypes(include='object').columns

for col in object_cols:
    if df[col].nunique() / len(df) < threshold:
        print(f"{df.columns.get_loc(col)} {col}")

4 Ship Mode
7 Segment
8 Country
10 State
12 Region
14 Category
15 Sub-Category


- Column 0 duplicates the index and should be removed.
- Columns 4, 7, 8, 10, 12, 14, 15 contain categorical data and will converted to `category` dtype in cleaning stage.
- Columns 1, 5, 6, 9, 11, 13, and 16 contain short free-text entries that do not represent meaningful categories and/or have too many unique values. These will be kept as plain `str` dtype rather than converted to `category`.
- Columns 2 and 3 represent datetime values and require parsing into proper datetime format.
- All numerical columns, except `Postal Code`, are correctly identified as numeric.
- No formal missing values in object columns are detected in the dataset. However, a manual inspection is still required to identify potential placeholder values representing missing data.
- Based on the `.describe()['min']` output, numerical columns expected to be non-negative contain no negative values, indicating no missing data represented by invalid negative placeholders.

Manually inspect potential placeholder values by reviewing `.unique()` outputs for selected columns likely to contain encoded missing data.

In [533]:
selected_cols = [4, 7, 8, 9, 10, 12, 14, 15]
for col_number in selected_cols:
    print(df.columns[col_number])
    print(df.iloc[:, col_number].unique())
    print('------------------------------------------------------------------')

Ship Mode
['Second Class' 'Standard Class' 'First Class' 'Same Day']
------------------------------------------------------------------
Segment
['Consumer' 'Corporate' 'Home Office']
------------------------------------------------------------------
Country
['United States']
------------------------------------------------------------------
City
['Henderson' 'Los Angeles' 'Fort Lauderdale' 'Concord' 'Seattle'
 'Fort Worth' 'Madison' 'West Jordan' 'San Francisco' 'Fremont'
 'Philadelphia' 'Orem' 'Houston' 'Richardson' 'Naperville' 'Melbourne'
 'Eagan' 'Westland' 'Dover' 'New Albany' 'New York City' 'Troy' 'Chicago'
 'Gilbert' 'Springfield' 'Jackson' 'Memphis' 'Decatur' 'Durham' 'Columbia'
 'Rochester' 'Minneapolis' 'Portland' 'Saint Paul' 'Aurora' 'Charlotte'
 'Orland Park' 'Urbandale' 'Columbus' 'Bristol' 'Wilmington' 'Bloomington'
 'Phoenix' 'Roseville' 'Independence' 'Pasadena' 'Newark' 'Franklin'
 'Scottsdale' 'San Jose' 'Edmond' 'Carlsbad' 'San Antonio' 'Monroe'
 'Fairfield' 'Grand

No placeholder values were detected. The dataset shows no signs of missingness.

The `Country` column contains a single constant value and provides no variance; therefore, it should be removed.

Check the dataset for duplicate records. The Row Id column is excluded from this step as it only replicates the index.

In [471]:
df[df.duplicated(subset = df.columns[1:])]

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
3406,3407,US-2014-150119,4/23/2014,4/27/2014,Standard Class,LB-16795,Laurel Beltran,Home Office,United States,Columbus,Ohio,43229,East,FUR-CH-10002965,Furniture,Chairs,Global Leather Highback Executive Chair with P...,281.372,2,0.3,-12.0588


Check the details

In [474]:
df.loc[(df['Order ID'] == "US-2014-150119")]

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
3405,3406,US-2014-150119,4/23/2014,4/27/2014,Standard Class,LB-16795,Laurel Beltran,Home Office,United States,Columbus,Ohio,43229,East,FUR-CH-10002965,Furniture,Chairs,Global Leather Highback Executive Chair with P...,281.372,2,0.3,-12.0588
3406,3407,US-2014-150119,4/23/2014,4/27/2014,Standard Class,LB-16795,Laurel Beltran,Home Office,United States,Columbus,Ohio,43229,East,FUR-CH-10002965,Furniture,Chairs,Global Leather Highback Executive Chair with P...,281.372,2,0.3,-12.0588
3407,3408,US-2014-150119,4/23/2014,4/27/2014,Standard Class,LB-16795,Laurel Beltran,Home Office,United States,Columbus,Ohio,43229,East,OFF-BI-10000145,Office Supplies,Binders,Zipper Ring Binder Pockets,7.488,8,0.7,-5.2416
3408,3409,US-2014-150119,4/23/2014,4/27/2014,Standard Class,LB-16795,Laurel Beltran,Home Office,United States,Columbus,Ohio,43229,East,FUR-FU-10002191,Furniture,Furnishings,G.E. Halogen Desk Lamp Bulbs,22.336,4,0.2,7.8176


It appears that rows 3406 and 3407 contain identical data. However, this duplication may still be valid due to operational factors not reflected in the dataset—such as partial fulfillment or duplicate packaging records.

**Conclusion**: All rows should be retained.

Perform an additional duplicate check on a subset of columns most likely to contain unintended duplicates: `['Order ID', 'Product ID', 'Quantity]`

In [517]:
df[df.duplicated(subset = ['Order ID', 'Product ID', 'Quantity'])]

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
3406,3407,US-2014-150119,4/23/2014,4/27/2014,Standard Class,LB-16795,Laurel Beltran,Home Office,United States,Columbus,Ohio,43229,East,FUR-CH-10002965,Furniture,Chairs,Global Leather Highback Executive Chair with P...,281.372,2,0.3,-12.0588


No additional duplicates were detected, and the initial analysis phase is complete.

### Save Intermediate “readable” copy

In [526]:
df.to_csv('../data/interim/01_bronze_intake.csv', index = False)
print('Copy is saved')

Copy is saved
