# Data Discovery

## Importing Data and Preliminary Cleaning

#### Note: This is only for Harrods Transactions Data, for now

In [5]:
import pandas as pd

### Head and Columns

In [13]:
data = pd.read_csv("Harrods January Transactions_LBS.csv")
data.head()

Unnamed: 0,TRANX_ID,CAL_DAY,ZDATBIRTH,AGE,GENERATION,ZFIRSTSHOP,ZLASTSHOP,ZCUSTOMER,ZTRADE,QUANTITY,...,SPENDBAND_POINTOFTRANX,ZDATBIRTH_RAW,AGE_GROUP,RESIDENCY,TIER_LATEST,CALC_COUNTRY_GP,PRIVATE_SHOPPING_TRANSACTION,PRIVATE_SHOPPING_CUST_FLAG,ZPERSONA,ZISREWTR
0,2025010400000018200000000126,2025-01-04,1900-01-01,125.0,OTHERS,2021-06-06,2025-02-14,9017096,27.0,1.0,...,< £ 5K,19000101,Unknown,UK,Green 0,UK,,,Aspirational,X
1,2025010400000024020000002114,2025-01-04,,,,2017-01-03,2025-01-07,3662245,24.0,1.0,...,£ 5K - £ 10K,0,Unknown,International,Gold,Middle East - GCC,,,VIP,X
2,2025010400000028860000001411,2025-01-04,,,,2017-01-03,2025-01-07,3662245,4.0,1.0,...,£ 5K - £ 10K,0,Unknown,International,Gold,Middle East - GCC,,,VIP,X
3,2025010400000021700000000208,2025-01-04,,,,2009-04-26,2025-01-04,3093317,16.0,1.0,...,< £ 5K,0,Unknown,Dual Resident,Green 0,EU,,,Local Affluent,X
4,2025010400000027640000000204,2025-01-04,1990-05-21,34.0,MILLENNIALS,2009-08-25,2025-01-08,3095197,47.9,2.0,...,£ 5K - £ 10K,19900521,31 - 40,International,Gold,Middle East - GCC,,,VIP,X


- Some customer ages are missing or abnormally high (e.g. 125), which suggests placeholder or incorrect data.
- Many rows have missing values in `ZDATBIRTH`, `AGE`, and `GENERATION`.
- Fields like `GENDER`, `ZPERSONA`, and `RESIDENCY` also have high levels of missingness.
- These columns may need cleaning, filtering, or exclusion depending on their relevance to the analysis.


In [5]:
data.columns

Index(['TRANX_ID', 'CAL_DAY', 'ZDATBIRTH', 'AGE', 'GENERATION', 'ZFIRSTSHOP',
       'ZLASTSHOP', 'ZCUSTOMER', 'ZTRADE', 'QUANTITY', 'CHANNEL',
       'DISTANCE_TO_STORE', 'SITE', 'TYEAR', 'TYEAR_HALF', 'QTR', 'MNTH',
       'TYEAR_MNTH', 'WK', 'MCH3', 'MCH1', 'BRAND', 'PRODUCT_DETAIL',
       'PRODUCT_DESCRIPTION_LONG', 'GENDER', 'SPENDBAND_LATEST',
       'SPENDBAND_POINTOFTRANX', 'ZDATBIRTH_RAW', 'AGE_GROUP', 'RESIDENCY',
       'TIER_LATEST', 'CALC_COUNTRY_GP', 'PRIVATE_SHOPPING_TRANSACTION',
       'PRIVATE_SHOPPING_CUST_FLAG', 'ZPERSONA', 'ZISREWTR'],
      dtype='object')

### Column Descriptions (Preliminary Assumptions)



- `TRANX_ID` — Unique identifier for each transaction
- `CAL_DAY` — Calendar day of transaction
- `ZDATBIRTH` — Customer date of birth
- `AGE` — Customer age at time of transaction
- `GENERATION` — Generation category (e.g. Millennials, Gen X)
- `ZFIRSTSHOP` — Date of customer’s first recorded purchase
- `ZLASTSHOP` — Date of customer’s most recent purchase
- `ZCUSTOMER` — Customer ID
- `ZTRADE` — Internal trade or product classification code
- `QUANTITY` — Number of items in the transaction
- `CHANNEL` — Sales channel (e.g. store location or online)
- `DISTANCE_TO_STORE` — Approximate distance between customer and store
- `SITE` — Specific store or department (e.g. Harrods Shop)
- `TYEAR` — Transaction year
- `TYEAR_HALF` — Half-year marker (e.g. H01.2024 or H02.2024)
- `QTR` — Quarter of the year (e.g. Q1, Q4)
- `MNTH` — Month number
- `TYEAR_MNTH` — Combined year-month identifier (e.g. 202412)
- `WK` — Week number of the year
- `MCH3` — High-level merchandise category (e.g. RESTAURANTS)
- `MCH1` — More detailed product category
- `BRAND` — Brand name of purchased item
- `PRODUCT_DETAIL` — Brief item description
- `PRODUCT_DESCRIPTION_LONG` — Detailed product description
- `GENDER` — Customer gender
- `SPENDBAND_LATEST` — Customer's latest spending tier
- `SPENDBAND_POINTOFTRANX` — Spend band at time of transaction
- `ZDATBIRTH_RAW` — Raw date of birth field (possibly uncleaned)
- `AGE_GROUP` — Customer's age bracket
- `RESIDENCY` — Whether customer is local (UK) or international
- `TIER_LATEST` — Customer loyalty tier (e.g. Gold, Black)
- `CALC_COUNTRY_GP` — Country grouping (e.g. UK, GCC)
- `PRIVATE_SHOPPING_TRANSACTION` — Indicates a private shopping event
- `PRIVATE_SHOPPING_CUST_FLAG` — Marks whether customer is a private shopper
- `ZPERSONA` — Segment or persona assigned to the customer
- `ZISREWTR` — Flag for rewards or loyalty program membership


## Basic Exploration

In [7]:
data.describe()

Unnamed: 0,AGE,ZCUSTOMER,ZTRADE,QUANTITY,TYEAR,MNTH,TYEAR_MNTH,WK,ZDATBIRTH_RAW
count,128880.0,212107.0,212107.0,212107.0,212107.0,212107.0,212107.0,212107.0,212107.0
mean,42.981137,8621012.0,180.8452,1.231286,2024.0,12.0,202412.0,49.572466,12037830.0
std,16.738716,3739320.0,3937.254,1.773588,0.0,0.0,0.0,1.185347,9674483.0
min,0.0,2315791.0,-1192683.0,-74.0,2024.0,12.0,202412.0,48.0,0.0
25%,32.0,5192512.0,6.95,1.0,2024.0,12.0,202412.0,49.0,0.0
50%,41.0,9017750.0,20.28,1.0,2024.0,12.0,202412.0,50.0,19690630.0
75%,51.0,12011820.0,63.0,1.0,2024.0,12.0,202412.0,50.0,19870200.0
max,125.0,13437840.0,907250.0,324.0,2024.0,12.0,202412.0,52.0,20250210.0


- `AGE` ranges from 0 to 125, with a median of 41 — values like 0 and 125 suggest possible data issues.
- `ZTRADE` includes extreme outliers (min: -1.19M, max: 907K), which may indicate data entry errors or require winsorization.
- `QUANTITY` has negative values and outliers (min: -74, max: 324), which are likely invalid for transaction quantities.
- `ZDATBIRTH_RAW` has a large cluster at 0, reinforcing that many birthdates are placeholders or missing.
- All time-related fields (`TYEAR`, `MNTH`, `TYEAR_MNTH`, `WK`) are consistent and likely system-generated — no issues found.


In [43]:
def summarize_df(df):
    hidden_nulls = ["Unknown", "None", "0", "1900-01-01", "", " ", "NaT", "nan", "NaN"]
    
    hidden_null_counts = {}
    for col in df.columns:
        col_data = df[col].astype(str).str.strip()
        mask_not_null = ~df[col].isnull()
        hidden_mask = mask_not_null & col_data.isin(hidden_nulls)
        hidden_null_counts[col] = hidden_mask.sum()


    summary = pd.DataFrame({
        'non_null_count': df.count(),
        'null_count': df.isnull().sum(),
        'hidden_null_count': pd.Series(hidden_null_counts),
        'unique_count': df.nunique(),
        'dtype': df.dtypes,
        'most_common': df.mode().iloc[0]
    })

    return summary

summarize_df(data)

Unnamed: 0,non_null_count,null_count,hidden_null_count,unique_count,dtype,most_common
TRANX_ID,212107,0,0,96404,object,2025010400000087460000000131
CAL_DAY,212107,0,0,20,object,2025-01-12
ZDATBIRTH,128880,83227,937,12891,object,1900-01-01
AGE,128880,83227,0,90,float64,37.0
GENERATION,128862,83245,0,5,object,MILLENNIALS
ZFIRSTSHOP,212107,0,0,5946,object,2025-01-03
ZLASTSHOP,212107,0,0,57,object,2025-02-25
ZCUSTOMER,212107,0,0,42272,int64,4335723
ZTRADE,212107,0,0,9667,float64,0.0
QUANTITY,212107,0,0,751,float64,1.0


### Data Quality Observations (Based on Field Purpose)

- `ZDATBIRTH` has 83k true nulls (39%) and 937 placeholders like "1900-01-01", making date-of-birth data largely unreliable.
- `AGE` is missing in the same rows as `ZDATBIRTH`, and `AGE_GROUP` has ~85k (40%) rows marked as "Unknown".
- `ZDATBIRTH_RAW` has 83k entries set to 0, confirming poor reliability of raw birthdate values.
- `GENERATION` is derived from age and also missing for over 83k rows, limiting demographic segmentation.
- `RESIDENCY` and `CALC_COUNTRY_GP` have ~43k nulls each (20%) which affects geographic profiling.
- `PRODUCT_DESCRIPTION_LONG` is missing in ~89% of rows; likely unusable for product-level analysis or NLP.
- `PRIVATE_SHOPPING_TRANSACTION` and `PRIVATE_SHOPPING_CUST_FLAG` are empty for over 95% of records — useful only for specific segmentation, if needed.
- Fields like `CHANNEL`, `SITE`, `TYEAR`, `QTR`, `MNTH`, and `TYEAR_MNTH` are complete and consistent — ideal for time-based and location-based analysis.
- `BRAND` is missing in 816 rows, which is minimal given the dataset size (~0.4%).


# Exploratory Data Analysis

In [None]:
## Work here!