# This notebook is an IDA of returns.csv

## 1. Dataset Overview

- **File:** `returns.csv`
- **Number of rows:** 2,033
- **Number of columns:** 3
- **Columns:**
  - `Returned` – Whether the order was returned (Yes/No)
  - `Order ID` – Unique identifier for each order
  - `Region` – Region associated with the order
- **Type of data:** Transactional
- **Primary use:** Can serve as a target variable for ML (classification) and for analysis of return patterns by region.


In [2]:
import numpy as np
import pandas as pd

returns = pd.read_csv('../data/raw/returns.csv')

display(returns.head())

Unnamed: 0,Returned,Order ID,Region
0,Yes,IN-2017-CA120551-42816,Southern Asia
1,Yes,IN-2017-AA103751-42926,Southern Asia
2,Yes,IN-2017-TS212051-42904,Southern Asia
3,Yes,AG-2014-RO97803-41695,North Africa
4,Yes,AG-2015-LC70503-42265,North Africa


In [4]:
print("Columns:")
print(returns.columns)

print("\nShape:")
print(returns.shape)

Columns:
Index(['Returned', 'Order ID', 'Region'], dtype='str')

Shape:
(2033, 3)


## Columns and Data Types

- The dataset contains only 3 columns: `Returned`, `Order ID`, and `Region`.
- All columns are of type `string` (object).
- There are **no missing values**, indicating a clean dataset.
- `Order ID` can be joined with `orders.csv` for enrichment.


In [5]:
print("Info:")
print(returns.info())

Info:
<class 'pandas.DataFrame'>
RangeIndex: 2033 entries, 0 to 2032
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   Returned  2033 non-null   str  
 1   Order ID  2033 non-null   str  
 2   Region    2033 non-null   str  
dtypes: str(3)
memory usage: 47.8 KB
None


In [6]:
print("Missing values per column:")
print(returns.isnull().sum())

Missing values per column:
Returned    0
Order ID    0
Region      0
dtype: int64


In [7]:
print("Numeric summary statistics:")
display(returns.describe())

Numeric summary statistics:


Unnamed: 0,Returned,Order ID,Region
count,2033,2033,2033
unique,1,1970,23
top,Yes,MX-2016-BT1130526-42695,Western Europe
freq,2033,3,226


## Summary Statistics

- **Returned**: All entries are 'Yes' in your sample, but globally we may have some variation. This is the key target for any ML model.
- **Order ID**: 1,970 unique order IDs across 2,033 rows, showing some orders were returned multiple times.
- **Region**: 23 unique regions represented, with `Western Europe` having the most returns (226 entries), and `Central Asia` having the fewest (1 entry).


In [8]:
cat_cols = returns.select_dtypes(include='object').columns
print("\nTop categories per categorical column:")
for col in cat_cols:
    print(f"\n{col} value counts:")
    display(returns[col].value_counts())


Top categories per categorical column:

Returned value counts:


See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3.
  cat_cols = returns.select_dtypes(include='object').columns


Returned
Yes    2033
Name: count, dtype: int64


Order ID value counts:


Order ID
MX-2016-BT1130526-42695    3
MX-2015-DE1325551-42100    3
US-2016-JB160005-42392     2
ID-2017-CJ118757-42792     2
ES-2017-TB211758-42907     2
                          ..
ZA-2017-BT1305147-43069    1
ZA-2016-BD1500147-42581    1
ZA-2017-BT1395147-43067    1
ZA-2015-DN3690147-42159    1
ZA-2015-LC6885147-42021    1
Name: count, Length: 1970, dtype: int64


Region value counts:


Region
Western Europe       226
Central America      208
Oceania              144
Western US           140
Southeastern Asia    139
South America        117
Eastern US           115
Southern Asia        105
Western Asia         102
Eastern Asia          92
Central US            92
Southern Europe       86
Northern Europe       84
Caribbean             66
Eastern Europe        58
Western Africa        58
Southern US           58
North Africa          54
Eastern Africa        39
Central Africa        22
Southern Africa       15
Canada                12
Central Asia           1
Name: count, dtype: int64

## Observations

- **Returned:** This column will serve as the target for return prediction.
- **Order ID:** Some orders appear multiple times. This is useful for understanding repeat returns and can help in customer behavior analysis.
- **Region:** Returns are not evenly distributed; some regions have much higher counts, which might indicate regional trends or business factors.


## Key Takeaways

1. The dataset is clean with **no missing values**.  
2. `Order ID` can be linked with `orders.csv` for detailed analysis.  
3. `Returned` can be used as the **target variable** for classification tasks.  
4. `Region` can be used as a **categorical feature** to analyze regional return trends.  
5. IDA complete – no transformations or cleaning needed in this notebook.  
6. This dataset is ready for **exploratory data analysis (EDA)** and ML modeling.
