# Simulation of Vinted Dataset for Analysis

### Simulation Strategy:

1. **Define Core Parameters:** Set up the number of orders, time ranges, and probabilities for various events (e.g., shipping within 3 days, cancellation rates).
2. **Generate Orders:** Create a base of orders with order_id, seller_id, buyer_id, and order_date.
3. **Simulate Shipping Times:**
    - Most orders will ship within 1-2 days.
    - A smaller percentage will ship within 3-5 days.
    - A small percentage will be "delayed but shipped" (e.g., 6-10 days).
    - A portion will be "never shipped."
4. **Simulate Cancellations:** Assign cancellation reasons, with a focus on "Seller did not ship in time" for the "never shipped" items.
5. **Simulate Item and Seller Attributes:** Add realistic item_category, item_price, seller_rating, and seller_total_sales.
6. **Calculate Derived Fields:** Compute days_to_ship, is_shipped, and shipped_within_3_days.

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random
from faker import Faker

In [2]:
%load_ext blackcellmagic

In [3]:
# Initialize Faker for generating realistic-looking data
fake = Faker()

#### 1. Configuration Parameters for Simulation

In [4]:
num_orders = 10000
start_date = datetime(2024, 1, 1)
end_date = datetime(2025, 6, 30)

##### Here we can manipulate and test different probabilities for the output

In [5]:
# Probabilities for different shipping scenarios
prob_shipped_1_day = 0.50
prob_shipped_2_3_days = 0.30
prob_shipped_4_5_days = 0.10
prob_shipped_delayed = 0.05
prob_never_shipped = 0.05

In [6]:
# Probability of different cancellation reasons
prob_cancelled_seller_no_ship = 0.6 # Vinted cancels because seller didn't ship
prob_cancelled_buyer_choice = 0.2 # Buyer cancels after waiting after deadline
prob_cancelled_other = 0.2 # Other reasons (e.g., item out of stock for seller)

In [7]:
# Probability of an order being cancelled early (e.g., within 3 days)
prob_early_cancellation = 0.03 # 3% chance of early cancellation
prob_late_cancellation_unshipped = 0.9 # High chance of cancellation if never shipped

In [8]:
# Item and Seller attributes
item_categories = [
    "Women's Clothing",
    "Women's Shoes",
    "Men's Clothing",
    "Men's Shoes",
    "Kids' Clothing",
    "Accessories",
    "Bags",
    "Jewellery",
    "Home Decor",
    "Books",
    "Electronics",
    "Beauty Products",
]
avg_item_price = 25.0
std_item_price = 15.0
min_item_price = 5.0
max_item_price = 200.0

In [9]:
# Seller rating distribution (skewed towards higher ratings)
seller_ratings = [1.0, 2.0, 3.0, 3.5, 4.0, 4.2, 4.5, 4.7, 4.8, 4.9, 5.0]
seller_rating_probabilities = [0.01, 0.02, 0.05, 0.07, 0.10, 0.15, 0.20, 0.20, 0.10, 0.05, 0.05]

#### 2. Data Generation

In [10]:
data = []
seller_ids = [fake.uuid4() for _ in range(int(num_orders * 0.2))] # Fewer sellers than orders to simulate multiple sales
buyer_ids = [fake.uuid4() for _ in range(int(num_orders * 0.8))]

In [11]:
# Pre-generate seller sales history to influence their future behavior
seller_history = {
    seller_id: {
        "total_sales": random.randint(0, 500),
        "avg_shipping_days": random.uniform(1.0, 7.0),
    }
    for seller_id in seller_ids
}

In [12]:
for i in range(num_orders):
    order_id = fake.uuid4()
    seller_id = random.choice(seller_ids)
    buyer_id = random.choice(buyer_ids)
    order_date = start_date + timedelta(days=random.randint(0, (end_date - start_date).days))

    shipment_date = None
    cancellation_date = None
    cancellation_reason = None
    is_shipped = False
    days_to_add = 0

    # Simulate shipping time based on probabilities
    rand_prob = random.random()
    days_to_add = 0

    # Cancellation logic
    if random.random() < prob_early_cancellation:
        # Order is cancelled early
        cancellation_date = order_date + timedelta(days=random.randint(1, 3))
        cancellation_reason = "Seller unable to fulfill (out of stock)" 
        is_shipped = False
    else:
        # if not early cancelled
        rand_prob = random.random()

        if rand_prob < prob_shipped_1_day:
            days_to_add = random.randint(0, 1)
            is_shipped = True
        elif rand_prob < prob_shipped_1_day + prob_shipped_2_3_days:
            days_to_add = random.randint(2, 3)
            is_shipped = True
        elif rand_prob < prob_shipped_1_day + prob_shipped_2_3_days + prob_shipped_4_5_days:
            days_to_add = random.randint(4, 5)
            is_shipped = True
        elif rand_prob < prob_shipped_1_day + prob_shipped_2_3_days + prob_shipped_4_5_days + prob_shipped_delayed:
            days_to_add = random.randint(6, 15)
            is_shipped = True
        else:
            # Orders that were never shipped and not early cancelled
            # Cancelled due to seller not shipping in time
            is_shipped = False
            if random.random() < prob_late_cancellation_unshipped:
                cancellation_date = order_date + timedelta(days=random.randint(6, 10))
                cancel_rand_prob = random.random()
                if cancel_rand_prob < prob_cancelled_seller_no_ship:
                    cancellation_reason = "Seller did not ship in time"
                elif cancel_rand_prob < prob_cancelled_seller_no_ship + prob_cancelled_buyer_choice:
                    cancellation_reason = "Buyer canceled due to seller delay"
                else:
                    # If not cancelled, then it's just an unshipped order that's still open
                    cancellation_reason = "Seller unable to fulfill (out of stock)"


    # Handle incomplete data of most recent orders (only if not already cancelled)
    if is_shipped and shipment_date is None:
        shipment_date = order_date + timedelta(days=days_to_add)

        if shipment_date > end_date and order_date <= end_date:
            shipment_date = min(shipment_date, end_date)

            if shipment_date == end_date and random.random() < 0.5:
                is_shipped = False
                shipment_date = None
                cancellation_date = end_date
                cancellation_reason = "Still awaiting shipment / Order too recent for full cycle"


    # Item attributes
    item_category = random.choice(item_categories)
    item_price = max(min_item_price, min(max_item_price, np.random.normal(avg_item_price, std_item_price)))
    item_price = round(item_price, 2)

    # Seller attributes
    seller_data = seller_history.get(seller_id, {"total_sales": 0, "avg_shipping_days": 2.0})
    seller_total_sales = seller_data["total_sales"] + (1 if is_shipped else 0)
    seller_rating = np.random.choice(seller_ratings, p=seller_rating_probabilities)
    
    # Seller behavior based on their history
    if seller_data["avg_shipping_days"] > 4 and is_shipped: 
        days_to_add += random.randint(0, 1)

    data.append({
        "order_id": order_id,
        "seller_id": seller_id,
        "buyer_id": buyer_id,
        "order_date": order_date,
        "shipment_date": shipment_date,
        "cancellation_date": cancellation_date,
        "cancellation_reason": cancellation_reason,
        "item_category": item_category,
        "item_price": item_price,
        "seller_rating": seller_rating,
        "seller_total_sales": seller_total_sales,
        "vinted_shipping_deadline_days": 5
    })

In [13]:
df = pd.DataFrame(data)

#### 3. Derived Fields Calculation

In [14]:
# Ensure date columns are datetime objects
df["order_date"] = pd.to_datetime(df["order_date"])
df["shipment_date"] = pd.to_datetime(df["shipment_date"])
df["cancellation_date"] = pd.to_datetime(df["cancellation_date"])

In [15]:
# Calculate business days to ship (excluding weekends)
def business_days_between(row):
    if pd.notna(row["shipment_date"]):
        return np.busday_count(row["order_date"].date(), row["shipment_date"].date())
    return np.nan

In [16]:
df["days_to_ship"] = df.apply(business_days_between, axis=1)
df["is_shipped"] = df["shipment_date"].notna()

In [17]:
# Determine if item was never shipped (based on is_shipped and cancellation_reason)
# This captures items cancelled because seller didn't ship
df["is_never_shipped"] = (df["is_shipped"] == False) & (
    df["cancellation_reason"].isin(
        [
            "Seller did not ship in time",
            "Buyer canceled due to seller delay",
            "Seller unable to fulfill (out of stock)"
            
        ]
    )
)

In [18]:
# Define if shipped within 3 days (only for orders that were actually shipped)
df["shipped_within_3_days"] = df["days_to_ship"] <= 3

## General Overview of Simulated Data

In [19]:
# Display basic info about the generated data
print(f"Generated {len(df)} orders.")
print(f"\nDataFrame Info:")
df.info()

Generated 10000 orders.

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 16 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   order_id                       10000 non-null  object        
 1   seller_id                      10000 non-null  object        
 2   buyer_id                       10000 non-null  object        
 3   order_date                     10000 non-null  datetime64[ns]
 4   shipment_date                  9216 non-null   datetime64[ns]
 5   cancellation_date              743 non-null    datetime64[ns]
 6   cancellation_reason            743 non-null    object        
 7   item_category                  10000 non-null  object        
 8   item_price                     10000 non-null  float64       
 9   seller_rating                  10000 non-null  float64       
 10  seller_total_sales             10000 non-n

In [20]:
print("First 5 rows of the generated data:")
df.head()

First 5 rows of the generated data:


Unnamed: 0,order_id,seller_id,buyer_id,order_date,shipment_date,cancellation_date,cancellation_reason,item_category,item_price,seller_rating,seller_total_sales,vinted_shipping_deadline_days,days_to_ship,is_shipped,is_never_shipped,shipped_within_3_days
0,a5a3bca6-440a-452e-bc4f-5644aff73d94,5eed2795-8d16-4113-a4e5-2896609dbe8b,4edbe609-f16e-4708-b990-05dffc570edc,2024-06-02,2024-06-03,NaT,,Accessories,30.95,5.0,317,5,0.0,True,False,True
1,f1eba7e2-99d1-4ef7-a925-fd2cb4da83be,792043b5-71db-4c5f-913b-696145d7c50f,c6b51b26-dd73-49d8-a8b8-5a9d07f13dfe,2024-06-05,2024-06-08,NaT,,Men's Clothing,6.82,4.2,366,5,3.0,True,False,True
2,fa09b0ae-565d-4be7-927e-5a531f8fe374,d8b6bbe3-5a31-4895-b015-9361c66b999c,02334f8d-a6f0-49f9-b653-2b1835e98059,2025-03-17,2025-03-18,NaT,,Bags,25.48,4.7,328,5,1.0,True,False,True
3,b22e3bea-fb66-46e5-a974-0a57cbb99552,35c0e53b-732b-496c-9eab-e16cbfb1590e,9dd0776d-2769-4cc2-bd46-4d91e085faaf,2024-09-23,2024-09-25,NaT,,Books,6.73,4.5,13,5,2.0,True,False,True
4,88157c5a-f21b-46e6-991d-754cfc9bb69e,489a3382-7c17-42b5-bd5c-6dea8fd79568,abdd6d55-4152-4ca3-9c6a-d7e4402dbe03,2025-04-20,NaT,2025-04-26,Seller unable to fulfill (out of stock),Women's Clothing,42.56,4.5,139,5,,False,True,False


In [21]:
print(f"\nValue counts for 'is_shipped':")
print(df["is_shipped"].value_counts())


Value counts for 'is_shipped':
is_shipped
True     9216
False     784
Name: count, dtype: int64


In [22]:
print(f"\nValue counts for 'is_never_shipped':")
print(df["is_never_shipped"].value_counts())


Value counts for 'is_never_shipped':
is_never_shipped
False    9279
True      721
Name: count, dtype: int64


In [23]:
print(f"\nDistribution of 'days_to_ship' (for shipped items):")
print(df["days_to_ship"].describe())


Distribution of 'days_to_ship' (for shipped items):
count    9216.000000
mean        1.478624
std         1.903699
min         0.000000
25%         0.000000
50%         1.000000
75%         2.000000
max        11.000000
Name: days_to_ship, dtype: float64


In [24]:
print(f"\nDistribution of 'cancellation_reason':")
print(df["cancellation_reason"].value_counts(dropna=False))


Distribution of 'cancellation_reason':
cancellation_reason
None                                                         9257
Seller unable to fulfill (out of stock)                       376
Seller did not ship in time                                   255
Buyer canceled due to seller delay                             90
Still awaiting shipment / Order too recent for full cycle      22
Name: count, dtype: int64


In [25]:
# Save this DataFrame to a CSV for later use in Analysis
df.to_csv("vinted_orders_simulated_data.csv", index=False)