## Market Evolution Dynamics: Does Scale Compromise Quality?

Thessaloniki's short-term rental market has grown rapidly, raising questions about whether professional operators enhance or diminish the guest experience. Understanding host ecosystem dynamics is crucial for evidence-based tourism policy.

### Research Questions
1. Do multi-property hosts achieve different guest engagement patterns than smaller operators?
2. Does the current host ecosystem structure benefit the rental market?
3. Is there a "sweet spot" in host portfolio size for optimal performance?

### Market Maturity Categorization Framework

| Category | Listings | Profile |
|----------|----------|---------|
| **Individual** | 1 | Casual/occasional hosts, often sharing personal space |
| **Small Multi** | 2-3 | Semi-professional, transitioning to STR business |
| **Medium Multi** | 4-10 | Professional operators, dedicated STR management |
| **Large Multi** | 11+ | Commercial/corporate operators, scaled operations |

### Key Finding Preview
> *Mid-scale professional hosts (2-10 listings) achieve the optimal balance of operational efficiency and guest experience quality. Large commercial operators (11+ listings) show signs of "scale without soul" â€” higher market share but lower quality scores, suggesting a volume-over-quality approach.*

In [None]:
import numpy as np
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
import sys
from scipy import stats

sys.path.insert(0, str(Path.cwd().parent))
from scripts.eda_functions import (
    analyze_numeric_variable,
    analyze_categorical_variable,
    analyze_categorical_numerical,
    analyze_categorical_categorical
    )

In [None]:
data_path = Path.cwd().parent / "data" / "processed"
df = pd.read_parquet(data_path / "listings_regular_license.parquet", engine="pyarrow")
pd.set_option('display.float_format', '{:,.2f}'.format)
df.shape

## Listing Age

listing age dist graphic here 

market maturity categorization

In [None]:
# Variables to check against market maturity
variables = [
    "distance_category",
    "host_category",
    "estimated_occupancy_l365d",
    "estimated_revenue_l365d",
    "review_scores_rating",
    "price_cat",
    "reviews_per_month"
]

In [None]:
for var in variables:
    
    if pd.api.types.is_numeric_dtype(df[var]):
        
        analyze_numeric_variable(df[var])
        analyze_categorical_numerical(df["host_cat"], df[var])

    else:
        
        analyze_categorical_variable(df[var])
        analyze_categorical_categorical(df["host_cat"], df[var])

    print("\n", "\n")

In [None]:
# === FIRST REVIEW TEMPORAL ANALYSIS ===
# Time budget: 45 minutes

print("=== LISTING AGE DISTRIBUTION ===")
print(df['listing_age_years'].describe())
print("\n", df['market_maturity'].value_counts())

# 2. Neighborhood Growth Analysis (15 min)
neighborhood_maturity = df.groupby('neighbourhood_cleansed').agg({
    'listing_age_years': ['median', 'mean'],
    'first_review_date': ['min', 'max'],
    'id': 'count'
}).round(2)

neighborhood_maturity.columns = ['median_age', 'mean_age', 'first_listing', 'last_listing', 'count']
print("\n=== NEIGHBORHOOD MATURITY ===")
print(neighborhood_maturity.sort_values('median_age'))

# Distance category maturity
distance_maturity = df.groupby('distance_cat').agg({
    'listing_age_years': ['median', 'mean'],
    'first_review_date': ['min', 'max'],
    'id': 'count'
}).round(2)

distance_maturity.columns = ['median_age', 'mean_age', 'first_listing', 'last_listing', 'count']
print("\n=== DISTANCE CATEGORY MATURITY ===")
print(distance_maturity.sort_values('median_age'))

# 3. Recent Growth Trends (10 min)
df['first_review_year'] = df['first_review_date'].dt.year

recent_growth = df[df['first_review_year'] >= 2020].groupby(
    ['neighbourhood_cleansed', 'first_review_year']
).size().reset_index(name='new_listings')

growth_pivot = recent_growth.pivot(
    index='neighbourhood_cleansed',
    columns='first_review_year',
    values='new_listings'
).fillna(0)

print("\n=== NEW LISTINGS BY YEAR (2020-2025) ===")
print(growth_pivot)

# 4. Performance by Maturity (10 min)
maturity_performance = df.groupby('market_maturity').agg({
    'price': 'median',
    'estimated_revenue_l365d': 'median',
    'review_scores_rating': 'mean',
    'Host_Category': lambda x: (x == 'Large Multi (4+)').mean() * 100,
    'host_is_superhost': lambda x: x.mean() * 100
}).round(2)

print("\n=== PERFORMANCE BY MARKET MATURITY ===")
print(maturity_performance)

# 5. Key Questions
print("\n=== KEY INSIGHTS ===")
print(f"Oldest neighborhood: {neighborhood_maturity['median_age'].idxmax()} ({neighborhood_maturity['median_age'].max():.1f} years)")
print(f"Newest neighborhood: {neighborhood_maturity['median_age'].idxmin()} ({neighborhood_maturity['median_age'].min():.1f} years)")

# Check if Pavlou Mela shows recent growth
if 'Pavlou Mela' in growth_pivot.index:
    pm_growth = growth_pivot.loc['Pavlou Mela']
    print(f"\nPavlou Mela growth trend:")
    print(pm_growth[pm_growth > 0])

In [None]:
host with most recent listings and most listings

In [None]:
# Quick correlation check
recent_multis = df[(df['Host_Category'] == 'Large Multi (4+)') & 
                   (df['first_review_year'] >= 2023)]
older_multis = df[(df['Host_Category'] == 'Large Multi (4+)') & 
                  (df['first_review_year'] <= 2021)]

print("Multi-host quality by cohort:")
print(f"Pre-2022: {older_multis['review_scores_rating'].mean():.2f}")
print(f"2023+: {recent_multis['review_scores_rating'].mean():.2f}")

# T-test for significance
from scipy.stats import ttest_ind
t_stat, p_val = ttest_ind(older_multis['review_scores_rating'].dropna(),
                           recent_multis['review_scores_rating'].dropna())
print(f"T-test: t={t_stat:.2f}, p={p_val:.4f}")

In [None]:
from scipy import stats
import numpy as np

# After you've defined older_multis and recent_multis:
older_ratings = older_multis['review_scores_rating'].dropna()
recent_ratings = recent_multis['review_scores_rating'].dropna()

# Calculate Cohen's d
mean_diff = older_ratings.mean() - recent_ratings.mean()
pooled_std = np.sqrt(((len(older_ratings)-1)*older_ratings.std()**2 + 
                       (len(recent_ratings)-1)*recent_ratings.std()**2) / 
                      (len(older_ratings) + len(recent_ratings) - 2))

cohens_d = mean_diff / pooled_std

print(f"Cohen's d: {cohens_d:.3f}")

# Interpretation guide:
# 0.2 = small effect
# 0.5 = medium effect  
# 0.8 = large effect

In [None]:
review scores by price cat during recent years