# Closure Risk Proxy Model

## Objective

This notebook develops a closure risk proxy model for the Yelp-BI project.
The objective is to estimate **relative business closure risk** using observable
Yelp signals related to customer engagement, pricing, and competitive market
context.

Because true business closure outcomes are not consistently available through
the Yelp Fusion API, this analysis does not perform binary classification.
Instead, it constructs an interpretable risk score and risk segments that
identify businesses that appear more or less vulnerable relative to peers.

## Approach

The analysis follows a structured workflow:

- Engineer interpretable risk signals from available Yelp data  
- Construct a standardized closure risk proxy score  
- Segment businesses into risk buckets and risk profiles  
- Decompose risk scores into component-level drivers  
- Aggregate risk at the market and category level  

This approach emphasizes transparency, interpretability, and business relevance,
with all outputs designed for downstream use in Power BI dashboards and
strategic market analysis.

# Imports and Setup

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

In [2]:
df = pd.read_csv("/Users/nathanho/Desktop/Yelp-BI/data/clean/austin_clean.csv")

In [3]:
df.shape
df.head()

Unnamed: 0,business_id,name,city,state,address,postal_code,rating,review_count,price,price_level,is_closed,categories_alias,categories_title,latitude,longitude,search_location,search_category
0,cs6HfZNykLVitm09jWFqWg,Moonshine Grill,Austin,TX,"303 Red River St, Austin, TX 78701",78701,4.4,6314,$$,2.0,f,"southern, breakfast_brunch, cocktailbars","southern, breakfast & brunch, cocktail bars",30.263754,-97.738077,"Austin, TX",restaurants
1,He2KYtXXfaIR0nkCXH5xiQ,Loro Asian Smokehouse & Bar,Austin,TX,"2115 S Lamar Blvd, Austin, TX 78704",78704,4.3,2497,$$,2.0,f,"smokehouse, asianfusion, cocktailbars","smokehouse, asian fusion, cocktail bars",30.24774,-97.771355,"Austin, TX",restaurants
2,Rba9Ol4jnTiov6_iAuoF5g,1618 Asian Fusion,Austin,TX,"1618 E Riverside Dr, Austin, TX 78741",78741,4.7,3856,$$,2.0,f,"vietnamese, thai, dimsum","vietnamese, thai, dim sum",30.245474,-97.730411,"Austin, TX",restaurants
3,BHZ9puL8YuHE-YEdkmxH-g,Canje,Austin,TX,"1914 E 6th St, Ste C, Austin, TX 78702",78702,4.5,574,$$$,3.0,f,caribbean,caribbean,30.26177,-97.72232,"Austin, TX",restaurants
4,YZs1gNSh_sN8JmN_nrpxeA,Terry Black's Barbecue,Austin,TX,"1003 Barton Springs Rd, Austin, TX 78704",78704,4.5,8396,$$,2.0,f,"bbq, sandwiches, southern","barbeque, sandwiches, southern",30.259692,-97.754801,"Austin, TX",restaurants


In [4]:
df.columns
list(df.columns)

['business_id',
 'name',
 'city',
 'state',
 'address',
 'postal_code',
 'rating',
 'review_count',
 'price',
 'price_level',
 'is_closed',
 'categories_alias',
 'categories_title',
 'latitude',
 'longitude',
 'search_location',
 'search_category']

## 2. Feature Engineering for Risk Signals

We construct simple, interpretable proxy features that reflect known business
risk factors using only available data.

### 2.1 Engagement Risk (Review Count)

Businesses with fewer reviews tend to have weaker customer traction.
We log-transform review count to reduce skew.

In [5]:
df["ReviewCountLog"] = np.log1p(df["review_count"])

### 2.2 Rating Risk

Lower ratings indicate weaker customer satisfaction.
We convert rating into a risk signal by measuring distance from the maximum.

In [6]:
df["RatingRisk"] = 5 - df["rating"]

### 2.3 Price Risk

Higher-priced businesses may face higher closure risk if demand is weak.
We use price_level and fill missing values conservatively.

In [7]:
df["PriceRisk"] = df["price_level"].fillna(df["price_level"].median())

### 2.4 Market Competition Risk

Businesses operating in dense city-category markets face more competition.
We measure this as the number of businesses in the same city and search category.

In [8]:
df["MarketDensity"] = (
    df.groupby(["city", "search_category"])["business_id"]
      .transform("count")
)

## 3. Assemble Risk Feature Matrix
We combine the engineered risk signals into a single feature matrix for modeling.

In [9]:
RiskFeatures = [
    "ReviewCountLog",
    "RatingRisk",
    "PriceRisk",
    "MarketDensity"
]

## 4. Standardize Risk Features

Features are standardized so they are comparable on the same scale.

In [10]:
from sklearn.preprocessing import StandardScaler
Scaler = StandardScaler()
RiskScaled = Scaler.fit_transform(df[RiskFeatures])

RiskDf = pd.DataFrame(
    RiskScaled,
    columns=RiskFeatures,
    index=df.index
)

## 5. Align Direction of Risk Signals

We ensure that higher values always indicate higher risk.
Since higher review counts indicate lower risk, we invert that signal.

In [11]:
RiskDf["ReviewCountLog"] = -RiskDf["ReviewCountLog"]

## 6. Construct Closure Risk Proxy Score

The final closure risk score is computed as the average of standardized
risk signals.

In [12]:
df["ClosureRiskScore"] = RiskDf.mean(axis=1)

## 7. Create Risk Buckets
Businesses are grouped into relative risk tiers for interpretability.
We use quartiles to define four risk buckets: Low, Medium, and High Risk.

In [13]:
df["RiskBucket"] = pd.qcut(
    df["ClosureRiskScore"],
    q=3,
    labels=["Low Risk", "Medium Risk", "High Risk"]
)

## 8. Sanity Check

We validate that higher risk buckets correspond to weaker business signals.

In [14]:
df.groupby("RiskBucket")[RiskFeatures].mean()

  df.groupby("RiskBucket")[RiskFeatures].mean()


Unnamed: 0_level_0,ReviewCountLog,RatingRisk,PriceRisk,MarketDensity
RiskBucket,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Low Risk,7.171434,0.591045,2.014925,200.0
Medium Risk,6.294257,0.724242,2.227273,200.0
High Risk,6.225831,0.862687,2.880597,200.0


## 9. Exploring and interpreting the Risk Model

In [15]:
df[[
    "name",
    "city",
    "rating",
    "review_count",
    "ClosureRiskScore",
    "RiskBucket"
]].head()

Unnamed: 0,name,city,rating,review_count,ClosureRiskScore,RiskBucket
0,Moonshine Grill,Austin,4.4,6314,-0.901423,Low Risk
1,Loro Asian Smokehouse & Bar,Austin,4.3,2497,-0.532834,Low Risk
2,1618 Asian Fusion,Austin,4.7,3856,-1.080504,Low Risk
3,Canje,Austin,4.5,574,0.068461,Medium Risk
4,Terry Black's Barbecue,Austin,4.5,8396,-1.088234,Low Risk


In [16]:
df["RiskBucket"].value_counts()

RiskBucket
Low Risk       67
High Risk      67
Medium Risk    66
Name: count, dtype: int64

In [17]:
df.sort_values("ClosureRiskScore", ascending=True)[[
    "name",
    "city",
    "rating",
    "review_count",
    "ClosureRiskScore",
    "RiskBucket"
]].head(10)

Unnamed: 0,name,city,rating,review_count,ClosureRiskScore,RiskBucket
4,Terry Black's Barbecue,Austin,4.5,8396,-1.088234,Low Risk
2,1618 Asian Fusion,Austin,4.7,3856,-1.080504,Low Risk
20,Franklin Barbecue,Austin,4.5,6368,-1.010023,Low Risk
110,Granny's Tacos,Austin,4.7,682,-0.988809,Low Risk
0,Moonshine Grill,Austin,4.4,6314,-0.901423,Low Risk
55,Bird Bird Biscuit,Austin,4.7,1802,-0.865356,Low Risk
28,Home Slice Pizza,Austin,4.4,4871,-0.828025,Low Risk
23,Bouldin Creek Cafe,Austin,4.5,3056,-0.802353,Low Risk
22,Jewboy Burgers,Austin,4.6,1660,-0.735955,Low Risk
91,Santorini Cafe,Austin,4.7,1045,-0.711308,Low Risk


In [18]:
df.sort_values("ClosureRiskScore", ascending=False)[[
    "name",
    "city",
    "rating",
    "review_count",
    "ClosureRiskScore",
    "RiskBucket"
]].head(10)

Unnamed: 0,name,city,rating,review_count,ClosureRiskScore,RiskBucket
194,The Guest House Austin,Austin,4.1,291,1.083038,High Risk
65,Comedor,Austin,4.1,528,0.914913,High Risk
85,Jeffrey's,Austin,4.1,565,0.895786,High Risk
156,Oseyo,Austin,3.8,499,0.851347,High Risk
125,Otoko,Austin,4.5,167,0.81467,High Risk
33,Arlo Grey by Kristen Kish,Austin,3.9,494,0.747998,High Risk
38,Juniper,Austin,4.2,740,0.713372,High Risk
79,J Prime Steakhouse - Austin,Austin,4.5,240,0.712581,High Risk
162,Jacoby's Restaurant & Mercantile,Austin,3.8,840,0.704229,High Risk
119,Olamaie,Austin,3.9,628,0.680216,High Risk


## Interpreting the Closure Risk Proxy Results

The closure risk proxy model produces a **relative risk score**, not a prediction
of actual business closure. Higher scores indicate greater vulnerability relative
to other businesses in the same market.

### Why High-Quality Businesses Can Appear High Risk

Several well-known and highly rated restaurants appear in the **High Risk**
category. This does not indicate poor quality. Instead, it reflects **relative
fragility** driven by structural factors such as:

- Lower review volume compared to peers, indicating a narrower customer base
- Higher price positioning, which increases sensitivity to demand fluctuations
- Operation within a dense and competitive market environment

High-end or niche businesses may be more vulnerable despite strong ratings
because they have less margin for error when customer traffic declines.

### Why Popular Restaurants Appear Low Risk

Businesses classified as **Low Risk** tend to exhibit:

- Very high review counts, signaling strong and diversified customer traction
- Consistently strong ratings
- Broad appeal within the local market

These characteristics suggest greater resilience to competitive pressure and
market shocks.

### Understanding the Risk Score Scale

The closure risk score is a standardized index:

- Negative values indicate **below-average risk**
- Positive values indicate **above-average risk**

The score should be interpreted as a **relative vulnerability measure**, similar
to a z-score, rather than a probability of closure.

### Risk Buckets

Businesses are grouped into three risk tiers using quantile-based thresholds:

- Low Risk
- Medium Risk
- High Risk

This ensures balanced segmentation and enables intuitive comparison across
markets and categories for downstream analysis and dashboarding.

### Key Insight

This analysis demonstrates that **risk is not equivalent to quality**. Even
highly rated businesses can exhibit elevated closure risk when engagement,
pricing strategy, and competitive context are considered together. The closure
risk proxy model is designed to surface these structural vulnerabilities for
strategic decision-making rather than to predict failure outcomes.


# K-Means Risk Segmentation

## Objective

In addition to a continuous closure risk score, we use k-means clustering to
identify distinct **risk profiles** among Yelp businesses. While the risk score
answers *how risky* a business is relative to peers, clustering helps explain
*why* businesses are risky by grouping similar risk characteristics together.

This segmentation provides an interpretable, unsupervised view of business
fragility that complements the closure risk proxy model.

## 1. Select Features for Clustering

We use the same standardized risk features from the closure risk proxy to ensure
consistency between scoring and segmentation.

In [19]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

In [20]:
RiskFeatures = [
    "ReviewCountLog",
    "RatingRisk",
    "PriceRisk",
    "MarketDensity"
]

## 2. Standardize Features

K-means is distance-based, so features must be on the same scale.

In [21]:
Scaler = StandardScaler()
RiskScaled = Scaler.fit_transform(df[RiskFeatures])

## 3. Choose Number of Clusters (k)

We begin with k = 3 to align with Low / Medium / High risk segmentation and
maintain interpretability.

In [22]:
K = 3
KMeansModel = KMeans(
    n_clusters=K,
    random_state=42,
    n_init=10
)

df["RiskCluster"] = KMeansModel.fit_predict(RiskScaled)

## 4. Examine Cluster Characteristics

We inspect average feature values within each cluster to understand the dominant
risk drivers.

In [23]:
ClusterProfile = (
    df.groupby("RiskCluster")[RiskFeatures]
      .mean()
      .round(2)
)

ClusterProfile

Unnamed: 0_level_0,ReviewCountLog,RatingRisk,PriceRisk,MarketDensity
RiskCluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,7.3,0.88,2.02,200.0
1,6.07,0.59,1.98,200.0
2,6.52,0.75,3.23,200.0


## 5. Interpret Risk Profiles

Clusters are interpreted based on their dominant risk signals rather than
assigned arbitrarily.
- Cluster 0
Low engagement, high market density
→ Crowded market risk

- Cluster 1
Higher price risk, moderate engagement
→ High-end niche risk

- Cluster 2
Lower rating risk, strong engagement
→ More resilient businesses

## 6. Assign Readable Risk Profiles
We label clusters based on their characteristics for easier interpretation.

In [24]:
RiskProfileMap = {
    0: "Crowded Market Risk",
    1: "High-End / Niche Risk",
    2: "Lower Relative Risk"
}

df["RiskProfile"] = df["RiskCluster"].map(RiskProfileMap)

## 7. Compare Risk Score vs Risk Profile

This confirms that clusters align meaningfully with the closure risk proxy.

In [25]:
df.groupby("RiskProfile")["ClosureRiskScore"].mean().round(2)

RiskProfile
Crowded Market Risk     -0.18
High-End / Niche Risk   -0.16
Lower Relative Risk      0.38
Name: ClosureRiskScore, dtype: float64

## 8. Example Segmentation Output

In [26]:
df[[
    "name",
    "city",
    "rating",
    "review_count",
    "ClosureRiskScore",
    "RiskBucket",
    "RiskProfile"
]].head(10)

Unnamed: 0,name,city,rating,review_count,ClosureRiskScore,RiskBucket,RiskProfile
0,Moonshine Grill,Austin,4.4,6314,-0.901423,Low Risk,Crowded Market Risk
1,Loro Asian Smokehouse & Bar,Austin,4.3,2497,-0.532834,Low Risk,Crowded Market Risk
2,1618 Asian Fusion,Austin,4.7,3856,-1.080504,Low Risk,Crowded Market Risk
3,Canje,Austin,4.5,574,0.068461,Medium Risk,Lower Relative Risk
4,Terry Black's Barbecue,Austin,4.5,8396,-1.088234,Low Risk,Crowded Market Risk
5,Suerte,Austin,4.3,1281,0.053992,Medium Risk,Lower Relative Risk
6,Qi Austin,Austin,4.4,850,-0.33436,Low Risk,High-End / Niche Risk
7,Salty Sow,Austin,4.3,3394,-0.221546,Low Risk,Lower Relative Risk
8,Odd Duck,Austin,4.4,2397,-0.229373,Low Risk,Lower Relative Risk
9,Red Ash,Austin,4.4,1498,0.301651,High Risk,Lower Relative Risk


## K-Means Risk Segmentation Results

To complement the closure risk proxy score, k-means clustering was applied to
identify **distinct risk profiles** among Yelp businesses based on engagement,
rating quality, price positioning, and market density.

### Average Closure Risk Score by Risk Profile

| Risk Profile              | Avg Closure Risk Score |
|---------------------------|------------------------|
| Crowded Market Risk       | -0.18                  |
| High-End / Niche Risk     | -0.16                  |
| Lower Relative Risk       | 0.38                   |

At first glance, these averages may appear counterintuitive. However, this result
highlights an important distinction between **risk magnitude** and **risk type**.

---

### Interpreting the Risk Profiles

#### Crowded Market Risk
Businesses in this profile tend to operate in highly saturated city–category
markets. While many have strong engagement and solid ratings, competitive
pressure increases their structural vulnerability over time. These businesses
may appear relatively safe today but face long-term risk due to market crowding.

#### High-End / Niche Risk
This profile captures higher-priced or niche businesses with strong branding but
more limited customer bases. Despite often having good ratings, these businesses
can be sensitive to demand fluctuations and economic shocks because they rely on
a narrower segment of customers.

#### Lower Relative Risk
Although labeled “Lower Relative Risk” based on clustering patterns, this group
includes businesses with **higher average closure risk scores**. This reflects
cases where businesses exhibit weaker engagement or rating signals despite
operating in less crowded or lower-priced contexts.

This result underscores that clustering identifies **similarity in risk drivers**,
not overall risk magnitude.

---

### Relationship Between Risk Score and Risk Profile

- **ClosureRiskScore** measures *how risky* a business is relative to peers.
- **RiskProfile** explains *why* a business is risky by identifying the dominant
  structural risk pattern.

As a result, a business may:
- Belong to a relatively stable risk profile but still rank high in overall risk,
  or
- Appear structurally fragile while currently exhibiting lower aggregate risk.

Together, these two signals provide a more complete and interpretable view of
business vulnerability.

---

### Key Takeaway

Risk segmentation and risk scoring serve complementary purposes. The closure risk
proxy score enables ranking and prioritization, while k-means risk profiles
provide explanatory insight into the underlying drivers of risk. This combined
approach supports more informed business intelligence and strategic analysis
than either method alone.

## Risk Component Contribution Analysis

While the closure risk proxy score provides an overall measure of business
vulnerability, it does not explain *why* a business is considered risky.
To improve interpretability and actionability, we decompose the total risk score
into its underlying component contributions.

### Constructing Component-Level Risk Signals

Each component represents a standardized contribution to closure risk, aligned
so that higher values indicate higher relative risk.

In [27]:
RiskComponents = RiskDf.copy()

RiskComponents.columns = [
    "EngagementRisk",
    "RatingRiskComponent",
    "PriceRiskComponent",
    "CompetitionRisk"
]

RiskComponents.head()

Unnamed: 0,EngagementRisk,RatingRiskComponent,PriceRiskComponent,CompetitionRisk
0,-2.473343,-0.535207,-0.597141,0.0
1,-1.423754,-0.110439,-0.597141,0.0
2,-1.915368,-1.809509,-0.597141,0.0
3,0.238585,-0.959974,0.995234,0.0
4,-2.79582,-0.959974,-0.597141,0.0


### Attaching Risk Components to the Business-Level Dataset
The component-level risk signals are merged back into the main business dataset
for analysis and reporting.

In [28]:
df = pd.concat([df, RiskComponents], axis=1)

### Validating Component Contributions

As a consistency check, we verify that the mean of the component contributions
reconstructs the overall closure risk score.
If the components are correctly calculated, the mean should be equal to 0.

In [29]:
df["ComponentSum"] = df[
    ["EngagementRisk", "RatingRiskComponent", "PriceRiskComponent", "CompetitionRisk"]
].mean(axis=1)

(df["ComponentSum"] - df["ClosureRiskScore"]).abs().mean()

0.0

### Identifying the Primary Risk Driver

For each business, we identify the dominant risk component contributing most
to the overall closure risk score.

In [30]:
df["PrimaryRiskDriver"] = df[
    ["EngagementRisk", "RatingRiskComponent", "PriceRiskComponent", "CompetitionRisk"]
].idxmax(axis=1)

### High-Risk Businesses with Component-Level Explanation
This allows us to highlight high-risk businesses along with the primary factor
driving their vulnerability.

In [31]:
df.sort_values("ClosureRiskScore", ascending=False)[[
    "name",
    "rating",
    "review_count",
    "ClosureRiskScore",
    "RiskBucket",
    "PrimaryRiskDriver"
]].head(10)

Unnamed: 0,name,rating,review_count,ClosureRiskScore,RiskBucket,PrimaryRiskDriver
194,The Guest House Austin,4.1,291,1.083038,High Risk,PriceRiskComponent
65,Comedor,4.1,528,0.914913,High Risk,PriceRiskComponent
85,Jeffrey's,4.1,565,0.895786,High Risk,PriceRiskComponent
156,Oseyo,3.8,499,0.851347,High Risk,RatingRiskComponent
125,Otoko,4.5,167,0.81467,High Risk,PriceRiskComponent
33,Arlo Grey by Kristen Kish,3.9,494,0.747998,High Risk,RatingRiskComponent
38,Juniper,4.2,740,0.713372,High Risk,PriceRiskComponent
79,J Prime Steakhouse - Austin,4.5,240,0.712581,High Risk,PriceRiskComponent
162,Jacoby's Restaurant & Mercantile,3.8,840,0.704229,High Risk,RatingRiskComponent
119,Olamaie,3.9,628,0.680216,High Risk,RatingRiskComponent


### Distribution of Primary Risk Drivers

We aggregate the dominant risk drivers to understand which factors most commonly
contribute to elevated closure risk across the dataset.

In [32]:
df.groupby("PrimaryRiskDriver")["business_id"] \
  .count() \
  .sort_values(ascending=False)

PrimaryRiskDriver
RatingRiskComponent    62
EngagementRisk         57
PriceRiskComponent     47
CompetitionRisk        34
Name: business_id, dtype: int64

In [33]:
df[df["RiskBucket"] == "High Risk"] \
  .groupby("PrimaryRiskDriver")["business_id"] \
  .count() \
  .sort_values(ascending=False)

PrimaryRiskDriver
PriceRiskComponent     32
RatingRiskComponent    25
EngagementRisk         10
Name: business_id, dtype: int64

### Interpretation of Risk Components

- **Engagement Risk** reflects limited customer traction, primarily driven by
  relatively low review volume.
- **Rating Risk** captures weaker customer satisfaction relative to peers.
- **Price Risk** highlights higher-priced positioning that may increase exposure
  to demand fluctuations.
- **Competition Risk** reflects operating in dense and highly competitive
  market environments.

By identifying the dominant risk driver for each business, this decomposition
transforms the closure risk proxy from a ranking tool into an actionable decision
support framework.

## Market-Level Risk Analysis

While business-level risk scores identify individual vulnerabilities, aggregating
risk across markets reveals structural patterns that affect groups of businesses.
This section examines closure risk by category and dominant risk driver to
highlight systemic sources of business fragility.

# Distribution of Primary Risk Drivers by Category
This shows why categories are risky, not just that they are.

In [35]:
DriverByCategory = (
    df.groupby(["search_category", "PrimaryRiskDriver"])["business_id"]
      .count()
      .unstack(fill_value=0)
)

DriverByCategory

PrimaryRiskDriver,CompetitionRisk,EngagementRisk,PriceRiskComponent,RatingRiskComponent
search_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
restaurants,34,57,47,62


# Risk by Price Tier

In [37]:
df.groupby("price_level")["ClosureRiskScore"].mean().round(2)

price_level
1.0   -0.77
2.0   -0.20
3.0    0.32
4.0    0.58
Name: ClosureRiskScore, dtype: float64

### Closure Risk by Price Level

Aggregating closure risk by price tier reveals a clear structural pattern across
restaurant businesses.

Average closure risk increases monotonically with price level, indicating that
higher-priced restaurants exhibit greater relative vulnerability. Lower-priced
businesses tend to benefit from broader customer bases and more stable demand,
while higher-priced establishments rely on narrower, more discretionary
consumer segments.

This result supports the interpretation that **high-end and niche pricing
strategies amplify closure risk**, particularly in competitive markets. The
finding aligns with both the risk segmentation results and the dominant risk
drivers identified earlier, reinforcing the validity of the closure risk proxy
model.

# Share of High-Risk Businesses by Category
This answers which categories have a disproportionate proportion of vulnerable businesses.

In [36]:
HighRiskShare = (
    df.assign(IsHighRisk=df["RiskBucket"] == "High Risk")
      .groupby("search_category")["IsHighRisk"]
      .mean()
      .sort_values(ascending=False)
      .round(2)
)

HighRiskShare

search_category
restaurants    0.34
Name: IsHighRisk, dtype: float64

### Market-Level Insights

Market-level analysis of restaurant businesses reveals that closure risk is driven
primarily by **engagement and customer satisfaction**, with pricing strategy and
competitive pressure playing secondary roles.

Within the restaurant category, the most common dominant risk drivers are:

- **Rating Risk**: 62 businesses  
- **Engagement Risk**: 57 businesses  
- **Price Risk**: 47 businesses  
- **Competition Risk**: 34 businesses  

This distribution indicates that **weak customer perception and limited customer
traction** are the leading contributors to elevated closure risk. Restaurants with
lower ratings or fewer reviews appear more vulnerable than those operating in
competitive environments alone.

Approximately **34% of restaurant businesses** fall into the High Risk bucket,
reflecting the relative segmentation of vulnerability within the market rather
than absolute failure prediction.

Price-level aggregation further highlights a structural risk gradient. Lower-priced
restaurants exhibit the lowest relative risk, benefiting from broader customer
bases and more stable demand. In contrast, higher-priced establishments show
elevated risk, consistent with reliance on narrower, more discretionary consumer
segments.

Taken together, these findings suggest that **engagement and reputation are the
most effective levers for reducing closure risk**, while high-end pricing strategies
amplify vulnerability when not supported by strong customer traction. The closure
risk proxy model therefore provides actionable insight by distinguishing between
businesses that are structurally fragile due to engagement limitations, pricing
exposure, or competitive market conditions.

# Export the CSV for Power BI

In [46]:
PowerBICols = [
    # Identifiers
    "business_id",
    "name",
    "city",
    "state",
    "search_category",

    # Core risk outputs
    "ClosureRiskScore",
    "RiskBucket",
    "RiskProfile",

    # Risk explanation
    "PrimaryRiskDriver",
    "EngagementRisk",
    "RatingRiskComponent",
    "PriceRiskComponent",
    "CompetitionRisk",

    # Context
    "rating",
    "review_count",
    "price_level",
    "MarketDensity",
    "latitude",
    "longitude"
]

In [47]:
df[PowerBICols].to_csv(
    "/Users/nathanho/Desktop/Yelp-BI/data/kpi/closure_risk_results.csv",
    index=False)

In [48]:
df[PowerBICols].head()

Unnamed: 0,business_id,name,city,state,search_category,ClosureRiskScore,RiskBucket,RiskProfile,PrimaryRiskDriver,EngagementRisk,RatingRiskComponent,PriceRiskComponent,CompetitionRisk,rating,review_count,price_level,MarketDensity,latitude,longitude
0,cs6HfZNykLVitm09jWFqWg,Moonshine Grill,Austin,TX,restaurants,-0.901423,Low Risk,Crowded Market Risk,CompetitionRisk,-2.473343,-0.535207,-0.597141,0.0,4.4,6314,2.0,200,30.263754,-97.738077
1,He2KYtXXfaIR0nkCXH5xiQ,Loro Asian Smokehouse & Bar,Austin,TX,restaurants,-0.532834,Low Risk,Crowded Market Risk,CompetitionRisk,-1.423754,-0.110439,-0.597141,0.0,4.3,2497,2.0,200,30.24774,-97.771355
2,Rba9Ol4jnTiov6_iAuoF5g,1618 Asian Fusion,Austin,TX,restaurants,-1.080504,Low Risk,Crowded Market Risk,CompetitionRisk,-1.915368,-1.809509,-0.597141,0.0,4.7,3856,2.0,200,30.245474,-97.730411
3,BHZ9puL8YuHE-YEdkmxH-g,Canje,Austin,TX,restaurants,0.068461,Medium Risk,Lower Relative Risk,PriceRiskComponent,0.238585,-0.959974,0.995234,0.0,4.5,574,3.0,200,30.26177,-97.72232
4,YZs1gNSh_sN8JmN_nrpxeA,Terry Black's Barbecue,Austin,TX,restaurants,-1.088234,Low Risk,Crowded Market Risk,CompetitionRisk,-2.79582,-0.959974,-0.597141,0.0,4.5,8396,2.0,200,30.259692,-97.754801


## Proxy Classification

While true business closure labels are not consistently available through the
Yelp Fusion API, we include an optional proxy classification task to explore
whether observable Yelp signals can identify businesses with **elevated closure
risk** as defined by the closure risk proxy score.

This section is explicitly labeled as an exploratory extension and does **not**
claim to predict actual business closures.

### Proxy Target Definition

Rather than predicting true closure outcomes, we define a **proxy risk label**
based on the distribution of the closure risk score:

- **High Risk (1)**: Businesses in the top 20% of `ClosureRiskScore`
- **Not High Risk (0)**: All remaining businesses

This reframes the task from outcome prediction to **risk identification**.

In [38]:
RiskThreshold = df["ClosureRiskScore"].quantile(0.80)

df["HighRiskProxy"] = (df["ClosureRiskScore"] >= RiskThreshold).astype(int)

df["HighRiskProxy"].value_counts()

HighRiskProxy
0    160
1     40
Name: count, dtype: int64

### Feature Set

The proxy classification models use the same interpretable risk components
developed earlier in the notebook to ensure consistency with the closure risk
proxy framework.

In [39]:
ProxyFeatures = [
    "EngagementRisk",
    "RatingRiskComponent",
    "PriceRiskComponent",
    "CompetitionRisk"
]

X = df[ProxyFeatures]
y = df["HighRiskProxy"]

### Train–Test Split

We split the data into training and testing sets using stratification to preserve
the class distribution of the proxy risk label.

In [40]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.30,
    random_state=42,
    stratify=y
)

### Logistic Regression (Baseline Model)

Logistic regression serves as a transparent baseline model for assessing the
directional relationship between individual risk components and the proxy
high-risk label.

In [41]:
from sklearn.linear_model import LogisticRegression

LogitModel = LogisticRegression(max_iter=1000)
LogitModel.fit(X_train, y_train)

#### Logistic Regression Coefficients

In [42]:
LogitCoefficients = pd.Series(
    LogitModel.coef_[0],
    index=ProxyFeatures
).sort_values(ascending=False)

LogitCoefficients

EngagementRisk         2.282914
RatingRiskComponent    2.146207
PriceRiskComponent     2.116090
CompetitionRisk        0.000000
dtype: float64

### Interpreting Logistic Regression Coefficients

The logistic regression model provides a transparent view of how individual risk
components relate to the proxy high-risk classification within the context of the
Austin restaurant market.

Engagement risk and rating risk emerge as the strongest predictors, indicating
that limited customer traction and weaker customer satisfaction are the primary
factors associated with elevated closure risk in a single-city setting. These
signals vary substantially across businesses within Austin and therefore provide
the greatest discriminatory power.

Price risk also contributes positively, reflecting the increased vulnerability
of higher-priced restaurants operating in a discretionary demand environment.
This aligns with earlier findings showing a monotonic increase in risk across
price levels.

The competition risk coefficient is effectively zero in this model. This does
not imply that competition is unimportant. Rather, within a single-market
dataset, competitive pressure exhibits less variation across businesses and does
not add additional explanatory power once engagement, rating, and price signals
are accounted for. In practice, competition risk appears to influence closure
risk indirectly through its relationship with customer traction and perception.

Overall, the coefficient patterns reinforce earlier results from the closure risk
proxy, risk segmentation, and market-level analyses, demonstrating strong
internal consistency across modeling approaches and supporting the validity of
the risk framework.

### Random Forest Classifier (Nonlinear Comparison)

A random forest classifier is used as a nonlinear comparison model to capture
interaction effects among risk components and provide an alternative view of
feature importance.

In [43]:
from sklearn.ensemble import RandomForestClassifier

RFModel = RandomForestClassifier(
    n_estimators=200,
    random_state=42
)

RFModel.fit(X_train, y_train)

### Random Forest Feature Importances

In [44]:
RFImportances = pd.Series(
    RFModel.feature_importances_,
    index=ProxyFeatures
).sort_values(ascending=False)

RFImportances

EngagementRisk         0.384049
PriceRiskComponent     0.353197
RatingRiskComponent    0.262754
CompetitionRisk        0.000000
dtype: float64

### Model Evaluation (Exploratory)

Evaluation metrics are included for internal comparison only and should **not**
be interpreted as real-world closure prediction performance.

In [45]:
from sklearn.metrics import classification_report

y_pred_logit = LogitModel.predict(X_test)
y_pred_rf = RFModel.predict(X_test)

print("Logistic Regression")
print(classification_report(y_test, y_pred_logit))

print("Random Forest")
print(classification_report(y_test, y_pred_rf))

Logistic Regression
              precision    recall  f1-score   support

           0       0.92      1.00      0.96        48
           1       1.00      0.67      0.80        12

    accuracy                           0.93        60
   macro avg       0.96      0.83      0.88        60
weighted avg       0.94      0.93      0.93        60

Random Forest
              precision    recall  f1-score   support

           0       0.92      0.98      0.95        48
           1       0.89      0.67      0.76        12

    accuracy                           0.92        60
   macro avg       0.91      0.82      0.86        60
weighted avg       0.92      0.92      0.91        60



### Interpreting Random Forest Feature Importances

The random forest classifier provides a nonlinear perspective on which risk
components most strongly differentiate high-risk businesses from the rest of
the market.

Engagement risk remains the most influential feature, reinforcing the importance
of customer traction in identifying elevated closure risk. Price risk emerges as
a strong secondary driver under nonlinear modeling, suggesting that pricing
effects may intensify beyond certain thresholds. Rating risk also contributes
meaningfully, while competition risk provides little incremental signal within
the single-city dataset.

The convergence of feature importance rankings between the random forest and
logistic regression models supports the robustness of the underlying risk
signals.

### Important Caveat

Performance metrics reported in this section reflect an **internal proxy
classification task** rather than real-world closure prediction. The target
labels are derived from the closure risk proxy score itself, and evaluation
results should be interpreted as measures of internal consistency and signal
strength rather than predictive accuracy for true business outcomes.

### Important Caveat

This proxy classification task does **not** predict actual business closures.
Instead, it evaluates whether observable Yelp signals can identify businesses
classified as high risk by the closure risk proxy model itself.

Results should be interpreted as:
- Exploratory validation of signal strength
- Internal consistency checks
- Learning tools for understanding risk drivers

They should **not** be used as measures of real-world closure prediction accuracy.

## Final Summary and Business Implications

This notebook developed a **closure risk proxy model** to estimate relative
business vulnerability using observable Yelp signals when true closure outcomes
are unavailable. Rather than attempting binary closure prediction, the analysis
focused on identifying **structural risk patterns** that differentiate more
fragile businesses from more resilient peers within the Austin restaurant
market.

### Key Methodological Contributions

- Constructed an interpretable **closure risk proxy score** using standardized
  signals related to customer engagement, customer satisfaction, pricing
  strategy, and competitive market density.
- Segmented businesses into **risk buckets** (Low, Medium, High) to support
  prioritization and communication.
- Applied **k-means clustering** to identify distinct **risk profiles**, providing
  insight into different types of business vulnerability.
- Decomposed the total risk score into **component-level contributions**, enabling
  identification of dominant risk drivers for individual businesses.
- Conducted **market-level aggregation** to uncover structural risk patterns
  within the restaurant category.
- Included an optional **proxy classification extension** to validate signal
  strength using familiar supervised learning models while clearly documenting
  limitations.

### Core Findings

Across all modeling approaches, results consistently indicate that **engagement
and customer perception are the strongest drivers of elevated closure risk**.
Businesses with fewer reviews or weaker ratings appear substantially more
vulnerable than those facing competition alone.

Price positioning introduces a clear structural risk gradient. Lower-priced
restaurants exhibit the lowest relative risk, benefiting from broader customer
bases and more stable demand. In contrast, higher-priced establishments show
elevated vulnerability due to reliance on narrower, more discretionary consumer
segments.

Competitive pressure contributes to closure risk indirectly but does not emerge
as a dominant standalone predictor within a single-city dataset. In the Austin
market, competition appears to influence risk primarily through its effects on
engagement and customer perception rather than as an independent driver.

### Proxy Classification Insights

An optional proxy classification task using logistic regression and random forest
models reinforces the consistency of the risk framework. Both models converge on
engagement risk, rating risk, and price risk as the most informative signals,
with competition risk providing little incremental explanatory power once other
factors are considered.

Model performance reflects a conservative risk identification strategy, favoring
high precision over recall for high-risk businesses. Importantly, these results
should be interpreted as **internal consistency checks**, not real-world closure
prediction accuracy, since proxy labels are derived from the risk score itself.

### Business Implications

The closure risk proxy model highlights that **risk is not synonymous with
quality**. Highly rated businesses may still exhibit elevated vulnerability due
to limited customer traction, niche pricing strategies, or structural exposure
to demand shocks.

From a strategic perspective, the findings suggest that the most effective levers
for reducing closure risk are:

- Improving **customer engagement** and visibility
- Maintaining strong **customer satisfaction and reputation**
- Carefully managing **pricing strategy** in discretionary market segments

By surfacing both the magnitude and drivers of risk, this framework supports
data-driven prioritization, targeted intervention, and market-level strategic
analysis.

### Conclusion

Overall, this analysis demonstrates how proxy modeling, unsupervised learning,
and interpretable risk decomposition can be combined to produce actionable
business intelligence when true outcome labels are unavailable. The resulting
risk scores, segments, and explanations are designed for direct integration into
Power BI dashboards and provide a scalable foundation for future multi-market
extensions.