# CoffeeKing - Yelp Data Exploration (Milestone 1)

**Goal:** Use Yelp business + review data to generate actionable recommendations for a coffee startup (location & positioning insights).

**Dataset:**: Yelp Open Dataset (Academic Use)
**Local enviornment**: Python (pandas) + SQLite + Jupyter (VSCode)

> Note: Raw Yelp JSON files are not committed to GitHub (see `.gitignore`).

## 1. Load Data

In [10]:
import pandas as pd

biz_path = "../data_raw/yelp_academic_dataset_business.json"
biz = pd.read_json(biz_path, lines = True)
biz.head(3)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,0,{'ByAppointmentOnly': 'True'},"Doctors, Traditional Chinese Medicine, Naturop...",
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."


## 2. Quick Dataset Overview

In [6]:
biz.shape

(150346, 14)

In [4]:
biz.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150346 entries, 0 to 150345
Data columns (total 14 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   business_id   150346 non-null  object 
 1   name          150346 non-null  object 
 2   address       150346 non-null  object 
 3   city          150346 non-null  object 
 4   state         150346 non-null  object 
 5   postal_code   150346 non-null  object 
 6   latitude      150346 non-null  float64
 7   longitude     150346 non-null  float64
 8   stars         150346 non-null  float64
 9   review_count  150346 non-null  int64  
 10  is_open       150346 non-null  int64  
 11  attributes    136602 non-null  object 
 12  categories    150243 non-null  object 
 13  hours         127123 non-null  object 
dtypes: float64(3), int64(2), object(9)
memory usage: 16.1+ MB


## 3. Column Quality Summary (Nulls)

In [7]:
overview = pd.DataFrame({
    "column": biz.columns,
    "dtype": [str(t) for t in biz.dtypes],
    "non_null": [biz[c].notna().sum() for c in biz.columns],
    "nulls": [biz[c].isna().sum() for c in biz.columns],
    "null_pct": [round(biz[c].isna().mean() * 100, 2) for c in biz.columns],
})
overview.sort_values("null_pct", ascending=False).reset_index(drop=True)

Unnamed: 0,column,dtype,non_null,nulls,null_pct
0,hours,object,127123,23223,15.45
1,attributes,object,136602,13744,9.14
2,categories,object,150243,103,0.07
3,business_id,object,150346,0,0.0
4,name,object,150346,0,0.0
5,address,object,150346,0,0.0
6,city,object,150346,0,0.0
7,state,object,150346,0,0.0
8,postal_code,object,150346,0,0.0
9,latitude,float64,150346,0,0.0


**Key takeaway:** `hours` (15.45%) and `attributes` (9.14%) have the most missing values; core identifiers and location fields are complete.

## 4. Coffee-related Subset (Scope Definition)

We define "coffee-related businesses" using the `categories` text field (case-insensitive keyword match).
This is a simple and explainable starting rule for the proposal stage. 

In [11]:
# Make a safe text column for filtering (handle missing categories)
biz["categories_clean"] = biz["categories"].fillna("").str.lower()

# Define keywords (start simple; you can refine later)
coffee_keywords = ["coffee", "cafe", "cafes", "coffee & tea"]

coffee_mask = biz["categories_clean"].str.contains("|".join(coffee_keywords), regex = True)
coffee_biz = biz.loc[coffee_mask].copy()

coffee_biz.shape


(8509, 15)

In [12]:
coffee_biz[["business_id", "name", "city", "state", "stars", "review_count", "categories"]].head(10)

Unnamed: 0,business_id,name,city,state,stars,review_count,categories
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,Philadelphia,PA,4.0,80,"Restaurants, Food, Bubble Tea, Coffee & Tea, B..."
20,WKMJwqnfZKsAae75RMP6jA,Roast Coffeehouse and Wine Bar,Edmonton,AB,4.0,40,"Coffee & Tea, Food, Cafes, Bars, Wine Bars, Re..."
46,JX4tUpd09YFchLBuI43lGw,Naked Cyber Cafe & Espresso Bar,Edmonton,AB,4.0,12,"Arts & Entertainment, Music Venues, Internet S..."
47,lk9IwjZXqUMqqOhM774DtQ,Caviar & Bananas,Nashville,TN,3.5,159,"Coffee & Tea, Restaurants, Wine Bars, Bars, Ni..."
53,cVBxfMC4lp3DnocjYA3FHQ,Paws The Cat Cafe,Edmonton,AB,5.0,20,"Coffee & Tea, Cafes, Pets, Restaurants, Pet Ad..."
82,ppFCk9aQkM338Rgwpl2F5A,Wawa,Philadelphia,PA,3.0,56,"Restaurants, Automotive, Delis, Gas Stations, ..."
85,IDtLPgUrqorrpqSLdfMhZQ,Helena Avenue Bakery,Santa Barbara,CA,4.0,389,"Food, Restaurants, Salad, Coffee & Tea, Breakf..."
89,oaboaRBUgGjbo2kfUIKDLQ,Mike's Ice Cream,Nashville,TN,4.5,593,"Ice Cream & Frozen Yogurt, Coffee & Tea, Resta..."
99,1MeIwdbTnZOBFCKOrgaxuw,Ricardo's Italian Cafe,Saint Louis,MO,3.5,80,"American (New), Restaurants, Cafes, Italian, A..."
187,h_qlv6CIXGVurFOhFQ945w,Tim Hortons,Edmonton,AB,3.5,6,"Coffee & Tea, Food"


In [22]:
coffee_biz["state"].value_counts().head(10)

state
PA    2128
FL    1519
TN     647
LA     633
IN     573
MO     551
NJ     487
AB     464
AZ     435
NV     327
Name: count, dtype: int64

In [23]:
coffee_biz["is_open"].value_counts(dropna=False)

is_open
1    6052
0    2457
Name: count, dtype: int64

## 5. Milestone 1 Notes (for Proposal)

We summarize what we learned from the initial `business.json` exploration and the coffee-related subset definition.
These notes will feed directly into the Milestone 1 proposal and help guide Milestone 2 analysis.


In [25]:
# Key Dataset Sizes
biz_rows, biz_cols = biz.shape
coffee_rows, coffee_cols = coffee_biz.shape

biz_rows, biz_cols, coffee_rows, coffee_cols

(150346, 15, 8509, 15)

In [26]:
# Missingness summary (top)
overview = pd.DataFrame({
    "column": biz.columns,
    "dtype": [str(t) for t in biz.dtypes],
    "non_null": [biz[c].notna().sum() for c in biz.columns],
    "nulls": [biz[c].isna().sum() for c in biz.columns],
    "null_pct": [round(biz[c].isna().mean() * 100, 2) for c in biz.columns],
}).sort_values("null_pct", ascending=False)

overview.head(5)

Unnamed: 0,column,dtype,non_null,nulls,null_pct
13,hours,object,127123,23223,15.45
11,attributes,object,136602,13744,9.14
12,categories,object,150243,103,0.07
0,business_id,object,150346,0,0.0
1,name,object,150346,0,0.0


### Notes

- **Dataset size:** `business.json` has **150,346 rows** and **14 columns**.
- **Coffee-related subset size:** Using keyword matching on `categories`, we extracted **8,509 coffee-related busiensses**.

- **Key missing fields:**
    - `hours` has ~15% missing values
    - `attributes` has ~9% missing values
    - Most core identifier and location fields are complete.

- **Initial cleaning needs (proposal-stage):**
    - Use a safe text field for category filtering (`categories_clean`)
    - Keep missing `hours`/`attributes` as null for now; revisit later depending on analysis needs.

- **Next step (Milestone 2 direction):**
    - Describe rating and review volume distributions for coffee business
    - Identify where coffee businesses cluster (state/city)
    - Consider refining the coffee filter to reduce false positives (e.g., gas stations / restaurants mixed in)

## 6. Descriptive Statistics (Coffee Subset)

In Milestone 2, we start with simple descrptive statistics to understand the distribution of ratings and review volume.
These are basic but essential for explaining the dataset and defending any later insights. 

In [27]:
coffee_biz["stars"].describe()

count    8509.000000
mean        3.596310
std         0.963291
min         1.000000
25%         3.000000
50%         4.000000
75%         4.500000
max         5.000000
Name: stars, dtype: float64

In [29]:
coffee_biz["review_count"].describe()

count    8509.000000
mean       71.675755
std       169.898609
min         5.000000
25%        11.000000
50%        26.000000
75%        67.000000
max      5721.000000
Name: review_count, dtype: float64

In [30]:
coffee_biz["state"].value_counts().head(10), coffee_biz["state"].nunique()

(state
 PA    2128
 FL    1519
 TN     647
 LA     633
 IN     573
 MO     551
 NJ     487
 AB     464
 AZ     435
 NV     327
 Name: count, dtype: int64,
 15)

### Quick Interpretation

- **Stars:** The average is approximately **3.60** and the median is **4.0**, so the overall ratings are concentrated on the "OK~Good" side.
- **Review count:** The average is **71.7**, the standard deviation is **169.9**, and the maximum is **5721**, so a **small number of famous stores (large outliers)** are significantly driving up the average.
--> Since the median (26) is much lower than the average, the distribution is likely to be heavy-tailed. 

In [31]:
coffee_biz.sort_values("review_count", ascending = False)[
    ["business_id", "name", "city", "state", "stars", "review_count", "categories"]
].head(10)

Unnamed: 0,business_id,name,city,state,stars,review_count,categories
143157,ytynqOUb3hjKeJfRj5Tshw,Reading Terminal Market,Philadelphia,PA,4.5,5721,"Candy Stores, Shopping, Department Stores, Fas..."
147081,oBNrLz4EDhiscSlbOl8uAw,Ruby Slipper - New Orleans,New Orleans,LA,4.5,5193,"Restaurants, American (Traditional), American ..."
20078,j-qtdD55OLfSqfsWuQTDJg,Parc,Philadelphia,PA,4.0,2761,"Restaurants, French, Wine Bars, Nightlife, Ame..."
126929,Y2Pfil51rNvTd_lFHwzb_g,Cafe Beignet on Royal Street,New Orleans,LA,4.0,2688,"Cafes, Breakfast & Brunch, Food, Restaurants, ..."
81594,PGd06nrseC2YAIqP6S9gUA,Ruby Slipper Cafe,New Orleans,LA,4.0,2159,"American (New), Breakfast & Brunch, Restaurant..."
100087,BjeHLwKOlHyV6DJgmZxAjA,Jimmy J's Cafe,New Orleans,LA,4.0,2137,"American (Traditional), Cafes, American (New),..."
5851,vN6v8m4DO45Z4pp8yxxF_w,Surrey's Café & Juice Bar,New Orleans,LA,4.5,2084,"Vegetarian, Restaurants, Breakfast & Brunch, C..."
957,W4ZEKkva9HpAdZG88juwyQ,Mr. B's Bistro,New Orleans,LA,4.0,2064,"Bars, Breakfast & Brunch, Restaurants, Barbequ..."
1971,8uF-bhJFgT4Tn6DTb27viA,District Donuts Sliders Brew,New Orleans,LA,4.5,2062,"Food, Donuts, Burgers, American (Traditional),..."
147511,qQO7ErS_RAN4Vs1uX0L55Q,The Franklin Fountain,Philadelphia,PA,4.0,2062,"Ice Cream & Frozen Yogurt, Coffee & Tea, Food,..."


In [32]:
bins = [0, 10, 25, 50, 100, 250, 500, 1000, 100000]
labels = ["0-10", "11-25", "26-50", "51-100", "101-250", "251-500", "501-1000", "1000+"]

coffee_biz["review_bin"] = pd.cut(coffee_biz["review_count"], bins = bins, labels = labels, include_lowest = True)
coffee_biz["review_bin"].value_counts(normalize=True).sort_index().round(3)

review_bin
0-10        0.230
11-25       0.260
26-50       0.196
51-100      0.142
101-250     0.116
251-500     0.038
501-1000    0.014
1000+       0.004
Name: proportion, dtype: float64

### Interpretation (what these stats suggest)

**Review count distribution is highly skewed (long-tailed).**
- ~68.6% of coffee-related businesses have **≤ 50 reviews** (0-10: 23.0%, 11-25: 26.0%, 26-50: 19.6%).
- ~82.8% have **≤ 100 reviews** (adding 51-100: 14.2%).
- Only **0.4%** have **1000** reviews, meaning "mega-popular" listings are extremely rare. 

**This explains why the mean review_count can be misleading.**
The average review count (~71.7) is pulled upward by a small number of very high-review businesses (max = 5,721).

**Top review-volume listings often look like "coffee + destination food" hybrids.**
In the top 10 by review_count, several business include categories like markets, brunch restaurants, donuts, or dessert-suggesting that high review volume may reflect being a broader food destination rather than a pure coffee shop.

**Geographic concentration appears among top examples.**
Several of the highest-review businesses are located in **New Orleans (LA)** and **Philadelphia (PA)**, which suggests city-level factors (tourism, density, food culture) may influence review volume. 