# CoffeeKing - Yelp Data Exploration (Milestone 1)

**Goal:** Use Yelp business + review data to generate actionable recommendations for a coffee startup (location & positioning insights).

**Dataset:**: Yelp Open Dataset (Academic Use)
**Local enviornment**: Python (pandas) + SQLite + Jupyter (VSCode)

> Note: Raw Yelp JSON files are not committed to GitHub (see `.gitignore`).

## 1. Load Data

In [10]:
import pandas as pd

biz_path = "../data_raw/yelp_academic_dataset_business.json"
biz = pd.read_json(biz_path, lines = True)
biz.head(3)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,0,{'ByAppointmentOnly': 'True'},"Doctors, Traditional Chinese Medicine, Naturop...",
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."


## 2. Quick Dataset Overview

In [6]:
biz.shape

(150346, 14)

In [4]:
biz.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150346 entries, 0 to 150345
Data columns (total 14 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   business_id   150346 non-null  object 
 1   name          150346 non-null  object 
 2   address       150346 non-null  object 
 3   city          150346 non-null  object 
 4   state         150346 non-null  object 
 5   postal_code   150346 non-null  object 
 6   latitude      150346 non-null  float64
 7   longitude     150346 non-null  float64
 8   stars         150346 non-null  float64
 9   review_count  150346 non-null  int64  
 10  is_open       150346 non-null  int64  
 11  attributes    136602 non-null  object 
 12  categories    150243 non-null  object 
 13  hours         127123 non-null  object 
dtypes: float64(3), int64(2), object(9)
memory usage: 16.1+ MB


## 3. Column Quality Summary (Nulls)

In [7]:
overview = pd.DataFrame({
    "column": biz.columns,
    "dtype": [str(t) for t in biz.dtypes],
    "non_null": [biz[c].notna().sum() for c in biz.columns],
    "nulls": [biz[c].isna().sum() for c in biz.columns],
    "null_pct": [round(biz[c].isna().mean() * 100, 2) for c in biz.columns],
})
overview.sort_values("null_pct", ascending=False).reset_index(drop=True)

Unnamed: 0,column,dtype,non_null,nulls,null_pct
0,hours,object,127123,23223,15.45
1,attributes,object,136602,13744,9.14
2,categories,object,150243,103,0.07
3,business_id,object,150346,0,0.0
4,name,object,150346,0,0.0
5,address,object,150346,0,0.0
6,city,object,150346,0,0.0
7,state,object,150346,0,0.0
8,postal_code,object,150346,0,0.0
9,latitude,float64,150346,0,0.0


**Key takeaway:** `hours` (15.45%) and `attributes` (9.14%) have the most missing values; core identifiers and location fields are complete.

## 4. Coffee-related Subset (Scope Definition)

We define "coffee-related businesses" using the `categories` text field (case-insensitive keyword match).
This is a simple and explainable starting rule for the proposal stage. 

In [11]:
# Make a safe text column for filtering (handle missing categories)
biz["categories_clean"] = biz["categories"].fillna("").str.lower()

# Define keywords (start simple; you can refine later)
coffee_keywords = ["coffee", "cafe", "cafes", "coffee & tea"]

coffee_mask = biz["categories_clean"].str.contains("|".join(coffee_keywords), regex = True)
coffee_biz = biz.loc[coffee_mask].copy()

coffee_biz.shape


(8509, 15)

In [12]:
coffee_biz[["business_id", "name", "city", "state", "stars", "review_count", "categories"]].head(10)

Unnamed: 0,business_id,name,city,state,stars,review_count,categories
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,Philadelphia,PA,4.0,80,"Restaurants, Food, Bubble Tea, Coffee & Tea, B..."
20,WKMJwqnfZKsAae75RMP6jA,Roast Coffeehouse and Wine Bar,Edmonton,AB,4.0,40,"Coffee & Tea, Food, Cafes, Bars, Wine Bars, Re..."
46,JX4tUpd09YFchLBuI43lGw,Naked Cyber Cafe & Espresso Bar,Edmonton,AB,4.0,12,"Arts & Entertainment, Music Venues, Internet S..."
47,lk9IwjZXqUMqqOhM774DtQ,Caviar & Bananas,Nashville,TN,3.5,159,"Coffee & Tea, Restaurants, Wine Bars, Bars, Ni..."
53,cVBxfMC4lp3DnocjYA3FHQ,Paws The Cat Cafe,Edmonton,AB,5.0,20,"Coffee & Tea, Cafes, Pets, Restaurants, Pet Ad..."
82,ppFCk9aQkM338Rgwpl2F5A,Wawa,Philadelphia,PA,3.0,56,"Restaurants, Automotive, Delis, Gas Stations, ..."
85,IDtLPgUrqorrpqSLdfMhZQ,Helena Avenue Bakery,Santa Barbara,CA,4.0,389,"Food, Restaurants, Salad, Coffee & Tea, Breakf..."
89,oaboaRBUgGjbo2kfUIKDLQ,Mike's Ice Cream,Nashville,TN,4.5,593,"Ice Cream & Frozen Yogurt, Coffee & Tea, Resta..."
99,1MeIwdbTnZOBFCKOrgaxuw,Ricardo's Italian Cafe,Saint Louis,MO,3.5,80,"American (New), Restaurants, Cafes, Italian, A..."
187,h_qlv6CIXGVurFOhFQ945w,Tim Hortons,Edmonton,AB,3.5,6,"Coffee & Tea, Food"


In [22]:
coffee_biz["state"].value_counts().head(10)

state
PA    2128
FL    1519
TN     647
LA     633
IN     573
MO     551
NJ     487
AB     464
AZ     435
NV     327
Name: count, dtype: int64

In [23]:
coffee_biz["is_open"].value_counts(dropna=False)

is_open
1    6052
0    2457
Name: count, dtype: int64

## 5. Milestone 1 Notes (for Proposal)
- Dataset size:
- Key missing fields:
- Initial cleaning needs:
- Next step: