# Preliminary Data Exploration of Yelp Dataset
### Purpose
This notebook performs some preliminary data exploration of the Yelp academic dataset in preparation for use of the data in the final group project. 
- Find null values
- Filter for restaurants
- Determine metro areas in data set
- Delete unneeded columns
- Export a preliminary cleaned data set to csv

### Data Source
- Data: https://www.yelp.com/dataset/download
- Documentation: https://www.yelp.com/dataset/documentation/main
- Only the business JSON file was explored in this notebook

In [1]:
# Dependencies
import os
import pandas as pd

In [2]:
# Load File
file_path = os.path.join("Raw_Data", "yelp_dataset", "yelp_academic_dataset_business.json")
businesses_df = pd.read_json(file_path, lines=True)

In [3]:
businesses_df.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,0,{'ByAppointmentOnly': 'True'},"Doctors, Traditional Chinese Medicine, Naturop...",
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'Wheelc...","Brewpubs, Breweries, Food","{'Wednesday': '14:0-22:0', 'Thursday': '16:0-2..."


In [4]:
businesses_df.columns

Index(['business_id', 'name', 'address', 'city', 'state', 'postal_code',
       'latitude', 'longitude', 'stars', 'review_count', 'is_open',
       'attributes', 'categories', 'hours'],
      dtype='object')

In [5]:
businesses_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150346 entries, 0 to 150345
Data columns (total 14 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   business_id   150346 non-null  object 
 1   name          150346 non-null  object 
 2   address       150346 non-null  object 
 3   city          150346 non-null  object 
 4   state         150346 non-null  object 
 5   postal_code   150346 non-null  object 
 6   latitude      150346 non-null  float64
 7   longitude     150346 non-null  float64
 8   stars         150346 non-null  float64
 9   review_count  150346 non-null  int64  
 10  is_open       150346 non-null  int64  
 11  attributes    136602 non-null  object 
 12  categories    150243 non-null  object 
 13  hours         127123 non-null  object 
dtypes: float64(3), int64(2), object(9)
memory usage: 16.1+ MB


NOTE: Null values in attributes, categories, and hours columns

## Delete Unneeded Columns / Rows
- attributes column (likely not using)
- hours column (likely not using)
- latitude and longitue (likely not using)
- rows with businesses not open
- rows without categories (need categories to filter for restaurants)

In [6]:
businesses_df.drop(['attributes','hours','latitude','longitude'],axis=1, inplace=True)
businesses_df.columns

Index(['business_id', 'name', 'address', 'city', 'state', 'postal_code',
       'stars', 'review_count', 'is_open', 'categories'],
      dtype='object')

In [7]:
# Explore is_open data
businesses_df['is_open'].value_counts()

1    119698
0     30648
Name: is_open, dtype: int64

In [8]:
# Delete rows for businesses not open
open_businesses_df = businesses_df[businesses_df['is_open']==1]
len(open_businesses_df)

119698

In [9]:
# Explore categories column
open_businesses_df['categories'].value_counts()

Beauty & Spas, Nail Salons                                                                             900
Nail Salons, Beauty & Spas                                                                             849
Restaurants, Pizza                                                                                     668
Pizza, Restaurants                                                                                     575
Restaurants, Chinese                                                                                   537
                                                                                                      ... 
Hotels & Travel, Tours, Train Stations, Local Flavor, Arts & Entertainment                               1
Pet Stores, Pet Groomers, Pet Sitting, Pet Services, Pet Training, Pets                                  1
Convenience Stores, Automotive, Breakfast & Brunch, Food, Restaurants, Gas Stations, Grocery, Delis      1
Specialty Food, Flowers & Gifts, Shop

In [10]:
# Find null values in categories column
open_businesses_df['categories'].isnull().sum()

95

In [11]:
# Delete rows with null values in categories column
open_businesses_df = open_businesses_df[open_businesses_df['categories'].notnull()]

In [12]:
len(open_businesses_df)

119603

In [13]:
open_businesses_df['categories'].isnull().sum()

0

In [17]:
# Drop is_open column since no longer needed
open_businesses_df.drop(['is_open'],axis=1,inplace=True)

In [18]:
open_businesses_df.columns

Index(['business_id', 'name', 'address', 'city', 'state', 'postal_code',
       'stars', 'review_count', 'categories'],
      dtype='object')

## Filter for Restaurants

In [19]:
# Testing if partial string can be used to filter lists in "categories column"
"Restaurants" in open_businesses_df["categories"][1]

False

In [20]:
"Restaurants" in open_businesses_df["categories"][3]

True

In [21]:
open_businesses_df.head(10)
# Note some indexes not present now due to row deletions

Unnamed: 0,business_id,name,address,city,state,postal_code,stars,review_count,categories
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,3.0,15,"Shipping Centers, Local Services, Notaries, Ma..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,4.0,80,"Restaurants, Food, Bubble Tea, Coffee & Tea, B..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,4.5,13,"Brewpubs, Breweries, Food"
5,CF33F8-E6oudUQ46HnavjQ,Sonic Drive-In,615 S Main St,Ashland City,TN,37015,2.0,6,"Burgers, Fast Food, Sandwiches, Food, Ice Crea..."
6,n_0UpQx1hsNbnPUSlodU8w,Famous Footwear,"8522 Eager Road, Dierbergs Brentwood Point",Brentwood,MO,63144,2.5,13,"Sporting Goods, Fashion, Shoe Stores, Shopping..."
7,qkRM_2X51Yqxk3btlwAQIg,Temple Beth-El,400 Pasadena Ave S,St. Petersburg,FL,33707,3.5,5,"Synagogues, Religious Organizations"
9,bBDDEgkFA1Otx9Lfe7BZUQ,Sonic Drive-In,2312 Dickerson Pike,Nashville,TN,37207,1.5,10,"Ice Cream & Frozen Yogurt, Fast Food, Burgers,..."
10,UJsufbvfyfONHeWdvAHKjA,Marshalls,21705 Village Lakes Sc Dr,Land O' Lakes,FL,34639,3.5,6,"Department Stores, Shopping, Fashion"
11,eEOYSgkmpB90uNA7lDOMRA,Vietnamese Food Truck,,Tampa Bay,FL,33602,4.0,10,"Vietnamese, Food, Restaurants, Food Trucks"
12,il_Ro8jwPlHresjw9EGmBg,Denny's,8901 US 31 S,Indianapolis,IN,46227,2.5,28,"American (Traditional), Restaurants, Diners, B..."


In [22]:
# Reset index so in order to loop (need sequential index to reference rows)
open_businesses_df = open_businesses_df.reset_index()

In [23]:
open_businesses_df.columns

Index(['index', 'business_id', 'name', 'address', 'city', 'state',
       'postal_code', 'stars', 'review_count', 'categories'],
      dtype='object')

In [24]:
open_businesses_df.head(10)

Unnamed: 0,index,business_id,name,address,city,state,postal_code,stars,review_count,categories
0,1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,3.0,15,"Shipping Centers, Local Services, Notaries, Ma..."
1,3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,4.0,80,"Restaurants, Food, Bubble Tea, Coffee & Tea, B..."
2,4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,4.5,13,"Brewpubs, Breweries, Food"
3,5,CF33F8-E6oudUQ46HnavjQ,Sonic Drive-In,615 S Main St,Ashland City,TN,37015,2.0,6,"Burgers, Fast Food, Sandwiches, Food, Ice Crea..."
4,6,n_0UpQx1hsNbnPUSlodU8w,Famous Footwear,"8522 Eager Road, Dierbergs Brentwood Point",Brentwood,MO,63144,2.5,13,"Sporting Goods, Fashion, Shoe Stores, Shopping..."
5,7,qkRM_2X51Yqxk3btlwAQIg,Temple Beth-El,400 Pasadena Ave S,St. Petersburg,FL,33707,3.5,5,"Synagogues, Religious Organizations"
6,9,bBDDEgkFA1Otx9Lfe7BZUQ,Sonic Drive-In,2312 Dickerson Pike,Nashville,TN,37207,1.5,10,"Ice Cream & Frozen Yogurt, Fast Food, Burgers,..."
7,10,UJsufbvfyfONHeWdvAHKjA,Marshalls,21705 Village Lakes Sc Dr,Land O' Lakes,FL,34639,3.5,6,"Department Stores, Shopping, Fashion"
8,11,eEOYSgkmpB90uNA7lDOMRA,Vietnamese Food Truck,,Tampa Bay,FL,33602,4.0,10,"Vietnamese, Food, Restaurants, Food Trucks"
9,12,il_Ro8jwPlHresjw9EGmBg,Denny's,8901 US 31 S,Indianapolis,IN,46227,2.5,28,"American (Traditional), Restaurants, Diners, B..."


In [25]:
# Drop old index column
open_businesses_df.drop(['index'], axis=1, inplace=True)

In [26]:
open_businesses_df.head(10)

Unnamed: 0,business_id,name,address,city,state,postal_code,stars,review_count,categories
0,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,3.0,15,"Shipping Centers, Local Services, Notaries, Ma..."
1,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,4.0,80,"Restaurants, Food, Bubble Tea, Coffee & Tea, B..."
2,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,4.5,13,"Brewpubs, Breweries, Food"
3,CF33F8-E6oudUQ46HnavjQ,Sonic Drive-In,615 S Main St,Ashland City,TN,37015,2.0,6,"Burgers, Fast Food, Sandwiches, Food, Ice Crea..."
4,n_0UpQx1hsNbnPUSlodU8w,Famous Footwear,"8522 Eager Road, Dierbergs Brentwood Point",Brentwood,MO,63144,2.5,13,"Sporting Goods, Fashion, Shoe Stores, Shopping..."
5,qkRM_2X51Yqxk3btlwAQIg,Temple Beth-El,400 Pasadena Ave S,St. Petersburg,FL,33707,3.5,5,"Synagogues, Religious Organizations"
6,bBDDEgkFA1Otx9Lfe7BZUQ,Sonic Drive-In,2312 Dickerson Pike,Nashville,TN,37207,1.5,10,"Ice Cream & Frozen Yogurt, Fast Food, Burgers,..."
7,UJsufbvfyfONHeWdvAHKjA,Marshalls,21705 Village Lakes Sc Dr,Land O' Lakes,FL,34639,3.5,6,"Department Stores, Shopping, Fashion"
8,eEOYSgkmpB90uNA7lDOMRA,Vietnamese Food Truck,,Tampa Bay,FL,33602,4.0,10,"Vietnamese, Food, Restaurants, Food Trucks"
9,il_Ro8jwPlHresjw9EGmBg,Denny's,8901 US 31 S,Indianapolis,IN,46227,2.5,28,"American (Traditional), Restaurants, Diners, B..."


In [44]:
open_businesses_df.columns

Index(['business_id', 'name', 'address', 'city', 'state', 'postal_code',
       'stars', 'review_count', 'categories'],
      dtype='object')

In [45]:
# Make new dataframe for restaurants
restaurants_df = pd.DataFrame(columns = ['business_id', 'name', 'address', 'city', 'state', 'postal_code',
       'stars', 'review_count', 'categories'])

In [46]:
restaurants_df

Unnamed: 0,business_id,name,address,city,state,postal_code,stars,review_count,categories


In [59]:
# Loop through open_businesses_df and append rows with restuarants to restaurants_df
for i in range(len(open_businesses_df)):
    if "Restaurants" in open_businesses_df['categories'][i]:
        restaurants_df = restaurants_df.append(open_businesses_df.loc[i], ignore_index=True)

In [60]:
len(restaurants_df)

34988

In [61]:
# Export restaurants_df to csv
restaurants_df.to_csv("yelp_restaurants_prelim_clean.csv")

## Determine metro areas and zip codes

In [62]:
# Explore zip codes in restaurants_df
restaurants_df["postal_code"].groupby(restaurants_df["city"]).value_counts()

city               postal_code
Abington           19001          30
                   19027           1
Abington Township  19006           1
Affton             63123          14
                   63129           1
                                  ..
philadelphia       19103           1
reno               89508           1
sewell             08096           1
wilmington         19801           1
wimauma            33598           1
Name: postal_code, Length: 2684, dtype: int64

In [64]:
len(restaurants_df["postal_code"].unique())

1877

In [65]:
# Explore cities in restaurants_df
restaurants_df["city"].value_counts()

Philadelphia        3525
Tampa               1964
Indianapolis        1904
Nashville           1681
Tucson              1639
                    ... 
Sassamansville         1
LITHIA                 1
Ashland                1
Oldmans Township       1
Montgomery             1
Name: city, Length: 846, dtype: int64

In [69]:
# Make list of all unique cities in df and change to lower case
cities = list(restaurants_df["city"].str.lower().unique())
len(cities)

796

In [70]:
cities

['affton',
 'philadelphia',
 'ashland city',
 'nashville',
 'tampa bay',
 'indianapolis',
 'reno',
 'white house',
 'ardmore',
 'alton',
 'bala cynwyd',
 'williamstown',
 'glenolden',
 'wesley chapel',
 'santa barbara',
 'new orleans',
 'camden',
 'tampa',
 'fairview heights',
 'wilmington',
 'treasure island',
 'saint louis',
 'brentwood',
 'tucson',
 'woodbury',
 'largo',
 'madison',
 'ewing',
 'warrington',
 'st. louis',
 'lutz',
 'langhorne',
 'king of prussia',
 'clearwater',
 'avon',
 'franklin',
 'meridian',
 'st albert',
 'downingtown',
 'virginia city',
 'saint petersburg',
 'brandon',
 'exton',
 'odessa',
 'brownsburg',
 'maple shade',
 'edmonton',
 'lansdale',
 'goodlettsville',
 'narberth',
 'oldsmar',
 'glassboro',
 'spring hill',
 'pennsville',
 'haddon heights',
 'goleta',
 'view',
 'brookhaven',
 'noblesville',
 'metairie',
 'norristown',
 'cherry hill',
 'isla vista',
 'boise',
 'mount juliet',
 'carmel',
 'fishers',
 'saint charles',
 'high ridge',
 'conshohocken',
 '

In [77]:
# List of major US cities by population (top 50)
# Source: (https://worldpopulationreview.com/us-cities), originally from US Census Bureau
Major_Cities = ["new york", "new york city", "ny", "los angeles", "la", "houston", "chicago", "phoenix", "san antonio",\
               "philadelphia","san diego", "sd", "dallas", "austin", "san jose", "forth worth", "jacksonville", "charlotte",\
               "columbus", "indianapolis", "san francisco", "sf", "seattle", "denver", "washington", "boston", "el paso",\
               "nashville", "oklahoma city", "las vegas", "portland", "detroit", "memphis", "louisville", "milwaukee",\
               "baltimore", "albuquerque","tucson", "mesa", "fresno", "atlanta", "sacramento", "kansas city", "colorado springs",\
               "raleigh", "miami", "omaha", "long beach", "virginia beach", "oakland", "minneapolis", "tampa", "tulsa",\
               "arlington", "aurora"]

In [78]:
for i in Major_Cities: 
    if i in cities:
        print(f'{i}: YES')

san antonio: YES
philadelphia: YES
columbus: YES
indianapolis: YES
nashville: YES
tucson: YES
tampa: YES
