<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-the-Business-Dataset" data-toc-modified-id="Load-the-Business-Dataset-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load the Business Dataset</a></span></li><li><span><a href='#Exploring-the-"attributes"-Column' data-toc-modified-id='Exploring-the-"attributes"-Column-2'><span class="toc-item-num">2&nbsp;&nbsp;</span>Exploring the "attributes" Column</a></span></li><li><span><a href='#Exploring-the-"categories"-Column' data-toc-modified-id='Exploring-the-"categories"-Column-3'><span class="toc-item-num">3&nbsp;&nbsp;</span>Exploring the "categories" Column</a></span><ul class="toc-item"><li><span><a href='#The-Column-"cuisine"' data-toc-modified-id='The-Column-"cuisine"-3.1'><span class="toc-item-num">3.1&nbsp;&nbsp;</span>The Column "cuisine"</a></span></li><li><span><a href='#The-Column-"special_food"' data-toc-modified-id='The-Column-"special_food"-3.2'><span class="toc-item-num">3.2&nbsp;&nbsp;</span>The Column "special_food"</a></span></li><li><span><a href='#The-Column-"place"' data-toc-modified-id='The-Column-"place"-3.3'><span class="toc-item-num">3.3&nbsp;&nbsp;</span>The Column "place"</a></span></li></ul></li><li><span><a href="#Next-Steps" data-toc-modified-id="Next-Steps-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Next Steps</a></span></li></ul></div>

The business dataset from Yelp associates to each restaurant attributes and categories. We examine here the content of these two columns. We explore the different types of attributes for each restaurant and transform each attribute to one column. We also partition the categories into different classes that we add as columns.

## Load the Business Dataset

We load the Yelp business dataset that consists of the common restaurants in the inspection dataset.

In [1]:
import pandas as pd
bus_to_focus = pd.read_csv('bus_to_focus.csv')

In [2]:
bus_to_focus.columns

Index(['Unnamed: 0', 'business_id', 'name', 'address', 'city', 'state',
       'postal_code', 'latitude', 'longitude', 'stars', 'review_count',
       'is_open', 'attributes', 'categories', 'hours', 'sty', 'id'],
      dtype='object')

## Exploring the "attributes" Column

For each restaurant, a set of attributes is assigned in a dictionary-like format. For instance, as shown below, these attributes specify the ambience, price range and business parking of the restaurants. 

In [3]:
print(bus_to_focus.attributes.iloc[0])

{'CoatCheck': 'False', 'BusinessParking': "{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}", 'HappyHour': 'True', 'Smoking': "u'no'", 'WiFi': "u'free'", 'RestaurantsTableService': 'False', 'RestaurantsDelivery': 'False', 'Alcohol': "u'full_bar'", 'RestaurantsPriceRange2': '2', 'HasTV': 'False', 'Caters': 'True', 'Music': "{'dj': False, 'background_music': True, 'jukebox': False, 'live': False, 'video': False, 'karaoke': False}", 'RestaurantsTakeOut': 'True', 'BestNights': "{'monday': False, 'tuesday': False, 'friday': True, 'wednesday': True, 'thursday': True, 'sunday': False, 'saturday': False}", 'WheelchairAccessible': 'True', 'BusinessAcceptsCreditCards': 'True', 'GoodForKids': 'False', 'BusinessAcceptsBitcoin': 'False', 'GoodForDancing': 'False', 'BikeParking': 'True', 'RestaurantsAttire': "u'casual'", 'RestaurantsGoodForGroups': 'True', 'NoiseLevel': "u'average'", 'RestaurantsReservations': 'False', 'Ambience': "{'romantic': False, 'intimate': F

Note that these attributes vary with each restaurant. Now, we transform each of these attributes into one column in the same data frame.

In [4]:
df = pd.DataFrame ()
for index, row in bus_to_focus.iterrows():
    row_f = pd.DataFrame()
    try:
        row_f = pd.Series(eval(bus_to_focus.loc[index,'attributes'])).to_frame().T
        row_f['business_id'] = bus_to_focus.loc[index,'business_id']
    except:
        pass
    df = pd.concat([df,row_f])

bus_to_focus = bus_to_focus.merge(df, on='business_id', how='left')

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  if __name__ == '__main__':


We now have additional columns in the business dataframe, which consist of the attributes provided by Yelp business dataset.

In [5]:
bus_to_focus.shape

(2557, 53)

In [6]:
bus_to_focus.columns

Index(['Unnamed: 0', 'business_id', 'name', 'address', 'city', 'state',
       'postal_code', 'latitude', 'longitude', 'stars', 'review_count',
       'is_open', 'attributes', 'categories', 'hours', 'sty', 'id',
       'AcceptsInsurance', 'AgesAllowed', 'Alcohol', 'Ambience', 'BYOB',
       'BYOBCorkage', 'BestNights', 'BikeParking', 'BusinessAcceptsBitcoin',
       'BusinessAcceptsCreditCards', 'BusinessParking', 'ByAppointmentOnly',
       'Caters', 'CoatCheck', 'Corkage', 'DogsAllowed', 'DriveThru',
       'GoodForDancing', 'GoodForKids', 'GoodForMeal', 'HappyHour', 'HasTV',
       'Music', 'NoiseLevel', 'Open24Hours', 'OutdoorSeating',
       'RestaurantsAttire', 'RestaurantsDelivery', 'RestaurantsGoodForGroups',
       'RestaurantsPriceRange2', 'RestaurantsReservations',
       'RestaurantsTableService', 'RestaurantsTakeOut', 'Smoking',
       'WheelchairAccessible', 'WiFi'],
      dtype='object')

Since these attributes are not all mentioned for all restaurants, some of these columns are mostly empty, which we are going to drop.

In [7]:
# attributes 
cols = ['AcceptsInsurance', 'AgesAllowed', 'Alcohol', 'Ambience', 'BYOB',
       'BYOBCorkage', 'BestNights', 'BikeParking', 'BusinessAcceptsBitcoin',
       'BusinessAcceptsCreditCards', 'BusinessParking', 'ByAppointmentOnly',
       'Caters', 'CoatCheck', 'Corkage', 'DogsAllowed', 'DriveThru',
       'GoodForDancing', 'GoodForKids', 'GoodForMeal', 'HappyHour', 'HasTV',
       'Music', 'NoiseLevel', 'Open24Hours', 'OutdoorSeating',
       'RestaurantsAttire', 'RestaurantsDelivery', 'RestaurantsGoodForGroups',
       'RestaurantsPriceRange2', 'RestaurantsReservations',
       'RestaurantsTableService', 'RestaurantsTakeOut', 'Smoking',
       'WheelchairAccessible', 'WiFi']

# check how many entries are empty for each attribute
for col in cols:
    num = bus_to_focus[col].isnull().sum()
    print(col,':',num)

AcceptsInsurance : 2553
AgesAllowed : 2555
Alcohol : 959
Ambience : 955
BYOB : 2555
BYOBCorkage : 2499
BestNights : 2405
BikeParking : 598
BusinessAcceptsBitcoin : 2289
BusinessAcceptsCreditCards : 180
BusinessParking : 416
ByAppointmentOnly : 2494
Caters : 1091
CoatCheck : 2416
Corkage : 2535
DogsAllowed : 2380
DriveThru : 2484
GoodForDancing : 2377
GoodForKids : 875
GoodForMeal : 1490
HappyHour : 2345
HasTV : 940
Music : 2332
NoiseLevel : 1076
Open24Hours : 2556
OutdoorSeating : 833
RestaurantsAttire : 990
RestaurantsDelivery : 828
RestaurantsGoodForGroups : 834
RestaurantsPriceRange2 : 317
RestaurantsReservations : 838
RestaurantsTableService : 2016
RestaurantsTakeOut : 503
Smoking : 2440
WheelchairAccessible : 2200
WiFi : 974


We now drop the attributes that are mostly empty.

In [8]:
for col in cols:
    num = bus_to_focus[col].isnull().sum()
    if (num>1000):
        bus_to_focus = bus_to_focus.drop(columns=col)

In [9]:
bus_to_focus.columns

Index(['Unnamed: 0', 'business_id', 'name', 'address', 'city', 'state',
       'postal_code', 'latitude', 'longitude', 'stars', 'review_count',
       'is_open', 'attributes', 'categories', 'hours', 'sty', 'id', 'Alcohol',
       'Ambience', 'BikeParking', 'BusinessAcceptsCreditCards',
       'BusinessParking', 'GoodForKids', 'HasTV', 'OutdoorSeating',
       'RestaurantsAttire', 'RestaurantsDelivery', 'RestaurantsGoodForGroups',
       'RestaurantsPriceRange2', 'RestaurantsReservations',
       'RestaurantsTakeOut', 'WiFi'],
      dtype='object')

For the remaining empty entries, we fill them as "unknown". 

In [10]:
bus_to_focus = bus_to_focus.fillna('unknown')

## Exploring the "categories" Column

The "categories" column specifies some characteristics of the place, as well as the type of cuisine or any specific foods served. These categories are entered as a sequence of words for each restaurant, as shown below. We are going next to extract from this column three columns: "place", "cuisine", "special_food". The column "place" will have some characteristics of the place, the column "cuisine" will specify what type of cuisine is served and the column "special_food" includes if the restaurant serves any specific food.

In [11]:
bus_to_focus.categories.iloc[16]

'Bagels, Donuts, Coffee & Tea, Breakfast & Brunch, Restaurants, Food'

### The Column "cuisine"

We extract the types of cuisine served by the restaurants and partition them into the following classes:
American, Italian, French, Mediterranean, Spanish, European, Mexican, Latin American, African, Caribbean, Southern, Japanese, Chinese, E Asian (East Asian), SE Asian (South East Asian), N/C Asian (North and Central Asian), Indian, Australian. Since a restaurant can serve multiple cuisines, we enter the types of cuisine served as "set".

In [12]:
import re
for index, row in bus_to_focus.iterrows():
    lists = str(row["categories"]).split(", ")
    cuisine = set()
    for ele in lists:
        if (("American" in ele) & ("Latin American" not in ele)):
            cuisine.add("American")
        if (re.search(("Italian|Sicilian"),ele)):
            cuisine.add("Italian")
        if (("French") in ele):
            cuisine.add("French")
        if (re.search(
            ("Lebanese|Middle Eastern|Greek|Turkish|Syrian|Armenian|Mediterranean|Morrocan"),ele)):
            cuisine.add("Mediterranean")
        if re.search(("Portuguese|Basque|Spanish|Iberian"),ele):
            cuisine.add("Spanish")
        if (re.search(("European|Polish|Modern European|Irish|German|British|Belgian"),ele)):
            cuisine.add("European")
        if re.search(("Mexican|Tex-Mex"),ele):
            cuisine.add("Mexican")
        if re.search(("Latin American|Venezuelan|Argentine|Peruvian|Brazilian|Colombian"),ele):
            cuisine.add("Latin American")
        if re.search(("African|Ethiopian"), ele):
            cuisine.add("African")
        if re.search(("Caribbean|Cuban"),ele):
            cuisine.add("Caribbean")
        if re.search(("Southern|Soul Food|Cajun|Creole"), ele):
            cuisine.add("Southern")
        if re.search(("Japanese|Izakaya|Teppanyaki"), ele):
            cuisine.add("Japanese")
        if re.search(("Chinese|Szechuan|Cantonese"), ele):
            cuisine.add("Chinese")
        if re.search(("Indian|Pakistani|Bangladeshi|Nepalese"), ele):
            cuisine.add("Indian")
        if re.search(("Thai|Korean|Vietnamese|Ramen|Taiwanese|Pan Asian|Asian Fusion"), ele):
            cuisine.add("E Asian")
        if re.search(("Uzbek|Russian|Mongolian"), ele):
            cuisine.add("N/C Asian")
        if re.search(("Cambodian, Singaporian, Malayasian, Burmese, Laotian"), ele):
            cuisine.add("SE Asian")
        if ("Australian" in ele):
            cuisine.add("Australian")
    #print(cuisine)
    bus_to_focus.at[index,'cuisine'] = cuisine

### The Column "special_food"

We extract what types of special food served by the restaurants and partition them into the following classes:
Pizza, Fast/Fried Foods, Burgers, Desserts, Bagels/Pretzels, Gelato, Seafood, BBQ, Steaks, Vegetarian, Vegan, Gluten-Free, Noodles, Tacos, Sandwiches, Sushi, Kosher, Fruit/Veg and other.

In [13]:
for index, row in bus_to_focus.iterrows():
    lists = str(row["categories"]).split(", ")
    food = set()
    for ele in lists:
        if ("Pizza" in ele):
            food.add("Pizza")
        if (re.search(("Fast Food|Fish & Chips|Chicken Wings|Hot Dogs"),ele)):
            food.add("Fast/Fried Foods")
        if (("Burgers") in ele):
            food.add("Burgers")
        if (re.search(
            ("Desserts|Cupcakes|Custom Cakes|Creperies|Waffles|Fondue"),ele)):
            food.add("Dessertd")
        if re.search(("Bagels|Pretzels|Donuts"),ele):
            food.add("Bagels/Pretzels")
        if (re.search(("Gelato|Ice Cream"),ele)):
            food.add("Gelato")
        if re.search(("Seafood"),ele):
            food.add("Seafood")
        if re.search(("Barbeque|Smokehouse|BBQ"),ele):
            food.add("BBQ")
        if re.search(("Steak|Steakhouses|Cheesesteaks"), ele):
            food.add("Steaks")
        if "Vegetarian" in ele:
            food.add("Vegetarian")
        if "Vegan" in ele:
            food.add("Vegan")
        if "Gluten-Free" in ele:
            food.add("Gluten-Free")
        if "Noodles" in ele:
            food.add("Noodles")
        if re.search(("Tacos, Empanadas"), ele):
            food.add("Tacos")
        if re.search(("Sandwiches|Wraps"), ele):
            food.add("Sandwiches")
        if re.search(("Kosher|Halal"), ele):
            food.add("Kosher")
        if re.search(("Sushi|Raw|Poke"), ele):
            food.add("Sushi")
        if re.search(("Salad|Fruit & Veggies|Juice|Smoothies|Acai Bowls"), ele):
            food.add("Fruit/Veg")
        if ("Specialty Food" in ele):
            food.add("Other")
    
    bus_to_focus.at[index,'special_food'] = food

### The Column "place"

We finally add the column "place" that specifies the characteristics of the place that we partition into the following classes:
Restaurants, Mobile, Convenience Store, Grocery Store, Food Shop, Bakeries, Coffee Place, Other Goods (i.e., the places sells other products that are not food), Shopping, Services (i.e., if the place provides other types of services that are different than preparing foods), Entertainment/Event Place, Fitness/Sport Place, Teaching/School Place, Religious Place, Health & Medical Place, Pubs/Bars and Liquor Manufacturing.

In [14]:
bus_to_focus['place'] = ""
bus_to_focus['place'] = bus_to_focus['place'].astype(object)
for index, row in bus_to_focus.iterrows():
    lists = str(row["categories"]).split(", ")
    place = set()
    for ele in lists:
        if re.search("Restaurants|Bistros|Cafes|Cafeteria", ele):
            place.add("Restaurants")
        if re.search("Trucks|Vendors|Stands", ele):
            place.add("Mobile")
        if re.search("Convenience|Gas Stations", ele):
            place.add("Convenience Store")
        if re.search("Grocery|Health Markets|Organic Stores|Farmers Market", ele):
            place.add("Grocery Store")
        if re.search("Sea food Markets|Meat|Chicken|Cheese|Pasta|Beverage|Candy|Chocolatiers|Popcorn|Imported|Banks|Delis|Delicatessen|Butcher", ele):
            place.add("Food Shop")
        if re.search("Patisserie|Bakeries", ele):
            place.add("Bakeries")
        if re.search("Coffee|Tea", ele):
            place.add("Coffee Place")
        if re.search("Equipment|Supplies|Supply|Drugstores|Discount|Department|Home|Garden|Flower|Gift|Florist|Furniture|Hobby|Wholesale|Tobacco|Gear|Phones|Electronics|Toy|Paint|DVD|Antique|Bikes|Pet|Vitamins|Appliance|Thrift|Book|Goods|Bird|Glass|Skate|Tires|Mattresses|Rugs|Vintage|Automotive|Computer|Mag|Newspapers", ele):
            place.add("Other Goods")
        if re.search("Shopping|Fashion|Clothing|Wear|Shoe", ele):
            place.add("Shopping")
        if re.search("Service|Planning|Repair|Rental|Sewing|Sales|Design|Storage|Wash|Hotels|Dealers|Electricians|Dietitians|Nutritionists|Gardeners", ele):
            place.add("Services")
        if re.search("Entertainment|Venue|Dance|Karaoke|Cinema|Golf|Bowling|Performing|Galleries|Activities|Skating|Pool|Camps|Park|Stadium|Comedy|Leisure|Jazz|Social|Theater|Playcentre|Tours|Botanical|Beaches|Vinyl", ele):
            place.add("Entertainment/Event Place")
        if re.search("Active Life|GymYoga|Sports|Fitness|Cycling|Boxing|Barre|Cardio,Tennis|Soccer|Climbing|Biking", ele):
            place.add("Fitness/Sport Place")
        if re.search("Education|Child Care|Preschool|Schools|Colleges|BootCamps|Art Classes", ele):
            place.add("Teaching/School Place")
        if re.search("Religious|Churches|Synagogues", ele):
            place.add("Religious Place")
        if re.search("Health & Medical|Hospitals|Pharmancy|Health Care|Mental Health|Medical|Weight|Therapy", ele):
            place.add("Health & Medical Place")
        if (re.search("Nightlife|Bars|Beer|Wine|Gastropubs|Brewpubs|Whiskey|Pub|Champagne|Speakeasies", ele)):
            if (("Sushi" not in ele) & ("Juice" not in ele)):
                place.add("Pub/Bars")
        if re.search("Wineries|Cideries|Distilleries|Breweries", ele):
            place.add("Liquor Manufacturing")

    bus_to_focus.at[index, "place"] = place

## Next Steps

We next explore how the information of the restaurants is associated with the type of violations detected during inspection of the restaurants.

In particular, we examine if there is association between the cuisine, the types of foods, or the characteristics (or description) of the place and the types of violations. We also examine in which neighborhood or areas, violations occur more and how the number of each type of violation is evolving across time and seasons. We also check if the age, price range and the number of stars of the restaurant can be associated to the type of violations detected. 

We will also examine the reviews that were available before each inspection to check for any words or phrases that could indicate any possible violations.

The idea is to collect all possible information about each restaurant that could indicate which restaurants are more likely to make a violation, so that health inspectors can prioritize their inspection efforts.