# Yelp Dataset Exploration and Visualization

Now after data cleaning we have csv file which contains processed Yelp business dataset. This jupyter notebook is dedicated to detailed data exploration of this dataset. Data exploration was done in parallel with modelling, therefore we recomend to open both jupyter notebooks side by side. Purpose of data exploration is to gain better sense about the data we're dealing with and we will try to identify insights to fulfill our task to identify what make business to satisfy more customers.

In [38]:
# library imports
import pandas as pd
import numpy as np
import modules.plots as plots
from dython.nominal import associations
from IPython.display import display, HTML

We will do our analysis and modelling with respect to response 'satisfied', which we've defined to have value 1 if business has at least 4.5 stars and 0 if it has less than 4.5 stars. It would be possible to expand analysis and do it for 'unsatisfied' and 'stars' in with the same approach. Taking into account also 'unsatisfied' would give us bigger picture about problematic.

In [3]:
# let's start with loading data in jupyter
df = pd.read_csv('data/Yelp_business_data_processed.csv')

# response = 'stars'
# reponse = 'unsatisfied'
response = 'satisfied'

print(f'Mean response value is {np.mean(df[response])}')

Mean response value is 0.2892527902305349


## Important Variables

In modelling notebook we have applied regularization and feature importance to reveal us variables that have strongest connection with response. In this part we will look better into variables that were identified as most important.

### avg_time_open_week

It is interesting that average opening hour time have so big impact on positive rating. It looks like that businesses with shorter opening hours have much higher chance to be highly rated.

In [5]:
plots.investigate_categoric_variable(df, response, ['avg_time_open_week_binned'], ["Unknown"])


This variable contains these levels: ['0. [0,2]', '1. (2,4]', '2. (4,6]', '3. (6,8]', '4. (10,12]', '5. (12,14]', '6. (12,14]', '7. (14,16]', '8. (16, 18]', '9. >18']
Number of observation by level of this variable: [1292, 3733, 6817, 18443, 30077, 53912, 16594, 7835, 3375, 8268]
Mean value of response by level of this variable: [0.4551083591331269, 0.4374497723010983, 0.43596890127622123, 0.42178604348533316, 0.3287561924394055, 0.2551194539249147, 0.22230926841026877, 0.1700063816209317, 0.1034074074074074, 0.1819061441702951]
Mean fitted value by level of this variable: None


The reason might be that only specific businesses has short or long opening hours. Let's look at two-way graph where we add business categories.

In [9]:
fig = plots.investigate_categoric_variable(df, response, ['categories_classified', 'avg_time_open_week_binned'], ["Unknown"])
fig.update_layout(height=800)

This variable contains these levels: ["('Automotive', '0. [0,2]')", "('Automotive', '1. (2,4]')", "('Automotive', '2. (4,6]')", "('Automotive', '3. (6,8]')", "('Automotive', '4. (10,12]')", "('Automotive', '5. (12,14]')", "('Automotive', '6. (12,14]')", "('Automotive', '7. (14,16]')", "('Automotive', '8. (16, 18]')", "('Automotive', '9. >18')", "('Business Support & Supplies', '0. [0,2]')", "('Business Support & Supplies', '1. (2,4]')", "('Business Support & Supplies', '2. (4,6]')", "('Business Support & Supplies', '3. (6,8]')", "('Business Support & Supplies', '4. (10,12]')", "('Business Support & Supplies', '5. (12,14]')", "('Business Support & Supplies', '6. (12,14]')", "('Business Support & Supplies', '7. (14,16]')", "('Business Support & Supplies', '8. (16, 18]')", "('Business Support & Supplies', '9. >18')", "('Computers & Electronics', '0. [0,2]')", "('Computers & Electronics', '1. (2,4]')", "('Computers & Electronics', '2. (4,6]')", "('Computers & Electronics', '3. (6,8]')", "(

It is so much happening in this graph, but when we zoom in and look one by one to each business category, we can notice that decreasing satisfaction trend in consistent among most of busisness categories, exception is Real Estate where is trend oposite, Construction & Contractors and Legal & Financial. Since we are interested in restaurant business, we can confirm that it has strong trend.

### state

We can confirm that there are some differences between states, we can add this variable into models, maybe customers has different standards accross states. But inisights from this variable for our case is limited since we are looking more to business attributes. Business location could have impact on whether business will be successful or not. However usually business owner will not relocate to different state. Geographic analysis of surroundings however might be useful to select optimal location for business in particular city. 

In [50]:
plots.investigate_categoric_variable(df, response, ['state'], ["Unknown"])

This variable contains these levels:  ['AB', 'AZ', 'CA', 'CO', 'DE', 'FL', 'HI', 'ID', 'IL', 'IN', 'LA', 'MA', 'MI', 'MO', 'MT', 'NC', 'NJ', 'NV', 'PA', 'SD', 'TN', 'TX', 'UT', 'VI', 'VT', 'WA', 'XMS']
Number of observation by level of this variable:  [5573, 9912, 5203, 3, 2265, 26330, 2, 4467, 2145, 11247, 9924, 2, 1, 10913, 1, 1, 8536, 7715, 34039, 1, 12056, 4, 1, 1, 1, 2, 1]
Mean value of response by level of this variable: [0.1880495244930917, 0.29106133979015336, 0.46723044397463004, 0.3333333333333333, 0.2, 0.2988606152677554, 0.5, 0.33781061114842176, 0.22004662004662004, 0.2900328976615986, 0.30945183393792824, 0.0, 0.0, 0.26518830752313755, 1.0, 0.0, 0.22375820056232426, 0.3668178872326636, 0.2760656893563266, 1.0, 0.2860816191108162, 0.25, 1.0, 0.0, 1.0, 0.5, 0.0]


### ByAppointmentOnly

This variable looks promissing, it seems that busnesses having appontments only have much higher satified ratings. However it might not be applicable on restaurant business, therefore we should check two-way graph with business categories.

In [51]:
plots.investigate_categoric_variable(df, response, ['ByAppointmentOnly'], ["Unknown"])

This variable contains these levels:  ['False', 'True', 'Unknown']
Number of observation by level of this variable:  [26690, 15609, 108047]
Mean value of response by level of this variable: [0.33461970775571376, 0.4965084246268179, 0.24810499134635852]


In [13]:
fig = plots.investigate_categoric_variable(df, response, ['categories_classified', 'ByAppointmentOnly'], ["Unknown"])
fig.update_layout(height=800)

This variable contains these levels: ["('Automotive', 'False')", "('Automotive', 'True')", "('Automotive', 'Unknown')", "('Business Support & Supplies', 'False')", "('Business Support & Supplies', 'True')", "('Business Support & Supplies', 'Unknown')", "('Computers & Electronics', 'False')", "('Computers & Electronics', 'True')", "('Computers & Electronics', 'Unknown')", "('Construction & Contractors', 'False')", "('Construction & Contractors', 'True')", "('Construction & Contractors', 'Unknown')", "('Education', 'False')", "('Education', 'True')", "('Education', 'Unknown')", "('Entertainment', 'False')", "('Entertainment', 'True')", "('Entertainment', 'Unknown')", "('Food & Dining', 'False')", "('Food & Dining', 'True')", "('Food & Dining', 'Unknown')", "('Health & Medicine', 'False')", "('Health & Medicine', 'True')", "('Health & Medicine', 'Unknown')", "('Home & Garden', 'False')", "('Home & Garden', 'True')", "('Home & Garden', 'Unknown')", "('Legal & Financial', 'False')", "('Lega

As we suspect appointments are not relevant for restuarant business, level True for Food & Dinning has no data. However it gives us great information about Personal Care & Services category, where it seems to be very important.

### BusinessAcceptsCreditCards

Accepting credit cards seems to be necessary in the age of digitalization. However data show that businesses, which don't accept credit cards has higher satisfaction rate. We can try to search deeper and see if it isn't caused by some business category.

In [52]:
plots.investigate_categoric_variable(df, response, ['BusinessAcceptsCreditCards'], ["Unknown"])

This variable contains these levels:  ['False', 'True', 'Unknown']
Number of observation by level of this variable:  [6025, 113667, 30654]
Mean value of response by level of this variable: [0.4250622406639004, 0.28049477860768735, 0.29503490572192864]


In [14]:
fig = plots.investigate_categoric_variable(df, response, ['categories_classified', 'BusinessAcceptsCreditCards'], ["Unknown"])
fig.update_layout(height=800)

This variable contains these levels: ["('Automotive', 'False')", "('Automotive', 'True')", "('Automotive', 'Unknown')", "('Business Support & Supplies', 'False')", "('Business Support & Supplies', 'True')", "('Business Support & Supplies', 'Unknown')", "('Computers & Electronics', 'False')", "('Computers & Electronics', 'True')", "('Computers & Electronics', 'Unknown')", "('Construction & Contractors', 'False')", "('Construction & Contractors', 'True')", "('Construction & Contractors', 'Unknown')", "('Education', 'False')", "('Education', 'True')", "('Education', 'Unknown')", "('Entertainment', 'False')", "('Entertainment', 'True')", "('Entertainment', 'Unknown')", "('Food & Dining', 'False')", "('Food & Dining', 'True')", "('Food & Dining', 'Unknown')", "('Health & Medicine', 'False')", "('Health & Medicine', 'True')", "('Health & Medicine', 'Unknown')", "('Home & Garden', 'False')", "('Home & Garden', 'True')", "('Home & Garden', 'Unknown')", "('Legal & Financial', 'False')", "('Lega

It's surprising that this trend is consistent accross different business categories. We might want to investigate it deeper or accept it as a fact that not accepting creditcards could increase customer satisfaction?

### BikeParking

It seems reasonable that making life easier for people comming on bike to have parking spots may have some impact on satisfaction rate.

In [53]:
plots.investigate_categoric_variable(df, response, ['BikeParking'], ["Unknown"])

This variable contains these levels:  ['False', 'True', 'Unknown']
Number of observation by level of this variable:  [17518, 55040, 77788]
Mean value of response by level of this variable: [0.19659778513528942, 0.2997456395348837, 0.3026945030081761]


In [15]:
fig = plots.investigate_categoric_variable(df, response, ['categories_classified', 'BikeParking'], ["Unknown"])
fig.update_layout(height=800)

This variable contains these levels: ["('Automotive', 'False')", "('Automotive', 'True')", "('Automotive', 'Unknown')", "('Business Support & Supplies', 'False')", "('Business Support & Supplies', 'True')", "('Business Support & Supplies', 'Unknown')", "('Computers & Electronics', 'False')", "('Computers & Electronics', 'True')", "('Computers & Electronics', 'Unknown')", "('Construction & Contractors', 'False')", "('Construction & Contractors', 'True')", "('Construction & Contractors', 'Unknown')", "('Education', 'False')", "('Education', 'True')", "('Education', 'Unknown')", "('Entertainment', 'False')", "('Entertainment', 'True')", "('Entertainment', 'Unknown')", "('Food & Dining', 'False')", "('Food & Dining', 'True')", "('Food & Dining', 'Unknown')", "('Health & Medicine', 'False')", "('Health & Medicine', 'True')", "('Health & Medicine', 'Unknown')", "('Home & Garden', 'False')", "('Home & Garden', 'True')", "('Home & Garden', 'Unknown')", "('Legal & Financial', 'False')", "('Lega

Drop of satisfaction rate for businesses, which don't have bike parking, seems to be very consistent pattern even accross different business categories. So it would be good suggestion for business owners to create such a parking spots.

### RestaurantsDelivery

It may seem clear that having delivery service could make customers more satisfied.

In [54]:
plots.investigate_categoric_variable(df, response, ['RestaurantsDelivery'], ["Unknown"])

This variable contains these levels:  ['False', 'True', 'Unknown']
Number of observation by level of this variable:  [20188, 32146, 98012]
Mean value of response by level of this variable: [0.21294828611056074, 0.1800223978099919, 0.34079500469330287]


In [16]:
fig = plots.investigate_categoric_variable(df, response, ['categories_classified', 'RestaurantsDelivery'], ["Unknown"])
fig.update_layout(height=800)

This variable contains these levels: ["('Automotive', 'False')", "('Automotive', 'True')", "('Automotive', 'Unknown')", "('Business Support & Supplies', 'False')", "('Business Support & Supplies', 'True')", "('Business Support & Supplies', 'Unknown')", "('Computers & Electronics', 'False')", "('Computers & Electronics', 'True')", "('Computers & Electronics', 'Unknown')", "('Construction & Contractors', 'False')", "('Construction & Contractors', 'True')", "('Construction & Contractors', 'Unknown')", "('Education', 'False')", "('Education', 'True')", "('Education', 'Unknown')", "('Entertainment', 'False')", "('Entertainment', 'True')", "('Entertainment', 'Unknown')", "('Food & Dining', 'False')", "('Food & Dining', 'True')", "('Food & Dining', 'Unknown')", "('Health & Medicine', 'False')", "('Health & Medicine', 'True')", "('Health & Medicine', 'Unknown')", "('Home & Garden', 'False')", "('Home & Garden', 'True')", "('Home & Garden', 'Unknown')", "('Legal & Financial', 'False')", "('Lega

However data don't support this theory, there isn't very big difference between restaurants having delivery and these that haven't. Moreover it might seem that delivery is less popular accross customers. 

### WheelchairAccessible

Adjusting envinronment to be wheelchair friendly is very human thing to do, so it might be good idea for business to do so. It can attract more positive reviews from people.

In [55]:
plots.investigate_categoric_variable(df, response, ['WheelchairAccessible'], ["Unknown"])

This variable contains these levels:  ['False', 'True', 'Unknown']
Number of observation by level of this variable:  [2933, 25993, 121420]
Mean value of response by level of this variable: [0.5666553017388339, 0.3934905551494633, 0.2602371932136386]


In [17]:
fig = plots.investigate_categoric_variable(df, response, ['categories_classified', 'WheelchairAccessible'], ["Unknown"])
fig.update_layout(height=800)

This variable contains these levels: ["('Automotive', 'False')", "('Automotive', 'True')", "('Automotive', 'Unknown')", "('Business Support & Supplies', 'False')", "('Business Support & Supplies', 'True')", "('Business Support & Supplies', 'Unknown')", "('Computers & Electronics', 'False')", "('Computers & Electronics', 'True')", "('Computers & Electronics', 'Unknown')", "('Construction & Contractors', 'False')", "('Construction & Contractors', 'True')", "('Construction & Contractors', 'Unknown')", "('Education', 'False')", "('Education', 'True')", "('Education', 'Unknown')", "('Entertainment', 'False')", "('Entertainment', 'True')", "('Entertainment', 'Unknown')", "('Food & Dining', 'False')", "('Food & Dining', 'True')", "('Food & Dining', 'Unknown')", "('Health & Medicine', 'False')", "('Health & Medicine', 'True')", "('Health & Medicine', 'Unknown')", "('Home & Garden', 'False')", "('Home & Garden', 'True')", "('Home & Garden', 'Unknown')", "('Legal & Financial', 'False')", "('Lega

Once again our theory is crushed by data, or is it? It seems that not being wheelchair friendly pays off with higher customer satisfaction. But we have big volume of missing data, here it would be good to investigate how was done data collection. We suspect that if business if wheelchair friendly it's more willingly to spread information about it. So we might be expecting that most of the businesses that aren't wheelchair accessible don't fill this information since it has negative undertone. 

### HappyHour

Having happy hour is typical for some types of restaurants or bars, let's look at the data.

In [56]:
plots.investigate_categoric_variable(df, response, ['HappyHour'], ["Unknown"])

This variable contains these levels:  ['False', 'True', 'Unknown']
Number of observation by level of this variable:  [5448, 9721, 135177]
Mean value of response by level of this variable: [0.34397944199706315, 0.16356341940129615, 0.2960858725966696]


In [18]:
fig = plots.investigate_categoric_variable(df, response, ['categories_classified', 'HappyHour'], ["Unknown"])
fig.update_layout(height=800)

This variable contains these levels: ["('Automotive', 'False')", "('Automotive', 'True')", "('Automotive', 'Unknown')", "('Business Support & Supplies', 'False')", "('Business Support & Supplies', 'True')", "('Business Support & Supplies', 'Unknown')", "('Computers & Electronics', 'False')", "('Computers & Electronics', 'True')", "('Computers & Electronics', 'Unknown')", "('Construction & Contractors', 'True')", "('Construction & Contractors', 'Unknown')", "('Education', 'False')", "('Education', 'True')", "('Education', 'Unknown')", "('Entertainment', 'False')", "('Entertainment', 'True')", "('Entertainment', 'Unknown')", "('Food & Dining', 'False')", "('Food & Dining', 'True')", "('Food & Dining', 'Unknown')", "('Health & Medicine', 'False')", "('Health & Medicine', 'Unknown')", "('Home & Garden', 'False')", "('Home & Garden', 'True')", "('Home & Garden', 'Unknown')", "('Legal & Financial', 'False')", "('Legal & Financial', 'True')", "('Legal & Financial', 'Unknown')", "('Manufacturi

It seems that not having happy hour might have some impact on higher satisfaction rate, but again we don't have too much data to support this hypothesis since value for most records is missing.

### OutdoorSeating

During hot summer months everybody like sit outside, we should investigate how outside seating impacts customer satisfaction.

In [57]:
plots.investigate_categoric_variable(df, response, ['OutdoorSeating'], ["Unknown"])

This variable contains these levels:  ['False', 'True', 'Unknown']
Number of observation by level of this variable:  [24371, 22549, 103426]
Mean value of response by level of this variable: [0.15169668868737435, 0.2135793161559271, 0.33816448475238337]


In [19]:
fig = plots.investigate_categoric_variable(df, response, ['categories_classified', 'OutdoorSeating'], ["Unknown"])
fig.update_layout(height=800)

This variable contains these levels: ["('Automotive', 'False')", "('Automotive', 'True')", "('Automotive', 'Unknown')", "('Business Support & Supplies', 'False')", "('Business Support & Supplies', 'True')", "('Business Support & Supplies', 'Unknown')", "('Computers & Electronics', 'False')", "('Computers & Electronics', 'True')", "('Computers & Electronics', 'Unknown')", "('Construction & Contractors', 'False')", "('Construction & Contractors', 'True')", "('Construction & Contractors', 'Unknown')", "('Education', 'False')", "('Education', 'True')", "('Education', 'Unknown')", "('Entertainment', 'False')", "('Entertainment', 'True')", "('Entertainment', 'Unknown')", "('Food & Dining', 'False')", "('Food & Dining', 'True')", "('Food & Dining', 'Unknown')", "('Health & Medicine', 'False')", "('Health & Medicine', 'True')", "('Health & Medicine', 'Unknown')", "('Home & Garden', 'False')", "('Home & Garden', 'True')", "('Home & Garden', 'Unknown')", "('Legal & Financial', 'False')", "('Lega

There is small difference in favor of having outdoor seating. However missing data has much higher satisfaction rate. Again it might be caused by the way how data was collected or missing value can have businesses which are not applicable for outdoor seating. For example if we look just on Food & Dining, we can see that almost 1/3 of data is missing and this part of data has higher satisfaction than businesses that have filled this information. But we probably can recomend to consider of having outdoor seating for restaurant, because according to data it seems that it doesn't do harm.

### HasTV

Having TV in restaurant is quite often, especially for types of restaurants like sport bars. It might be interesting for us to find out whether it has positive impact on satisfaction. 

In [58]:
plots.investigate_categoric_variable(df, response, ['HasTV'], ["Unknown"])

This variable contains these levels:  ['False', 'True', 'Unknown']
Number of observation by level of this variable:  [10911, 34154, 105281]
Mean value of response by level of this variable: [0.2721107139583906, 0.15312994085612228, 0.33518868551780473]


In [21]:
fig = plots.investigate_categoric_variable(df, response, ['categories_classified', 'HasTV'], ["Unknown"])
fig.update_layout(height=800)

This variable contains these levels: ["('Automotive', 'False')", "('Automotive', 'True')", "('Automotive', 'Unknown')", "('Business Support & Supplies', 'False')", "('Business Support & Supplies', 'True')", "('Business Support & Supplies', 'Unknown')", "('Computers & Electronics', 'False')", "('Computers & Electronics', 'True')", "('Computers & Electronics', 'Unknown')", "('Construction & Contractors', 'False')", "('Construction & Contractors', 'True')", "('Construction & Contractors', 'Unknown')", "('Education', 'False')", "('Education', 'True')", "('Education', 'Unknown')", "('Entertainment', 'False')", "('Entertainment', 'True')", "('Entertainment', 'Unknown')", "('Food & Dining', 'False')", "('Food & Dining', 'True')", "('Food & Dining', 'Unknown')", "('Health & Medicine', 'False')", "('Health & Medicine', 'True')", "('Health & Medicine', 'Unknown')", "('Home & Garden', 'False')", "('Home & Garden', 'True')", "('Home & Garden', 'Unknown')", "('Legal & Financial', 'False')", "('Lega

It seems that it could have some impact, restaurants without TV have  average satisfaction, but restaurants with TV has much lower satisfaction rate.

### DogsAllowed

In [59]:
plots.investigate_categoric_variable(df, response, ['DogsAllowed'], ["Unknown"])

This variable contains these levels:  ['False', 'True', 'Unknown']
Number of observation by level of this variable:  [12267, 5991, 132088]
Mean value of response by level of this variable: [0.2990136137604956, 0.4560173593723919, 0.2807825086306099]


In [22]:
fig = plots.investigate_categoric_variable(df, response, ['categories_classified', 'DogsAllowed'], ["Unknown"])
fig.update_layout(height=800)

This variable contains these levels: ["('Automotive', 'False')", "('Automotive', 'True')", "('Automotive', 'Unknown')", "('Business Support & Supplies', 'False')", "('Business Support & Supplies', 'True')", "('Business Support & Supplies', 'Unknown')", "('Computers & Electronics', 'False')", "('Computers & Electronics', 'True')", "('Computers & Electronics', 'Unknown')", "('Construction & Contractors', 'False')", "('Construction & Contractors', 'True')", "('Construction & Contractors', 'Unknown')", "('Education', 'False')", "('Education', 'True')", "('Education', 'Unknown')", "('Entertainment', 'False')", "('Entertainment', 'True')", "('Entertainment', 'Unknown')", "('Food & Dining', 'False')", "('Food & Dining', 'True')", "('Food & Dining', 'Unknown')", "('Health & Medicine', 'False')", "('Health & Medicine', 'True')", "('Health & Medicine', 'Unknown')", "('Home & Garden', 'False')", "('Home & Garden', 'True')", "('Home & Garden', 'Unknown')", "('Legal & Financial', 'False')", "('Lega

It seems that being open to best friends of people pays off with higher satisfaction of dog owners. So this might be good suggestion to be dog friendly restaurant.

### Alcohol

In [60]:
plots.investigate_categoric_variable(df, response, ['Alcohol'], ["Unknown"])

This variable contains these levels:  ['Unknown', 'beer_and_wine', 'full_bar']
Number of observation by level of this variable:  [128105, 6249, 15992]
Mean value of response by level of this variable: [0.31183013933882364, 0.25636101776284204, 0.12124812406203102]


In [23]:
fig = plots.investigate_categoric_variable(df, response, ['categories_classified', 'Alcohol'], ["Unknown"])
fig.update_layout(height=800)

This variable contains these levels: ["('Automotive', 'Unknown')", "('Automotive', 'beer_and_wine')", "('Automotive', 'full_bar')", "('Business Support & Supplies', 'Unknown')", "('Business Support & Supplies', 'beer_and_wine')", "('Business Support & Supplies', 'full_bar')", "('Computers & Electronics', 'Unknown')", "('Computers & Electronics', 'beer_and_wine')", "('Computers & Electronics', 'full_bar')", "('Construction & Contractors', 'Unknown')", "('Education', 'Unknown')", "('Education', 'beer_and_wine')", "('Education', 'full_bar')", "('Entertainment', 'Unknown')", "('Entertainment', 'beer_and_wine')", "('Entertainment', 'full_bar')", "('Food & Dining', 'Unknown')", "('Food & Dining', 'beer_and_wine')", "('Food & Dining', 'full_bar')", "('Health & Medicine', 'Unknown')", "('Health & Medicine', 'beer_and_wine')", "('Home & Garden', 'Unknown')", "('Home & Garden', 'beer_and_wine')", "('Home & Garden', 'full_bar')", "('Legal & Financial', 'Unknown')", "('Legal & Financial', 'full_ba

Data suggest that having full bar is less popular then serving beer and wine only. But this is probably connected twith the type of restaurant.

### RestaurantsAttire

In [61]:
plots.investigate_categoric_variable(df, response, ['RestaurantsAttire'], ["Unknown"])

This variable contains these levels:  ['Unknown', 'casual', 'dressy', 'formal']
Number of observation by level of this variable:  [111129, 38344, 803, 70]
Mean value of response by level of this variable: [0.3409911004328303, 0.142264761109952, 0.16811955168119552, 0.05714285714285714]


In [24]:
fig = plots.investigate_categoric_variable(df, response, ['categories_classified', 'RestaurantsAttire'], ["Unknown"])
fig.update_layout(height=800)

This variable contains these levels: ["('Automotive', 'Unknown')", "('Automotive', 'casual')", "('Automotive', 'dressy')", "('Automotive', 'formal')", "('Business Support & Supplies', 'Unknown')", "('Business Support & Supplies', 'casual')", "('Computers & Electronics', 'Unknown')", "('Computers & Electronics', 'casual')", "('Computers & Electronics', 'dressy')", "('Computers & Electronics', 'formal')", "('Construction & Contractors', 'Unknown')", "('Construction & Contractors', 'casual')", "('Education', 'Unknown')", "('Education', 'casual')", "('Entertainment', 'Unknown')", "('Entertainment', 'casual')", "('Entertainment', 'dressy')", "('Entertainment', 'formal')", "('Food & Dining', 'Unknown')", "('Food & Dining', 'casual')", "('Food & Dining', 'dressy')", "('Food & Dining', 'formal')", "('Health & Medicine', 'Unknown')", "('Health & Medicine', 'casual')", "('Home & Garden', 'Unknown')", "('Home & Garden', 'casual')", "('Legal & Financial', 'Unknown')", "('Legal & Financial', 'casua

We can't tell whether there is difference between casual and dressy or formal type of restaurant. Again lot of restaurants have missing value and it seems that these have higher satisfaction rates.

### RestaurantsTableService

In [62]:
plots.investigate_categoric_variable(df, response, ['RestaurantsTableService'], ["Unknown"])

This variable contains these levels:  ['False', 'True', 'Unknown']
Number of observation by level of this variable:  [7293, 12674, 130379]
Mean value of response by level of this variable: [0.2971342383107089, 0.22802587975382674, 0.2947637272873699]


In [25]:
fig = plots.investigate_categoric_variable(df, response, ['categories_classified', 'RestaurantsTableService'], ["Unknown"])
fig.update_layout(height=800)

This variable contains these levels: ["('Automotive', 'False')", "('Automotive', 'True')", "('Automotive', 'Unknown')", "('Business Support & Supplies', 'False')", "('Business Support & Supplies', 'True')", "('Business Support & Supplies', 'Unknown')", "('Computers & Electronics', 'False')", "('Computers & Electronics', 'True')", "('Computers & Electronics', 'Unknown')", "('Construction & Contractors', 'False')", "('Construction & Contractors', 'Unknown')", "('Education', 'False')", "('Education', 'True')", "('Education', 'Unknown')", "('Entertainment', 'False')", "('Entertainment', 'True')", "('Entertainment', 'Unknown')", "('Food & Dining', 'False')", "('Food & Dining', 'True')", "('Food & Dining', 'Unknown')", "('Health & Medicine', 'False')", "('Health & Medicine', 'Unknown')", "('Home & Garden', 'False')", "('Home & Garden', 'True')", "('Home & Garden', 'Unknown')", "('Legal & Financial', 'True')", "('Legal & Financial', 'Unknown')", "('Manufacturing, Wholesale, Distribution', 'Fa

It seems that not having table servis is little bit better for restaurants, but most of the data is missing.

### RestaurantsGoodForGroups

In [63]:
plots.investigate_categoric_variable(df, response, ['RestaurantsGoodForGroups'], ["Unknown"])

This variable contains these levels:  ['False', 'True', 'Unknown']
Number of observation by level of this variable:  [6001, 38148, 106197]
Mean value of response by level of this variable: [0.19046825529078487, 0.1511743734927126, 0.3444353418646478]


In [26]:
fig = plots.investigate_categoric_variable(df, response, ['categories_classified', 'RestaurantsGoodForGroups'], ["Unknown"])
fig.update_layout(height=800)

This variable contains these levels: ["('Automotive', 'False')", "('Automotive', 'True')", "('Automotive', 'Unknown')", "('Business Support & Supplies', 'False')", "('Business Support & Supplies', 'True')", "('Business Support & Supplies', 'Unknown')", "('Computers & Electronics', 'False')", "('Computers & Electronics', 'True')", "('Computers & Electronics', 'Unknown')", "('Construction & Contractors', 'False')", "('Construction & Contractors', 'True')", "('Construction & Contractors', 'Unknown')", "('Education', 'False')", "('Education', 'True')", "('Education', 'Unknown')", "('Entertainment', 'False')", "('Entertainment', 'True')", "('Entertainment', 'Unknown')", "('Food & Dining', 'False')", "('Food & Dining', 'True')", "('Food & Dining', 'Unknown')", "('Health & Medicine', 'False')", "('Health & Medicine', 'True')", "('Health & Medicine', 'Unknown')", "('Home & Garden', 'False')", "('Home & Garden', 'True')", "('Home & Garden', 'Unknown')", "('Legal & Financial', 'False')", "('Lega

Data suggest that restaurants which aren't good for groups have slightly higher satisfaction, but lot of data is missing and it has high satisfaction rate.

### DriveThru

In [64]:
plots.investigate_categoric_variable(df, response, ['DriveThru'], ["Unknown"])

This variable contains these levels:  ['False', 'True', 'Unknown']
Number of observation by level of this variable:  [2631, 4374, 143341]
Mean value of response by level of this variable: [0.20752565564424175, 0.045496113397347965, 0.2981910269915795]


In [27]:
fig = plots.investigate_categoric_variable(df, response, ['categories_classified', 'DriveThru'], ["Unknown"])
fig.update_layout(height=800)

This variable contains these levels: ["('Automotive', 'False')", "('Automotive', 'True')", "('Automotive', 'Unknown')", "('Business Support & Supplies', 'False')", "('Business Support & Supplies', 'True')", "('Business Support & Supplies', 'Unknown')", "('Computers & Electronics', 'False')", "('Computers & Electronics', 'True')", "('Computers & Electronics', 'Unknown')", "('Construction & Contractors', 'False')", "('Construction & Contractors', 'True')", "('Construction & Contractors', 'Unknown')", "('Education', 'False')", "('Education', 'True')", "('Education', 'Unknown')", "('Entertainment', 'False')", "('Entertainment', 'True')", "('Entertainment', 'Unknown')", "('Food & Dining', 'False')", "('Food & Dining', 'True')", "('Food & Dining', 'Unknown')", "('Health & Medicine', 'False')", "('Health & Medicine', 'True')", "('Health & Medicine', 'Unknown')", "('Home & Garden', 'False')", "('Home & Garden', 'True')", "('Home & Garden', 'Unknown')", "('Legal & Financial', 'True')", "('Legal

Having drive thru seems to lower customers satisfaction, but this might be caused by the fact that drive thru have mostly just fast food chains like McDonald's. We cannot expect high satisfaction as a fast food chains.

### BusinessAcceptsBitcoin

In [65]:
plots.investigate_categoric_variable(df, response, ['BusinessAcceptsBitcoin'], ["Unknown"])

This variable contains these levels:  ['False', 'True', 'Unknown']
Number of observation by level of this variable:  [16957, 470, 132919]
Mean value of response by level of this variable: [0.4169369581883588, 0.5404255319148936, 0.2720754745371241]


In [29]:
fig = plots.investigate_categoric_variable(df, response, ['categories_classified', 'BusinessAcceptsBitcoin'], ["Unknown"])
fig.update_layout(height=800)

This variable contains these levels: ["('Automotive', 'False')", "('Automotive', 'True')", "('Automotive', 'Unknown')", "('Business Support & Supplies', 'False')", "('Business Support & Supplies', 'True')", "('Business Support & Supplies', 'Unknown')", "('Computers & Electronics', 'False')", "('Computers & Electronics', 'True')", "('Computers & Electronics', 'Unknown')", "('Construction & Contractors', 'False')", "('Construction & Contractors', 'True')", "('Construction & Contractors', 'Unknown')", "('Education', 'False')", "('Education', 'True')", "('Education', 'Unknown')", "('Entertainment', 'False')", "('Entertainment', 'True')", "('Entertainment', 'Unknown')", "('Food & Dining', 'False')", "('Food & Dining', 'True')", "('Food & Dining', 'Unknown')", "('Health & Medicine', 'False')", "('Health & Medicine', 'True')", "('Health & Medicine', 'Unknown')", "('Home & Garden', 'False')", "('Home & Garden', 'True')", "('Home & Garden', 'Unknown')", "('Legal & Financial', 'False')", "('Lega

Crypto currencies are very new and there is only small number of businesses that accepts it for a payments. Maybe in future it would be useful feature but now we don't expect that customers are so much interested to prefer paying with crypto over normal currency.

### AcceptsInsurance

In [66]:
plots.investigate_categoric_variable(df, response, ['AcceptsInsurance'], ["Unknown"])

This variable contains these levels:  ['False', 'True', 'Unknown']
Number of observation by level of this variable:  [1767, 3938, 144641]
Mean value of response by level of this variable: [0.6061120543293718, 0.3430675469781615, 0.28391673177038323]


In [30]:
fig = plots.investigate_categoric_variable(df, response, ['categories_classified', 'AcceptsInsurance'], ["Unknown"])
fig.update_layout(height=800)

This variable contains these levels: ["('Automotive', 'True')", "('Automotive', 'Unknown')", "('Business Support & Supplies', 'False')", "('Business Support & Supplies', 'Unknown')", "('Computers & Electronics', 'Unknown')", "('Construction & Contractors', 'False')", "('Construction & Contractors', 'True')", "('Construction & Contractors', 'Unknown')", "('Education', 'False')", "('Education', 'True')", "('Education', 'Unknown')", "('Entertainment', 'False')", "('Entertainment', 'True')", "('Entertainment', 'Unknown')", "('Food & Dining', 'False')", "('Food & Dining', 'True')", "('Food & Dining', 'Unknown')", "('Health & Medicine', 'False')", "('Health & Medicine', 'True')", "('Health & Medicine', 'Unknown')", "('Home & Garden', 'Unknown')", "('Legal & Financial', 'False')", "('Legal & Financial', 'True')", "('Legal & Financial', 'Unknown')", "('Manufacturing, Wholesale, Distribution', 'False')", "('Manufacturing, Wholesale, Distribution', 'True')", "('Manufacturing, Wholesale, Distribu

This factor is relevant only for Healt & Medicine category.

### BYOB

In [31]:
plots.investigate_categoric_variable(df, response, ['BYOB'], ["Unknown"])

This variable contains these levels: ['False', 'True', 'Unknown']
Number of observation by level of this variable: [3437, 1002, 145907]
Mean value of response by level of this variable: [0.3552516729706139, 0.46407185628742514, 0.28649756351648653]
Mean fitted value by level of this variable: None


In [32]:
fig = plots.investigate_categoric_variable(df, response, ['categories_classified', 'BYOB'], ["Unknown"])
fig.update_layout(height=800)

This variable contains these levels: ["('Automotive', 'False')", "('Automotive', 'True')", "('Automotive', 'Unknown')", "('Business Support & Supplies', 'False')", "('Business Support & Supplies', 'True')", "('Business Support & Supplies', 'Unknown')", "('Computers & Electronics', 'False')", "('Computers & Electronics', 'True')", "('Computers & Electronics', 'Unknown')", "('Construction & Contractors', 'Unknown')", "('Education', 'False')", "('Education', 'True')", "('Education', 'Unknown')", "('Entertainment', 'False')", "('Entertainment', 'True')", "('Entertainment', 'Unknown')", "('Food & Dining', 'False')", "('Food & Dining', 'True')", "('Food & Dining', 'Unknown')", "('Health & Medicine', 'Unknown')", "('Home & Garden', 'False')", "('Home & Garden', 'True')", "('Home & Garden', 'Unknown')", "('Legal & Financial', 'True')", "('Legal & Financial', 'Unknown')", "('Manufacturing, Wholesale, Distribution', 'Unknown')", "('Merchants (Retail)', 'False')", "('Merchants (Retail)', 'True')"

Bring your own bottle is concept of restaurants where are customers allowed to bring own alcohol. Volume of data isn't big but it seems that BYOB restaurants have higher satisfaction. However during considering of allowing BYOB a business owner also have to take into accont loss of income from selling alcoholic beverages.

### BusinessParking_street

In [68]:
plots.investigate_categoric_variable(df, response, ['BusinessParking_street'], ["Unknown"])

This variable contains these levels:  ['False', 'True', 'Unknown']
Number of observation by level of this variable:  [62448, 23026, 64872]
Mean value of response by level of this variable: [0.22916666666666666, 0.3695387822461565, 0.31859662103835246]


In [33]:
fig = plots.investigate_categoric_variable(df, response, ['categories_classified', 'BusinessParking_street'], ["Unknown"])
fig.update_layout(height=800)

This variable contains these levels: ["('Automotive', 'False')", "('Automotive', 'True')", "('Automotive', 'Unknown')", "('Business Support & Supplies', 'False')", "('Business Support & Supplies', 'True')", "('Business Support & Supplies', 'Unknown')", "('Computers & Electronics', 'False')", "('Computers & Electronics', 'True')", "('Computers & Electronics', 'Unknown')", "('Construction & Contractors', 'False')", "('Construction & Contractors', 'True')", "('Construction & Contractors', 'Unknown')", "('Education', 'False')", "('Education', 'True')", "('Education', 'Unknown')", "('Entertainment', 'False')", "('Entertainment', 'True')", "('Entertainment', 'Unknown')", "('Food & Dining', 'False')", "('Food & Dining', 'True')", "('Food & Dining', 'Unknown')", "('Health & Medicine', 'False')", "('Health & Medicine', 'True')", "('Health & Medicine', 'Unknown')", "('Home & Garden', 'False')", "('Home & Garden', 'True')", "('Home & Garden', 'Unknown')", "('Legal & Financial', 'False')", "('Lega

Data suggest that having parking spots is increasing customer satisfaction. We recomend to reserve parking for guests.

### HairSpecializesIn

In [69]:
plots.investigate_categoric_variable(df, response, ['HairSpecializesIn_africanamerican'], ["Unknown"])

This variable contains these levels:  ['False', 'True', 'Unknown']
Number of observation by level of this variable:  [698, 304, 149344]
Mean value of response by level of this variable: [0.6475644699140402, 0.5197368421052632, 0.2871089565031069]


This factor isn't relevant for restaurtants.

### categories_classified

In [75]:
plots.investigate_categoric_variable(df, response, ['categories_classified'], ["Unknown"])

This variable contains these levels:  ['Automotive', 'Business Support & Supplies', 'Computers & Electronics', 'Construction & Contractors', 'Education', 'Entertainment', 'Food & Dining', 'Health & Medicine', 'Home & Garden', 'Legal & Financial', 'Manufacturing, Wholesale, Distribution', 'Merchants (Retail)', 'Miscellaneous', 'Personal Care & Services', 'Real Estate', 'Travel & Transportation']
Number of observation by level of this variable:  [10923, 465, 3920, 6843, 2214, 21862, 45552, 10835, 2911, 2614, 6825, 3726, 4171, 17023, 5997, 4465]
Mean value of response by level of this variable: [0.2877414629680491, 0.2709677419354839, 0.21709183673469387, 0.422475522431682, 0.2804878048780488, 0.27563809349556306, 0.22914471373375483, 0.3596677434240886, 0.34627275850223294, 0.2796480489671002, 0.2946520146520146, 0.2665056360708535, 0.34140493886358186, 0.3900605063737297, 0.263965315991329, 0.24748040313549832]


We can clearly see differences between business categories. Customers rate differently each business category.

### WiFi

In [71]:
plots.investigate_categoric_variable(df, response, ['WiFi'], ["Unknown"])

This variable contains these levels:  ['Unknown', 'free', 'no', 'paid']
Number of observation by level of this variable:  [93482, 34414, 21831, 619]
Mean value of response by level of this variable: [0.30773838813889304, 0.27814261637705584, 0.22816178828271724, 0.26978998384491115]


In [35]:
fig = plots.investigate_categoric_variable(df, response, ['categories_classified', 'WiFi'], ["Unknown"])
fig.update_layout(height=800)

This variable contains these levels: ["('Automotive', 'Unknown')", "('Automotive', 'free')", "('Automotive', 'no')", "('Automotive', 'paid')", "('Business Support & Supplies', 'Unknown')", "('Business Support & Supplies', 'free')", "('Business Support & Supplies', 'no')", "('Business Support & Supplies', 'paid')", "('Computers & Electronics', 'Unknown')", "('Computers & Electronics', 'free')", "('Computers & Electronics', 'no')", "('Computers & Electronics', 'paid')", "('Construction & Contractors', 'Unknown')", "('Construction & Contractors', 'free')", "('Construction & Contractors', 'no')", "('Construction & Contractors', 'paid')", "('Education', 'Unknown')", "('Education', 'free')", "('Education', 'no')", "('Education', 'paid')", "('Entertainment', 'Unknown')", "('Entertainment', 'free')", "('Entertainment', 'no')", "('Entertainment', 'paid')", "('Food & Dining', 'Unknown')", "('Food & Dining', 'free')", "('Food & Dining', 'no')", "('Food & Dining', 'paid')", "('Health & Medicine', 

Having wifi is convenient for customers, businesses with free wifi have slightly higher satisfaction. Wifi isn't very expensive, so it might be good idea to have it in restaurant.

### RestaurantsPriceRange2

In [37]:
# fig = plots.investigate_categoric_variable(df, 'stars', ['RestaurantsPriceRange2'], ["Unknown"])
# fig = plots.investigate_categoric_variable(df, 'unsatisfied', ['RestaurantsPriceRange2'], ["Unknown"])
fig = plots.investigate_categoric_variable(df, 'satisfied', ['RestaurantsPriceRange2'], ["Unknown"])
fig.show()

This variable contains these levels: [0.0, 1.0, 2.0, 3.0, 4.0]
Number of observation by level of this variable: [65066, 28840, 48581, 6667, 1192]
Mean value of response by level of this variable: [0.37275996680293855, 0.2058252427184466, 0.2386118029682386, 0.22258887055647217, 0.18624161073825504]
Mean fitted value by level of this variable: None


In [36]:
fig = plots.investigate_categoric_variable(df, response, ['categories_classified', 'RestaurantsPriceRange2'], ["Unknown"])
fig.update_layout(height=800)

This variable contains these levels: ["('Automotive', 0.0)", "('Automotive', 1.0)", "('Automotive', 2.0)", "('Automotive', 3.0)", "('Automotive', 4.0)", "('Business Support & Supplies', 0.0)", "('Business Support & Supplies', 1.0)", "('Business Support & Supplies', 2.0)", "('Business Support & Supplies', 3.0)", "('Business Support & Supplies', 4.0)", "('Computers & Electronics', 0.0)", "('Computers & Electronics', 1.0)", "('Computers & Electronics', 2.0)", "('Computers & Electronics', 3.0)", "('Computers & Electronics', 4.0)", "('Construction & Contractors', 0.0)", "('Construction & Contractors', 1.0)", "('Construction & Contractors', 2.0)", "('Construction & Contractors', 3.0)", "('Construction & Contractors', 4.0)", "('Education', 0.0)", "('Education', 1.0)", "('Education', 2.0)", "('Education', 3.0)", "('Education', 4.0)", "('Entertainment', 0.0)", "('Entertainment', 1.0)", "('Entertainment', 2.0)", "('Entertainment', 3.0)", "('Entertainment', 4.0)", "('Food & Dining', 0.0)", "('Foo

Despite of the fact that price matter the most, we can see that price range doesn't have impact on customer's rating of restaurant. Level 0 is missing data and it seems that these restaurants have better average ratings.

## Correlation Analysis

In [43]:
# %%capture --no-display

exclude = ['name', 'latitude', 'longitude']
numeric = ['stars', 'review_count', 'is_open', 'RestaurantsPriceRange2_unknown', 
           'Monday_time', 'Tuesday_time', 'Wednesday_time', 'Thursday_time', 'Friday_time', 
           'Saturday_time', 'Sunday_time', 'avg_time_open_week', 'avg_time_open_weekend']
ordered = ['RestaurantsPriceRange2']
unordered = [col for col in list(df.columns) if col not in exclude + ordered + numeric]

df_corr = df.drop(exclude, axis=1)
complete_correlation = associations(df_corr, compute_only=True)
correlation_matrix = complete_correlation['corr']

def magnify():
    return [dict(selector="th",
                 props=[("font-size", "8pt")]),
            dict(selector="td",
                 props=[('padding', "0em 0em")]),
            dict(selector="th:hover",
                 props=[("font-size", "12pt")]),
            dict(selector="tr:hover td:hover",
                 props=[('max-width', '200px'),
                        ('font-size', '12pt')]),
            dict(selector='thead th',
                 props=[('position', 'sticky'),
                        ('top', '0'),
                        ('background-color', 'black')]),
            dict(selector='tbody th',
                 props=[('position', 'sticky'),
                        ('left', '0'),
                        ('background-color', 'black')])
]

     

# Puts the scrollbar next to the DataFrame
display(
    HTML(
        "<div style='height: 600px; overflow: auto; width: 1400px';>" +
        correlation_matrix.style.background_gradient(cmap='coolwarm').format(precision=2).set_properties(**{'max-width': '10px', 'font-size': '0pt'}).set_caption("Hover to magnify").set_table_styles(magnify()).to_html() +
        "</div>"
    )
)
    

Unnamed: 0,city,state,stars,review_count,is_open,satisfied,unsatisfied,ByAppointmentOnly,BusinessAcceptsCreditCards,BikeParking,RestaurantsPriceRange2,CoatCheck,RestaurantsTakeOut,RestaurantsDelivery,Caters,WiFi,WheelchairAccessible,HappyHour,OutdoorSeating,HasTV,RestaurantsReservations,DogsAllowed,Alcohol,GoodForKids,RestaurantsAttire,RestaurantsTableService,RestaurantsGoodForGroups,DriveThru,NoiseLevel,BusinessAcceptsBitcoin,Smoking,GoodForDancing,AcceptsInsurance,BYOB,Corkage,BYOBCorkage,Open24Hours,RestaurantsCounterService,AgesAllowed,BusinessParking_garage,BusinessParking_street,BusinessParking_validated,BusinessParking_lot,BusinessParking_valet,Ambience_romantic,Ambience_intimate,Ambience_touristy,Ambience_hipster,Ambience_divey,Ambience_classy,Ambience_trendy,Ambience_upscale,Ambience_casual,GoodForMeal_dessert,GoodForMeal_latenight,GoodForMeal_lunch,GoodForMeal_dinner,GoodForMeal_brunch,GoodForMeal_breakfast,Music_dj,Music_background_music,Music_no_music,Music_jukebox,Music_live,Music_video,Music_karaoke,BestNights_monday,BestNights_tuesday,BestNights_friday,BestNights_wednesday,BestNights_thursday,BestNights_sunday,BestNights_saturday,HairSpecializesIn_straightperms,HairSpecializesIn_coloring,HairSpecializesIn_extensions,HairSpecializesIn_africanamerican,HairSpecializesIn_curly,HairSpecializesIn_kids,HairSpecializesIn_perms,HairSpecializesIn_asian,DietaryRestrictions_dairy-free,DietaryRestrictions_gluten-free,DietaryRestrictions_vegan,DietaryRestrictions_kosher,DietaryRestrictions_halal,DietaryRestrictions_soy-free,DietaryRestrictions_vegetarian,RestaurantsPriceRange2_Unknown,avg_time_open_week,avg_time_open_weekend,avg_time_open_week_Unknown,avg_time_open_weekend_Unknown,categories_classified,review_count_binned,avg_time_open_week_binned,avg_time_open_weekend_binned
city,1.0,0.92,0.2,0.16,0.14,0.18,0.17,0.1,0.27,0.12,0.17,0.04,0.1,0.12,0.08,0.05,0.08,0.08,0.14,0.1,0.11,0.09,0.13,0.11,0.11,0.06,0.1,0.07,0.05,0.08,0.03,0.0,0.03,0.06,0.05,0.0,0.0,0.0,0.0,0.11,0.26,0.08,0.2,0.09,0.07,0.09,0.08,0.07,0.08,0.09,0.09,0.08,0.09,0.07,0.04,0.07,0.07,0.07,0.06,0.03,0.02,0.08,0.03,0.06,0.0,0.0,0.02,0.03,0.04,0.04,0.03,0.03,0.03,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.08,0.09,0.1,0.03,0.07,0.1,0.16,0.16,0.15,0.13,0.14,0.07,0.05,0.07,0.07
state,0.92,1.0,0.11,0.1,0.06,0.11,0.07,0.07,0.27,0.08,0.11,0.04,0.08,0.09,0.05,0.04,0.05,0.05,0.09,0.07,0.09,0.07,0.12,0.07,0.05,0.04,0.07,0.05,0.04,0.06,0.04,0.03,0.04,0.04,0.04,0.04,0.0,0.01,0.01,0.07,0.15,0.06,0.12,0.07,0.06,0.06,0.07,0.06,0.07,0.06,0.06,0.06,0.07,0.05,0.04,0.05,0.05,0.05,0.05,0.03,0.03,0.05,0.04,0.04,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.02,0.01,0.01,0.02,0.01,0.01,0.01,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.05,0.04,0.05,0.09,0.04,0.04,0.03,0.04
stars,0.2,0.11,1.0,0.06,0.04,0.71,-0.79,0.14,0.07,0.13,-0.03,0.01,0.06,0.1,0.06,0.05,0.19,0.06,0.09,0.07,0.09,0.13,0.05,0.01,0.09,0.07,0.08,0.18,0.07,0.14,0.04,0.02,0.07,0.08,0.06,0.01,0.0,0.0,0.02,0.07,0.16,0.07,0.06,0.06,0.05,0.07,0.04,0.07,0.05,0.09,0.07,0.05,0.08,0.01,0.02,0.04,0.04,0.03,0.01,0.03,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.06,0.05,0.06,0.06,0.05,0.06,0.06,0.06,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.06,-0.22,-0.21,-0.11,-0.04,0.14,0.08,0.24,0.23
review_count,0.16,0.1,0.06,1.0,0.03,-0.02,-0.09,0.07,0.12,0.22,0.2,0.19,0.23,0.22,0.32,0.28,0.17,0.36,0.28,0.3,0.31,0.34,0.33,0.26,0.31,0.34,0.3,0.14,0.37,0.11,0.19,0.23,0.04,0.19,0.21,0.21,0.03,0.04,0.02,0.2,0.25,0.19,0.18,0.22,0.29,0.28,0.3,0.29,0.29,0.41,0.31,0.3,0.36,0.26,0.3,0.36,0.43,0.3,0.33,0.2,0.2,0.2,0.19,0.2,0.2,0.18,0.23,0.23,0.24,0.23,0.23,0.23,0.24,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.02,0.02,0.02,0.02,0.02,0.02,0.02,-0.19,-0.02,0.03,-0.1,-0.14,0.22,0.92,0.09,0.14
is_open,0.14,0.06,0.04,0.03,1.0,0.09,0.01,0.15,0.09,0.05,-0.2,0.0,0.18,0.27,0.13,0.11,0.05,0.07,0.25,0.23,0.23,0.05,0.18,0.22,0.25,0.05,0.25,0.07,0.22,0.03,0.01,0.03,0.08,0.05,0.03,0.06,0.01,0.0,0.01,0.2,0.21,0.2,0.22,0.19,0.23,0.23,0.23,0.23,0.23,0.25,0.24,0.22,0.23,0.09,0.08,0.08,0.08,0.08,0.08,0.02,0.05,0.01,0.01,0.01,0.01,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.23,0.1,0.04,-0.11,-0.02,0.22,0.04,0.13,0.1
satisfied,0.18,0.11,0.71,-0.02,0.09,1.0,-0.32,0.17,0.06,0.07,-0.13,0.04,0.15,0.16,0.1,0.06,0.14,0.08,0.16,0.17,0.17,0.08,0.13,0.12,0.19,0.04,0.19,0.1,0.16,0.11,0.02,0.04,0.08,0.04,0.02,0.05,0.0,0.0,0.01,0.06,0.12,0.06,0.06,0.07,0.16,0.16,0.15,0.16,0.15,0.16,0.16,0.16,0.15,0.1,0.11,0.1,0.11,0.11,0.11,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.16,-0.16,-0.17,-0.06,0.02,0.14,0.08,0.19,0.21
unsatisfied,0.17,0.07,-0.79,-0.09,0.01,-0.32,1.0,0.06,0.07,0.15,-0.06,0.05,0.05,0.07,0.1,0.1,0.15,0.1,0.1,0.08,0.1,0.13,0.11,0.07,0.04,0.12,0.06,0.17,0.1,0.11,0.07,0.06,0.04,0.08,0.07,0.03,0.0,0.0,0.01,0.13,0.18,0.13,0.13,0.14,0.06,0.06,0.06,0.06,0.06,0.09,0.07,0.07,0.12,0.06,0.07,0.09,0.11,0.07,0.07,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.04,0.17,0.16,0.09,0.06,0.14,0.13,0.19,0.18
ByAppointmentOnly,0.1,0.07,0.14,0.07,0.15,0.17,0.06,1.0,0.13,0.11,0.13,0.06,0.25,0.23,0.18,0.1,0.12,0.08,0.22,0.21,0.21,0.09,0.12,0.12,0.2,0.11,0.22,0.06,0.18,0.11,0.04,0.04,0.25,0.04,0.04,0.03,0.01,0.0,0.01,0.12,0.11,0.11,0.11,0.12,0.2,0.2,0.2,0.2,0.19,0.2,0.2,0.2,0.2,0.14,0.14,0.15,0.15,0.14,0.14,0.05,0.05,0.07,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.09,0.09,0.09,0.09,0.09,0.1,0.1,0.09,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.19,0.06,0.12,0.12,0.16,0.47,0.09,0.11,0.16
BusinessAcceptsCreditCards,0.27,0.27,0.07,0.12,0.09,0.06,0.07,0.13,1.0,0.25,0.35,0.06,0.17,0.15,0.18,0.21,0.15,0.11,0.17,0.16,0.17,0.1,0.11,0.13,0.16,0.11,0.17,0.06,0.16,0.13,0.06,0.06,0.05,0.06,0.05,0.04,0.0,0.0,0.01,0.24,0.24,0.24,0.24,0.24,0.14,0.14,0.14,0.14,0.16,0.14,0.13,0.14,0.15,0.11,0.11,0.12,0.12,0.11,0.12,0.05,0.05,0.07,0.05,0.05,0.05,0.05,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.03,0.03,0.03,0.03,0.03,0.02,0.03,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.41,0.04,0.04,0.2,0.19,0.25,0.19,0.09,0.11
BikeParking,0.12,0.08,0.13,0.22,0.05,0.07,0.15,0.11,0.25,1.0,0.53,0.12,0.28,0.25,0.33,0.28,0.28,0.18,0.3,0.28,0.26,0.21,0.19,0.28,0.26,0.22,0.26,0.1,0.29,0.12,0.11,0.12,0.09,0.12,0.1,0.04,0.01,0.01,0.02,0.46,0.46,0.46,0.46,0.47,0.25,0.24,0.25,0.25,0.24,0.25,0.24,0.25,0.27,0.23,0.24,0.25,0.25,0.24,0.25,0.12,0.12,0.17,0.12,0.12,0.12,0.11,0.12,0.12,0.12,0.12,0.12,0.12,0.12,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.01,0.01,0.01,0.02,0.01,0.01,0.01,0.61,0.05,0.04,0.14,0.28,0.34,0.29,0.13,0.19


Correlation insights:
1) Most of the variables is uncorrelated
2) Some groups of variables with very high number of missing variables are correlated inside of group, such as Dietary restrictions, hair specializes in, best nights, music, good for meal, ambience and business parking. For each of these groups it might be better select only one variable into model if some will be significant.
3) All opening times are very correlated, therefore we will include only one of opening time variables into our models. Probably 'avg_time_open_week' would be ideal. 