[Back](https://keqideng.github.io/data_analysis_portfolio_project/)
# Chinese Eateries Analysis in U.S.
Date: Sept 23, 2021
Prepared by ***Keqi Deng***

>This dataset is a subset of Yelp’s businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp’s data and share their discoveries. In the dataset you’ll find information about businesses across 11 metropolitan areas in four countries.

In [20]:
import gc # garbage collector
import numpy as np # linear algebra
from collections import Counter # for counting commong words
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # visualization
plt.style.use('fivethirtyeight') # use ggplot ploting style
import seaborn as sns # visualization
from wordcloud import WordCloud, STOPWORDS # this module is for making wordcloud in python
import re # regular expression
import string # for finding punctuation in text
import nltk # preprocessing text
from textblob import TextBlob
# import ploty for visualization
import plotly
import plotly.offline as py # make offline
py.init_notebook_mode(connected=True)
import plotly.tools as tls
import plotly.graph_objs as go
from plotly.graph_objs import *
import plotly.tools as tls
import plotly.figure_factory as fig_fact
# remove warnings
import warnings
warnings.filterwarnings('ignore')
# this will allow ploting inside the notebook
%matplotlib inline

## Data Source
The data used in this analysis is provided by [Yelp Opendata](https://www.yelp.com/dataset), last update on Jan 21, 2021.

In [21]:
yelp_df = pd.read_json('yelp_dataset/yelp_academic_dataset_business.json', lines=True)
yelp_df.head(2)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,6iYb2HFDywm3zjuRg0shjw,Oskar Blues Taproom,921 Pearl St,Boulder,CO,80302,40.017544,-105.283348,4.0,86,1,"{'RestaurantsTableService': 'True', 'WiFi': 'u...","Gastropubs, Food, Beer Gardens, Restaurants, B...","{'Monday': '11:0-23:0', 'Tuesday': '11:0-23:0'..."
1,tCbdrRPZA0oiIYSmHG3J0w,Flying Elephants at PDX,7000 NE Airport Way,Portland,OR,97218,45.588906,-122.593331,4.0,126,1,"{'RestaurantsTakeOut': 'True', 'RestaurantsAtt...","Salad, Soup, Sandwiches, Delis, Restaurants, C...","{'Monday': '5:0-18:0', 'Tuesday': '5:0-17:0', ..."


In [22]:
# Dataset information
yelp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160585 entries, 0 to 160584
Data columns (total 14 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   business_id   160585 non-null  object 
 1   name          160585 non-null  object 
 2   address       160585 non-null  object 
 3   city          160585 non-null  object 
 4   state         160585 non-null  object 
 5   postal_code   160585 non-null  object 
 6   latitude      160585 non-null  float64
 7   longitude     160585 non-null  float64
 8   stars         160585 non-null  float64
 9   review_count  160585 non-null  int64  
 10  is_open       160585 non-null  int64  
 11  attributes    145593 non-null  object 
 12  categories    160470 non-null  object 
 13  hours         133244 non-null  object 
dtypes: float64(3), int64(2), object(9)
memory usage: 17.2+ MB


In [23]:
# Dataset length:
len(yelp_df)

160585

## Dataset Modification
> Modify the Dataset to Find Restaurants Labeled *Chinese*, then plot the location of Chinese restaurants on the map.

In [24]:
chinese_df = yelp_df[yelp_df.categories.str.contains('Chinese', na=False)]
chinese_df.head(2)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
53,djolOChjDtxniurUFP_SXA,Panda Express,1460 Rinehart Rd,Sanford,FL,32771,28.800686,-81.331712,2.5,47,1,"{'RestaurantsGoodForGroups': 'True', 'GoodForK...","Restaurants, Chinese, Fast Food","{'Monday': '10:30-21:0', 'Tuesday': '10:30-21:..."
75,3ME_CSB1bo4F0QMhQRUeOA,Yan's China Bistro,146 Humphrey St,Swampscott,MA,1907,42.468081,-70.916752,4.0,74,1,"{'RestaurantsGoodForGroups': 'True', 'NoiseLev...","Restaurants, Chinese","{'Monday': '11:30-22:0', 'Tuesday': '11:30-22:..."


In [27]:
# import plotly for plotting interactive map
hov_label = chinese_df[['name', 'stars']].astype(str).apply(lambda x: ' . rating: '.join(x), axis=1).tolist()

# use mapbox token
tkn = 'pk.eyJ1IjoicGF0cmlja2RkZCIsImEiOiJja3R5dGluOWEzNzE3MzFvMzR0MjRlZWVtIn0.ezHYtubmTTHs1z2n11c7yQ'

data = Data([Scattermapbox(lat=chinese_df.latitude.tolist(),
                                       lon=chinese_df.longitude.tolist(),
                                       mode='markers',
                                       text=hov_label)])
layout = Layout(title='Chinese Restaurants on Yelp',
                autosize=True,
                hovermode='closest',
                mapbox=dict(accesstoken=tkn,
                            bearing=0,
                            center=dict(lat=39.5, lon=-99),
                            style='light',
                            pitch=0,
                            zoom=3))

fig=dict(data=data, layout=layout)
plotly.offline.iplot(fig, filename='mapbox')

In [None]:
chinese_df.state.unique()