# BDA 507 - Introduction to Computer Programming (Python) 

## Term Project - 06.01.2020

Nilay Kamar
**Student ID**: 311902021
**Email**: kamarn@mef.edu.tr

# Zomato Data Analysis | Caddebostan vs. Nisantasi, Istanbul, Turkey 

Zomato is an application-based platform to find the best restaurant, cafe, and bar in a specific city. Their vision is better food for more people and they aim that connect people to food in every context but work closely with restaurants to enable a sustainable ecosystem. People can review and rate restaurants where they go to. Also, restaurants share their specific features such as payment types, gluten-free options, pet-friendly, no alcohol available, etc. 

In this project, I examined whether a significant difference in food price and restaurant preferences between Nişantaşı and Caddebostan. They are expensive and popular subzones in Istanbul. Also, I wonder where people like to go at Caddebostan and Nişantaşı, how their rates are, and which features attract the people through there.

For that, I used an API. Zomato shares their data with an [API](https://developers.zomato.com/api) and you can easily access restaurant data in a given city, subzone, or country after getting an API key. I got data about restaurants and their highlights of Caddebostan which is a popular subzone in Istanbul.

### Loading Libraries and data

Below are the libraries that are used in order to perform EDA (Exploratory data analysis) in this project.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
from matplotlib import pyplot as plt
import matplotlib.pyplot as plt
import plotly.offline as py
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=False)

In [None]:
res_df = pd.read_excel(r'/Users/nilaykamar/PycharmProjects/Term_Project/restaurants.xlsx') #restaurant data

In [None]:
high_df = pd.read_excel(r'/Users/nilaykamar/PycharmProjects/Term_Project/highlights.xlsx') #restaurant's highlights data

In [None]:
cuisines_df = pd.read_excel(r'/Users/nilaykamar/PycharmProjects/Term_Project/cuisines.xlsx') #cuisines data

In [None]:
foodie_df = pd.read_excel(r'/Users/nilaykamar/PycharmProjects/Term_Project/foodie_level.xlsx') #foodie level data

### Getting basic ideas

In [None]:
print("Restaurant dataset contains {} rows and {} columns".format(res_df.shape[0],res_df.shape[1]))
print("Highlight dataset contains {} rows and {} columns".format(high_df.shape[0],high_df.shape[1]))
print("Cuisines dataset contains {} rows and {} columns".format(cuisines_df.shape[0],cuisines_df.shape[1]))
print("Foodie level dataset contains {} rows and {} columns".format(foodie_df.shape[0],foodie_df.shape[1]))

In [None]:
res_df.info()

In [None]:
high_df.info()

In [None]:
cuisines_df.info()

In [None]:
foodie_df.info()

Since there are data frames about cuisines, highlights, and foodie_level, these columns can be discarded.

In [None]:
res_df = res_df.drop(['cuisines', 'highlights', 'foodie_level', 'price_range'], axis=1)
res_df.head(5)

In [None]:
high_df = high_df.drop(['Unnamed: 0'], axis=1)
cuisines_df = cuisines_df.drop(['Unnamed: 0'], axis=1)
foodie_df = foodie_df.drop(['Unnamed: 0'], axis=1)

Foodie level data and highlight data contain missing values. Because this means no highlight or foodie post, they can be discarded.

In [None]:
print(high_df.isnull().sum())
print(foodie_df.isnull().sum())

In [None]:
high_df = high_df.dropna() 
high_df.count()

In [None]:
foodie_df = foodie_df.dropna()
foodie_df.count()

In [None]:
high_df.head()

In [None]:
cuisines_df.head()

In [None]:
foodie_df.head()

#### Column Description

Restaurant dataset:
 
 - **res_id**: a unique id of restaurants
 - **name**: restaurant name
 - **locality**: subzone name of restaurants
 - **latitude**: latitude of restaurant's location
 - **longitude**: longitude of restaurant's location
 - **establishment**: restaurant types
 - **all_reviews_count**: count of reviews
 - **aggregate_rating**: average ratings of restaurants
 - **rating_text**: text which equaled to rating according to Zomato
 - **rating_color**: color which equaled to rating according to Zomato
 - **votes**: number of votes to given restaurant
 - **photo_count**: number of photos posted on given restaurants
 - **average_cost_for_two**: average price of given restaurants for two people
 
Highlight dataset:

 - **res_id**: a unique id of restaurants
 - **locality**: subzone name of restaurants
 - **highlight**: features that restaurants give customers
 
Cuisines dataset:

 - **res_id**: a unique id of restaurants
 - **locality**: subzone name of restaurants
 - **cuisines**: cuisines of restaurants
 
Foodie level dataset:

 - **res_id**: a unique id of restaurants
 - **locality**: subzone name of restaurants
 - **foodie_level**: level which equaled to foodie according to Zomato *foodie: users of Zomato*

#### Outlier Detection

In [None]:
sns.boxplot(x=res_df['average_cost_for_two'])

There is no outlier in average price.

## Explatory Data Analysis

### Common restaurant types in Istanbul? 

In [None]:
plt.figure(figsize=(10,7))
types=res_df[res_df['locality'] == 'Nişantaşı']['establishment'].value_counts()[:100]

sns.barplot(x=types, y=types.index, palette='deep')
plt.title("Most popular restaurant types in Nisantasi")
plt.xlabel("Number of restaurants")

In [None]:
plt.figure(figsize=(10,7))
types=res_df[res_df['locality'] == 'Caddebostan']['establishment'].value_counts()[:20]

sns.barplot(x=types, y=types.index, palette='deep')
plt.title("Most popular restaurant types in Caddebostan")
plt.xlabel("Number of restaurants")

As it can be shown graphs above, Cafe's are popular restaurant types in both Caddebostan and Nişantaşı. Fine dining restaurants seems to have rising trend at Nisantasi.

### Highlights of restaurants

In [None]:
high_df[high_df['locality'] == 'Nişantaşı'].highlight.value_counts().nlargest(40).plot(kind='bar', figsize=(20,5))
plt.title("Highlights of restaurants at Nisantasi")
plt.ylabel('Number of restaurants')
plt.xlabel('Highlights');

In [None]:
high_df[high_df['locality'] == 'Caddebostan'].highlight.value_counts().nlargest(40).plot(kind='bar', figsize=(20,5))
plt.title("Highlights of restaurants at Caddebostan")
plt.ylabel('Number of restaurants')
plt.xlabel('Highlights');

Restaurants have similar features at both Caddebostan and Nisantasi. Interestingly, popular eating trends were started to enter among desirable features such as gluten-free options and organic food serves, etc. According to limited data about people's prefers, people care about smoking at Caddebostan less than Nisantasi.

### Average cost for two person

In [None]:
plt.figure(figsize=(10,7))
rating=res_df[res_df['locality'] == 'Nişantaşı']['average_cost_for_two']
sns.distplot(rating,bins=20)
plt.title("Average Price in Nisantasi")
plt.xlabel("Average cost for two people")

In [None]:
plt.figure(figsize=(10,7))
rating=res_df[res_df['locality'] == 'Caddebostan']['average_cost_for_two']
sns.distplot(rating,bins=20)
plt.title("Average Price in Caddebostan")
plt.xlabel("Average cost for two people")

According to graphs above, Nisantasi seems to have higher price than Caddebostan.

### Rating distribution

In [None]:
plt.figure(figsize=(10,7))
rating=res_df[res_df['locality'] == 'Nişantaşı']['aggregate_rating']
sns.distplot(rating, bins=20)
plt.title("Average rating in Nişantaşı")
plt.xlabel("Average Rating")

In [None]:
plt.figure(figsize=(10,7))
rating=res_df[res_df['locality'] == 'Caddebostan']['aggregate_rating']
sns.distplot(rating,bins=20)
plt.title("Average rating in Caddebostan")
plt.xlabel("Average Rating")

Nisantası and Caddebostan nearly have the same average ratings range.
Restaurants have ratings more than 4.5 are rare.

### Average price vs. average rating

In [None]:
plt.figure(figsize=(10,7))
sns.scatterplot(x="average_cost_for_two",y='aggregate_rating', hue='locality', data=res_df)
plt.show()

### Average price vs. Establishments

In [None]:
fig, ax = plt.subplots(figsize=(10,6))
sns.boxplot(res_df[res_df['locality'] == 'Nişantaşı']['average_cost_for_two'], res_df[res_df['locality'] == 'Nişantaşı']['establishment'], showfliers=False)

plt.title("Average price by restaurant types at Nisantasi")
ax.set_xlabel('Average Price')
ax.set_ylabel('Restaurant types')
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10,6))
sns.boxplot(res_df[res_df['locality'] == 'Caddebostan']['average_cost_for_two'], res_df[res_df['locality'] == 'Caddebostan']['establishment'], showfliers=False)
plt.xticks(rotation=90)

plt.title("Average price by restaurant types at Caddebostan")
ax.set_xlabel('Average Price')
ax.set_ylabel('Restaurant types')
plt.show()

- It seems that restaurants which their types are Cafe have higher price at Nisantası than Caddebostan.
- While 'meyhane's are the most expensive restaurant types in Caddebostan, this rank is belong to fine dining restaurants at Nisantasi.

### Average rating

In [None]:
plt.figure(figsize=(10,6))
rating=res_df[res_df['locality'] == 'Nişantaşı']['aggregate_rating'].value_counts()
sns.barplot(x=rating.index,y=rating)
plt.xlabel("Ratings")
plt.ylabel('count')

In [None]:
plt.figure(figsize=(10,6))
rating=res_df[res_df['locality'] == 'Caddebostan']['aggregate_rating'].value_counts()
sns.barplot(x=rating.index,y=rating)
plt.xlabel("Ratings")
plt.ylabel('count')

### Cuisines distribution

In [None]:
plt.figure(figsize=(15,10))
cuisines=cuisines_df[cuisines_df['locality'] == 'Nişantaşı']['cuisine'].value_counts()
sns.barplot(cuisines,cuisines.index)
plt.xlabel('Count')
plt.title("Most popular cuisines of Nisantasi")

In [None]:
plt.figure(figsize=(15,10))
cuisines=cuisines_df[cuisines_df['locality'] == 'Caddebostan']['cuisine'].value_counts()
sns.barplot(cuisines,cuisines.index)
plt.xlabel('Count')
plt.title("Most popular cuisines of Caddebostan")

### Correlation matrix

A correlation matrix is a table showing the value of the correlation coefficient (Correlation coefficients are used in statistics to measure how strong a relationship is between two variables.) between sets of variables. Each attribute of the dataset is compared with the other attributes to find out the correlation coefficient. This analysis allows us to see which pairs have the highest correlation, the pairs which are highly correlated represent the same variance of the dataset thus we can further analyze them to understand which attribute among the pairs are most significant for building the model.

In [None]:
plt.figure(figsize=(12,8))
c= res_df[res_df['locality'] == 'Caddebostan'].corr()
sns.heatmap(c, cmap='BrBG', annot=True)
c

In [None]:
plt.figure(figsize=(12,8))
c= res_df[res_df['locality'] == 'Nişantaşı'].corr()
sns.heatmap(c, cmap='BrBG', annot=True)
c

According to the correlation matrixes above, there is a strong relationship between revies, votes, and posting photos. Hence, this relationship is stronger at Caddebostan than Nisantasi.

In [None]:
for i in range(0, len(res_df.columns), 5):
    sns.pairplot(data=res_df,
                x_vars=res_df.columns[i:i+5],
                y_vars=['average_cost_for_two'])

In [None]:
sns.lmplot(x="votes", y="all_reviews_count", data=res_df, fit_reg=False, hue="locality")

In [None]:
sns.lmplot(x="photo_count", y="all_reviews_count", data=res_df, fit_reg=False, hue="locality")

### Foodie levels

In [None]:
x=foodie_df[foodie_df['locality'] == 'Nişantaşı']['foodie_level'].value_counts()
colors = ['#FEBFB3', '#E1396C']

trace=go.Pie(labels=x.index,values=x,textinfo="value",
             marker=dict(colors=colors, 
                         line=dict(color='#000000', width=2)))

layout=go.Layout(title="Foodie levels of Nişantaşı",width=500,height=500)
fig=go.Figure(data=[trace],layout=layout)
py.iplot(fig, filename='pie_chart_subplots')

In [None]:
x=foodie_df[foodie_df['locality'] == 'Caddebostan']['foodie_level'].value_counts()
colors = ['#FEBFB3', '#E1396C']

trace=go.Pie(labels=x.index,values=x,textinfo="value",
             marker=dict(colors=colors, 
                         line=dict(color='#000000', width=2)))

layout=go.Layout(title="Foodie levels of Caddebostan",width=500,height=500)
fig=go.Figure(data=[trace],layout=layout)
py.iplot(fig, filename='pie_chart_subplots')

People who match connoisseur according to Zomato prefer Caddebostan to eat and chill.

### Plotting restaurants on map

In [None]:
BBox_c = (res_df[res_df['locality'] == 'Caddebostan']['longitude'].min(),
          res_df[res_df['locality'] == 'Caddebostan']['longitude'].max(),
          res_df[res_df['locality'] == 'Caddebostan']['latitude'].min(),
          res_df[res_df['locality'] == 'Caddebostan']['latitude'].max())

In [None]:
map_c = plt.imread('/Users/nilaykamar/PycharmProjects/Term_Project/map.png')

In [None]:
fig, ax = plt.subplots(figsize = (8,7))
ax.scatter(res_df[res_df['locality'] == 'Caddebostan']['longitude'],
           res_df[res_df['locality'] == 'Caddebostan']['latitude'],
           zorder=1, c='b', s=10)

ax.set_title('Plotting Restaurants Data on Caddebostan Map')
ax.set_xlim(BBox_c[0],BBox_c[1])
ax.set_ylim(BBox_c[2],BBox_c[3])
ax.imshow(map_c, zorder=0, extent = BBox_c, aspect= 'equal')

In [None]:
BBox_n = (res_df[res_df['locality'] == 'Nişantaşı']['longitude'].min(),
          res_df[res_df['locality'] == 'Nişantaşı']['longitude'].max(),
          res_df[res_df['locality'] == 'Nişantaşı']['latitude'].min(),
          res_df[res_df['locality'] == 'Nişantaşı']['latitude'].max())
BBox_n

In [None]:
map_n = plt.imread('/Users/nilaykamar/PycharmProjects/Term_Project/map-2.png')

In [None]:
fig, ax = plt.subplots(figsize = (8,7))
ax.scatter(res_df[res_df['locality'] == 'Nişantaşı']['longitude'],
           res_df[res_df['locality'] == 'Nişantaşı']['latitude'],
           zorder=1, c='b', s=10)

ax.set_title('Plotting Restaurants Data on Nişantaşı Map')
ax.set_xlim(BBox_n[0],BBox_n[1])
ax.set_ylim(BBox_n[2],BBox_n[3])
ax.imshow(map_n, zorder=0, extent = BBox_n, aspect= 'equal')

## Where should you go at Nisantası?

In [None]:
res_n = res_df[res_df['locality'] == 'Nişantaşı']
res_n.iloc[res_n.groupby('establishment')['aggregate_rating'].agg(pd.Series.idxmax)].sort_values(by='aggregate_rating', ascending=False)

## Where should you go at Caddebostan?

In [None]:
res_c = res_df[res_df['locality'] == 'Caddebostan']
res_c.loc[res_c.groupby('establishment')['aggregate_rating'].agg(pd.Series.idxmax)].sort_values(by='aggregate_rating', ascending=False)

## Results

While I examined Zomato dataset, it can be got out results below:

 - Restaurants have a higher price at Nisantası than Caddebostan.
 - Many restaurants have similar features such as credit cards, cash, air conditioned, or indoor/outdoor settings.
 - Popular eating trends start to form restaurant features such as vegan options, gluten-free, or organic food. Nisantası is more sensitive to popular eating trends.
 - Disparately, Japanese cuisines have a rising trend at Nisantası compare to Caddebostan.
 
I also learned that:

 - Working with JSON files after got the data via API
 - Cleaning dataset
 - Plotting different types of graphs
 - Interpreting heat maps
 - Visualizing results
 - Comparing two different categories according to features