<a href="https://colab.research.google.com/github/pixlricha/Zomato-data-analysis/blob/main/Zomato_Reviews_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = 'tomato-reviews:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-data-sets%2F2826904%2F4875434%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240617%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240617T051124Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3Db2402d5f671719f6b08fb0ec4c4300e53320c9b043608b245193c22e0ef031d1e3ad1ed22beac773db3ad52220e3e73789aff7ac9675f711a541710aba091ed303b5f2854c5459f8c61e8608216f1c047de628f3b5ac958cfc0f78f1d99c97ba822815b911835d2fdb2c606052f72d851d759c3a9f873dba37e5cdeba229685de6419f4e20c0dde785237610f4ea324cceb1dc3555c72d5ea06b3d71e48f653904cd45480f97184e002dfcb250824c5c813018bc773799e1084975f35f34a685c936dae597c96336ba786c50fa323523848d00f53a581fc563e55c6284b9141e793d40df78db95461fcf1fdd923bc22be1a63e3ee1cd1f635100727c074f16b9'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


> #  About Dataset

This dataset containing information of food restraunts in banglaore who are working with Zomato.The data was scraped from Zomato in two phase. After going through the structure of the website I found

for each neighborhood there are 6-7 category of restaurants viz. Buffet, Cafes, Delivery, Desserts, Dine-out, Drinks & nightlife, Pubs and bars.
So, here we are trying to find the best restaurants for customer depends on their need.

# Possible Findings :

* 1) How many restaurants accpeting online order for zomato?

* 2) Find best location by seeing dataset.

* 3) Find Types of restaurants and their count.

* 4) Find count of restaurants have table boking facility.

* 5) Find number of restaurant at given location.

* 6) Find most famous restaurants chains like Franchise(Rastaurants having more than one branch) in Bangalore.

* 7) Find how many voters gives rating for each 'type' and aggregate rating of that 'type'.

* 8) Gaussian Rest Type(Normal Distribution) of Rating.

* 9) Find the how many Restaurants havign Chinese and North Indian food in their food type.

* 10) Find the most profitable type of restaurant.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Reading Dataset

In [None]:
zmt= pd.read_csv('/kaggle/input/tomato-reviews/zomato.csv')

In [None]:
zmt.head()

In [None]:
zmt.info()

Here we can see we are having 17 columns and some columns including NaN values and incorrect data_types.

Ex: approx_cost(for two people) column has integer values but having object data_type.

# Droping Unnecessary Columns
Unnecessary Columns:

Unnecessary columns are those columns which are not that useful for analysis.

Ex: Phone_Number, we not able to predict anything using Phone numbers. So, we can drop that columns which we are not going to use.

In [None]:
zmt.drop(['url', 'reviews_list', 'menu_item', 'address', 'phone', 'dish_liked'], axis=1, inplace=True)

In [None]:
zmt.head(2)

Here we are getting only those columns which are useful for analysis.

# Renaming Columns
we are renaming the columns for better readibility

In [None]:
zmt.rename(columns={'name':'restaurants', 'rate': 'rating', 'cuisines':'food_type','listed_in(type)':'type', 'listed_in(city)':'city', 'approx_cost(for two people)':'cost'}, inplace=True)

In [None]:
zmt.head(2)

# Droping NaN values

In [None]:
zmt.dropna(inplace=True)

> # Cleaning Individual Columns

**Columns : 'restaurants'**

Column contains Restaurant Names

In [None]:
zmt.groupby('restaurants').count().head()
# We are grouping the restaurant names and their count.

Here we can see we are got restaurant names having disturbed characters

so we need to remove that characters from the name using pattern matcing to get orignal names.

In [None]:
# '[Ãx][^A-Za-z]+' pattern I got to find disturbed characters using regex. and replacing those characters with empty spaces.
zmt['restaurants']=zmt['restaurants'].str.replace('[Ãx][^A-Za-z]+', '', regex=True)

In [None]:
zmt.groupby('restaurants').count().head()

Here see we successfully removed the disturbed characters from restaurant name(refer 5th row to see transformation)

By reading the databse I found the names having incorrect spell. Let me correct them as well.

In [None]:
zmt['restaurants']=zmt['restaurants'].str.replace('Caf-|Caf', 'Cafe', regex=True)

**Column: 'online_order'**

Column shows that restaurants accepting online orders or not.

In [None]:
zmt['online_order'].unique()


Here, we can see only Yes and No values are present so there is no need to clean the column.

**Column:'book_table'**

Column shows that restaurants having table booking facility or not.

In [None]:
zmt['book_table'].unique()

Here also we got the same result so no need to perform any cleaning.

**Column: 'rating'**

Column shows the rating of the hotel out of 5

In [None]:
zmt['rating'].unique()

Here, we can see rating column having string values and having object datatype including 'NEW' and '-' values. we know that rating is always out of 5 and in decimal format. So, we need to clean our 'rating' columns.

Need changes:

* Remove 'NEW' and '-' values.
* Remove '/5' from the rating.
* Covert datatype object to float (float_datatype has decimal values).

Try to do all changes in one block of code

In [None]:
replace=lambda x:x.replace('/5',"") #lambda function to replace /5 to empty string
l=[] #list to store cleaned values
for val in map(replace, zmt['rating']): #map function to read data from column and replace /5 to empty string
    if val!='NEW' and val!='-': #excluding 'New' and '-' values
        var=float(val) #converting the result in float datatype and storing into one variable
    l.append(var) # appending cleaned values in created list
zmt['rating']=l # updating rating column with new and cleaned values

In [None]:
zmt['rating'].unique(),zmt['rating'].dtype

See we got the cleaned values with float datatype.



**Column: 'votes'**

Column contains number of votes hotel got.

In [None]:
zmt['votes'].isnull().value_counts()

Here, we can see there is no null values and having correct datatype. So, no need to clean anything.

**Column: 'location'**

Column contains locations of restaurants.

In [None]:
zmt.location.unique()

Here, aslo all things are correct so no need to perform data cleaning

**Column: 'rest_type' and 'food_type'**

Columns shows that the restaurant types and which type of food restaurants have.

In [None]:
zmt.rest_type.unique(), zmt.rest_type.isnull().value_counts()

In [None]:
zmt.food_type.unique(), zmt.food_type.isnull().value_counts()

As we see column 'rest_type' and 'food_type' are aslo correct so no need to perform any cleaning operations.

**Column: 'cost'**

Column contains approximate cost for two people.



In [None]:
zmt.cost.unique()

Here see cost column string values with ',' values. And we know cost is always in int sp we need to make some transformations.

Changes need:

* Remove ',' from values
* Change datatype object to integer(int)

In [None]:
zmt['cost']=zmt['cost'].apply(lambda x:x.replace(",", "")).astype(int)

In [None]:
zmt.cost.unique(), zmt.cost.dtype

Here, using lambda function we replaced ',' to empty string and using astype(int) we converted datatype object to int and update in the cost column and we got cleaned data.

**Column: 'type'**

Column shows which types of arrangment restaurant have.

In [None]:
zmt.type.unique()

Here also all values are correct and having correct datatype so no need to perform cleaning.

**Column: 'city'**

Column contains name of cities of restaurants

In [None]:
zmt.city.unique()

Here also all values are correct and having correct datatype so no need to perform cleaning.

# Droping Duplicates

Dropping duplicates means we are removing repeated values or duplicate values from the dataset.

In [None]:
zmt.duplicated().value_counts()

We got 80 duplicate values in our dataset we need to remove those.

In [None]:
zmt.drop_duplicates(keep='last',inplace=True)
zmt.reset_index(drop=True,inplace=True)

In [None]:
zmt.duplicated().value_counts()

As a final step of cleaning we removed all duplicate values from our dataset and we got fully cleaned data.

In [None]:
zmt

In [None]:
zmt.info()

In [None]:
zmt.to_csv('./zomato_clean_data.csv')
# Saving cleaned dataset in output directory.

**Therefore, we cleaned our data successfully and got 43453 Values of data out of 51717.**

> #  Data Visualization

In data visualization we going to visualize our data how it is, and trying to find some informative data from our dataset.

Here we are going to use two python libraries For visualization.

* Seaborn
* Matplotlib

So, first we are going to import them.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

**We know zomato mostely focused on online delivery lets find,**

**1) How many restaurants accpeting online order for zomato?**

In [None]:
zmt.head(1)

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(x='online_order', data=zmt)
plt.show()

By observing the graph, we can say that 25000+ restaurants are accepting online order for zomato and 15000 restaurants not accepting online orders.

**2) Find best location by seeing dataset.**

How can we find? :-

We have two columns 'rating' and 'votes' so by getting the average of those columns with repect to 'location' we can find the best location.

In [None]:
plt.figure(figsize=(16,10))
ax= plt.subplot(2,1,1)
loc_rating=zmt.groupby('location').agg({'rating':'median'})
rating_sorted_loc=loc_rating.sort_values('rating',ascending=False).head(5).reset_index()
sns.barplot(x='location',y='rating', data=rating_sorted_loc)
ax.set_title("Best locations by rating and votes")
ax= plt.subplot(2,1,2)
loc_votes=zmt.groupby('location').agg({'votes':'median'})
votes_sorted_loc=loc_votes.sort_values('votes',ascending=False).head(5).reset_index()
sns.barplot(x='location',y='votes', data=votes_sorted_loc)



By observing the above graphs,'levelle Road' has an high rating and votes as well as compare to other.

So, we can say that 'levelle Road' is a best location by comparing 'votes' and 'rating' of locations.

**3) Find Types of restraunts and their count.**

In [None]:
zmt.head(1)

In [None]:
plt.figure(figsize=(10,20))
rest_types=zmt.groupby('rest_type')['restaurants'].count().reset_index()

sorted_rest_types=rest_types.sort_values('restaurants',ascending=False).head(40)

rest_types.rest_type.count(),sns.barplot(x='restaurants',y='rest_type',data=sorted_rest_types,orient='h')

By observing above graph, the 1st line of grapth shows the number of rest_type we have that is 87 and we plotted only top 40 types which are high count of restaurants types.

ex:

14000 Quick Bites restaurants.

10000 Casual Dining restaurants.

and so on...

**4) Find Cost's of Restaurants**

In [None]:
zmt.head(1)

In [None]:
zmt['cost'] = pd.to_numeric(zmt['cost'], errors='coerce')  # 'coerce' to handle non-numeric values
plt.figure(figsize=(10, 5))
sns.kdeplot(x='cost', data=zmt)
plt.show()

By observing above kde plot we can say that most of the restaurants have cost between 1 to 1000Rs. for food and remaining have above 1000Rs for their food.

**4) Find count of restaurants have table boking facility.**

Here, we can use countplot but try to draw pointplot to see how it looks like.

In [None]:
plt.figure(figsize=(10,5))
table_booking= zmt.groupby('book_table')['restaurants'].count().reset_index()
sns.pointplot(x='book_table',y='restaurants',color='b',data=table_booking)
plt.show()

By seeing above pointplot we can say that only 7000 restaurants have table booking facility and 35000+ restaurants don't have facility of table booking.

**5) Find number of restaurant at given location.**

* 'BTM'
* 'Basavanagudi'
* 'West Bangalore'
* 'Whitefield'
* 'Yeshwantpur'

In [None]:
df=pd.DataFrame(zmt.groupby('location')['restaurants'].count()).reset_index()
criteria=df['location'].isin(['BTM', 'Basavanagudi','West Bangalore','Whitefield','Yeshwantpur'])
plt.figure(figsize=(10,5))
sns.barplot(x='location',y='restaurants',data=df[criteria])
plt.show()

By seeing given plot, we can notice that how many restaurants present are at given locations.

Ex:-

BTM lication has 4000+ restaurants

**6) Find most famous restaurants chains(like Franchise(Rastaurants having more than one branch) in Bangalore.**

We working on a zomato dataset of bangalore location so we can asume that all locations are belongs to bangalore.

* In this problem we need to find restaurant at each location having more than 1 branch and high rating.

In [None]:
df1=pd.DataFrame(zmt.groupby(['location', 'restaurants','rating']).count()).reset_index()
df1

In [None]:
sns.displot(df1['cost'])

Note that all columns having some count values these are nothing but a count of that restaurant at given location.

We can see some restaurants having 1 count and some having more than 1. We want that restaurants which having more than 1 count and high rating.

In [None]:
chain_restaurants = df1[df1['book_table'] > 1]
famous_restaurants = chain_restaurants.groupby('location')[['restaurants', 'rating']].max().reset_index()
famous_restaurants

Finally, we got restaurant at each location has more that one branch and highest rating. Total 89 result we got. Let's try to plot only first 5 to get idea.

In [None]:
sns.catplot(x='location', y='rating', hue="restaurants", kind='bar', height=7, data=famous_restaurants.head())


Here we can see that famous restaurant having more that one branch at location and their rating as well.

Ex.:

We can say that, at BTM location in bangalore "eat.fit" restaurant is a famous restaurant franchise having maximum rating(4.9) compare to other restaurants franchise's at same location.

At Banashankari location in bangalore "Yo Roll Corner" restaurant is a famous restaurant franchise having maximum rating(4.6) compare to other restaurants franchise's at same location.

At Banaswadi location in bangalore "Zam Zam Restaurant" restaurant is a famous restaurant franchise having maximum rating(4.0) compare to other restaurants franchise's at same location.
and so on...

**7) Find how many voters gives rating for each 'type' and aggregate rating of that 'type'.**

In [None]:
zmt.head(1)

In [None]:
plt.figure(figsize=(15,30))
df2=zmt.groupby('type').agg({'votes':'sum', 'rating':'mean'}).nlargest(7,['votes']).reset_index()
sns.catplot(x='type', y='rating', hue='votes', kind='bar', height=8, data=df2)
plt.show()

**8) Gaussian Rest Type(Normal Distribution) of Rating.**

In [None]:
sns.displot(zmt['rating'])

We got a normal distribution form of our rating and we observe that we have maximum ratings between 3.5 to 4.5.

**9) Find the how many Restaurants havign Chinese and North Indian food in their food type.**

In [None]:
Chinese=len([i for i in zmt['food_type'] if 'Chinese' in i])
North_Indian=len([i for i in zmt['food_type'] if 'North Indian' in i])
Restaurant_count=[Chinese,North_Indian]
Food_Type=['Chinese','North Indian']
df3 = pd.DataFrame({'Food_Type': Food_Type, 'Restaurant_count': Restaurant_count})

In [None]:
sns.barplot(x='Food_Type', y='Restaurant_count', data=df3)
plt.show()

Here, We got actual count of restaurants who serving chinese and North Indian food.

**10) Find the most profitable type of restaurant.**

In [None]:
df4=zmt.groupby('type').agg({'cost':'mean'})
df4.cost.plot(kind='pie', autopct='%1.1f%%', figsize=(9,9), shadow=True)
plt.show()

Here, by seeing the total percentage of average cost we can observe that. The restaurants having Drink and Nightlife facility are in high profit than other types.



# Conclusion:

In the Given dataset we have explored and learned many things,

* we learned about how to clean our data.
* How we can interprete data by visualizing it.