<a href="https://colab.research.google.com/github/pi-mishra/Zomato-Restaurant-Clustering-And-Sentiment-Analysis/blob/main/Zomato_Restaurant_Clustering_And_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Problem Description**

Zomato is an Indian restaurant aggregator and food delivery start-up founded by Deepinder Goyal
and Pankaj Chaddah in 2008. Zomato provides information, menus and user-reviews of
restaurants, and also has food delivery options from partner restaurants in select cities. India is
quite famous for its diverse multi cuisine available in a large number of restaurants and hotel
resorts, which is reminiscent of unity in diversity. Restaurant business in India is always evolving.
More Indians are warming up to the idea of eating restaurant food whether by dining outside or
getting food delivered. The growing number of restaurants in every state of India has been a
motivation to inspect the data to get some insights, interesting facts and figures about the Indian
food industry in each city. So, this project focuses on analysing the Zomato restaurant data for
each city in India.
The Project focuses on Customers and Company, you have to analyze the sentiments of the
reviews given by the customer in the data and make some useful conclusions in the form of
Visualizations. Also, cluster the zomato restaurants into different segments. The data is vizualized
as it becomes easy to analyse data at instant. The Analysis also solves some of the business
cases that can directly help the customers finding the Best restaurant in their locality and for the
company to grow up and work on the fields they are currently lagging in. This could help in
clustering the restaurants into segments. Also the data has valuable information around cuisine
and costing which can be used in cost vs. benefit analysis Data could be used for sentiment
analysis. Also the metadata of reviewers can be used for identifying the critics in the industry.

# Data Description

## Restaurants Data

* Name - Name of the restaurants
* Links - URL links of restaurants
* Cost - Per person estimated cost of dining
* Collection- Tagging of restaurants w.r.t Zomato categories
* Cuisines- Cuisines served by restaurants
* Timings - Restaurants timings

## Review Data

* Reviewer - Name of the reviewer
* review - Review text
* Rating - Rating provided
* MetaData - Reviewer metadata-No of reviews and followers
* Time - Data and time of review
* Pictures - No of pictures posted with review

# Knowing the data

In [46]:
#importing all important packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
import time
from wordcloud import WordCloud
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import precision_score,recall_score,f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
sns.set_style("whitegrid",{'grid.linestyle': '--'})

In [47]:
#mounting drive

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [48]:
#restaurant data
restaurants_data = pd.read_csv('/content/drive/MyDrive/ZomatoProject/Zomato Restaurant names and Metadata.csv')

#review data
review_data = pd.read_csv('/content/drive/MyDrive/ZomatoProject/Zomato Restaurant reviews.csv')

## Restaurants Data

In [49]:
#data head
restaurants_data.head()

Unnamed: 0,Name,Links,Cost,Collections,Cuisines,Timings
0,Beyond Flavours,https://www.zomato.com/hyderabad/beyond-flavou...,800,"Food Hygiene Rated Restaurants in Hyderabad, C...","Chinese, Continental, Kebab, European, South I...","12noon to 3:30pm, 6:30pm to 11:30pm (Mon-Sun)"
1,Paradise,https://www.zomato.com/hyderabad/paradise-gach...,800,Hyderabad's Hottest,"Biryani, North Indian, Chinese",11 AM to 11 PM
2,Flechazo,https://www.zomato.com/hyderabad/flechazo-gach...,1300,"Great Buffets, Hyderabad's Hottest","Asian, Mediterranean, North Indian, Desserts","11:30 AM to 4:30 PM, 6:30 PM to 11 PM"
3,Shah Ghouse Hotel & Restaurant,https://www.zomato.com/hyderabad/shah-ghouse-h...,800,Late Night Restaurants,"Biryani, North Indian, Chinese, Seafood, Bever...",12 Noon to 2 AM
4,Over The Moon Brew Company,https://www.zomato.com/hyderabad/over-the-moon...,1200,"Best Bars & Pubs, Food Hygiene Rated Restauran...","Asian, Continental, North Indian, Chinese, Med...","12noon to 11pm (Mon, Tue, Wed, Thu, Sun), 12no..."


In [50]:
#data shape
restaurants_data.shape

(105, 6)

In [51]:
#information of data
restaurants_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105 entries, 0 to 104
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Name         105 non-null    object
 1   Links        105 non-null    object
 2   Cost         105 non-null    object
 3   Collections  51 non-null     object
 4   Cuisines     105 non-null    object
 5   Timings      104 non-null    object
dtypes: object(6)
memory usage: 5.0+ KB


In [52]:
#checking for null values

restaurants_data.isnull().mean()

Name           0.000000
Links          0.000000
Cost           0.000000
Collections    0.514286
Cuisines       0.000000
Timings        0.009524
dtype: float64

In [53]:
#getting description of the data

restaurants_data.describe().transpose()

Unnamed: 0,count,unique,top,freq
Name,105,105,Beyond Flavours,1
Links,105,105,https://www.zomato.com/hyderabad/beyond-flavou...,1
Cost,105,29,500,13
Collections,51,42,Food Hygiene Rated Restaurants in Hyderabad,4
Cuisines,105,92,"North Indian, Chinese",4
Timings,104,77,11 AM to 11 PM,6


In [54]:
#checking for duplicated rows

duplicated_rows = restaurants_data.duplicated().sum()
print(duplicated_rows)

0


Observation- Restaurants data consist of 105 rows and 6 columns.

## Review Data

In [55]:
#head of data
review_data.head()

Unnamed: 0,Restaurant,Reviewer,Review,Rating,Metadata,Time,Pictures
0,Beyond Flavours,Rusha Chakraborty,"The ambience was good, food was quite good . h...",5,"1 Review , 2 Followers",5/25/2019 15:54,0
1,Beyond Flavours,Anusha Tirumalaneedi,Ambience is too good for a pleasant evening. S...,5,"3 Reviews , 2 Followers",5/25/2019 14:20,0
2,Beyond Flavours,Ashok Shekhawat,A must try.. great food great ambience. Thnx f...,5,"2 Reviews , 3 Followers",5/24/2019 22:54,0
3,Beyond Flavours,Swapnil Sarkar,Soumen das and Arun was a great guy. Only beca...,5,"1 Review , 1 Follower",5/24/2019 22:11,0
4,Beyond Flavours,Dileep,Food is good.we ordered Kodi drumsticks and ba...,5,"3 Reviews , 2 Followers",5/24/2019 21:37,0


In [56]:
review_data.shape

(10000, 7)

In [57]:
review_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Restaurant  10000 non-null  object
 1   Reviewer    9962 non-null   object
 2   Review      9955 non-null   object
 3   Rating      9962 non-null   object
 4   Metadata    9962 non-null   object
 5   Time        9962 non-null   object
 6   Pictures    10000 non-null  int64 
dtypes: int64(1), object(6)
memory usage: 547.0+ KB


In [58]:
review_data.isnull().sum()

Restaurant     0
Reviewer      38
Review        45
Rating        38
Metadata      38
Time          38
Pictures       0
dtype: int64

In [59]:
review_data.describe(include='all').transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Restaurant,10000.0,100.0,Beyond Flavours,100.0,,,,,,,
Reviewer,9962.0,7446.0,Parijat Ray,13.0,,,,,,,
Review,9955.0,9364.0,good,237.0,,,,,,,
Rating,9962.0,10.0,5,3832.0,,,,,,,
Metadata,9962.0,2477.0,1 Review,919.0,,,,,,,
Time,9962.0,9782.0,7/29/2018 20:34,3.0,,,,,,,
Pictures,10000.0,,,,0.7486,2.570381,0.0,0.0,0.0,0.0,64.0


In [60]:
review_data.loc[review_data['Rating']== 'Like'] = np.NaN

review_data['Rating'] = review_data['Rating'].astype('float64')

print(review_data.groupby('Restaurant')['Rating'].mean())

Restaurant
10 Downing Street                        3.80
13 Dhaba                                 3.48
3B's - Buddies, Bar & Barbecue           4.76
AB's - Absolute Barbecues                4.88
Absolute Sizzlers                        3.62
                                         ... 
Urban Asia - Kitchen & Bar               3.65
Yum Yum Tree - The Arabian Food Court    3.56
Zega - Sheraton Hyderabad Hotel          4.45
Zing's Northeast Kitchen                 3.65
eat.fit                                  3.20
Name: Rating, Length: 100, dtype: float64


In [61]:
review_data.dtypes

Restaurant     object
Reviewer       object
Review         object
Rating        float64
Metadata       object
Time           object
Pictures      float64
dtype: object

# Data Wrangling

In [62]:
#dropping column collection as 51% of data is missing
restaurants_data.drop('Collections', axis=1, inplace=True)

#dropping the null values
restaurants_data.dropna(subset=['Timings'],inplace=True)

print(restaurants_data.isnull().sum())

Name        0
Links       0
Cost        0
Cuisines    0
Timings     0
dtype: int64


In [63]:
#changing cost datatype from object to float

restaurants_data['Cost'] = restaurants_data['Cost'].str.replace(',', '').astype(float)
