# Taste the problems of Zomato


This notebook holds the information about working on sample zomato data to answer questions of stakeholders and generate meaningful solutions to those problems
by identifying the patterns and developing insights.


***Questions to Answer***

- **Do a greater number of restaurants provide online delivery as opposed to offline services?**
- **Which types of restaurants are the most favored by the general public?**
- **What price range is preferred by couples for their dinner at restaurants?**

Import the important libraries to use -

In [30]:
# %pip install numpy pandas seaborn matplotlib

In [31]:
import pandas as pd
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
import prettytable
prettytable.DEFAULT='DEFAULT'

In [32]:
# Lets Collect the data
df=pd.read_csv('D:\\Drive into Analysis\\Zomato Data Analysis\\Zomato-data-.csv')
if df is not None :
    print(df.head(5))
else :
    print('Failed to load the data from the given path')

                    name online_order book_table   rate  votes  \
0                  Jalsa          Yes        Yes  4.1/5    775   
1         Spice Elephant          Yes         No  4.1/5    787   
2        San Churro Cafe          Yes         No  3.8/5    918   
3  Addhuri Udupi Bhojana           No         No  3.7/5     88   
4          Grand Village           No         No  3.8/5    166   

   approx_cost(for two people) listed_in(type)  
0                          800          Buffet  
1                          800          Buffet  
2                          800          Buffet  
3                          300          Buffet  
4                          600          Buffet  


In [33]:
# Check for null values 
df.isna().sum()

name                           0
online_order                   0
book_table                     0
rate                           0
votes                          0
approx_cost(for two people)    0
listed_in(type)                0
dtype: int64

In [34]:
# Lets Check for DataType
df.dtypes

name                           object
online_order                   object
book_table                     object
rate                           object
votes                           int64
approx_cost(for two people)     int64
listed_in(type)                object
dtype: object

In [35]:
# Since rating is object we need to change it to int and and replace that '/' 
def handle(value):
    value=str(value).split('/')
    value=value[0]
    return value
df['rate']=df['rate'].apply(handle)
df.head(2)

Unnamed: 0,name,online_order,book_table,rate,votes,approx_cost(for two people),listed_in(type)
0,Jalsa,Yes,Yes,4.1,775,800,Buffet
1,Spice Elephant,Yes,No,4.1,787,800,Buffet


In [36]:
# Converting rate as int
df['rate']=df['rate'].astype(float)
df.dtypes

name                            object
online_order                    object
book_table                      object
rate                           float64
votes                            int64
approx_cost(for two people)      int64
listed_in(type)                 object
dtype: object

In [37]:
# There's no null value in given data set lets check the data for statistic values

df.describe(include='all')

Unnamed: 0,name,online_order,book_table,rate,votes,approx_cost(for two people),listed_in(type)
count,148,148,148,148.0,148.0,148.0,148
unique,145,2,2,,,,4
top,San Churro Cafe,No,No,,,,Dining
freq,2,90,140,,,,110
mean,,,,3.633108,264.810811,418.243243,
std,,,,0.402271,653.676951,223.085098,
min,,,,2.6,0.0,100.0,
25%,,,,3.3,6.75,200.0,
50%,,,,3.7,43.5,400.0,
75%,,,,3.9,221.75,600.0,


From above statistical analysis clears following point :
- Tight clustering in Rate
- Widely spread Values in Votes and Approx_Cost(moderate spread)
- There 145 unique restaurants
- Total count is 148


In [38]:
# Information about data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148 entries, 0 to 147
Data columns (total 7 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   name                         148 non-null    object 
 1   online_order                 148 non-null    object 
 2   book_table                   148 non-null    object 
 3   rate                         148 non-null    float64
 4   votes                        148 non-null    int64  
 5   approx_cost(for two people)  148 non-null    int64  
 6   listed_in(type)              148 non-null    object 
dtypes: float64(1), int64(2), object(4)
memory usage: 8.2+ KB


In [39]:
# Lets Rename the columns appropriately
df.rename(columns={'name':'Restaurant','online_order':'Online','book_table':'Bookings','rate':'Ratings(5)','votes':'Vote','approx_cost(for two people)':'Average Cost( For 2)','listed_in(type)':'Type'},inplace=True)

In [40]:
# Check the columns wether renamed 
df.head(5)

Unnamed: 0,Restaurant,Online,Bookings,Ratings(5),Vote,Average Cost( For 2),Type
0,Jalsa,Yes,Yes,4.1,775,800,Buffet
1,Spice Elephant,Yes,No,4.1,787,800,Buffet
2,San Churro Cafe,Yes,No,3.8,918,800,Buffet
3,Addhuri Udupi Bhojana,No,No,3.7,88,300,Buffet
4,Grand Village,No,No,3.8,166,600,Buffet


In [41]:
# Normalisation of votes and average cost ( Useful for algorithms that need data within the specific range)

# Z score normalisation used in case of outliers and machine learning models it ranges the value in a proper format and make data clean to analysis
df['Vote']=(df['Vote']-df['Vote'].mean())/df['Vote'].std()

# Min - Max is used to get values in a same range say 0-1 cannot be used when outliers
df['Average Cost( For 2)']=(df['Average Cost( For 2)']-df['Average Cost( For 2)'].min())/(df['Average Cost( For 2)'].max()-df['Average Cost( For 2)'].min())

In [42]:
df.head(5)

Unnamed: 0,Restaurant,Online,Bookings,Ratings(5),Vote,Average Cost( For 2),Type
0,Jalsa,Yes,Yes,4.1,0.780491,0.823529,Buffet
1,Spice Elephant,Yes,No,4.1,0.798849,0.823529,Buffet
2,San Churro Cafe,Yes,No,3.8,0.999254,0.823529,Buffet
3,Addhuri Udupi Bhojana,No,No,3.7,-0.270487,0.235294,Buffet
4,Grand Village,No,No,3.8,-0.151162,0.588235,Buffet


In [43]:
# Create Dummies for Type this help in classify the categorical data and helpful in algorithms that cant process the categorical data

dummy=pd.get_dummies(df['Type'])
dummy.head(5)

Unnamed: 0,Buffet,Cafes,Dining,other
0,True,False,False,False
1,True,False,False,False
2,True,False,False,False
3,True,False,False,False
4,True,False,False,False


In [44]:
df=pd.concat([df,dummy],axis=1)
df.head(5)

Unnamed: 0,Restaurant,Online,Bookings,Ratings(5),Vote,Average Cost( For 2),Type,Buffet,Cafes,Dining,other
0,Jalsa,Yes,Yes,4.1,0.780491,0.823529,Buffet,True,False,False,False
1,Spice Elephant,Yes,No,4.1,0.798849,0.823529,Buffet,True,False,False,False
2,San Churro Cafe,Yes,No,3.8,0.999254,0.823529,Buffet,True,False,False,False
3,Addhuri Udupi Bhojana,No,No,3.7,-0.270487,0.235294,Buffet,True,False,False,False
4,Grand Village,No,No,3.8,-0.151162,0.588235,Buffet,True,False,False,False


In [45]:
df.drop(columns=['Type'],inplace=True)
df.head(5)

Unnamed: 0,Restaurant,Online,Bookings,Ratings(5),Vote,Average Cost( For 2),Buffet,Cafes,Dining,other
0,Jalsa,Yes,Yes,4.1,0.780491,0.823529,True,False,False,False
1,Spice Elephant,Yes,No,4.1,0.798849,0.823529,True,False,False,False
2,San Churro Cafe,Yes,No,3.8,0.999254,0.823529,True,False,False,False
3,Addhuri Udupi Bhojana,No,No,3.7,-0.270487,0.235294,True,False,False,False
4,Grand Village,No,No,3.8,-0.151162,0.588235,True,False,False,False


In [46]:
df['Buffet'].value_counts()


Buffet
False    141
True       7
Name: count, dtype: int64

In [47]:
df['Cafes'].value_counts()


Cafes
False    125
True      23
Name: count, dtype: int64

In [48]:
df['other'].value_counts()



other
False    140
True       8
Name: count, dtype: int64

In [49]:
df['Dining'].value_counts()

Dining
True     110
False     38
Name: count, dtype: int64

In [50]:
df.describe(include='all')

Unnamed: 0,Restaurant,Online,Bookings,Ratings(5),Vote,Average Cost( For 2),Buffet,Cafes,Dining,other
count,148,148,148,148.0,148.0,148.0,148,148,148,148
unique,145,2,2,,,,2,2,2,2
top,San Churro Cafe,No,No,,,,False,False,True,False
freq,2,90,140,,,,141,125,110,140
mean,,,,3.633108,-2.4004820000000003e-17,0.374404,,,,
std,,,,0.402271,1.0,0.262453,,,,
min,,,,2.6,-0.4051096,0.0,,,,
25%,,,,3.3,-0.3947834,0.117647,,,,
50%,,,,3.7,-0.338563,0.352941,,,,
75%,,,,3.9,-0.06587476,0.588235,,,,


In [51]:
# Create Labels for high rating and low rating
bins=np.linspace(df['Ratings(5)'].min(),df['Ratings(5)'].max(),3)
labels=['Low','High']
df['Rating Labels']=pd.cut(df['Ratings(5)'],bins,labels=labels,include_lowest=True)
df.head(5)

Unnamed: 0,Restaurant,Online,Bookings,Ratings(5),Vote,Average Cost( For 2),Buffet,Cafes,Dining,other,Rating Labels
0,Jalsa,Yes,Yes,4.1,0.780491,0.823529,True,False,False,False,High
1,Spice Elephant,Yes,No,4.1,0.798849,0.823529,True,False,False,False,High
2,San Churro Cafe,Yes,No,3.8,0.999254,0.823529,True,False,False,False,High
3,Addhuri Udupi Bhojana,No,No,3.7,-0.270487,0.235294,True,False,False,False,High
4,Grand Village,No,No,3.8,-0.151162,0.588235,True,False,False,False,High
