In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("../data/raw/zomato.csv")

df.shape


(51717, 17)

In [2]:
import os
os.getcwd()


'c:\\Users\\Nishi\\Desktop\\Data_Analytics\\Zomato_Data_Analytics\\notebooks'

In [3]:
df.shape
df.columns
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51717 entries, 0 to 51716
Data columns (total 17 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   url                          51717 non-null  object
 1   address                      51717 non-null  object
 2   name                         51717 non-null  object
 3   online_order                 51717 non-null  object
 4   book_table                   51717 non-null  object
 5   rate                         43942 non-null  object
 6   votes                        51717 non-null  int64 
 7   phone                        50509 non-null  object
 8   location                     51696 non-null  object
 9   rest_type                    51490 non-null  object
 10  dish_liked                   23639 non-null  object
 11  cuisines                     51672 non-null  object
 12  approx_cost(for two people)  51371 non-null  object
 13  reviews_list                 51

In [4]:
df = df.drop(columns=[
    'url',
    'phone',
    'menu_item',
    'reviews_list'
])


These add no analytical value: ‚ÄúI removed identifier and text-heavy columns that weren‚Äôt useful for structured analysis.‚Äù

In [None]:
df = df[df['rate'].notna()]
df = df[~df['rate'].isin(['NEW', '-'])]


In [None]:
df['rate'] = df['rate'].apply(lambda x: float(x.split('/')[0]))


‚ÄúUnrated restaurants were excluded to prevent skewed insights.‚Äù

In [5]:
#Convert to numeric:
df['approx_cost(for two people)'] = (
    df['approx_cost(for two people)']
    .str.replace(',', '')
    .astype(float)
)


In [6]:
#Remove invalid costs:
df = df[df['approx_cost(for two people)'] > 0]


In [7]:
#Too many missing values ‚Üí don‚Äôt impute.
df['dish_liked'] = df['dish_liked'].fillna('Not Specified')


‚ÄúHigh-missing text features were retained as categorical placeholders.‚Äù

In [8]:
df = df[df['cuisines'].notna()]
df = df[df['location'].notna()]


In [9]:
df['online_order'] = df['online_order'].str.lower()
df['book_table'] = df['book_table'].str.lower()


In [10]:
df = df.drop_duplicates()


In [11]:
df.shape
df.info()
df.isnull().sum()


<class 'pandas.core.frame.DataFrame'>
Index: 51269 entries, 0 to 51716
Data columns (total 13 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   address                      51269 non-null  object 
 1   name                         51269 non-null  object 
 2   online_order                 51269 non-null  object 
 3   book_table                   51269 non-null  object 
 4   rate                         43610 non-null  object 
 5   votes                        51269 non-null  int64  
 6   location                     51269 non-null  object 
 7   rest_type                    51065 non-null  object 
 8   dish_liked                   51269 non-null  object 
 9   cuisines                     51269 non-null  object 
 10  approx_cost(for two people)  51269 non-null  float64
 11  listed_in(type)              51269 non-null  object 
 12  listed_in(city)              51269 non-null  object 
dtypes: float64(1), int64(

address                           0
name                              0
online_order                      0
book_table                        0
rate                           7659
votes                             0
location                          0
rest_type                       204
dish_liked                        0
cuisines                          0
approx_cost(for two people)       0
listed_in(type)                   0
listed_in(city)                   0
dtype: int64

In [12]:
df.to_csv("../data/clean/zomato_cleaned.csv", index=False)


DAY 2 DONE
Final shape: (____ , ____)


DAY 2 DONE

Final shape: (51269, 13)

Key Cleaning Steps:
- Removed irrelevant identifier columns
- Cleaned and standardized ratings
- Converted cost to numeric format
- Handled missing text features
- Ensured data consistency for analysis and ML

üëâ Your final shape is (51269, 13) ‚Äî you can confidently write that.
