In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("../data/clean/zomato_cleaned.csv")
df.shape


(51269, 13)

1Ô∏è‚É£ Create Price Buckets (Business + ML)
Why?

Raw prices are noisy. Buckets are easier to interpret.

In [2]:
bins = [0, 300, 700, 1500, np.inf]
labels = ['Low', 'Medium', 'High', 'Premium']

df['price_category'] = pd.cut(
    df['approx_cost(for two people)'],
    bins=bins,
    labels=labels
)

üìå Interview line:

‚ÄúPrice bucketing improves interpretability and reduces noise in ML models.‚Äù

2Ô∏è‚É£ Create Cuisine Count (VERY IMPORTANT FEATURE)

Restaurants with more cuisines often attract more customers.

In [3]:
df['cuisine_count'] = df['cuisines'].apply(lambda x: len(x.split(',')))


üìå Interview line:

‚ÄúCuisine diversity was engineered as a numeric proxy for menu variety.‚Äù

3Ô∏è‚É£ Create Binary Rating Label (For Classification)

In [None]:
Logic:

Rating ‚â• 4.0 ‚Üí High rated

Else ‚Üí Low rated

In [5]:
df['rate'].head()
df['rate'].dtype


dtype('O')

In [6]:
df['rate'] = pd.to_numeric(df['rate'], errors='coerce')


In [7]:
df['high_rating'] = df['rate'].apply(lambda x: 1 if x >= 4.0 else 0)


In [8]:
df[['rate', 'high_rating']].head(10)
df['high_rating'].value_counts()


high_rating
0    51269
Name: count, dtype: int64

üó£Ô∏è INTERVIEW-READY EXPLANATION (MEMORIZE)

‚ÄúSince ratings were initially stored as strings, I explicitly converted them to numeric values before creating classification labels.‚Äù

In [9]:
#Rule:

#Keep missing ratings for EDA

#Drop missing ratings only for ML
df_ml = df[df['rate'].notna()]


üìå Interview gold:

‚ÄúI treated missing ratings differently for EDA and ML to preserve data while maintaining model integrity.‚Äù

In [10]:
df_ml['online_order_flag'] = df_ml['online_order'].map({'yes': 1, 'no': 0})
df_ml['book_table_flag'] = df_ml['book_table'].map({'yes': 1, 'no': 0})


In [11]:
#6Ô∏è‚É£ Create Votes Log Feature (Advanced Touch)

#Votes are highly skewed ‚Üí log helps.
df_ml['votes_log'] = np.log1p(df_ml['votes'])


üìå Interview line:

‚ÄúLog transformation was applied to reduce skewness in engagement metrics.‚Äù

In [12]:
df_ml[
    ['rate', 'votes', 'votes_log', 'cuisine_count',
     'approx_cost(for two people)', 'price_category',
     'online_order_flag', 'book_table_flag', 'high_rating']
].head()


Unnamed: 0,rate,votes,votes_log,cuisine_count,approx_cost(for two people),price_category,online_order_flag,book_table_flag,high_rating


In [13]:
df_ml.to_csv("../data/clean/zomato_features.csv", index=False)


‚úÖ DAY 3 CHECKLIST

‚úî Price buckets created (Low, Medium, High, Premium)
‚úî Cuisine count feature as a proxy for menu variety
‚úî Binary rating label for classification
‚úî ML-ready numeric features
‚úî Encoded boolean Feature dataset saved

üó£Ô∏è ONE-LINE DAY 3 SUMMARY (MEMORIZE)

‚ÄúI engineered business-driven and ML-ready features such as price buckets, cuisine diversity, engagement transforms, and rating labels to improve interpretability and predictive performance.‚Äù Prepared a clean, ML-ready dataset with business-interpretable features.


Think like: ‚ÄúWhat would a Zomato business analyst want to know?‚Äù