# Data Analytics Internship level 2

## Data Processing and Feature Engineering

### Project title : Feature Engineering for Predictive Modeling 
Dataset : zomato.csv
Source : Kaggle

---

## Task2-Feature Engineering for Predictive Modeling 
 
*• Goal: Enhance predictive model performance by transforming raw data into meaningful features, handling missing values, encoding categorical variables, and optimizing feature selection for improved accuracy and efficiency.*  
*• Select a dataset and create new meaningful features. Handle missing values, outliers, and categorical variables. Prepare the dataset for machine learning.*

---

In [1]:
#Core Libraries

import pandas as pd 
import numpy as np 

import matplotlib.pyplot as plt 
import seaborn as sns 


In [2]:
#data loading 

def data_load(path) :
    data = pd.read_csv(path)
    return data

data = data_load("zomato.csv")

In [3]:
data.head()

Unnamed: 0,url,address,name,online_order,book_table,rate,votes,phone,location,rest_type,dish_liked,cuisines,approx_cost(for two people),reviews_list,menu_item,listed_in(type),listed_in(city)
0,https://www.zomato.com/bangalore/jalsa-banasha...,"942, 21st Main Road, 2nd Stage, Banashankari, ...",Jalsa,Yes,Yes,4.1/5,775,080 42297555\r\n+91 9743772233,Banashankari,Casual Dining,"Pasta, Lunch Buffet, Masala Papad, Paneer Laja...","North Indian, Mughlai, Chinese",800,"[('Rated 4.0', 'RATED\n A beautiful place to ...",[],Buffet,Banashankari
1,https://www.zomato.com/bangalore/spice-elephan...,"2nd Floor, 80 Feet Road, Near Big Bazaar, 6th ...",Spice Elephant,Yes,No,4.1/5,787,080 41714161,Banashankari,Casual Dining,"Momos, Lunch Buffet, Chocolate Nirvana, Thai G...","Chinese, North Indian, Thai",800,"[('Rated 4.0', 'RATED\n Had been here for din...",[],Buffet,Banashankari
2,https://www.zomato.com/SanchurroBangalore?cont...,"1112, Next to KIMS Medical College, 17th Cross...",San Churro Cafe,Yes,No,3.8/5,918,+91 9663487993,Banashankari,"Cafe, Casual Dining","Churros, Cannelloni, Minestrone Soup, Hot Choc...","Cafe, Mexican, Italian",800,"[('Rated 3.0', ""RATED\n Ambience is not that ...",[],Buffet,Banashankari
3,https://www.zomato.com/bangalore/addhuri-udupi...,"1st Floor, Annakuteera, 3rd Stage, Banashankar...",Addhuri Udupi Bhojana,No,No,3.7/5,88,+91 9620009302,Banashankari,Quick Bites,Masala Dosa,"South Indian, North Indian",300,"[('Rated 4.0', ""RATED\n Great food and proper...",[],Buffet,Banashankari
4,https://www.zomato.com/bangalore/grand-village...,"10, 3rd Floor, Lakshmi Associates, Gandhi Baza...",Grand Village,No,No,3.8/5,166,+91 8026612447\r\n+91 9901210005,Basavanagudi,Casual Dining,"Panipuri, Gol Gappe","North Indian, Rajasthani",600,"[('Rated 4.0', 'RATED\n Very good restaurant ...",[],Buffet,Banashankari


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51717 entries, 0 to 51716
Data columns (total 17 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   url                          51717 non-null  object
 1   address                      51717 non-null  object
 2   name                         51717 non-null  object
 3   online_order                 51717 non-null  object
 4   book_table                   51717 non-null  object
 5   rate                         43942 non-null  object
 6   votes                        51717 non-null  int64 
 7   phone                        50509 non-null  object
 8   location                     51696 non-null  object
 9   rest_type                    51490 non-null  object
 10  dish_liked                   23639 non-null  object
 11  cuisines                     51672 non-null  object
 12  approx_cost(for two people)  51371 non-null  object
 13  reviews_list                 51

In [5]:
data.describe()

Unnamed: 0,votes
count,51717.0
mean,283.697527
std,803.838853
min,0.0
25%,7.0
50%,41.0
75%,198.0
max,16832.0


In [33]:
data.describe(include='object')

Unnamed: 0,address,name,online_order,book_table,location,rest_type,cuisines,reviews_list,menu_item,listed_in(type),listed_in(city)
count,33641,33641,33641,33641,33641,33641,33641,33641,33641,33641,33641
unique,7784,5869,2,2,92,68,1908,15375,6975,7,30
top,Delivery Only,Cafe Coffee Day,Yes,No,BTM,Quick Bites,"North Indian, Chinese",[],[],Delivery,BTM
freq,80,84,22511,31808,3714,13452,1833,1050,24565,17992,2179


In [7]:
# data cleaning 

data.duplicated().sum()

np.int64(0)

In [8]:
data[data.duplicated()].head()

Unnamed: 0,url,address,name,online_order,book_table,rate,votes,phone,location,rest_type,dish_liked,cuisines,approx_cost(for two people),reviews_list,menu_item,listed_in(type),listed_in(city)


In [9]:
data.isnull().sum()

url                                0
address                            0
name                               0
online_order                       0
book_table                         0
rate                            7775
votes                              0
phone                           1208
location                          21
rest_type                        227
dish_liked                     28078
cuisines                          45
approx_cost(for two people)      346
reviews_list                       0
menu_item                          0
listed_in(type)                    0
listed_in(city)                    0
dtype: int64

In [10]:
(data.isnull().sum() / len(data))*100    #Missing value percentage

url                             0.000000
address                         0.000000
name                            0.000000
online_order                    0.000000
book_table                      0.000000
rate                           15.033741
votes                           0.000000
phone                           2.335789
location                        0.040606
rest_type                       0.438927
dish_liked                     54.291626
cuisines                        0.087012
approx_cost(for two people)     0.669026
reviews_list                    0.000000
menu_item                       0.000000
listed_in(type)                 0.000000
listed_in(city)                 0.000000
dtype: float64

In [11]:
# drop columns 

data.drop(columns= ['phone' , 'url'],inplace= True)

In [12]:
data.drop(columns = 'dish_liked' , inplace= True)  #very high missing values

In [13]:
data['rate'].unique()

array(['4.1/5', '3.8/5', '3.7/5', '3.6/5', '4.6/5', '4.0/5', '4.2/5',
       '3.9/5', '3.1/5', '3.0/5', '3.2/5', '3.3/5', '2.8/5', '4.4/5',
       '4.3/5', 'NEW', '2.9/5', '3.5/5', nan, '2.6/5', '3.8 /5', '3.4/5',
       '4.5/5', '2.5/5', '2.7/5', '4.7/5', '2.4/5', '2.2/5', '2.3/5',
       '3.4 /5', '-', '3.6 /5', '4.8/5', '3.9 /5', '4.2 /5', '4.0 /5',
       '4.1 /5', '3.7 /5', '3.1 /5', '2.9 /5', '3.3 /5', '2.8 /5',
       '3.5 /5', '2.7 /5', '2.5 /5', '3.2 /5', '2.6 /5', '4.5 /5',
       '4.3 /5', '4.4 /5', '4.9/5', '2.1/5', '2.0/5', '1.8/5', '4.6 /5',
       '4.9 /5', '3.0 /5', '4.8 /5', '2.3 /5', '4.7 /5', '2.4 /5',
       '2.1 /5', '2.2 /5', '2.0 /5', '1.8 /5'], dtype=object)

In [14]:
data['rate'] = data['rate'].astype(str)

In [15]:
data = data[data['rate'].str.contains('/')]
data['rate'].unique()

array(['4.1/5', '3.8/5', '3.7/5', '3.6/5', '4.6/5', '4.0/5', '4.2/5',
       '3.9/5', '3.1/5', '3.0/5', '3.2/5', '3.3/5', '2.8/5', '4.4/5',
       '4.3/5', '2.9/5', '3.5/5', '2.6/5', '3.8 /5', '3.4/5', '4.5/5',
       '2.5/5', '2.7/5', '4.7/5', '2.4/5', '2.2/5', '2.3/5', '3.4 /5',
       '3.6 /5', '4.8/5', '3.9 /5', '4.2 /5', '4.0 /5', '4.1 /5',
       '3.7 /5', '3.1 /5', '2.9 /5', '3.3 /5', '2.8 /5', '3.5 /5',
       '2.7 /5', '2.5 /5', '3.2 /5', '2.6 /5', '4.5 /5', '4.3 /5',
       '4.4 /5', '4.9/5', '2.1/5', '2.0/5', '1.8/5', '4.6 /5', '4.9 /5',
       '3.0 /5', '4.8 /5', '2.3 /5', '4.7 /5', '2.4 /5', '2.1 /5',
       '2.2 /5', '2.0 /5', '1.8 /5'], dtype=object)

In [16]:
data['rate'] =  data['rate'].str.split('/').str[0].astype(float)
data['rate'].unique()

array([4.1, 3.8, 3.7, 3.6, 4.6, 4. , 4.2, 3.9, 3.1, 3. , 3.2, 3.3, 2.8,
       4.4, 4.3, 2.9, 3.5, 2.6, 3.4, 4.5, 2.5, 2.7, 4.7, 2.4, 2.2, 2.3,
       4.8, 4.9, 2.1, 2. , 1.8])

In [17]:
data['approx_cost(for two people)'].unique()

array(['800', '300', '600', '700', '550', '500', '450', '650', '400',
       '900', '200', '750', '150', '850', '100', '1,200', '350', '250',
       '950', '1,000', '1,500', '1,300', '199', '1,100', '1,600', '230',
       '130', '1,700', '1,350', '2,200', '1,400', '2,000', '1,800', nan,
       '1,900', '180', '330', '2,500', '2,100', '3,000', '2,800', '3,400',
       '50', '40', '1,250', '3,500', '4,000', '2,400', '2,600', '1,450',
       '70', '3,200', '240', '6,000', '1,050', '2,300', '4,100', '120',
       '5,000', '3,700', '1,650', '2,700', '4,500', '80'], dtype=object)

In [18]:
data['approx_cost(for two people)'] = data['approx_cost(for two people)'].str.replace(',','').astype(float)

In [19]:
data['approx_cost(for two people)'].unique()

array([ 800.,  300.,  600.,  700.,  550.,  500.,  450.,  650.,  400.,
        900.,  200.,  750.,  150.,  850.,  100., 1200.,  350.,  250.,
        950., 1000., 1500., 1300.,  199., 1100., 1600.,  230.,  130.,
       1700., 1350., 2200., 1400., 2000., 1800.,   nan, 1900.,  180.,
        330., 2500., 2100., 3000., 2800., 3400.,   50.,   40., 1250.,
       3500., 4000., 2400., 2600., 1450.,   70., 3200.,  240., 6000.,
       1050., 2300., 4100.,  120., 5000., 3700., 1650., 2700., 4500.,
         80.])

In [20]:
data['approx_cost(for two people)'].fillna(data['approx_cost(for two people)'].median(),inplace= True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['approx_cost(for two people)'].fillna(data['approx_cost(for two people)'].median(),inplace= True)


In [21]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 41665 entries, 0 to 51716
Data columns (total 14 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   address                      41665 non-null  object 
 1   name                         41665 non-null  object 
 2   online_order                 41665 non-null  object 
 3   book_table                   41665 non-null  object 
 4   rate                         41665 non-null  float64
 5   votes                        41665 non-null  int64  
 6   location                     41665 non-null  object 
 7   rest_type                    41516 non-null  object 
 8   cuisines                     41654 non-null  object 
 9   approx_cost(for two people)  41665 non-null  float64
 10  reviews_list                 41665 non-null  object 
 11  menu_item                    41665 non-null  object 
 12  listed_in(type)              41665 non-null  object 
 13  listed_in(city)      

In [22]:
data.isnull().sum()

address                          0
name                             0
online_order                     0
book_table                       0
rate                             0
votes                            0
location                         0
rest_type                      149
cuisines                        11
approx_cost(for two people)      0
reviews_list                     0
menu_item                        0
listed_in(type)                  0
listed_in(city)                  0
dtype: int64

In [23]:
data['rest_type'].fillna('Unknown' , inplace= True)   #very small proportion of missing values

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['rest_type'].fillna('Unknown' , inplace= True)   #very small proportion of missing values


In [24]:
data['cuisines'].fillna('Not Specified' , inplace= True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['cuisines'].fillna('Not Specified' , inplace= True)


In [25]:
# Handle Outliers

numeric_cols = ['approx_cost(for two people)' , 'votes' , 'rate']
for col in numeric_cols : 
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5*IQR
    upper = Q3 + 1.5*IQR
    data = data[(data[col] >= lower ) & (data[col] <= upper)]

In [26]:
data.isnull().sum()

address                        0
name                           0
online_order                   0
book_table                     0
rate                           0
votes                          0
location                       0
rest_type                      0
cuisines                       0
approx_cost(for two people)    0
reviews_list                   0
menu_item                      0
listed_in(type)                0
listed_in(city)                0
dtype: int64

#### Feature Creation

**_________________________________________________________________________________________________________________________________________________________________________________**

In [27]:
#Binary encoding (Yes/No -> 0/1)

data['online_order_flag'] = data['online_order'].map({'Yes' : 1 , 'No' : 0})
data['book_table_flag']  = data['book_table'].map({'Yes':1,'No':0})

#Encodes availaibility of online ordering and tables booking as binary flags.

In [28]:
#Votes - Popular restaurent feature

median_votes = data['votes'].median()

data['popular_restaurant'] = np.where(data['votes']>=median_votes , 1 ,0)

#Restaurant with more customer engagement are considered popular.

In [29]:
#Cost Buckets
data['cost_category'] = pd.cut(
    data['approx_cost(for two people)'],bins = [0,300,700,1500,5000],labels=['low','Medium','High','Premium']
)



print("Restaurants are segmented into low , High , Medium , Premium cost tiers.")
print(data['cost_category'])

Restaurants are segmented into low , High , Medium , Premium cost tiers.
3           low
4        Medium
5        Medium
6          High
8        Medium
          ...  
51705      High
51706      High
51708      High
51709      High
51711      High
Name: cost_category, Length: 33641, dtype: category
Categories (4, object): ['low' < 'Medium' < 'High' < 'Premium']


In [30]:
#City - High Rated City Flag

top_cities = (
    data.groupby('listed_in(city)')['rate'].mean().sort_values(ascending=False).index
)

data['High_rated_city'] = data['listed_in(city)'].isin(top_cities).astype(int)


#"Cities with overall higher average ratings are highlighted."


In [31]:
#Feature check 

print("\n\t\t\t Feature Check Preview")
data[[
    'rate',
    'votes',
    'online_order_flag',
    'book_table_flag',
    'popular_restaurant',
    'High_rated_city',
    'cost_category'
]]


			 Feature Check Preview


Unnamed: 0,rate,votes,online_order_flag,book_table_flag,popular_restaurant,High_rated_city,cost_category
3,3.7,88,0,0,1,1,low
4,3.8,166,0,0,1,1,Medium
5,3.8,286,1,0,1,1,Medium
6,3.6,8,0,0,0,1,High
8,4.0,324,1,0,1,1,Medium
...,...,...,...,...,...,...,...
51705,3.8,128,1,1,1,1,High
51706,3.7,27,0,0,0,1,High
51708,2.8,161,0,0,1,1,High
51709,3.7,34,0,0,0,1,High


In [32]:
print("\n\t\t\t\t Final dataset Preview")

data.head().T


				 Final dataset Preview


Unnamed: 0,3,4,5,6,8
address,"1st Floor, Annakuteera, 3rd Stage, Banashankar...","10, 3rd Floor, Lakshmi Associates, Gandhi Baza...","37, 5-1, 4th Floor, Bosco Court, Gandhi Bazaar...","19/1, New Timberyard Layout, Beside Satellite ...","1, 30th Main Road, 3rd Stage, Banashankari, Ba..."
name,Addhuri Udupi Bhojana,Grand Village,Timepass Dinner,Rosewood International Hotel - Bar & Restaurant,Penthouse Cafe
online_order,No,No,Yes,No,Yes
book_table,No,No,No,No,No
rate,3.7,3.8,3.8,3.6,4.0
votes,88,166,286,8,324
location,Banashankari,Basavanagudi,Basavanagudi,Mysore Road,Banashankari
rest_type,Quick Bites,Casual Dining,Casual Dining,Casual Dining,Cafe
cuisines,"South Indian, North Indian","North Indian, Rajasthani",North Indian,"North Indian, South Indian, Andhra, Chinese","Cafe, Italian, Continental"
approx_cost(for two people),300.0,600.0,600.0,800.0,700.0


---
## Overall Insights

• Feature engineering improved the dataset quality.  
• Binary features simplified categorical information.  
• Cost categories helped group restaurants by pricing.  
• Popular restaurant and city flags captured customer behavior.  
• The dataset is now ready for predictive modeling.

