### Level 1

**Task 1 : Data Exploration and Preprocessing**

- Explore the dataset and identify the number
of rows and columns.

- Check for missing values in each column and
handle them accordingly.


- Perform data type conversion if necessary.

- Analyze the distribution of the target variable
("Aggregate rating") and identify any class
imbalances.

In [5]:
pip install pandas numpy matplotlib seaborn scikit-learn folium

Collecting folium
  Downloading folium-0.20.0-py2.py3-none-any.whl.metadata (4.2 kB)
Collecting branca>=0.6.0 (from folium)
  Downloading branca-0.8.2-py3-none-any.whl.metadata (1.7 kB)
Downloading folium-0.20.0-py2.py3-none-any.whl (113 kB)
Downloading branca-0.8.2-py3-none-any.whl (26 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.8.2 folium-0.20.0
Note: you may need to restart the kernel to use updated packages.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
df = pd.read_csv("/Users/pranay/Downloads/Cognifyz Internship Task/Dataset.csv")
print(df.head())
print()
print(df.shape)

   Restaurant ID         Restaurant Name  Country Code              City  \
0        6317637        Le Petit Souffle           162       Makati City   
1        6304287        Izakaya Kikufuji           162       Makati City   
2        6300002  Heat - Edsa Shangri-La           162  Mandaluyong City   
3        6318506                    Ooma           162  Mandaluyong City   
4        6314302             Sambo Kojin           162  Mandaluyong City   

                                             Address  \
0  Third Floor, Century City Mall, Kalayaan Avenu...   
1  Little Tokyo, 2277 Chino Roces Avenue, Legaspi...   
2  Edsa Shangri-La, 1 Garden Way, Ortigas, Mandal...   
3  Third Floor, Mega Fashion Hall, SM Megamall, O...   
4  Third Floor, Mega Atrium, SM Megamall, Ortigas...   

                                     Locality  \
0   Century City Mall, Poblacion, Makati City   
1  Little Tokyo, Legaspi Village, Makati City   
2  Edsa Shangri-La, Ortigas, Mandaluyong City   
3      SM 

In [7]:
# Rows and Columns

print("Rows, Columns:", df.shape)

Rows, Columns: (9551, 21)


In [9]:
# Missing values

missing = df.isnull().sum().sort_values(ascending=False)
print("Missing values in each column:\n", missing[missing > 0])

Missing values in each column:
 Cuisines    9
dtype: int64


In [11]:
# Handle missing values
# Strategy:
# - For numeric columns: fill with median
# - For categorical columns: fill with mode

for col in df.columns:
    if df[col].isnull().sum() > 0:
        if df[col].dtype in ["int64", "float64"]:
            df[col] = df[col].fillna(df[col].median())
        else:
            df[col] = df[col].fillna(df[col].mode()[0])

print("Missing values after handling:\n", df.isnull().sum().sum())

Missing values after handling:
 0


In [13]:
# Data type conversion (example)
# Convert object numbers into numeric if any

for col in df.columns:
    if df[col].dtype == "object":
        try:
            df[col] = pd.to_numeric(df[col])
        except:
            pass

print("Data types:\n", df.dtypes)

Data types:
 Restaurant ID             int64
Restaurant Name          object
Country Code              int64
City                     object
Address                  object
Locality                 object
Locality Verbose         object
Longitude               float64
Latitude                float64
Cuisines                 object
Average Cost for two      int64
Currency                 object
Has Table booking        object
Has Online delivery      object
Is delivering now        object
Switch to order menu     object
Price range               int64
Aggregate rating        float64
Rating color             object
Rating text              object
Votes                     int64
dtype: object


In [15]:
# Target Distribution (Aggregate rating)

print("Aggregate rating summary:\n", df["Aggregate rating"].describe())

Aggregate rating summary:
 count    9551.000000
mean        2.666370
std         1.516378
min         0.000000
25%         2.500000
50%         3.200000
75%         3.700000
max         4.900000
Name: Aggregate rating, dtype: float64


In [17]:
# Count unique rating values

rating_counts = df["Aggregate rating"].value_counts().head(10)
print("Top rating counts:\n", rating_counts)

Top rating counts:
 Aggregate rating
0.0    2148
3.2     522
3.1     519
3.4     498
3.3     483
3.5     480
3.0     468
3.6     458
3.7     427
3.8     400
Name: count, dtype: int64
