Let's start by analyzing the raw data. I'll check for:

Missing values
Incorrect formatting (e.g., inconsistent phone numbers, addresses, or ratings)
Duplicates
Unnecessary columns
I'll process the file now.

In [1]:
import pandas as pd

# Load the raw data file
file_path = r"C:\Users\visha\OneDrive\Desktop\restaurants raw data.csv"
df = pd.read_csv(file_path)

# Display basic info and first few rows
df.info(), df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3199 entries, 0 to 3198
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    3199 non-null   object 
 1   Food    3199 non-null   object 
 2   Street  3199 non-null   object 
 3   Phone   3199 non-null   object 
 4   Rating  3199 non-null   float64
dtypes: float64(1), object(4)
memory usage: 125.1+ KB


(None,
                   Name   Food                      Street            Phone  \
 0              Unknown  Salad           Yonge St, Toronto  +1 919-344-3561   
 1      The Black Briik  Sushi  Bloor Street West, Toronto  +1 907-190-6640   
 2    Prague Restaurant    BBQ           Yonge St, Toronto  +1 899-455-5080   
 3               Pi Co.  pizza         College St, Toronto  +1 840-290-9973   
 4  Prime Doner Shwarma  Tacos     Spadina Avenue, Toronto  +1 864-238-7803   
 
    Rating  
 0     2.7  
 1     1.6  
 2     4.1  
 3     2.7  
 4     4.7  )

Data Overview:
Total Entries: 3,199 restaurants
Columns:
Name: Restaurant name (some values are "Unknown", which needs fixing)
Food: Type of cuisine (all values are filled)
Street: Street name (some might need corrections)
Phone: Phone numbers (need validation)
Rating: Restaurant rating (format looks correct)
Cleaning Steps:
Fix "Unknown" Names

Replace "Unknown" with "Unnamed Restaurant" or use the street name as a fallback.
Validate Phone Numbers

Ensure they follow the format: +1 XXX-XXX-XXXX
Remove any invalid or missing numbers.
Normalize Street Names

Standardize naming (e.g., convert "Bloor Street West" to "Bloor St").
Handle Ratings

Ensure ratings are between 1.0 and 5.0.
Round ratings to 1 decimal place (if needed).
Remove Duplicates

Identify and drop any duplicate restaurant entries.
I'll now clean the data accordingly. 

In [2]:
# Cleaning Process

# 1. Replace "Unknown" Names with "Unnamed Restaurant"
df["Name"] = df["Name"].replace("Unknown", "Unnamed Restaurant")

# 2. Validate & Format Phone Numbers (Ensuring correct format)
import re

def validate_phone(phone):
    pattern = r"\+1 \d{3}-\d{3}-\d{4}"  # Expected format: +1 XXX-XXX-XXXX
    return phone if re.match(pattern, phone) else "Invalid"

df["Phone"] = df["Phone"].apply(validate_phone)

# 3. Standardize Street Names (shorten common street words)
street_replacements = {
    "Street": "St",
    "Avenue": "Ave",
    "Road": "Rd",
    "Boulevard": "Blvd",
    "Drive": "Dr",
    "West": "W",
    "East": "E",
    "North": "N",
    "South": "S"
}

for key, value in street_replacements.items():
    df["Street"] = df["Street"].str.replace(key, value, regex=True)

# 4. Ensure Ratings are within valid range (1.0 - 5.0) and round to 1 decimal
df["Rating"] = df["Rating"].clip(1.0, 5.0).round(1)

# 5. Remove Duplicate Entries (based on Name and Address)
df.drop_duplicates(subset=["Name", "Street"], keep="first", inplace=True)

# Display cleaned data overview
df.info(), df.head()


<class 'pandas.core.frame.DataFrame'>
Index: 3097 entries, 0 to 3198
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    3097 non-null   object 
 1   Food    3097 non-null   object 
 2   Street  3097 non-null   object 
 3   Phone   3097 non-null   object 
 4   Rating  3097 non-null   float64
dtypes: float64(1), object(4)
memory usage: 145.2+ KB


(None,
                   Name   Food                Street            Phone  Rating
 0   Unnamed Restaurant  Salad     Yonge St, Toronto  +1 919-344-3561     2.7
 1      The Black Briik  Sushi   Bloor St W, Toronto  +1 907-190-6640     1.6
 2    Prague Restaurant    BBQ     Yonge St, Toronto  +1 899-455-5080     4.1
 3               Pi Co.  pizza   College St, Toronto  +1 840-290-9973     2.7
 4  Prime Doner Shwarma  Tacos  Spadina Ave, Toronto  +1 864-238-7803     4.7)

Cleaned Data Overview:
Total Entries: ✅ 3,097 (After removing duplicates)
Key Fixes:
Restaurant Names: "Unknown" replaced with "Unnamed Restaurant".
Phone Numbers: Invalid ones marked as "Invalid", others validated.
Street Names: Standardized (e.g., "Bloor Street West" → "Bloor St W").
Ratings: Ensured they are between 1.0 - 5.0 and rounded to 1 decimal place.
Duplicates Removed: Entries with the same name and address were removed.
Now, I'll save the cleaned data so you can use it.

In [4]:
import os

# Create directory if it doesn't exist
os.makedirs("/mnt/data", exist_ok=True)

# Save the cleaned data to a new CSV file
cleaned_file_path = "/mnt/data/restaurants_cleaned.csv"
df.to_csv(cleaned_file_path, index=False)


In [5]:
cleaned_file_path = "restaurants_cleaned.csv"
df.to_csv(cleaned_file_path, index=False)


Now, this dataset is structured, validated, and ready for analysis or visualization. 