## Overview

This notebook outlines the ETL process for restaurant metadata, focusing on the integration of `metadata-places` files and `customer reviews` from each state. It delves into each stage of the workflow, from data extraction and transformation to the final loading phase. Throughout, I'll detail the techniques employed and the rationale behind each step, aiming to construct a comprehensive and insightful dataset that sheds light on the intricacies of the restaurant industry across various regions.


In [1]:
import os
import re
import json
import pandas as pd
import concurrent.futures
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
from geopy import Nominatim


## Loading Data

In this phase, we focus on preparing and integrating the restaurant metadata with reviews. The process unfolds as follows:

- **`Define Paths`**: We start by specifying the directories for both the metadata of places and the reviews from various states, ensuring a structured approach to data handling.

- **`Collect GMap IDs`**: A function is employed to traverse through state review directories, collecting unique Google Maps IDs. This step is crucial for correlating reviews with the corresponding restaurant metadata.

- **`Filter Metadata`**: With the collected GMap IDs, we filter the comprehensive metadata to retain only those entries that match our review dataset. This filtration ensures our dataset is precise and relevant to the analysis.

- **`Data Integration`**: The final step involves integrating the filtered metadata with the state-specific reviews, enriching our dataset with a wealth of information for in-depth analysis. This integrated dataset forms the backbone of our ETL process, ready for transformation and analysis.

- **`Filter by Restaurant`**: Further refine the dataset by filtering for entries that specifically relate to food services such as restaurants, cafes, and eateries, excluding non-food-related establishments. This is achieved by matching against a predefined list of food-related keywords and exclusions to ensure the dataset is strictly relevant to the restaurant analysis.

- **`Check Categories`**: After filtering, we examine the categories of the remaining entries to validate the effectiveness of our filtering criteria and to understand the distribution of restaurant types within our dataset. This step provides insight into the variety of dining establishments covered in our analysis and helps identify any potential areas for further refinement in our data selection criteria.

**Note**: This targeted approach was chosen to ensure efficiency and relevance in our analysis. By focusing only on restaurants from each state that will be used in the analysis, and further refining the selection to include only food-related establishments, we avoid the unnecessary loading and processing of extraneous data. This method streamlines our data preparation phase, ensuring that only pertinent data is considered, thereby optimizing the analysis process and conserving computational resources.

**Defining paths**

In [9]:
metadata_places_path = '../data/raw/Google Maps/metadata-places/'
state_review_paths = {
    'New York': '../data/raw/Google Maps/State Review/review-New_York/',
    'California': '../data/raw/Google Maps/State Review/review-California/',
    'Florida': '../data/raw/Google Maps/State Review/review-Florida/',
    'Pennsylvania': '../data/raw/Google Maps/State Review/review-Pennsylvania/',
    'Texas': '../data/raw/Google Maps/State Review/review-Texas/'
}

**Collect GMap IDs**

In [10]:
def collect_gmap_ids_from_reviews(state_review_paths):
    state_gmap_ids = {}
    for state, state_path in state_review_paths.items():
        state_gmap_ids[state] = set()
        # Iterate over each JSON file
        for filename in os.listdir(state_path):
            file_path = os.path.join(state_path, filename)
            with open(file_path, 'r') as file:
                for line in file:
                    review_data = json.loads(line)
                    state_gmap_ids[state].add(review_data['gmap_id'])
    return state_gmap_ids

**Filter Metadata**

In [11]:
def filter_metadata_places(metadata_places_path, state_gmap_ids):
    filtered_metadata = []
    for filename in os.listdir(metadata_places_path):
        file_path = os.path.join(metadata_places_path, filename)
        with open(file_path, 'r') as file:
            for line in file:
                place_data = json.loads(line)
                for state, gmap_ids in state_gmap_ids.items():
                    if place_data['gmap_id'] in gmap_ids:
                        place_data['state'] = state 
                        filtered_metadata.append(place_data)
                        break  # Break to avoid adding the same place under multiple states
    return filtered_metadata

**Data Integration**

In [12]:
# Step 1: Collect GMap IDs from state reviews with state information
state_gmap_ids = collect_gmap_ids_from_reviews(state_review_paths)

# Step 2: Filter metadata places based on collected GMap IDs and include state information
filtered_metadata = filter_metadata_places(metadata_places_path, state_gmap_ids)

# Step 3: Convert the filtered metadata to a DataFrame
df_filtered_metadata = pd.DataFrame(filtered_metadata)


In [13]:
df_filtered_metadata.tail()

Unnamed: 0,name,address,gmap_id,description,latitude,longitude,category,avg_rating,num_of_reviews,price,hours,MISC,state,relative_results,url
324132,Audiomotive Creations,"Audiomotive Creations, 1167 Niagara Falls Blvd...",0x89d3726f24abe7a1:0xa756f654e5f0ea9c,,42.986004,-78.822423,"[Auto parts store, Electronics store]",4.9,548,,"[[Saturday, 9AM–6PM], [Sunday, Closed], [Monda...",,New York,"[0x89d36814d72ff6b5:0x9f609a9ca26e5db1, 0x89d3...",https://www.google.com/maps/place//data=!4m2!3...
324133,The Alibi Bar & Grill,"The Alibi Bar & Grill, 5822 Camp Rd, Hamburg, ...",0x89d304cb1436ea91:0xeb8ec964c0277a1e,,42.728979,-78.836644,[Bar],4.3,155,$,"[[Saturday, 12PM–4AM], [Sunday, 12PM–4AM], [Mo...","{'Service options': ['Outdoor seating', 'Takeo...",New York,"[0x89d31ad0f3d9db0b:0x975a9c5fcea9860d, 0x89d3...",https://www.google.com/maps/place//data=!4m2!3...
324134,Peacemaker Brewing Company,"Peacemaker Brewing Company, 39 Coach St, Canan...",0x89d126876236a7ad:0x8271b070396653b1,,42.884818,-77.281542,[Brewery],4.8,68,,"[[Saturday, 12–10PM], [Sunday, 12–5PM], [Monda...",{'From the business': ['Identifies as veteran-...,New York,"[0x89d1269180ea2eef:0x956e4a0ddb33852b, 0x89d1...",https://www.google.com/maps/place//data=!4m2!3...
324135,SVS Vision Optical Centers,"SVS Vision Optical Centers, 1551 Niagara Falls...",0x89d3721399f576df:0xe572aaa0d5ae28fc,,42.997026,-78.82158,"[Eye care center, Contact lenses supplier, Opt...",4.7,68,,"[[Saturday, 9AM–1PM], [Sunday, Closed], [Monda...","{'Highlights': ['LGBTQ friendly', 'Transgender...",New York,"[0x89d37250fe70683f:0x321b9ff0973f91a4, 0x89d3...",https://www.google.com/maps/place//data=!4m2!3...
324136,House of Gourmet -- 食全食美,"House of Gourmet -- 食全食美, 2865 Sheridan Dr Sui...",0x89d373386b718fff:0xf82c90ef48e7b23,,42.980251,-78.827337,"[Chinese restaurant, Sichuan restaurant]",4.3,58,,"[[Saturday, 11:15AM–9:45PM], [Sunday, 11:15AM–...","{'Service options': ['Delivery', 'Takeout', 'D...",New York,"[0x89d3731867671331:0x643f6dd3ba475c6e, 0x89d3...",https://www.google.com/maps/place//data=!4m2!3...


**Filter by restaurant**: 
- Updated keywords and exclusions to keep those who are only related to restaurants. 
- Created function to apply and filter

In [14]:
food_keywords = [
    'restaurant', 'cafe', '\\bfood\\b', 'dining', 'eatery', 'bistro', 'bakery',
    'grill', 'kitchen', 'pizzeria', 'steakhouse', 'sushi', 'tavern', 'diner'
]

exclusions = [
    'supplier','ATM','gas station', 'school', 'bank', 'area', 'company', 'broker', 'bark', 'stool', 'dart',
    'store', 'shop', 'bar', 'lounge', 'venue', 'service', 'club', 'remodeler', 'boutique',
    'market', 'pharmacy', 'furniture', 'grocery', 'hardware', 'book', 'garden', 'home', 'office', 
    'electronics', 'clothing', 'gift', 'toy', 'jewelry', 'florist', 'repair', 'maintenance',
    'construction', 'contractor', 'installer', 'supplier', 'wholesaler', 'retailer', 'distributor', 
    'manufacturer', 'producer', 'facility', 'center', 'park', 'gallery', 'studio', 'salon', 'spa',
    'gym', 'fitness', 'health', 'wellness', 'boutique', 'event', 'entertainment', 'amusement', 
    'recreation', 'cultural', 'education', 'tutoring', 'learning', 'training', 'consultant', 
    'counseling', 'legal', 'financial', 'insurance', 'real estate', 'accommodation', 'lodging', 
    'rental', 'automotive', 'mechanic', 'pet', 'veterinary', 'storage', 'security', 'transportation',
    'delivery', 'logistics', 'utility', 'energy', 'sanitation', 'cleaning', 'waste', 'recycling',
]

include_pattern = re.compile(r'\b(?:' + '|'.join(food_keywords) + r')\b', re.IGNORECASE)
exclude_pattern = re.compile(r'\b(?:' + '|'.join(exclusions) + r')\b', re.IGNORECASE)



In [15]:
def is_food_service(category_list):
    """
    Check if any of the categories in the list match the food-related keywords
    and ensure none match the exclusions.
    """
    category_list = category_list if isinstance(category_list, list) else []
    if any(exclude_pattern.search(category) for category in category_list):
        return False
    return any(include_pattern.search(category) for category in category_list)

restaurant_metadata = [place for place in filtered_metadata if is_food_service(place.get('category'))]


In [16]:
df_restaurant_metadata = pd.DataFrame(restaurant_metadata)
df_restaurant_metadata.head()


Unnamed: 0,name,address,gmap_id,description,latitude,longitude,category,avg_rating,num_of_reviews,price,hours,MISC,state,relative_results,url
0,San Soo Dang,"San Soo Dang, 761 S Vermont Ave, Los Angeles, ...",0x80c2c778e3b73d33:0xbdc58662a4a97d49,,34.058092,-118.29213,[Korean restaurant],4.4,18,,"[[Thursday, 6:30AM–6PM], [Friday, 6:30AM–6PM],...","{'Service options': ['Takeout', 'Dine-in', 'De...",California,"[0x80c2c78249aba68f:0x35bf16ce61be751d, 0x80c2...",https://www.google.com/maps/place//data=!4m2!3...
1,Vons Chicken,"Vons Chicken, 12740 La Mirada Blvd, La Mirada,...",0x80dd2b4c8555edb7:0xfc33d65c4bdbef42,,33.916402,-118.010855,[Restaurant],4.5,18,,"[[Thursday, 11AM–9:30PM], [Friday, 11AM–9:30PM...","{'Service options': ['Outdoor seating', 'Curbs...",California,,https://www.google.com/maps/place//data=!4m2!3...
2,Golden Castle,"Golden Castle, 1906 E 12th St, Austin, TX 78702",0x8644b59b8fe872e5:0x5e638876caa84cc3,,30.273985,-97.719563,[Restaurant],4.5,8,,"[[Thursday, 5PM–12AM], [Friday, 5PM–12AM], [Sa...","{'Service options': ['Delivery', 'Takeout', 'D...",Texas,,https://www.google.com/maps/place//data=!4m2!3...
3,Oneyda's Bakery,"Oneyda's Bakery, 600 Goodlette-Frank Rd #101, ...",0x88dae191ee505917:0x6ba3e25388d3fad4,,26.154754,-81.790528,"[Bakery, Deli]",4.6,19,$,"[[Thursday, 8AM–6PM], [Friday, 8AM–6PM], [Satu...",{'Service options': ['Delivery']},Florida,"[0x88dae1997e122d6b:0xfd776fa851f06d29, 0x88da...",https://www.google.com/maps/place//data=!4m2!3...
4,Top Cat Seafood Restaurant,"Top Cat Seafood Restaurant, 3117 Martin Luther...",0x864e9891e381f3df:0x4cefe6219bc9199c,,32.77313,-96.764484,[Seafood restaurant],3.9,8,,"[[Thursday, 12–8PM], [Friday, 12–8PM], [Saturd...","{'Service options': ['Takeout', 'Dine-in', 'De...",Texas,"[0x864e988def7880bf:0x981331c7ea3d01cd, 0x864e...",https://www.google.com/maps/place//data=!4m2!3...


**Check Categories**

In [17]:

exploded_categories = df_restaurant_metadata.explode('category')

category_counts = exploded_categories['category'].value_counts()

df_category_counts = category_counts.reset_index()
df_category_counts.columns = ['Category', 'Count']

df_category_counts.head()


Unnamed: 0,Category,Count
0,Restaurant,9331
1,Mexican restaurant,2913
2,Pizza restaurant,2080
3,Chinese restaurant,1949
4,Cafe,1646


## Data Transformations

This section outlines the transformation steps applied to clean and structure the restaurant metadata for analysis.

### Initial Cleanup
- **Dropped Columns**: Columns such as `description`, `relative_results`, `MISC`, `hours`, and `url` were removed to focus on relevant data.

### Price Column Transformation
- **Normalization**: Transformed `$` symbols in the `price` column to numeric values (`1`, `2`, `3`, `0` for missing) to represent price ranges.

### Geographic Data Extraction
- **City and Postal Code**: Used regex to extract `city` and `postal_code` from the `address` column for detailed geographic analysis.

### Data Structuring
- **Rearranging Columns**: Columns were reordered to improve data accessibility.
- **Dummy Variables for Categories**: Expanded the `category` column into dummy variables for detailed category analysis.
- **Creation of Dummy Tables**: Generated a separate table with `gmap_id` and category dummies for nuanced analysis.

### Data Integrity
- **Duplicate Handling**: Identified and removed duplicates based on a composite key (`name`, `address`, `city`, `postal_code`).
- **Missing Values**: Addressed missing values in essential columns, using default values or removal where necessary.

### Geolocation Correction
- **Reverse Geocoding**: Populated missing `city` and `postal_code` fields using latitude and longitude data to enhance dataset accuracy.

### Final Adjustments
- **Data Type Conversions**: Adjusted data types for specific columns like `postal_code` to ensure consistency across the dataset.

By applying these transformations, the dataset is now primed for in-depth analysis, with a focus on maintaining integrity, usability, and analytical relevance.


**Initial Cleanup**

In [18]:
df_restaurant_clean = df_restaurant_metadata.drop(columns=['description', 'relative_results', 'MISC','hours','url'])

**Price Column Transformation**
- 1 = Inexpensive, usually $10 and under
- 2 = Moderately expensive, usually between $10 -$25
- 3 = Expensive, usually between $25- $45
- 4 = Very Expensive, usually $50 and up


In [19]:
def price_to_numeric(price):
    if pd.isnull(price):
        return None  
    else:
        return len(price)  

df_restaurant_clean['price_numeric'] = df_restaurant_clean['price'].apply(price_to_numeric)
df_restaurant_clean['price_numeric'] = df_restaurant_clean['price_numeric'].fillna("No Data")

In [20]:
def price_to_numeric(price):
    if pd.isnull(price):
        return 0  
    else:
        return len(price)  

df_restaurant_clean['price_numeric'] = df_restaurant_clean['price'].apply(price_to_numeric)
df_restaurant_clean.drop(columns=['price'], inplace=True)
df_restaurant_clean['price_numeric'] = df_restaurant_clean['price_numeric'].astype(int)


**Geographic Data Extraction**: 
- Data extrated from the adress column

In [21]:
city_regex = r',\s*([^,]+),\s*[A-Z]{2}\s+\d{5}'
postal_code_regex = r'(\d{5})$'

df_restaurant_clean['city'] = df_restaurant_clean['address'].str.extract(city_regex, expand=False)
df_restaurant_clean['postal_code'] = df_restaurant_clean['address'].str.extract(postal_code_regex, expand=False)


### Data Structuring

**Rearranging Columns**

In [22]:

new_column_order = ['name', 'address','state', 'city', 'postal_code', 'latitude', 'longitude', 'avg_rating', 'num_of_reviews', 'price_numeric', 'gmap_id','category']
df_restaurant_clean = df_restaurant_clean.reindex(columns=new_column_order)


In [23]:
df_restaurant_clean.tail()

Unnamed: 0,name,address,state,city,postal_code,latitude,longitude,avg_rating,num_of_reviews,price_numeric,gmap_id,category
26909,Luthun,"Luthun, 432 E 13th St, New York, NY 10009",New York,New York,10009,40.729911,-73.981883,4.8,68,0,0x89c2592cf2935ed9:0x6672264426649f94,"[New American restaurant, American restaurant,..."
26910,To Two Boonsik,"To Two Boonsik, 97 Canal St, New York, NY 10002",New York,New York,10002,40.715612,-73.993828,4.3,45,0,0x89c25a286e6ea2db:0x1291d4bbf41d5ed4,[Korean restaurant]
26911,Matsunichi,"Matsunichi, 14-18 Elizabeth St #32-33, New Yor...",New York,New York,10013,40.715866,-73.997244,4.2,48,0,0x89c25a27a5c98845:0xad087746b0c3381a,[Sushi restaurant]
26912,China King Express,"China King Express, 6938 Erie Rd, Derby, NY 14047",New York,Derby,14047,42.698185,-78.988253,4.1,118,1,0x89d31f176b64da79:0x202faca0f650e880,[Chinese restaurant]
26913,House of Gourmet -- 食全食美,"House of Gourmet -- 食全食美, 2865 Sheridan Dr Sui...",New York,Tonawanda,14150,42.980251,-78.827337,4.3,58,0,0x89d373386b718fff:0xf82c90ef48e7b23,"[Chinese restaurant, Sichuan restaurant]"


**Dummy Variables for Categories & Creation of Dummy Tables**

In [24]:
temp_df = df_restaurant_clean[['gmap_id', 'category']]
category_dummies = temp_df['category'].str.get_dummies(sep=', ')
restaurant_cat_dummies = pd.concat([temp_df[['gmap_id']], category_dummies], axis=1)
df_restaurant_clean = df_restaurant_clean.drop(columns=['category'])

In [25]:
df_restaurant_clean.head()

Unnamed: 0,name,address,state,city,postal_code,latitude,longitude,avg_rating,num_of_reviews,price_numeric,gmap_id
0,San Soo Dang,"San Soo Dang, 761 S Vermont Ave, Los Angeles, ...",California,Los Angeles,90005,34.058092,-118.29213,4.4,18,0,0x80c2c778e3b73d33:0xbdc58662a4a97d49
1,Vons Chicken,"Vons Chicken, 12740 La Mirada Blvd, La Mirada,...",California,La Mirada,90638,33.916402,-118.010855,4.5,18,0,0x80dd2b4c8555edb7:0xfc33d65c4bdbef42
2,Golden Castle,"Golden Castle, 1906 E 12th St, Austin, TX 78702",Texas,Austin,78702,30.273985,-97.719563,4.5,8,0,0x8644b59b8fe872e5:0x5e638876caa84cc3
3,Oneyda's Bakery,"Oneyda's Bakery, 600 Goodlette-Frank Rd #101, ...",Florida,Naples,34102,26.154754,-81.790528,4.6,19,1,0x88dae191ee505917:0x6ba3e25388d3fad4
4,Top Cat Seafood Restaurant,"Top Cat Seafood Restaurant, 3117 Martin Luther...",Texas,Dallas,75215,32.77313,-96.764484,3.9,8,0,0x864e9891e381f3df:0x4cefe6219bc9199c


In [26]:
restaurant_cat_dummies.tail()


Unnamed: 0,gmap_id,'Afghani restaurant','African restaurant','African restaurant'],'American restaurant','American restaurant'],'Animal shelter'],'Argentinian restaurant','Armenian restaurant','Armenian restaurant'],...,['West African restaurant',['West African restaurant'],['Wholesale bakery',['Wholesale bakery'],['Winery',['Wok restaurant',['Wok restaurant'],['Yakiniku restaurant',['Yakitori restaurant'],['Yemenite restaurant']
26909,0x89c2592cf2935ed9:0x6672264426649f94,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26910,0x89c25a286e6ea2db:0x1291d4bbf41d5ed4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26911,0x89c25a27a5c98845:0xad087746b0c3381a,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26912,0x89d31f176b64da79:0x202faca0f650e880,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26913,0x89d373386b718fff:0xf82c90ef48e7b23,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Data Integrity

**Duplicate Handling**:
- important that the filtering has to be done based on a composite key since multiple restaurants exist with the same name but are in different locations

In [27]:

composite_key = ['name', 'address', 'city', 'postal_code']
duplicate_rows = df_restaurant_clean.duplicated(subset=composite_key, keep=False)

num_duplicate_rows = duplicate_rows.sum()
print(f"Number of rows with potential duplicates based on name, address, city, and postal code: {num_duplicate_rows}")

potential_duplicates = df_restaurant_clean[duplicate_rows]

sorted_potential_duplicates = potential_duplicates.sort_values(by='name')

print("Potential duplicate rows sorted by name:")
sorted_potential_duplicates.head()




Number of rows with potential duplicates based on name, address, city, and postal code: 412
Potential duplicate rows sorted by name:


Unnamed: 0,name,address,state,city,postal_code,latitude,longitude,avg_rating,num_of_reviews,price_numeric,gmap_id
63,'Round The Lake Kitchen,"'Round The Lake Kitchen, 12057 PA-618, Conneau...",Pennsylvania,Conneaut Lake,16316,41.630465,-80.325579,4.5,27,0,0x88324fdf1a955263:0x6b4077533d6e6061
264,'Round The Lake Kitchen,"'Round The Lake Kitchen, 12057 PA-618, Conneau...",Pennsylvania,Conneaut Lake,16316,41.630465,-80.325579,4.5,27,0,0x88324fdf1a955263:0x6b4077533d6e6061
96,1903 Taphouse & Co.,"1903 Taphouse & Co., 175 N Main St, Bishop, CA...",California,Bishop,93514,37.361898,-118.395555,4.7,8,0,0x80be3dd544115ed9:0x5d500f8046469ea3
297,1903 Taphouse & Co.,"1903 Taphouse & Co., 175 N Main St, Bishop, CA...",California,Bishop,93514,37.361898,-118.395555,4.7,8,0,0x80be3dd544115ed9:0x5d500f8046469ea3
14869,2 Korean Girls,"2 Korean Girls, 2801a Florida Ave, Coconut Gro...",Florida,Coconut Grove,33133,25.729263,-80.240075,4.6,45,0,0x88d9b76485f1a105:0x92bc2888314d41f4


In [28]:
df_restaurant_clean = df_restaurant_clean.drop_duplicates(subset=composite_key)

**Missing Values**
- Checked for missing values
- Manually checked those which had an empty `adress` all of these were clsoed etablishments and so we removed them from the data
- For `posal code` and `city` reverse geocoding was used and explained further later

In [None]:
missing_values_count = df_restaurant_clean.isnull().sum()
print("Count of missing values in each column:")
print(missing_values_count)


In [31]:
rows_with_missing_values = df_restaurant_clean[df_restaurant_clean.isnull().any(axis=1)]
print("Rows with at least one missing value:")
rows_with_missing_values.head()


Rows with at least one missing value:


Unnamed: 0,name,address,state,city,postal_code,latitude,longitude,avg_rating,num_of_reviews,price_numeric,gmap_id
36,李小龍台吃,"〒11362 New York, Queens, Northern Blvd, 李小龍台吃",New York,,,40.770059,-73.735522,4.0,27,0,0x89c289efdb82221b:0xed627c2af97c2069
75,glazed,"Texas, Sugar Land, glazed邮政编码: 77479",Texas,,77479.0,29.590894,-95.625716,4.3,8,0,0x8640e6bd6e861357:0xb96bb36e5cd02ee3
928,Me Bakery,"New York, Flushing, 47th Ave, Me Bakery邮政编码: 1...",New York,,11358.0,40.752362,-73.785966,4.5,24,0,0x89c261a1df3470c3:0x1e414e9bf91c369f
942,Dada Sushi,"92562 California, Murrieta, California Oaks Rd...",California,,,33.574236,-117.204089,4.2,34,0,0x80dc8385c954fc1f:0xa65cf0047f021902
1115,Juicy King Crab Express,"New York, Bronx, E Tremont Ave, Juicy King Cra...",New York,,10457.0,40.845333,-73.890552,4.6,133,0,0x89c2f596580efdf1:0x1cec36682aa17eba


**Adress Empty:** Closed Establishemnts

In [32]:
rows_with_missing_address = df_restaurant_clean[df_restaurant_clean['address'].isnull()]
rows_with_missing_address.head()


Unnamed: 0,name,address,state,city,postal_code,latitude,longitude,avg_rating,num_of_reviews,price_numeric,gmap_id
3229,Pasquale's Pizzeria and Restaurant at Wallkill,,New York,,,41.710932,-74.112969,4.1,54,0,0x89dd29ab5dd170f1:0x7cb3c8e119a5bbe9
3948,Namaste Tashi Delek,,New York,,,40.759291,-73.884518,4.3,95,0,0x89c25f072d742ce9:0x2a930399a15215e3
14070,Sabaidee Thai Grille,,California,,,37.275179,-119.273197,4.0,38,0,0x809ac6a6aa8b8cdf:0xeb219547c732cc3c
14447,Rios Mexican & Katracho Kitchen,,Texas,,,29.31909,-94.978222,4.5,28,0,0x863f7fedb1f1dc55:0xe6f2db0b0f2d483f
14534,Fox & Fawn Bakehouse,,California,,,38.109238,-122.125753,4.9,24,0,0x80856e3126db6b41:0x9412edbc6d853576


In [33]:
df_restaurant_clean = df_restaurant_clean.dropna(subset=['address'])

**Geolocation Correction**
- Populated missing `postal_code` `city` coluumns with the use of Nominatim API 
- Multiple wrkers were used to speed up this process

NOTE: Empty results after the functions were applied were converted to `No Data` and `00000` since the other information was deemed necesary and may be used for future examination despite these two missing values. 

In [34]:
geolocator = Nominatim(user_agent='pef999@hotmail.com')
reverse_geocode = RateLimiter(geolocator.reverse, min_delay_seconds=1)


In [35]:
def reverse_geocode_row(row, index):
    try:
        if pd.isnull(row['city']) or pd.isnull(row['postal_code']):
            location = reverse_geocode((row['latitude'], row['longitude']), exactly_one=True)
            address = location.raw.get('address', {})
            city = address.get('city', address.get('town', address.get('village')))
            postal_code = address.get('postcode')

            print(f"Row {index} updated: City = {city}, Postal Code = {postal_code}")
            return (index, city, postal_code)
    except Exception as e:
        print(f"Error retrieving location for row {index}: {e}")
    return (index, None, None)

In [36]:
def update_city_postal_code_parallel(df):
 
    results = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:  # Adjust max_workers as needed
        future_to_index = {executor.submit(reverse_geocode_row, row, index): index for index, row in df.iterrows()}
        for future in concurrent.futures.as_completed(future_to_index):
            index = future_to_index[future]
            try:
                result = future.result()
                results.append(result)
            except Exception as e:
                print(f"Error processing row {index}: {e}")
                
    for index, city, postal_code in results:
        if city:
            df.at[index, 'city'] = city
        if postal_code:
            df.at[index, 'postal_code'] = postal_code

    return df

In [37]:
df_restaurant_clean = update_city_postal_code_parallel(df_restaurant_clean)


Row 36 updated: City = City of New York, Postal Code = 11362
Row 928 updated: City = City of New York, Postal Code = 11358
Row 1985 updated: City = City of New York, Postal Code = 10003
Row 942 updated: City = Murrieta, Postal Code = 92562
Row 1408 updated: City = Philadelphia, Postal Code = 19111
Row 1865 updated: City = Baldwin, Postal Code = 11510
Row 75 updated: City = Sugar Land, Postal Code = 77479
Row 1648 updated: City = None, Postal Code = 14738
Row 2545 updated: City = City of New York, Postal Code = 11207
Row 3650 updated: City = Middlesex Township, Postal Code = 17013
Row 1444 updated: City = City of New York, Postal Code = 10451
Row 2642 updated: City = Falls Township, Postal Code = 19030
Row 3912 updated: City = Trappe, Postal Code = 19426
Row 1115 updated: City = City of New York, Postal Code = 10457
Row 3871 updated: City = City of New York, Postal Code = 11416
Row 3704 updated: City = City of New York, Postal Code = 11361
Row 2299 updated: City = Jericho, Postal Code =

In [44]:
# check rows that still have missing values
rows_with_missing_values = df_restaurant_clean[df_restaurant_clean['city'].isnull() | df_restaurant_clean['postal_code'].isnull()]
rows_with_missing_values

Unnamed: 0,name,address,state,city,postal_code,latitude,longitude,avg_rating,num_of_reviews,price_numeric,gmap_id
1648,Horseshoe Inn,"Horseshoe Inn, W Perimeter Rd, Coldspring, NY",New York,,14738.0,42.05541,-78.938788,4.9,18,0,0x89d29fd67c9bc79b:0xb6c8a298729a5333
5249,UN1QUE TEA & BAR,"California, Hacienda Heights, S Hacienda Blvd,...",California,,91745.0,33.99515,-117.966782,4.5,58,0,0x80c2d5f8c0b01c57:0xa4f968b7b3666706
7362,Redland Ballfield BBQ,"Redland Ballfield BBQ, hwy 64 & VZ County Road...",Texas,,,32.37849,-95.50357,4.7,36,0,0x8649b0e69f9442ef:0x40a536f0bde607d3
7779,Smile Market 2,"95827 California, Sacramento, Folsom Blvd, Smi...",California,,95827.0,38.571112,-121.340102,4.8,17,0,0x809add2c297d2171:0x251dd1d6afd0a4e5
14386,Yummy Pho & Grill,"Texas, Spring, Kuykendahl Rd, Yummy Pho & Gril...",Texas,,77389.0,30.113487,-95.552523,4.6,58,0,0x864733fa1d17774b:0x396f849380fd56b4
15905,炭香村,"〒91748 California, Rowland Heights, Colima Rd,...",California,,91748.0,33.98703,-117.89492,4.1,28,0,0x80c32af6815467e5:0x2c68b44b22cc06b9
18275,P.J. Clarke's,"P.J. Clarke's, 250 Vesey St, New York, NY 1028...",New York,City of New York,,40.713711,-74.016239,4.2,1684,2,0x89c25a1b0cd3c8cb:0xe29a00ff230959b8
18883,Chuck Wagon,"Chuck Wagon, 1203 Academy Ave, Sanger, CA 9365...",California,Sanger,,36.699852,-119.554766,4.6,1118,1,0x8094f90050be4dcb:0x8a52470b825bc0f3
19965,Amigos Mexican Restaurant,"Amigos Mexican Restaurant, 285 N Main St, Bish...",California,Bishop,,37.362931,-118.395612,4.2,276,1,0x80be3dffaf1721d3:0x85df382562b75703
20313,Mastro's Ocean Club,"Mastro's Ocean Club, 18412 Pacific Coast Hwy, ...",California,Topanga,,34.039687,-118.576133,4.5,1297,4,0x80c2a3e3c883c457:0xae5fd9b1d21d56ae


In [48]:
df_restaurant_clean['city'] = df_restaurant_clean['city'].str.replace(r'^City of\s+', '', regex=True)


In [45]:
df_restaurant_clean['postal_code'] = df_restaurant_clean['postal_code'].fillna('00000')
df_restaurant_clean['city'] = df_restaurant_clean['city'].fillna('No Data')


In [56]:
# Rename the column 'name' to 'restaurant_name'
df_restaurant_clean = df_restaurant_clean.rename(columns={'name': 'restaurant_name'})


In [58]:
df_restaurant_clean.head()

Unnamed: 0,restaurant_name,address,state,city,postal_code,latitude,longitude,avg_rating,num_of_reviews,price_numeric,gmap_id
0,San Soo Dang,"San Soo Dang, 761 S Vermont Ave, Los Angeles, ...",California,Los Angeles,90005,34.058092,-118.29213,4.4,18,0,0x80c2c778e3b73d33:0xbdc58662a4a97d49
1,Vons Chicken,"Vons Chicken, 12740 La Mirada Blvd, La Mirada,...",California,La Mirada,90638,33.916402,-118.010855,4.5,18,0,0x80dd2b4c8555edb7:0xfc33d65c4bdbef42
2,Golden Castle,"Golden Castle, 1906 E 12th St, Austin, TX 78702",Texas,Austin,78702,30.273985,-97.719563,4.5,8,0,0x8644b59b8fe872e5:0x5e638876caa84cc3
3,Oneyda's Bakery,"Oneyda's Bakery, 600 Goodlette-Frank Rd #101, ...",Florida,Naples,34102,26.154754,-81.790528,4.6,19,1,0x88dae191ee505917:0x6ba3e25388d3fad4
4,Top Cat Seafood Restaurant,"Top Cat Seafood Restaurant, 3117 Martin Luther...",Texas,Dallas,75215,32.77313,-96.764484,3.9,8,0,0x864e9891e381f3df:0x4cefe6219bc9199c


In [59]:
missing_values_count = df_restaurant_clean.isnull().sum()
print("Count of missing values in each column:")
print(missing_values_count)


Count of missing values in each column:
restaurant_name    0
address            0
state              0
city               0
postal_code        0
latitude           0
longitude          0
avg_rating         0
num_of_reviews     0
price_numeric      0
gmap_id            0
dtype: int64


**Data Type Conversions**
- Most of them were correct and so onnly postal_code was changed from an `Object` to an `Int`

In [61]:
df_restaurant_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26686 entries, 0 to 26913
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   restaurant_name  26686 non-null  object 
 1   address          26686 non-null  object 
 2   state            26686 non-null  object 
 3   city             26686 non-null  object 
 4   postal_code      26686 non-null  int32  
 5   latitude         26686 non-null  float64
 6   longitude        26686 non-null  float64
 7   avg_rating       26686 non-null  float64
 8   num_of_reviews   26686 non-null  int64  
 9   price_numeric    26686 non-null  int32  
 10  gmap_id          26686 non-null  object 
dtypes: float64(3), int32(2), int64(1), object(5)
memory usage: 3.2+ MB


In [60]:
try:
    df_restaurant_clean['postal_code'] = df_restaurant_clean['postal_code'].astype(int)
except ValueError as e:
    print(f"Conversion error: {e}")
    print("Postal codes might contain non-numeric values or NaNs.")

## Loading/Saving Transformed Data

After the data has been cleaned and transformed, it's crucial to save it in a structured and accessible format for future analysis or processing. The steps below outline how the data is saved:

### Preparation for Saving
- **Base Save Path**: A base directory (`../data/Processed/`) is defined to store the processed files. This helps in organizing the output data systematically.

### Saving Dataframes
- **File Paths**: The full paths for saving the files are defined, incorporating the base path and the desired filenames (`google_restaurant_clean.parquet` for the cleaned restaurant data and `google_restaurant_cat_dummies.parquet` for the category dummies).
- **Parquet Format**: The data is saved in Parquet format, a columnar storage file format optimized for speed in both reading and writing, as well as efficient data compression and encoding schemes. This choice is particularly suitable for handling large datasets like this one.
- **Dataframe Saving**: The `to_parquet` method is used to save `df_restaurant_clean` and `restaurant_cat_dummies` dataframes to their respective file paths. The `index=False` parameter is specified to avoid saving dataframe indices, keeping the files lean and focused on the data content.

By saving the transformed data in a structured and efficient format, we ensure that it is readily accessible for future analyses, providing a solid foundation for insights and decision-making.


In [64]:
save_path = '../data/Processed/'

os.makedirs(save_path, exist_ok=True)

clean_file_path = save_path + 'google_restaurant_clean.parquet'
dummies_file_path = save_path + 'google_restaurant_cat_dummies.parquet'

df_restaurant_clean.to_parquet(clean_file_path, index=False)
restaurant_cat_dummies.to_parquet(dummies_file_path, index=False)
