# Extraction, Transformation, and Loading of data 👨🏽‍💻 👩🏽‍💻

In this notebook, the data will be loaded, and the types of each column will be reviewed to determine which columns will be useful. The notebook will also check for duplicate records (rows) and if any record in a column has empty or null values. In addition to these transformations, the libraries that will be used throughout the notebook to manipulate the data will be imported, including our custom module called 'Tools.' Finally, the processed data will be exported and ready for analysis.

## Importing the necessary libraries 📚

These libraries assist us in manipulating the data to ensure consistency and quality. Additionally, we import our custom module named 'tools' to aid in this entire process.

In [76]:
import pandas as pd
import re
from geopy.geocoders import Nominatim
import Tools as T
import warnings
warnings.filterwarnings("ignore")

## Data Loading 📂🔄

In [77]:
df_business = pd.read_csv('business.csv')
df_business

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,...,state.1,postal_code.1,latitude.1,longitude.1,stars.1,review_count.1,is_open.1,attributes.1,categories.1,hours.1
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,,93101,34.426679,-119.711197,5.0,7,...,,,,,,,,,,
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,,63123,38.551126,-90.335695,3.0,15,...,,,,,,,,,,
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,,85711,32.223236,-110.880452,3.5,22,...,,,,,,,,,,
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,CA,19107,39.955505,-75.155564,4.0,80,...,,,,,,,,,,
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,MO,18054,40.338183,-75.471659,4.5,13,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
150341,IUQopTMmYQG-qRtBk-8QnA,Binh's Nails,3388 Gateway Blvd,Edmonton,IN,T6J 5H2,53.468419,-113.492054,3.0,13,...,,,,,,,,,,
150342,c8GjPIOTGVmIemT7j5_SyQ,Wild Birds Unlimited,2813 Bransford Ave,Nashville,DE,37204,36.115118,-86.766925,4.0,5,...,,,,,,,,,,
150343,_QAMST-NrQobXduilWEqSw,Claire's Boutique,"6020 E 82nd St, Ste 46",Indianapolis,AB,46250,39.908707,-86.065088,3.5,8,...,,,,,,,,,,
150344,mtGm22y5c2UHNXDFAjaPNw,Cyclery & Fitness Center,2472 Troy Rd,Edwardsville,AB,62025,38.782351,-89.950558,4.0,24,...,,,,,,,,,,


A review is conducted using a function from our custom module, which provides detailed information about the DataFrame, including data types, quantity, and percentage of null values in each column.

In [78]:
T.analyze_data(df_business)

Unnamed: 0,Name,Unique Data Types,% of Non-null Values,% of Null Values,Number of Null Values
0,business_id,[<class 'str'>],100.0,0.0,0
1,name,[<class 'str'>],100.0,0.0,0
2,address,"[<class 'str'>, <class 'float'>]",96.59,3.41,5127
3,city,[<class 'str'>],100.0,0.0,0
4,state,"[<class 'float'>, <class 'str'>]",100.0,0.0,3
5,postal_code,"[<class 'str'>, <class 'float'>]",99.95,0.05,73
6,latitude,[<class 'float'>],100.0,0.0,0
7,longitude,[<class 'float'>],100.0,0.0,0
8,stars,[<class 'float'>],100.0,0.0,0
9,review_count,[<class 'int'>],100.0,0.0,0


## Transformations 🔀

We remove unnecessary columns containing duplicates with all their null values. During the file reading, various attempts were made to filter the dataframe, but all resulted in different errors. Therefore, the decision was made to copy the column names and proceed with their deletion. Additionally, three columns were removed as they were deemed irrelevant to our data analysis and recommendation system.

In [79]:
df_business = df_business.drop(columns=['business_id.1', 'name.1','address.1', 'city.1', 'state.1', 'postal_code.1', 'latitude.1','longitude.1', 'stars.1', 'review_count.1', 'is_open.1', 'attributes.1','categories.1', 'hours.1','is_open','hours','attributes'])
df_business

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,categories
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,,93101,34.426679,-119.711197,5.0,7,"Doctors, Traditional Chinese Medicine, Naturop..."
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,,63123,38.551126,-90.335695,3.0,15,"Shipping Centers, Local Services, Notaries, Ma..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,,85711,32.223236,-110.880452,3.5,22,"Department Stores, Shopping, Fashion, Home & G..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,CA,19107,39.955505,-75.155564,4.0,80,"Restaurants, Food, Bubble Tea, Coffee & Tea, B..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,MO,18054,40.338183,-75.471659,4.5,13,"Brewpubs, Breweries, Food"
...,...,...,...,...,...,...,...,...,...,...,...
150341,IUQopTMmYQG-qRtBk-8QnA,Binh's Nails,3388 Gateway Blvd,Edmonton,IN,T6J 5H2,53.468419,-113.492054,3.0,13,"Nail Salons, Beauty & Spas"
150342,c8GjPIOTGVmIemT7j5_SyQ,Wild Birds Unlimited,2813 Bransford Ave,Nashville,DE,37204,36.115118,-86.766925,4.0,5,"Pets, Nurseries & Gardening, Pet Stores, Hobby..."
150343,_QAMST-NrQobXduilWEqSw,Claire's Boutique,"6020 E 82nd St, Ste 46",Indianapolis,AB,46250,39.908707,-86.065088,3.5,8,"Shopping, Jewelry, Piercing, Toy Stores, Beaut..."
150344,mtGm22y5c2UHNXDFAjaPNw,Cyclery & Fitness Center,2472 Troy Rd,Edwardsville,AB,62025,38.782351,-89.950558,4.0,24,"Fitness/Exercise Equipment, Eyewear & Optician..."


## Column ``State``

Filtering is carried out based on the selected states using web scraping from Wikipedia, taking into account the highest population in each state. Subsequently, the records' index is reset to facilitate data handling.

In [80]:
df_business = df_business[(df_business['state'] == 'CA') | (df_business['state'] == 'TX') | (df_business['state'] == 'FL') | (df_business['state'] == 'PA') | (df_business['state'] == 'NY') ]
df_business = df_business.reset_index(drop=True)
df_business

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,categories
0,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,CA,19107,39.955505,-75.155564,4.0,80,"Restaurants, Food, Bubble Tea, Coffee & Tea, B..."
1,n_0UpQx1hsNbnPUSlodU8w,Famous Footwear,"8522 Eager Road, Dierbergs Brentwood Point",Brentwood,PA,63144,38.627695,-90.340465,2.5,13,"Sporting Goods, Fashion, Shoe Stores, Shopping..."
2,qkRM_2X51Yqxk3btlwAQIg,Temple Beth-El,400 Pasadena Ave S,St. Petersburg,PA,33707,27.766590,-82.732983,3.5,5,"Synagogues, Religious Organizations"
3,UJsufbvfyfONHeWdvAHKjA,Marshalls,21705 Village Lakes Sc Dr,Land O' Lakes,FL,34639,28.190459,-82.457380,3.5,6,"Department Stores, Shopping, Fashion"
4,jaxMSoInw8Poo3XeMJt8lQ,Adams Dental,15 N Missouri Ave,Clearwater,FL,33755,27.966235,-82.787412,5.0,10,"General Dentistry, Dentists, Health & Medical,..."
...,...,...,...,...,...,...,...,...,...,...,...
65570,1jx1sfgjgVg0nM6n3p0xWA,Savaya Coffee Market,11177 N Oracle Rd,Oro Valley,PA,85737,32.409552,-110.943073,4.5,41,"Specialty Food, Food, Coffee & Tea, Coffee Roa..."
65571,9U1Igcpe954LoWZRmNc-zg,Hand & Stone Massage And Facial Spa,"1100 S Columbus Blvd, Ste 24",Philadelphia,PA,19147,39.932756,-75.144504,3.0,32,"Day Spas, Beauty & Spas, Skin Care, Massage"
65572,t_SGoRT5yt14OWr64TOulA,Sherwood Park Kwik Lube,979 Fir St,Sherwood Park,PA,T8A 4N5,53.513215,-113.328680,5.0,5,"Oil Change Stations, Automotive, Auto Repair"
65573,x_2IrYgFiQn7GOTTgWRbAw,The Vac & Sew Center,"200 Haddonfield Berlin Rd, Ste 5",Voorhees,PA,08043,39.857700,-74.987230,4.0,5,"Appliances & Repair, Home & Garden, Appliances..."


The proper application of the filter and column deletion is verified, ensuring that both changes have been executed correctly.

In [81]:
T.analyze_data(df_business)

Unnamed: 0,Name,Unique Data Types,% of Non-null Values,% of Null Values,Number of Null Values
0,business_id,[<class 'str'>],100.0,0.0,0
1,name,[<class 'str'>],100.0,0.0,0
2,address,"[<class 'str'>, <class 'float'>]",96.59,3.41,2238
3,city,[<class 'str'>],100.0,0.0,0
4,state,[<class 'str'>],100.0,0.0,0
5,postal_code,"[<class 'str'>, <class 'float'>]",99.97,0.03,22
6,latitude,[<class 'float'>],100.0,0.0,0
7,longitude,[<class 'float'>],100.0,0.0,0
8,stars,[<class 'float'>],100.0,0.0,0
9,review_count,[<class 'int'>],100.0,0.0,0


## Column ``Categories``

We will begin with the 'Categories' column, reviewing null values before proceeding with filtering based on various categories.

In [82]:
df_business_categories_null = df_business[df_business['categories'].isna()]
df_business_categories_null.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,categories
841,SMYXOLPyM95JvZ-oqnsWUA,A A Berlin Glass & Mirror Co,60 W White Horse Pike,Berlin,PA,8009,39.800416,-74.937181,3.0,5,
1418,xT3J-SP5g49g2FjQfLEQfg,Luxury Perfume,5135 Meadowood Mall Cir,Reno,PA,89502,39.475623,-119.78335,2.0,5,
1984,mKxCNYEoKt6d_1rXmvRwww,Green Envy,3520 N Highway 94,Saint Charles,FL,63301,38.826533,-90.472224,1.5,5,
2338,9QoKKDZB_YuDeS5TxRW8bg,Our 365 Portraits,9109 Watson Rd,Saint Louis,PA,63126,38.561429,-90.371805,1.0,10,
5483,ZERQMWb1PFzCfbfknqq-fA,Pilot Air Freight,314 N Middletown Rd,Media,PA,19063,39.917976,-75.441892,1.5,8,


Now, the data marked as floating-point is examined to determine if changes are necessary in the column and to unify everything under a single data type if needed.

In [83]:
df_business_cat_float = df_business[df_business['categories'].apply(lambda x:isinstance(x,float))]
df_business_cat_float.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,categories
841,SMYXOLPyM95JvZ-oqnsWUA,A A Berlin Glass & Mirror Co,60 W White Horse Pike,Berlin,PA,8009,39.800416,-74.937181,3.0,5,
1418,xT3J-SP5g49g2FjQfLEQfg,Luxury Perfume,5135 Meadowood Mall Cir,Reno,PA,89502,39.475623,-119.78335,2.0,5,
1984,mKxCNYEoKt6d_1rXmvRwww,Green Envy,3520 N Highway 94,Saint Charles,FL,63301,38.826533,-90.472224,1.5,5,
2338,9QoKKDZB_YuDeS5TxRW8bg,Our 365 Portraits,9109 Watson Rd,Saint Louis,PA,63126,38.561429,-90.371805,1.0,10,
5483,ZERQMWb1PFzCfbfknqq-fA,Pilot Air Freight,314 N Middletown Rd,Media,PA,19063,39.917976,-75.441892,1.5,8,


It has been verified that the data marked as float or floating-point type are the same as those we filtered and were marked as null. Consequently, we will proceed to eliminate them. Additionally, a dataframe filtration will be performed using keywords in the categories to obtain only those places related to food.

In [84]:
df_business.dropna(subset=['categories'],inplace=True)

categories_exclude = ['shopping', 'beauty', 'salon','Sports Bars','Pets', 'Pet Adoption', 'Nightlife','Gastropubs','Automotive','Custom Cakes', 'Desserts', 'Cupcakes', 'Ice Cream & Frozen Yogurt', 'Organic Stores', 'Health Markets', 'Grocery','Cupcakes', 'Street Vendors', 'Food Trucks','Acai Bowls',
'Home Services', 'Painters', 'Contractors', 'Pressure Washers', 'Shopping', 'Fences & Gates', 'Flooring', 'Home & Garden', 'Door Sales/Installation', 'Kitchen & Bath', 'Home Inspectors','Health & Medical', 'Pharmacy', 'Convenience Stores', 'Drugstores','Flowers & Gifts', 'Chocolatiers & Shops', 'Florists', 'Gift Shops', 'American (New)', 'Music Venues', 'Breakfast & Brunch', 'Arts & Entertainment', 'Bars', 'American (Traditional)', 'Dive Bars', 'Pool Halls','Farmers Market','Building Supplies', 'Masonry/Concrete', 'Countertop Installation','Active Life', 'Advertising', 'Afghan', 'African', 'Airport Terminals', 'Airports', 'American (New)', 'American (Traditional)', 'Amusement Parks', 'Water Delivery', 'Water Stores', 'Web Design', 'Wedding Planning', 'Wholesalers', 'Wine & Spirits', 'Wine Tasting Classes', 'Wine Tours', 'Wraps', 'Yelp Events', 'Walking Tours'
]

categories_include = ['restaurant', 'cafe', 'food', 'dining', 'eatery', 'bistro', 'bakery', 'grill', 'kitchen', 'pizzeria', 'steakhouse', 'sushi', 'tavern', 'diner']
# Crear patrones de inclusión y exclusión
include_pattern = re.compile('|'.join(categories_include), flags=re.IGNORECASE)
exclude_pattern = re.compile('|'.join(categories_exclude), flags=re.IGNORECASE)

# Aplicar filtros utilizando expresiones regulares
mask = df_business['categories'].str.contains(include_pattern)
mask2 = df_business['categories'].str.contains(exclude_pattern)

df_business = df_business[mask & ~mask2]


The filtering is checked to ensure it has been done correctly, excluding various categories.

In [85]:
df_business

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,categories
0,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,CA,19107,39.955505,-75.155564,4.0,80,"Restaurants, Food, Bubble Tea, Coffee & Tea, B..."
5,0bPLkL0QhhPO5kt1_EXmNQ,Zio's Italian Market,2575 E Bay Dr,Largo,FL,33771,27.916116,-82.760461,4.5,100,"Food, Delis, Italian, Bakeries, Restaurants"
9,kfNv-JZpuN6TVNSO6hHdkw,Hibachi Express,6625 E 82nd St,Indianapolis,PA,46250,39.904320,-86.053080,4.0,20,"Steakhouses, Asian Fusion, Restaurants"
11,sqSqqLy0sN8n2IZrAbzidQ,Domino's Pizza,3001 Highway 31 W,White House,CA,37188,36.464747,-86.659187,3.5,8,"Pizza, Chicken Wings, Sandwiches, Restaurants"
12,Mjboz24M9NlBeiOJKLEd_Q,DeSandro on Main,4105 Main St,Philadelphia,PA,19127,40.022466,-75.218314,3.0,41,"Pizza, Restaurants, Salad, Soup"
...,...,...,...,...,...,...,...,...,...,...,...
65555,uriD7RFuHhLJeDdKaf0nFA,Pizza Guru,3534 State St,Santa Barbara,PA,93105,34.440689,-119.739681,4.0,299,"Restaurants, Pizza, Food"
65560,gPr1io7ks0Eo3FDsnDTYfg,Tata Cafe,7201 Germantown Ave,Philadelphia,PA,19119,40.060414,-75.191084,4.0,21,"Sandwiches, Restaurants, Italian"
65563,wVxXRFf10zTTAs11nr4xeA,PrimoHoagies,6024 Ridge Ave,Philadelphia,CA,19128,40.032483,-75.214430,3.0,55,"Restaurants, Specialty Food, Food, Sandwiches,..."
65566,8n93L-ilMAsvwUatarykSg,Kitchen Gia,3716 Spruce St,Philadelphia,PA,19104,39.951018,-75.198240,3.0,22,"Coffee & Tea, Food, Sandwiches, American (Trad..."


Once this column is processed, it will later be converted into a dataframe through dummy variables. Now we will continue with the other columns that present two types of data and/or null values. In order to avoid the presence of null values, these columns will be reviewed to ensure that there are no empty data.

## Column ``Adress``

We obtain the null values in the column using a function called 'nulls' from our 'Tools' module.

In [86]:
T.nulls(df_business,'address')

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,categories
1121,siwG4ZM7RjUDO52DI84m3w,Ray's Vegan Soul,,St. Petersburg,PA,,27.767601,-82.640292,5.0,5,"Soul Food, Vegan, Event Planning & Services, R..."
1818,cqIYskWPVQ0rz1XSG0ohxg,City Food Tours,,Philadelphia,FL,19102,39.955624,-75.164753,4.5,48,"Hotels & Travel, Specialty Food, Food, Food To..."
2483,4BbzHEVKSHuctibot056og,City of Tucson,,Tucson,PA,85701,32.222872,-110.965398,3.0,52,"Coffee & Tea, Local Flavor, Public Services & ..."
3398,UYhbbpuD199E7BabX93vAg,Buggin' Out Boils,,New Orleans,PA,70117,29.968268,-90.032653,5.0,5,"Event Planning & Services, Caterers, Cajun/Cre..."
4189,iySMW1yQs9qjNnkScAWtmA,MrBeast Burger,,Indianapolis,FL,46204,39.767958,-86.159432,1.5,13,"Food, Restaurants, Burgers, Food Delivery Serv..."
6287,6PJt5TQWp3qQW5UOkDPwjA,Guy Fieri's Flavortown Kitchen,,Indianapolis,FL,46204,39.771343,-86.157371,3.0,14,"Restaurants, Food, Chicken Wings, Food Deliver..."
6352,oId0VeETbxUWL8ynfJ6nJw,MrBeast Burger,,Brandon,PA,33510,27.956046,-82.312629,3.5,7,"Restaurants, Food, Food Delivery Services, Bur..."
10738,4J55A3x6jQPdApokTLMxzw,Live Oak Grill Catering,,Carpinteria,FL,93013,34.398884,-119.518456,4.5,7,"Food Delivery Services, Event Planning & Servi..."
12109,QNsY9cttooythe2eGPybZQ,Metabolic Meals,,Saint Louis,FL,63129,38.613212,-90.32073,4.0,158,"Food, Food Delivery Services"
13395,jg1AdBM-COOEY1HMwdXf1g,My Big Fat Greek Truck,,Saint Louis,FL,63101,38.630539,-90.192822,3.5,19,"Greek, Restaurants"


To find the addresses of places that do not have them, we will use the Geopy library. Through reverse geocoding, we can obtain addresses using latitude and longitude.

In [87]:
geolocalizated = Nominatim(user_agent='My_Address')

for index, row in df_business.iterrows():
    if pd.isnull(row['address']):
        try:
            location = geolocalizated.reverse((row['latitude'], row['longitude']), language='en')
            df_business.at[index, 'address'] = location.address
        except Exception as e:
            print(f"The address for the row could not be obtained {index}: {e}")

## Column ``Postal_code``

Since we had records without postal codes, we decided to perform a similar process to that of obtaining addresses for places and impute them using Geopy through reverse geocoding. This involves obtaining addresses and other necessary values, in this case, the postal code, through the provided latitude and longitude.

In [88]:
geolocalizated = Nominatim(user_agent='My_Address')

filas_sin_postal_code = df_business[pd.isnull(df_business['postal_code'])]


for index, row in filas_sin_postal_code.iterrows():
    try:
        location = geolocalizated.reverse((row['latitude'], row['longitude']), language='en')
        
        postal_code = location.raw.get('address', {}).get('postcode')
        if postal_code:
            # Actualizar la columna 'postal_code' en el DataFrame
            df_business.at[index, 'postal_code'] = postal_code

    except Exception as e:
        print(f"The address for the row could not be obtained {index}: {e}")


## Column ``Name``

We start looking for empty values to check the consistency of the data.

In [89]:
T.empty_values(df_business,'name')

The column "name" does not have empty values


We check that the names correspond to restaurants or others similar, related to the food industry.

In [90]:
names = df_business['name'].value_counts()
names.head()

name
Starbucks      316
McDonald's     296
Subway         225
Dunkin'        207
Burger King    136
Name: count, dtype: int64

## Column ``City``

We look for empty values to check the consistency of the data.

In [91]:
T.empty_values(df_business,'city')

The column "city" does not have empty values


We check that the names of the different cities correspond to the states that were selected through a web scraping process based on population.

In [92]:
cities = df_business['city'].value_counts()
cities.head()

city
Philadelphia    1639
Indianapolis     818
Tampa            782
Edmonton         704
Tucson           679
Name: count, dtype: int64

## Columns ``Latitude`` and ``Longitude``

We look for empty values to check the consistency of the data.

In [93]:
T.empty_values(df_business,'latitude')
T.empty_values(df_business,'longitude')

The column "latitude" does not have empty values
The column "longitude" does not have empty values


## Column ``stars``

We look for empty values to check the consistency of the data.

In [94]:
T.empty_values(df_business,'stars')

The column "stars" does not have empty values


A count of the values is performed, and the percentage that each possible value represents is calculated using a function from our custom module. This is done with the aim of achieving a better visualization of the data.

In [95]:
T.count_and_percentage(df_business,'stars')

The values of stars:
4.0    3473
3.5    3188
4.5    2227
3.0    2215
2.5    1451
2.0    1003
1.5     565
5.0     452
1.0     101

The percentage that each value represents:
4.0    23.67
3.5    21.72
4.5    15.18
3.0    15.09
2.5     9.89
2.0     6.83
1.5     3.85
5.0     3.08
1.0     0.69


## Column ``review_count``

We look for empty values to check the consistency of the data.

In [96]:
T.empty_values(df_business,'review_count')

The column "review_count" does not have empty values


We perform a groupby operation to identify the restaurants with the most reviews on Yelp, regardless of the rating.

In [97]:
reviews = df_business.groupby(by='name')['review_count'].sum().sort_values(ascending=False)
reviews.head()

name
Starbucks                     8611
McDonald's                    7391
Chipotle Mexican Grill        4144
Dunkin'                       3929
Drago's Seafood Restaurant    3160
Name: review_count, dtype: int64

Finally, we will look for duplicates in the 'df_business' dataframe to complete the cleaning process.

In [98]:
T.duplicates(df_business)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,categories


After completing the analysis of all columns and confirming that there are no duplicates, we will proceed with the conversion of a column into dummy variables. To do this, the values of this column will be transferred to a new dataframe called df_categories.

## Dummy Variables

### Column ``categories``

Various operations will be performed, such as creating a new DataFrame, separating categories using the split function, removing null values, and obtaining dummy variables with the ID of each business.

In [99]:
df_categories = df_business[['business_id', 'categories']]

df_categories['categories'] = df_categories['categories'].replace('No_data', None)

df_categories['categories'] = df_categories['categories'].dropna().str.split(', ')

df_categories_exploded = df_categories.explode('categories')

df_dummies = pd.get_dummies(df_categories_exploded['categories'], prefix='Category')

df_dummies = df_categories_exploded[['business_id']].join(df_dummies)

df_dimmi = df_dummies.groupby('business_id').sum().reset_index()

A 'threshold' operation was performed, meaning that in places where categories were repeated, a value of 1 was assigned to be used in the recommendation system.

In [100]:
threshold = 1
df_dimmi.iloc[:, 1:] = (df_dimmi.iloc[:, 1:] > threshold).astype(int)
df_dimmi

Unnamed: 0,business_id,Category_American (New),Category_American (Traditional),Category_Arabic,Category_Argentine,Category_Armenian,Category_Asian Fusion,Category_Australian,Category_Austrian,Category_Bagels,...,Category_Trinidadian,Category_Turkish,Category_Ukrainian,Category_Uzbek,Category_Vegan,Category_Vegetarian,Category_Venezuelan,Category_Venues & Event Spaces,Category_Vietnamese,Category_Waffles
0,-0EdehHjIQc0DtYU8QcAig,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,-0FX23yAacC4bbLaGPvyxw,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,-0dKgi_Hpcis921nOpM85Q,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,-0eUa8TsXFFy0FCxHYmrjg,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,-0gRYq5UjMtZbELj0KHxzA,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14670,zy-BlEF0mkvAwOk3ru1WLA,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14671,zy0fsu1Wns7g6Om0SxU67g,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14672,zyMkbavgHASQtqVwaock9A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14673,zzIF9qp2UoHN48EeZH_IDg,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


All rows containing values equal to 0 in every column are filtered out.

In [101]:
rows_with_all_zeros = df_dimmi[df_dimmi.iloc[:, 1:].eq(0).all(axis=1)]
rows_with_all_zeros.head()

Unnamed: 0,business_id,Category_American (New),Category_American (Traditional),Category_Arabic,Category_Argentine,Category_Armenian,Category_Asian Fusion,Category_Australian,Category_Austrian,Category_Bagels,...,Category_Trinidadian,Category_Turkish,Category_Ukrainian,Category_Uzbek,Category_Vegan,Category_Vegetarian,Category_Venezuelan,Category_Venues & Event Spaces,Category_Vietnamese,Category_Waffles
756,2INVBDZ45z4Tbg3VtdDyTg,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
911,2s5h1TMhYPdRUUNcQFr3WA,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1137,3tTVN74KJAN0z1DJ_dsE6w,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1249,4IvQU16RBKuLtpgx8yLqmQ,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2004,7RSsRMQCO2l0GpGjlvEFfA,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


All IDs found with a value of 0 in every column are passed, and they will be removed from the df_dimmies.

In [102]:
business_ids_to_remove = rows_with_all_zeros['business_id'].tolist()
df_dimmies = df_dimmi[~df_dimmi['business_id'].isin(business_ids_to_remove)]
df_dimmies.shape

(14615, 248)

We quickly review those categories that are most repeated with the aim of visualizing them.

In [103]:
values_categories = df_categories['categories'].value_counts().head()
values = pd.DataFrame(values_categories)
values

Unnamed: 0_level_0,count
categories,Unnamed: 1_level_1
"[Restaurants, Pizza]",409
"[Pizza, Restaurants]",347
"[Restaurants, Chinese]",311
"[Restaurants, Mexican]",298
"[Mexican, Restaurants]",287


## Exporting the Data 🌐

In [104]:
df_business.to_parquet('df_business.parquet')
df_dimmies.to_parquet('df_dummies.parquet')

The conclusions from this notebook are as follows: We had duplicate columns; data was filtered based on the states and categories selected by the team. Missing restaurant addresses were completed using reverse geocoding, i.e., through latitude and longitude. Similarly, missing postal codes were added. Next, the values of restaurant names, cities, ratings, and review counts were reviewed, and a top list of these data points was created. Additionally, a table of dummy variables was created from the 'category' column, which will be used in our recommendation system powered by the area of Machine Learning.

Once this ETL process is finished, we will proceed to the next one for [reviews](ETL_reviews.ipynb).