# ETL for Yelp data Files 

We will perform the extraction, transformation and data loading of the data included in the files provided in the Yelp folder

### 1- IMPORT LIBRARIES


In [1]:
import builtin.utils as ut
import pandas as pd
from ast import literal_eval
import ast

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 3, Finished, Available)

___
### 2- LOAD CLEAN DATA

First we are going to extract all the information from the data lake by using the Pandas library, and proceed to analyze the dataframes one by one

In [2]:
yelp_bus = '/lakehouse/default/Files/df_database/Yelp_parquet/business.pkl.parquet'
yelp_check = '/lakehouse/default/Files/df_database/Yelp_parquet/checkin.json.parquet'
yelp_review = '/lakehouse/default/Files/df_database/Yelp_parquet/review.json.parquet'
yelp_tip = '/lakehouse/default/Files/df_database/Yelp_parquet/tip.json.parquet'
yelp_user = '/lakehouse/default/Files/df_database/Yelp_parquet/user.parquet'

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 4, Finished, Available)

In [3]:
business = pd.read_parquet(yelp_bus)
checkin = pd.read_parquet(yelp_check)
review = pd.read_parquet(yelp_review)
tip = pd.read_parquet(yelp_tip)
user = pd.read_parquet(yelp_user)

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 5, Finished, Available)

Now we have the following dataframes:

- Business
- Checkin
- Review
- Tip
- User

___
### 3 - TRANSFORM


#### 3.1 Business Dataframe

In [4]:
business.head()

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 6, Finished, Available)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,,93101,34.426679,-119.711197,5.0,7,0,"{'AcceptsInsurance': None, 'AgesAllowed': None...","Doctors, Traditional Chinese Medicine, Naturop...",
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,,63123,38.551126,-90.335695,3.0,15,1,"{'AcceptsInsurance': None, 'AgesAllowed': None...","Shipping Centers, Local Services, Notaries, Ma...","{'Friday': '8:0-18:30', 'Monday': '0:0-0:0', '..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,,85711,32.223236,-110.880452,3.5,22,0,"{'AcceptsInsurance': None, 'AgesAllowed': None...","Department Stores, Shopping, Fashion, Home & G...","{'Friday': '8:0-23:0', 'Monday': '8:0-22:0', '..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,CA,19107,39.955505,-75.155564,4.0,80,1,"{'AcceptsInsurance': None, 'AgesAllowed': None...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Friday': '7:0-21:0', 'Monday': '7:0-20:0', '..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,MO,18054,40.338183,-75.471659,4.5,13,1,"{'AcceptsInsurance': None, 'AgesAllowed': None...","Brewpubs, Breweries, Food","{'Friday': '12:0-22:0', 'Monday': None, 'Satur..."


In [5]:
ut.data_summ(business)

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 7, Finished, Available)

      Column                            Data_type  No_miss_Qty  %Missing  Missing_Qty
 business_id                      [<class 'str'>]       150346      0.00            0
        name                      [<class 'str'>]       150346      0.00            0
     address                      [<class 'str'>]       150346      0.00            0
        city                      [<class 'str'>]       150346      0.00            0
       state  [<class 'NoneType'>, <class 'str'>]       150343      0.00            3
 postal_code                      [<class 'str'>]       150346      0.00            0
    latitude                    [<class 'float'>]       150346      0.00            0
   longitude                    [<class 'float'>]       150346      0.00            0
       stars                    [<class 'float'>]       150346      0.00            0
review_count                      [<class 'int'>]       150346      0.00            0
     is_open                      [<class 'int'>]     

Unnamed: 0,Column,Data_type,No_miss_Qty,%Missing,Missing_Qty
0,business_id,[<class 'str'>],150346,0.0,0
1,name,[<class 'str'>],150346,0.0,0
2,address,[<class 'str'>],150346,0.0,0
3,city,[<class 'str'>],150346,0.0,0
4,state,"[<class 'NoneType'>, <class 'str'>]",150343,0.0,3
5,postal_code,[<class 'str'>],150346,0.0,0
6,latitude,[<class 'float'>],150346,0.0,0
7,longitude,[<class 'float'>],150346,0.0,0
8,stars,[<class 'float'>],150346,0.0,0
9,review_count,[<class 'int'>],150346,0.0,0



The content ~head~ of the dataframe is observed, it has 14 columns with the information of:

- **Businnes ID**:    STR Contains business identification 
- **Name**:    STR Contains the name of the business
- **Address**:    STR Contains the complete address of the business
- **City**:    STR Contains the city where the business is located
- **State**:    STR Contains the state where the business is located
- **Postal Code**:    STR Contains the postal code of the business location
- **Latitude**:    FLOAT Contains Latitude
- **Longitude**:    FLOAT Contains Longitude
- **Is Open**:    INT Contains 1 and 0 to indicate whether the business is open or closed respectively 
- **Stars**:    FLOAT Contains the number of stars, between 0 and 5
- **Review_count**:    INT Contains the count of reviews of the business
- **Attributes**:    STR Contains business attributes as values
- **Categories**:    STR Contains business categories
- **Hours**:    STR Contains the hours it is open

For the project proposal that we are carrying out, several of the columns are relevant and others are not.
Therefore, we proceed to eliminate those columns that are not relevant to us, among them: "Postal_code", "is_open", "attributes", "hours"

In [6]:
business = business.drop(["postal_code", "is_open", "attributes", "hours"], axis=1)
business

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 8, Finished, Available)

Unnamed: 0,business_id,name,address,city,state,latitude,longitude,stars,review_count,categories
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,,34.426679,-119.711197,5.0,7,"Doctors, Traditional Chinese Medicine, Naturop..."
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,,38.551126,-90.335695,3.0,15,"Shipping Centers, Local Services, Notaries, Ma..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,,32.223236,-110.880452,3.5,22,"Department Stores, Shopping, Fashion, Home & G..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,CA,39.955505,-75.155564,4.0,80,"Restaurants, Food, Bubble Tea, Coffee & Tea, B..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,MO,40.338183,-75.471659,4.5,13,"Brewpubs, Breweries, Food"
...,...,...,...,...,...,...,...,...,...,...
150341,IUQopTMmYQG-qRtBk-8QnA,Binh's Nails,3388 Gateway Blvd,Edmonton,IN,53.468419,-113.492054,3.0,13,"Nail Salons, Beauty & Spas"
150342,c8GjPIOTGVmIemT7j5_SyQ,Wild Birds Unlimited,2813 Bransford Ave,Nashville,DE,36.115118,-86.766925,4.0,5,"Pets, Nurseries & Gardening, Pet Stores, Hobby..."
150343,_QAMST-NrQobXduilWEqSw,Claire's Boutique,"6020 E 82nd St, Ste 46",Indianapolis,AB,39.908707,-86.065088,3.5,8,"Shopping, Jewelry, Piercing, Toy Stores, Beaut..."
150344,mtGm22y5c2UHNXDFAjaPNw,Cyclery & Fitness Center,2472 Troy Rd,Edwardsville,AB,38.782351,-89.950558,4.0,24,"Fitness/Exercise Equipment, Eyewear & Optician..."


In [7]:
business["categories"] = business["categories"].apply(lambda x: x.split(', ') if pd.notna(x) else [])

rt_df = ut.explode_column(business, "categories")

rt_df = ut.replace_all_nulls(rt_df)

unique_categories = rt_df['categories'].unique()
for category in unique_categories:
    print(category)

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 9, Finished, Available)

Doctors
Traditional Chinese Medicine
Naturopathic/Holistic
Acupuncture
Health & Medical
Nutritionists
Shipping Centers
Local Services
Notaries
Mailbox Centers
Printing Services
Department Stores
Shopping
Fashion
Home & Garden
Electronics
Furniture Stores
Restaurants
Food
Bubble Tea
Coffee & Tea
Bakeries
Brewpubs
Breweries
Burgers
Fast Food
Sandwiches
Ice Cream & Frozen Yogurt
Sporting Goods
Shoe Stores
Sports Wear
Accessories
Synagogues
Religious Organizations
Pubs
Italian
Bars
American (Traditional)
Nightlife
Greek
Vietnamese
Food Trucks
Diners
Breakfast & Brunch
General Dentistry
Dentists
Cosmetic Dentists
Delis
Sushi Bars
Japanese
Automotive
Auto Parts & Supplies
Auto Customization
Vape Shops
Tobacco Shops
Personal Shopping
Vitamins & Supplements
Car Rental
Hotels & Travel
Truck Rental
Korean
Cafes
Wine Bars
Books
Mags
Music & Video
Bookstores
Steakhouses
Asian Fusion
Hot Dogs
Pet Services
Pet Groomers
Pets
Veterinarians
Women's Clothing
Children's Clothing
Men's Clothing
Adult
Seaf

In [8]:
keywords = ['museum', 'park', 'Hotel', 'Motel', 'Hostel', 'restaurant', 'Restaurant', 'forest', 'gallery', 'mall', 'pub', 'Zoo', 'roller coaster']

keywords_lower = [keyword.lower() for keyword in keywords]

mask = rt_df['categories'].str.contains('|'.join(keywords))

filtered_dataframe = rt_df[mask]

filtered_dataframe

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 10, Finished, Available)

Unnamed: 0,business_id,name,address,city,state,latitude,longitude,stars,review_count,categories
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,CA,39.955505,-75.155564,4.0,80,Restaurants
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,MO,40.338183,-75.471659,4.5,13,Brewpubs
5,CF33F8-E6oudUQ46HnavjQ,Sonic Drive-In,615 S Main St,Ashland City,AZ,36.269593,-87.058943,2.0,6,Restaurants
8,k0hlBqXX-Bt0vf1op7Jr1w,Tsevi's Pub And Grill,8025 Mackenzie Rd,Affton,TN,38.565165,-90.321087,3.0,19,Restaurants
9,bBDDEgkFA1Otx9Lfe7BZUQ,Sonic Drive-In,2312 Dickerson Pike,Nashville,MO,36.208102,-86.768170,1.5,10,Restaurants
...,...,...,...,...,...,...,...,...,...,...
150331,qQ7FHvkGEMqoPKKXPk4gjA,La Quinta by Wyndham NW Tucson Marana,6020 West Hospitality Rd,Tucson,AZ,32.358587,-111.093308,2.5,67,Hotels & Travel
150331,qQ7FHvkGEMqoPKKXPk4gjA,La Quinta by Wyndham NW Tucson Marana,6020 West Hospitality Rd,Tucson,AZ,32.358587,-111.093308,2.5,67,Hotels
150336,WnT9NIzQgLlILjPT0kEcsQ,Adelita Taqueria & Restaurant,1108 S 9th St,Philadelphia,MO,39.935982,-75.158665,4.5,35,Restaurants
150339,2O2K6SXPWv56amqxCECd4w,The Plum Pit,4405 Pennell Rd,Aston,PA,39.856185,-75.427725,4.5,14,Restaurants


In [9]:
filtered_dataframe = ut.group_column(filtered_dataframe, 'categories')
filtered_dataframe

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 11, Finished, Available)

Unnamed: 0,business_id,name,address,city,state,latitude,longitude,stars,review_count,categories
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,CA,39.955505,-75.155564,4.0,80,[Restaurants]
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,MO,40.338183,-75.471659,4.5,13,[Brewpubs]
5,CF33F8-E6oudUQ46HnavjQ,Sonic Drive-In,615 S Main St,Ashland City,AZ,36.269593,-87.058943,2.0,6,[Restaurants]
8,k0hlBqXX-Bt0vf1op7Jr1w,Tsevi's Pub And Grill,8025 Mackenzie Rd,Affton,TN,38.565165,-90.321087,3.0,19,[Restaurants]
9,bBDDEgkFA1Otx9Lfe7BZUQ,Sonic Drive-In,2312 Dickerson Pike,Nashville,MO,36.208102,-86.768170,1.5,10,[Restaurants]
...,...,...,...,...,...,...,...,...,...,...
150327,cM6V90ExQD6KMSU3rRB5ZA,Dutch Bros Coffee,1181 N Milwaukee St,Boise,ID,43.615401,-116.284689,4.0,33,[Restaurants]
150331,qQ7FHvkGEMqoPKKXPk4gjA,La Quinta by Wyndham NW Tucson Marana,6020 West Hospitality Rd,Tucson,AZ,32.358587,-111.093308,2.5,67,"[Hotels & Travel, Hotels]"
150336,WnT9NIzQgLlILjPT0kEcsQ,Adelita Taqueria & Restaurant,1108 S 9th St,Philadelphia,MO,39.935982,-75.158665,4.5,35,[Restaurants]
150339,2O2K6SXPWv56amqxCECd4w,The Plum Pit,4405 Pennell Rd,Aston,PA,39.856185,-75.427725,4.5,14,[Restaurants]


In [10]:
df1 = filtered_dataframe[['business_id', 'longitude', 'latitude']]
df1

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 12, Finished, Available)

Unnamed: 0,business_id,longitude,latitude
3,MTSW4McQd7CbVtyjqoe9mw,-75.155564,39.955505
4,mWMc6_wTdE0EUBKIGXDVfA,-75.471659,40.338183
5,CF33F8-E6oudUQ46HnavjQ,-87.058943,36.269593
8,k0hlBqXX-Bt0vf1op7Jr1w,-90.321087,38.565165
9,bBDDEgkFA1Otx9Lfe7BZUQ,-86.768170,36.208102
...,...,...,...
150327,cM6V90ExQD6KMSU3rRB5ZA,-116.284689,43.615401
150331,qQ7FHvkGEMqoPKKXPk4gjA,-111.093308,32.358587
150336,WnT9NIzQgLlILjPT0kEcsQ,-75.158665,39.935982
150339,2O2K6SXPWv56amqxCECd4w,-75.427725,39.856185


In [11]:
df1 = df1.drop_duplicates().reset_index(drop=True)
df1

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 13, Finished, Available)

Unnamed: 0,business_id,longitude,latitude
0,MTSW4McQd7CbVtyjqoe9mw,-75.155564,39.955505
1,mWMc6_wTdE0EUBKIGXDVfA,-75.471659,40.338183
2,CF33F8-E6oudUQ46HnavjQ,-87.058943,36.269593
3,k0hlBqXX-Bt0vf1op7Jr1w,-90.321087,38.565165
4,bBDDEgkFA1Otx9Lfe7BZUQ,-86.768170,36.208102
...,...,...,...
57821,cM6V90ExQD6KMSU3rRB5ZA,-116.284689,43.615401
57822,qQ7FHvkGEMqoPKKXPk4gjA,-111.093308,32.358587
57823,WnT9NIzQgLlILjPT0kEcsQ,-75.158665,39.935982
57824,2O2K6SXPWv56amqxCECd4w,-75.427725,39.856185


In [1]:
# Take the coordinates and assign the city where it is located using the API
file_location = '/lakehouse/default/Files/df_database/state.parquet'
api_key = 'AIzaSyAHo-DQhxRt2aCx1rHZHq9_4BpKQoGs'
df_state = ut.state_dataframe(df1, file_location, api_key)


StatementMeta(, , , SessionStarting, )

In [None]:
df_state

StatementMeta(, , , Waiting, )

In [14]:
unique_states = df_state['State'].unique()

# Imprimir los valores únicos
print("Valores únicos en la columna 'State':")
print(unique_states)

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 16, Finished, Available)

Valores únicos en la columna 'State':
['Pennsylvania' 'Tennessee' 'Missouri' 'Florida' 'Indiana' 'Louisiana'
 'Alberta' 'Nevada' 'Idaho' 'Illinois' 'Arizona' 'New Jersey' 'California'
 'Delaware' None 'NY' 'ME' 'TX' 'MD']


We filter by the states that are of interest to us

In [15]:
allowed_states = ['California', 'Florida', 'Nevada', 'Indiana', 'Lousiana', 'Illinois', 'New Jersey']

df_state = df_state[df_state['State'].isin(allowed_states)]
df_state

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 17, Finished, Available)

Unnamed: 0,business_id,longitude,latitude,State
4,eEOYSgkmpB90uNA7lDOMRA,-82.456320,27.955269,Florida
5,il_Ro8jwPlHresjw9EGmBg,-86.127217,39.637133,Indiana
6,0bPLkL0QhhPO5kt1_EXmNQ,-82.760461,27.916116,Florida
11,kfNv-JZpuN6TVNSO6hHdkw,-86.053080,39.904320,Indiana
12,9OG5YkX1g2GReZM0AskizA,-119.789339,39.476117,Nevada
...,...,...,...,...
60870,-bZQH8yjm7ntTyGeLQwh8Q,-82.750395,27.916787,Florida
60872,BIyT7Kr7tMJqlfp4oOOYQg,-82.316887,27.853745,Florida
60873,8BUr8GviR2o_b-brO21wwQ,-119.856886,34.412966,California
60883,uriD7RFuHhLJeDdKaf0nFA,-119.739681,34.440689,California


In [16]:
business = pd.merge(business, df_state, on=['business_id', 'longitude', 'latitude'], how = "inner")

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 18, Finished, Available)

In [17]:
business = business.drop(columns=['state'])

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 19, Finished, Available)

Having the Business dataframe without these columns makes it easier for us to manage the relevant information throughout this process.

___
#### 3.2 Checkin Dataframe

In [18]:
checkin.head()

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 20, Finished, Available)

Unnamed: 0,business_id,date
0,---kPU91CF4Lq2-WlRu9Lw,"2020-03-13 21:10:56, 2020-06-02 22:18:06, 2020..."
1,--0iUa4sNDFiZFrAdIWhZQ,"2010-09-13 21:43:09, 2011-05-04 23:08:15, 2011..."
2,--30_8IhuyMHbSOcNWd6DQ,"2013-06-14 23:29:17, 2014-08-13 23:20:22"
3,--7PUidqRWpRSpXebiyxTg,"2011-02-15 17:12:00, 2011-07-28 02:46:10, 2012..."
4,--7jw19RH9JKXgFohspgQw,"2014-04-21 20:42:11, 2014-04-28 21:04:46, 2014..."


In [19]:
checkin.info()

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 21, Finished, Available)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 131930 entries, 0 to 131929
Data columns (total 2 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   business_id  131930 non-null  object
 1   date         131930 non-null  object
dtypes: object(2)
memory usage: 2.0+ MB


###### The date-time combination is not useful for our purposes, so we proceed to count the number of those records per row.

In [20]:
def count_element (date):
    dates = date.split(",")
    return len(dates)

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 22, Finished, Available)

In [21]:
checkin["count_checkin"] = checkin["date"].apply(count_element) 
checkin.drop(columns=['date'], inplace=True)

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 23, Finished, Available)

In [22]:
checkin.head()

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 24, Finished, Available)

Unnamed: 0,business_id,count_checkin
0,---kPU91CF4Lq2-WlRu9Lw,11
1,--0iUa4sNDFiZFrAdIWhZQ,10
2,--30_8IhuyMHbSOcNWd6DQ,2
3,--7PUidqRWpRSpXebiyxTg,10
4,--7jw19RH9JKXgFohspgQw,26


In this way we have information on the number of records that exist by business ID.

___
#### 3.3 Review Dataframe

We create a brief visualization to understand the information contained in the dataframe

In [23]:
review.head(2)

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 25, Finished, Available)

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11
1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18


In [24]:
ut.data_summ(review)

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 26, Finished, Available)

     Column                                            Data_type  No_miss_Qty  %Missing  Missing_Qty
  review_id                                      [<class 'str'>]      6990280       0.0            0
    user_id                                      [<class 'str'>]      6990280       0.0            0
business_id                                      [<class 'str'>]      6990280       0.0            0
      stars                                      [<class 'int'>]      6990280       0.0            0
     useful                                      [<class 'int'>]      6990280       0.0            0
      funny                                      [<class 'int'>]      6990280       0.0            0
       cool                                      [<class 'int'>]      6990280       0.0            0
       text                                      [<class 'str'>]      6990280       0.0            0
       date [<class 'pandas._libs.tslibs.timestamps.Timestamp'>]      6990280       0.0    

Unnamed: 0,Column,Data_type,No_miss_Qty,%Missing,Missing_Qty
0,review_id,[<class 'str'>],6990280,0.0,0
1,user_id,[<class 'str'>],6990280,0.0,0
2,business_id,[<class 'str'>],6990280,0.0,0
3,stars,[<class 'int'>],6990280,0.0,0
4,useful,[<class 'int'>],6990280,0.0,0
5,funny,[<class 'int'>],6990280,0.0,0
6,cool,[<class 'int'>],6990280,0.0,0
7,text,[<class 'str'>],6990280,0.0,0
8,date,[<class 'pandas._libs.tslibs.timestamps.Timest...,6990280,0.0,0


In [25]:
review = review.drop(["review_id", "useful", "funny", "cool"], axis=1)

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 27, Finished, Available)

The columns "review_id", "useful", "funny", "cool" and "date" which contain irrelevant content are removed from the dataframe

___
#### 3.4 Tip Dataframe

In [26]:
tip.head(2)

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 28, Finished, Available)

Unnamed: 0,user_id,business_id,text,date,compliment_count
0,AGNUgVwnZUey3gcPCJ76iw,3uLgwr0qeCNMjKenHJwPGQ,Avengers time with the ladies.,2012-05-18 02:17:21,0
1,NBN4MgHP9D3cw--SnauTkA,QoezRbYQncpRqyrLH6Iqjg,They have lots of good deserts and tasty cuban...,2013-02-05 18:35:10,0


We can see the content of the Tip dataframe which has 6 columns:
- User id
- Business_id
- Text
- Date
- Compliment_count

The information contained in this dataframe is mainly to store tips or shorter reviews aimed at giving advice. This dataframe is irrelevant to us due to the fact that we have repeated information but more briefly about the review information.

___
#### 3.5 User Dataframe

In [27]:
user.head()

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 29, Finished, Available)

Unnamed: 0,user_id,name,review_count,yelping_since,useful,funny,cool,elite,friends,fans,...,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos
0,qVc8ODYU5SZjKXVBgXdI7w,Walker,585,2007-01-25 16:47:26,7217,1259,5994,2007,"NSCy54eWehBJyZdG2iE84w, pe42u7DcCH2QmI81NX-8qA...",267,...,65,55,56,18,232,844,467,467,239,180
1,j14WgRoU_-2ZE1aw1dXrJg,Daniel,4333,2009-01-25 04:35:42,43091,13066,27281,"2009,2010,2011,2012,2013,2014,2015,2016,2017,2...","ueRPE0CX75ePGMqOFVj6IQ, 52oH4DrRvzzl8wh5UXyU0A...",3138,...,264,184,157,251,1847,7054,3131,3131,1521,1946
2,2WnXYQFK0hXEoTxPtV2zvg,Steph,665,2008-07-25 10:41:00,2086,1010,1003,20092010201120122013,"LuO3Bn4f3rlhyHIaNfTlnA, j9B4XdHUhDfTKVecyWQgyA...",52,...,13,10,17,3,66,96,119,119,35,18
3,SZDeASXq7o05mMNLshsdIA,Gwen,224,2005-11-29 04:38:33,512,330,299,200920102011,"enx1vVPnfdNUdPho6PH_wg, 4wOcvMLtU6a9Lslggq74Vg...",28,...,4,1,6,2,12,16,26,26,10,9
4,hA5lMy-EnncsH4JoR-hFGQ,Karen,79,2007-01-05 19:40:59,29,15,7,,"PBK4q9KEEBHhFvSXCUirIw, 3FWPpM7KU1gXeOM_ZbYMbA...",1,...,1,0,0,0,1,1,0,0,0,0


In [28]:
ut.data_summ(user)

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 30, Finished, Available)

            Column         Data_type  No_miss_Qty  %Missing  Missing_Qty
           user_id   [<class 'str'>]      2105597       0.0            0
              name   [<class 'str'>]      2105597       0.0            0
      review_count   [<class 'int'>]      2105597       0.0            0
     yelping_since   [<class 'str'>]      2105597       0.0            0
            useful   [<class 'int'>]      2105597       0.0            0
             funny   [<class 'int'>]      2105597       0.0            0
              cool   [<class 'int'>]      2105597       0.0            0
             elite   [<class 'str'>]      2105597       0.0            0
           friends   [<class 'str'>]      2105597       0.0            0
              fans   [<class 'int'>]      2105597       0.0            0
     average_stars [<class 'float'>]      2105597       0.0            0
    compliment_hot   [<class 'int'>]      2105597       0.0            0
   compliment_more   [<class 'int'>]      2105597  

Unnamed: 0,Column,Data_type,No_miss_Qty,%Missing,Missing_Qty
0,user_id,[<class 'str'>],2105597,0.0,0
1,name,[<class 'str'>],2105597,0.0,0
2,review_count,[<class 'int'>],2105597,0.0,0
3,yelping_since,[<class 'str'>],2105597,0.0,0
4,useful,[<class 'int'>],2105597,0.0,0
5,funny,[<class 'int'>],2105597,0.0,0
6,cool,[<class 'int'>],2105597,0.0,0
7,elite,[<class 'str'>],2105597,0.0,0
8,friends,[<class 'str'>],2105597,0.0,0
9,fans,[<class 'int'>],2105597,0.0,0


We can see that we have this dataframe oriented to the information of the user who made the review, organized with 22 columns:

- user_id
- name   
- review_count   
- yelping_since   
- useful   
- funny   
- cool   
- elite   
- friends   
- fans   
- average_stars 
- compliment_hot   
- compliment_more   
- compliment_profile   
- compliment_cute   
- compliment_list   
- compliment_note   
- compliment_plain   
- compliment_cool   
- compliment_funny   
- compliment_writer   
- compliment_photos   

Most of these columns store quantities of elements that are not relevant to us.

In [29]:
user = user[['user_id', 'name']]

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 31, Finished, Available)

In this way we leave the User dataframe, with the only relevant columns that are 'user_id' and 'name', which store the user IDs and respective names.

___
### 4 - Unification of all information - YELP

##### Observations:
- the information provided in the Tip file has abbreviated user reviews. As the Review file contains more detailed information on comments and ratings from users, we proceeded to use it to unify the files and not use the Tip file.
- Tables are linked via business_id and user_id

A main dataframe is created which will unite the information from the dataframes:
Business, Review, Checkin and User


In [30]:
dy_df = pd.merge(business, checkin, on= "business_id")
dy_df = pd.merge(dy_df, review, on="business_id" )
dy_df = pd.merge(dy_df, user, on="user_id" )
dy_df

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 32, Finished, Available)

Unnamed: 0,business_id,name_x,address,city,latitude,longitude,stars_x,review_count,categories,State,count_checkin,user_id,stars_y,text,date,name_y
0,eEOYSgkmpB90uNA7lDOMRA,Vietnamese Food Truck,,Tampa Bay,27.955269,-82.456320,4.0,10,"[Vietnamese, Food, Restaurants, Food Trucks]",Florida,4,nnu9h6du4E6oqMasPgKR3Q,5,I eat pho about 4 times a week and from a spec...,2019-04-04 16:03:00,Dana
1,AODksDNj5mH953cyhUG1Qw,Thai 5 Fast Food,3424 S Dale Mabry Hwy,Tampa,27.912060,-82.505768,4.0,336,"[Laotian, American (Traditional), Asian Fusion...",Florida,495,nnu9h6du4E6oqMasPgKR3Q,1,Had gotten the Pad see eew in the past and did...,2018-02-04 18:25:53,Dana
2,F9D2mQ4u-D5i0EWrcI01DQ,Thai Gourmet Market,5831 Memorial Hwy,Tampa,27.984573,-82.568798,4.5,194,"[Event Planning & Services, Ethnic Food, Asian...",Florida,499,nnu9h6du4E6oqMasPgKR3Q,5,"BEST PAD Thai I have ever had!! Seriously, not...",2016-06-11 23:14:50,Dana
3,eEOYSgkmpB90uNA7lDOMRA,Vietnamese Food Truck,,Tampa Bay,27.955269,-82.456320,4.0,10,"[Vietnamese, Food, Restaurants, Food Trucks]",Florida,4,JlEdjZvhAbFCU-ObZQb1lw,5,I've been in Wesley Chapel area for about 2 ye...,2018-10-23 00:36:29,Linh
4,Wk21f0DAM7uj3DaJ_rMI-Q,Brunchies - Lutz,24400 State Rd 54,Lutz,28.185389,-82.413001,4.0,272,"[Restaurants, Breakfast & Brunch, Coffee & Tea...",Florida,298,JlEdjZvhAbFCU-ObZQb1lw,5,My experience was wonderful. My server Michea...,2018-01-13 20:08:35,Linh
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2516649,esBGrrmuZzSiECyRBoKvvA,Colony Grill - St. Petersburg,670 Central Ave,St. Petersburg,27.770872,-82.643069,4.5,38,"[Bars, Beer Bar, Nightlife, Wine Bars, Pizza, ...",Florida,20,V0vxjcA66r80_nZsTyCg8Q,5,"We are from Connecticut, so expectations were ...",2021-11-12 22:04:25,Barbara
2516650,esBGrrmuZzSiECyRBoKvvA,Colony Grill - St. Petersburg,670 Central Ave,St. Petersburg,27.770872,-82.643069,4.5,38,"[Bars, Beer Bar, Nightlife, Wine Bars, Pizza, ...",Florida,20,NmJ6gpoLY6WGNJkk6jA8Zg,1,You would think that a restaurant called Colon...,2021-11-10 13:21:11,Elizabeth
2516651,esBGrrmuZzSiECyRBoKvvA,Colony Grill - St. Petersburg,670 Central Ave,St. Petersburg,27.770872,-82.643069,4.5,38,"[Bars, Beer Bar, Nightlife, Wine Bars, Pizza, ...",Florida,20,l-Kwk2mDOwEBXG8X5sU6lw,5,Ive been coming to Colony Grille in CT for ove...,2021-11-03 19:28:06,Paul From Norwalk
2516652,esBGrrmuZzSiECyRBoKvvA,Colony Grill - St. Petersburg,670 Central Ave,St. Petersburg,27.770872,-82.643069,4.5,38,"[Bars, Beer Bar, Nightlife, Wine Bars, Pizza, ...",Florida,20,QGMX1H6WQTp-o8Qq3uO3dQ,5,Colony Grill occupies one of the larger eatery...,2021-12-27 19:56:12,William


In [31]:
Yelp = dy_df[["name_x", "address", "city", "State", "latitude", "longitude", "stars_x", "review_count", "date", "count_checkin", "user_id", "stars_y", "text", "name_y", "categories"]]

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 33, Finished, Available)

In [32]:
Yelp = Yelp.rename(columns={
    "name_x": 'Business_Name',
    "address": 'Address',
    "city": 'City',
    "state": 'State',
    "latitude": 'Latitude',
    "longitude": 'Longitude',
    "stars_x": 'Ranking',
    "review_count": 'Review_Count',
    "date" : "Date",
    "count_checkin": 'Checkin_Count',
    "user_id": 'User_Id',
    "stars_y": 'Stars',
    "text": 'Text',
    "name_y": 'User_Name',
    "categories": 'Category'
})



StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 34, Finished, Available)

Once the Dataframe has been filtered and regrouped, we end up generating a parquet file which contains all the filtered information separated into the columns:
- 'Business_Name',
- 'Address',
- 'City',
- 'State',
- 'Latitude',
- 'Longitude',
- 'Ranking',
- 'Review_Count',
- 'Checkin_Count',
- 'User_Id',
- 'Stars',
- 'Text',
- 'User_Name',
- 'Category’


In [33]:
Yelp

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 35, Finished, Available)

Unnamed: 0,Business_Name,Address,City,State,Latitude,Longitude,Ranking,Review_Count,Date,Checkin_Count,User_Id,Stars,Text,User_Name,Category
0,Vietnamese Food Truck,,Tampa Bay,Florida,27.955269,-82.456320,4.0,10,2019-04-04 16:03:00,4,nnu9h6du4E6oqMasPgKR3Q,5,I eat pho about 4 times a week and from a spec...,Dana,"[Vietnamese, Food, Restaurants, Food Trucks]"
1,Thai 5 Fast Food,3424 S Dale Mabry Hwy,Tampa,Florida,27.912060,-82.505768,4.0,336,2018-02-04 18:25:53,495,nnu9h6du4E6oqMasPgKR3Q,1,Had gotten the Pad see eew in the past and did...,Dana,"[Laotian, American (Traditional), Asian Fusion..."
2,Thai Gourmet Market,5831 Memorial Hwy,Tampa,Florida,27.984573,-82.568798,4.5,194,2016-06-11 23:14:50,499,nnu9h6du4E6oqMasPgKR3Q,5,"BEST PAD Thai I have ever had!! Seriously, not...",Dana,"[Event Planning & Services, Ethnic Food, Asian..."
3,Vietnamese Food Truck,,Tampa Bay,Florida,27.955269,-82.456320,4.0,10,2018-10-23 00:36:29,4,JlEdjZvhAbFCU-ObZQb1lw,5,I've been in Wesley Chapel area for about 2 ye...,Linh,"[Vietnamese, Food, Restaurants, Food Trucks]"
4,Brunchies - Lutz,24400 State Rd 54,Lutz,Florida,28.185389,-82.413001,4.0,272,2018-01-13 20:08:35,298,JlEdjZvhAbFCU-ObZQb1lw,5,My experience was wonderful. My server Michea...,Linh,"[Restaurants, Breakfast & Brunch, Coffee & Tea..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2516649,Colony Grill - St. Petersburg,670 Central Ave,St. Petersburg,Florida,27.770872,-82.643069,4.5,38,2021-11-12 22:04:25,20,V0vxjcA66r80_nZsTyCg8Q,5,"We are from Connecticut, so expectations were ...",Barbara,"[Bars, Beer Bar, Nightlife, Wine Bars, Pizza, ..."
2516650,Colony Grill - St. Petersburg,670 Central Ave,St. Petersburg,Florida,27.770872,-82.643069,4.5,38,2021-11-10 13:21:11,20,NmJ6gpoLY6WGNJkk6jA8Zg,1,You would think that a restaurant called Colon...,Elizabeth,"[Bars, Beer Bar, Nightlife, Wine Bars, Pizza, ..."
2516651,Colony Grill - St. Petersburg,670 Central Ave,St. Petersburg,Florida,27.770872,-82.643069,4.5,38,2021-11-03 19:28:06,20,l-Kwk2mDOwEBXG8X5sU6lw,5,Ive been coming to Colony Grille in CT for ove...,Paul From Norwalk,"[Bars, Beer Bar, Nightlife, Wine Bars, Pizza, ..."
2516652,Colony Grill - St. Petersburg,670 Central Ave,St. Petersburg,Florida,27.770872,-82.643069,4.5,38,2021-12-27 19:56:12,20,QGMX1H6WQTp-o8Qq3uO3dQ,5,Colony Grill occupies one of the larger eatery...,William,"[Bars, Beer Bar, Nightlife, Wine Bars, Pizza, ..."


In [34]:
Yelp.to_parquet('/lakehouse/default/Files/df_database/files_to_EDA/Yelp_to_EDA.parquet')

StatementMeta(, 3c52b027-0896-4fe6-9576-75033c0ca337, 36, Finished, Available)