# Extraction, Transformation, and Loading of data 👨🏽‍💻 👩🏽‍💻

In this notebook, the data will be loaded, and the types of each column will be reviewed to determine which columns will be useful. The notebook will also check for duplicate records (rows) and if any record in a column has empty or null values. In addition to these transformations, the libraries that will be used throughout the notebook to manipulate the data will be imported, including our custom module called 'Tools.' Finally, the processed data will be exported and ready for analysis.

## Importing the necessary libraries 📚

These libraries assist us in manipulating the data to ensure consistency and quality. Additionally, we import our custom module named 'tools' to aid in this entire process.

In [9]:
import pandas as pd
import Tools as T
import warnings
warnings.filterwarnings("ignore")

## Data Loading 📂🔄

In [10]:
df_rest_reviews = T.OpenJsonYelp('review-001.json')

## Transformations 🔀

We will proceed to delete the columns that contain three types of ratings:

* cool
* funny
* useful

Since we will conduct our own sentiment analysis.

In [11]:
df_rest_reviews = df_rest_reviews.drop(columns=['cool','funny','useful'])
df_rest_reviews

Unnamed: 0,review_id,user_id,business_id,stars,text,date
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3.0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11
1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5.0,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18
2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3.0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30
3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5.0,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03
4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4.0,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15
...,...,...,...,...,...,...
6990275,H0RIamZu0B0Ei0P4aeh3sQ,qskILQ3k0I_qcCMI-k6_QQ,jals67o91gcrD4DC81Vk6w,5.0,Latest addition to services from ICCU is Apple...,2014-12-17 21:45:20
6990276,shTPgbgdwTHSuU67mGCmZQ,Zo0th2m8Ez4gLSbHftiQvg,2vLksaMmSEcGbjI5gywpZA,5.0,"This spot offers a great, affordable east week...",2021-03-31 16:55:10
6990277,YNfNhgZlaaCO5Q_YJR4rEw,mm6E4FbCMwJmb7kPDZ5v2Q,R1khUUxidqfaJmcpmGd4aw,4.0,This Home Depot won me over when I needed to g...,2019-12-30 03:56:30
6990278,i-I4ZOhoX70Nw5H0FwrQUA,YwAMC-jvZ1fvEUum6QkEkw,Rr9kKArrMhSLVE9a53q-aA,5.0,For when I'm feeling like ignoring my calorie-...,2022-01-19 18:59:27


The proper deletion of the columns is verified, ensuring that the change has been made correctly. Now, we can check for null values and determine the data types present in each column.

In [12]:
T.analyze_data(df_rest_reviews)

Unnamed: 0,Name,Unique Data Types,% of Non-null Values,% of Null Values,Number of Null Values
0,review_id,[<class 'str'>],100.0,0.0,0
1,user_id,[<class 'str'>],100.0,0.0,0
2,business_id,[<class 'str'>],100.0,0.0,0
3,stars,[<class 'float'>],100.0,0.0,0
4,text,[<class 'str'>],100.0,0.0,0
5,date,[<class 'str'>],100.0,0.0,0


"e need to cross-reference the data to perform a join and only keep the reviews of places related to food. This will be done using the business identifier we have in both datasets - the reviews and the business dataset. From now on, we will work with this new dataframe called 'df_rest_reviews'.

In [13]:
df_rest = pd.read_parquet('df_business.parquet')
df_rest2 = df_rest[['business_id']]
df_rest_reviews = pd.merge(df_rest2,df_rest_reviews,how='inner',on='business_id')
df_rest_reviews

Unnamed: 0,business_id,review_id,user_id,stars,text,date
0,MTSW4McQd7CbVtyjqoe9mw,BXQcBN0iAi1lAUxibGLFzA,6_SpY41LIHZuIaiDs5FMKA,4.0,This is nice little Chinese bakery in the hear...,2014-05-26 01:09:53
1,MTSW4McQd7CbVtyjqoe9mw,uduvUCvi9w3T2bSGivCfXg,tCXElwhzekJEH6QJe3xs7Q,4.0,This is the bakery I usually go to in Chinatow...,2013-10-05 15:19:06
2,MTSW4McQd7CbVtyjqoe9mw,a0vwPOqDXXZuJkbBW2356g,WqfKtI-aGMmvbA9pPUxNQQ,5.0,"A delightful find in Chinatown! Very clean, an...",2013-10-25 01:34:57
3,MTSW4McQd7CbVtyjqoe9mw,MKNp_CdR2k2202-c8GN5Dw,3-1va0IQfK-9tUMzfHWfTA,5.0,I ordered a graduation cake for my niece and i...,2018-05-20 17:58:57
4,MTSW4McQd7CbVtyjqoe9mw,D1GisLDPe84Rrk_R4X2brQ,EouCKoDfzaVG0klEgdDvCQ,4.0,HK-STYLE MILK TEA: FOUR STARS\n\nNot quite su...,2013-10-25 02:31:35
...,...,...,...,...,...,...
3024653,2O2K6SXPWv56amqxCECd4w,Kt3gFeW1rhZz7RuiV-6Tcw,eWz12w7dzYlfrGnhTQ82Fg,5.0,This is my favorite food truck! I only wish I ...,2019-07-14 14:25:35
3024654,2O2K6SXPWv56amqxCECd4w,ruy3Ycey_gGbwkE_3TX1Fg,lDyhGApbGZ0_BoeJzRQq7g,5.0,This food truck was stupid. Stupidly delicious...,2021-06-25 23:22:26
3024655,2O2K6SXPWv56amqxCECd4w,C_l8NTpvNOEUorEmEOusaA,-TTJ75--0NEAjvFCOV7rBg,5.0,Bubba never disappoints i go to his fb page an...,2016-12-09 21:38:05
3024656,2O2K6SXPWv56amqxCECd4w,q39JOIkHmIhdmYnjEhZCdQ,8yFNNU7UmQcfzmcTvzTlOA,1.0,The truck was invited to our office for a part...,2020-02-19 22:59:06


Now, we can check again for null values and determine the data types present in each column.

In [14]:
T.analyze_data(df_rest_reviews)

Unnamed: 0,Name,Unique Data Types,% of Non-null Values,% of Null Values,Number of Null Values
0,business_id,[<class 'str'>],100.0,0.0,0
1,review_id,[<class 'str'>],100.0,0.0,0
2,user_id,[<class 'str'>],100.0,0.0,0
3,stars,[<class 'float'>],100.0,0.0,0
4,text,[<class 'str'>],100.0,0.0,0
5,date,[<class 'str'>],100.0,0.0,0


## Column ``Date``

It was observed that we have a column with hours and dates in string format. Therefore, the pandas library was used to handle it and convert it to a date format (datetime). pandas was chosen for its versatility compared to standard Python functions.

In [15]:
df_rest_reviews['date'] = pd.to_datetime(df_rest_reviews['date'],format='%Y-%m-%d %H:%M:%S')
df_rest_reviews

Unnamed: 0,business_id,review_id,user_id,stars,text,date
0,MTSW4McQd7CbVtyjqoe9mw,BXQcBN0iAi1lAUxibGLFzA,6_SpY41LIHZuIaiDs5FMKA,4.0,This is nice little Chinese bakery in the hear...,2014-05-26 01:09:53
1,MTSW4McQd7CbVtyjqoe9mw,uduvUCvi9w3T2bSGivCfXg,tCXElwhzekJEH6QJe3xs7Q,4.0,This is the bakery I usually go to in Chinatow...,2013-10-05 15:19:06
2,MTSW4McQd7CbVtyjqoe9mw,a0vwPOqDXXZuJkbBW2356g,WqfKtI-aGMmvbA9pPUxNQQ,5.0,"A delightful find in Chinatown! Very clean, an...",2013-10-25 01:34:57
3,MTSW4McQd7CbVtyjqoe9mw,MKNp_CdR2k2202-c8GN5Dw,3-1va0IQfK-9tUMzfHWfTA,5.0,I ordered a graduation cake for my niece and i...,2018-05-20 17:58:57
4,MTSW4McQd7CbVtyjqoe9mw,D1GisLDPe84Rrk_R4X2brQ,EouCKoDfzaVG0klEgdDvCQ,4.0,HK-STYLE MILK TEA: FOUR STARS\n\nNot quite su...,2013-10-25 02:31:35
...,...,...,...,...,...,...
3024653,2O2K6SXPWv56amqxCECd4w,Kt3gFeW1rhZz7RuiV-6Tcw,eWz12w7dzYlfrGnhTQ82Fg,5.0,This is my favorite food truck! I only wish I ...,2019-07-14 14:25:35
3024654,2O2K6SXPWv56amqxCECd4w,ruy3Ycey_gGbwkE_3TX1Fg,lDyhGApbGZ0_BoeJzRQq7g,5.0,This food truck was stupid. Stupidly delicious...,2021-06-25 23:22:26
3024655,2O2K6SXPWv56amqxCECd4w,C_l8NTpvNOEUorEmEOusaA,-TTJ75--0NEAjvFCOV7rBg,5.0,Bubba never disappoints i go to his fb page an...,2016-12-09 21:38:05
3024656,2O2K6SXPWv56amqxCECd4w,q39JOIkHmIhdmYnjEhZCdQ,8yFNNU7UmQcfzmcTvzTlOA,1.0,The truck was invited to our office for a part...,2020-02-19 22:59:06


## Column ``review_id``

We will start by searching for null values using a function from our custom module.

In [16]:
T.nulls(df_rest_reviews,'review_id')

The column "review_id" does not have nulls


Unnamed: 0,business_id,review_id,user_id,stars,text,date


We are going to check if there are any empty records.

In [17]:
T.empty_values(df_rest_reviews,'review_id')

The column "review_id" does not have empty values


We are going to check if there are any duplicate data.

In [18]:
T.duplicates(df_rest_reviews)

Unnamed: 0,business_id,review_id,user_id,stars,text,date


We observe that, by not finding duplicates, we can conclude that there is no user who has made more than one review for the same place. Additionally, there are no empty or null values, indicating that the information is complete and consistent.

## Column ``text``

We will start by searching for null values using a function from our custom module.

In [19]:
T.nulls(df_rest_reviews,'text')

The column "text" does not have nulls


Unnamed: 0,business_id,review_id,user_id,stars,text,date


We are going to check if there are any empty records.

In [20]:
T.empty_values(df_rest_reviews,'text')

The column "text" does not have empty values


Some comments are repeated up to 10 times, as shown for visualization purposes.

In [21]:
df_rest_reviews['text'].value_counts().head()

text
DO NOT PARK HERE!\nthey are too quick to boot you!\n$144 to remove a boot.\nIf you lose track of time (a normal affair in New Orleans, having fun) and even if (like I did) you add time, they get you in between!!!!\nAVOID ALL PREMIUM PARKING LOTS!!\nThere are so many other options near by!\nTrust me!                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

## Column ``stars``

We will start by searching for null values using a function from our custom module.

In [22]:
T.nulls(df_rest_reviews,'stars')

The column "stars" does not have nulls


Unnamed: 0,business_id,review_id,user_id,stars,text,date


We are going to check if there are any empty records.

In [23]:
T.empty_values(df_rest_reviews,'stars')

The column "stars" does not have empty values


It can be observed that the vast majority tends to rate restaurants with 5 stars. This is presented for visualization purposes.

In [24]:
T.count_and_percentage(df_rest_reviews,'stars')

The values of stars:
5.0    1393524
4.0     631449
1.0     462985
3.0     300476
2.0     236224

The percentage that each value represents:
5.0    46.07
4.0    20.88
1.0    15.31
3.0     9.93
2.0     7.81


## Column ``business_id``

We will start by searching for null values using a function from our custom module.

In [25]:
T.nulls(df_rest_reviews,'business_id')

The column "business_id" does not have nulls


Unnamed: 0,business_id,review_id,user_id,stars,text,date


We are going to check if there are any empty records.

In [26]:
T.empty_values(df_rest_reviews,'business_id')

The column "business_id" does not have empty values


It can be observed that the top five businesses have more than five thousand and up to over seven thousand reviews by customers. These data are presented for visualization purposes.

In [27]:
df_rest_reviews['business_id'].value_counts().head()

business_id
ac1AeYqs8Z4_e2X5M3if2A    7516
ytynqOUb3hjKeJfRj5Tshw    5778
oBNrLz4EDhiscSlbOl8uAw    5264
_C7QiQQc47AOEv4PE3Kong    4969
GBTPC53ZrG1ZBY3DT8Mbcw    4661
Name: count, dtype: int64

## Column ``user_id``

We will start by searching for null values using a function from our custom module.

In [28]:
T.nulls(df_rest_reviews,'user_id')

The column "user_id" does not have nulls


Unnamed: 0,business_id,review_id,user_id,stars,text,date


We are going to check if there are any empty records.

In [29]:
T.empty_values(df_rest_reviews,'user_id')

The column "user_id" does not have empty values


It can be observed that the top five users have more than fifteen hundred and up to over three thousand reviews. These data are presented for visualization purposes.

In [30]:
df_rest_reviews['user_id'].value_counts().head()

user_id
_BcWyKQL16ndpBdggh2kNA    1345
Xw7ZjaGfr0WNVt6s_5KZfA     821
0Igx-a1wAstiBDerGxXk2A     791
ET8n-r7glWYqZhuR6GcdNw     751
-G7Zkl1wIWBBmD0KRy_sCw     726
Name: count, dtype: int64

After analyzing all the columns, an additional column called 'id_review_yelp' will be added, which will help us identify from which platform the user is at the time of connecting the information with Google's data.

In [31]:
df_rest_reviews['id_review_yelp'] = range(0,len(df_rest_reviews))

## Exporting the Data 🌐

In [32]:
df_rest_reviews.to_parquet('df_reviews.parquet')

The conclusions from this notebook are as follows: We had columns that were not suitable for our purposes, such as feeding a recommendation system, so they were deleted. We had data types changed, such as dates and times in string format. Then, the values of user ID, business ID, review ID, reviews, and ratings columns were reviewed, and a top list of these data points was created.

Once this ETL is completed, we will move on to the next one for [users](ETL_user.ipynb).