# Extraction, Transformation, and Loading of data 👨🏽‍💻 👩🏽‍💻

In this notebook, the data will be loaded, and the types of each column will be reviewed to determine which columns will be useful. The notebook will also check for duplicate records (rows) and if any record in a column has empty or null values. In addition to these transformations, the libraries that will be used throughout the notebook to manipulate the data will be imported, including our custom module called 'Tools.' Finally, the processed data will be exported and ready for analysis.

## Importing the necessary libraries 📚

These libraries assist us in manipulating the data to ensure consistency and quality. Additionally, we import our custom module named 'tools' to aid in this entire process.

In [1]:
import pandas as pd
import pyarrow.parquet as pq
import Tools as T
import warnings
warnings.filterwarnings("ignore")


## Data Loading 📂🔄

In [2]:
file_path = 'user-002.parquet'
df_u = pq.read_table(file_path)
df_u_pd = df_u.to_pandas()
df_u_pd = pd.DataFrame(df_u_pd)
df_u_pd

Unnamed: 0,user_id,name,review_count,yelping_since,useful,funny,cool,elite,friends,fans,...,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos
0,qVc8ODYU5SZjKXVBgXdI7w,Walker,585,2007-01-25 16:47:26,7217,1259,5994,2007,"NSCy54eWehBJyZdG2iE84w, pe42u7DcCH2QmI81NX-8qA...",267,...,65,55,56,18,232,844,467,467,239,180
1,j14WgRoU_-2ZE1aw1dXrJg,Daniel,4333,2009-01-25 04:35:42,43091,13066,27281,"2009,2010,2011,2012,2013,2014,2015,2016,2017,2...","ueRPE0CX75ePGMqOFVj6IQ, 52oH4DrRvzzl8wh5UXyU0A...",3138,...,264,184,157,251,1847,7054,3131,3131,1521,1946
2,2WnXYQFK0hXEoTxPtV2zvg,Steph,665,2008-07-25 10:41:00,2086,1010,1003,20092010201120122013,"LuO3Bn4f3rlhyHIaNfTlnA, j9B4XdHUhDfTKVecyWQgyA...",52,...,13,10,17,3,66,96,119,119,35,18
3,SZDeASXq7o05mMNLshsdIA,Gwen,224,2005-11-29 04:38:33,512,330,299,200920102011,"enx1vVPnfdNUdPho6PH_wg, 4wOcvMLtU6a9Lslggq74Vg...",28,...,4,1,6,2,12,16,26,26,10,9
4,hA5lMy-EnncsH4JoR-hFGQ,Karen,79,2007-01-05 19:40:59,29,15,7,,"PBK4q9KEEBHhFvSXCUirIw, 3FWPpM7KU1gXeOM_ZbYMbA...",1,...,1,0,0,0,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2105592,4QGxxakRZeOlg_qDuxmTeQ,Jennilee,38,2012-01-19 23:33:02,74,9,6,,kmwNG5LZSHFmveg6wYYdrw,0,...,1,0,0,0,1,4,0,0,1,0
2105593,tmelBbVBGAzXBVfH2u_R6g,Gerry,19,2009-06-09 16:34:54,14,5,2,,"BFYdCAMFyjYHDwesndEXEg, _9fTIqfSJc7g3V_o76XRVg...",1,...,1,0,0,0,0,1,0,0,0,0
2105594,tpBznnD6uJN3m_pJubj09w,Emily,26,2013-08-13 23:18:11,4,1,2,,"bKV3ly2MuK-K1cptMrFknQ, liel18zRoSB4tEkUP7i6Cg...",0,...,0,0,0,0,1,0,0,0,0,0
2105595,Kst_srPw7GdYydMFYdCtzw,Heatheranne,25,2015-01-10 00:06:25,21,2,5,,"dzHTk52vbGtbktRm_B-wEg, fOfFLV7IbBDN6lzARaLqdg...",0,...,0,0,0,0,0,1,0,0,0,0


## Transformations 🔀

The columns considered of great importance have been defined. We will select them and proceed to eliminate those that are not necessary for both our data warehouse and recommendation system.

In [3]:
df_user = df_u_pd[['user_id','name','review_count','useful']]
df_user

Unnamed: 0,user_id,name,review_count,useful
0,qVc8ODYU5SZjKXVBgXdI7w,Walker,585,7217
1,j14WgRoU_-2ZE1aw1dXrJg,Daniel,4333,43091
2,2WnXYQFK0hXEoTxPtV2zvg,Steph,665,2086
3,SZDeASXq7o05mMNLshsdIA,Gwen,224,512
4,hA5lMy-EnncsH4JoR-hFGQ,Karen,79,29
...,...,...,...,...
2105592,4QGxxakRZeOlg_qDuxmTeQ,Jennilee,38,74
2105593,tmelBbVBGAzXBVfH2u_R6g,Gerry,19,14
2105594,tpBznnD6uJN3m_pJubj09w,Emily,26,4
2105595,Kst_srPw7GdYydMFYdCtzw,Heatheranne,25,21


Now, we will review the composition of the columns, their data types, and whether they contain null values or not.

In [4]:
T.analyze_data(df_user)

Unnamed: 0,Name,Unique Data Types,% of Non-null Values,% of Null Values,Number of Null Values
0,user_id,[<class 'str'>],100.0,0.0,0
1,name,[<class 'str'>],100.0,0.0,0
2,review_count,[<class 'int'>],100.0,0.0,0
3,useful,[<class 'int'>],100.0,0.0,0


First, we will check for duplicates to eliminate them if they exist.

In [5]:
duplicados = T.duplicates(df_user)
duplicados

Unnamed: 0,user_id,name,review_count,useful
0,qVc8ODYU5SZjKXVBgXdI7w,Walker,585,7217
1,j14WgRoU_-2ZE1aw1dXrJg,Daniel,4333,43091
2,2WnXYQFK0hXEoTxPtV2zvg,Steph,665,2086
3,SZDeASXq7o05mMNLshsdIA,Gwen,224,512
4,hA5lMy-EnncsH4JoR-hFGQ,Karen,79,29
...,...,...,...,...
2105592,4QGxxakRZeOlg_qDuxmTeQ,Jennilee,38,74
2105593,tmelBbVBGAzXBVfH2u_R6g,Gerry,19,14
2105594,tpBznnD6uJN3m_pJubj09w,Emily,26,4
2105595,Kst_srPw7GdYydMFYdCtzw,Heatheranne,25,21


Many duplicates have been found. We will review with 3 IDs to determine if they are repeated or if it's an error in the function.

In [6]:
list_dupl = ['qVc8ODYU5SZjKXVBgXdI7w','j14WgRoU_-2ZE1aw1dXrJg','2WnXYQFK0hXEoTxPtV2zvg']
user_dupl = df_user[df_user['user_id'].isin(list_dupl)]
user_dupl

Unnamed: 0,user_id,name,review_count,useful
0,qVc8ODYU5SZjKXVBgXdI7w,Walker,585,7217
1,j14WgRoU_-2ZE1aw1dXrJg,Daniel,4333,43091
2,2WnXYQFK0hXEoTxPtV2zvg,Steph,665,2086
1987897,qVc8ODYU5SZjKXVBgXdI7w,Walker,585,7217
1987898,j14WgRoU_-2ZE1aw1dXrJg,Daniel,4333,43091
1987899,2WnXYQFK0hXEoTxPtV2zvg,Steph,665,2086
1998997,qVc8ODYU5SZjKXVBgXdI7w,Walker,585,7217
1998998,j14WgRoU_-2ZE1aw1dXrJg,Daniel,4333,43091
1998999,2WnXYQFK0hXEoTxPtV2zvg,Steph,665,2086


Indeed, the dataframe has many duplicate users. We will proceed to eliminate them to improve data quality and ensure coherence.

In [7]:
df_user = df_user.drop_duplicates()

## Column ``User_id``

Let's check for null values to ensure data quality.

In [8]:
T.nulls(df_user,'user_id')

The column "user_id" does not have nulls


Unnamed: 0,user_id,name,review_count,useful


It is confirmed that we do not have empty values.

In [9]:
T.empty_values(df_user,'user_id')

The column "user_id" does not have empty values


Given that it has already been verified earlier, and no duplicates were found, it is not necessary to check it for each remaining column.

## Column ``Name``

Let's check for null values to ensure data quality.

In [10]:
T.nulls(df_user,'name')

The column "name" does not have nulls


Unnamed: 0,user_id,name,review_count,useful


It is confirmed that we do not have empty values.

In [11]:
T.empty_values(df_user,'name')

The column "name" does not have empty values


## Column ``review_count``

Let's check for null values to ensure data quality.

In [12]:
T.nulls(df_user,'review_count')

The column "review_count" does not have nulls


Unnamed: 0,user_id,name,review_count,useful


It is confirmed that we do not have empty values.

In [13]:
T.empty_values(df_user,'review_count')

The column "review_count" does not have empty values


Let's review who are the top 5 customers who have made the most reviews at restaurants.

In [14]:
group = df_user.groupby(by='user_id')['review_count'].sum().sort_values(ascending=False)
group.head()

user_id
Hi10sGSZNxQH3NLyWSZ1oA    17473
8k3aO-mPeyhbR5HUucA5aA    16978
hWDybu_KvYLSdEFzGrniTw    16567
RtGqdDBvvBCjcu5dUqwfzA    12868
P5bUL3Engv-2z6kKohB6qQ     9941
Name: review_count, dtype: int64

It can be observed that there are users who have made from ten thousand up to seventeen thousand reviews on Yelp.

## Column ``useful``

Let's check for null values to ensure data quality.

In [15]:
T.nulls(df_user,'useful')

The column "useful" does not have nulls


Unnamed: 0,user_id,name,review_count,useful


It is confirmed that we do not have empty values.

In [16]:
T.empty_values(df_user,'useful')

The column "useful" does not have empty values


For visualization purposes, we will display the top 5 users with the most reviews where they received useful votes.

In [17]:
group2 = df_user.groupby(by='user_id')['useful'].sum().sort_values(ascending=False)
group2.head()

user_id
Hi10sGSZNxQH3NLyWSZ1oA    206296
--2vR0DIsmQ6WfcSzKWigw    205765
JjXuiru1_ONzDkYVrHN0aw    183512
lvthTfCQGD0qaEk6jCdRdQ    182788
hWDybu_KvYLSdEFzGrniTw    173089
Name: useful, dtype: int64

It can be observed that the first user on the list is the same user who has the most completed reviews on the Yelp platform.

After analyzing all the columns, an additional column called 'id_user_yelp' will be added, which will help us identify from which platform the user is at the time of connecting the information with Google's data.

In [18]:
df_user['id_user_yelp'] = range(0,len(df_user))

We check the consistency once again before exporting the data.

In [19]:
T.analyze_data(df_user)

Unnamed: 0,Name,Unique Data Types,% of Non-null Values,% of Null Values,Number of Null Values
0,user_id,[<class 'str'>],100.0,0.0,0
1,name,[<class 'str'>],100.0,0.0,0
2,review_count,[<class 'int'>],100.0,0.0,0
3,useful,[<class 'int'>],100.0,0.0,0
4,id_user_yelp,[<class 'int'>],100.0,0.0,0


## Exporting the Data 🌐

In [20]:
df_user.to_parquet('df_user.parquet')

The conclusions from this notebook are as follows: The file that was read had 22 columns, of which only 4 were necessary. An additional column was added at the end of the cleaning process to use it as our team's identifier when manipulating information. A large number of duplicates were found. Then, the columns of user ID, user name, review count, and useful count were reviewed, and a top list of these data points was created.

This is the last ETL; the initial stage is considered complete. If you have any more questions or need further assistance, feel free to ask. If you want to access the resulting data from the ETL processes, click [here](https://drive.google.com/drive/u/0/folders/1ypi9kclaOXgQj0ABxTXgW6N_1_ELCVpc).