# Preliminary Exploratory Data Analysis (Pre-EDA)

In the upcoming Preliminary Exploratory Data Analysis (Pre-EDA), we will conduct a brief examination of the available data. This process is carried out after having done a concise extract, transform and load (ETL) process to improve data quality.
In this Pre-EDA we are going to focus on analyzing the quality of the data provided by the client in order to evaluate the way in which the data will be treated during the development of the project.

In [1]:
# Always Run this code

import builtin.utils as ut

StatementMeta(, 6ded00d1-9621-4717-bd4e-619d370d69cb, 3, Finished, Available)

## 1 Memory Usage Calculation

Our main objective is to make a comparisson between the amount of Gbs used into the original file and the new amount for those files in a .parquet format.

This will show us how efficent our database can be by reducing his mass.

In [2]:
estados_size = round(ut.calculate_original_memory_usage('/lakehouse/default/Files/original/reviews-estados'), 2)
sitios_size = round(ut.calculate_original_memory_usage('/lakehouse/default/Files/original/metadata-sitios'), 2)
yelp_size = round(ut.calculate_original_memory_usage('/lakehouse/default/Files/original/Yelp'), 2)

total_size = estados_size + sitios_size + yelp_size
print(f'Total size of original data: \nEstados: {estados_size}Mb\n Sitios: {sitios_size}Mb\n Yelp: {yelp_size}Mb\n Total size: {total_size}Mb\n')

parquet_size = ut.calculate_parquet_memory_usage('/lakehouse/default/Files/df_database')
print(f'Total size of compressed data: {round(parquet_size,2)}Mb\n')

diff_size = total_size - parquet_size
print(f'Savings in resource and size usage: {round(diff_size,2)}Mb\n')

size_perc = (diff_size/total_size)*100
print(f'Percentage savings in resources: {round(size_perc,2)}%\n')


StatementMeta(, 6ded00d1-9621-4717-bd4e-619d370d69cb, 4, Finished, Available)

Total size of original data: 
Estados: 24917.16Mb
 Sitios: 2832.32Mb
 Yelp: 8465.34Mb
 Total size: 36214.82Mb

Total size of compressed data: 10897.09Mb

Savings in resource and size usage: 25317.73Mb

Percentage savings in resources: 69.91%



## 2. Data Quality

In order to review the quality of the data, to know how we can approach our work, we carry out an inspection of each file, reviewing the data contained in each column of the tables to be studied.


### 2.1 Data quality of files in the 'Metadata_sites_parquet' folder

Let's apply the "data_summ" function that we have developed in "utils.py" to see the summary of the data information

In [3]:
folder_path = '/lakehouse/default/Files/df_database/Metadata_sitios_parquet/'

# Apply the apply_data_summ_to_parquet_folder function to get the summaries
summaries = ut.data_summ_on_parquet(folder_path)


StatementMeta(, 6ded00d1-9621-4717-bd4e-619d370d69cb, 5, Finished, Available)

metadata-sitios Summary

Total rows:  275001

Total full null rows:  0
          Column                                     Data_type  No_miss_Qty  %Missing  Missing_Qty
            name           [<class 'str'>, <class 'NoneType'>]       274998      0.00            3
         address           [<class 'str'>, <class 'NoneType'>]       269428      2.03         5573
         gmap_id                               [<class 'str'>]       275001      0.00            0
     description           [<class 'NoneType'>, <class 'str'>]        27900     89.85       247101
        latitude                             [<class 'float'>]       275001      0.00            0
       longitude                             [<class 'float'>]       275001      0.00            0
        category [<class 'numpy.ndarray'>, <class 'NoneType'>]       273662      0.49         1339
      avg_rating                             [<class 'float'>]       275001      0.00            0
  num_of_reviews                      

## Conclusions

The previous data set shows us the following information

The database has a sufficient number of records to perform an accurate analysis. In each folder we have:
Metadata-sites:

275000 records

0 null records

7 columns containing null data; of which 5 have above 20%

In another instance, during the final EDA, we will inspect the type of information in these columns in order to observe the quality of the information provided. If the information is of little relevance or is not information that can be handled with the amount of null data they have, we will proceed to eliminate it.

### 2.2 Data quality of files in the 'Review_estados_parquet' folder


Let's apply the "data_summ" function that we have developed in "utils.py" to see the summary of the data information

In [4]:
folder_path = '/lakehouse/default/Files/df_database/Review_estados_parquet/'

# Apply the apply_data_summ_to_parquet_folder function to get the summaries
summaries = ut.data_summ_on_parquet(folder_path)

StatementMeta(, 6ded00d1-9621-4717-bd4e-619d370d69cb, 6, Finished, Available)

Alabama Summary

Total rows:  1800000

Total full null rows:  0
 Column                                     Data_type  No_miss_Qty  %Missing  Missing_Qty
user_id                             [<class 'float'>]      1800000      0.00            0
   name                               [<class 'str'>]      1800000      0.00            0
   time                               [<class 'int'>]      1800000      0.00            0
 rating                               [<class 'int'>]      1800000      0.00            0
   text           [<class 'str'>, <class 'NoneType'>]       966601     46.30       833399
   pics [<class 'NoneType'>, <class 'numpy.ndarray'>]        36170     97.99      1763830
   resp          [<class 'NoneType'>, <class 'dict'>]       207283     88.48      1592717
gmap_id                               [<class 'str'>]      1800000      0.00            0
   date                               [<class 'str'>]      1800000      0.00            0
Alaska Summary

Total rows:  521515


## Conclusions
The previous data set shows us the following information

Reviews-states:

51 files which have between 300 thousand and 2.8 million records

0 null records

3 columns have over 35% null data; the same for all files

In another instance, during the final EDA, we will inspect the type of information in these columns in order to observe the quality of the information provided. If the information is of little relevance or is not information that can be handled with the amount of null data they have, we will proceed to eliminate it.

### 2.3 Data quality of files in the 'Yelp_parquet' folder


Let's apply the "data_summ" function that we have developed in "utils.py" to see the summary of the data information

In [5]:
folder_path = '/lakehouse/default/Files/df_database/Yelp_parquet/'

# Apply the apply_data_summ_to_parquet_folder function to get the summaries
summaries = ut.data_summ_on_parquet(folder_path)

StatementMeta(, 6ded00d1-9621-4717-bd4e-619d370d69cb, 7, Finished, Available)

business.pkl Summary

Total rows:  150346

Total full null rows:  0
      Column                            Data_type  No_miss_Qty  %Missing  Missing_Qty
 business_id                      [<class 'str'>]       150346      0.00            0
        name                      [<class 'str'>]       150346      0.00            0
     address                      [<class 'str'>]       150346      0.00            0
        city                      [<class 'str'>]       150346      0.00            0
       state  [<class 'NoneType'>, <class 'str'>]       150343      0.00            3
 postal_code                      [<class 'str'>]       150346      0.00            0
    latitude                    [<class 'float'>]       150346      0.00            0
   longitude                    [<class 'float'>]       150346      0.00            0
       stars                    [<class 'float'>]       150346      0.00            0
review_count                      [<class 'int'>]       150346      0.00

## Conclusions

The previous data set shows us the following information

Yelp:
4 files with 9.1 million records in their entirety.

0 null records

1 file with 3 columns containing low percentage of null data

In another instance, during the final EDA, we will inspect the type of information in these columns in order to observe the quality of the information provided. If the information is of little relevance or is not information that can be handled with the amount of null data they have, we will proceed to eliminate it.