# Análisis Exploratorio de Datos - Google reviews:

En el presente notebook se realizará un análisis exploratorio de los datos de **reviews** de **restaurantes** realizadas en **Google Maps** en el **estado** de **California**.

En primer lugar, analizaremos la información correspondiente a las **reviews** de **usuarios** de la aplicación, para luego complementar con la información relativa a los **locales** sobre los que se realizó la review, más específicamente, sobre **restaurantes**:

In [1]:
# En primer lugar, se importan las librerías con las que se va a trabajar:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
#import pickle
warnings.filterwarnings('ignore')
sns.set()
import gc

## Reviews de usuarios:

Tal cual se explicó anteriormente, este dataset incluye las reviews realizadas por usuarios:

In [6]:
# En primer lugar, se carga el archivo:
df_reviews = pd.read_parquet("Datasets/reviews_california.parquet")
df_reviews.head()

Unnamed: 0,user_id,name,time,rating,text,pics,resp,gmap_id
0,1.089912e+20,Song Ro,1609909927056,5,Love there korean rice cake.,,,0x80c2c778e3b73d33:0xbdc58662a4a97d49
1,1.112903e+20,Rafa Robles,1612849648663,5,Good very good,,,0x80c2c778e3b73d33:0xbdc58662a4a97d49
2,1.126404e+20,David Han,1583643882296,4,They make Korean traditional food very properly.,,,0x80c2c778e3b73d33:0xbdc58662a4a97d49
3,1.174403e+20,Anthony Kim,1551938216355,5,Short ribs are very delicious.,,,0x80c2c778e3b73d33:0xbdc58662a4a97d49
4,1.005808e+20,Mario Marzouk,1494910901933,5,Great food and prices the portions are large,,,0x80c2c778e3b73d33:0xbdc58662a4a97d49


In [9]:
# Se controla información general del dataset:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2700000 entries, 0 to 2699999
Data columns (total 8 columns):
 #   Column   Dtype  
---  ------   -----  
 0   user_id  float64
 1   name     object 
 2   time     int64  
 3   rating   int64  
 4   text     object 
 5   pics     object 
 6   resp     object 
 7   gmap_id  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 164.8+ MB


Se eliminan las columnas "time", "pics" y "resp" ya que no se utilizarán en el análisis:

In [10]:
df_reviews = df_reviews.drop(columns=["time", "pics", "resp"])

In [11]:
# Se controla que no existan filas duplicadas:
df_reviews.duplicated().sum()

75245

In [12]:
# Se eliminan valores duplicados:
df_reviews = df_reviews.drop_duplicates()
df_reviews.shape

(2624755, 5)

In [13]:
# Se controla la cantidad de valores nulos:
df_reviews.isnull().sum()

user_id          0
name             0
rating           0
text       1163822
gmap_id          0
dtype: int64

De acuerdo a lo expuesto, existe una **gran cantidad de valores faltantes** en la columna **"text"**, que contiene los **comentarios de las reviews** realizadas por el usuario. 

Se decide **mantener** las mismas, ya que correspoden a casi el 50% de la información del dataset y por ende estaríamos eliminando información importante, y a su vez tal vez podamos implementar algún método para completar los mismos a través del rating ingresado por el usuario.

Por tal motivo, procedemos a controlar los valores numéricos del dataset:

In [14]:
df_reviews.describe()

Unnamed: 0,user_id,rating
count,2624755.0,2624755.0
mean,1.093545e+20,4.314977
std,5.238717e+18,1.123237
min,1e+20,1.0
25%,1.048919e+20,4.0
50%,1.093178e+20,5.0
75%,1.138792e+20,5.0
max,1.184467e+20,5.0


Podemos observar que el **valor mínimo** en **"rating"** es **1**, mientras que el **valor máximo** es **5**, por lo que supone una **escala** de calificaciones de **1 a 5 estrellas**.

Teniendo esto presente, ahora podemos **reemplazar** los **valores nulos** en los comentarios, a través de una **escala de satisfacción** basada en el **rating ingresado por el usuario**, a saber:

    - Rating menor o igual a 1: "Very Dissatisfied"
    - Rating menor o igual a 2: "Dissatisfied"
    - Rating menor o igual a 3: "Neutral"
    - Rating menor o igual a 4: "Satisfied"
    - Rating menor o igual a 5: "Very Satisfied"

In [15]:
# En primer lugar, creamos la función que asignará la escala de satisfacción:
def asignar_escala(rating):
    if rating <= 1:
        return "Very Dissatisfied"
    elif rating <= 2:
        return "Dissatisfied"
    elif rating <= 3:
        return "Neutral"
    elif rating <= 4:
        return "Satisfied"
    else:
        return "Very Satisfied"

In [16]:
# Luego creamos una columna en el dataframe donde aplicamos la funcion creada anteriormente:
df_reviews["escala_satisfaccion"] = df_reviews["rating"].apply(asignar_escala)
df_reviews.head()

Unnamed: 0,user_id,name,rating,text,gmap_id,escala_satisfaccion
0,1.089912e+20,Song Ro,5,Love there korean rice cake.,0x80c2c778e3b73d33:0xbdc58662a4a97d49,Very Satisfied
1,1.112903e+20,Rafa Robles,5,Good very good,0x80c2c778e3b73d33:0xbdc58662a4a97d49,Very Satisfied
2,1.126404e+20,David Han,4,They make Korean traditional food very properly.,0x80c2c778e3b73d33:0xbdc58662a4a97d49,Satisfied
3,1.174403e+20,Anthony Kim,5,Short ribs are very delicious.,0x80c2c778e3b73d33:0xbdc58662a4a97d49,Very Satisfied
4,1.005808e+20,Mario Marzouk,5,Great food and prices the portions are large,0x80c2c778e3b73d33:0xbdc58662a4a97d49,Very Satisfied


In [17]:
# Reemplazamos valores nulos:
df_reviews["text"].fillna(df_reviews["escala_satisfaccion"], inplace = True)

# Controlamos que no queden valores nulos en el dataframe:
df_reviews.isnull().sum()

user_id                0
name                   0
rating                 0
text                   0
gmap_id                0
escala_satisfaccion    0
dtype: int64

In [18]:
# Eliminamos la columna agregada anteriormente:
df_reviews = df_reviews.drop(columns="escala_satisfaccion")

Una vez completados los datos nulos, controlamos el Top 10 de usuarios en cantidad de reviews:

In [19]:
top_user_reviews = df_reviews["user_id"].value_counts()
top_user_reviews.head(10)

user_id
1.033885e+20    307
1.077740e+20    144
1.030183e+20    130
1.119374e+20    130
1.150273e+20    125
1.164464e+20    110
1.055059e+20    105
1.087680e+20    104
1.056125e+20    104
1.021802e+20    102
Name: count, dtype: int64

In [20]:
# Controlamos los últimos 10:
top_user_reviews.tail(10)

user_id
1.143127e+20    1
1.131172e+20    1
1.129661e+20    1
1.147637e+20    1
1.016916e+20    1
1.059341e+20    1
1.137927e+20    1
1.159818e+20    1
1.098696e+20    1
1.122199e+20    1
Name: count, dtype: int64

## Locales (metadata)

La **"metadata"** corresponde a la **información de** los diferentes **locales/comercios incluidos** en **Google Maps**, en este caso en particular, para el estado de **California**.

In [5]:
# Se carga el archivo co:
metadata_california = pd.read_parquet("Datasets/metadata_california.parquet")
metadata_california.head()

Unnamed: 0,index,name,address,gmap_id,description,latitude,longitude,category,avg_rating,num_of_reviews,price,hours,MISC,state,relative_results,url
0,0,San Soo Dang,"San Soo Dang, 761 S Vermont Ave, Los Angeles, ...",0x80c2c778e3b73d33:0xbdc58662a4a97d49,,34.058092,-118.29213,[Korean restaurant],4.4,18,,"[[Thursday, 6:30AM–6PM], [Friday, 6:30AM–6PM],...",{'Accessibility': ['Wheelchair accessible entr...,Open ⋅ Closes 6PM,"[0x80c2c78249aba68f:0x35bf16ce61be751d, 0x80c2...",https://www.google.com/maps/place//data=!4m2!3...
1,1,Nobel Textile Co,"Nobel Textile Co, 719 E 9th St, Los Angeles, C...",0x80c2c632f933b073:0xc31785961fe826a6,,34.036694,-118.249421,[Fabric store],4.3,7,,"[[Thursday, 9AM–5PM], [Friday, 9AM–5PM], [Satu...","{'Accessibility': None, 'Activities': None, 'A...",Open ⋅ Closes 5PM,"[0x80c2c62c496083d1:0xdefa11317fe870a1, 0x80c2...",https://www.google.com/maps/place//data=!4m2!3...
2,2,Matrix International Textiles,"Matrix International Textiles, 1363 S Bonnie B...",0x80c2cf163db6bc89:0x219484e2edbcfa41,,34.015505,-118.181839,[Fabric store],3.5,6,,"[[Thursday, 8:30AM–5:30PM], [Friday, 8:30AM–5:...",{'Accessibility': ['Wheelchair accessible entr...,Open ⋅ Closes 5:30PM,"[0x80c2cf042a5d9561:0xd0024ad6f81f1335, 0x80c2...",https://www.google.com/maps/place//data=!4m2!3...
3,3,Vons Chicken,"Vons Chicken, 12740 La Mirada Blvd, La Mirada,...",0x80dd2b4c8555edb7:0xfc33d65c4bdbef42,,33.916402,-118.010855,[Restaurant],4.5,18,,"[[Thursday, 11AM–9:30PM], [Friday, 11AM–9:30PM...",{'Accessibility': ['Wheelchair accessible entr...,Open ⋅ Closes 9:30PM,,https://www.google.com/maps/place//data=!4m2!3...
4,4,Black Tie Ski Rental Delivery of Mammoth,"Black Tie Ski Rental Delivery of Mammoth, 501 ...",0x80960c29f2e3bf29:0x4b291f0d275a5699,,37.638754,-118.966055,"[Ski rental service, Snowboard rental service]",5.0,34,,"[[Thursday, 8AM–5PM], [Friday, 8AM–5PM], [Satu...",{'Accessibility': ['Wheelchair accessible entr...,Open ⋅ Closes 5PM,"[0x80960dcd6ba76731:0x9a6875ced2f9228e, 0x8096...",https://www.google.com/maps/place//data=!4m2!3...


In [6]:
# Se controla la información general del dataset:
metadata_california.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73405 entries, 0 to 73404
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   index             73405 non-null  int64  
 1   name              73403 non-null  object 
 2   address           72978 non-null  object 
 3   gmap_id           73405 non-null  object 
 4   description       11072 non-null  object 
 5   latitude          73405 non-null  float64
 6   longitude         73405 non-null  float64
 7   category          73334 non-null  object 
 8   avg_rating        73405 non-null  float64
 9   num_of_reviews    73405 non-null  int64  
 10  price             10853 non-null  object 
 11  hours             63907 non-null  object 
 12  MISC              66572 non-null  object 
 13  state             64610 non-null  object 
 14  relative_results  69916 non-null  object 
 15  url               73405 non-null  object 
dtypes: float64(3), int64(2), object(11)
memo

In [15]:
# Se resetea el indice:
metadata_california = metadata_california.reset_index()

In [10]:
# Se controlan valores nulos en gmap_id:
metadata_california["gmap_id"].isnull().sum()

0

In [7]:
# Se controlan valores duplicados en gmap_id:
metadata_california["gmap_id"].duplicated().sum()

0

In [12]:
# Se eliminan valores duplicados de gmap_id:
metadata_california = metadata_california.drop_duplicates(subset="gmap_id").reset_index()
metadata_california.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73405 entries, 0 to 73404
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   index             73405 non-null  int64  
 1   name              73403 non-null  object 
 2   address           72978 non-null  object 
 3   gmap_id           73405 non-null  object 
 4   description       11072 non-null  object 
 5   latitude          73405 non-null  float64
 6   longitude         73405 non-null  float64
 7   category          73334 non-null  object 
 8   avg_rating        73405 non-null  float64
 9   num_of_reviews    73405 non-null  int64  
 10  price             10853 non-null  object 
 11  hours             63907 non-null  object 
 12  MISC              66572 non-null  object 
 13  state             64610 non-null  object 
 14  relative_results  69916 non-null  object 
 15  url               73405 non-null  object 
dtypes: float64(3), int64(2), object(11)
memo

In [13]:
# Exportamos la metadata de los locales de california en formato parquet:
metadata_california.to_parquet('metadata_california.parquet', engine="pyarrow")

In [13]:
metadata_california.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73405 entries, 0 to 73404
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   index             73405 non-null  int64  
 1   name              73403 non-null  object 
 2   address           72978 non-null  object 
 3   gmap_id           73405 non-null  object 
 4   description       11072 non-null  object 
 5   latitude          73405 non-null  float64
 6   longitude         73405 non-null  float64
 7   category          73334 non-null  object 
 8   avg_rating        73405 non-null  float64
 9   num_of_reviews    73405 non-null  int64  
 10  price             10853 non-null  object 
 11  hours             63907 non-null  object 
 12  MISC              66572 non-null  object 
 13  state             64610 non-null  object 
 14  relative_results  69916 non-null  object 
 15  url               73405 non-null  object 
dtypes: float64(3), int64(2), object(11)
memo

Una vez tenemos toda la data correspondiente a los locales de California, filtramos aquellos que entran en la categoría de restaurant:

In [14]:
# Observamos el total de categorias comprendidas en los locales
categorias = metadata_california["category"].explode().unique()
categorias

array(['Korean restaurant', 'Fabric store', 'Restaurant', ...,
       'Angler fish restaurant', 'Contemporary Louisiana restaurant',
       'Office of Vital Records'], dtype=object)

Analizamos el total de categorías:

In [15]:
len(categorias)

3045

Existen 3045 categorías, de las cuales sólo nos interesa aquellos que contienen la palabra "restaurant", por lo que se buscaran filtrar las mismas:

In [23]:
# Se abre el dataframe en las diferentes categorías
apertura_cat = metadata_california.explode("category")
apertura_cat = apertura_cat.reset_index()
apertura_cat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 168602 entries, 0 to 168601
Data columns (total 17 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   level_0           168602 non-null  int64  
 1   index             168602 non-null  int64  
 2   name              168600 non-null  object 
 3   address           167309 non-null  object 
 4   gmap_id           168602 non-null  object 
 5   description       37577 non-null   object 
 6   latitude          168602 non-null  float64
 7   longitude         168602 non-null  float64
 8   category          168531 non-null  object 
 9   avg_rating        168602 non-null  float64
 10  num_of_reviews    168602 non-null  int64  
 11  price             35420 non-null   object 
 12  hours             154022 non-null  object 
 13  MISC              156673 non-null  object 
 14  state             155036 non-null  object 
 15  relative_results  161930 non-null  object 
 16  url               16

In [25]:
# Se eliminan los valores nulos en la columna "category"
apertura_cat = apertura_cat.dropna(subset="category")
apertura_cat["category"].isnull().sum()

0

In [26]:
# Se determina la palabra clave a buscar:
palabra_clave = "restaurant"

# Se establece el fltro por palabra clave:
filtro = apertura_cat["category"].str.contains(palabra_clave, case=False)

# Se crea el dataframe que incluye info únicamente de restaurantes:
restaurantes = apertura_cat[filtro]
restaurantes.head()

Unnamed: 0,level_0,index,name,address,gmap_id,description,latitude,longitude,category,avg_rating,num_of_reviews,price,hours,MISC,state,relative_results,url
0,0,0,San Soo Dang,"San Soo Dang, 761 S Vermont Ave, Los Angeles, ...",0x80c2c778e3b73d33:0xbdc58662a4a97d49,,34.058092,-118.29213,Korean restaurant,4.4,18,,"[[Thursday, 6:30AM–6PM], [Friday, 6:30AM–6PM],...",{'Accessibility': ['Wheelchair accessible entr...,Open ⋅ Closes 6PM,"[0x80c2c78249aba68f:0x35bf16ce61be751d, 0x80c2...",https://www.google.com/maps/place//data=!4m2!3...
3,3,3,Vons Chicken,"Vons Chicken, 12740 La Mirada Blvd, La Mirada,...",0x80dd2b4c8555edb7:0xfc33d65c4bdbef42,,33.916402,-118.010855,Restaurant,4.5,18,,"[[Thursday, 11AM–9:30PM], [Friday, 11AM–9:30PM...",{'Accessibility': ['Wheelchair accessible entr...,Open ⋅ Closes 9:30PM,,https://www.google.com/maps/place//data=!4m2!3...
107,52,52,La Potranca,"La Potranca, 12821 Venice Blvd., Los Angeles, ...",0x80c2baf50d29bf63:0x5bd904b842b9fcc,,34.000181,-118.441249,Restaurant,4.2,13,,"[[Thursday, 10AM–2AM], [Friday, 10AM–2AM], [Sa...","{'Accessibility': None, 'Activities': None, 'A...",Closed ⋅ Opens 10AM,"[0x80c2bac345536273:0x8b015c3512788465, 0x80c2...",https://www.google.com/maps/place//data=!4m2!3...
174,81,81,Cowboy Burgers & BBQ,"Cowboy Burgers & BBQ, 13101 Ramona Blvd, Baldw...",0x80c2d765f8c90a3d:0x16afb75943e7ad50,"American grub such as BBQ ribs, hamburgers, sa...",34.079995,-117.988951,Hamburger restaurant,3.7,38,,"[[Thursday, 6AM–9PM], [Friday, 6AM–9PM], [Satu...",{'Accessibility': ['Wheelchair accessible entr...,Closed ⋅ Opens 6AM,"[0x80c2d765f8dd4ebf:0xb6baf31e3e536ffa, 0x80c2...",https://www.google.com/maps/place//data=!4m2!3...
175,81,81,Cowboy Burgers & BBQ,"Cowboy Burgers & BBQ, 13101 Ramona Blvd, Baldw...",0x80c2d765f8c90a3d:0x16afb75943e7ad50,"American grub such as BBQ ribs, hamburgers, sa...",34.079995,-117.988951,American restaurant,3.7,38,,"[[Thursday, 6AM–9PM], [Friday, 6AM–9PM], [Satu...",{'Accessibility': ['Wheelchair accessible entr...,Closed ⋅ Opens 6AM,"[0x80c2d765f8dd4ebf:0xb6baf31e3e536ffa, 0x80c2...",https://www.google.com/maps/place//data=!4m2!3...


In [30]:
# Se chequea info general del dataframe:
restaurantes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16285 entries, 0 to 16284
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   index             16285 non-null  int64  
 1   name              16285 non-null  object 
 2   address           16275 non-null  object 
 3   gmap_id           16285 non-null  object 
 4   description       8471 non-null   object 
 5   latitude          16285 non-null  float64
 6   longitude         16285 non-null  float64
 7   category          16285 non-null  object 
 8   avg_rating        16285 non-null  float64
 9   num_of_reviews    16285 non-null  int64  
 10  price             10392 non-null  object 
 11  hours             15374 non-null  object 
 12  MISC              16239 non-null  object 
 13  state             15395 non-null  object 
 14  relative_results  14220 non-null  object 
 15  url               16285 non-null  object 
dtypes: float64(3), int64(2), object(11)
memo

In [29]:
# Se resetean el indice y eliminan columnas innecesarias:
restaurantes = restaurantes.drop(columns=["level_0", "index"])
restaurantes = restaurantes.reset_index()

Luego se observan la cantidad de restaurantes por categoría:

In [31]:
restaurantes["category"].value_counts()

category
Restaurant                           3769
Mexican restaurant                   1259
Fast food restaurant                 1146
Takeout Restaurant                    851
Pizza restaurant                      703
                                     ... 
Chophouse restaurant                    1
Czech restaurant                        1
Venezuelan restaurant                   1
Yucatan restaurant                      1
Contemporary Louisiana restaurant       1
Name: count, Length: 203, dtype: int64

Observamos la totalidad de tipologías de restaurant existentes:

In [34]:
categorias_restaurantes = restaurantes["category"].unique()
categorias_restaurantes.sort()
categorias_restaurantes

array(['Afghani restaurant', 'African restaurant', 'American restaurant',
       'Angler fish restaurant', 'Argentinian restaurant',
       'Armenian restaurant', 'Asian fusion restaurant',
       'Asian restaurant', 'Australian restaurant', 'Austrian restaurant',
       'Authentic Japanese restaurant', 'Bangladeshi restaurant',
       'Bar restaurant furniture store', 'Barbecue restaurant',
       'Basque restaurant', 'Belgian restaurant', 'Biryani restaurant',
       'Brazilian restaurant', 'Breakfast restaurant',
       'British restaurant', 'Brunch restaurant', 'Buffet restaurant',
       'Burmese restaurant', 'Burrito restaurant', 'Cajun restaurant',
       'Californian restaurant', 'Cambodian restaurant',
       'Cantonese restaurant', 'Caribbean restaurant',
       'Central American restaurant', 'Cheesesteak restaurant',
       'Chicken restaurant', 'Chicken wings restaurant',
       'Chilean restaurant', 'Chinese noodle restaurant',
       'Chinese restaurant', 'Chophouse resta

Por último, exportamos el total de restaurantes de California en formato parquet:

In [35]:
# Se exporta el archivo completo en formato parquet:
restaurantes.to_parquet('restaurantes_california.parquet', engine="pyarrow")

In [37]:
gc.collect()

1965

## Dataset final: Reviews de restaurantes en california

In [38]:
# Se une el dataframe de reviews con el de los restaurantes de California
reviews_completo = pd.merge(df_reviews, restaurantes, on="gmap_id", how="inner")
print(reviews_completo.shape)
reviews_completo.head()

(1957104, 23)


Unnamed: 0,user_id,name_x,time,rating,text,pics,resp,gmap_id,index,name_y,...,longitude,category,avg_rating,num_of_reviews,price,hours,MISC,state,relative_results,url
0,1.089912e+20,Song Ro,1609909927056,5,Love there korean rice cake.,,,0x80c2c778e3b73d33:0xbdc58662a4a97d49,0,San Soo Dang,...,-118.29213,Korean restaurant,4.4,18,,"[[Thursday, 6:30AM–6PM], [Friday, 6:30AM–6PM],...",{'Accessibility': ['Wheelchair accessible entr...,Open ⋅ Closes 6PM,"[0x80c2c78249aba68f:0x35bf16ce61be751d, 0x80c2...",https://www.google.com/maps/place//data=!4m2!3...
1,1.112903e+20,Rafa Robles,1612849648663,5,Good very good,,,0x80c2c778e3b73d33:0xbdc58662a4a97d49,0,San Soo Dang,...,-118.29213,Korean restaurant,4.4,18,,"[[Thursday, 6:30AM–6PM], [Friday, 6:30AM–6PM],...",{'Accessibility': ['Wheelchair accessible entr...,Open ⋅ Closes 6PM,"[0x80c2c78249aba68f:0x35bf16ce61be751d, 0x80c2...",https://www.google.com/maps/place//data=!4m2!3...
2,1.126404e+20,David Han,1583643882296,4,They make Korean traditional food very properly.,,,0x80c2c778e3b73d33:0xbdc58662a4a97d49,0,San Soo Dang,...,-118.29213,Korean restaurant,4.4,18,,"[[Thursday, 6:30AM–6PM], [Friday, 6:30AM–6PM],...",{'Accessibility': ['Wheelchair accessible entr...,Open ⋅ Closes 6PM,"[0x80c2c78249aba68f:0x35bf16ce61be751d, 0x80c2...",https://www.google.com/maps/place//data=!4m2!3...
3,1.174403e+20,Anthony Kim,1551938216355,5,Short ribs are very delicious.,,,0x80c2c778e3b73d33:0xbdc58662a4a97d49,0,San Soo Dang,...,-118.29213,Korean restaurant,4.4,18,,"[[Thursday, 6:30AM–6PM], [Friday, 6:30AM–6PM],...",{'Accessibility': ['Wheelchair accessible entr...,Open ⋅ Closes 6PM,"[0x80c2c78249aba68f:0x35bf16ce61be751d, 0x80c2...",https://www.google.com/maps/place//data=!4m2!3...
4,1.005808e+20,Mario Marzouk,1494910901933,5,Great food and prices the portions are large,,,0x80c2c778e3b73d33:0xbdc58662a4a97d49,0,San Soo Dang,...,-118.29213,Korean restaurant,4.4,18,,"[[Thursday, 6:30AM–6PM], [Friday, 6:30AM–6PM],...",{'Accessibility': ['Wheelchair accessible entr...,Open ⋅ Closes 6PM,"[0x80c2c78249aba68f:0x35bf16ce61be751d, 0x80c2...",https://www.google.com/maps/place//data=!4m2!3...


In [39]:
reviews_completo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1957104 entries, 0 to 1957103
Data columns (total 23 columns):
 #   Column            Dtype  
---  ------            -----  
 0   user_id           float64
 1   name_x            object 
 2   time              int64  
 3   rating            int64  
 4   text              object 
 5   pics              object 
 6   resp              object 
 7   gmap_id           object 
 8   index             int64  
 9   name_y            object 
 10  address           object 
 11  description       object 
 12  latitude          float64
 13  longitude         float64
 14  category          object 
 15  avg_rating        float64
 16  num_of_reviews    int64  
 17  price             object 
 18  hours             object 
 19  MISC              object 
 20  state             object 
 21  relative_results  object 
 22  url               object 
dtypes: float64(4), int64(4), object(15)
memory usage: 343.4+ MB


In [17]:
# Se exporta el archivo completo en formato parquet:
reviews_completo.to_parquet('reviews_california_completo.parquet', engine="pyarrow")