## Importamos las librerías necesarias

Estas librerías nos ayudan a manipular los datos para que tengan consistencia y calidad. Además, importamos un módulo propio llamado 'herramientas' que nos ayuda en todo este proceso.

In [10]:
import pandas as pd
from geopy.geocoders import Nominatim
import Herramientas as Her
import warnings
warnings.filterwarnings("ignore")

## Carga de Datos

In [11]:
df_business = pd.read_csv('business.csv')
df_business

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,...,state.1,postal_code.1,latitude.1,longitude.1,stars.1,review_count.1,is_open.1,attributes.1,categories.1,hours.1
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,,93101,34.426679,-119.711197,5.0,7,...,,,,,,,,,,
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,,63123,38.551126,-90.335695,3.0,15,...,,,,,,,,,,
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,,85711,32.223236,-110.880452,3.5,22,...,,,,,,,,,,
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,CA,19107,39.955505,-75.155564,4.0,80,...,,,,,,,,,,
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,MO,18054,40.338183,-75.471659,4.5,13,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
150341,IUQopTMmYQG-qRtBk-8QnA,Binh's Nails,3388 Gateway Blvd,Edmonton,IN,T6J 5H2,53.468419,-113.492054,3.0,13,...,,,,,,,,,,
150342,c8GjPIOTGVmIemT7j5_SyQ,Wild Birds Unlimited,2813 Bransford Ave,Nashville,DE,37204,36.115118,-86.766925,4.0,5,...,,,,,,,,,,
150343,_QAMST-NrQobXduilWEqSw,Claire's Boutique,"6020 E 82nd St, Ste 46",Indianapolis,AB,46250,39.908707,-86.065088,3.5,8,...,,,,,,,,,,
150344,mtGm22y5c2UHNXDFAjaPNw,Cyclery & Fitness Center,2472 Troy Rd,Edwardsville,AB,62025,38.782351,-89.950558,4.0,24,...,,,,,,,,,,


Se lleva a cabo una revisión mediante una función de nuestro módulo personalizado, la cual proporciona información detallada sobre el DataFrame, incluyendo tipos de datos, cantidad y porcentaje de valores nulos en cada columna.

In [12]:
Her.analizar_datos(df_business)

Unnamed: 0,Nombre,Tipos de Datos Únicos,% de Valores No Nulos,% de Valores Nulos,Cantidad de Valores Nulos
0,business_id,[<class 'str'>],100.0,0.0,0
1,name,[<class 'str'>],100.0,0.0,0
2,address,"[<class 'str'>, <class 'float'>]",96.59,3.41,5127
3,city,[<class 'str'>],100.0,0.0,0
4,state,"[<class 'float'>, <class 'str'>]",100.0,0.0,3
5,postal_code,"[<class 'str'>, <class 'float'>]",99.95,0.05,73
6,latitude,[<class 'float'>],100.0,0.0,0
7,longitude,[<class 'float'>],100.0,0.0,0
8,stars,[<class 'float'>],100.0,0.0,0
9,review_count,[<class 'int'>],100.0,0.0,0


## Transformaciones

Eliminamos las columnas que no son necesarias y que contienen duplicados con todos sus valores nulos. Durante la lectura del archivo, se intentó filtrar el dataframe de diversas maneras, pero todas resultaron en diferentes errores. Por ende, se decidió copiar los nombres de las columnas y proceder a eliminarlas. Además, se eliminaron dos columnas que se consideraron no relevantes para nuestro análisis de datos y el sistema de recomendación.

In [13]:
df_business = df_business.drop(columns=['business_id.1', 'name.1','address.1', 'city.1', 'state.1', 'postal_code.1', 'latitude.1','longitude.1', 'stars.1', 'review_count.1', 'is_open.1', 'attributes.1','categories.1', 'hours.1','is_open','hours','attributes'])
df_business

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,categories
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,,93101,34.426679,-119.711197,5.0,7,"Doctors, Traditional Chinese Medicine, Naturop..."
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,,63123,38.551126,-90.335695,3.0,15,"Shipping Centers, Local Services, Notaries, Ma..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,,85711,32.223236,-110.880452,3.5,22,"Department Stores, Shopping, Fashion, Home & G..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,CA,19107,39.955505,-75.155564,4.0,80,"Restaurants, Food, Bubble Tea, Coffee & Tea, B..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,MO,18054,40.338183,-75.471659,4.5,13,"Brewpubs, Breweries, Food"
...,...,...,...,...,...,...,...,...,...,...,...
150341,IUQopTMmYQG-qRtBk-8QnA,Binh's Nails,3388 Gateway Blvd,Edmonton,IN,T6J 5H2,53.468419,-113.492054,3.0,13,"Nail Salons, Beauty & Spas"
150342,c8GjPIOTGVmIemT7j5_SyQ,Wild Birds Unlimited,2813 Bransford Ave,Nashville,DE,37204,36.115118,-86.766925,4.0,5,"Pets, Nurseries & Gardening, Pet Stores, Hobby..."
150343,_QAMST-NrQobXduilWEqSw,Claire's Boutique,"6020 E 82nd St, Ste 46",Indianapolis,AB,46250,39.908707,-86.065088,3.5,8,"Shopping, Jewelry, Piercing, Toy Stores, Beaut..."
150344,mtGm22y5c2UHNXDFAjaPNw,Cyclery & Fitness Center,2472 Troy Rd,Edwardsville,AB,62025,38.782351,-89.950558,4.0,24,"Fitness/Exercise Equipment, Eyewear & Optician..."


## Column ``State``

Se realiza un filtrado a través de los estados seleccionados mediante web scraping realizado en Wikipedia, teniendo en cuenta la mayor cantidad de personas por cada estado. Posteriormente, se procede a restablecer el índice de los registros para facilitar el manejo de los datos.

In [14]:
df_business = df_business[(df_business['state'] == 'CA') | (df_business['state'] == 'TX') | (df_business['state'] == 'FL') | (df_business['state'] == 'PA') | (df_business['state'] == 'NY') ]
df_business = df_business.reset_index(drop=True)
df_business

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,categories
0,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,CA,19107,39.955505,-75.155564,4.0,80,"Restaurants, Food, Bubble Tea, Coffee & Tea, B..."
1,n_0UpQx1hsNbnPUSlodU8w,Famous Footwear,"8522 Eager Road, Dierbergs Brentwood Point",Brentwood,PA,63144,38.627695,-90.340465,2.5,13,"Sporting Goods, Fashion, Shoe Stores, Shopping..."
2,qkRM_2X51Yqxk3btlwAQIg,Temple Beth-El,400 Pasadena Ave S,St. Petersburg,PA,33707,27.766590,-82.732983,3.5,5,"Synagogues, Religious Organizations"
3,UJsufbvfyfONHeWdvAHKjA,Marshalls,21705 Village Lakes Sc Dr,Land O' Lakes,FL,34639,28.190459,-82.457380,3.5,6,"Department Stores, Shopping, Fashion"
4,jaxMSoInw8Poo3XeMJt8lQ,Adams Dental,15 N Missouri Ave,Clearwater,FL,33755,27.966235,-82.787412,5.0,10,"General Dentistry, Dentists, Health & Medical,..."
...,...,...,...,...,...,...,...,...,...,...,...
65570,1jx1sfgjgVg0nM6n3p0xWA,Savaya Coffee Market,11177 N Oracle Rd,Oro Valley,PA,85737,32.409552,-110.943073,4.5,41,"Specialty Food, Food, Coffee & Tea, Coffee Roa..."
65571,9U1Igcpe954LoWZRmNc-zg,Hand & Stone Massage And Facial Spa,"1100 S Columbus Blvd, Ste 24",Philadelphia,PA,19147,39.932756,-75.144504,3.0,32,"Day Spas, Beauty & Spas, Skin Care, Massage"
65572,t_SGoRT5yt14OWr64TOulA,Sherwood Park Kwik Lube,979 Fir St,Sherwood Park,PA,T8A 4N5,53.513215,-113.328680,5.0,5,"Oil Change Stations, Automotive, Auto Repair"
65573,x_2IrYgFiQn7GOTTgWRbAw,The Vac & Sew Center,"200 Haddonfield Berlin Rd, Ste 5",Voorhees,PA,08043,39.857700,-74.987230,4.0,5,"Appliances & Repair, Home & Garden, Appliances..."


Se verifica la correcta aplicación del filtro y la eliminación de las columnas, asegurándonos de que ambos cambios se hayan efectuado correctamente.

In [15]:
Her.analizar_datos(df_business)

Unnamed: 0,Nombre,Tipos de Datos Únicos,% de Valores No Nulos,% de Valores Nulos,Cantidad de Valores Nulos
0,business_id,[<class 'str'>],100.0,0.0,0
1,name,[<class 'str'>],100.0,0.0,0
2,address,"[<class 'str'>, <class 'float'>]",96.59,3.41,2238
3,city,[<class 'str'>],100.0,0.0,0
4,state,[<class 'str'>],100.0,0.0,0
5,postal_code,"[<class 'str'>, <class 'float'>]",99.97,0.03,22
6,latitude,[<class 'float'>],100.0,0.0,0
7,longitude,[<class 'float'>],100.0,0.0,0
8,stars,[<class 'float'>],100.0,0.0,0
9,review_count,[<class 'int'>],100.0,0.0,0


## Column ``Categories``

Comenzaremos con la columna 'Categories', revisando los valores nulos para luego llevar a cabo la filtración por diversas categorías.

In [16]:
df_business_categories_null = df_business[df_business['categories'].isna()]
df_business_categories_null.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,categories
841,SMYXOLPyM95JvZ-oqnsWUA,A A Berlin Glass & Mirror Co,60 W White Horse Pike,Berlin,PA,8009,39.800416,-74.937181,3.0,5,
1418,xT3J-SP5g49g2FjQfLEQfg,Luxury Perfume,5135 Meadowood Mall Cir,Reno,PA,89502,39.475623,-119.78335,2.0,5,
1984,mKxCNYEoKt6d_1rXmvRwww,Green Envy,3520 N Highway 94,Saint Charles,FL,63301,38.826533,-90.472224,1.5,5,
2338,9QoKKDZB_YuDeS5TxRW8bg,Our 365 Portraits,9109 Watson Rd,Saint Louis,PA,63126,38.561429,-90.371805,1.0,10,
5483,ZERQMWb1PFzCfbfknqq-fA,Pilot Air Freight,314 N Middletown Rd,Media,PA,19063,39.917976,-75.441892,1.5,8,


Ahora, se examinan los datos marcados como tipo flotante para determinar si es necesario realizar cambios en la columna y unificar todo bajo un solo tipo de dato.

In [17]:
df_business_cat_float = df_business[df_business['categories'].apply(lambda x:isinstance(x,float))]
df_business_cat_float.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,categories
841,SMYXOLPyM95JvZ-oqnsWUA,A A Berlin Glass & Mirror Co,60 W White Horse Pike,Berlin,PA,8009,39.800416,-74.937181,3.0,5,
1418,xT3J-SP5g49g2FjQfLEQfg,Luxury Perfume,5135 Meadowood Mall Cir,Reno,PA,89502,39.475623,-119.78335,2.0,5,
1984,mKxCNYEoKt6d_1rXmvRwww,Green Envy,3520 N Highway 94,Saint Charles,FL,63301,38.826533,-90.472224,1.5,5,
2338,9QoKKDZB_YuDeS5TxRW8bg,Our 365 Portraits,9109 Watson Rd,Saint Louis,PA,63126,38.561429,-90.371805,1.0,10,
5483,ZERQMWb1PFzCfbfknqq-fA,Pilot Air Freight,314 N Middletown Rd,Media,PA,19063,39.917976,-75.441892,1.5,8,


Se ha verificado que los datos marcados como tipo de dato float o flotante son los mismos que aquellos que filtramos y se encontraban marcados como nulos. En consecuencia, se procederá a eliminarlos Además, se realizará una filtración del dataframe mediante palabras clave en las categorías para obtener únicamente aquellos lugares que estén relacionados con la comida.

In [18]:
df_business.dropna(subset=['categories'],inplace=True)

categories_exclude = ['shopping', 'beauty', 'salon','Sports Bars','Pets', 'Pet Adoption', 'Nightlife','Gastropubs','Automotive','Custom Cakes', 'Desserts', 'Cupcakes', 'Ice Cream & Frozen Yogurt', 'Organic Stores', 'Specialty Food', 'Health Markets', 'Grocery','Cupcakes', 'Street Vendors', 'Food Delivery Services', 'Food Trucks','Acai Bowls',
'Home Services', 'Painters', 'Contractors', 'Pressure Washers', 'Shopping', 'Fences & Gates', 'Flooring', 'Home & Garden', 'Door Sales/Installation', 'Kitchen & Bath', 'Home Inspectors','Health & Medical', 'Pharmacy', 'Convenience Stores', 'Drugstores','Flowers & Gifts', 'Chocolatiers & Shops', 'Florists', 'Gift Shops', 'American (New)', 'Music Venues', 'Breakfast & Brunch', 'Arts & Entertainment', 'Bars', 'American (Traditional)', 'Dive Bars', 'Pool Halls','Farmers Market','Building Supplies', 'Masonry/Concrete', 'Countertop Installation','Active Life', 'Advertising', 'Afghan', 'African', 'Airport Terminals', 'Airports', 'American (New)', 'American (Traditional)', 'Amusement Parks', 'Water Delivery', 'Water Stores', 'Web Design', 'Wedding Planning', 'Wholesalers', 'Wine & Spirits', 'Wine Tasting Classes', 'Wine Tours', 'Wraps', 'Yelp Events', 'Walking Tours'
]

categories_include = ['restaurant', 'cafe', 'food', 'dining', 'eatery', 'bistro', 'bakery', 'grill', 'kitchen', 'pizzeria', 'steakhouse', 'sushi', 'tavern', 'diner']

mask = df_business['categories'].str.contains('|'.join(categories_include), case=False)
mask2 = df_business['categories'].str.contains('|'.join(categories_exclude), case=False)


df_business['categories'] = df_business['categories'][mask & ~mask2]

Se revisa que el filtrado se haya realizado correctamente, excluyendo las diversas categorías.

In [19]:
df_business

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,categories
0,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,CA,19107,39.955505,-75.155564,4.0,80,"Restaurants, Food, Bubble Tea, Coffee & Tea, B..."
1,n_0UpQx1hsNbnPUSlodU8w,Famous Footwear,"8522 Eager Road, Dierbergs Brentwood Point",Brentwood,PA,63144,38.627695,-90.340465,2.5,13,
2,qkRM_2X51Yqxk3btlwAQIg,Temple Beth-El,400 Pasadena Ave S,St. Petersburg,PA,33707,27.766590,-82.732983,3.5,5,
3,UJsufbvfyfONHeWdvAHKjA,Marshalls,21705 Village Lakes Sc Dr,Land O' Lakes,FL,34639,28.190459,-82.457380,3.5,6,
4,jaxMSoInw8Poo3XeMJt8lQ,Adams Dental,15 N Missouri Ave,Clearwater,FL,33755,27.966235,-82.787412,5.0,10,
...,...,...,...,...,...,...,...,...,...,...,...
65570,1jx1sfgjgVg0nM6n3p0xWA,Savaya Coffee Market,11177 N Oracle Rd,Oro Valley,PA,85737,32.409552,-110.943073,4.5,41,
65571,9U1Igcpe954LoWZRmNc-zg,Hand & Stone Massage And Facial Spa,"1100 S Columbus Blvd, Ste 24",Philadelphia,PA,19147,39.932756,-75.144504,3.0,32,
65572,t_SGoRT5yt14OWr64TOulA,Sherwood Park Kwik Lube,979 Fir St,Sherwood Park,PA,T8A 4N5,53.513215,-113.328680,5.0,5,
65573,x_2IrYgFiQn7GOTTgWRbAw,The Vac & Sew Center,"200 Haddonfield Berlin Rd, Ste 5",Voorhees,PA,08043,39.857700,-74.987230,4.0,5,


Una vez tratada esta columa, más tarde será convertidas en un dataframe a traves de variables dummy. Ahora continuaremos con las demás columnas que presenten dos tipos de datos y/o valores nulos. Con el objetivo de evitar la presencia de valores nulos, estas columnas serán revisadas para corroborar que no haya datos vacíos.

## Column ``Adress``

Obtenemos los valores nulos de la columna mediante una función llamada 'nulls' en nuestro módulo 'Herramientas'.

In [20]:
Her.nulls(df_business,'address')

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,categories
17,fSCNwMtNNQY9QT69Cj9fiA,Sierra Pro Events,,Sparks,PA,89431,39.540154,-119.748395,5.0,7,
27,n7AQvGvNHlmun3kqXeBKVQ,Roy's Appliance Service,,Meridian,FL,83646,43.643494,-116.436000,5.0,5,
29,7PDi_iyik3jraDAzWwwR4Q,Chase JP Morgan Bank Credit Card Services,,Wilmington,FL,19850,39.749361,-75.643331,1.5,111,
48,bYjnX_J1bHZob10DoSFkqQ,Tinkle Belle Diaper Service,,Santa Barbara,PA,93101,34.420334,-119.710749,5.0,17,
101,MyE_zdul_JO-dOHOug4GQQ,Watson Adventures Scavenger Hunts,,Philadelphia,FL,19019,40.119713,-75.009710,3.0,8,
...,...,...,...,...,...,...,...,...,...,...,...
65466,TXar07-aamk0g5AuSZfxkw,Everest Tree Service,,St. Petersburg,FL,33713,27.789678,-82.680746,4.5,13,
65483,b-49gLY_9W3rRgW-8juz4A,National Property Inspections,,Newtown,PA,18940,40.228337,-74.932260,1.0,5,
65506,CWaEYOc5g4AB-h88lnKntQ,Driving Miss Daisy Shuttle,,Nashville,PA,37214,36.140540,-86.672641,3.0,6,
65525,IRBhPAC4ZoDpXazpoB3epQ,Good Stuff Baked Treats,,Santa Barbara,PA,93101,34.420334,-119.710749,5.0,9,


Para encontrar las direcciones de los lugares que no las poseen, emplearemos la biblioteca Geopy. A través de la geodecodificación inversa, logramos obtener las direcciones utilizando la latitud y longitud.

In [21]:
geolocalizated = Nominatim(user_agent='My_Address')

for index, row in df_business.iterrows():
    if pd.isnull(row['address']):
        try:
            location = geolocalizated.reverse((row['latitude'], row['longitude']), language='en')
            df_business.at[index, 'address'] = location.address
        except Exception as e:
            print(f"No se pudo obtener la dirección para la fila {index}: {e}")

## Column ``Postal_code``

Dado que teníamos registros sin su código postal, decidimos realizar un proceso similar al de las direcciones de los lugares e imputarlos con Geopy mediante geodecodificación inversa. Esto implica obtener direcciones y otros valores necesarios, en este caso el codigo postal, a través de la latitud y longitud proporcionadas.

In [22]:
geolocalizated = Nominatim(user_agent='My_Address')

filas_sin_postal_code = df_business[pd.isnull(df_business['postal_code'])]


for index, row in filas_sin_postal_code.iterrows():
    try:
        location = geolocalizated.reverse((row['latitude'], row['longitude']), language='en')
        
        postal_code = location.raw.get('address', {}).get('postcode')
        if postal_code:
            # Actualizar la columna 'postal_code' en el DataFrame
            df_business.at[index, 'postal_code'] = postal_code

    except Exception as e:
        print(f"No se pudo obtener el código postal para la fila {index}: {e}")


## Column ``Name``

We start looking for empty values to check the consistency of the data.

In [23]:
Her.empty_values(df_business,'name')

The column "name" does not have empty values


Se revisa que los nombres correspondan a restaurantes u otros similares relacionados con la industria alimenticia.

In [24]:
names = df_business['name'].value_counts()
names.head()

name
Starbucks     329
McDonald's    298
Subway        227
Dunkin'       220
Taco Bell     163
Name: count, dtype: int64

## Column ``City``

We look for empty values to check the consistency of the data.

In [25]:
Her.empty_values(df_business,'city')

The column "city" does not have empty values


Se revisa que los nombres de las diferentes ciudades correspondan a los estados que fueron seleccionados mediante un proceso de web scraping basado en la población.

In [26]:
cities = df_business['city'].value_counts()
cities.head()

city
Philadelphia    6379
Tucson          3964
Tampa           3936
Indianapolis    3314
Nashville       3042
Name: count, dtype: int64

## Columns ``Latitude`` and ``Longitude``

We look for empty values to check the consistency of the data.

In [27]:
Her.empty_values(df_business,'latitude')
Her.empty_values(df_business,'longitude')

The column "latitude" does not have empty values
The column "longitude" does not have empty values


## Column ``stars``

We look for empty values to check the consistency of the data.

In [28]:
Her.empty_values(df_business,'stars')

The column "stars" does not have empty values


Se realiza un conteo de los valores y se calcula el porcentaje que representa cada valor posible, utilizando una función de nuestro módulo propio. Esto se realiza con el objetivo de lograr una mejor visualización de los datos.

In [29]:
Her.cantidad_porcentaje(df_business,'stars')

Los valores de stars:
4.0    13581
4.5    11890
3.5    11560
3.0     8089
5.0     7084
2.5     6114
2.0     4172
1.5     2187
1.0      852

El porcentaje que representa cada valor:
4.0    20.73
4.5    18.14
3.5    17.64
3.0    12.34
5.0    10.81
2.5     9.33
2.0     6.37
1.5     3.34
1.0     1.30


## Column ``review_count``

In [30]:
Her.empty_values(df_business,'review_count')

The column "review_count" does not have empty values


Realizamos un groupby para identificar los restaurantes con más reseñas en Yelp, con indiferencia de la calificación.

In [31]:
reviews = df_business.groupby(by='name')['review_count'].sum().sort_values(ascending=False)
reviews.head()

name
Starbucks                     8939
McDonald's                    7480
Oceana Grill                  7400
Reading Terminal Market       5721
Ruby Slipper - New Orleans    5193
Name: review_count, dtype: int64

Finalmente, buscaremos duplicados en el dataframe 'df_business' para completar el proceso de limpieza.

In [32]:
Her.duplicates(df_business)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,categories


Concluido el análisis de todas las columnas y viendo que no hay duplicados, procederemos con la conversión de una columna en tabla dummy. Para ello, se transferirán a un nuevo dataframe esta columna y luego se eliminará del dataframe original 'df_business'.

## Dummies

### Column ``categories``

Se realizarán diversas operaciones como la creación de un nuevo DataFrame, la separación de categorías mediante la función split, la eliminación de valores nulos y la obtención de variables dummy con el ID de cada negocio.

In [33]:

df_categories = df_business[['business_id', 'categories']]

df_categories['categories'] = df_categories['categories'].replace('No_data', None)

df_categories['categories'] = df_categories['categories'].dropna().str.split(', ')

df_categories_exploded = df_categories.explode('categories')

df_dummies = pd.get_dummies(df_categories_exploded['categories'], prefix='Category')

df_dummies = df_categories_exploded[['business_id']].join(df_dummies)

df_dimmi = df_dummies.groupby('business_id').sum().reset_index()

Se llevó a cabo una operación de 'umbral', lo que significa que en los locales donde las categorías se repetían, se les asignó un valor de 1 para ser utilizado en el sistema de recomendación.

In [34]:
threshold = 1

df_dimmi.iloc[:, 1:] = (df_dimmi.iloc[:, 1:] > threshold).astype(int)

df_dimmi

Unnamed: 0,business_id,Category_American (New),Category_American (Traditional),Category_Arabic,Category_Argentine,Category_Armenian,Category_Asian Fusion,Category_Australian,Category_Bagels,Category_Bakeries,...,Category_Trinidadian,Category_Turkish,Category_Ukrainian,Category_Uzbek,Category_Vegan,Category_Vegetarian,Category_Venezuelan,Category_Venues & Event Spaces,Category_Vietnamese,Category_Waffles
0,---kPU91CF4Lq2-WlRu9Lw,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,--30_8IhuyMHbSOcNWd6DQ,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,--9osgUCSDUWUkoTLdvYhQ,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,--FcbSxK1AoEtEAxOgBaCw,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,--MbOh2O1pATkXa7xbU6LA,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65524,zzW99n4VJr1Atte1Uhub1A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
65525,zzZqlYfZZIcN02C8SLcuBw,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
65526,zzfj1-iPfw0cwnOjY0yUgA,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
65527,zzg-Il9zxsaVXlCDrcG7hg,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Revisamos rápidamente aquellas categorías que más se repiten con el objetivo de visualizarlas.

In [35]:
values_categories = df_categories['categories'].value_counts().head()
values = pd.DataFrame(values_categories)
values

Unnamed: 0_level_0,count
categories,Unnamed: 1_level_1
"[Restaurants, Pizza]",409
"[Pizza, Restaurants]",347
"[Restaurants, Chinese]",311
"[Restaurants, Mexican]",298
"[Mexican, Restaurants]",287


Como se mencionó anteriormente, la columna 'categories' fue eliminada, ya que hemos obtenido nuestra tabla de variables dummy y no la necesitamos más en el dataframe de business.

In [36]:
df_business = df_business.drop(columns='categories')

## Exportamos los datos

In [37]:
df_business.to_csv('df_business.csv')

Concluido este etl pasaremos al siguiente de reviews.