<h1 align='center'>WebScraping TripAdvisor</h1>

---

El código a continuación tiene por objetivo extraer la **[información solicitada](https://github.com/mozilla/geckodriver/releases/download/v0.28.0/geckodriver-v0.28.0-win64.zip "Word en Google Drive")**, desde la página de **[TripAdvisor](https://www.tripadvisor.cl/Restaurants-g294305-Santiago_Santiago_Metropolitan_Region.html "Web TripAdvisor")** para Ximena. La siguiente celda sólo cumple con el propósito de **silenciar las posibles advertencias** que pudieran levantarse al correr el código, pero no aportan mayormente a la comprensión del proceso por parte del usuario.

In [1]:
%%capture --no-display

import warnings
warnings.filterwarnings('ignore')

La celda anterior asegurará que no se desplieguen advertencias innecesarias para la correcta comprensión y lectura de este informe. A continuación se darán las **instrucciones para instalar las librerías** necesarias para correr el código, cuestión que requiere de un comando para ello, por lo que las instrucciones se despliegan como impresión de una celda.

In [2]:
import os

print(f'Si es la primera vez que corre este programa, por favor abra la terminal PowerShell de Anaconda' +
      f' e ingrese el siguiente comando: "\033[4mpip install -r {os.getcwd()}\\requirements.txt\033[4m"')

Si es la primera vez que corre este programa, por favor abra la terminal PowerShell de Anaconda e ingrese el siguiente comando: "[4mpip install -r C:\Users\nicol\Proyectos\GitHub\Webscraping-TripAdvisor\requirements.txt[4m"


La primera parte fundamental de todo programa, corresponde a la **importación de librerías de Python**. Si acaso hubiera errores en esta primera celda, se aconseja contactar a Nicolás Ganter a su correo: nicolas@ganter.cl

In [3]:
import time
import pickle
import pandas as pd
from tqdm import tqdm
from datetime import datetime

from distributed import Client, LocalCluster
import utils

Dentro del código, hay ciertas **variables que es preferible tener en especial consideración**. Entre ellas, encontramos la ubicación del *driver* para *Selenium* que permitirá lanzar una instancia de *Firefox* para navegar la página y extraer los enlaces requeridos en la primera etapa de *webcrawling*. Si aún no ha instalado el driver, acceda a este **[link de descarga](https://github.com/mozilla/geckodriver/releases/download/v0.28.0/geckodriver-v0.28.0-win64.zip "geckodriver download link")**, extraiga el paquete y mueva los documentos a la carpeta de binarios de las librerías de Python.

In [4]:
geckodriver_path = r'C:\Users\nicol\anaconda3\Library\bin\geckodriver'
time_id = datetime.today().strftime('%Y%m%d')
basic_url = 'https://www.tripadvisor.cl'

A continuación se creará nuestro clúster para trabajar en forma distribuída cada tarea a realizarse. En particular, utilizaremos un *LocalCluster* para levantar en una máquina cierta cantidad de *workers* con determinada configuración pre-cargada.

In [5]:
cluster = LocalCluster(threads_per_worker=1, preload='worker_setup.py')
client = Client(cluster)
client

0,1
Client  Scheduler: tcp://127.0.0.1:62524  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 6  Cores: 6  Memory: 17.02 GB


<h2>Webcrawling de los restaurantes</h2>
La siguiente celda se encargará de la **extracción de los enlaces** asociados a cada restaurante en las páginas especificadas mediante el enlace de la segunda línea. En este caso, se extraerán los restaurantes de Santiago de Chile. Al final de la celda se imprime la cantidad de restaurantes extraídos, la cantidad de restaurantes disponibles según la página, y el porcentaje capturado por el programa. Nótese que el proceso toma algo así como 10 minutos, por lo que se utilizará un atajo mediante *pickles* (estructura de datos propia de este lenguaje de programación) y se especificará la fecha de captura de la información asociada a éste.

In [6]:
start = time.time()
url = basic_url + '/Restaurants-g294305-Santiago_Santiago_Metropolitan_Region.html'
info = utils.info_restaurants(url, geckodriver_path)

cwd = os.getcwd()
dict_pickles = utils.check_files(dir_files=cwd, keyword='urls')

if len(dict_pickles) == 0:
    urls = utils.gen_pickle(url, geckodriver_path, info['pages'], basic_url, time_id)

else:
    last_pickle = utils.last_pickle(dict_pickles)
    with open(last_pickle, 'rb') as file:
        urls = pickle.load(file)
    
print('Se obtuvieron {} restaurantes de {} lo que corresponde a una extracción del {}%'
      .format(len(urls), info['max_restaurants'], round(len(urls) / info['max_restaurants'] * 100, 2)))

stop = time.time()
print(f'Este proceso tomó {round(stop-start, 2)} segundos en correr.\n')

Información cargada del pickle 20210205_4830_urls.pickle extraído el 05 de febrero del 2021.
Se obtuvieron 4830 restaurantes de 4848 lo que corresponde a una extracción del 99.63%
Este proceso tomó 11.43 segundos en correr.



<h2>Webscraping de los restaurantes</h2>
Con esto concluye la parte más compleja y crítica de la recopilación de enlaces para los restaurantes. No obstante esta tarea continúa luego a nivel de comentarios, **a continuación se procederá a extraer la información solicitada** para cada uno de los restaurantes en la lista. Dado que se utilizan estrategias de computación paralela, no es posible observar el avance, sino abriendo el *Dashboard* cuyo link se encuentra bajo la cuarta celda del código.

In [7]:
start = time.time()
dict_dataframes = utils.check_files(dir_files=cwd, keyword='dataframe')

if len(dict_dataframes) == 0:
    futures = [client.submit(utils.get_restaurant, url_restaurant) for url_restaurant in list(set(urls))]
    results = client.gather(futures)
    
    dict_structure = {'id':[], 'Nombre restaurante':[], 'Promedio de calificaciones':[],
                      'N° de opiniones':[], 'Calificación de viajeros por categoría':[],
                      'Toman medidas de seguridad':[], 'Rankings':[],
                      'Tipo de comida y servicios':[], 'url':[]}
    
    df_restaurants = utils.build_dataframe(dict_structure, results, time_id)
    df_restaurants.to_pickle(f'{time_id}_dataframe_of_{df_restaurants.shape[0]}_restaurants.pickle')
    print(f'Se guardó "{time_id}_dataframe_of_{df_restaurants.shape[0]}_restaurants.pickle" en "{os.getcwd()}".')
    
else:
    last_pickle = utils.last_pickle(dict_dataframes)
    with open(last_pickle, 'rb') as file:
        df_restaurants = pickle.load(file)
    
stop = time.time()
print(f'Este proceso tomó {round(stop-start, 2)} segundos en correr.\n')

Información cargada del pickle 20210205_dataframe_of_4830_restaurants.pickle extraído el 05 de febrero del 2021.
Este proceso tomó 0.09 segundos en correr.



<h2>Webcrawling de los comentarios</h2>

In [8]:
start = time.time()
dict_files = utils.check_files(dir_files=cwd, keyword='review_urls')

if len(dict_files) == 0:
    futures = [client.submit(utils.review_urls, url_restaurant) for url_restaurant in list(set(urls))]
    results = client.gather(futures)
    
    dict_reviews = {key:value for key, value in results if isinstance(value, list)}
    n_reviews = len(dict_reviews.values())
    
    with open(f'{time_id}_{n_reviews}_review_urls.pickle', 'wb') as file:
        pickle.dump(dict_reviews, file)
    
    print(f'Se guardó "{time_id}_{n_reviews}_review_urls.pickle" en "{os.getcwd()}".')
    
else:
    last_pickle = utils.last_pickle(dict_files)
    with open(last_pickle, 'rb') as file:
        dict_reviews = pickle.load(file)
    
stop = time.time()
print(f'Este proceso tomó {round(stop-start, 2)} segundos en correr.',
      'Se dispone aproximadamente de {} comentarios para extraer.\n'.format(len(dict_reviews.values())*10))

Información cargada del pickle 20210205_4321_review_urls.pickle extraído el 05 de febrero del 2021.
Este proceso tomó 0.01 segundos en correr. Se dispone aproximadamente de 43210 comentarios para extraer.



<h2>Webscraping de los comentarios</h2>

In [9]:
'''
start = time.time()
dict_files = utils.check_files(dir_files=cwd, keyword='scraped_reviews')
url_reviews = utils.prepare_urls(dict_reviews)

if len(dict_files) == 0:
    #browsers = utils.gen_browsers(client, geckodriver_path)
    futures = [client.submit(utils.get_reviews, url) for url in url_reviews]
    results = client.gather(futures)
    
    dict_structure = {'id':[], 'date_review':[], 'comments':[], 'date_stayed':[], 'response_body':[],
                      'user_name':[], 'user_reviews':[], 'useful_votes':[]}

    df_reviews = utils.build_dataframe(dict_structure, results, time_id)
    df_pathname = f'{time_id}_dataframe_of_{df_reviews.shape[0]}_scraped_reviews.pickle'

    df_reviews.to_pickle(df_pathname)
    print(f'Se guardó "{df_pathname}" en "{os.getcwd()}"')
        
else:
    last_pickle = utils.last_pickle(dict_files)
    df_reviews = pd.read_pickle(last_pickle)

stop = time.time()
print(f'Este proceso tomó {round(stop-start, 2)} segundos en correr.',
      'Se extrajeron {} comentarios.\n'.format(df_reviews.shape[0]))
'''

'\nstart = time.time()\ndict_files = utils.check_files(dir_files=cwd, keyword=\'scraped_reviews\')\nurl_reviews = utils.prepare_urls(dict_reviews)\n\nif len(dict_files) == 0:\n    #browsers = utils.gen_browsers(client, geckodriver_path)\n    futures = [client.submit(utils.get_reviews, url) for url in url_reviews]\n    results = client.gather(futures)\n    \n    dict_structure = {\'id\':[], \'date_review\':[], \'comments\':[], \'date_stayed\':[], \'response_body\':[],\n                      \'user_name\':[], \'user_reviews\':[], \'useful_votes\':[]}\n\n    df_reviews = utils.build_dataframe(dict_structure, results, time_id)\n    df_pathname = f\'{time_id}_dataframe_of_{df_reviews.shape[0]}_scraped_reviews.pickle\'\n\n    df_reviews.to_pickle(df_pathname)\n    print(f\'Se guardó "{df_pathname}" en "{os.getcwd()}"\')\n        \nelse:\n    last_pickle = utils.last_pickle(dict_files)\n    df_reviews = pd.read_pickle(last_pickle)\n\nstop = time.time()\nprint(f\'Este proceso tomó {round(stop-

In [10]:
dict_files = utils.check_files(dir_files=cwd, keyword='scraped_reviews')
url_reviews = utils.prepare_urls(dict_reviews)

In [35]:
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
import re

def get_reviews(url):
    try:
        html = requests.get(url['scraping'], timeout=600)
    
    except Exception as e:
        dict_reviews = {'id':hash(url['identifier']), 'restaurant':'timeout', 'grade':'timeout',
                        'date_review':'timeout', 'comments':'timeout', 'date_stayed':'timeout',
                        'response_body':'timeout', 'user_name':'timeout', 'user_reviews':'timeout',
                        'useful_votes':'timeout', 'url':url, 'Error':e}
        
        return dict_reviews
        
    soup = BeautifulSoup(html.text, 'lxml')
    soup_reviews = soup.find_all('div', class_ = 'reviewSelector')
    dict_months = {'enero':1, 'febrero':2, 'marzo':3, 'abril':4,
                  'mayo':5, 'junio':6, 'julio':7, 'agosto':8,
                  'septiembre':9, 'octubre':10, 'noviembre':11, 'diciembre':12}

    dict_reviews = {'id':[], 'restaurant':[], 'grade':[], 'date_review':[], 'comments':[],
                    'date_stayed':[], 'response_body':[], 'user_name':[], 'user_reviews':[],
                    'useful_votes':[], 'url':[]}
    try:
        restaurant = soup.find('h1', class_ = '_3a1XQ88S').text
        
    except Exception as e:
        restaurant = e
        
    for i, review in enumerate(soup_reviews):
        dict_reviews['id'].append(hash(url['identifier']))
        dict_reviews['restaurant'].append(restaurant)
        grade = str(review.find('div', class_ = 'ui_column is-9').span)
        re_grade = int(re.search('_([0-9]+)">', grade).group(1))
        dict_reviews['grade'].append(re_grade)

        try:
            raw_date = re.search('([0-9]+) de ([a-z]+) de ([0-9]+)',
                                 review.find('span', class_ = 'ratingDate').text)

            day, month, year = raw_date.group(1), dict_months[raw_date.group(2)], raw_date.group(3)
            dict_reviews['date_review'].append('{}/{:02d}/{}'.format(day, month, year))

        except:
            dict_reviews['date_review'].append(review.find('span', class_ = 'ratingDate').text)
        
        # In the next lines, we are going to extract the actual reviews.
        try:
            basic_review = review.find('p', class_ = 'partial_entry').text
            extended_review = review.find('span', class_ = 'postSnippet').text
            complete_review = basic_review.replace(f'...{extended_review}Más',
                                                   f' {extended_review}')
        except Exception as e:
            button_code = review.find('span', class_ = 'taLnk ulBlueLinks')
                
            if (button_code != None) and ('browser' not in locals()):
                browser = None
                
                #while browser == None:
                #    browser = utils.browser_call(browsers)
                #    if browser == None:
                #        time.sleep(1)
                        
                get_worker().browser.driver.get(url['scraping'])                
                button = get_worker().browser.driver.find_element_by_class_name('taLnk.ulBlueLinks').click()
                
                html_selenium = get_worker().browser.driver.page_source
                soup_selenium = BeautifulSoup(html_selenium, 'lxml')
                
                reviews_selenium = soup_selenium.find_all('div', class_ = 'reviewSelector')
                complete_review = reviews_selenium[i].find('p', class_ = 'partial_entry').text
                
            else:
                complete_review = basic_review
                
        finally:
            complete_review = str(complete_review).replace('\n', ' ')    
            dict_reviews['comments'].append(complete_review)
            
        # The following lines extract the dates
        raw_date = re.search(': ([a-z]+) de ([0-9]+)',
                             review.find('div', class_ = 'prw_rup prw_reviews_stay_date_hsx').text)
        try:
            month, year = dict_months[raw_date.group(1)], raw_date.group(2)
            dict_reviews['date_stayed'].append('{:02d}/{}'.format(month, year))
            
        except Exception as e:
            month, year = review.find('div', class_ = 'prw_rup prw_reviews_stay_date_hsx'), e
            dict_reviews['date_stayed'].append(f'{month} with error: {year}')
            

        try:
            full_response = review.find('div', class_ = 'mgrRspnInline')
            local_body = []

            for match in ['(.*)\n', '(.*)\.\.\.Más']:
                re_body = re.search(match, full_response.find('p', class_ = 'partial_entry').text)

                if re_body != None:
                    local_body.append(re_body.group(1)) # Acá agregar marcador para extracción completa
                    
            dict_reviews['response_body'].append(' '.join(local_body))

        except:
            full_response = None
            dict_reviews['response_body'].append(None)

        full_response = review.find('div', class_ = 'entry')
        
        try:
            dict_reviews['user_name'].append(review.find('div', class_ = 'info_text pointer_cursor').text)
            
        except Exception as e:
            dict_reviews['user_name'].append('La url {} presenta un error de tipo {}'.format(url['scraping'], e))            
        try:
            dict_reviews['user_reviews'].append(int(re.match('([0-9]+)',
                                                             review.find('span',
                                                                         class_ = 'badgeText').text).group(1)))
        except Exception as e:
            dict_reviews['user_reviews'].append('La url {} presenta un error de tipo {}'.format(url['scraping'], e)) 

        get_votes = lambda useful_votes: n.text if useful_votes != None else 0
        dict_reviews['useful_votes'].append(get_votes(review.find('span', class_ = 'numHlpIn')))
        
        dict_reviews['url'].append(url)
        #browser.hang()

    return dict_reviews

In [36]:
task = client.submit(get_reviews, url_reviews[3])
task.result().result()

UnboundLocalError: local variable 'complete_review' referenced before assignment

In [None]:
from distributed import get_worker

start = time.time()

def get_reviews(url):
    # First, our request to the main url gets protected against possible timeout errors
    try:
        html = requests.get(url['scraping'], timeout=600)
    
    except Exception as e:
        dict_reviews = {'id':hash(url['identifier']), 'restaurant':'timeout', 'grade':'timeout',
                        'date_review':'timeout', 'comments':'timeout', 'date_stayed':'timeout',
                        'response_body':'timeout', 'user_name':'timeout', 'user_reviews':'timeout',
                        'useful_votes':'timeout', 'url':url, 'Error':e}
        
        return dict_reviews
    
    # If we managed to get a response, our html code is converted to a soup and divided by its reviews
    soup = BeautifulSoup(html.text, 'lxml')
    soup_reviews = soup.find_all('div', class_ = 'reviewSelector')
    
    # A dictionary with the equivalent values per month is created to assist with our dates
    dict_months = {'enero':1, 'febrero':2, 'marzo':3, 'abril':4,
                  'mayo':5, 'junio':6, 'julio':7, 'agosto':8,
                  'septiembre':9, 'octubre':10, 'noviembre':11, 'diciembre':12}
    
    # Our main dictionary is instanced and prepared to be filled through the following for loop
    dict_reviews = {'id':[], 'restaurant':[], 'grade':[], 'date_review':[], 'comments':[],
                    'date_stayed':[], 'response_body':[], 'user_name':[], 'user_reviews':[],
                    'useful_votes':[], 'url':[]}
    
    # Although rare, some restaurants do not have any name, or do frequently give errors in that field
    try:
        restaurant = soup.find('h1', class_ = '_3a1XQ88S').text
        
    except Exception as e:
        restaurant = e
    
    # With everything what's common for all reviews settled, we can iterate through our reviews
    for i, review in enumerate(soup_reviews):

        dict_reviews['id'].append(hash(url['identifier']))
        dict_reviews['restaurant'].append(restaurant)
        grade = str(review.find('div', class_ = 'ui_column is-9').span)
        re_grade = int(re.search('_([0-9]+)">', grade).group(1))
        dict_reviews['grade'].append(re_grade)

        try:
            raw_date = re.search('([0-9]+) de ([a-z]+) de ([0-9]+)',
                                 review.find('span', class_ = 'ratingDate').text)

            day, month, year = raw_date.group(1), dict_months[raw_date.group(2)], raw_date.group(3)
            dict_reviews['date_review'].append('{}/{:02d}/{}'.format(day, month, year))

        except:
            dict_reviews['date_review'].append(review.find('span', class_ = 'ratingDate').text)
        
        # First we extract our review text by one of the following three ways
        try:
            # Alternative 1
            basic_review = review.find('p', class_ = 'partial_entry').text
            extended_review = review.find('span', class_ = 'postSnippet').text
            dict_reviews['comments'].append(basic_review.replace(f'...{extended_review}Más',
                                                                 f' {extended_review}'))
        except:
            # There is no guarantee that we'll find our button. We therefore protect ourselves against
            try:
                button_code = review.find('span', class_ = 'taLnk ulBlueLinks')
            except:
                button_code = None
            
            # Alternative 2    
            if button_code != None:
                get_worker().browser.driver.get(url['scraping'])
                button = get_worker().browser.driver.find_element_by_class_name('taLnk.ulBlueLinks').click()
                time.sleep(1)

                html_selenium = get_worker().browser.driver.page_source
                soup_selenium = BeautifulSoup(html_selenium, 'lxml')

                reviews_selenium = soup_selenium.find_all('div', class_ = 'reviewSelector')
                dict_reviews['comments'].append(reviews_selenium[i].find('p', class_ = 'partial_entry').text)
                
            # Alternative 3
            else:
                dict_reviews['comments'].append(basic_review)

        # The following lines extract the dates
        raw_date = re.search(': ([a-z]+) de ([0-9]+)',
                             review.find('div', class_ = 'prw_rup prw_reviews_stay_date_hsx').text)
        try:
            month, year = dict_months[raw_date.group(1)], raw_date.group(2)
            dict_reviews['date_stayed'].append('{:02d}/{}'.format(month, year))
            
        except Exception as e:
            month, year = review.find('div', class_ = 'prw_rup prw_reviews_stay_date_hsx'), e
            dict_reviews['date_stayed'].append(f'{month} with error: {year}')
            

        try:
            full_response = review.find('div', class_ = 'mgrRspnInline')
            local_body = []

            for match in ['(.*)\n', '(.*)\.\.\.Más']:
                re_body = re.search(match, full_response.find('p', class_ = 'partial_entry').text)

                if re_body != None:
                    local_body.append(re_body.group(1)) # Acá agregar marcador para extracción completa
                    
            dict_reviews['response_body'].append(' '.join(local_body))

        except:
            full_response = None
            dict_reviews['response_body'].append(None)

        full_response = review.find('div', class_ = 'entry')
                
        # After extracting our comments, we ... author name
        try:
            dict_reviews['user_name'].append(review.find('div', class_ = 'info_text pointer_cursor').text)
            
        except Exception as e:
            dict_reviews['user_name'].append('La url {} presenta un error de tipo {}'.format(url['scraping'], e)) 
            
        try:
            dict_reviews['user_reviews'].append(int(re.match('([0-9]+)',
                                                             review.find('span',
                                                                         class_ = 'badgeText').text).group(1)))
        except Exception as e:
            dict_reviews['user_reviews'].append('La url {} presenta un error de tipo {}'.format(url['scraping'], e)) 

        get_votes = lambda useful_votes: n.text if useful_votes != None else 0
        dict_reviews['useful_votes'].append(get_votes(review.find('span', class_ = 'numHlpIn')))
        
        dict_reviews['url'].append(url)    

    return (dict_reviews)


urls_test = url_reviews[0:100]

futures = [client.submit(get_reviews, url) for url in urls_test]
results = client.gather(futures)
stop = time.time()

dict_structure = {'id':[], 'date_review':[], 'comments':[], 'date_stayed':[], 'response_body':[],
                  'user_name':[], 'user_reviews':[], 'useful_votes':[]}

df_reviews = utils.build_dataframe(dict_structure, results, time_id)

print(f'Este proceso tomó {round(stop-start, 2)} segundos')
display(df_reviews)

#for i, url in enumerate(url_reviews):
#    test = client.submit(search_url, url)
#    print(url['scraping'], '\n', test.result(), '\n\n')
#    
#    if i > 3:
#        break

In [64]:
df_reviews.to_excel('test01.xlsx')

<h2>Generación de tablas</h2>

In [13]:
#start = time.time()
#df_restaurants.to_excel(f'{time_id}_excel_with_{df_restaurants.shape[0]}_restaurants.xlsx')
#df_reviews.to_excel(f'{time_id}_excel_with_{df_reviews.shape[0]}_reviews.xlsx')
#stop = time.time()
#
#print(f'Este proceso tomó {round(stop-start, 2)} segundos en correr.',
#      'Se extrajeron {} comentarios.\n'.format(df_reviews.shape[0]))

<h2>Extraer comunas</h2>

In [14]:
#import googlemaps
#
#gmaps = googlemaps.Client(key='AIzaSyAv9kBNSqEznAwQ3nhnb1A6GZPlxJteLE8')
#
#def get_location(address):
#    geocode = dict(*gmaps.geocode([address]))['address_components']
#    location = str(geocode[3]['long_name'])
#    
#    return location

In [15]:
#df_restaurants = pd.read_pickle('20210205_dataframe_of_4830_restaurants.pickle')
#addresses = df_restaurants['Dirección'].to_list()
#locations = []
#
#for address in tqdm(addresses):
#    try:
#        location = get_location(address)
#        locations.append(location)
#        
#    except:
#        locations.append(address)
        
    #time.sleep(3)

In [16]:
#df_restaurants['Comuna'] = locations
#df_restaurants.to_excel(f'{time_id}_excel_with_{df_restaurants.shape[0]}_restaurants_with_locations.xlsx')

In [17]:
#url = url_reviews[7]
#
#print(url['scraping'])