# TripAdvisor restaurants info scraping
Takes a city as an argument and scrape the summary data of each restaurants of the city through the TA restaurants display pages.

Curate the raw dataset generated and aggregate them into one single dataset for all cities.

https://www.tripadvisor.fr/Restaurants-g274772-Krakow_Lesser_Poland_Province_Southern_Poland.html#EATERY_OVERVIEW_BOX

In [1]:
#! /usr/bin/env python3
# coding: utf-8

import requests
from bs4 import BeautifulSoup
import datetime
import pandas as pd
import numpy as np
import logging
import glob2

In [2]:
#Variables that will be used globally through the script
url0 = 'https://www.tripadvisor.com'
today = datetime.datetime.now()
today_date = str(today.year) + '/' + str(today.month) + '/' + str(today.day)

#Enable display of info messages
logging.basicConfig(level=logging.INFO)

## Scrape data from the summary list of restaurants

The TripAdvisor URL to scroll through the restaurants list is built as follow:
https://www.tripadvisor.com/RestaurantSearch-g1225481-oa15, where 
- g122548 is the id of the city
- oa30 is the variable to scroll through the pages, by incrementing by 30 to go to the next page.

Restaurants are naturally sorted by descending Ranking
The information is heterogenous: 
- all restaurants have name, id, URL
- not all have cuisine, rank, rate, reviews

### Scraper

In [3]:
def scraper(city):
    query = '/TypeAheadJson?action=API&startTime='+today_date+'&uiOrigin=GEOSCOPE&source=GEOSCOPE&interleaved=true&type=geo&neighborhood_geos=true&link_type=eat&details=true&max=12&injectNeighborhoods=true&query='+city
    url = url0 + query
    #Query the API ad get a JSON answer readable by Python as dictionnaries objects
    api_response = requests.get(url).json()
    geo = api_response['results'][0]['url']  #Get the URL from the results/1st element/Url key
    restaurants_url = url0 + geo
    logging.info("Scraping {} restaurants info".format(city))
    print(restaurants_url)

    #Prepare the scrolling requests using a URL such as
    #https://www.tripadvisor.com/RestaurantSearch-g1225481-oa15
    scroll_url0= 'https://www.tripadvisor.com/RestaurantSearch-'
    b = restaurants_url.find('-')
    e = restaurants_url.find('-', b+1)
    city_id = restaurants_url[b+1:e]
    
    #Initialize the lists of parameters to scrape and the dataframe containing all data
    inc_page=0
    resto_dict = {}
    dataset = pd.DataFrame(resto_dict)
    #columns=['Name', 'URL_TA', 'ID_TA', 'Rating', 'Ranking', 'Price Range', 'Cuisine Style', 'Number of Reviews', 'Reviews'])
    
    #Get the total number of pages
    r = requests.get(scroll_url0+city_id,
                     headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',},
                     cookies= {"SetCurrency":"EUR"})
    soup = BeautifulSoup(r.text, "lxml")
    page_tag = soup.find_all(class_="pageNumbers")[0] #tag that displays number of pages at bottom of webpage
    a_tags = page_tag.find_all('a')  #last item of the returned list is the last page button
    tot_pages=int(a_tags[-1].contents[0])  #integer from text content of the <a>
    logging.info("{} pages to explore".format(tot_pages))
    
    #Explore all the pages that display restaurants
    for page_index in range (1, tot_pages+1):
        
        #URL of the current webpage
        scroll_url = scroll_url0 + city_id + '-oa' + str(inc_page)
        print("Scraping page n°{}".format(page_index))
        print(scroll_url)

        #Scrape HTML content of the current webpage using the library BeautifulSoup
        r = requests.get(scroll_url,
                 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',},
                cookies= {"SetCurrency":"EUR"})
        soup = BeautifulSoup(r.text, "lxml")


        #Restaurants list starts with tag <div id="EATERY_SEARCH_RESULTS">
        data_bloc = soup.find_all(attrs={"id": "EATERY_SEARCH_RESULTS"}) #contains the data bloc in a list object
        data_bloc = data_bloc[0]  #easier to manipulate

    #First restaurant of page has a particular class attribute
        if data_bloc.find_all(class_="listing rebrand listingIndex-1 first") != []:
            resto_soup = data_bloc.find_all(class_="listing rebrand listingIndex-1 first")[0]
        else:
            resto_soup = data_bloc.find_all(class_="listing rebrand first")[0]

        #Get the url, id and name of restaurants
        url_name_tag = resto_soup.find_all(class_="property_title")[0] #tag containing the data
        #Get restaurant URL
        resto_dict['URL_TA'] = url_name_tag.get('href')
        #Get the restaurant ID within its URL (-dxxxxxxx-Reviews)
        b = url_name_tag.get('href').find('-d')
        e= url_name_tag.get('href').find('-R')
        resto_dict['ID_TA'] = url_name_tag.get('href')[b+1:e]
        #Get names
        resto_dict['Name'] = url_name_tag.contents[0][1:-1]

        #Get the ranking of the restaurant
        if resto_soup.find_all(class_="popIndex rebrand popIndexDefault") != []:
            ranking_tag = resto_soup.find_all(class_="popIndex rebrand popIndexDefault")[0]
            resto_dict['Ranking'] = ranking_tag.contents[0][1:-1]
        else:
            resto_dict['Ranking'] = np.nan #put a NaN instead

        #Get the rating of the restaurant from <span> tags
        if resto_soup.find_all('span') != []:
            span_tags = resto_soup.find_all('span')
            for tag in span_tags:
                if tag.get('alt') is not None:
                    resto_dict['Rating'] = tag.get('alt')
        else:
            resto_dict['Rating'] = np.nan

        #Information from <div class="cuisines">  
        #!! some resaurants don't have pricerange nor cuisine styles, instead <div class="cuisine_margin">
        cuisines_tags = resto_soup.find_all(class_="cuisines") #1 element of the list is 1 restaurant
        if resto_soup.find_all(class_="cuisines") != []:
            for item in cuisines_tags:
                #Get price range from <span class="item price">
                if item.find(class_="item price") is not None:
                    price_range = item.find(class_="item price") #unique tag with price range
                    resto_dict['Price Range'] = price_range.contents[0]
                else:
                     resto_dict['Price Range'] = np.nan
                #Get cuisine styles from <span class="item cuisine"> tags (several/restaurant)
                if item.find_all(class_="item cuisine") != []:
                    cuisines = item.find_all(class_="item cuisine")  # list of <a> tags with the cuisine style as text
                    resto_dict['Cuisine Style'] = [tag.contents[0] for tag in cuisines]
                else:
                    resto_dict['Cuisine Style'] = np.nan

        #Get number of reviews
        if resto_soup.find_all(class_="reviewCount") != []:
            numb_tag = resto_soup.find_all(class_="reviewCount")[0]
            resto_dict['Number of Reviews'] = numb_tag.find('a').contents[0][1:-9]
        else:
            resto_dict['Number of Reviews'] = np.nan
            
        #Get 2 reviews (text+date) from <ul class="review_stubs review_snippets rebrand"> and <li> tags within
        ul_tags = resto_soup.find_all(class_="review_stubs review_snippets rebrand")
        if ul_tags != []:
            for reviews_set in ul_tags:
                rev_texts = reviews_set.find_all(dir="ltr")
                rev_dates = reviews_set.find_all(class_="date")
                resto_dict['Reviews'] = [[tag.find('a').contents[0] for tag in rev_texts], #text is in a <a> tag
                                          [tag.contents[0] for tag in rev_dates]]
        else:
            resto_dict['Reviews'] = np.nan
            
        #Append the dataset
        dataset = pd.concat([dataset, pd.DataFrame([resto_dict])])
            
    #For the rest of the list from 2 to 30:
        try:
            inc_rest = 0
            for i in range (2, 31):
                resto_dict = {}
                resto_bloc_id = "listing rebrand listingIndex-" + str(i)
                if data_bloc.find_all(class_=resto_bloc_id) != []:
                    resto_soup = data_bloc.find_all(class_=resto_bloc_id)[0] #Bloc for one restaurant

                    #Get the url, id and name of restaurants
                    url_name_tag = resto_soup.find_all(class_="property_title")[0] #tag containing the data
                    #Get restaurant URL
                    resto_dict['URL_TA'] = url_name_tag.get('href')
                    #Get the restaurant ID within its URL (-dxxxxxxx-Reviews)
                    b = url_name_tag.get('href').find('-d')
                    e= url_name_tag.get('href').find('-R')
                    resto_dict['ID_TA'] = url_name_tag.get('href')[b+1:e]
                    #Get names
                    resto_dict['Name'] = url_name_tag.contents[0][1:-1]

                    #Get the ranking of the restaurant
                    if resto_soup.find_all(class_="popIndex rebrand popIndexDefault") != []:
                        ranking_tag = resto_soup.find_all(class_="popIndex rebrand popIndexDefault")[0]
                        resto_dict['Ranking'] = ranking_tag.contents[0][1:-1]
                    else:
                        resto_dict['Ranking'] = np.nan

                    #Get the rating of the restaurant from <span> tags
                    span_tags = resto_soup.find_all('span')
                    if resto_soup.find_all('span') != []:
                        for tag in span_tags:
                            if tag.get('alt') is not None:
                                resto_dict['Rating'] = tag.get('alt')
                    else:
                        resto_dict['Rating'] = np.nan

                    #Information from <div class="cuisines">  
                    #!! some resaurants don't have pricerange nor cuisine styles, instead <div class="cuisine_margin">
                    cuisines_tags = resto_soup.find_all(class_="cuisines") #1 element of the list is 1 restaurant
                    if resto_soup.find_all(class_="cuisines") != []:
                        for item in cuisines_tags:
                            #Get price range from <span class="item price">
                            if item.find(class_="item price") is not None:
                                price_range = item.find(class_="item price") #unique tag with price range
                                resto_dict['Price Range'] = price_range.contents[0]
                            else:
                                resto_dict['Price Range'] = np.nan
                            #Get cuisine styles from <span class="item cuisine"> tags (several/restaurant)
                            if item.find_all(class_="item cuisine") != []:
                                cuisines = item.find_all(class_="item cuisine")  # list of <a> tags with the cuisine style as text
                                resto_dict['Cuisine Style'] = [tag.contents[0] for tag in cuisines]
                            else: 
                                resto_dict['Cuisine Style'] = np.nan

                    #Get number of reviews
                    if resto_soup.find_all(class_="reviewCount") != []:
                        numb_tag = resto_soup.find_all(class_="reviewCount")[0]
                        resto_dict['Number of Reviews'] = numb_tag.find('a').contents[0][1:-9]
                    else:
                        resto_dict['Number of Reviews'] = np.nan

                    #Get 2 reviews (text+date) from <ul class="review_stubs review_snippets rebrand"> and <li> tags within
                    ul_tags = resto_soup.find_all(class_="review_stubs review_snippets rebrand")
                    if resto_soup.find_all(class_="review_stubs review_snippets rebrand") != []:
                        for reviews_set in ul_tags:
                            rev_texts = reviews_set.find_all(dir="ltr")
                            rev_dates = reviews_set.find_all(class_="date")
                            #Able to pick up empty displayed review "" (St morris Argentijns, Amsterdam)
                            resto_dict['Reviews'] = [[tag.find('a').contents[0] if tag.find('a').contents != [] else np.nan for tag in rev_texts], #text is in a <a> tag
                                                  [tag.contents[0] for tag in rev_dates]]
                    else:
                        resto_dict['Reviews'] = np.nan

                    #Append the dataset
                    dataset = pd.concat([dataset, pd.DataFrame([resto_dict])])

                else: #tag of restaurant is instead "listing rebrand"
                    resto_soup = data_bloc.find_all(class_="listing rebrand")[inc_rest]

                    #Get the url, id and name of restaurants
                    url_name_tag = resto_soup.find_all(class_="property_title")[0] #tag containing the data
                    #Get restaurant URL
                    resto_dict['URL_TA'] = url_name_tag.get('href')
                    #Get the restaurant ID within its URL (-dxxxxxxx-Reviews)
                    b = url_name_tag.get('href').find('-d')
                    e= url_name_tag.get('href').find('-R')
                    resto_dict['ID_TA'] = url_name_tag.get('href')[b+1:e]
                    #Get names
                    resto_dict['Name'] = url_name_tag.contents[0][1:-1]

                    #Get the ranking of the restaurant
                    if resto_soup.find_all(class_="popIndex rebrand popIndexDefault") != []:
                        ranking_tag = resto_soup.find_all(class_="popIndex rebrand popIndexDefault")[0]
                        resto_dict['Ranking'] = ranking_tag.contents[0][1:-1]
                    else:
                        resto_dict['Ranking'] = np.nan

                    #Get the rating of the restaurant from <span> tags
                    span_tags = resto_soup.find_all('span')
                    if resto_soup.find_all('span') != []:
                        for tag in span_tags:
                            if tag.get('alt') is not None:
                                resto_dict['Rating'] = tag.get('alt')
                    else:
                        resto_dict['Rating'] = np.nan

                    #Information from <div class="cuisines">  
                    #!! some resaurants don't have pricerange nor cuisine styles, instead <div class="cuisine_margin">
                    cuisines_tags = resto_soup.find_all(class_="cuisines") #1 element of the list is 1 restaurant
                    if resto_soup.find_all(class_="cuisines") != []:
                        for item in cuisines_tags:
                            #Get price range from <span class="item price">
                            if item.find(class_="item price") is not None:
                                price_range = item.find(class_="item price") #unique tag with price range
                                resto_dict['Price Range'] = price_range.contents[0]
                            else:
                                resto_dict['Price Range'] = np.nan
                            #Get cuisine styles from <span class="item cuisine"> tags (several/restaurant)
                            if item.find_all(class_="item cuisine") != []:
                                cuisines = item.find_all(class_="item cuisine")  # list of <a> tags with the cuisine style as text
                                resto_dict['Cuisine Style'] = [tag.contents[0] for tag in cuisines]
                            else: 
                                resto_dict['Cuisine Style'] = np.nan

                    #Get number of reviews
                    if resto_soup.find_all(class_="reviewCount") != []:
                        numb_tag = resto_soup.find_all(class_="reviewCount")[0]
                        resto_dict['Number of Reviews'] = numb_tag.find('a').contents[0][1:-9]
                    else:
                        resto_dict['Number of Reviews'] = np.nan

                    #Get 2 reviews (text+date) from <ul class="review_stubs review_snippets rebrand"> and <li> tags within
                    ul_tags = resto_soup.find_all(class_="review_stubs review_snippets rebrand")
                    if resto_soup.find_all(class_="review_stubs review_snippets rebrand") != []:
                        for reviews_set in ul_tags:
                            rev_texts = reviews_set.find_all(dir="ltr")
                            rev_dates = reviews_set.find_all(class_="date")
                            #Able to pick up empty displayed review "" (St morris Argentijns, Amsterdam)
                            resto_dict['Reviews'] = [[tag.find('a').contents[0] if tag.find('a').contents != [] else np.nan for tag in rev_texts], #text
                                                  [tag.contents[0] for tag in rev_dates]]
                    else:
                        resto_dict['Reviews'] = np.nan

                    #Append the dataset
                    dataset = pd.concat([dataset, pd.DataFrame([resto_dict])])

                    inc_rest += 1

            #Increment to next page to display the next 30 restaurants
            inc_page += 30
    
        #End scrolling when no more restaurants when not able to find other restaurant bloc
        except IndexError:
            logging.info("Last restaurant reached")
            break
    
    #Save dataframe as csv file
    dataset.to_csv(city + '_TA_restaurants_raw.csv', sep=',', encoding="utf-8")
    print("File created in current directory: {}_TA_restaurants_raw.csv".format(city))

    return(dataset)

### Test on a middle-size city

In [None]:
#Test
scraper('Krakow')

### Scrape all the euro capitals for restaurants data

In [None]:
#Run the scraper to get data from the euro capitals
euro_capitals = ['Paris', 'London', 'Budapest', 'Madrid', 'Lisbon', 'Berlin', 'Rome', 
            'Athens', 'Vienna', 'Warsaw', 'Ljubljana', 'Dublin',
                 'Bruxelles', 'Prague', 'Amsterdam', 'Luxembourg', 'Bratislava',
                'Copenhagen', 'Oslo', 'Helsinki', 'Stockholm', 'Geneva']
for city in euro_capitals:
    scraper(city)

In [4]:
#Run the scraper to get data from the euro other main cities
euro_second = ['Marseille', 'Barcelona', 'Oporto', 'Milan', 'Munich', 'Edinburg', 'Krakow', 'Zurich']
for city in euro_second[3:] :
    scraper(city)

INFO:root:Scraping Milan restaurants info


https://www.tripadvisor.com/Restaurants-g187849-Milan_Lombardy.html


INFO:root:223 pages to explore


Scraping page n°1
https://www.tripadvisor.com/RestaurantSearch-g187849-oa0
Scraping page n°2
https://www.tripadvisor.com/RestaurantSearch-g187849-oa30
Scraping page n°3
https://www.tripadvisor.com/RestaurantSearch-g187849-oa60
Scraping page n°4
https://www.tripadvisor.com/RestaurantSearch-g187849-oa90
Scraping page n°5
https://www.tripadvisor.com/RestaurantSearch-g187849-oa120
Scraping page n°6
https://www.tripadvisor.com/RestaurantSearch-g187849-oa150
Scraping page n°7
https://www.tripadvisor.com/RestaurantSearch-g187849-oa180
Scraping page n°8
https://www.tripadvisor.com/RestaurantSearch-g187849-oa210
Scraping page n°9
https://www.tripadvisor.com/RestaurantSearch-g187849-oa240
Scraping page n°10
https://www.tripadvisor.com/RestaurantSearch-g187849-oa270
Scraping page n°11
https://www.tripadvisor.com/RestaurantSearch-g187849-oa300
Scraping page n°12
https://www.tripadvisor.com/RestaurantSearch-g187849-oa330
Scraping page n°13
https://www.tripadvisor.com/RestaurantSearch-g187849-oa360


Scraping page n°106
https://www.tripadvisor.com/RestaurantSearch-g187849-oa3150
Scraping page n°107
https://www.tripadvisor.com/RestaurantSearch-g187849-oa3180
Scraping page n°108
https://www.tripadvisor.com/RestaurantSearch-g187849-oa3210
Scraping page n°109
https://www.tripadvisor.com/RestaurantSearch-g187849-oa3240
Scraping page n°110
https://www.tripadvisor.com/RestaurantSearch-g187849-oa3270
Scraping page n°111
https://www.tripadvisor.com/RestaurantSearch-g187849-oa3300
Scraping page n°112
https://www.tripadvisor.com/RestaurantSearch-g187849-oa3330
Scraping page n°113
https://www.tripadvisor.com/RestaurantSearch-g187849-oa3360
Scraping page n°114
https://www.tripadvisor.com/RestaurantSearch-g187849-oa3390
Scraping page n°115
https://www.tripadvisor.com/RestaurantSearch-g187849-oa3420
Scraping page n°116
https://www.tripadvisor.com/RestaurantSearch-g187849-oa3450
Scraping page n°117
https://www.tripadvisor.com/RestaurantSearch-g187849-oa3480
Scraping page n°118
https://www.tripadvi

Scraping page n°209
https://www.tripadvisor.com/RestaurantSearch-g187849-oa6240
Scraping page n°210
https://www.tripadvisor.com/RestaurantSearch-g187849-oa6270
Scraping page n°211
https://www.tripadvisor.com/RestaurantSearch-g187849-oa6300
Scraping page n°212
https://www.tripadvisor.com/RestaurantSearch-g187849-oa6330
Scraping page n°213
https://www.tripadvisor.com/RestaurantSearch-g187849-oa6360
Scraping page n°214
https://www.tripadvisor.com/RestaurantSearch-g187849-oa6390
Scraping page n°215
https://www.tripadvisor.com/RestaurantSearch-g187849-oa6420
Scraping page n°216
https://www.tripadvisor.com/RestaurantSearch-g187849-oa6450
Scraping page n°217
https://www.tripadvisor.com/RestaurantSearch-g187849-oa6480
Scraping page n°218
https://www.tripadvisor.com/RestaurantSearch-g187849-oa6510
Scraping page n°219
https://www.tripadvisor.com/RestaurantSearch-g187849-oa6540
Scraping page n°220
https://www.tripadvisor.com/RestaurantSearch-g187849-oa6570
Scraping page n°221
https://www.tripadvi

INFO:root:Last restaurant reached


File created in current directory: Milan_TA_restaurants_raw.csv


INFO:root:Scraping Munich restaurants info


https://www.tripadvisor.com/Restaurants-g187309-Munich_Upper_Bavaria_Bavaria.html


INFO:root:100 pages to explore


Scraping page n°1
https://www.tripadvisor.com/RestaurantSearch-g187309-oa0
Scraping page n°2
https://www.tripadvisor.com/RestaurantSearch-g187309-oa30
Scraping page n°3
https://www.tripadvisor.com/RestaurantSearch-g187309-oa60
Scraping page n°4
https://www.tripadvisor.com/RestaurantSearch-g187309-oa90
Scraping page n°5
https://www.tripadvisor.com/RestaurantSearch-g187309-oa120
Scraping page n°6
https://www.tripadvisor.com/RestaurantSearch-g187309-oa150
Scraping page n°7
https://www.tripadvisor.com/RestaurantSearch-g187309-oa180
Scraping page n°8
https://www.tripadvisor.com/RestaurantSearch-g187309-oa210
Scraping page n°9
https://www.tripadvisor.com/RestaurantSearch-g187309-oa240
Scraping page n°10
https://www.tripadvisor.com/RestaurantSearch-g187309-oa270
Scraping page n°11
https://www.tripadvisor.com/RestaurantSearch-g187309-oa300
Scraping page n°12
https://www.tripadvisor.com/RestaurantSearch-g187309-oa330
Scraping page n°13
https://www.tripadvisor.com/RestaurantSearch-g187309-oa360


INFO:root:Last restaurant reached


File created in current directory: Munich_TA_restaurants_raw.csv


INFO:root:Scraping Edinburg restaurants info


https://www.tripadvisor.com/Restaurants-g186525-Edinburgh_Scotland.html


INFO:root:63 pages to explore


Scraping page n°1
https://www.tripadvisor.com/RestaurantSearch-g186525-oa0
Scraping page n°2
https://www.tripadvisor.com/RestaurantSearch-g186525-oa30
Scraping page n°3
https://www.tripadvisor.com/RestaurantSearch-g186525-oa60
Scraping page n°4
https://www.tripadvisor.com/RestaurantSearch-g186525-oa90
Scraping page n°5
https://www.tripadvisor.com/RestaurantSearch-g186525-oa120
Scraping page n°6
https://www.tripadvisor.com/RestaurantSearch-g186525-oa150
Scraping page n°7
https://www.tripadvisor.com/RestaurantSearch-g186525-oa180
Scraping page n°8
https://www.tripadvisor.com/RestaurantSearch-g186525-oa210
Scraping page n°9
https://www.tripadvisor.com/RestaurantSearch-g186525-oa240
Scraping page n°10
https://www.tripadvisor.com/RestaurantSearch-g186525-oa270
Scraping page n°11
https://www.tripadvisor.com/RestaurantSearch-g186525-oa300
Scraping page n°12
https://www.tripadvisor.com/RestaurantSearch-g186525-oa330
Scraping page n°13
https://www.tripadvisor.com/RestaurantSearch-g186525-oa360


INFO:root:Last restaurant reached


File created in current directory: Edinburg_TA_restaurants_raw.csv


INFO:root:Scraping Krakow restaurants info


https://www.tripadvisor.com/Restaurants-g274772-Krakow_Lesser_Poland_Province_Southern_Poland.html


INFO:root:46 pages to explore


Scraping page n°1
https://www.tripadvisor.com/RestaurantSearch-g274772-oa0
Scraping page n°2
https://www.tripadvisor.com/RestaurantSearch-g274772-oa30
Scraping page n°3
https://www.tripadvisor.com/RestaurantSearch-g274772-oa60
Scraping page n°4
https://www.tripadvisor.com/RestaurantSearch-g274772-oa90
Scraping page n°5
https://www.tripadvisor.com/RestaurantSearch-g274772-oa120
Scraping page n°6
https://www.tripadvisor.com/RestaurantSearch-g274772-oa150
Scraping page n°7
https://www.tripadvisor.com/RestaurantSearch-g274772-oa180
Scraping page n°8
https://www.tripadvisor.com/RestaurantSearch-g274772-oa210
Scraping page n°9
https://www.tripadvisor.com/RestaurantSearch-g274772-oa240
Scraping page n°10
https://www.tripadvisor.com/RestaurantSearch-g274772-oa270
Scraping page n°11
https://www.tripadvisor.com/RestaurantSearch-g274772-oa300
Scraping page n°12
https://www.tripadvisor.com/RestaurantSearch-g274772-oa330
Scraping page n°13
https://www.tripadvisor.com/RestaurantSearch-g274772-oa360


INFO:root:Last restaurant reached


File created in current directory: Krakow_TA_restaurants_raw.csv


In [6]:
scraper('Geneva')
scraper('Zurich')

INFO:root:Scraping Geneva restaurants info


https://www.tripadvisor.com/Restaurants-g188057-Geneva.html


INFO:root:53 pages to explore


Scraping page n°1
https://www.tripadvisor.com/RestaurantSearch-g188057-oa0
Scraping page n°2
https://www.tripadvisor.com/RestaurantSearch-g188057-oa30
Scraping page n°3
https://www.tripadvisor.com/RestaurantSearch-g188057-oa60
Scraping page n°4
https://www.tripadvisor.com/RestaurantSearch-g188057-oa90
Scraping page n°5
https://www.tripadvisor.com/RestaurantSearch-g188057-oa120
Scraping page n°6
https://www.tripadvisor.com/RestaurantSearch-g188057-oa150
Scraping page n°7
https://www.tripadvisor.com/RestaurantSearch-g188057-oa180
Scraping page n°8
https://www.tripadvisor.com/RestaurantSearch-g188057-oa210
Scraping page n°9
https://www.tripadvisor.com/RestaurantSearch-g188057-oa240
Scraping page n°10
https://www.tripadvisor.com/RestaurantSearch-g188057-oa270
Scraping page n°11
https://www.tripadvisor.com/RestaurantSearch-g188057-oa300
Scraping page n°12
https://www.tripadvisor.com/RestaurantSearch-g188057-oa330
Scraping page n°13
https://www.tripadvisor.com/RestaurantSearch-g188057-oa360


INFO:root:Last restaurant reached


File created in current directory: Geneva_TA_restaurants_raw.csv


INFO:root:Scraping Zurich restaurants info


https://www.tripadvisor.com/Restaurants-g188113-Zurich.html


INFO:root:56 pages to explore


Scraping page n°1
https://www.tripadvisor.com/RestaurantSearch-g188113-oa0
Scraping page n°2
https://www.tripadvisor.com/RestaurantSearch-g188113-oa30
Scraping page n°3
https://www.tripadvisor.com/RestaurantSearch-g188113-oa60
Scraping page n°4
https://www.tripadvisor.com/RestaurantSearch-g188113-oa90
Scraping page n°5
https://www.tripadvisor.com/RestaurantSearch-g188113-oa120
Scraping page n°6
https://www.tripadvisor.com/RestaurantSearch-g188113-oa150
Scraping page n°7
https://www.tripadvisor.com/RestaurantSearch-g188113-oa180
Scraping page n°8
https://www.tripadvisor.com/RestaurantSearch-g188113-oa210
Scraping page n°9
https://www.tripadvisor.com/RestaurantSearch-g188113-oa240
Scraping page n°10
https://www.tripadvisor.com/RestaurantSearch-g188113-oa270
Scraping page n°11
https://www.tripadvisor.com/RestaurantSearch-g188113-oa300
Scraping page n°12
https://www.tripadvisor.com/RestaurantSearch-g188113-oa330
Scraping page n°13
https://www.tripadvisor.com/RestaurantSearch-g188113-oa360


INFO:root:Last restaurant reached


File created in current directory: Zurich_TA_restaurants_raw.csv


Unnamed: 0,Cuisine Style,ID_TA,Name,Number of Reviews,Price Range,Ranking,Rating,Reviews,URL_TA
0,"[Italian, Pizza, Mediterranean, European, Vege...",d6212120,La Fonte,704,$$ - $$$,"#1 of 1,672 Restaurants in Zurich",4.5 of 5 bubbles,"[[Lovely!, If you love pizza, reserve your tab...",/Restaurant_Review-g188113-d6212120-Reviews-La...
0,"[Indian, Asian, Vegetarian Friendly, Vegan Opt...",d10169867,Tamarind Hill Indian Restaurant,418,$$ - $$$,"#2 of 1,672 Restaurants in Zurich",4.5 of 5 bubbles,"[[Fantastic, Little india], [01/10/2018, 01/08...",/Restaurant_Review-g188113-d10169867-Reviews-T...
0,"[Swiss, European, Central European, Vegetarian...",d1190632,Differente Hotel Krone Unterstrass,505,$$ - $$$,"#3 of 1,672 Restaurants in Zurich",4.5 of 5 bubbles,"[[New year's eve diner, Simply top quality], [...",/Restaurant_Review-g188113-d1190632-Reviews-Di...
0,"[Mediterranean, Swiss, European, Vegetarian Fr...",d697845,Haus zum Rueden,157,$$$$,"#4 of 1,672 Restaurants in Zurich",4.5 of 5 bubbles,"[[Great food in a great setting, Wow! Totally ...",/Restaurant_Review-g188113-d697845-Reviews-Hau...
0,"[French, Swiss, European, Central European, Ve...",d825545,Didi's Frieden,546,$$$$,"#5 of 1,672 Restaurants in Zurich",4.5 of 5 bubbles,"[[A must in zurich, Good efficient and friendl...",/Restaurant_Review-g188113-d825545-Reviews-Did...
0,"[Asian, Malaysian, Vegetarian Friendly, Vegan ...",d4745718,My Kitchen,361,$$ - $$$,"#6 of 1,672 Restaurants in Zurich",4.5 of 5 bubbles,"[[Sedap., Definitely to go place!], [01/05/201...",/Restaurant_Review-g188113-d4745718-Reviews-My...
0,"[Mediterranean, European, Central European, Ve...",d8051689,"Gustav Restaurant & Bar, Cafe & Confiserie",196,$$$$,"#7 of 1,672 Restaurants in Zurich",4.5 of 5 bubbles,"[[Delicious Surprise, Special treat], [12/28/2...",/Restaurant_Review-g188113-d8051689-Reviews-Gu...
0,"[Swiss, European, Central European, Vegetarian...",d802913,Kindli,492,$$$$,"#8 of 1,672 Restaurants in Zurich",4.5 of 5 bubbles,"[[Wonderful traditional Swiss restaurant, Grea...",/Restaurant_Review-g188113-d802913-Reviews-Kin...
0,"[Swiss, Vegetarian Friendly, Gluten Free Options]",d11899165,Raclette Factory - Rindermarkt,146,$$ - $$$,"#9 of 1,672 Restaurants in Zurich",4.5 of 5 bubbles,"[[Amazing place!, perfect for a lunch], [01/06...",/Restaurant_Review-g188113-d11899165-Reviews-R...
0,"[Indian, Swiss, European, Mediterranean, Asian...",d697846,Haus Hiltl,3085,$$ - $$$,"#10 of 1,672 Restaurants in Zurich",4.5 of 5 bubbles,"[[Amazing Place, Amazing restaurant], [01/10/2...",/Restaurant_Review-g188113-d697846-Reviews-Hau...


---

# Data Curation
## Exploration of raw datasets

In [7]:
#Explore the datasets obtained from the scraper
raw_csv_files = glob2.glob('*raw.csv')
print (raw_csv_files)
print("{} files in the directory".format(len(raw_csv_files)))
for file in raw_csv_files:
    print('\n' + file)
    dataset = pd.read_csv(file, sep=',',  encoding="utf-8")
    print(dataset.info())
    print(dataset.head())
    print(dataset.tail())

    #Count the unique values in Price range and rating columns
    print(dataset['Price Range'].value_counts(dropna=False))
    print(dataset['Rating'].value_counts(dropna=False))

['Amsterdam_TA_restaurants_raw.csv', 'Athens_TA_restaurants_raw.csv', 'Barcelona_TA_restaurants_raw.csv', 'Berlin_TA_restaurants_raw.csv', 'Bratislava_TA_restaurants_raw.csv', 'Bruxelles_TA_restaurants_raw.csv', 'Budapest_TA_restaurants_raw.csv', 'Copenhagen_TA_restaurants_raw.csv', 'Dublin_TA_restaurants_raw.csv', 'Edinburg_TA_restaurants_raw.csv', 'Geneva_TA_restaurants_raw.csv', 'Helsinki_TA_restaurants_raw.csv', 'Koln_TA_restaurants_raw.csv', 'Krakow_TA_restaurants_raw.csv', 'Lisbon_TA_restaurants_raw.csv', 'Ljubljana_TA_restaurants_raw.csv', 'London_TA_restaurants_raw.csv', 'Luxembourg_TA_restaurants_raw.csv', 'Madrid_TA_restaurants_raw.csv', 'Marseille_TA_restaurants_raw.csv', 'Milan_TA_restaurants_raw.csv', 'Munich_TA_restaurants_raw.csv', 'Oporto_TA_restaurants_raw.csv', 'Oslo_TA_restaurants_raw.csv', 'Paris_TA_restaurants_raw.csv', 'Prague_TA_restaurants_raw.csv', 'Rome_TA_restaurants_raw.csv', 'Stockholm_TA_restaurants_raw.csv', 'Toulouse_TA_restaurants_raw.csv', 'Vienna_TA_r

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8425 entries, 0 to 8424
Data columns (total 10 columns):
Unnamed: 0           8425 non-null int64
Cuisine Style        6388 non-null object
ID_TA                8425 non-null object
Name                 8425 non-null object
Number of Reviews    7264 non-null object
Price Range          5407 non-null object
Ranking              7795 non-null object
Rating               7793 non-null object
Reviews              7793 non-null object
URL_TA               8425 non-null object
dtypes: int64(1), object(9)
memory usage: 658.3+ KB
None
   Unnamed: 0                                      Cuisine Style      ID_TA  \
0           0  ['International', 'Mediterranean', 'Fusion', '...   d8003030   
1           0  ['International', 'Mediterranean', 'Spanish', ...   d8531409   
2           0  ['Mediterranean', 'European', 'Spanish', 'Vege...   d3353672   
3           0  ['Mediterranean', 'European', 'Spanish', 'Vege...  d12643500   
4           0  ['Medit

      Unnamed: 0                    Cuisine Style      ID_TA  \
1065           0             ['Pub', 'Gastropub']  d13289864   
1066           0  ['Sushi', 'Thai', 'Vietnamese']  d13324256   
1067           0                   ['Bar', 'Pub']  d13330002   
1068           0           ['Bar', 'Cafe', 'Pub']  d13336292   
1069           0                     ['Barbecue']  d13355257   

                       Name Number of Reviews Price Range Ranking Rating  \
1065  Bratislavska Kozlovna               NaN    $$ - $$$     NaN    NaN   
1066             Bamboo Snp               NaN    $$ - $$$     NaN    NaN   
1067       mini BAR by SPIN               NaN           $     NaN    NaN   
1068           Shisha Cubes               NaN        $$$$     NaN    NaN   
1069             Pivarnicka               NaN           $     NaN    NaN   

     Reviews                                             URL_TA  
1065     NaN  /Restaurant_Review-g274924-d13289864-Reviews-B...  
1066     NaN  /Restaurant_

      Unnamed: 0                                      Cuisine Style  \
2104           0                      ['Italian', 'Pizza', 'Grill']   
2105           0                       ['European', 'Scandinavian']   
2106           0                             ['European', 'Danish']   
2107           0                             ['European', 'Danish']   
2108           0  ['International', 'Mediterranean', 'European',...   

          ID_TA                      Name Number of Reviews Price Range  \
2104  d13347420         Pizzaport Femoren               NaN    $$ - $$$   
2105  d13347425  Shabaz Kaffebar & Kokken               NaN           $   
2106  d13347436               Natur Torst               NaN    $$ - $$$   
2107  d13347443            Yolk Kobenhavn               NaN         NaN   
2108  d13347451                     Sonny               NaN    $$ - $$$   

     Ranking Rating Reviews                                             URL_TA  
2104     NaN    NaN     NaN  /Restaurant_

   Unnamed: 0                                      Cuisine Style      ID_TA  \
0           0  ['French', 'European', 'Vegetarian Friendly', ...   d2437056   
1           0  ['Japanese', 'Sushi', 'Asian', 'Vegetarian Fri...   d5049938   
2           0  ['French', 'European', 'Vegetarian Friendly', ...    d697832   
3           0  ['French', 'European', 'Central European', 'Ve...    d697864   
4           0  ['French', 'European', 'Contemporary', 'Vegeta...  d12173937   

                     Name Number of Reviews Price Range  \
0                 Bayview               383        $$$$   
1                   Izumi               509        $$$$   
2  Bistrot du Boeuf Rouge               578    $$ - $$$   
3           Le Chat-Botte               369        $$$$   
4                Intensus                68        $$$$   

                             Ranking            Rating  \
0  #1 of 1,578 Restaurants in Geneva  4.5 of 5 bubbles   
1  #2 of 1,578 Restaurants in Geneva  4.5 of 5 bubbles

None
   Unnamed: 0                                      Cuisine Style     ID_TA  \
0           0  ['African', 'Vegetarian Friendly', 'Vegan Opti...  d2722617   
1           0  ['Barbecue', 'Asian', 'Korean', 'Grill', 'Vege...  d2182710   
2           0  ['Italian', 'Mediterranean', 'European', 'Vege...   d718264   
3           0  ['Diner', 'German', 'Brew Pub', 'European', 'C...   d695516   
4           0  ['German', 'Bar', 'European', 'Pub', 'Central ...   d966510   

                                       Name Number of Reviews Price Range  \
0               Shaka Zulu Restaurant & Bar               456    $$ - $$$   
1                              Bulgogi-Haus               406    $$ - $$$   
2                                 Pasta Bar               472    $$ - $$$   
3  Restaurant Gaststaette Bei Oma Kleinmann             1,013    $$ - $$$   
4                               Lommerzheim               780           $   

                              Ranking            Rating  \
0  #

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18212 entries, 0 to 18211
Data columns (total 10 columns):
Unnamed: 0           18212 non-null int64
Cuisine Style        14504 non-null object
ID_TA                18212 non-null object
Name                 18212 non-null object
Number of Reviews    15352 non-null object
Price Range          12163 non-null object
Ranking              16434 non-null object
Rating               16435 non-null object
Reviews              16436 non-null object
URL_TA               18212 non-null object
dtypes: int64(1), object(9)
memory usage: 1.4+ MB
None
   Unnamed: 0                                      Cuisine Style      ID_TA  \
0           0  ['Cafe', 'Middle Eastern', 'Persian', 'Vegetar...  d10517849   
1           0  ['Italian', 'Mediterranean', 'Vegetarian Frien...  d13149344   
2           0      ['Seafood', 'British', 'Gluten Free Options']  d12518072   
3           0  ['Mediterranean', 'European', 'Turkish', 'Midd...  d10444968   
4           

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9543 entries, 0 to 9542
Data columns (total 10 columns):
Unnamed: 0           9543 non-null int64
Cuisine Style        6402 non-null object
ID_TA                9543 non-null object
Name                 9543 non-null object
Number of Reviews    8232 non-null object
Price Range          4969 non-null object
Ranking              8803 non-null object
Rating               8812 non-null object
Reviews              8814 non-null object
URL_TA               9543 non-null object
dtypes: int64(1), object(9)
memory usage: 745.6+ KB
None
   Unnamed: 0                                      Cuisine Style      ID_TA  \
0           0  ['Mediterranean', 'European', 'Spanish', 'Cont...  d10428302   
1           0  ['International', 'European', 'Spanish', 'Cont...   d6884911   
2           0  ['International', 'Mediterranean', 'European',...   d2006562   
3           0  ['International', 'Mediterranean', 'European',...  d11896546   
4           0         

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2995 entries, 0 to 2994
Data columns (total 10 columns):
Unnamed: 0           2995 non-null int64
Cuisine Style        2039 non-null object
ID_TA                2995 non-null object
Name                 2995 non-null object
Number of Reviews    2545 non-null object
Price Range          1751 non-null object
Ranking              2744 non-null object
Rating               2743 non-null object
Reviews              2744 non-null object
URL_TA               2995 non-null object
dtypes: int64(1), object(9)
memory usage: 234.1+ KB
None
   Unnamed: 0                                      Cuisine Style     ID_TA  \
0           0  ['Austrian', 'European', 'Central European', '...   d692576   
1           0  ['German', 'European', 'Diner', 'Central Europ...  d7736677   
2           0  ['Seafood', 'Mediterranean', 'Greek', 'Grill',...  d2660543   
3           0  ['German', 'Bar', 'European', 'Vegetarian Frie...  d1344151   
4           0  ['Seafood', 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1213 entries, 0 to 1212
Data columns (total 10 columns):
Unnamed: 0           1213 non-null int64
Cuisine Style        914 non-null object
ID_TA                1213 non-null object
Name                 1213 non-null object
Number of Reviews    1074 non-null object
Price Range          779 non-null object
Ranking              1137 non-null object
Rating               1138 non-null object
Reviews              1138 non-null object
URL_TA               1213 non-null object
dtypes: int64(1), object(9)
memory usage: 94.8+ KB
None
   Unnamed: 0                                      Cuisine Style     ID_TA  \
0           0  ['Contemporary', 'European', 'Scandinavian', '...   d805342   
1           0  ['European', 'Scandinavian', 'Norwegian', 'Veg...  d2447055   
2           0  ['French', 'European', 'Scandinavian', 'Norweg...  d1045646   
3           0  ['Italian', 'Mediterranean', 'European', 'Vege...  d1512099   
4           0  ['French', 'Eur

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4859 entries, 0 to 4858
Data columns (total 10 columns):
Unnamed: 0           4859 non-null int64
Cuisine Style        4032 non-null object
ID_TA                4859 non-null object
Name                 4859 non-null object
Number of Reviews    3711 non-null object
Price Range          2674 non-null object
Ranking              4172 non-null object
Rating               4172 non-null object
Reviews              4172 non-null object
URL_TA               4859 non-null object
dtypes: int64(1), object(9)
memory usage: 379.7+ KB
None
   Unnamed: 0                                      Cuisine Style      ID_TA  \
0           0  ['Italian', 'Mediterranean', 'European', 'Wine...   d8374547   
1           0  ['Indian', 'Vegetarian Friendly', 'Vegan Optio...  d10920105   
2           0  ['International', 'Bar', 'Vegetarian Friendly'...   d6494862   
3           0  ['International', 'European', 'Vegetarian Frie...  d10299686   
4           0  ['Inter

2704  /Restaurant_Review-g189852-d13360173-Reviews-M...  
NaN         1290
$$ - $$$    1164
$            173
$$$$          78
Name: Price Range, dtype: int64
4 of 5 bubbles      856
3.5 of 5 bubbles    556
4.5 of 5 bubbles    528
NaN                 242
3 of 5 bubbles      238
5 of 5 bubbles      164
2.5 of 5 bubbles     62
2 of 5 bubbles       42
1.5 of 5 bubbles      9
1 of 5 bubbles        8
Name: Rating, dtype: int64

Toulouse_TA_restaurants_raw.csv
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1606 entries, 0 to 1605
Data columns (total 10 columns):
Unnamed: 0           1606 non-null int64
Cuisine Style        1046 non-null object
ID_TA                1606 non-null object
Name                 1606 non-null object
Number of Reviews    1412 non-null object
Price Range          831 non-null object
Ranking              1483 non-null object
Rating               1481 non-null object
Reviews              1481 non-null object
URL_TA               1606 non-null object
dtypes: int64(1),

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1667 entries, 0 to 1666
Data columns (total 10 columns):
Unnamed: 0           1667 non-null int64
Cuisine Style        1288 non-null object
ID_TA                1667 non-null object
Name                 1667 non-null object
Number of Reviews    1494 non-null object
Price Range          1121 non-null object
Ranking              1596 non-null object
Rating               1595 non-null object
Reviews              1595 non-null object
URL_TA               1667 non-null object
dtypes: int64(1), object(9)
memory usage: 130.3+ KB
None
   Unnamed: 0                                      Cuisine Style      ID_TA  \
0           0  ['Italian', 'Pizza', 'Mediterranean', 'Europea...   d6212120   
1           0  ['Indian', 'Asian', 'Vegetarian Friendly', 'Ve...  d10169867   
2           0  ['Swiss', 'European', 'Central European', 'Veg...   d1190632   
3           0  ['Mediterranean', 'Swiss', 'European', 'Vegeta...    d697845   
4           0  ['Frenc

#### Problems to fix in the raws datasets:
- All data are object type 
    - the price range and rate are categorical variables, taking 3 and 9 values (NaN not included)
    - Rate, rank, number of reviews as numerical data
- No row Index, can used instead rank
- Unamed column to be deleted
- rearrange orger of the columns

## Dataset Curation & Aggregated Dataset Creation

Cures raw datasets to:
- have numerical useful values for number reviews, rate, rank
- have price range as a categorical type
- lists ready to be parsed
- rows in order and first empty row deleted
- add a column with city name for further concatenation

==> Creates a global dataset with concatenated curated dataset

Using the rank of the restaurant as the index is not possible as there are plenty NaN values by the end of the table for each city

In [11]:
#Curates all the raw.csv files in the current director and create a new curated & aggregated dataset    
raw_csv_files = glob2.glob('*raw.csv')
print("{} cities raw datasets available".format(len(raw_csv_files)))
curated_dataset = pd.DataFrame()

#Curates all the raw files from scraper
for file in raw_csv_files:
    city = file[:file.find('_')]
    print(city + ': Curating ' + file)
    dataset = pd.read_csv(file, sep=',',  encoding="utf-8")
    
    #broadcast the city name in the dataset
    dataset['City'] = city
    
    #Remove first empty column'Unamed: 0'
    dataset = dataset.drop('Unnamed: 0', axis=1)
    
    #Rating column into numerical data by slicing ' of 5 bubbles'
    dataset['Rating'] = dataset['Rating'].apply(lambda x: str(x)[:-13])
    dataset['Rating'] = pd.to_numeric(dataset['Rating'], errors='ignore')
    
    #Ranking column into numerical by slicing '#' and ' of xx restaurants in city'
    dataset['Ranking'] = dataset['Ranking'].apply(lambda x: str(x)[1:(str(x).find(' of'))].replace(',',''))
    dataset['Ranking'] = pd.to_numeric(dataset['Ranking'], errors='coerce')
    
    #Number of reviews column into numerical
    dataset['Number of Reviews'] =  dataset['Number of Reviews'].apply(lambda x: str(x).replace(',',''))
    dataset['Number of Reviews'] = pd.to_numeric(dataset['Number of Reviews'], errors='coerce')
    
    #Price range as categorical type:
    dataset['Price Range'] = dataset['Price Range'].astype('category')
    
    #Re-order columns
    dataset = dataset[['Name', 'City', 'Cuisine Style', 'Ranking', 'Rating', 
                       'Price Range', 'Number of Reviews', 'Reviews', 'URL_TA', 'ID_TA']]
    
    #Create a curated csv file
    dataset.to_csv(file.replace('raw', 'curated'),  sep=',', encoding="utf-8")
    print('Curated dataset created')

    #Append the curated dataset with the new curated dataset of the current city
    curated_dataset = pd.concat([curated_dataset, dataset])
    
#Categorize the city column data
curated_dataset['City'] = curated_dataset['City'].astype('category')

#Create the aggregated curated csv file
curated_dataset.to_csv('TA_restaurants_curated.csv', sep=',', encoding="utf-8")
logging.info('Aggregated curated dataset created')
print(curated_dataset.info())

32 cities raw datasets available
Amsterdam: Curating Amsterdam_TA_restaurants_raw.csv
Curated dataset created
Athens: Curating Athens_TA_restaurants_raw.csv
Curated dataset created
Barcelona: Curating Barcelona_TA_restaurants_raw.csv
Curated dataset created
Berlin: Curating Berlin_TA_restaurants_raw.csv
Curated dataset created
Bratislava: Curating Bratislava_TA_restaurants_raw.csv
Curated dataset created
Bruxelles: Curating Bruxelles_TA_restaurants_raw.csv
Curated dataset created
Budapest: Curating Budapest_TA_restaurants_raw.csv
Curated dataset created
Copenhagen: Curating Copenhagen_TA_restaurants_raw.csv
Curated dataset created
Dublin: Curating Dublin_TA_restaurants_raw.csv
Curated dataset created
Edinburg: Curating Edinburg_TA_restaurants_raw.csv
Curated dataset created
Geneva: Curating Geneva_TA_restaurants_raw.csv
Curated dataset created
Helsinki: Curating Helsinki_TA_restaurants_raw.csv
Curated dataset created
Koln: Curating Koln_TA_restaurants_raw.csv
Curated dataset created
Kr

INFO:root:Aggregated curated dataset created


<class 'pandas.core.frame.DataFrame'>
Int64Index: 124995 entries, 0 to 1666
Data columns (total 10 columns):
Name                 124995 non-null object
City                 124995 non-null category
Cuisine Style        94046 non-null object
Ranking              115645 non-null float64
Rating               115658 non-null float64
Price Range          77555 non-null category
Number of Reviews    108020 non-null float64
Reviews              115673 non-null object
URL_TA               124995 non-null object
ID_TA                124995 non-null object
dtypes: category(2), float64(3), object(5)
memory usage: 8.8+ MB
None
