<a href="https://colab.research.google.com/github/lucasmoratof/behaviour_with_ggplot_and_correlation/blob/master/Scraping_HW_Berlin.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Hostel World web scraping

I'm starting a project to work with data related to hostels. I decided to scrap one of the biggest hostel portals in the world, [hostel world](https://www.hostelworld.com), and on this notebook I'll get some data from hostels in my favorite city, Berlin.

Goal: scrape all hostels in Berlin and get info about:

- Name; 
- Link; 
- Distance from centre (km); 
- Average Rating; 
- Number of reviews;
- Average price in USD

Web address for scraping: https://www.hostelworld.com/hostels/Berlin

Now, discover how to get the maximum pages automatically

In [0]:
from requests import get
from bs4 import BeautifulSoup

import pandas as pd
import numpy as np

import time

import re

In [0]:
url = 'https://www.hostelworld.com/hostels/Berlin'
response = get(url)

# create soup
soup = BeautifulSoup(response.text, 'html.parser')

# creating individual containers, on each one there's information about one hostel.
holstel_containers= soup.findAll(class_= 'fabresult rounded clearfix hwta-property')

In [4]:
# how many hostels on the page
len(holstel_containers)

30

For me, it's easy to explore the first container first, in order to get the correct information. Once I find out how to search for what I need, I can next make a loop that will do the same process for each container.

In [5]:
first_hostel = holstel_containers[0]
print(first_hostel.prettify())

<div class="fabresult rounded clearfix hwta-property" data-id="11286" data-name="EastSeven Berlin Hostel" id="searchResults_11286" url="https://www.hostelworld.com/hosteldetails.php/EastSeven-Berlin-Hostel/Berlin/11286">
 <div class="fab-carousel-skeleton carousel-skeleton">
  <div class="fab-carousel-container small-12 medium-5 large-3 columns rounded" data-images="https://a.hwstatic.com/propertyimages/1/11286/5025.jpg,https://a.hwstatic.com/propertyimages/1/11286/501.jpg,https://a.hwstatic.com/propertyimages/1/11286/506.jpg,https://a.hwstatic.com/propertyimages/1/11286/507.jpg,https://a.hwstatic.com/propertyimages/1/11286/5010.jpg,https://a.hwstatic.com/propertyimages/1/11286/5012.jpg,https://a.hwstatic.com/propertyimages/1/11286/5013.jpg,https://a.hwstatic.com/propertyimages/1/11286/5014.jpg,https://a.hwstatic.com/propertyimages/1/11286/5015.jpg,https://a.hwstatic.com/propertyimages/1/11286/5016.jpg,https://a.hwstatic.com/propertyimages/1/11286/5017.jpg,https://a.hwstatic.com/proper

In [6]:
# Hostel name
first_hostel.h2.a.text

'EastSeven Berlin Hostel'

In [7]:
# hostel link
first_hostel.h2.a.get('href')

'https://www.hostelworld.com/hosteldetails.php/EastSeven-Berlin-Hostel/Berlin/11286'

In [8]:
# distance from city centre in km
first_hostel.find(class_= "addressline").text[12:18].replace('k','').replace('m','').strip()

'1.2'

In [9]:
# average rating
first_hostel.find(class_='hwta-rating-score').text.replace('\n', '').strip()

'9.5'

In [10]:
# number of reviews
first_hostel.find(class_="hwta-rating-counter").text.replace('\n', '').strip()

'6940'

In [11]:
# average price per night in US$
first_hostel.find(class_= "price").text.replace('\n', '').strip()[3:]

'19.13'

Now, I'm going to apply the same logic and iterate over the pages:

In [0]:
# first, create the empty lists
hostel_names= []
hostel_links= []
hostel_distance= []
hostel_ratings= []
hostel_reviews= []
hostel_prices= []

for page in np.arange(1,4): # to iterate over the pages and create the conteiners
  url = 'https://www.hostelworld.com/hostels/Berlin?page=' + str(page)
  response = get(url)
  soup = BeautifulSoup(response.text, 'html.parser')
  holstel_containers= soup.findAll(class_= 'fabresult rounded clearfix hwta-property')

  for item in range(len(holstel_containers)): # to iterate over the results on each page
    hostel_names.append(holstel_containers[item].h2.a.text)
    hostel_links.append(holstel_containers[item].h2.a.get('href'))
    hostel_distance.append(holstel_containers[item].find(class_= "addressline").text[12:18].replace('k','').replace('m','').strip())
    hostel_ratings.append(holstel_containers[item].find(class_='hwta-rating-score').text.replace('\n', '').strip())
    hostel_reviews.append(holstel_containers[item].find(class_="hwta-rating-counter").text.replace('\n', '').strip())
    hostel_prices.append(holstel_containers[item].find(class_= "price").text.replace('\n', '').strip()[3:])                          
  time.sleep(2) # this is used to not push too hard on the website

In [13]:
hw_berlin = pd.DataFrame({
    'hostel_name': hostel_names,
    'distance_centre_km': hostel_distance,
    'average_rating': hostel_ratings,
    'number_reviews': hostel_reviews,
    'average_price_usd': hostel_prices,
    'hw_link': hostel_links
})

hw_berlin.head()

Unnamed: 0,hostel_name,distance_centre_km,average_rating,number_reviews,average_price_usd,hw_link
0,EastSeven Berlin Hostel,1.2,9.5,6940,19.13,https://www.hostelworld.com/hosteldetails.php/...
1,Industriepalast Hostel Berlin,3.3,8.6,1608,14.34,https://www.hostelworld.com/hosteldetails.php/...
2,PLUS Berlin,3.4,9.1,16510,12.36,https://www.hostelworld.com/hosteldetails.php/...
3,Circus Hostel,1 fr,9.3,5350,21.63,https://www.hostelworld.com/hosteldetails.php/...
4,ONE80º Hostel - Alexanderplatz,0.6,8.4,4347,13.68,https://www.hostelworld.com/hosteldetails.php/...


In [14]:
# removing non numerical character on the column distance_centre_km

hw_berlin.distance_centre_km = [re.sub('[^0-9.]','', x) for x in hw_berlin.distance_centre_km]

hw_berlin.distance_centre_km.head(-5)

0      1.2
1      3.3
2      3.4
3        1
4      0.6
      ... 
78     6.4
79     7.1
80    12.9
81    16.3
82     1.9
Name: distance_centre_km, Length: 83, dtype: object

And now, to finalize, let's save the df as a csv file:

In [0]:
hw_berlin.to_csv('hw_berlin.csv')