

![](images/pune.jpg)

# [Scrapping website of goibibo for hotels in Pune and their prices](https://www.goibibo.com/hotels/hotels-in-pune-ct/)


Before scrapping any website check its __robots.txt__ file (which is also known as the robot exclusion protocol). This tells which pages/details not to crawl or scrap.

![](images/robots.PNG)

In [1]:
# importing required libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### Reading the web page into Python

In [2]:
# tagret url to scrap
url = "https://www.goibibo.com/hotels/hotels-in-pune-ct/"

In [3]:
r = requests.get(url)

The code above fetches our web page from the URL, and stores the result in a "response" object called `r`. That response object has a `text` attribute, which contains the same HTML code we get when viewing the page source from chrome web browser.

In [4]:
# print the first 500 characters of the HTML
print(r.text[0:500])

<!DOCTYPE html><html><head><title>Hotels in Pune - Book 933 Pune Hotels with 𝘂𝗽𝘁𝗼 𝟱𝟬% off @ ₹184</title><meta charset="utf-8"/><meta content="IE=edge" http-equiv="X-UA-Compatible"/><meta content="width=device-width, initial-scale=1.0" name="viewport"/><meta content="Best Pune Hotels with upto 50% off from Goibibo. Check 111003 reviews and 56982 photos for 933  Pune Hotels. Use coupon code GETSETGO and grab best deals starting from  @ ₹184 on Pune online hotel booking. ✔ Lowest Price Guarantee ✔ 


#### Parsing the HTML using Beautiful Soup

In [5]:
soup = BeautifulSoup(r.text, 'lxml')

The code above parses the HTML (stored in r.text) into a special object called soup that the Beautiful Soup library understands. In other words, __Beautiful Soup is reading the HTML and making sense of its structure__.

(Note that __lxml__ is the parser included with the Python standard library, though other parsers can be used by Beautiful Soup. See [differences between parsers](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers) to learn more.)

#### Collecting all of the records

Taking advantage of the patterns we noticed in the article formatting to build our dataset. __Each record will be tagged in a consistent way in the HTML. This is the pattern that allows us to build our dataset__.

The Beautiful Soup methods required for this task are:

1. find()
2. find_all()

There is an excellent tutorial on these methods [(Searching the tree)](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree) in the Beautiful Soup documentation.

## Determing pattern

![](images/goibibo_hotels.PNG)

In [6]:
results = soup.find_all('div', attrs = {'class': "width100 fl htlListSeo hotel-tile-srp-container hotel-tile-srp-container-template new-htl-design-tile-main-block"})

This code searches the soup object for all `<div>` tags with the attribute __class="width100 fl htlListSeo hotel-tile-srp-container hotel-tile-srp-container-template new-htl-design-tile-main-block"__. It returns a special Beautiful Soup object (called a "ResultSet") containing the search results.

__results__ acts like a Python list, so we can check its length:

In [7]:
len(results)

10

There are 10 results, which seems reasonable given the length of the article. (If this number did not seem reasonable, we would examine the HTML further to determine if our assumptions about the patterns in the HTML were incorrect.)

## Extracting the data (Hotel name)

Web scraping is often an iterative process, in which you experiment with your code until it works exactly as you desire. To simplify the experimentation, we'll start by only working with the first record in the results object, and then later on we'll modify our code to use a loop:

In [8]:
first_result = results[0].find('p',
              attrs = {'style':"font-size: 18px; font-weight: bolder; color: #141823;font-family: 'Quicksand', sans-serif;"})
first_result

<p style="font-size: 18px; font-weight: bolder; color: #141823;font-family: 'Quicksand', sans-serif;">Mint Koregaon Park next to Osho Ashram</p>

Although `first_result` may look like a Python string, you'll notice that there are no quote marks around it. Instead, it's another special Beautiful Soup object (called a "Tag") that has specific methods and attributes. 

Since we want __to extract the text between the opening and closing tags__, we can access its `text` attribute.

In [9]:
(results[0].find('p',
              attrs = {'style':"font-size: 18px; font-weight: bolder; color: #141823;font-family: 'Quicksand', sans-serif;"})
.text)

'Mint Koregaon Park next to Osho Ashram'

## Extracting the price

In [10]:
results[0].find('li', attrs = {"class":"htl-tile-discount-prc"}).text

'2499'

You can apply these two methods to either the initial soup object or a Tag object (such as first_result):

- __find()__: searches for the first matching tag, and returns a Tag object
- __find_all()__: searches for all matching tags, and returns a ResultSet object (which you can treat like a list of Tags)

You can extract information from a Tag object (such as `first_result`) using these __two attributes__:

- __text__: extracts the text of a Tag, and returns a string
- __contents__: extracts the children of a Tag, and returns a list of Tags and strings

It's important to keep track of whether you are interacting with a Tag, ResultSet, list, or string, because that affects which methods and attributes you can access.

## Building the dataset

Now that we've figured out how to extract the hotel name and price, we can create a loop to repeat this process on all 10 results. We'll store the output in a list of tuples called records:

In [11]:
records = []
for result in results:
    hotel_name = result.find('p',
        attrs = {'style':"font-size: 18px; font-weight: bolder; color: #141823;font-family: 'Quicksand', sans-serif;"}).text
    price = result.find('li', attrs = {"class":"htl-tile-discount-prc"}).text
    records.append((hotel_name, price))
    
len(records)

10

Since there were 10 results, we have 10 records.

In [12]:
records[:3]

[('Mint Koregaon Park next to Osho Ashram', '2499'),
 ('The Deccan Royaale', '2299'),
 ('Park Central Comfort-e-Suites, Pune', '2501')]

## Applying a tabular data structure

In [13]:
df = pd.DataFrame(records, columns = ["Hotel Name", "Price"])
df

Unnamed: 0,Hotel Name,Price
0,Mint Koregaon Park next to Osho Ashram,2499
1,The Deccan Royaale,2299
2,"Park Central Comfort-e-Suites, Pune",2501
3,Hotel Mint Ivy Viman Nagar,2499
4,"Lemon Tree Premier, City Center Pune",4082
5,THE E- SQUARE HOTEL,3000
6,"The Grand Tulip, Swargate",2430
7,Kapila Business Hotel,2900
8,Hotel Aurora Towers,3000
9,Hotel Vinstar Serviced Apartments,2200


## Overall code required to do the extraction


```
# importing required libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

# tagret url to scrap
url = "https://www.goibibo.com/hotels/hotels-in-pune-ct/"

r = requests.get(url)

soup = BeautifulSoup(r.text, 'lxml')

results = soup.find_all('div', attrs = {'class': "width100 fl htlListSeo hotel-tile-srp-container hotel-tile-srp-container-template new-htl-design-tile-main-block"})

records = []
for result in results:
    hotel_name = result.find('p',
        attrs = {'style':"font-size: 18px; font-weight: bolder; color: #141823;font-family: 'Quicksand', sans-serif;"}).text
    price = result.find('li', attrs = {"class":"htl-tile-discount-prc"}).text
    records.append((hotel_name, price))


df = pd.DataFrame(records, columns = ["Hotel Name", "Price"])
print(df
```

## Reference:
1. [Analytics Vidhya: Introduction to Web Scraping in Python](https://www.analyticsvidhya.com/blog/2019/10/web-scraping-hands-on-introduction-python/)
2. [Data school web-scrapping python](https://www.dataschool.io/python-web-scraping-of-president-trumps-lies/)