In [1]:
# importing required libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Reference:
1. [Analytics Vidhya: Introduction to Web Scraping in Python](https://www.analyticsvidhya.com/blog/2019/10/web-scraping-hands-on-introduction-python/)
2. [Data school web-scrapping python](https://www.dataschool.io/python-web-scraping-of-president-trumps-lies/)

#### Reading the web page into Python

In [2]:
# tagret url to scrap
url = "https://www.goibibo.com/hotels/find-hotels-in-Pune/1554245012668028405/1554245012668028405/%7B%22ci%22:%2220191025%22,%22co%22:%2220191029%22,%22r%22:%221-2-0%22%7D/?{}&sec=dom"

In [3]:
r = requests.get(url)

The code above fetches our web page from the URL, and stores the result in a "response" object called `r`. That response object has a `text` attribute, which contains the same HTML code we get when viewing the page source from chrome web browser.

In [4]:
# print the first 500 characters of the HTML
print(r.text[0:500])


<!doctype html>
<html lang="en-us">
<head>
<script>
          var starttime = new Date();
        </script>
<title data-react-helmet="true">Results</title>
<meta data-react-helmet="true" name="description" property="og:description" content="Goibibo provides you online hotel bookings all over the world. Book cheap, budget and luxury hotels at best price from leading hotel booking site. Free cancellation on many hotels"/><meta data-react-helmet="true" name="keywords" content="Goibibo, online hote


#### Parsing the HTML using Beautiful Soup

In [6]:
soup = BeautifulSoup(r.text, 'lxml')

The code above parses the HTML (stored in r.text) into a special object called soup that the Beautiful Soup library understands. In other words, __Beautiful Soup is reading the HTML and making sense of its structure__.

(Note that __lxml__ is the parser included with the Python standard library, though other parsers can be used by Beautiful Soup. See [differences between parsers](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers) to learn more.)

#### Collecting all of the records

Taking advantage of the patterns we noticed in the article formatting to build our dataset. __Each record will be tagged in a consistent way in the HTML. This is the pattern that allows us to build our dataset__.

The Beautiful Soup methods required for this task are:

1. find()
2. find_all()

There is an excellent tutorial on these methods [(Searching the tree)](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree) in the Beautiful Soup documentation.

In [7]:
results = soup.find_all('p', attrs = {'class':'ico20 fb'})

This code searches the soup object for all `<p>` tags with the attribute __class="short-desc"__. It returns a special Beautiful Soup object (called a "ResultSet") containing the search results.

__results__ acts like a Python list, so we can check its length:

In [8]:
len(results)

0