# Web Scraping

Web Scraping is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table format.


A note of caution here – web scraping is subject to a lot of guidelines and rules. Not every website allows the user to scrape content so there are certain legal restrictions at play. Always ensure you read the website’s terms and conditions on web scraping before you attempt to do it.

## Popular Libraries for Web Scraping

You’ll come across multiple libraries and frameworks in Python for web scraping. Here are three popular ones that do the task with efficiency and aplomb:

<li>BeautifulSoup: BeautifulSoup is an amazing parsing library in Python that enables us to extract data from HTML and XML documents. It can automatically detect encodings and gracefully handles HTML documents even with special characters. We can navigate a parsed document and find what we need which makes it quick and painless to extract the data from the webpages. In this course, we will learn how to build web scrapers using Beautiful Soup in detail
<li>Scrapy:  Scrapy is a Python framework for large scale web scraping. It gives you all the tools you need to efficiently extract data from websites, process them as you want, and store them in your preferred structure and format. You can read more about Scrapy here.


<li>Selenium :Selenium is another popular tool for automating browsers. It’s primarily used for testing in the industry but is also very handy for web scraping. Check out this amazing article to know more about how it works in Python

In [1]:
!pip install beautifulsoup4



In [2]:
!pip install scrapy



In [3]:
!pip install selenium



## Components of Web Scraping


<li>  Crawl : Navigate  to the target website by making an HTTP request and download the response you get.

<li> Parse and Transform : Once we have recieved the response, parse this download data into HTML parser like Beautiful soup



<li> Store : Simply store the data as JSON or CSV

In [4]:
### Step 1 : Crawl

''' Navigate the target website and download 
the source code of the web page
'''

import requests
from bs4 import BeautifulSoup
import pandas as pd

# the target url
scraped_data = []
url = "https://www.goibibo.com/hotels/hotels-in-shimla-ct/"


# headers

headers = { 'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36" }



# send request to download the data

response = requests.request("GET", url, headers = headers)



# parse that downloaded data

data = BeautifulSoup(response.text, 'html.parser')

print(data)



<!DOCTYPE html>

<html lang="en">
<head>
<script>
          var starttime = new Date();
        </script>
<title data-react-helmet="true">Hotels in Shimla, Book 535 Shimla Hotels starting at ₹1013</title>
<meta content="#2d67b2" data-react-helmet="true" name="theme-color"/><meta content="122023101161980" data-react-helmet="true" property="fb:app_id"/><meta content="239522418693" data-react-helmet="true" property="fb:pages"/><meta content="l3rQIge7B2N_G1cQl0VZP0y7-nE" data-react-helmet="true" name="alexaVerifyID"/><meta charset="utf-8" data-react-helmet="true"/><meta content="width=device-width,initial-scale=1.0, maximum-scale=1.0, user-scalable=0" data-react-helmet="true" name="viewport"/><meta content="Book Hotels in Shimla at lowest Prices on Goibibo. Get Free Cancellation and Instant Refund on 535  Shimla Hotels starting from  ₹1013. Book from 45 goSafe Hotels in Shimla, ensuring clean and safe hotel stay in current Coronavirus scenario. Use code GETSETGO for discounts upto 30% off

In [5]:
# find all the sections with specifiedd class name
cards_data = data.find_all('div', attrs={'class', 'HotelCardstyles__HotelNameWrapperDiv-sc-1s80tyk-12 biniNQ'})
# total number of cards
print('Total Number of Cards Found : ', len(cards_data))
# source code of hotel cards
for card in cards_data:
   print(card)

Total Number of Cards Found :  30
<div class="HotelCardstyles__HotelNameWrapperDiv-sc-1s80tyk-12 biniNQ"><a class="HotelCardstyles__HotelNameSeoAnchor-sc-1s80tyk-13 chrWIo" content="Snow Valley Resorts" href='/hotels/snow-valley-resorts-hotel-in-shimla-1953288179585905793/?hquery={"ci":"20210912","co":"20210913","r":"1-2-0","ibp":"v15"}&amp;hmd=a4c6d9937744207af45aef42197f58166101a5d21fa7cd2fc06f985a24d86238e88fd658fd119126f613891dcd42d1297444fe559a8b2784f87354e7326cffddd2235d0bfcb2b9c4854733f71ef1d49eebaa9b8c73226f1e564733e1a1fc113e18fb44fcc2fc6e91ba686f46e6a7a011c544641f2dd4fb554ab76fff173a76aad3205d3464d6e2e9cba156cbf2043f95614566239a94a0fb2bc6f392cba6c5e71d132a8414739125cec93c2bb398c05c51c475c89fa8d8764b77d418c3810bb2977b75abfad7f7b9affd6b9f1a6c5be9b0bb0441c144d41d93055fa446d755e9ad19b14430f4f0791c4d6cc4a60e05321ec1c40fa1f732194072996eddc4b550ea4b30680b31f1c214629d6329781ab75a3a654b1b20e5e668a692b3e94428280b7c84144999fbe4c2bd9d0786b769f9d08bbcb679bdf9462941bec0dcdc811a25cb069346447

We have filtered the cards data from the complete source code of the web page and each card here contains the information about a separate hotel. Select only the Hotel Name, perform the Inspect Element step, and do the same with the Room Price:

In [6]:


for card in cards_data:
  #hotel_name = card.find("a", attrs ={'class' : 'HotelCardstyles__HotelNameSeoAnchor-sc-1s80tyk-13 chrWIo' })
  hotel_name = card.find('a')

  

  #room_price = card.find('p', attrs ={'class' :"HotelCardstyles__CurrentPriceTextWrapper-sc-1s80tyk-27 eqtvkm"})
  card_details= {}
  #print(room_price)
  card_details['hotel_name'] = hotel_name.text

  print(hotel_name.text)

Snow Valley Resorts
Marigold Sarovar Portico, Shimla
Goldenfern Resort Shimla
The Zion Hotel
Regenta Resort & Spa Mashobra Shimla
Hotel Dhroov
Torrentium Lodge
Hotel Shingar
The Oberoi Cecil
Rocky Knob (Explore World Art in One Property)
The Rock Castle
landmark shimla - With Elevator Access To Mall Road
Hotel Combermere
Hotel Prestige
Summit Le Royale
Solo Home
Meena Bagh Shimla
Shivanchal Homestay
Hotel Atithi
Belvilla Serene Mountain Getaway with Exquisite Views
Marley Villa
Kalawati Homes Vacation Rentals
Resort Eutopia
Honeymoon Inn
Clarkes Hotel, A grand heritage hotel since 1898
Fairmount Hotel
Royal Tulip Shimla, Kufri
OYO 1706 Hotel The Alpine Heritage Residency
Hotel Sangeet
Marina- Shimla First Designer Boutique Hotel


In [7]:
# find all the sections with specifiedd class name
cards_data = data.find_all('div', attrs={'class', "HotelCardstyles__CurrentPriceTextWrapper-sc-1s80tyk-27 eqtvkm"})
# total number of cards
print('Total Number of Cards Found : ', len(cards_data))
# source code of hotel cards
for card in cards_data:
   print(card)

Total Number of Cards Found :  30
<div class="HotelCardstyles__CurrentPriceTextWrapper-sc-1s80tyk-27 eqtvkm"><svg class="RupeeIcon-sc-5hlwf0-0 bendgm" height="1.5rem" viewbox="0 0 32 32" width="1.5rem" xmlns="http://www.w3.org/2000/svg"><path d="M21.482 7.945h3.536c.982 0 1.786.818 1.786 1.818s-.804 1.818-1.786 1.818h-3.536a9.429 9.429 0 01-2.625 5.109 9.509 9.509 0 01-6.75 2.891h-.679l9.661 9.255c0 .018.018.018.036.036.679.673.696 1.782.036 2.473a1.742 1.742 0 01-2.518.091L5.714 19a1.78 1.78 0 01-.554-1.364c.036-.964.839-1.727 1.786-1.691h5.179a5.902 5.902 0 004.214-1.836 6.327 6.327 0 001.482-2.527H6.946c-.982 0-1.786-.818-1.786-1.818s.804-1.818 1.786-1.818h10.875C17 5.455 14.714 3.782 12.125 3.764H6.946c-.982 0-1.786-.818-1.786-1.818S5.964.128 6.946.128h18.071c.982 0 1.786.818 1.786 1.818s-.804 1.818-1.804 1.818h-5.464a8.504 8.504 0 011.946 4.182z"></path></svg><p class="HotelCardstyles__CurrentPrice-sc-1s80tyk-28 inUyrJ" itemprop="priceRange">3163</p></div>
<div class="HotelCardsty

In [8]:


for card in cards_data:
  #hotel_name = card.find("a", attrs ={'class' : 'HotelCardstyles__HotelNameSeoAnchor-sc-1s80tyk-13 chrWIo' })
  #hotel_name = card.find('a')


  room_price = card.find('p', attrs ={'class' :"HotelCardstyles__CurrentPrice-sc-1s80tyk-28 inUyrJ"})

  #print(room_price)
  card_details['room_price'] = room_price.text
  scraped_data.append(card_details)

  print(room_price.text)

3163
4389
2730
2695
6815
3308
2544
2539
11000
2529
1831
2963
5628
976
2428
1110
8386
651
788
5559
8898
10745
2490
3800
7125
1776
10794
1029
2032
10421


## Store the Extracted data into CSV file

In [9]:
dataFrame = pd.DataFrame.from_dict(scraped_data)


dataFrame.to_csv('hotels_data.csv', index= False)