# Web Scraping

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. <br>
In this notebook i have used <b>Zomato</b> Website to scrape the data of Top restuarants in my city using BeautifulSoup Library.<br>
I have also cleaned the Scraped data and saved it in a csv format.<br>
The Resturant data will contain <b>Name, Cuisines, Ratings, Votes, and Cost For Two People </b>
### BeautifulSoup Documentation : https://www.crummy.com/software/BeautifulSoup/bs4/doc/
### Important Note
Read through the website’s Terms and Conditions to understand how you can legally use the data. Most sites prohibit you from using the data for commercial purposes.

In [None]:
#Install Packages
#!pip install beautifulsoup4
#!pip install requests

In [1]:
#Importing Packages
import requests
from bs4 import BeautifulSoup
from pandas import DataFrame
import numpy

In [2]:
#Used headers/agent. 
#Using following code we can fake the agent.
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/mumbai/ulhasnagar-restaurants",headers=headers)

In [3]:
content = response.content
#BeautifulSoup Object Initialization
soup = BeautifulSoup(content,"html.parser")

In [None]:
#Prinitng html doc of the webpage
#print(soup.prettify())
#I have commented the code because it prints the whole html doc of the webpage

The data needs to scraped is wraped inside the div with class <b>search-snippet-card</b>

In [None]:
#top_rest = soup.find_all("div",attrs={"class": "search-snippet-card"})
#top_rest
#I have commented the code because it prints the whole html doc of the webpage

Now we have the whole card we will look for individual features we need to scrape which are :
1. Name of the Restuarant
2. Cuisines
3. Ratings
4. Votes
5. Cost For Two

In [7]:
#This line of code gives name of the restuarnt
top_rest[0].find("a",attrs={"class":"result-title"}).text.strip()

"McDonald's"

In [9]:
#This line of code gives Votes
top_rest[11].find("span",attrs={"class":"grey-text"}).text.strip()

'Cuisines:'

In [10]:
#This line of code gives Ratings
top_rest[0].find("div",attrs={"class":"rating-popup"}).text.strip()

'3.9'

In [15]:
#This line of code gives name of the restuarnt Cuisines
top_rest[0].find("span",attrs={"class":"col-s-11"}).text

'Burger, Fast Food'

In [13]:
#This line of code gives Cost For Two
top_rest[0].find("span",attrs={"class":"col-s-11 col-m-12 pl0"}).text

'₹400'

Now we know the code to scrape the individual features we now have to scrape these features for all of the resturants. I have a created a custom class <b>Scraping</b> to do the work.

In [20]:
class Scraping:
    
    def __init__(self,url,headers):
        """
        Param url: URL of the website
        Param headers: User Agent
        """
        self.url = url
        self.headers = headers
    
    def get_data(self):
        """
        This function returns scraped data in the form of a list containing dictionaries
        """
        #Empty list to store the data
        datalist = []
        
        #Loop through each webpage
        #For my city Zomato has 15 webpages where each page having details of 15 restuarants
        for page in range(15):
            
            #Getting request
            response = requests.get("{}&page={}".format(self.url,page+1)
                                    ,headers=self.headers)
            content = response.content
            
            #BeautifulSoup Object Initialization
            soup = BeautifulSoup(content,"html.parser")
            
            #Getting the element from which we will find the data we required
            top_rest = soup.find_all("div",attrs={"class": "search-snippet-card"})
            
            #Loop through each element
            for i in range(len(top_rest)):
                #Empty dictionary to store the data in key-value pair
                dataframe = {}
                
                #Try except blocks
                #Try block store the data
                #If exception occurs it will store empty string
                
                #Restuarant Name
                try:
                    dataframe['Restaurant']=top_rest[i].find("a",attrs={"class":"result-title"}).text.strip()
                except:
                    
                    dataframe['Restaurant']=''
                
                #Votes
                try:
                    dataframe['Votes']=top_rest[i].find("span",attrs={"class":"grey-text"}).text
                except:
                    dataframe['Votes']=''
                
                #Ratings
                try:    
                    dataframe['Ratings']=top_rest[i].find("div",attrs={"class":"rating-popup"}).text.strip()
                except:
                    dataframe['Ratings']=''
                
                #Cuisines
                try:
                    dataframe['Cuisines']=top_rest[i].find("span",attrs={"class":"col-s-11"}).text
                except:
                    dataframe['Cuisines']=''
                
                #Cost For Two
                try:
                    dataframe['Cost For Two']=top_rest[i].find("span",attrs={"class":"col-s-11 col-m-12 pl0"}).text
                except:
                    dataframe['Cost For Two']=''
                
                #Append each dictionary to datalist
                datalist.append(dataframe)
        return datalist

In [21]:
#Creating an object of the class Scraping
myObject = Scraping(url = "https://www.zomato.com/mumbai/ulhasnagar-restaurants?nearby=0",headers = headers)

In [22]:
#Calling method get_data()
data_in_list = myObject.get_data()

In [23]:
#View top 5 values
data_in_list[:5]

[{'Restaurant': "McDonald's",
  'Votes': '347 votes',
  'Ratings': '3.9',
  'Cuisines': 'Burger, Fast Food',
  'Cost For Two': '₹400'},
 {'Restaurant': 'Natural Ice Cream',
  'Votes': '93 votes',
  'Ratings': '4.0',
  'Cuisines': 'Ice Cream, Desserts',
  'Cost For Two': '₹300'},
 {'Restaurant': 'Jumboking',
  'Votes': '45 votes',
  'Ratings': '4.2',
  'Cuisines': 'Burger, Fast Food',
  'Cost For Two': '₹150'},
 {'Restaurant': 'Jai Mata Di Fast Food',
  'Votes': '16 votes',
  'Ratings': '3.4',
  'Cuisines': 'Chinese, Fast Food',
  'Cost For Two': '₹200'},
 {'Restaurant': 'Ming Yang',
  'Votes': '126 votes',
  'Ratings': '3.8',
  'Cuisines': 'Chinese, Thai',
  'Cost For Two': '₹950'}]

We will save the data frame data to CSV format which is easily readable.

In [64]:
#First convert the list into a dataframe
data = DataFrame(data=data_in_list,columns=["Restaurant","Cuisines","Ratings","Votes","Cost For Two"])

In [65]:
data.head()

Unnamed: 0,Restaurant,Cuisines,Ratings,Votes,Cost For Two
0,McDonald's,"Burger, Fast Food",3.9,347 votes,₹400
1,Natural Ice Cream,"Ice Cream, Desserts",4.0,93 votes,₹300
2,Jumboking,"Burger, Fast Food",4.2,45 votes,₹150
3,Jai Mata Di Fast Food,"Chinese, Fast Food",3.4,16 votes,₹200
4,Ming Yang,"Chinese, Thai",3.8,126 votes,₹950


In [66]:
data.shape

(221, 5)

The total entries we got is 221

In [67]:
data["Ratings"].unique()

array(['3.9', '4.0', '4.2', '3.4', '3.8', '3.7', '3.1', '3.6', '3.5',
       '3.3', 'NEW', '2.9', '3.2', '-', '3.0', '2.6', '2.7', '2.5'],
      dtype=object)

As we can see the Ratings column has <b>NEW</b> and <b>-</b> values

In [68]:
data["Votes"].unique()

array(['347 votes', '93 votes', '45 votes', '16 votes', '126 votes',
       '108 votes', '346 votes', '4 votes', '57 votes', '101 votes',
       '51 votes', 'Cuisines: ', '137 votes', '393 votes', '163 votes',
       '33 votes', '9 votes', '21 votes', '206 votes', '14 votes',
       '103 votes', '63 votes', '71 votes', '32 votes', '25 votes',
       '46 votes', '11 votes', '91 votes', '7 votes', '17 votes',
       '27 votes', '56 votes', '12 votes', '50 votes', '67 votes',
       '18 votes', '13 votes', '5 votes', '8 votes', '191 votes',
       '42 votes', '76 votes', '61 votes', '129 votes', '6 votes',
       '40 votes', '323 votes', '69 votes', '105 votes', '89 votes',
       '97 votes', '158 votes', '38 votes', '60 votes', '24 votes',
       '35 votes', '15 votes', '28 votes', '52 votes', '10 votes',
       '22 votes', '241 votes', '19 votes', '47 votes', '107 votes',
       '82 votes', '49 votes', '62 votes', '118 votes', '53 votes',
       '20 votes', '131 votes', '95 votes', '159

The Votes column has some values as <b>Cuisines:</b>. This is becaue some restuarants did not contain any votes. 

In [69]:
data["Cost For Two"].unique()

array(['₹400', '₹300', '₹150', '₹200', '₹950', '₹600', '₹700', '₹350',
       '₹500', '₹450', '₹250', '₹750', '₹650', '₹850', '₹1,000', '₹1,600',
       '₹900', '₹1,500', '₹800', '₹100', '₹50', '₹1,200', '₹1,300',
       '₹550'], dtype=object)

The Cost For Two column has the data in object data type

# Data Cleaning

I have replaced some irrelevant values 

In [70]:
#Replace with 0 Ratings
data["Ratings"] = numpy.where((data["Ratings"]=="NEW") | (data["Ratings"]=="-"),0.0,data["Ratings"])

In [71]:
#Replace with 0 votes
data["Votes"] = numpy.where((data["Votes"]=="Cuisines: "),"0 votes",data["Votes"])
#Removing votes strig and convert to int dtype
data['Votes'] = data['Votes'].str.replace(r'\D+', '').astype('int')

In [72]:
#Cleaning the Cost For Two column
data['Cost For Two'] = data['Cost For Two'].str.replace(r'\D+', '').astype('int')

In [73]:
#Viewing data after cleaning
data.iloc[:10,:]

Unnamed: 0,Restaurant,Cuisines,Ratings,Votes,Cost For Two
0,McDonald's,"Burger, Fast Food",3.9,347,400
1,Natural Ice Cream,"Ice Cream, Desserts",4.0,93,300
2,Jumboking,"Burger, Fast Food",4.2,45,150
3,Jai Mata Di Fast Food,"Chinese, Fast Food",3.4,16,200
4,Ming Yang,"Chinese, Thai",3.8,126,950
5,Domino's Pizza,"Pizza, Fast Food",3.8,108,400
6,Amar Fast Food & Restaurant,"Street Food, Chinese, North Indian, Biryani, F...",3.7,346,600
7,Box Office Cafe Bar and Bistro,"North Indian, Chinese, Continental, Fast Food",3.1,4,700
8,Hunger Ground,"North Indian, Chinese",3.7,57,350
9,Diet With Delight,"Chinese, Fast Food, Street Food, Healthy Food,...",3.8,101,350


Now we will save the data frame data to CSV format which is easily readable.

In [74]:
data.to_csv(path_or_buf='zomato_res.csv',sep=",",index=False,encoding='utf-8')
#The above code will create a CSV file named zomato_res.

# Conclusion
Successfully scraped the data from zomato website and saved the data in csv format