<h1>Riyadh Coffee Shops

_Auther: Mohammed Alosaimi_
<hr>

The motivation behind this project is to acquire data related to coffee shops in Riyadh city in Saudi Arabia. This data was obtained from Google search engine via web scraping. This data can be used for analysis and opportunity findings. However, the data is not enough for an efficient machine learning solution since there is a raw 200 records or less.

- shop name
- rating
- number of ratings
- price
- shop type
- key words
- address
- reviews
<br><br>
<hr>

In this Jupyter Notebook, we are scrapping data from Google search engine using some powerful Python libraries, such as BeautifulSoup and selenium. The way we extract data from the web is through the Document Object Model of the web.

In [1]:
# import libraries
import requests
import pandas as pd
import numpy as np
from time import sleep
from bs4 import BeautifulSoup
import selenium
import re

In [2]:
# import web scraping libraries
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

In [3]:
# base url to visit
base_url = 'https://www.google.com'

In [4]:
# create a driver objet to start crawling
driver = webdriver.Chrome(executable_path='./chromedriver/chromedriver')

In [5]:
# open the desired url
driver.get(base_url)

In [6]:
# locat the input search bar
input_location = driver.find_element_by_name('q')
driver.implicitly_wait(10)  # wait for 10 seconds for the page to load
# type the searched words
input_location.send_keys('Riyadh Coffee shops')
# hit enter
input_location.send_keys(Keys.ENTER)
driver.implicitly_wait(10)  # wait for 10 seconds for the page to load
# click on more places to fetch more results
driver.find_element_by_class_name('cMjHbjVt9AZ__button').click()
driver.implicitly_wait(10)  # wait for 10 seconds for the page to load
sleep(2)

In [7]:
# get the page source
page_source = driver.page_source

In [8]:
# parse result as html
soup = BeautifulSoup(page_source, 'html')

In [9]:
# dataframe to store results
df = pd.DataFrame(columns=['shop_name', 'rating', 'numbers_of_rating',
                           'price', 'shop_type', 'key_words', 'address', 'reviews'])
# instantiate lists for each column to append results to
shop_names = []
rating = []
number_of_ratings = []
price = []
shop_types = []
key_words = []
address = []
reviews = []

In [10]:
# define a button for the next page
for index, button in enumerate(range(14), 1):

    sleep(2)
    driver.implicitly_wait(10)
    # loop through the tag where the shop name is located
    for i in soup.findAll('div', {'role': 'heading', 'aria-level': '3', 'class': 'dbg0pd'}):

        try:  # try and find shop names under the span tag

            # append shope names to the shop_names list
            shop_names.append(i.div.span.text)

        except:  # except if there is no shop name and append nan instead
            shop_names.append(i.div.text)

    # loop through the span tag which has the rest of the shop details
    for i in soup.findAll('span', {'class': 'rllt__details lqhpac'}):

        try:  # try and find shop rating in a span tag

            rating.append(i.find('div').find('span', {'class': 'BTtC6e'}).text)

            # if there is rating, then there exists number of ratings as well
            if i.find('div').text.split()[1].strip('()')[0].isdigit():

                number_of_ratings.append(
                    i.find('div').text.split()[1].strip('()'))

            else:  # if element content doesn't contain digit, then append nan

                number_of_ratings.append(np.nan)

        except:  # except if there is no rating and append nan instead

            rating.append(np.nan)
            number_of_ratings.append(np.nan)

        try:  # try and find the price details of the shop

            if '$' in i.find('div').text.split('·')[1].strip():

                price.append(i.find('div').text.split('·')[-2].strip())

            else:  # if element content doesn't contain dollar sign, then append nan

                price.append(np.nan)  # append nan if nor result was found

        except:  # except clause, no result was found, then append nan
            price.append(np.nan)

        try:  # try clause, find shop types in a certain element
            shop_types.append(i.find('div').text.split('·')[-1].strip())

        except:  # except clause, no result was found, then append nan
            shop_types.append(np.nan)

        try:  # try clause, find key words - cosy, casual, etc

            key_words.append(
                i.find('div', {'class': 'tlDDJd'}).text.replace('\xa0·\xa0', ', '))

        except:  # except clause if can not find key words, append nan
            key_words.append(np.nan)

    # address of the shop
    for single_item in driver.find_elements_by_class_name('VkpGBb'):

        try:  # try clause to fetch data from a single shop

            sleep(4)  # sleep for four seconds
            driver.implicitly_wait(15)  # wait 15 seconds for the page to load
            single_item.click()  # click on a single shop at a time
            sleep(2)  # sleep for two seconds
            # wait for 10 seconds for the page to load
            driver.implicitly_wait(10)  # wait 10 seconds for the page to load
            page_source = driver.page_source  # get the page source
            sleep(2)  # sleep for two seconds
            soup = BeautifulSoup(page_source, 'html')  # parse result as html
            sleep(2)

            try:  # try and find the address
                # append address of the shop to the address list
                address.append(soup.find_all('div', {
                               'data-attrid': 'kc:/location/location:address'})[0].find_all('span')[1].text)

            except:  # no address found, so append nan
                address.append(np.nan)

            try:  # try clause to find review of a single shop

                dummy_list = []  # instantiate a list holder

                # loop through the some reviews for each shop
                for review in soup.find_all('div', {'class': 'Jtu6Td'}):

                    if review.text != '':  # check if the fetched result are not empty
                        # append review to reviews list
                        dummy_list.append(review.text)
                    else:
                        # append nan to dummy list
                        dummy_list.append(np.nan)
                # exit the loop, then append dummy list to reviews list
                reviews.append(dummy_list)

            except:  # no review found, so append nan
                reviews.append(np.nan)

        except:  # except clause if cannot fetch shop details
            address.append(np.nan)
            reviews.append(np.nan)

    sleep(2)  # sleep for two seconds

    # update the pages element since the session changes everytime we click on the next page
    buttons = driver.find_elements_by_class_name('fl')

    try:  # try clause: to find the next page button
        buttons[index].click()  # click on the next page
        sleep(2)  # sleep for 2 seconds
        driver.implicitly_wait(10)  # wait for 10 seconds for the page to load

        page_source = driver.page_source  # get the page source

        soup = BeautifulSoup(page_source, 'html')  # parse result as html

    except:  # except clause if there is no further pages
        exit(keep_kernel=True)

In [11]:
# loop through the lists of results and append them rwo by row to the dataframe
for index, i in enumerate(zip(shop_names, rating, number_of_ratings, price, shop_types, key_words, address, reviews)):
    df.loc[index] = [i[0], i[1], i[2], i[3], i[4], i[5], i[6], i[7]]

In [13]:
# print the head of the dataframe
df.head()

Unnamed: 0,shop_name,rating,numbers_of_rating,price,shop_type,key_words,address,reviews
0,Elixir Bunn Coffee,4.3,218,,Café,,"King Abdullah Rd, حي الحمراء، Riyadh 13215",['Amazing new branch for my favorite coffee ho...
1,Chamonix Cafe,4.0,563,$$$,Coffee shop,"Late-night food, Breakfast, Outdoor seating","9259 Wadi Al Awsat, Al nbsp;2430, Riyadh",['Its really romantic and lovely place with go...
2,dr.CAFE COFFEE,4.1,1571,$$,Coffee shop,"Cosy, Casual, Vegetarian options","As Sulimaniyah, Khurais Road Abi Al Arab Stree...","[""Sandwich wasn't tasty and it was expensive. ..."
3,The Shaky,3.9,52,,Coffee shop,"Cosy, Casual, Groups",لوكاليزر مول بوابة رقم 7 طريق الأمير محمد بن ع...,"['Nice place', ""It's delicious you can build y..."
4,قرمز كافيه - قهوة مختصة,4.1,956,$$,Coffee shop,"Cosy, Casual, Groups",2659 Dammam Branch Road Al Yarmuk Riyadh 13243...,['This coffee shop is a two story shop with a ...


In [14]:
# print the shape of the dataframe
df.shape

(198, 8)

In [15]:
# print columns datatype
df.dtypes

shop_name             object
rating               float64
numbers_of_rating     object
price                 object
shop_type             object
key_words             object
address               object
reviews               object
dtype: object

In [16]:
# save the dataframe as a csv file
df.to_csv('riyad_coffee_shops.csv')

> After the data has been saved as a csv file, we will load it again to do some cleaning as this might be obvious since some rows are missing data and data types are not appropriate for its values

In [17]:
# load in the data
df = pd.read_csv('riyad_coffee_shops.csv', index_col=0)

In [18]:
# print the head
df.head()

Unnamed: 0,shop_name,rating,numbers_of_rating,price,shop_type,key_words,address,reviews
0,Elixir Bunn Coffee,4.3,218,,Café,,"King Abdullah Rd, حي الحمراء، Riyadh 13215",['Amazing new branch for my favorite coffee ho...
1,Chamonix Cafe,4.0,563,$$$,Coffee shop,"Late-night food, Breakfast, Outdoor seating","9259 Wadi Al Awsat, Al nbsp;2430, Riyadh",['Its really romantic and lovely place with go...
2,dr.CAFE COFFEE,4.1,1571,$$,Coffee shop,"Cosy, Casual, Vegetarian options","As Sulimaniyah, Khurais Road Abi Al Arab Stree...","[""Sandwich wasn't tasty and it was expensive. ..."
3,The Shaky,3.9,52,,Coffee shop,"Cosy, Casual, Groups",لوكاليزر مول بوابة رقم 7 طريق الأمير محمد بن ع...,"['Nice place', ""It's delicious you can build y..."
4,قرمز كافيه - قهوة مختصة,4.1,956,$$,Coffee shop,"Cosy, Casual, Groups",2659 Dammam Branch Road Al Yarmuk Riyadh 13243...,['This coffee shop is a two story shop with a ...


In [19]:
# change the data type of numbers_of_rating to integer
def to_int(x):
    try:
        return int(x.replace(',', ''))
    except:
        return 0
    
df.numbers_of_rating = df.numbers_of_rating.map(to_int)

In [20]:
# map the values of the price column to numeric values
price_map_values = {np.nan:0, '$':1, '$$':2, '$$$':3}
df.price.replace(price_map_values, inplace=True)

In [21]:
df.head()

Unnamed: 0,shop_name,rating,numbers_of_rating,price,shop_type,key_words,address,reviews
0,Elixir Bunn Coffee,4.3,218,0,Café,,"King Abdullah Rd, حي الحمراء، Riyadh 13215",['Amazing new branch for my favorite coffee ho...
1,Chamonix Cafe,4.0,563,3,Coffee shop,"Late-night food, Breakfast, Outdoor seating","9259 Wadi Al Awsat, Al nbsp;2430, Riyadh",['Its really romantic and lovely place with go...
2,dr.CAFE COFFEE,4.1,1571,2,Coffee shop,"Cosy, Casual, Vegetarian options","As Sulimaniyah, Khurais Road Abi Al Arab Stree...","[""Sandwich wasn't tasty and it was expensive. ..."
3,The Shaky,3.9,52,0,Coffee shop,"Cosy, Casual, Groups",لوكاليزر مول بوابة رقم 7 طريق الأمير محمد بن ع...,"['Nice place', ""It's delicious you can build y..."
4,قرمز كافيه - قهوة مختصة,4.1,956,2,Coffee shop,"Cosy, Casual, Groups",2659 Dammam Branch Road Al Yarmuk Riyadh 13243...,['This coffee shop is a two story shop with a ...


In [22]:
# change duplicate values of the shop_type column
df.shop_type[df.shop_type.str.contains('Coffee')] = 'coffee shop'
df.shop_type[df.shop_type.str.contains('Caf')] = 'cafe'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [23]:
# replace the null values of the key words column to a string 'none'
df.key_words.fillna('none', inplace=True)

In [24]:
# save the dataframe as a csv file
df.to_csv('riyad_coffee_shops_clean.csv')

> You can find the clean data on [kaggle](https://www.kaggle.com/mohammedhalosaimi/riyadh-coffee-shops)