In [1]:
''' This is an introductory instruction on collecting data from web via web scraping using Python. The parts presented in a gray-colored boxes are Python codes, which are provided for your reference to practice them. This is not an all-inclusive tutorial for beginners as you will only learn the necessary parts with some examples. You can also refer to the end of this tutorial to find references on this tutorial.
For this workshop, suppose that we want to do some market research in the gaming industry. We will collect some data from an online shop for video games in Switzerland. We would like to create a dataset that includes data such as the title and the price of the game, as well as the link to each game's webpage.'''

" This is an introductory instruction on collecting data from web via web scraping using Python. The parts presented in a gray-colored boxes are Python codes, which are provided for your reference to practice them. This is not an all-inclusive tutorial for beginners as you will only learn the necessary parts with some examples. You can also refer to the end of this tutorial to find references on this tutorial.\nFor this workshop, suppose that we want to do some market research in the gaming industry. We will collect some data from an online shop for video games in Switzerland. We would like to create a dataset that includes data such as the title and the price of the game, as well as the link to each game's webpage."

In [2]:
''' 1. Installing and Importing Necessary Libraries
In the first step, we may need to install the two libraries that are required to simulate a web browsing experience in Python. These two libraries are called “selenium” and “webdriver_manager.” After installation, we must import two functions from each library. Firstly, the “webdriver” function from “selenium” library. Secondly, the “ChromeDriverManager” from “chrome” module of “webdriver_manager” library, if you are a Google Chrome user.'''

' 1. Installing and Importing Necessary Libraries\nIn the first step, we may need to install the two libraries that are required to simulate a web browsing experience in Python. These two libraries are called “selenium” and “webdriver_manager.” After installation, we must import two functions from each library. Firstly, the “webdriver” function from “selenium” library. Secondly, the “ChromeDriverManager” from “chrome” module of “webdriver_manager” library, if you are a Google Chrome user.'

In [None]:
pip install selenium

In [None]:
pip install webdriver_manager

In [32]:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import os
import re
import pandas as pd

In [10]:
''' 2. Initiating the Browsing Environment
In this step, you need to initiate the browsing environment for your Python scraper. You will notice that a browser will automatically open up in your computer. You need to keep this browsing window open. So, do not close it! The following code is particular to Google Chrome users. For other browsers, you can refer to the references at the end of the tutorial and check “WebDriver Installation.”'''

' 2. Initiating the Browsing Environment\nIn this step, you need to initiate the browsing environment for your Python scraper. You will notice that a browser will automatically open up in your computer. You need to keep this browsing window open. So, do not close it! The following code is particular to Google Chrome users. For other browsers, you can refer to the references at the end of the tutorial and check “WebDriver Installation.”'

In [12]:
''' 3. Inspecting the Parent Webpage
Now, we need to look at the initial webpage that we want to scrap. So we store the URL of that webpage for later usage. Then, we will open the URL using the webdriver (simulation/testing environment)'''

' 3. Inspecting the Parent Webpage\nNow, we need to look at the initial webpage that we want to scrap. So we store the URL of that webpage for later usage. Then, we will open the URL using the webdriver (simulation/testing environment)'

In [None]:
driver = webdriver.Chrome()

In [15]:
url = 'https://www.wog.ch/index.cfm/budget/platform/Playstation-4'
driver.get(url)

In [14]:
'''  It may take a couple of seconds to load the page. So, make sure you wait enough. You need to know how to find and open the inspect element or inspect page in your browser. For instance, in Google Chrome, if we want to inspect the title of the game, we right click on the title, and choose “Inspect” from the bottom of the list. Then a window opens that shows the HTML source code along with many other data. This window allows us to find the path to each specific thing on the webpage that we are looking for.'''

'  It may take a couple of seconds to load the page. So, make sure you wait enough. You need to know how to find and open the inspect element or inspect page in your browser. For instance, in Google Chrome, if we want to inspect the title of the game, we right click on the title, and choose “Inspect” from the bottom of the list. Then a window opens that shows the HTML source code along with many other data. This window allows us to find the path to each specific thing on the webpage that we are looking for.'

In [16]:
''' 3
4. Downloading and Storing the HTML Data
We may want to download the HTML data of the webpage and store it in a variable so we can extract features from it. To do that, we can use the function “BeautifulSoup” from “bs4” library. Beautiful Soup is great for parsing HTML and XML documents.'''

' 3\n4. Downloading and Storing the HTML Data\nWe may want to download the HTML data of the webpage and store it in a variable so we can extract features from it. To do that, we can use the function “BeautifulSoup” from “bs4” library. Beautiful Soup is great for parsing HTML and XML documents.'

In [17]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')

In [19]:
''' To save a copy of the raw data on hard drive for later. In this case, we determine the working directory and then save the data as a text file.'''


' To save a copy of the raw data on hard drive for later. In this case, we determine the working directory and then save the data as a text file.'

In [20]:
with open('soup.txt', 'w', encoding='utf-8') as file:
    file.write(str(soup))

In [21]:
''' 5. Locating the Elements on HTML Source
Now, we would like to locate the information (elements) that we mentioned in the exercise goal in the HTML source using the Inspect window of the browser. Please check the reference on “Locating Elements with Selenium” at the end of tutorial to familiarize yourself with different ways to locate elements. For example, we will find the Title of the first game on the list using the following code:'''

' 5. Locating the Elements on HTML Source\nNow, we would like to locate the information (elements) that we mentioned in the exercise goal in the HTML source using the Inspect window of the browser. Please check the reference on “Locating Elements with Selenium” at the end of tutorial to familiarize yourself with different ways to locate elements. For example, we will find the Title of the first game on the list using the following code:'

In [22]:
Title = driver.find_element('xpath',r'//*[@id="wrapper"]/main/div[2]/div/div/div[1]/div[1]/div/div/div[4]').text

In [25]:
''' In this example, we located the Title and found the path as follows:
2. In the Inspect window, find the game title and right click on it. A list opens up and there is an option called “Copy.” Click on that and then another list opens up. From that list, we select “Copy XPath.”
3. The path will be copied on your clipboard and you paste it inside “find_element_by_xpath” function.
4. Since the title is a “text” object in that element we just found, we need to use “text” method to extract it.
Now that we are familiar with the process, we extract the link to the dedicated game webpage. This link exists in the same element that we just found. We only need to get the “href” attribute of that element.'''

' In this example, we located the Title and found the path as follows:\n2. In the Inspect window, find the game title and right click on it. A list opens up and there is an option called “Copy.” Click on that and then another list opens up. From that list, we select “Copy XPath.”\n3. The path will be copied on your clipboard and you paste it inside “find_element_by_xpath” function.\n4. Since the title is a “text” object in that element we just found, we need to use “text” method to extract it.\nNow that we are familiar with the process, we extract the link to the dedicated game webpage. This link exists in the same element that we just found. We only need to get the “href” attribute of that element.'

In [26]:
Link = driver.find_element('xpath',r'//*[@id="wrapper"]/main/div[2]/div/div/div[1]/div[1]/div/div/a').get_attribute('href')

In [27]:
''' Let’s locate and extract the price for the first game on the list now. Since the price is a “text” object in the underlying element, we need to use “text” method to extract it.'''

' Let’s locate and extract the price for the first game on the list now. Since the price is a “text” object in the underlying element, we need to use “text” method to extract it.'

In [28]:
Price = driver.find_element('xpath',r'//*[@id="wrapper"]/main/div[2]/div/div/div[1]/div[1]/div/div/div[7]/div[1]/p').text

In [31]:
Price = float(re.sub(r'[^\d,.]', '', Price))

In [33]:
game_record = pd.DataFrame( { 'Title' : Title, 
                             'Link' : Link, 
                             'Price' : Price },
                          index=[1])

In [34]:
game_record

Unnamed: 0,Title,Link,Price
1,Assassin's Creed Mirage,https://www.wog.ch/index.cfm/details/product/1...,42.9


In [35]:
''' 6. Automating the Extraction Process
Let’s suppose we need to extract those three features for all the games in the list on the webpage. To do that, we need to find and locate the root element that can lead to all those three attributes, which changes from one game on the list to another game. We can take a look at the path for the title of the second game and check where the shared path is.
The path to the first game’s title:
‘//*[@id="wrapper"]/main/div[2]/div/div/div[1]/div[1]/div/div/div[4]’
The path to the second game’s title:
‘//*[@id="//*[@id="wrapper"]/main/div[2]/div/div/div[1]/div[2]/div/div/div[4]’
The shared root path is highlighted with green and where it changes is highlighted with red. The green part is what we can use to extract all the elements in which we can find the attributes we need. Since there are multiple elements this time, we must use “find_elements” function. We store all the elements in a variable for further process.'''

' 6. Automating the Extraction Process\nLet’s suppose we need to extract those three features for all the games in the list on the webpage. To do that, we need to find and locate the root element that can lead to all those three attributes, which changes from one game on the list to another game. We can take a look at the path for the title of the second game and check where the shared path is.\nThe path to the first game’s title:\n‘//*[@id="wrapper"]/main/div[2]/div/div/div[1]/div[1]/div/div/div[4]’\nThe path to the second game’s title:\n‘//*[@id="//*[@id="wrapper"]/main/div[2]/div/div/div[1]/div[2]/div/div/div[4]’\nThe shared root path is highlighted with green and where it changes is highlighted with red. The green part is what we can use to extract all the elements in which we can find the attributes we need. Since there are multiple elements this time, we must use “find_elements” function. We store all the elements in a variable for further process.'

In [36]:
elements = driver.find_elements('xpath',r'//*[@id="wrapper"]/main/div[2]/div/div/div[1]/div')

In [37]:
''' Let’s try to find the third game’s title and weblink. We should use the game number (3rd) instead of the red part and use the blue part as the XPath address.'''

' Let’s try to find the third game’s title. We should use the game number (3rd) instead of the red part and use the blue part as the XPath address.'

In [39]:
Title = elements[2].find_element('xpath',r'div/div/div[4]').text
Weblink = elements[2].find_element('xpath',r'div/div/a').get_attribute('href')

In [40]:
''' Now, we can understand how to iterate through the list of elements and find the attributes.'''

' Now, we can understand how to iterate through the list of elements and find the attributes.'

In [41]:
game_records = []
for n, element in enumerate(elements):
    Title = element.find_element('xpath',r'div/div/div[4]').text
    Link = element.find_element('xpath',r'div/div/a').get_attribute('href')
    #Price = re.sub(r'[^\d,.]', '',element.find_element(
        #'xpath',r'div/div/div[7]/div[1]/p').text)
    game_records.append([Title,Link])
game_data = pd.DataFrame(data = game_records, columns = ['Title','Weblink'])

In [42]:
game_data

Unnamed: 0,Title,Weblink
0,Assassin's Creed Mirage,https://www.wog.ch/index.cfm/details/product/1...
1,Assassin's Creed Mirage - Deluxe Steelbook Edi...,https://www.wog.ch/index.cfm/details/product/1...
2,Bloodborne - Game of the Year Edition,https://www.wog.ch/index.cfm/details/product/4...
3,Bud Spencer & Terence Hill: Slaps and Beans 2,https://www.wog.ch/index.cfm/details/product/1...
4,Call of Duty: Modern Warfare,https://www.wog.ch/index.cfm/details/product/8...
5,Call of Duty: Modern Warfare III,https://www.wog.ch/index.cfm/details/product/1...
6,Call of Duty: Modern Warfare III - Limited...,https://www.wog.ch/index.cfm/details/product/1...
7,Call of Duty: Modern Warfare III - Steelbook...,https://www.wog.ch/index.cfm/details/product/1...
8,Controller Dualshock 4 -Glacier White-,https://www.wog.ch/index.cfm/details/product/3...
9,Controller Dualshock 4 -Jet Black-,https://www.wog.ch/index.cfm/details/product/5...
