# <span style="color:blue">Scraping data from Coolblue</span>
![image.png](attachment:image.png)

For the course Online Data Collection & Management at Tilburg University, our team aims to scrape from the Coolblue website. Coolblue is a Dutch e-commerce company. However, Coolblue claims that they are more than just an ecommerce company: she build end-to-end  solution for her customers. They are active in The Netherlands, Belgium, and Germany, both online and in her 15 physical stores.

In this scrape we scrape Coolblue’s supply in laptops. All laptops Coolblue has (in this code referred to as "al") are compared to Coolblue’s Choice laptops (in this code referred to as "cc") regarding the names, prices, and urls of these laptops.

Coolblue’s Choice indicates the best products for Coolblue’s customers, according to experts and other consumers. The products in this category are hardly ever returned, and have the best reviews. It is unique and very customer-oriented. It makes the choice to buy a certain product easier and more accessible.

In this notebook we will discuss the scraping in three chapters:
1. The preparation before scraping
2. The Coolblues Choice and All Laptops scraper
3. Saving the scraped data in csv files

# <span style="color:blue">1. The preparation before scraping</span>
Before the Coolblue website can be scraped, a few things need to be prepared:
* Import necessary libraries
* Set driver and define the base url
* Accept cookies

### 1.1 Import necessary libraries

In [1]:
import requests # Lets us make HTML request to Coolblue's website server for retrieving the data from their page
from bs4 import BeautifulSoup # Easy to use for beginners and will allow us to extract data from HTML files
import time # Pauses the execution of the commands. We use this because of the amount of data we gather
from time import sleep # Sleep package is needed to obey retrieval limits
import csv # Stores the scraped data in a csv file in the end
from datetime import datetime # Adds current day to csv file

import selenium # To open selenium webdriver to gather information from the websites
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from re import search
from selenium.webdriver import ActionChains

### 1.2 Set driver and define the base url

In [2]:
# Automatically installs/updates and opens ChromeDriver
driver = webdriver.Chrome(ChromeDriverManager().install())



Current google-chrome version is 100.0.4896
Get LATEST chromedriver version for 100.0.4896 google-chrome
Driver [/Users/irisvanwalraven/.wdm/drivers/chromedriver/mac64/100.0.4896.60/chromedriver] found in cache


In [3]:
# Defining the base url
base_url = 'https://www.coolblue.nl'
driver.get(base_url)
sleep(2) # To load the full page

### 1.3 Accept cookies

In [4]:
driver.find_element_by_name("accept_cookie").click()
print("ok")

ok


# <span style="color:blue">2. The Coolblues Choice and All Laptops scraper</span>

### 2.1 Coolblues Choice
In this step the page urls are defined

In [5]:
def generate_page_urls_cc(base_url, num_pages):
    page_urls_cc = []
    
    for counter in range(1,num_pages +1):
        coolblues_choice_urls = base_url + '/en/laptops/coolblues-choice?page=' + str(counter)
        page_urls_cc.append(coolblues_choice_urls)
    return page_urls_cc

In [6]:
page_urls_cc = generate_page_urls_cc(base_url, 2) # Coolblues Choice only has 2 pages, so we define 2 page urls
page_urls_cc

['https://www.coolblue.nl/en/laptops/coolblues-choice?page=1',
 'https://www.coolblue.nl/en/laptops/coolblues-choice?page=2']

In the next step, an information list is defined where the following information is collected per laptop: titles, urls and prices. 

In [7]:
def extract_laptop_urls_cc(page_urls_cc):
    information_list_cc = []
    
    for page_url_cc in page_urls_cc: 
        driver.get(page_url_cc) # To gather information from the predefined page urls
        time.sleep(2)
        res_cc = driver.page_source.encode('utf-8')
        soup_cc = BeautifulSoup(res_cc, "html.parser")
        information_cc = soup_cc.find_all(class_="product-grid__card")
        
        # For each laptop on that page look up the title, url and price and store it in a list
        for info_cc in information_cc: 
            title_cc = info_cc.find("img").attrs["alt"].replace(" Buy a laptop?","")
            url_cc = "https://www.coolblue.nl" + info_cc.find("a").attrs["href"]
            price_cc = info_cc.find(class_="sales-price__current js-sales-price-current").text
            information_list_cc.append({"title": title_cc,
                                        "url": url_cc,
                                        "price": price_cc}) 
        
        driver.execute_script('window.scrollTo(0, document.body.scrollHeight);') # Scroll down the page to collect data from all laptops
            
        sleep(1)  # Pause 1 second after each request
            
    return information_list_cc

In [8]:
information_list_cc = extract_laptop_urls_cc(page_urls_cc) # To create the list with information per laptop
information_list_cc 

[{'title': 'HP 15s-fq2960nd',
  'url': 'https://www.coolblue.nl/en/product/873236/hp-15s-fq2960nd.html',
  'price': '499,-'},
 {'title': 'Lenovo ThinkBook 15 G2 - 20VE0049MH',
  'url': 'https://www.coolblue.nl/en/product/872868/lenovo-thinkbook-15-g2-20ve0049mh.html',
  'price': '1.059,-'},
 {'title': 'Apple MacBook Air (2020) 16GB/256GB Apple M1 with 7-core GPU Space Gray',
  'url': 'https://www.coolblue.nl/en/product/874171/apple-macbook-air-2020-16gb-256gb-apple-m1-with-7-core-gpu-space-gray.html',
  'price': '1.359,-'},
 {'title': 'HP Pavilion 15-eh1907nd',
  'url': 'https://www.coolblue.nl/en/product/889428/hp-pavilion-15-eh1907nd.html',
  'price': '599,-'},
 {'title': 'Lenovo IdeaPad 5 14ARE05 81YM00F7MH',
  'url': 'https://www.coolblue.nl/en/product/900571/lenovo-ideapad-5-14are05-81ym00f7mh.html',
  'price': '749,-'},
 {'title': 'HP Chromebook 14a-na0170nd',
  'url': 'https://www.coolblue.nl/en/product/868651/hp-chromebook-14a-na0170nd.html',
  'price': '249,-'},
 {'title': 'Le

In [9]:
print("The number of laptops in Coolblues choice is " +str(len(information_list_cc)))

The number of laptops in Coolblues choice is 43


### 2.1 All Laptops
In this step the page urls are defined

In [10]:
def generate_page_urls_al(base_url, num_pages):
    page_urls_al = []
    
    for counter in range(1,num_pages +1):
        all_laptops_urls = base_url + '/en/laptops/filter?page=' + str(counter)
        page_urls_al.append(all_laptops_urls)
    return page_urls_al

In [11]:
page_urls_al = generate_page_urls_al(base_url, 20) # All Laptops has 20 pages, so we define 20 page urls
page_urls_al

['https://www.coolblue.nl/en/laptops/filter?page=1',
 'https://www.coolblue.nl/en/laptops/filter?page=2',
 'https://www.coolblue.nl/en/laptops/filter?page=3',
 'https://www.coolblue.nl/en/laptops/filter?page=4',
 'https://www.coolblue.nl/en/laptops/filter?page=5',
 'https://www.coolblue.nl/en/laptops/filter?page=6',
 'https://www.coolblue.nl/en/laptops/filter?page=7',
 'https://www.coolblue.nl/en/laptops/filter?page=8',
 'https://www.coolblue.nl/en/laptops/filter?page=9',
 'https://www.coolblue.nl/en/laptops/filter?page=10',
 'https://www.coolblue.nl/en/laptops/filter?page=11',
 'https://www.coolblue.nl/en/laptops/filter?page=12',
 'https://www.coolblue.nl/en/laptops/filter?page=13',
 'https://www.coolblue.nl/en/laptops/filter?page=14',
 'https://www.coolblue.nl/en/laptops/filter?page=15',
 'https://www.coolblue.nl/en/laptops/filter?page=16',
 'https://www.coolblue.nl/en/laptops/filter?page=17',
 'https://www.coolblue.nl/en/laptops/filter?page=18',
 'https://www.coolblue.nl/en/laptops/

In the next step, an information list is defined where the following information is collected per laptop: titles, urls and prices. 

In [12]:
def extract_laptop_urls_al(page_urls_al):
    information_list_al = []
    
    for page_url_al in page_urls_al:
            driver.get(page_url_al) # To gather information from the predefined page urls
            time.sleep(2)
            res_al = driver.page_source.encode('utf-8')
            soup_al = BeautifulSoup(res_al, "html.parser")
            information_al = soup_al.find_all(class_="product-grid__card")
        
        # For each laptop on that page look up the title, url and price and store it in a list
            for info_al in information_al:
                title_al = info_al.find("img").attrs["alt"].replace(" Buy a laptop?","")
                url_al = "https://www.coolblue.nl" + info_al.find("a").attrs["href"]
                if 
                price_al = info_al.find(class_="sales-price__current js-sales-price-current") # We want to get the price with code: '.text' but it gives an error so we use the whole html line
                information_list_al.append({"title": title_al,
                                            "url": url_al,
                                            "price": price_al}) 
                
            driver.execute_script('window.scrollTo(0, document.body.scrollHeight);') # Scroll down the page to collect data from all laptops
        
            sleep(5) # Pause 5 second after each request   
        
    return information_list_al

In [13]:
information_list_al = extract_laptop_urls_al(page_urls_al) # To create the list with information per laptop
information_list_al

[{'title': 'Lenovo ThinkBook 15 G2 - 20VE0049MH',
  'url': 'https://www.coolblue.nl/en/product/872868/lenovo-thinkbook-15-g2-20ve0049mh.html',
  'price': <strong class="sales-price__current js-sales-price-current">1.059,-</strong>},
 {'title': 'HP 15s-fq2960nd',
  'url': 'https://www.coolblue.nl/en/product/873236/hp-15s-fq2960nd.html',
  'price': <strong class="sales-price__current js-sales-price-current">499,-</strong>},
 {'title': '',
  'url': 'https://www.coolblue.nlhttps://www.coolblue.nl/en/c/app.html',
  'price': None},
 {'title': 'Apple MacBook Air (2020) 16GB/256GB Apple M1 with 7-core GPU Space Gray',
  'url': 'https://www.coolblue.nl/en/product/874171/apple-macbook-air-2020-16gb-256gb-apple-m1-with-7-core-gpu-space-gray.html',
  'price': <strong class="sales-price__current js-sales-price-current">1.359,-</strong>},
 {'title': 'Lenovo IdeaPad 5 14ARE05 81YM00F7MH',
  'url': 'https://www.coolblue.nl/en/product/900571/lenovo-ideapad-5-14are05-81ym00f7mh.html',
  'price': <strong

In [14]:
print("The number of all laptops is " +str(len(information_list_al)))

The number of all laptops is 480


# <span style="color:blue">3. Saving the scraped data in csv files</span>

### 3.1 For Coolblues Choice

In [15]:
with open("coolblues-choice.csv", "w", encoding = 'utf-8') as csv_file: # <<- this is the line with the "flag"l see exercises below
    writer = csv.writer(csv_file, delimiter = ";")
    writer.writerow(["title", "url", "price"])
    for cc_info in information_list_cc: # here we reference the book_descriptions list - make sure it's loaded otherwise you get an error! (Cell > Run All Above)
        writer.writerow([cc_info['title'], cc_info['url'], cc_info['price']])
print('done cc!')

done cc!


### 3.1 For All Laptops

In [16]:
with open("all-laptops.csv", "w", encoding = 'utf-8') as csv_file: # <<- this is the line with the "flag"l see exercises below
    writer = csv.writer(csv_file, delimiter = ";")
    writer.writerow(["title", "url", "price"])
    for al_info in information_list_al: # here we reference the book_descriptions list - make sure it's loaded otherwise you get an error! (Cell > Run All Above)
        writer.writerow([al_info['title'], al_info['url'], al_info['price']])
print('done al!')

done al!
