# How to web scrape data using Selenium and BeautifulSoup in Python - an example on www.mobile.de (2/3 part)

# Introduction
In this tutorial I will show you how I scraped mobile.de to collect data about cars. I hope that even experienced web scrapers will learn something.
This is the 2nd part of a 3 part series.

You can find part 1 and part 2 on the following links:
- part 1: 
- part 2: 

All the scripts that I used can be found on the following GitHub link: https://github.com/krinya/mobile_de_scraping

If you want to scrape a website and you need my help you can contact me: menyhert.kristof@gmail.com


# Import some packages that we are going to use
If you do not have any of the packages install them by using pip/ pip3

In [1]:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
import time
from bs4 import BeautifulSoup 
import pandas as pd
import numpy as np
import re
import requests
from random import *
from tqdm import tqdm #progress bar
from datetime import datetime
import os
import glob
import pickle #for saving data

# Load the data

Load the data that we scraped in part 1. We have links to all the model and make combinations that are present on mobile.de.
what we would like to achieve is to go to that - make and model combination - link which will list all the cars for that given category. After this we want to grab all the links for the adds and save them from that page.

In part 3 we will use these ad links to get the car related data (e.g.: price, condition, mileage, fuel type, first registration date, etc.)

In [2]:
make_model_data = pd.read_csv('data/make_and_model_links.csv')
make_model_data

Unnamed: 0,car_make,id1,car_model,id2,link
0,Mercedes-Benz,17200,190,126,https://suchen.mobile.de/fahrzeuge/search.html...
1,Mercedes-Benz,17200,200,127,https://suchen.mobile.de/fahrzeuge/search.html...
2,Mercedes-Benz,17200,220,128,https://suchen.mobile.de/fahrzeuge/search.html...
3,Mercedes-Benz,17200,230,129,https://suchen.mobile.de/fahrzeuge/search.html...
4,Mercedes-Benz,17200,240,130,https://suchen.mobile.de/fahrzeuge/search.html...
...,...,...,...,...,...
2230,Wiesmann,25650,MF 30,4,https://suchen.mobile.de/fahrzeuge/search.html...
2231,Wiesmann,25650,MF 35,6,https://suchen.mobile.de/fahrzeuge/search.html...
2232,Wiesmann,25650,MF 4,7,https://suchen.mobile.de/fahrzeuge/search.html...
2233,Wiesmann,25650,MF 5,8,https://suchen.mobile.de/fahrzeuge/search.html...


# Function to scrape one make and model combination:
I will summarize what I did here. I decided not to go into very detail because I think it is more or less straightforward.
1) We need to figure out how many pages a given make and model combination has. This is indicated on a button. We grab this number. See the picture below.
   ![last_button_picture](images/last_page_button.png)
2) We loop through all the pages by generating a link with incrementing a link that contains the page number.
3) We grab all the links for the adds listed on the page
4) Save this to a Pandas DataFrame

In [3]:
def scrape_links_for_one_make_model(make_model_input_link = '', sleep = 1, make_model_input_data = make_model_data, save_to_csv = True):

    chrome_options = webdriver.ChromeOptions()
    prefs = {"profile.managed_default_content_settings.images": 2} # this is to not load images
    chrome_options.add_experimental_option("prefs", prefs)

    #start a driver
    driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)

    #get the number of pages
    driver.get(make_model_input_link)
    make_model_link_lastpage_source = driver.page_source
    make_model_link_soup = BeautifulSoup(make_model_link_lastpage_source, 'html.parser')

    last_button = make_model_link_soup.findAll('span', {'class': 'btn btn--muted btn--s'})
    
    #if there is only one page, then this gives an error so we need to check for that
    try:
        print("This many pages found: ", last_button[len(last_button)-1].text)
        last_button_number = last_button[len(last_button)-1].text
        last_button_number = int(last_button_number)
    except:
        last_button_number = int(1)
    
    driver.close()

    #start scraping the ads
    
    links_on_multiple_pages = []

    for i in tqdm(range(1, last_button_number + 1)):

        #start a new driver every time
        #we need this to avoid getting blocked by the website. If we don't do this, we will get captcha

        chrome_options = webdriver.ChromeOptions()
        prefs = {"profile.managed_default_content_settings.images": 2} # this is to not load images
        chrome_options.add_experimental_option("prefs", prefs)

        #start a driver
        driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)

        #we need to navigate to the page
        one_page_link = make_model_input_link + "&pageNumber=" + str(i)

        driver.get(one_page_link)
        time.sleep(sleep)
        base_source = driver.page_source
        base_soup = BeautifulSoup(base_source, 'html.parser')

        #get all the links
        cars_add_list_all = base_soup.findAll('a', {'class': re.compile('^link--muted no--text--decoration')})

        links_on_one_page = []

        for i in range(len(cars_add_list_all)):

            link = cars_add_list_all[i]['href']
            
            if not link.endswith('SellerAd'):
                # filter out links that end with 'SellerAd' (these are links to ads and we do not need them)
                links_on_one_page.append(link)

        for elements in links_on_one_page:
            links_on_multiple_pages.append(elements)

        driver.close() #close the driver

    links_on_one_page_df = pd.DataFrame({'ad_link' : links_on_multiple_pages})
    #drop duplicates
    links_on_one_page_df = links_on_one_page_df.drop_duplicates()

    links_on_one_page_df['make_model_link'] = make_model_input_link #via this we can see which make and model the links belong to
    
    #datetime string
    now = datetime.now() 
    datetime_string = str(now.strftime("%Y%m%d_%H%M%S"))

    links_on_one_page_df['download_date_time'] = datetime_string

    #check is the make and model is in the dataframe
    if isinstance(make_model_input_data, pd.DataFrame):
        #join the dataframes to get make and model information
        links_on_one_page_df = pd.merge(links_on_one_page_df, make_model_input_data, how = 'left', left_on= ['make_model_link'], right_on = ['link'])

    #save the dataframe if save_to_csv is True
    if save_to_csv:
        #check if folder exists and if not create it
        if not os.path.exists('data/make_model_ads_links'):
            os.makedirs('data/make_model_ads_links')

        links_on_one_page_df.to_csv(str('data/make_model_ads_links/links_on_one_page_df' + datetime_string + '.csv'), index = False)

    return(links_on_one_page_df)

This will output a lots of strings. I do not know how to do this silently. If you know how to do this, please let me know.
But otherwise, you can use the following code to get the links.

This will grab all the links for all the Skoda Roomsters. It has aproximately 30 pages so it will take some time.

![skoda_roomsters](images/skoda_roomster_page.png)

(I commented out the script because we will do this for multiple make and model combinations in the next few lines)

In [4]:
# link_on_multiple_pages_data = scrape_links_for_one_make_model(
#     make_model_input_link = 'https://suchen.mobile.de/fahrzeuge/search.html?dam=0&isSearchRequest=true&ms=22900;13&ref=quickSearch&sfmr=false&vc=Car',
#     make_model_input_data = make_model_data,
#     save_to_csv=True)

In [5]:
#link_on_multiple_pages_data

# From here we just need to create a loop that goes throug all the make and model combinations.
Or we can just give the function explicitly which - make and model - link combination we would like to scrape. 

In [6]:
def multiple_link_on_multiple_pages_data(make_model_input_links = [], sleep = 1, make_model_input_data = make_model_data, save_to_csv = True):

    multiple_make_model_data = pd.DataFrame()

    for one_make_model_link in make_model_input_links:
        
        one_page_adds = scrape_links_for_one_make_model(make_model_input_link = one_make_model_link, sleep = sleep, make_model_input_data = make_model_input_data, save_to_csv = save_to_csv)
        multiple_make_model_data = pd.concat([multiple_make_model_data, one_page_adds], ignore_index=True)
    
    return(multiple_make_model_data)


Randomly get 3 make and model combination links.

In [7]:
#generate 3 random numbers between 1 1000
random_numbers = np.random.randint(0, len(make_model_data), 3)
random_numbers
for link in make_model_data.link[random_numbers]:
    print(link)

https://suchen.mobile.de/fahrzeuge/search.html?dam=0&isSearchRequest=true&ms=9900;4&ref=quickSearch&sfmr=false&vc=Car
https://suchen.mobile.de/fahrzeuge/search.html?dam=0&isSearchRequest=true&ms=3500;93&ref=quickSearch&sfmr=false&vc=Car
https://suchen.mobile.de/fahrzeuge/search.html?dam=0&isSearchRequest=true&ms=3500;37&ref=quickSearch&sfmr=false&vc=Car


Run this function to get the links for adds:
(This will output many things. I do not know how to do this silently. If you know how to do this, please let me know.)

In [8]:
multi_data = multiple_link_on_multiple_pages_data(make_model_input_links = make_model_data['link'][random_numbers])



Current google-chrome version is 97.0.4692
Get LATEST driver version for 97.0.4692
Driver [C:\Users\menyh\.wdm\drivers\chromedriver\win32\97.0.4692.71\chromedriver.exe] found in cache
  driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)
  0%|          | 0/1 [00:00<?, ?it/s]

Current google-chrome version is 97.0.4692
Get LATEST driver version for 97.0.4692
Driver [C:\Users\menyh\.wdm\drivers\chromedriver\win32\97.0.4692.71\chromedriver.exe] found in cache
  driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)
100%|██████████| 1/1 [00:10<00:00, 10.74s/it]


Current google-chrome version is 97.0.4692
Get LATEST driver version for 97.0.4692
Driver [C:\Users\menyh\.wdm\drivers\chromedriver\win32\97.0.4692.71\chromedriver.exe] found in cache


This many pages found:  42


  0%|          | 0/42 [00:00<?, ?it/s]

Current google-chrome version is 97.0.4692
Get LATEST driver version for 97.0.4692
Driver [C:\Users\menyh\.wdm\drivers\chromedriver\win32\97.0.4692.71\chromedriver.exe] found in cache
  2%|▏         | 1/42 [00:17<12:03, 17.64s/it]

Current google-chrome version is 97.0.4692
Get LATEST driver version for 97.0.4692
Driver [C:\Users\menyh\.wdm\drivers\chromedriver\win32\97.0.4692.71\chromedriver.exe] found in cache
  5%|▍         | 2/42 [00:24<07:29, 11.24s/it]

Current google-chrome version is 97.0.4692
Get LATEST driver version for 97.0.4692
Driver [C:\Users\menyh\.wdm\drivers\chromedriver\win32\97.0.4692.71\chromedriver.exe] found in cache
  7%|▋         | 3/42 [00:34<06:53, 10.59s/it]

Current google-chrome version is 97.0.4692
Get LATEST driver version for 97.0.4692
Driver [C:\Users\menyh\.wdm\drivers\chromedriver\win32\97.0.4692.71\chromedriver.exe] found in cache
 10%|▉         | 4/42 [00:46<07:13, 11.40s/it]

Current google-chrome version is

This many pages found:  7


Driver [C:\Users\menyh\.wdm\drivers\chromedriver\win32\97.0.4692.71\chromedriver.exe] found in cache
 14%|█▍        | 1/7 [00:10<01:01, 10.30s/it]

Current google-chrome version is 97.0.4692
Get LATEST driver version for 97.0.4692
Driver [C:\Users\menyh\.wdm\drivers\chromedriver\win32\97.0.4692.71\chromedriver.exe] found in cache
 29%|██▊       | 2/7 [00:20<00:51, 10.37s/it]

Current google-chrome version is 97.0.4692
Get LATEST driver version for 97.0.4692
Driver [C:\Users\menyh\.wdm\drivers\chromedriver\win32\97.0.4692.71\chromedriver.exe] found in cache
 43%|████▎     | 3/7 [00:30<00:41, 10.25s/it]

Current google-chrome version is 97.0.4692
Get LATEST driver version for 97.0.4692
Driver [C:\Users\menyh\.wdm\drivers\chromedriver\win32\97.0.4692.71\chromedriver.exe] found in cache
 57%|█████▋    | 4/7 [00:51<00:43, 14.49s/it]

Current google-chrome version is 97.0.4692
Get LATEST driver version for 97.0.4692
Driver [C:\Users\menyh\.wdm\drivers\chromedriver\win32\97.0.4692.71\chromedr

# We are done with the scraping part, we can crate a new data frame using the induvidual scraped files in the 'data/make_model_ads_links' folder

A function to concat files in a given folder. It also saves the file to CSV and/or pickle.

In [9]:
# concatenate the dataframes in one folder to get one file (with different columns)
def concatenate_dfs(indir, save_to_csv = True, save_to_pickle = True):
    

    fileList=glob.glob(str(str(indir) + "*.csv"))
    print("Found this many CSVs: ", len(fileList), " In this folder: ", str(os.getcwd()) + "/" + str(indir))

    output_file = pd.concat([pd.read_csv(filename) for filename in fileList])

    if save_to_csv:
        output_file.to_csv("data/make_model_ads_links_concatinated.csv", index=False)

    if save_to_pickle:
        output_file.to_pickle("data/make_model_ads_links_concatinated.pkl")

    return(output_file)


In [10]:
make_model_ads_data = concatenate_dfs(indir= "data/make_model_ads_links/", save_to_csv = False, save_to_pickle = True)
#(I use pickle to save the dataframe because it is faster and smaller than csv)

Found this many CSVs:  28  In this folder:  c:\Users\menyh\Documents\mobile_de_scraping/data/make_model_ads_links/


If we want to load the pickle file we can do that this way:

In [11]:
make_model_ads_data = pd.read_pickle("data/make_model_ads_links_concatinated.pkl")

In [12]:
make_model_ads_data

Unnamed: 0,ad_link,make_model_link,download_date_time,car_make,id1,car_model,id2,link
0,https://suchen.mobile.de/fahrzeuge/details.htm...,https://suchen.mobile.de/fahrzeuge/search.html...,20220117_160748,Skoda,22900,Roomster,13,https://suchen.mobile.de/fahrzeuge/search.html...
1,https://suchen.mobile.de/fahrzeuge/details.htm...,https://suchen.mobile.de/fahrzeuge/search.html...,20220117_160748,Skoda,22900,Roomster,13,https://suchen.mobile.de/fahrzeuge/search.html...
2,https://suchen.mobile.de/fahrzeuge/details.htm...,https://suchen.mobile.de/fahrzeuge/search.html...,20220117_160748,Skoda,22900,Roomster,13,https://suchen.mobile.de/fahrzeuge/search.html...
3,https://suchen.mobile.de/fahrzeuge/details.htm...,https://suchen.mobile.de/fahrzeuge/search.html...,20220117_160748,Skoda,22900,Roomster,13,https://suchen.mobile.de/fahrzeuge/search.html...
4,https://suchen.mobile.de/fahrzeuge/details.htm...,https://suchen.mobile.de/fahrzeuge/search.html...,20220117_160748,Skoda,22900,Roomster,13,https://suchen.mobile.de/fahrzeuge/search.html...
...,...,...,...,...,...,...,...,...
133,https://suchen.mobile.de/fahrzeuge/details.htm...,https://suchen.mobile.de/fahrzeuge/search.html...,20220118_144643,BMW,3500,735,37,https://suchen.mobile.de/fahrzeuge/search.html...
134,https://suchen.mobile.de/fahrzeuge/details.htm...,https://suchen.mobile.de/fahrzeuge/search.html...,20220118_144643,BMW,3500,735,37,https://suchen.mobile.de/fahrzeuge/search.html...
135,https://suchen.mobile.de/fahrzeuge/details.htm...,https://suchen.mobile.de/fahrzeuge/search.html...,20220118_144643,BMW,3500,735,37,https://suchen.mobile.de/fahrzeuge/search.html...
136,https://suchen.mobile.de/fahrzeuge/details.htm...,https://suchen.mobile.de/fahrzeuge/search.html...,20220118_144643,BMW,3500,735,37,https://suchen.mobile.de/fahrzeuge/search.html...


# Some quick statistics about our scraped data

We already scraped this many make and model combinations.

In [13]:
count_ads = make_model_ads_data.groupby(['car_make', 'car_model', 'download_date_time'], dropna=False).agg(number_of_ads=('ad_link', 'count'))
count_ads = count_ads.reset_index()
count_ads

Unnamed: 0,car_make,car_model,download_date_time,number_of_ads
0,Aston Martin,Rapide,20220117_174327,49
1,BMW,535,20220117_173238,901
2,BMW,732,20220117_175728,5
3,BMW,735,20220118_144643,138
4,BMW,214 Gran Tourer,20220117_162428,2
5,BMW,M4,20220118_144517,880
6,BMW,X6,20220117_163645,943
7,Bentley,Continental GTC,20220117_175658,167
8,Chevrolet,Matiz,20220117_162756,317
9,Ferrari,GTC4Lusso,20220117_182631,73


If we want to keep only the lastly scraped data for each make and model combination we can do this the following way:

In [14]:
latest_scrape = make_model_ads_data.groupby(['car_make', 'car_model'], dropna=False).agg(number_of_ads=('ad_link', 'count'), latest_scrape=('download_date_time', 'max'))
latest_scrape = latest_scrape.reset_index()
latest_scrape

Unnamed: 0,car_make,car_model,number_of_ads,latest_scrape
0,Aston Martin,Rapide,49,20220117_174327
1,BMW,535,901,20220117_173238
2,BMW,732,5,20220117_175728
3,BMW,735,138,20220118_144643
4,BMW,214 Gran Tourer,2,20220117_162428
5,BMW,M4,880,20220118_144517
6,BMW,X6,943,20220117_163645
7,Bentley,Continental GTC,167,20220117_175658
8,Chevrolet,Matiz,317,20220117_162756
9,Ferrari,GTC4Lusso,73,20220117_182631


In [15]:
make_model_ads_data_latest = pd.merge(make_model_ads_data, latest_scrape[['car_make', 'car_model', 'latest_scrape']], how = 'left', left_on = ['car_make', 'car_model'], right_on = ['car_make', 'car_model'])
make_model_ads_data_latest = make_model_ads_data_latest.reset_index(drop=True)
# keep rows where download_date_time is equal to latest_scrape
make_model_ads_data_latest = make_model_ads_data_latest[make_model_ads_data_latest['download_date_time'] == make_model_ads_data_latest['latest_scrape']]
make_model_ads_data_latest = make_model_ads_data_latest.reset_index(drop=True)
# drop the latest_scrape column
make_model_ads_data_latest = make_model_ads_data_latest.drop(columns = ['latest_scrape'])
make_model_ads_data_latest

Unnamed: 0,ad_link,make_model_link,download_date_time,car_make,id1,car_model,id2,link
0,https://suchen.mobile.de/fahrzeuge/details.htm...,https://suchen.mobile.de/fahrzeuge/search.html...,20220117_162428,BMW,3500,214 Gran Tourer,116,https://suchen.mobile.de/fahrzeuge/search.html...
1,https://suchen.mobile.de/fahrzeuge/details.htm...,https://suchen.mobile.de/fahrzeuge/search.html...,20220117_162428,BMW,3500,214 Gran Tourer,116,https://suchen.mobile.de/fahrzeuge/search.html...
2,https://suchen.mobile.de/fahrzeuge/details.htm...,https://suchen.mobile.de/fahrzeuge/search.html...,20220117_162448,Mercedes-Benz,17200,CLS 400 Shooting Brake,263,https://suchen.mobile.de/fahrzeuge/search.html...
3,https://suchen.mobile.de/fahrzeuge/details.htm...,https://suchen.mobile.de/fahrzeuge/search.html...,20220117_162448,Mercedes-Benz,17200,CLS 400 Shooting Brake,263,https://suchen.mobile.de/fahrzeuge/search.html...
4,https://suchen.mobile.de/fahrzeuge/details.htm...,https://suchen.mobile.de/fahrzeuge/search.html...,20220117_162448,Mercedes-Benz,17200,CLS 400 Shooting Brake,263,https://suchen.mobile.de/fahrzeuge/search.html...
...,...,...,...,...,...,...,...,...
8562,https://suchen.mobile.de/fahrzeuge/details.htm...,https://suchen.mobile.de/fahrzeuge/search.html...,20220118_144643,BMW,3500,735,37,https://suchen.mobile.de/fahrzeuge/search.html...
8563,https://suchen.mobile.de/fahrzeuge/details.htm...,https://suchen.mobile.de/fahrzeuge/search.html...,20220118_144643,BMW,3500,735,37,https://suchen.mobile.de/fahrzeuge/search.html...
8564,https://suchen.mobile.de/fahrzeuge/details.htm...,https://suchen.mobile.de/fahrzeuge/search.html...,20220118_144643,BMW,3500,735,37,https://suchen.mobile.de/fahrzeuge/search.html...
8565,https://suchen.mobile.de/fahrzeuge/details.htm...,https://suchen.mobile.de/fahrzeuge/search.html...,20220118_144643,BMW,3500,735,37,https://suchen.mobile.de/fahrzeuge/search.html...


# Next
In the next part which will be the 3rd part of this web scraping series I will show you how to get the data for each advartisment links that we collected in this script.
I hope you will check that out too.
When it is going to be published you will find it on the following link: