# Finding a Good Value BMW: Predicting Re-Sale Prices (notebook I)

This is project 2 of the Metis Data Science Bootcamp in Singapore and was presented in February 2020. Main activities in the project were:

1) **Scrapping of information** from BMW ads in an online marketplace for used cars in Germany using Beautiful Soup (notebook 1, see below)

2) **Analysing, cleansing and pre-processing** the scrapped data (notebook 2)

3) Fitting and evaluating of a  **multivariate regression model** to determine the fair market value of used BMWs based on the features such mileage, age, and CO2 emissions (notebook 3)

## Content of notebook: Scrapping of information from 500+ online ads of used cars

<span style="color: red;"> Do not run this notebook - it will throw error messages since the website is masked! </span>

The approch of web scrapping - with some customization based on the website architecture - can be applied to other websites.

Notebooks 2 and 3 can be run safely since this repo also contains the scrapped data as pkl files.

In [1]:
from bs4 import BeautifulSoup
import requests
import time
import random
import pickle
import pandas as pd

Check T&C of website:

In [4]:
'''
url_robots_tc = 'xxxxxx'
response_robots  = requests.get(url_robots_tc)
print(response_robots.text)
'''


"\nurl_robots_tc = 'xxxxxx'\nresponse_robots  = requests.get(url_robots_tc)\nprint(response_robots.text)\n"

### From the search results page for 'BMW 116' capture the extensions that lead to the page with the details of the car 

Search for used BMW (model "116") cars on the website xxxxxxxx (masked here) yielded 517 results. 

In [3]:
url = 'xxx' 
response = requests.get(url)

In [4]:
response.status_code

200

In [5]:
response.text[:100]

'<!DOCTYPE html>\n<html>\n<head>\n<meta charset="utf-8">\n\n\n<meta http-equiv="X-UA-Compatible" content="I'

In [6]:
page = response.text

In [7]:
soup = BeautifulSoup(page, "lxml")

In [18]:
print(soup.prettify())

Loop through the results pages to generate a list of URL extensions that will have to be appended to the main URL ('xxx') in order to access the pages with the detailed information on the used cars.


In [49]:
number_search_results = 517
number_results_per_page = 20
number_of_loops = number_search_results//number_results_per_page + 1
number_of_loops

26

In [50]:
URL_extensions = []
for i in range (0, number_of_loops):  
    url_loop = 'xxx'+str(i) 
    user_agent = {'User-agent': 'Mozilla/5.0'}
    response = requests.get(url_loop, headers = user_agent)
    time.sleep(.5+2*random.random())
    page = response.text
    soup = BeautifulSoup(page, "lxml")
    soup.find_all('div', class_ = "gwmlistteaser clearfix")
    for div in soup.find_all('div', class_ = "gwmlistteaser clearfix"):
        for link in div.find_all('a', class_= "pictureblock"):
            URL_extensions.append(link.get("href"))

In [51]:
# Example of what has been scrapped
URL_extensions[0:5]

['/details/242162/',
 '/details/578367/',
 '/details/691149/',
 '/details/229773/',
 '/details/705923/']

This is needed to distinguish the list from the scrapping of other models:

In [52]:
URL_extensions_BMW116 = URL_extensions.copy()

In [54]:
with open('pickles/URL_extensions_BMW116.pkl', 'wb') as f:
    pickle.dump(URL_extensions_BMW116, f)

### Generate a list of URLs from which to scrap the details on the used car

In [5]:
"""
import pickle 
base_URL = 'xxx'
URL_extensions_BMW116 = pickle.load(open('pickles/URL_extensions_BMW116.pkl', 'rb'))
URLs_BMW116 = [base_URL+URL_extension for URL_extension in URL_extensions_BMW116]
"""

"\nimport pickle \nbase_URL = 'xxx'\nURL_extensions_BMW116 = pickle.load(open('pickles/URL_extensions_BMW116.pkl', 'rb'))\nURLs_BMW116 = [base_URL+URL_extension for URL_extension in URL_extensions_BMW116]\n"

In [56]:
len(URL_extensions_BMW116)

517

In [6]:
#URLs_BMW116[0:5]

In [58]:
# When running the code twice, it might append the same extensions again to the list so there will be duplicates. 
# Therefore, to be on the safe side check if the number of captured URL extensions is the same as the unique 
# number of URL_extensions (see below).
unique_URL_extensions_BMW116=list(dict.fromkeys(URL_extensions_BMW116))
len(unique_URL_extensions_BMW116)

517

In [60]:
# Note: the below file is not on GitHub
with open('pickles/URLs_BMW116.pkl', 'wb') as f:
    pickle.dump(URLs_BMW116, f)

### Scrap information from the detail page of an advertised car ###

In [62]:
# Two HTML strings are scrapped: Price as well as a long HTML string continaing all other information 
# (aka 'feature_lake') in the ad (the later sting will be parsed in a next step)
"""
pickle_steps = 180
start_value = 0
number_of_URLs_captured = len(URL_extensions_BMW116) # This is 517 here
"""
car_URL_BMW116 = []
price_BMW116 = []
feature_lake_BMW116 = []

for i in range(0, 517, 180):   
    URLs_sample_BMW116 = URLs_BMW116[i:i+180]
    for detail_URL_BMW116 in URLs_sample_BMW116:
        car_URL_BMW116.append(detail_URL_BMW116)
        user_agent = {'User-agent': 'Mozilla/5.0'}
        response = requests.get(detail_URL_BMW116, headers = user_agent)
        time.sleep(1)
        page = response.text
        soup = BeautifulSoup(page, "lxml")
        try: 
            p = soup.find('p', class_ = "price").text.split()[0].replace('.','')
            price_BMW116.append(p)
            lake_content = soup.find_all('table', class_ = "articletable")[0:3]
            lake_extraction = str(lake_content)
            feature_lake_BMW116.append(lake_extraction)                                                      
        except:
            price_BMW116.append("expired")
            feature_lake_BMW116.append("expired")
    time.sleep(.5+2*random.random())

...

In [63]:
# Check that the information from all the ads was extracted successfully
len(price_BMW116)
len(feature_lake_BMW116)

517

In [65]:
with open('pickles/car_URL_BMW116.pkl', 'wb') as f:
        pickle.dump(car_URL_BMW116, f)  

In [66]:
with open('pickles/price_BMW116.pkl', 'wb') as f:
        pickle.dump(price_BMW116, f)  

In [67]:
with open('pickles/feature_lake_BMW116.pkl', 'wb') as f:
        pickle.dump(feature_lake_BMW116, f)  

### Parse the HTML string

Loading the pickled HTML string allowed to re-run the notebook without re-scrapping the websites which is very lengthy.

In [68]:
price_BMW116 = pickle.load(open('pickles/price_BMW116.pkl', 'rb'))

In [69]:
feature_lake_BMW116 = pickle.load(open('pickles/feature_lake_BMW116.pkl', 'rb'))

In [70]:
registration_date_string = []

for feature in feature_lake_BMW116:
    try:
        regd = feature.split('\n<td>')[2].split('</td>')[0]
        registration_date_string.append(regd)
    except:
        registration_date_string.append("expired")

In [71]:
len(registration_date_string)

517

In [73]:
# unit of mileage is kilometers
mileage_km = []

for feature in feature_lake_BMW116:
    try:
        car_mileage = feature.split('\n<td>')[4].split('</td>')[0].replace('.','').split()[0]
        mileage_km.append(car_mileage)
    except:
        mileage_km.append("expired")

In [74]:
inspection_duedate_string = []

for feature in feature_lake_BMW116:
    try:
        duedate = feature.split('\n<td>')[6].split('</td>')[0]
        inspection_duedate_string.append(duedate)
    except:
        inspection_duedate_string.append("expired")

In [75]:
len(inspection_duedate_string)

517

In [76]:
horse_power_ps = []

for feature in feature_lake_BMW116:
    try:
        duedate = feature.split('\n<td>')[8].split('(')[1].split()[0]
        horse_power_ps.append(duedate)
    except:
        horse_power_ps.append("expired")

In [77]:
doors_number = []

for feature in feature_lake_BMW116:
    try:
        doors = feature.split('\n<td>')[12].split('</td>')[0]
        doors_number.append(doors)
    except:
        doors_number.append("expired")

In [78]:
gears_type = []

for feature in feature_lake_BMW116:
    try:
        gears = feature.split('\n<td>')[14].split('</td>')[0]
        gears_type.append(gears)
    except:
        gears_type.append("expired")

In [79]:
prior_owners = []

for feature in feature_lake_BMW116:
    try:
        owners = feature.split('\n<td>')[18].split('</td>')[0]
        prior_owners.append(owners)
    except:
        prior_owners.append("expired")

In [81]:
colour = []

for feature in feature_lake_BMW116:
    try:
        c = feature.split('\n<td>')[20].split('</td>')[0]
        colour.append(c)
    except:
        colour.append("expired")

In [82]:
# unit for cylinder capacity is ccm
cylinder_capacity = []

for feature in feature_lake_BMW116:
    try:
        cc = feature.split('\n<td>')[22].split()[0]
        cylinder_capacity.append(cc)
    except:
        cylinder_capacity.append("expired")

In [83]:
aircon = []

for feature in feature_lake_BMW116:
    try:
        ac = feature.split('\n<td>')[24].split('</td>')[0]
        aircon.append(ac)
    except:
        aircon.append("expired")

In [84]:
fuel_type = []

for feature in feature_lake_BMW116:
    try:
        ft = feature.split('\n<td>')[26].split('</td>')[0]
        fuel_type.append(ft)
    except:
        fuel_type.append("expired")

In [85]:
environmental_certificate = []

for feature in feature_lake_BMW116:
    try:
        ec = feature.split('\n<td>')[28].split('</td>')[0]
        environmental_certificate.append(ec)
    except:
        environmental_certificate.append("expired")

In [86]:
emission_class = []

for feature in feature_lake_BMW116:
    try:
        emc = feature.split('\n<td>')[30].split('</td>')[0].replace(' ','').replace(' ','')
        emission_class.append(emc)
    except:
        emission_class.append("expired")

In [87]:
# unit of Co2 emission is gramm per kilometer
emissions = []

for feature in feature_lake_BMW116:
    try:
        emiss = feature.split('\n<td>')[32].split()[0]
        emissions.append(emiss)
    except:
        emissions.append("expired")

In [88]:
# unit of fuel consumption is liter per 100 km; multiply by 100 to be able to remove the commas, this allows converting into float later on
avg_fuel_cons = []

for feature in feature_lake_BMW116:
    try:
        afc = feature.split('\n<td>')[34].split(' ')[0].replace(',','')
        avg_fuel_cons.append(afc)
    except:
        avg_fuel_cons.append("expired")

In [90]:
# unit of fuel consumption is liter per 100 km
fuel_cons_city = []

for feature in feature_lake_BMW116:
    try:
        fcc = feature.split('\n<td>')[36].split(' ')[0].replace(',','')
        fuel_cons_city.append(fcc)
    except:
        fuel_cons_city.append("expired")

In [91]:
# unit of fuel consumption is liter per 100 km
fuel_cons_highway = []

for feature in feature_lake_BMW116:
    try:
        fch = feature.split('\n<td>')[38].split(' ')[0].replace(',','')
        fuel_cons_highway.append(fch)
    except:
        fuel_cons_highway.append("expired")

In [95]:
df_raw_BMW116 = pd.DataFrame({'price': price_BMW116, 'registration_date_string': registration_date_string, 'mileage_km': mileage_km, 'inspection_duedate_string': inspection_duedate_string, 'horse_power_ps': horse_power_ps, 'doors_number':doors_number, 'gears_type':gears_type, 'prior_owners':prior_owners, 'colour':colour, 'cylinder_capacity':cylinder_capacity, 'aircon': aircon, 'fuel_type': fuel_type, 'environmental_certificate': environmental_certificate, 'emission_class': emission_class, 'emissions': emissions, 'avg_fuel_cons': avg_fuel_cons, 'fuel_cons_city': fuel_cons_city, 'fuel_cons_highway': fuel_cons_highway})
df_raw_BMW116.head(15)  

Unnamed: 0,price,registration_date_string,mileage_km,inspection_duedate_string,horse_power_ps,doors_number,gears_type,prior_owners,colour,cylinder_capacity,aircon,fuel_type,environmental_certificate,emission_class,emissions,avg_fuel_cons,fuel_cons_city,fuel_cons_highway
0,9260,05/2013,123787,-,116,5,Schaltgetriebe,1,Alpinweiß(Weiß),1995,Klimaanlage,Diesel,Grün,Euro5,109,410,510,360
1,9395,08/2013,127400,-,116,3,Schaltgetriebe,2,schwarz 2,1995,Klimaautomatik,Diesel,Grün,Euro5,114,430,530,380
2,9480,06/2012,131255,05/2021,136,5,Automatik,1,(grau) Mineralgrau,1598,Klimaanlage,Benzin,Grün,Euro5,131,560,720,470
3,9580,09/2014,111655,-,116,5,Schaltgetriebe,3,Schwarz (SCHWARZ 2),1995,Klimaanlage,Diesel,Grün,Euro5,114,430,530,380
4,9850,06/2012,75450,10/2021,136,5,Schaltgetriebe,3,Alpinweiss (Weiss),1598,Klimaanlage,Benzin,Grün,Euro5,129,550,710,460
5,9860,06/2012,98000,-,136,5,Schaltgetriebe,2,Alpinweiß(Weiß),1598,Klimaanlage,Benzin,Grün,Euro5,125,540,700,450
6,9920,04/2013,100000,-,136,3,Schaltgetriebe,1,Blau (TIEFSEEBLAU METALLIC),1598,Klimaanlage,Benzin,Grün,Euro6,125,550,720,460
7,9920,07/2012,94635,08/2021,136,5,Schaltgetriebe,-,schwarz / schwarz II,1598,Klimaanlage,Benzin,Grün,Euro5,130,550,710,460
8,9960,11/2012,82000,-,116,5,Schaltgetriebe,1,Alpinweiß(Weiß),1598,Klimaanlage,Diesel,Grün,Euro5,99,380,440,340
9,9980,03/2012,81664,-,136,5,Schaltgetriebe,1,(schwarz) schwarz II,1598,Klimaautomatik,Benzin,Grün,Euro5,129,550,710,460


### Pickle dataframe with scrapped features

In [96]:
with open('pickles/df_raw_BMW116.pkl', 'wb') as f:
        pickle.dump(df_raw_BMW116, f) 