# PARSING DATA FROM A WEBSITE

In this project, I scraped data from a car insurance website using BeautifulSoup. I will use this data for analysis and prediction to find out which cars should be marketed the most to attract more customers.

For scraping data from this website, I'll perform the following tasks:

[**Task 1**](#task1): Importing the libraries

[**Task 2**](#task2): Creating the base url and choosing the header

[**Task 3**](#task3): Extracting product links on the first page

[**Task 4**](#task4): Extracting product links on all the pages

[**Task 5**](#task5): Extracting information of the first product

[**Task 6**](#task6): Extracting information of all the products


<a id='task1'></a>
# Task 1: Importing the libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

<a id='task2'></a>
# Task 2: Creating the base url and choosing the header

In [2]:
baseurl = 'https://www.theaa.com'

In [3]:
header = {
    'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36"
}

<a id='task3'></a>
# Task 3: Extracting product links on the first page

In [4]:
source = requests.get('https://www.theaa.com/used-cars/displaycars?sortby=closest&page=1&pricefrom=0&priceto=1000000&classid=102')# this get command returns a response object. And to get the source code from this response object we add .text. 
soup = BeautifulSoup(source.content, 'lxml')

In [5]:
productlist = soup.find_all('div', class_='vl-item')
print(productlist)

[<div class="vl-item clearfix">
<h3 class="vl-title" title="MINI Clubman COOPER CLASSIC">
<a class="black-link" href="/used-cars/cardetails/73-576234" rel="nofollow">
<span class="make-model-text">MINI Clubman</span>
<span>COOPER CLASSIC</span>
</a>
</h3>
<button class="shortlist-btn" data-table-name="vcarsdna" data-table-uid="576234" title="Click to add this vehicle to your shortlist" type="button">
</button>
<div class="two-column">
<a class="image-link" href="/used-cars/cardetails/73-576234" rel="nofollow">
<img alt="MINI Clubman COOPER CLASSIC" src="https://thumb.vcars.co.uk/vcarsdna/576234_1.jpg"/>
</a>
<div class="right-column">
<div class="vl-price">
<strong class="total-price strong-inline new-transport--bold">£17,950</strong>
</div>
<div class="clearfix">
<div class="vl-location"><span class="icon-location"></span><strong class="strong-inline">Enfield</strong></div>
</div>
<ul class="vl-specs">
<li>2019</li>
<li aria-hidden="true">•</li>
<li>7,000 miles</li>
<li aria-hidden="t

Creating a for loop to get the links of cars so that I can access information about each car at a later stage.

In [6]:
productlinks= []
for item in productlist:
    for link in item.find_all('a', href = True, class_ = 'black-link'):
        print(link['href'])

/used-cars/cardetails/73-576234
/used-cars/cardetails/73-556699
/used-cars/cardetails/73-578180
/used-cars/cardetails/6-1775625
/used-cars/cardetails/6-1773922
/used-cars/cardetails/6-1745145
/used-cars/cardetails/73-429940
/used-cars/cardetails/14-2553383
/used-cars/cardetails/14-2585617
/used-cars/cardetails/14-2585614
/used-cars/cardetails/14-2554709
/used-cars/cardetails/14-2561657
/used-cars/cardetails/14-2598037
/used-cars/cardetails/14-2598723
/used-cars/cardetails/14-2598725
/used-cars/cardetails/14-2596960
/used-cars/cardetails/14-2575853
/used-cars/cardetails/14-2570428
/used-cars/cardetails/14-2576361
/used-cars/cardetails/14-2575852


Concatenating the base url with the car links to form the complete url to access each page

In [7]:
for item in productlist:
    for link in item.find_all('a', href = True, class_='black-link'):
        productlinks.append(baseurl + link['href'])
print((productlinks))

['https://www.theaa.com/used-cars/cardetails/73-576234', 'https://www.theaa.com/used-cars/cardetails/73-556699', 'https://www.theaa.com/used-cars/cardetails/73-578180', 'https://www.theaa.com/used-cars/cardetails/6-1775625', 'https://www.theaa.com/used-cars/cardetails/6-1773922', 'https://www.theaa.com/used-cars/cardetails/6-1745145', 'https://www.theaa.com/used-cars/cardetails/73-429940', 'https://www.theaa.com/used-cars/cardetails/14-2553383', 'https://www.theaa.com/used-cars/cardetails/14-2585617', 'https://www.theaa.com/used-cars/cardetails/14-2585614', 'https://www.theaa.com/used-cars/cardetails/14-2554709', 'https://www.theaa.com/used-cars/cardetails/14-2561657', 'https://www.theaa.com/used-cars/cardetails/14-2598037', 'https://www.theaa.com/used-cars/cardetails/14-2598723', 'https://www.theaa.com/used-cars/cardetails/14-2598725', 'https://www.theaa.com/used-cars/cardetails/14-2596960', 'https://www.theaa.com/used-cars/cardetails/14-2575853', 'https://www.theaa.com/used-cars/card

<a id='task4'></a>
# Task 4: Extracting product links on all the pages

Since there are 20 cars per page and I want all the cars, so I am using a for loop to get information of cars from all the pages.

In [8]:
productlinks = []
for i in range(1,20):
    source = requests.get(f'https://www.theaa.com/used-cars/displaycars?sortby=closest&page={i}&pricefrom=0&priceto=1000000&classid=102')
    soup = BeautifulSoup(source.content, 'lxml')
    productlist = soup.find_all('div', class_='vl-item')
    for item in productlist:
        for link in item.find_all('a', href = True, class_='black-link'):
            productlinks.append(baseurl + link['href'])
print((productlinks))

['https://www.theaa.com/used-cars/cardetails/73-576234', 'https://www.theaa.com/used-cars/cardetails/73-556699', 'https://www.theaa.com/used-cars/cardetails/73-578180', 'https://www.theaa.com/used-cars/cardetails/6-1775625', 'https://www.theaa.com/used-cars/cardetails/6-1773922', 'https://www.theaa.com/used-cars/cardetails/6-1745145', 'https://www.theaa.com/used-cars/cardetails/73-429940', 'https://www.theaa.com/used-cars/cardetails/14-2553383', 'https://www.theaa.com/used-cars/cardetails/14-2585617', 'https://www.theaa.com/used-cars/cardetails/14-2585614', 'https://www.theaa.com/used-cars/cardetails/14-2554709', 'https://www.theaa.com/used-cars/cardetails/14-2561657', 'https://www.theaa.com/used-cars/cardetails/14-2598037', 'https://www.theaa.com/used-cars/cardetails/14-2598723', 'https://www.theaa.com/used-cars/cardetails/14-2598725', 'https://www.theaa.com/used-cars/cardetails/14-2596960', 'https://www.theaa.com/used-cars/cardetails/14-2575853', 'https://www.theaa.com/used-cars/card

<a id='task5'></a>
# Task 5: Extracting information of the first product

Let's try to get information for each product now. We'll start with a test link and then loop all the items. 

In [9]:
testlink= 'https://www.theaa.com/used-cars/cardetails/14-2556661'
r = requests.get(testlink, headers = header)
soup = BeautifulSoup(r.content, 'lxml')
print(soup)

<!DOCTYPE html>
<html lang="en-GB">
<head>
<meta charset="utf-8"/>
<title>Used Ford Fiesta for Sale | AA Cars</title>
<meta content="" name="description"/>
<meta content="width=device-width, initial-scale=1, maximum-scale=1" name="viewport"/>
<meta content="telephone=no" name="format-detection"/>
<meta content="#ffcc00" name="theme-color"/>
<link href="https://www.theaa.com/Assets/images/favicon.png" rel="shortcut icon" type="image/x-icon"/>
<meta content="Ford Fiesta 5Dr Titanium 1.0 100PS" property="og:title"/>
<meta content="product" property="og:type"/>
<meta content="https://www.theaa.com/used-cars/cardetails/14-2556661" property="og:url"/>
<meta content="https://thumb.vcars.co.uk/gforce/2556661_1.jpg" property="og:image"/>
</style>
<meta content="noindex, nofollow" name="robots"/>
<script>
            (function (i, s, o, g, r, a, m) {
                i['GoogleAnalyticsObject'] = r;
                i[r] = i[r] || function () {
                        (i[r].q = i[r].q || []).push(a

Let's try to extract necessary information from this page

In [10]:
name = soup.find('div', class_ = 'col-xs-12 col-md-8').h1.text
print(name)

FordFiesta 5Dr Titanium 1.0 100PS


In [11]:
price = soup.find('div', class_ = 'price').strong.text
print(price)

£11,399


Now, certain data on this page is scripted as list items, so we'll make a for loop to extract information from this list

In [12]:
result = []
for li in soup.find('ul', class_ = 'vd-specs-list clearfix').find_all('li'):
    result.append(list(li.stripped_strings))

print(result)

[['Mileage:', '12,937'], ['Year:', '2019'], ['Fuel type:', 'Petrol'], ['Transmission:', 'Manual'], ['Body type:', 'Hatchback'], ['Colour:', 'Shadow black (premium colour)'], ['Doors:', '5'], ['Engine size:', '1.0 L'], ['CO', '2', 'Emissions:', '110 g/km']]


Since, we've extracted the list items as a list, we'll extract information from this list using pop command

In [13]:
mileage = result[0].pop()
print(mileage)

12,937


In [14]:
year = result[1].pop()
print(year)

2019


In [15]:
fuel = result[2].pop()
print(fuel)

Petrol


In [16]:
engine_size = result[7].pop()
print(engine_size)

1.0 L


In [17]:
emission = result[8].pop()
print(emission)

110 g/km


There's another list found on this page, so I created a new for loop to extract information from it

In [18]:
ig_results = []
for li in soup.find('ul',class_ = 'tco clearfix').find_all('li'):
    ig_results.append(list(li.stripped_strings))
    
print(ig_results)

[['Mpg:', '49.6 miles', 'This is the average amount of miles it takes the car to use a gallon of fuel. The figure is used to see how efficient\n                                    the car is with its fuel. For a typical petrol-powered family hatchback, anything above 40mpg is considered respectable. The higher the\n                                    figure here, the more efficient the car is.'], ['Mpg urban:', '48.7 miles', "This number is calculated when the car is driven on a test cycle through towns and cities. As vehicles are more likely\n                                    to be continually stopping and starting, they'll use more fuel getting the car back up to speed – lowering its fuel efficiency compared\n                                    to the average."], ['Mpg extra urban:', '68.9 miles', "When driven on roads in and out of town, the car will be able to cruise at points – using less fuel. This figure takes\n                                    that into account, as when a c

In [19]:
in_group = ig_results[3].pop(-2)
print(in_group)

10


Now that I have extracted all the information I require for a single car, I shall create a dictionary for this car.

In [20]:
car = {
    'name': name,
    'price': price,
    'mileage': mileage,
    'year': year,
    'fuel': fuel,
    'engine_size': engine_size,
    'emission': emission,
    'insurance group': in_group
}
print(car)

{'name': 'FordFiesta 5Dr Titanium 1.0 100PS', 'price': '£11,399', 'mileage': '12,937', 'year': '2019', 'fuel': 'Petrol', 'engine_size': '1.0 L', 'emission': '110 g/km', 'insurance group': '10'}


<a id='task6'></a>
# Task 6: Extracting information of all the products

Now, I want to extract information for all the cars. So, I shall use a for loop again to get this informtaion.

In [21]:
carlist = []
for link in productlinks:
    r = requests.get(link, headers = header)
    soup = BeautifulSoup(r.content, 'lxml')
    try:
        name = soup.find('div', class_ = 'col-xs-12 col-md-8').h1.text
    except: 
        name = 'not given'
    try: 
        price = soup.find('div', class_ = 'price').strong.text
    except: 
        price = 'not given'
    result = []
    links = soup.find('ul', class_ = 'vd-specs-list clearfix').find_all('li')
    for li in links:
        result.append(list(li.stripped_strings))
        
    try: 
        mileage = result[0].pop()
    except: 
        mileage = 'not given'
    try: 
        year = result[1].pop()
    except: 
        year = 'not given'
    try: 
        engine_size = result[7].pop()
    except: 
        engine_size = 'not given'
    try: 
        emission = result[8].pop()
    except: 
        emission = 'not given'
        
     
    car = {
    'name': name,
    'price': price,
    'mileage': mileage,
    'year': year,
    'fuel': fuel,
    'engine_size': engine_size,
    'emission': emission
    }
    
    carlist.append(car)
    print('Saving:', car['name'])

df = pd.DataFrame(carlist)
print(df.head)

Saving: MINIClubman COOPER CLASSIC
Saving: MINICooper Convertible COOPER CLASSIC
Saving: AudiA1 SPORTBACK TFSI SPORT
Saving: BMW1 Series 1.6 116i M Sport Sports Hatch (s/s) 5dr
Saving: BMW1 Series 2.0 116i Sport 3dr
Saving: not given
Saving: BMW1 SERIES 120i SPORT
Saving: FordFiesta 5Dr ST-Line X Edition 1.0 125PS Auto
Saving: FordFiesta 5Dr Titanium 1.0 Hybrid 125PS
Saving: FordFiesta 5Dr ST-2 1.5 200PS
Saving: FordFiesta 5Dr Titanium 1.0 Hybrid 125PS
Saving: FordFiesta 5Dr ST-Line Edition 1.0 Hybrid 125PS
Saving: FordFiesta Vignale Fiesta 5Dr Vignale 1.0 140PS
Saving: FordFiesta Vignale Fiesta 5Dr Vignale Edition 1.0 125PS
Saving: FordFiesta 5Dr Active X 1.0 100PS
Saving: FordFiesta 5Dr ST-Line X 1.0 125PS
Saving: FordFiesta 5Dr ST-Line 1.0 125PS
Saving: FordFiesta 5Dr ST-Line X 1.0 125PS
Saving: FordFiesta 5Dr Active X 1.0 125PS
Saving: FordFiesta 5Dr Titanium 1.0 100PS
Saving: FordFiesta 5Dr Titanium 1.0 100PS
Saving: FordFiesta 3Dr ST-Line 1.0 140PS
Saving: FordFiesta 5Dr Zetec 1.

Saving: Mercedes-BenzA-Class A180d Sport 5dr Auto
Saving: Mercedes-BenzA-Class A180d Sport 5dr Auto
Saving: Mercedes-BenzA-Class A180 SE 5dr
Saving: Mercedes-BenzA-Class A180d AMG Line 5dr
Saving: Mercedes-BenzA-Class A180 AMG Line Premium 5dr
Saving: Mercedes-BenzA-Class A180 AMG Line Premium 5dr
Saving: Mercedes-BenzA-Class A200d AMG Line Premium 5dr
Saving: Mercedes-BenzA-Class A180d AMG Line Premium 5dr Auto
Saving: Mercedes-BenzA-Class A180d AMG Line Premium 5dr Auto
Saving: Mercedes-BenzA-Class A200d AMG Line 5dr
Saving: Mercedes-BenzA-Class A160 AMG Line 5dr
Saving: Mercedes-BenzA-Class A180d Sport 5dr Auto
Saving: Mercedes-BenzA-Class A180 AMG Line 5dr
Saving: Mercedes-BenzA-Class A180 Sport 5dr Auto
Saving: VauxhallCorsa 1.2 Turbo SRi Nav Premium 5dr Petrol Hatchback
Saving: CitroenC3 Aircross 1.2 PureTech Flair 5dr Petrol Hatchback
Saving: CitroenC3 Aircross 1.2 PureTech 110 Feel 5dr Petrol Hatchback
Saving: VauxhallCorsa 1.4 SE 5dr Auto Petrol Hatchback
Saving: VauxhallCorsa

Saving: VolkswagenPolo BlueMotion Tech SE
Saving: AudiA1 TFSI SPORT
Saving: VolkswagenPolo MATCH
Saving: VolkswagenPolo SE
Saving: VolkswagenPolo MATCH EDITION
Saving: VolkswagenPolo BlueMotion Tech SE
Saving: HyundaiI10 SE
Saving: NissanMicra ACENTA
Saving: VolkswagenPolo BlueMotion Tech SE
Saving: AudiA1 TDI SPORT ONE OF A KIND COLOUR!
Saving: KiaPicanto 2
Saving: VolkswagenPolo MATCH
Saving: VolkswagenPolo MATCH EDITION
Saving: VolkswagenPolo MATCH EDITION
Saving: VauxhallADAM GLAM
Saving: ToyotaYaris D-4D TR
<bound method NDFrame.head of                                                   name    price mileage  year  \
0                           MINIClubman COOPER CLASSIC  £17,950   7,000  2019   
1                MINICooper Convertible COOPER CLASSIC  £16,750   6,000  2019   
2                          AudiA1 SPORTBACK TFSI SPORT  £11,950  25,000  2017   
3    BMW1 Series 1.6 116i M Sport Sports Hatch (s/s...   £8,795  46,000  2014   
4                       BMW1 Series 2.0 116i Sp

In [22]:
df.head(30)

Unnamed: 0,name,price,mileage,year,fuel,engine_size,emission
0,MINIClubman COOPER CLASSIC,"£17,950",7000,2019,Petrol,1.5 L,130 g/km
1,MINICooper Convertible COOPER CLASSIC,"£16,750",6000,2019,Petrol,1.5 L,118 g/km
2,AudiA1 SPORTBACK TFSI SPORT,"£11,950",25000,2017,Petrol,1.4 L,119 g/km
3,BMW1 Series 1.6 116i M Sport Sports Hatch (s/s...,"£8,795",46000,2014,Petrol,1.6 L,131 g/km
4,BMW1 Series 2.0 116i Sport 3dr,"£3,495",85000,2010,Petrol,2.0 L,143 g/km
5,not given,"£3,195",95000,2013,Petrol,1.6 L,127 g/km
6,BMW1 SERIES 120i SPORT,"£15,250",18000,2016,Petrol,1.6 L,133 g/km
7,FordFiesta 5Dr ST-Line X Edition 1.0 125PS Auto,"£19,499",2383,2020,Petrol,1.0 L,133 g/km
8,FordFiesta 5Dr Titanium 1.0 Hybrid 125PS,"£17,699",598,2020,Petrol,1.0 L,115 g/km
9,FordFiesta 5Dr ST-2 1.5 200PS,"£17,499",6349,2019,Petrol,1.5 L,136 g/km


# FINAL DATA

In [23]:
df

Unnamed: 0,name,price,mileage,year,fuel,engine_size,emission
0,MINIClubman COOPER CLASSIC,"£17,950",7000,2019,Petrol,1.5 L,130 g/km
1,MINICooper Convertible COOPER CLASSIC,"£16,750",6000,2019,Petrol,1.5 L,118 g/km
2,AudiA1 SPORTBACK TFSI SPORT,"£11,950",25000,2017,Petrol,1.4 L,119 g/km
3,BMW1 Series 1.6 116i M Sport Sports Hatch (s/s...,"£8,795",46000,2014,Petrol,1.6 L,131 g/km
4,BMW1 Series 2.0 116i Sport 3dr,"£3,495",85000,2010,Petrol,2.0 L,143 g/km
...,...,...,...,...,...,...,...
375,VolkswagenPolo MATCH,"£5,980",40000,2013,Petrol,1.2 L,128 g/km
376,VolkswagenPolo MATCH EDITION,"£5,780",62000,2014,Petrol,1.2 L,128 g/km
377,VolkswagenPolo MATCH EDITION,"£5,780",66000,2014,Petrol,1.2 L,128 g/km
378,VauxhallADAM GLAM,"£4,980",39000,2013,Petrol,1.4 L,129 g/km
