# Scraping

The website [www.finn.no] offers many things for sale.

1. Create a function that, when given a page number, returns a list of the URLS of specific motorcycles that are for sale on that page.

```
def scrape_cycles(page_number):
    - get html
    - parse html
    - extract a list of urls
    
    return list_of_urls
```


In [10]:
from requests_html import HTMLSession

def scrape_cycles(page_number):
    url = 'https://www.finn.no/mc/all/search.html?page=' + str(page_number)
    session = HTMLSession()
    r = session.get(url)
    parsed_html = r.html
    
    links = []
    for e in parsed_html.find('.ads__unit__link'):
        link = list(e.absolute_links)[0]
        links.append(link)
    return links
    
scrape_cycles(10)

['https://www.finn.no/mc/all/ad.html?finnkode=154025365',
 'https://www.finn.no/mc/all/ad.html?finnkode=154083996',
 'https://www.finn.no/mc/all/ad.html?finnkode=154073713',
 'https://www.finn.no/mc/all/ad.html?finnkode=154083066',
 'https://www.finn.no/mc/all/ad.html?finnkode=128352780',
 'https://www.finn.no/mc/all/ad.html?finnkode=154082124',
 'https://www.finn.no/mc/all/ad.html?finnkode=154081268',
 'https://www.finn.no/mc/all/ad.html?finnkode=148655873',
 'https://www.finn.no/mc/all/ad.html?finnkode=154082109',
 'https://www.finn.no/mc/all/ad.html?finnkode=154080637',
 'https://www.finn.no/mc/all/ad.html?finnkode=154080845',
 'https://www.finn.no/mc/all/ad.html?finnkode=154078850',
 'https://www.finn.no/mc/all/ad.html?finnkode=147763955',
 'https://www.finn.no/mc/all/ad.html?finnkode=154079796',
 'https://www.finn.no/mc/all/ad.html?finnkode=154077560',
 'https://www.finn.no/mc/all/ad.html?finnkode=154073112',
 'https://www.finn.no/mc/all/ad.html?finnkode=154079285',
 'https://www.

2. Below is a section of a listing page refering to just one article. This piece can be used by `requests_html` using the following code:

```
from requests_html import HTML
parsed_html_section = HTML(html=html_section)
```
Create a function that scrapes this section and returns a dictionary with entries for the vehicle description ("Yamaha MT-07 2016"), mileage ("7000 km"), and price ("95000 kr").

```
def scrape_listing(parsed_html_section):
    - extract description
    - extract mileage
    - extract price

    info = {'description' : description,
            'mileage'     : mileage,
            'price'       : price}
    
    
    return info
```

Hints:
* ```'dog, river, cat'.split(', ')``` will return ```['dog', 'river', `cat')```
* ```'kr 95\xa0000'.encode('ascii','ignore')``` will return ```kr 95000```

In [13]:
html_section = '''
<article class="ads__unit">
                        <div aria-owns="result-item-heading-153469431"></div>

    <div class="ads__unit__img">
        <div class="ads__unit__img__ratio img-format img-format--ratio3by2  img-format--centered">
            <img src="https://images.finncdn.no/dynamic/480w/2019/8/vertical-3/01/1/153/469/431_787813931.jpg" class="img-format__img" alt="" srcset="https://images.finncdn.no/dynamic/960w/2019/8/vertical-3/01/1/153/469/431_787813931.jpg 954w, https://images.finncdn.no/dynamic/640w/2019/8/vertical-3/01/1/153/469/431_787813931.jpg 640w, https://images.finncdn.no/dynamic/480w/2019/8/vertical-3/01/1/153/469/431_787813931.jpg 480w, https://images.finncdn.no/dynamic/320w/2019/8/vertical-3/01/1/153/469/431_787813931.jpg 320w, https://images.finncdn.no/dynamic/240w/2019/8/vertical-3/01/1/153/469/431_787813931.jpg 240w" sizes="(min-width: 768px) 240px, 40vw">
        </div>
    </div>

<div class="ads__unit__content">
    <h2 class="ads__unit__content__title ads__unit__content__title--fav-placeholder" id="result-item-heading-153469431">
        <a id="153469431" href="/mc/all/ad.html?finnkode=153469431" class="ads__unit__link" data-finnkode="153469431" data-listposition="3" data-adtypeterm="mc" data-search-resultitem="">

            Yamaha MT-07 2016, 7&nbsp;000 km, kr 95&nbsp;000,-
        </a>
    </h2>
        <div class="ads__unit__content__status u-position-relative">
                <div data-call-out-box-position="" class="u-position-absolute u-top u-right" style="z-index: 1" data-controller="favoriteHeartReact" data-base-resource-url="https://www.finn.no/favorittliste/podium-resource/favorittlistePodlet/favorite-api" data-ad-id="153469431"><button aria-haspopup="dialog" aria-label="Hjertemerke" aria-pressed="false" class="button button--pill icon icon--heart-neutral" title="Legg til favoritt"></button></div>
        </div>
    <span class="ads__unit__content__details">
                <span>Straumen</span>
    </span>

            <p class="ads__unit__content__keys">
                        <span>2016</span>
                        <span>7&nbsp;000 km</span>
                        <span>95&nbsp;000 kr</span>
            </p>
    <p>
        <span class="u-float-left">
                <span class="ads__unit__content__list truncate">Privat</span>
        </span>
    </p>
</div>


                </article>
                '''

{'https://example.org/mc/all/ad.html?finnkode=153469431'}

In [112]:
from requests_html import HTML



def scrape_listing(parsed_html_section):
    year = parsed_html_section.find('h2')[0].text.split(',')[0]
    miles = parsed_html_section.find('h2')[0].text.split(',')[1]
    price = parsed_html_section.find('h2')[0].text.split(',')[2]

        
    info = {'description' : year,
                'mileage'     : miles.replace('\xa0',''),
                'price'       : price.replace('\xa0','')}
    # fix for cases with no mileage listed
    if 'kr ' in miles:
        info['mileage'] = 'Missing'
        info['price'] = miles
    return info

scrape_listing(parsed_html_section)

{'description': 'Yamaha MT-07 2016',
 'mileage': ' 7000 km',
 'price': ' kr 95000'}

3. Create a function that returns a dataframe of the listing information for all the lisitings on a specific search page. Use the function that you created in 2.

Hint:
* Each of the elements returned by `.find()` using `requests_html` can also be searched. For example, run the following script on your `parsed_html` page:
```for parsed_html_section in parsed_html.find('article'):
    print('New section')
    print(parsed_html_section.find('h2'))
    print('')
    ```



In [113]:

def scrape_all_listing(page_number):
    url = 'https://www.finn.no/mc/all/search.html?page=' + str(page_number)
    session = HTMLSession()
    r = session.get(url)
    parsed_html = r.html
    
    listings = []
    
    # iterate over each section
    for parsed_html_section in parsed_html.find('.ads__unit'):
        # Only look at the sectionst that actually have listings
        if len(parsed_html_section.absolute_links) > 0:
            info = scrape_listing(parsed_html_section)
            info['url'] = list(parsed_html_section.absolute_links)[0]
        listings.append(info)
    
            
    return listings
    
scrape_all_listing(3)

[{'description': 'Harley-Davidson 2009',
  'mileage': ' 59000 km',
  'price': ' kr 228000',
  'url': 'https://www.finn.no/mc/all/ad.html?finnkode=149299855'},
 {'description': 'Honda CX500 1980',
  'mileage': ' 50700 km',
  'price': ' kr 25000',
  'url': 'https://www.finn.no/mc/all/ad.html?finnkode=154339239'},
 {'description': 'Yamaha SR400 2016',
  'mileage': ' 5620 km',
  'price': ' kr 49000',
  'url': 'https://www.finn.no/mc/all/ad.html?finnkode=154343845'},
 {'description': 'BMW 90/6 1974',
  'mileage': 'Missing',
  'price': ' kr 25\xa0000',
  'url': 'https://www.finn.no/mc/all/ad.html?finnkode=154342347'},
 {'description': 'Harley-Davidson Sportster 2003',
  'mileage': ' 17000 km',
  'price': ' kr 85000',
  'url': 'https://www.finn.no/mc/all/ad.html?finnkode=154344379'},
 {'description': 'Honda 900rr 1998',
  'mileage': ' 31000 km',
  'price': ' kr 49000',
  'url': 'https://www.finn.no/mc/all/ad.html?finnkode=154325859'},
 {'description': 'Honda 900rr 1998',
  'mileage': ' 31000 

In [114]:
import pandas as pd

pd.DataFrame(scrape_all_listing(13))

Unnamed: 0,description,mileage,price,url
0,Harley-Davidson FXCWC 2008,49500 km,kr 169900,https://www.finn.no/mc/all/ad.html?finnkode=14...
1,BMW R1250GS *DEMO* 2019,7000 km,kr 259000,https://www.finn.no/mc/all/ad.html?finnkode=15...
2,Honda CBR 125 R 2012,33400 km,kr 14500,https://www.finn.no/mc/all/ad.html?finnkode=15...
3,Vespa VESPA PRIMAVERA 2015,6995 km,kr 22500,https://www.finn.no/mc/all/ad.html?finnkode=15...
4,Yamaha YZF-R125 2008,21300 km,kr 27000,https://www.finn.no/mc/all/ad.html?finnkode=15...
5,Suzuki VZ 800 2005,11500 km,kr 47500,https://www.finn.no/mc/all/ad.html?finnkode=15...
6,Suzuki VZ 800 2005,11500 km,kr 47500,https://www.finn.no/mc/all/ad.html?finnkode=15...
7,BMW F750GS 2019,Missing,kr 166 000,https://www.finn.no/mc/all/ad.html?finnkode=15...
8,Suzuki Burgman AN400ZA Limited 2017,13000 km,kr 74000,https://www.finn.no/mc/all/ad.html?finnkode=15...
9,BMW F800GT KAMPANJE 2019,Missing,kr 155 900,https://www.finn.no/mc/all/ad.html?finnkode=15...
