# 1. Collecting Data
------------------------------

**Overview**
 - Data are scraped from one of the most popular polish online marketplace for car shoppers and sellers: otomoto.pl
 - Specific car model is chosen: Opel (GM) Astra.
 - I scrapp from offers such information as: price, year of production, mileage, etc.
 - The scraped information is put to pandas DataFrame
 - The DataFrame are saved to csv file.
-------------------------


## 1.1 Scraping links to offers
---------------------------------------

1) Getting a page source

In [8]:
from requests import get as rqst_get
from contextlib import closing

def get_content_from_url(url):
    with closing(rqst_get(url, stream=True)) as cnt:
        return cnt.content

2) Extracting links from BeautifulSoup objects

In [9]:
from bs4 import BeautifulSoup

def extract_links(bfs):
    links = []
    for link in bfs.find_all('a', class_='offer-title__link'):
        links.append(str(link.get('href')))
    return links

3) Exploring website for links

In [10]:
def explore_website(core_url, suffix='?page=', n_pages=3):
    bfs = BeautifulSoup(get_content_from_url(core_url), 'html.parser')
    links = extract_links(bfs)
    for i in range(1,n_pages+1):
        new_url = core_url+suffix+str(i)
        bfs = BeautifulSoup(get_content_from_url(new_url), 'html.parser')
        links += extract_links(bfs)
    links = list(set(links)) # delete duplicates
    return links

4) Getting links to offer

In [11]:
url_to_offers = 'https://www.otomoto.pl/osobowe/opel/astra/'
links = explore_website(url_to_offers, n_pages=200)

-----------------------

## 1.2 Scraping data from offers
-----------------------

1) Finding price in the page source. (In case of problem, put 'Null')

In [12]:
def find_price(bfs):
    price = ['Cena']
    try:
        price.append( float(list(bfs.find(class_='offer-price__number').stripped_strings)[0].replace(' ','')))
    except:
        price.append('Null')
    return price

2) Finding items in the page source 

In [13]:
def find_items(bfs):
    items = []
    try:
        items_class = bfs.find_all(class_='offer-params__item')
        for item_class in items_class:
            items.append(list(item_class.stripped_strings))
    except:
        items = 'Null'
    return items

3) Finding features in the page source

In [14]:
def find_features(bfs):
    features = []
    try:
        features_class = bfs.find_all(class_='offer-features__item')
        for feature_class in features_class:
            features.append(list(feature_class.stripped_strings))
    except:
        features.append('Null')
    return features

4) Adding row to DataFrame

In [20]:
from numpy import zeros, repeat, array
def add_to_dataframe(df, price, items, features):
    current_index = 0
    if len(df.index) == 0:
        df.loc[0] = zeros(df.columns.size)
    else:
        df.loc[df.index[-1]+1] = zeros(df.columns.size)
        current_index = df.index[-1]
    df.loc[current_index, 'Cena'] = price[1]
    if items != 'Null':
        for item, value in items:
            if item not in df.columns:
                df[item] = repeat(0, len(df))
            df.loc[current_index, item] = value
    if features != 'Null':
        features = array(features).flatten()
        for feature in features:
            if feature not in df.columns:
                df[feature] = repeat(0, len(df))
            df.loc[current_index, feature] = 1

5) Gathering data to DataFrame for every link

In [21]:
import pandas as pd
data = pd.DataFrame({'Cena': []})
for offer_link in links:
    offer_bfs = BeautifulSoup(get_content_from_url(offer_link), 'html.parser')
    price = find_price(offer_bfs)
    items = find_items(offer_bfs)
    features = find_features(offer_bfs)
    add_to_dataframe(data, price, items, features)

--------------------------------

## 1.3 Saving DataFrame to file
------------------------------

In [23]:
data.to_csv('car_offers.csv')