<h1>Парсинг данных</h1>

Задача: Оценка рыночной стоимости мотоцикла с пробегом 

Сайт с данными: https://moto.drom.ru/ 

Целевая переменная: цена в рублях

Признаки: 
1. Модель мотоцикла
2. Пробег
3. Класс 
4. Год выпуска 
5. Объем двигателя 
6. Число тактов 
7. Состояние 
8. Документы
9. Город 
10. Дата публикации объявления 

Как они будут называться в таблице, разделитель ',': 

target: price

1. model
2. mileage 
3. motorcycle_class
4. year 
5. engine_capacity
6. engine_strokes 
7. damaged
8. documents
9. city
10. date 

Алгоритм:
1. Скачать все страницы (есть номер страницы, если забанят - знаю, откуда продолжить)
2. Получить из них список ссылкок (просто парсинг)
3. Потом идти по каждой ссылке (есть номер ссылки, если забанят - знаю, откуда продолжить)

<b>Импорт библиотек</b>

In [1]:
import numpy as np
import pandas as pd
import requests
import time
from bs4 import BeautifulSoup

<b>Класс, описывающий мотоцикл</b>

In [2]:
class Motorcycle:
    def __init__(self, soup_motorcycle, index=None):
        self.soup = soup_motorcycle
        if (index is None):
            self.index = 0
        else:
            self.index = index
            
        self.warning_string_ = "Index: " + str(self.index) + "; Warning: "
        
    def warning_(self, string):
        print(self.warning_string_ + string)
        
    def parse_information(self):
        self.price = self.get_price()
        if (self.price is None):
            self.warning_('price not found')
            return False
        
        self.model = self.get_feature({"data-field" : "model"})
        if (self.model is not None):
            comma_index = self.model.find(',')
            if (comma_index != -1):
                self.model = self.model[:comma_index]
            
        self.mileage = self.get_feature({"data-field" : "motoMileage"})
        self.motorcycle_class = self.get_feature({"data-field" : "motoBodyType"})
        self.year = self.get_feature({"data-field" : "year"})
        self.engine_capacity = self.get_feature({"data-field" : "displacement"})
        self.engine_strokes = self.get_feature({"data-field" : "motoEngineType"})  
        self.damaged = self.get_feature({"data-field" : "motoDriveCondition"})
        self.documents = self.get_feature({"data-field" : "hasDocuments"})
        self.city = self.get_feature({"title" : "Выбранный город"})
        
        self.date = self.get_feature({"class" : "viewbull-header__actuality"})
        if (self.date is not None):
            comma_index = self.date.find(',')
            if (comma_index != -1):
                self.date = self.date[comma_index+2:]
            
        return True
        
    def get_price(self):
        price = self.soup.select_one('span.viewbull-summary-price__value')
        if (price is not None):
            price = price.text.strip()
            if (price[0] == '≈'):
                return None
            
        return price
    
    def get_feature(self, attrs):
        feature = self.soup.findAll(attrs=attrs)
        if (len(feature) > 0):
            feature = feature[0].text.strip()
            if (feature[-1] == '?'):
                feature = feature[:-1].strip()
        else:
            feature = None
        return feature

    def __repr__(self):
         return "Motorcycle class"
        
    def __str__(self):
        string = str(self.index) + ',' + str(self.price) + ',' + str(self.model) + ',' + \
                    str(self.mileage) + ',' + str(self.motorcycle_class) + ',' + \
                    str(self.year) + ',' + str(self.engine_capacity) + ',' + \
                    str(self.engine_strokes) + ',' + str(self.damaged) + ',' + \
                    str(self.documents) + ',' + str(self.city) + ',' + str(self.date)
        return string

<b>Информация по подключению</b>

In [3]:
city = 'moskva'
url_template = 'https://moto.drom.ru'
url_list = url_template + '/' + city + '/sale/?status=archive&page='

In [4]:
# Сайт с прокси: http://spys.one/proxys/RU/

headers = None
proxies = None
#proxies = { 'https': '5.53.19.82:56907' }

<b>Вспомогательные функции для подключения</b>

In [5]:
def get_links(url_page, proxies=None, headers=None, sleep=False, sleep_seconds=5):
    html_page = requests.get(url_page, proxies=proxies, headers=headers).content
    soup = BeautifulSoup(html_page, 'html.parser')

    if (sleep):
        time.sleep(np.random.randint(sleep_seconds))
          
    return soup.select('.bulletinLink')

In [6]:
def get_motorcycle_soup(url_motorcycle, proxies=None, headers=None, sleep=False, sleep_seconds=5):
    html_motorcycle = requests.get(url_motorcycle, proxies=proxies, headers=headers).content
    soup_motorcycle = BeautifulSoup(html_motorcycle, 'html.parser')

    if (sleep):
        time.sleep(np.random.randint(sleep_seconds))
        
    return soup_motorcycle

<b>Создание списка ссылок</b>

In [7]:
current_page = 1
last_page = 142
current_link = 1

with open('data/links.txt', 'a') as f_output:
    for page in range(current_page, last_page+1):
        links = get_links(url_list + str(page), proxies, headers)

        if (links is None):
            print("Error while getting motorcycle links")
            print("Page: " + str(page))
            print("Link: " + str(current_link))
            break

        for link in links:
            print(str(current_link) + '. ' + url_template + link['href'], file=f_output)
            current_link += 1

<b>Парсинг страниц с мотоциклами</b>

In [8]:
features = "id,price,model,mileage,motorcycle_class,year,engine_capacity,engine_strokes,damaged,documents,city,date"

with open('data/links.txt', 'r') as links_file:
    with open('data/motorcycles.csv', 'a', encoding='utf-8') as motorcycle_file:
        print(features, file=motorcycle_file)
        for i, link in enumerate(links_file, 1):
            link = link[link.find('h'):]
            
            soup = get_motorcycle_soup(link, proxies, headers, False)
            if (soup is None):
                print("Error while getting motorcycle information")
                print("Link: " + str(link))
                break

            motorcycle = Motorcycle(soup, i)
            if (motorcycle.parse_information()):
                print(motorcycle, file=motorcycle_file)
            else:
                print("Error while parsing motorcycle information")
                print("Link: " + str(link))

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/ducati-diavel-2012-57570517.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/prodam-honda-cbr929-66695315.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/prodaju-ktm-690-duke-2012-g-v-66639732.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/harly-davidson-2013-57731456.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/bmw-k1600-gtl-66373834.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/prodam-motocikl-suzuki-desperado-vz-800-65163540.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/honda-cbr600f-66032691.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/prodam-motocikl-ural-s-koljaskoj-8.103-10-65426129.html

Error while parsing motor

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/2015-harley-davidson-sportster-1200-62508523.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/harley-davidson-tour-glide-classic-fltc-1983-62508350.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/harley-davidson-xlh-1000-sportster-1973-62215671.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/novyj-voshod-3-61692998.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/prodam-motocikl-jawa-350-1965-54607089.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/prodam-motocikl-ural-m62-54607506.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/cafe-racer-suzuki-59528132.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/elektromotocikl-ot-60.000-rub-

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/nonda-vfr750f-45077973.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/suzuki-katana-44944622.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/suzuki-44944611.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/suzuki-gsf1200-44944576.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/bmwb-k1200lt-44944499.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/honda-vfr-800-44944484.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/yamaha-xvs1100-44944462.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/kawasaki-vn900-44944429.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/motocikly-s-aukcionov-japonii-moskv

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/prodam-honda-vt1300cx-fury-34742394.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/motocikl-na-zap-chasti-honda-nx650-dominator-mod-pd02-33576782.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/motocikl-po-zap-chastjam-motorazborka-honda-cb500-mod-pc26-pc32-33576925.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/motocikl-po-zap-chastjam-motorazborka-honda-vfr750-mod-rc24-33576822.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/motocikl-na-razborku-motorazborka-kawasaki-zzr-600-e-33606234.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/kawasaki-vn-1700-2010-33924474.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/kupit-nedorogoj-motocikl-honda-shadow750-34258796.html



Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/suzuki-haybussa-gsx-1300r-2006-g-21291866.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/yamaha-stryker-1300-2011-g-21291755.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/kawasaki-ninja-zx10r-2006-g-21291613.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/ducati-1199-panigale-2012-21237465.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/honda-hornet-2006-6300-21014746.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/honda-cbr-600f-2001-21014341.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/prodazha-motociklov-20948178.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/motocikly-iz-japonii-prjamye-postavki-bolshoj-vybor-nizkie-cen

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/prodaetsja-honda-gold-wing-1200-i-dr-bajki-5587895.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/honda-gl1500se-1992g-v-5286782.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/sale/suzuki-b-king-1.3-184-l-s-5330574.html

Error while parsing motorcycle information
Link: https://moto.drom.ru/moskva/sale/hromirovannye-diski-prjamotoki-yoshimura-xenon-dempfer-rulja-5269050.html

