## Scrape cars on sale on various trader websites

The code below scrape cars available on autotrader.com for given zipcodes, and store them in a sql databases for further analyses. This particular trader seems to be actively fighting scraping of its offers by regularly modifying the html format and css files and class names. 

Issues right now:
* List is limited to top 1000 results, in this case only most expensive cars, so results are _clearly biased_.
* I stopped the script after 10 zipcodes in TX. (~10000 results).
* Script is slow. Using selenium to go through all search result pages. Not ideal.
* Not using proxies. I'm assuming that autotrader will ban me at some point. 
* Also, _Site using multiple css templates that render the same (possibly to hinder scraping...). This script only works with one of them and results in an error when the wrong one is loaded._ Next step would be to write a script for the other template and automatically identify which css template is used.
* Because I'm searching in a 25miles radius in each zipcode, the same cars can show up several time. For now I'm dealing with this with a SQL command after scraping. 

Possible solutions:
* Identify how results are transfered and intercept AJAX/JSON requests..?
* write a script for the other css templates used and automatically identify which is used.


In [None]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import sqlite3
import time

%load_ext sql
%config SqlMagic.autopandas=True
%matplotlib inline

In [None]:
# %qtconsole

In [None]:
# Connect to dbs
conn = sqlite3.connect('data/sales.db')
%sql sqlite:///data/zipcode.sqlite

In [None]:
#Get list of zipcodes, only do TX for now (LIMIT 10 because super slow... 
zips = %sql select ZipCode from zipcode where St = 'TX' limit 10

In [None]:
trader = 'http://www.autotrader.com'



In [None]:
for z in zips.ZipCode:
    
    print(z)

    browser = webdriver.Chrome()
    browser.get(trader)

#     elem = browser.find_element_by_name('fyc-form-j_id_bb-j_id_bn-j_id_cp-zipcode')  # Find the search box
    elem = browser.find_element_by_name('fyc-form-j_id_be-j_id_bq-j_id_cs-zipcode')  # Find the search box
    elem.send_keys(Keys.ESCAPE)
    elem.send_keys(str(z))
    elem.send_keys(Keys.RETURN)
    time.sleep(20)    

    not_end = 0
    while not not_end:
        soup = BeautifulSoup(browser.page_source, "lxml")

        # Create DataFrame
        # columns = ['NewUsed', 'Year', 'Make', 'Type','Mileage','Price','ZipCode','Source']
        all_cars = pd.DataFrame() #columns=columns)

        #Make and Type
        brands = soup.find_all("span", class_="atcui-truncate ymm")
        #Price and Mileage
        details = soup.find_all("div", class_="atcui-section atcui-clearfix   listing-content ")

        for cars, entry in zip(brands, details):
            carsSTR = cars.text.split()
            car_nu = carsSTR[0]
            car_year = int(carsSTR[1])
            car_make = carsSTR[2]
            car_type = " ".join(carsSTR[3:])

            try:
                milesSTR = entry.find_all("span", class_="mileage")[0].text
                car_miles = int(milesSTR.split()[0].replace(',',''))
            except:
                car_miles = 'Null'

            try:
                priceSTR = entry.find_all("h4", class_="primary-price")[0].text
                car_price = int(priceSTR.replace('$','').replace(',',''))
            except:
                car_price = 'Null'


    #         print('{} {} {} {} {} {}'.format(car_year, car_nu, car_make, car_type, car_price, car_miles))
            all_cars = all_cars.append(pd.DataFrame([[car_year, car_nu, car_make, 
                                                      car_type, car_miles, car_price, 
                                                      z, trader]]),
                                      ignore_index = True)

        all_cars.columns = ['new_used', 'year', 'make', 'type', 'mileage', 'price', 'zipcode', 'source']    
        # Update index
        index_max = conn.execute('SELECT MAX(car_id) FROM sales').fetchall()
        if index_max[0][0] is not None:
            all_cars.index = all_cars.index + 1 + index_max[0][0]
        all_cars.to_sql('sales',conn,flavor='sqlite',if_exists='append',index_label='car_id')

        next_counter = 0
        while next_counter < 4:
            try: 
                browser.find_element_by_class_name('pagination-button-next').click()
                next_counter = 4
            except:
                if next_counter == 3:
                    not_end = 1
                time.sleep(2)
                next_counter = next_counter + 1

    browser.quit()






In [None]:
conn.close()