# 1. Introduction #

### I will be web-scraping the information I need for analysing the used luxury sedan market in Singapore off sgcarmart.com, which is the largest online car marketplace in the country. ###

To respect the the rules of sgcarmart.com, I visited its robot.txt these are the parameters I need to abide by:
User-agent: *
Crawl-delay: 5
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /mail/
Disallow: /dealer/
Disallow: /directory/premium/
Disallow: /includes/
Disallow: /phpads/
Disallow: /update/
Disallow: /upload/

Hence, I will only be scraping these information: (1) Name of Postings (2) Price  (3) Depreciation Value (4) Mileage (5) Engine Capacity (6) Registered Date (7) Power (8) Number of Previous Owners.

# 2. Import Libraries #

In [1]:
import requests
from bs4 import BeautifulSoup as bs
from time import sleep
import re
import csv

In [2]:
with open('sg_used_cars.csv','w',newline='') as f:
    header = ['name','price','depre','mileage','engine_cap','reg_date','power','owners']
    writer = csv.writer(f)
    writer.writerow(header)
    

# 3. Get Links For All Postings #

I will store the links for all the car postings in a list before accessing them one by one to extract the data

In [3]:
def store_links(soup):
    links = []
    for item in soup.findAll('strong'):
        try:
            link = item.find('a')
            if 'info' in link['href']:
                links.append(link['href'])
        except:
            continue

    return links
    

# 4. Enter Link & Retrieve Info Needed #

In [4]:
def get_name(soup):
    try:
        name = soup.find('div',{'id':'toMap'}).text.strip()
    except:
         name = 'NA'
    return name

def get_price(soup):
    try:
        price = soup.find('td',{'class':'font_red'}).text.strip()[1:]
    except:
        price = 'NA'
    return price

def get_depre(soup):
    try:
        depre = re.findall(r'\d+,\d+',soup.findAll('tr',{'class':'row_bg'})[1].find('td',{'class':None}).text)[0]
    except:
        depre = 'NA'
    return depre

def get_miles(soup):
    try:
        mileage = re.findall(r'\d+,\d+',soup.find('div',{'class':'row_info'}).text.strip())[0]
    except:
        mileage = 'NA'
    return mileage

def get_engcap(soup):
    try:
        engine_cap = re.findall(r'\d+,*\d+',soup.findAll('div',{'class':'row_info'})[4].text)[0]
    except:
        engine_cap = 'NA'
    return engine_cap

def get_regdate(soup):
    try:
        reg_date = re.findall(r'\d{2}-\w{3}-\d{4}',soup.findAll('tr',{'class':'row_bg'})[1].findAll('td',{'class':None})[-1].text)[0]
    except:
        reg_date = 'NA'
    return reg_date

def get_power(soup):
    try:
        power = re.findall(r'\d+\.\d+',soup.findAll('div',{'class':'row_info'})[-2].text)[0]
    except:
        power = 'NA'
    return power


def get_owners(soup):
    try:
        owners = soup.findAll('div',{'class':'row_info'})[-1].text
    except:
        owners = 'NA'
    return owners

def access_link(links):
    info = []
    for link in links:
        front = 'https://www.sgcarmart.com/used_cars/'
        url = front + link
        html = requests.get(url)
        soup = bs(html.text,'lxml')
        name = get_name(soup)
        price = get_price(soup)
        depre = get_depre(soup)
        mileage = get_miles(soup)
        eng_cap = get_engcap(soup)
        reg_date = get_regdate(soup)
        power = get_power(soup)
        owners = get_owners(soup)
        info.append([name,price,depre,mileage,eng_cap,reg_date,power,owners])
        sleep(2)
    
    return info

# 5. Save to CSV file #

After accessing all the links within the page, I will save them to my CSV file before scraping the following page. This is helpful for scraping large amount of information in the event that the script stop working midway.

In [5]:
def save_info(info):
    
    with open('sg_used_cars.csv','a',newline='') as f:

        writer = csv.writer(f)
        for row in info:
            writer.writerow(row)
            

# 6. Run Program #

I put a 'print' counter to show me the number of pages that had been scraped and saved successfully each time so I know that the script is working.

In [6]:
url = 'https://www.sgcarmart.com/used_cars/listing.php?BRSR={}00&RPG=100&AVL=2&VEH=12'
page_count = 0
for page in range(10):
    html = requests.get(url.format(page))
    soup = bs(html.text,'lxml')
    links = store_links(soup)
    info = access_link(links)
    save_info(info)
    page_count += 1
    print(page_count)
    sleep(5)
    

1
2
3
4
5
6
7
8
9
10
