# Scraping 'Apple Watch Series 4' from Carousell 

## Web Scraping with Beautifulsoup (Python)

### BeautifulSoup
- __Python Package__ to parse HTML and XML content
- __Object__ to parse data from content (HTML Parser)
    - creates a parse tree for parsed pages that can be used to extract data from HTML
- Functions/Methods to extract data from the content
- https://www.crummy.com/software/BeautifulSoup/bs4/doc/

### Functions/Methods
 - .find_all()
     - return all the items that are using the tags you searched for
     - soup.find_all('a')
         - <class 'bs4.element.ResultSet'>
         - <a class="sister" href=(#"http://example.com/elsie") id="link1">Elsie</a>,
         - <a class="sister" href=(#"http://example.com/lacie") id="link2">Lacie</a>,
         - <a class="sister" href=(#"http://example.com/tillie") id="link3">Tillie</a>
     
 - .find()
     - <class 'bs4.element.Tag'>
     - return only 1 item that you searched for
     - find(id="link3")
         - <a class="link" href=(#"http://example.com/example3") id="link3">This returns just the matching element by ID</a>
 - .select()
     - .select("p > a")
         - <class 'list'>
         - <a class="sister" href=(#"http://example.com/elsie") id="link1">Elsie</a>,
         - <a class="sister" href=(#"http://example.com/lacie")  id="link2">Lacie</a>,
         - <a class="sister" href=(#"http://example.com/tillie") id="link3">Tillie</a>
 
### Steps to scrape a website
1. Identify the structure of the website
    - look out for odd cases that requires additional function to handle
2. Plan the flow on how you want to scrape the website
    - Navigation among the pages
3. Estimate the amount of data you will need to scrape
    - As the data increase, the time to run the program will be longer
    - likely to send more request to the webpage server 
4. 'Inspect element' to inspect the source code of the page
    - Understand the HTML code and identify what data you need
5. Writing the code
    - Planning on how to store the data
    - Planning the structure of the code

## 16/07/2019

In [6]:
from bs4 import BeautifulSoup
import urllib
import requests
import time

# Function to send request to page and parse html page into BeautifulSoup Object
def request_page(url):
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}
    req = urllib.request.Request(url, headers=headers)
    html = urllib.request.urlopen(req)
    read_html = html.read()
    page = BeautifulSoup(str(read_html,"utf-8"), "html5lib")
    return page # Return BeatifulSoup Objects

count_pages = 1
apple_watch_list = [] #sore all the apple watch 
url = 'https://sg.carousell.com'
extension = '/search/products/?query=Apple%20Watch%20series%204'

while extension != None:
    count_items = 0
    apple_watch_page = request_page(url+extension)
    all_apple_watch = apple_watch_page.select('div.g-_a > div > div') #g-_a changes daily
    print("Number of items: ", len(all_apple_watch), "Page: ", count_pages)
    print("URL: ", url + extension)

    for index in all_apple_watch:
        apple_watch_dict ={} #store all the info of an apple watch
        
        # Retrieve data from website using HTML tags
        # HTML tags class change daily
        if index.find(name='div', class_='J-A') != None: #J-A changes daily
            username = index.find(name='div', class_='J-A').get_text()
        if index.find(name='time', class_='J-t') != None:
            time_posted = index.find(name='time', class_='J-t')
            time_posted = time_posted.find(name='span').get_text()
        if index.find(name='div', class_='J-m') != None:
            title = index.find(name='div', class_='J-m').get_text()
        if index.find(name='div', class_='J-k') != None:
            info = index.find(name='div', class_='J-k')
        if info.find_all('div') != None:
            price = info.find_all('div')[0]
            description = info.find_all('div')[1]
            status = info.find_all('div')[2]
        
        # Storing data into dictionary
        apple_watch_dict['username'] = username
        apple_watch_dict['time_posted'] = time_posted
        apple_watch_dict['title'] = title
        apple_watch_dict['price'] = str(price)[5:-6]
        apple_watch_dict['description'] = str(description)[5:-6]
        apple_watch_dict['status'] = str(status)[5:-6]

        # append the dictionary into a list 
        apple_watch_list.append(apple_watch_dict)
        
        count_items += 1
    
    print("Number of items scraped: ", count_items)
    print("-" * 40)
        
    # Retrieve new extension (next extension is for the Next page)
    extension = apple_watch_page.find('li', class_='pagination-next pagination-btn')
    extension = extension.select('a')[0].get('href')

    count_pages += 1
    time.sleep(5)

print("--Scraping done!--")

Number of items:  40 Page:  1
URL:  https://sg.carousell.com/search/products/?query=Apple%20Watch%20series%204
Number of items scraped:  40
----------------------------------------
Number of items:  40 Page:  2
URL:  https://sg.carousell.com/search/products/?query=Apple%20Watch%20series%204&session=eyJhZ2dyZWdhdGVfY291bnQiOjQwLCJzZXNzaW9uX2luaXRfYXQiOiIyMDE5LTA3LTE2VDE1OjM1OjE0LjkyMjQxNjIyN1oiLCJzaWduYXR1cmVfaGFzaCI6IncxTC9xS0M4M2c4eldka0gzdWYxdGZxOWgyaz0iLCJzbG90cyI6eyJkZWZhdWx0Ijp7InN0YXJ0IjozOH0sImZpcnN0X3RpbWVfbGlzdGVyIjp7InN0YXJ0Ijo0fX19
Number of items scraped:  40
----------------------------------------
Number of items:  40 Page:  3
URL:  https://sg.carousell.com/search/products/?query=Apple%20Watch%20series%204&session=eyJhZ2dyZWdhdGVfY291bnQiOjgwLCJzZXNzaW9uX2luaXRfYXQiOiIyMDE5LTA3LTE2VDE1OjM1OjE0LjkyMjQxNjIyN1oiLCJzaWduYXR1cmVfaGFzaCI6IncxTC9xS0M4M2c4eldka0gzdWYxdGZxOWgyaz0iLCJzbG90cyI6eyJkZWZhdWx0Ijp7InN0YXJ0Ijo3OH0sImZpcnN0X3RpbWVfbGlzdGVyIjp7InN0YXJ0Ijo0fX19
Number of ite

Number of items:  40 Page:  21
URL:  https://sg.carousell.com/search/products/?query=Apple%20Watch%20series%204&session=eyJhZ2dyZWdhdGVfY291bnQiOjgwMCwic2Vzc2lvbl9pbml0X2F0IjoiMjAxOS0wNy0xNlQxNTozNToxNC45MjI0MTYyMjdaIiwic2lnbmF0dXJlX2hhc2giOiJ3MUwvcUtDODNnOHpXZGtIM3VmMXRmcTloMms9Iiwic2xvdHMiOnsiZGVmYXVsdCI6eyJzdGFydCI6Nzk4fSwiZmlyc3RfdGltZV9saXN0ZXIiOnsic3RhcnQiOjR9fX0%3D
Number of items scraped:  40
----------------------------------------
Number of items:  40 Page:  22
URL:  https://sg.carousell.com/search/products/?query=Apple%20Watch%20series%204&session=eyJhZ2dyZWdhdGVfY291bnQiOjg0MCwic2Vzc2lvbl9pbml0X2F0IjoiMjAxOS0wNy0xNlQxNTozNToxNC45MjI0MTYyMjdaIiwic2lnbmF0dXJlX2hhc2giOiJ3MUwvcUtDODNnOHpXZGtIM3VmMXRmcTloMms9Iiwic2xvdHMiOnsiZGVmYXVsdCI6eyJzdGFydCI6ODM4fSwiZmlyc3RfdGltZV9saXN0ZXIiOnsic3RhcnQiOjR9fX0%3D
Number of items scraped:  40
----------------------------------------
Number of items:  40 Page:  23
URL:  https://sg.carousell.com/search/products/?query=Apple%20Watch%20series%20

## 17/07/2019

In [1]:
from bs4 import BeautifulSoup
import urllib
import requests
import time

# Function to send request to page and parse html page into BeautifulSoup Object
def request_page(url):
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}
    req = urllib.request.Request(url, headers=headers)
    html = urllib.request.urlopen(req)
    read_html = html.read()
    page = BeautifulSoup(str(read_html,"utf-8"), "html5lib")
    return page

count_pages = 1
apple_watch_list = [] #sore all the apple watch 
url = 'https://sg.carousell.com'
extension = '/search/products/?query=Apple%20Watch%20series%204'

while extension != None:
    count_items = 0
    apple_watch_page = request_page(url+extension)
    all_apple_watch = apple_watch_page.select('div.g-_a > div > div') #g-_a changes daily
    print("Number of items: ", len(all_apple_watch), "Page: ", count_pages)
    print("URL: ", url + extension)

    for index in all_apple_watch:
        apple_watch_dict ={} #store all the info of an apple watch
        
        # Retrieve data from website using HTML tags
        # HTML tags class change daily
        if index.find(name='div', class_='J-A') != None: #J-A changes daily
            username = index.find(name='div', class_='J-A').get_text()
        if index.find(name='time', class_='J-t') != None:
            time_posted = index.find(name='time', class_='J-t')
            time_posted = time_posted.find(name='span').get_text()
        if index.find(name='div', class_='J-m') != None:
            title = index.find(name='div', class_='J-m').get_text()
        if index.find(name='div', class_='J-k') != None:
            info = index.find(name='div', class_='J-k')
        if info.find_all('div') != None:
            price = info.find_all('div')[0]
            description = info.find_all('div')[1]
            status = info.find_all('div')[2]
        
        # Storing data into dictionary
        apple_watch_dict['username'] = username
        apple_watch_dict['time_posted'] = time_posted
        apple_watch_dict['title'] = title
        apple_watch_dict['price'] = str(price)[7:-6]
        apple_watch_dict['description'] = str(description)[5:-6]
        apple_watch_dict['status'] = str(status)[5:-6]

        # append the dictionary into a list 
        apple_watch_list.append(apple_watch_dict)
        
        count_items += 1
    
    print("Number of items scraped: ", count_items)
    print("-" * 40)
        
    # Retrieve new extension (next extension is for the Next page)
    extension = apple_watch_page.find('li', class_='pagination-next pagination-btn')
    extension = extension.select('a')[0].get('href')

    count_pages += 1
    time.sleep(5)

print("--Scraping done!--")

Number of items:  40 Page:  1
URL:  https://sg.carousell.com/search/products/?query=Apple%20Watch%20series%204
Number of items scraped:  40
----------------------------------------
Number of items:  40 Page:  2
URL:  https://sg.carousell.com/search/products/?query=Apple%20Watch%20series%204&session=eyJhZ2dyZWdhdGVfY291bnQiOjQwLCJzZXNzaW9uX2luaXRfYXQiOiIyMDE5LTA3LTE4VDAzOjIxOjMxLjczOTIzMjg4M1oiLCJzaWduYXR1cmVfaGFzaCI6IncxTC9xS0M4M2c4eldka0gzdWYxdGZxOWgyaz0iLCJzbG90cyI6eyJkZWZhdWx0Ijp7InN0YXJ0IjozOH0sImZpcnN0X3RpbWVfbGlzdGVyIjp7InN0YXJ0Ijo1fX19
Number of items scraped:  40
----------------------------------------
Number of items:  40 Page:  3
URL:  https://sg.carousell.com/search/products/?query=Apple%20Watch%20series%204&session=eyJhZ2dyZWdhdGVfY291bnQiOjgwLCJzZXNzaW9uX2luaXRfYXQiOiIyMDE5LTA3LTE4VDAzOjIxOjMxLjczOTIzMjg4M1oiLCJzaWduYXR1cmVfaGFzaCI6IncxTC9xS0M4M2c4eldka0gzdWYxdGZxOWgyaz0iLCJzbG90cyI6eyJkZWZhdWx0Ijp7InN0YXJ0Ijo3OH0sImZpcnN0X3RpbWVfbGlzdGVyIjp7InN0YXJ0Ijo1fX19
Number of ite

AttributeError: 'NoneType' object has no attribute 'select'

## Print to check

In [2]:
print(apple_watch_list)

[{'username': 'koestoer', 'time_posted': '6 days ago', 'title': '🚚 Apple Watch Series 4', 'price': '730', 'description': 'Brand new Apple Watch Series 4 + Cellular Space Grey 44mm', 'status': 'New'}, {'username': 'strapatelier', 'time_posted': '2 days ago', 'title': '🚚 Apple Watch Adapter (Black)', 'price': '7', 'description': 'Now you have the freedom to switch and interchange your Apple watch with any of our straps!   Simply purchase the right adapter for your Apple watch series, you can bring out the versatility for your smart device!  Lug sizes 38/40 is equivalent to 22mm straps Lug sizes 42/44 is equivalent to 24mm st', 'status': 'New'}, {'username': 'cey15', 'time_posted': '3 days ago', 'title': 'Apple Watch Rugged Case 40 and 44mm', 'price': '14', 'description': '- Apple Watch Case for latest Apple Watch 4 - 40 and 44 MM size available  -Protect your apple watch from scratches  - Black / Red / Clear colour available.     # Apple Watch , Apple , Watch series 4', 'status': 'New'},

In [3]:
count = 1
for i in apple_watch_list:
    print(count, " :", i)
    count += 1

1  : {'username': 'koestoer', 'time_posted': '6 days ago', 'title': '🚚 Apple Watch Series 4', 'price': '730', 'description': 'Brand new Apple Watch Series 4 + Cellular Space Grey 44mm', 'status': 'New'}
2  : {'username': 'strapatelier', 'time_posted': '2 days ago', 'title': '🚚 Apple Watch Adapter (Black)', 'price': '7', 'description': 'Now you have the freedom to switch and interchange your Apple watch with any of our straps!   Simply purchase the right adapter for your Apple watch series, you can bring out the versatility for your smart device!  Lug sizes 38/40 is equivalent to 22mm straps Lug sizes 42/44 is equivalent to 24mm st', 'status': 'New'}
3  : {'username': 'cey15', 'time_posted': '3 days ago', 'title': 'Apple Watch Rugged Case 40 and 44mm', 'price': '14', 'description': '- Apple Watch Case for latest Apple Watch 4 - 40 and 44 MM size available  -Protect your apple watch from scratches  - Black / Red / Clear colour available.     # Apple Watch , Apple , Watch series 4', 'stat

## Export data (list of dictionary) to CSV

In [21]:
import csv

keys = apple_watch_list[0].keys()
with open('carousell_scraping_2.csv', 'w') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(apple_watch_list)

## Pandas Library 

### Pandas
 - DataFrame object for data manipulation and analysis.
 - Pandas is built on top of the __NumPy__ package
 
### Functions
 - .head()
     - Display top 5 data
 - .tail(2)
     - Display bottom 2 data
 - .info()
     - essential details about your dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using.
 - .shape()
     - Display (rows, columns)
 - https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
 
 

In [36]:
import pandas as pd
import csv

df = pd.read_csv('carousell_scraping_2.csv')

In [37]:
df["price"] = df["price"].str.replace(",","").astype(float)

In [39]:
df['price'] = pd.to_numeric(df['price'])

In [49]:
filter_df = df[(df["price"] > 400) & (df["price"] < 600) & (df["title"].str.contains('strap') == False)]

In [50]:
filter_df 

Unnamed: 0,username,time_posted,title,price,description,status
0,sabspore,8 days ago,Authentic Hermes Apple Watch Strap,500.0,Authentic Hermes Apple Watch Strap. Hermes Ora...,New
12,linda_sales,New Carouseller,Apple Watch series 4 44mm,550.0,Apple Watch Series 4 44mm Silver Aluminium whi...,Used
14,tinmichael,3 days ago,Apple Watch Series 4 40 mm Nike GPS with warranty,420.0,Apple Watch Series 4 40 mm Nike GPS with warra...,Used
18,davee0,New Carouseller,Apple Watch series 4 40mm,460.0,Condition is as shown. Warranty till 18 nov 20...,Used
19,ballonbleu,10 days ago,Original Hermes Apple Watch Strap for 42/44mm ...,580.0,This is an authentic Hermes Apple Watch silico...,New
23,armourdilo,3 days ago,🚚 Apple Watch Series 4,520.0,Looking to purchase Apple Watch Series 4. Eith...,Used
25,cm261,New Carouseller,Apple Watch Series 4 GPS 40mm (Brand New-Seale...,570.0,"Unopened pack of Apple Watch Series4, got as a...",New
26,binbiz,14 days ago,Apple watch series 42mm black,550.0,Brand New Apple Watch series 4 42mm Full box,New
32,eg_hub,7 days ago,Apple series 4 watch,549.9,In Stocks Available! Brand new and sealed! ...,New
33,cyberist,8 days ago,Apple Watch Series 4,510.0,apple Watch 44mm but still great condition as ...,Used


In [51]:
filter_df.describe()

Unnamed: 0,price
count,88.0
mean,539.282955
std,40.990855
min,420.0
25%,520.0
50%,550.0
75%,570.0
max,599.0
