*A document in progress showing my scraper development.*

### Import modules
*(Always do this step before running any other code chunks)*

In [20]:
import urllib.request
from urllib.request import urlopen
from bs4 import BeautifulSoup
import certifi
import urllib3
import pandas as pd 
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

## New listings (brand = lululemon)

Here, I specify the URL for **sold/completed** listings of **lululemon** brand clothing in **New** condition. (It's all in the link). Reminder to check the [robot.txt](https://www.ebay.com/robots.txt) for the site you want to scrape- don't get blocked as a bot!

I'm also identifying the page for BeautifulSoup.

In [21]:
urlpage = 'https://www.ebay.com/sch/i.html?_nkw=lululemon&LH_ItemCondition=1500|1000&LH_Complete=1&rt=nc&LH_Sold=1'

http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
r = http.request('GET', urlpage)
page = urllib.request.urlopen(urlpage).read()
soup = BeautifulSoup(page, 'html.parser')
# print(page)

I'm using the chunk below for testing snippets of code- ignore or use it as a scratchpad.

In [33]:
# For testing
item_containers = soup.find_all('div', {'class': 's-item__info clearfix'})
print(len(item_containers)) # should be about 4 dozen
# item_containers[0]

48
<class 'bs4.element.ResultSet'>
0


### Running the scraper
On this site, there are 48 listings per page. We are starting to scrape at page 1; I want to scrape 500 pages. I'm retrieving 1) title/item details in a messy HTML chunk, to be cleaned later; 2) price.

In [34]:
page_num = 1

# Create lists to store the scraped information
summary = []
price = []

while page_num<=500:
    html = requests.get(urlpage.format(page_num)).text
    soup = BeautifulSoup(html, 'html.parser')

    for item in soup.select('div.s-item__info.clearfix'):
        if item.select_one("h3.s-item__title"):
            summary.append(item.select_one("h3.s-item__title").text)
        if item.select_one("span.s-item__price"):
            price.append(item.select_one("span.s-item__price").text)

    page_num=page_num+1

# Check the results
print(len(summary))
print(len(price))

23904
23904


### Storing the data in a pandas dataframe

In [39]:
new_eBay_lululemon_df = pd.DataFrame({'Summary': summary, 'Price': price})
print(new_eBay_lululemon_df.info())
new_eBay_lululemon_df.head()

# Ew that looks like a lot of memory usage but I'll, uh, deal with it later

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23904 entries, 0 to 23903
Data columns (total 2 columns):
Summary    23904 non-null object
Price      23904 non-null object
dtypes: object(2)
memory usage: 373.6+ KB
None


Unnamed: 0,Summary,Price
0,"Sold Sep 16, 2019Lululemon red padded Sports ...",$13.49
1,"Sold Sep 16, 2019NWT $128.00 Lululemon On The...",$64.00
2,"Sold Sep 16, 2019Lululemon Skinny Will Pant 2...",$48.00
3,"Sold Sep 16, 2019Lululemon Hotty Hot Short Sz...",$41.00
4,"Sold Sep 16, 2019Ladies Lululemon Tank Flowy ...",$16.00


## Pre-owned listings (brand = lululemon)

Here's the workflow again, with less commentary. In this instance, I am applying the scraper to retrieve **sold/completed** listings of **lululemon** brand clothing in **Pre-owned** condition.

In [40]:
urlpage = 'https://www.ebay.com/sch/i.html?_nkw=lululemon&LH_Complete=1&LH_Sold=1&rt=nc&LH_ItemCondition=3000'

http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
r = http.request('GET', urlpage)
page = urllib.request.urlopen(urlpage).read()
soup = BeautifulSoup(page, 'html.parser')

In [42]:
page_num = 1

# Create lists to store the scraped information
summary = []
price = []

while page_num<=500:
    html = requests.get(urlpage.format(page_num)).text
    soup = BeautifulSoup(html, 'html.parser')

    for item in soup.select('div.s-item__info.clearfix'):
        if item.select_one("h3.s-item__title"):
            summary.append(item.select_one("h3.s-item__title").text)
        if item.select_one("span.s-item__price"):
            price.append(item.select_one("span.s-item__price").text)

    page_num=page_num+1

# Check the results
print(len(summary))
print(len(price))

23904
23904


In [43]:
po_eBay_lululemon_df = pd.DataFrame({'Summary': summary, 'Price': price})
print(po_eBay_lululemon_df.info())
po_eBay_lululemon_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23904 entries, 0 to 23903
Data columns (total 2 columns):
Summary    23904 non-null object
Price      23904 non-null object
dtypes: object(2)
memory usage: 373.6+ KB
None


Unnamed: 0,Summary,Price
0,"Sold Sep 16, 2019lululemon Quarter Legnth Jac...",$25.00
1,"Sold Sep 16, 2019Lululemon Commission Warpstr...",$35.00
2,"Sold Sep 16, 2019EUC Lululemon Pace Setter Sk...",$44.99
3,"Sold Sep 16, 2019Lululemon Knot Gonna Fly Tee...",$29.99
4,"Sold Sep 16, 2019Lululemon Core Shorts Men’s ...",$34.00


## New listings (brand = Reformation)

The scraper is retrieving **sold/completed** listings of **Reformation** brand clothing in **New** condition.

In [45]:
urlpage = 'https://www.ebay.com/sch/i.html?_nkw=reformation&_sacat=15724&LH_TitleDesc=0&LH_Sold=1&LH_Complete=1&rt=nc&LH_ItemCondition=1000%7C1500'

http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
r = http.request('GET', urlpage)
page = urllib.request.urlopen(urlpage).read()
soup = BeautifulSoup(page, 'html.parser')

In [46]:
page_num = 1

# Create lists to store the scraped information
summary = []
price = []

while page_num<=35: # Only ~2000 listings
    html = requests.get(urlpage.format(page_num)).text
    soup = BeautifulSoup(html, 'html.parser')

    for item in soup.select('div.s-item__info.clearfix'):
        if item.select_one("h3.s-item__title"):
            summary.append(item.select_one("h3.s-item__title").text)
        if item.select_one("span.s-item__price"):
            price.append(item.select_one("span.s-item__price").text)

    page_num=page_num+1

# Check the results
print(len(summary))
print(len(price))

1632
1632


In [47]:
new_eBay_ref_df = pd.DataFrame({'Summary': summary, 'Price': price})
print(new_eBay_ref_df.info())
new_eBay_ref_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1632 entries, 0 to 1631
Data columns (total 2 columns):
Summary    1632 non-null object
Price      1632 non-null object
dtypes: object(2)
memory usage: 25.6+ KB
None


Unnamed: 0,Summary,Price
0,"Sold Sep 16, 2019$320 REFORMATION WOMEN'S ROM...",$45.47
1,"Sold Sep 16, 2019Reformation Florence Skirt S...",$51.08
2,"Sold Sep 16, 2019REFORMATION Navy paisley min...",$49.82
3,"Sold Sep 16, 2019$78 REFORMATION IRIS RIBBED ...",$49.99
4,"Sold Sep 16, 2019Reformation Lasker Coat Size...",$66.00


## Pre-owned listings (brand = Reformation)

The scraper is retrieving **sold/completed** listings of **Reformation** brand clothing in **Pre-owned** condition.

In [48]:
urlpage = 'https://www.ebay.com/sch/i.html?_nkw=reformation&_sacat=15724&LH_TitleDesc=0&LH_ItemCondition=3000&rt=nc&LH_Sold=1&LH_Complete=1'

http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
r = http.request('GET', urlpage)
page = urllib.request.urlopen(urlpage).read()
soup = BeautifulSoup(page, 'html.parser')

In [49]:
page_num = 1

# Create lists to store the scraped information
summary = []
price = []

while page_num<=28: # Only ~1500 listings
    html = requests.get(urlpage.format(page_num)).text
    soup = BeautifulSoup(html, 'html.parser')

    for item in soup.select('div.s-item__info.clearfix'):
        if item.select_one("h3.s-item__title"):
            summary.append(item.select_one("h3.s-item__title").text)
        if item.select_one("span.s-item__price"):
            price.append(item.select_one("span.s-item__price").text)

    page_num=page_num+1

# Check the results
print(len(summary))
print(len(price))

1344
1344


In [50]:
po_eBay_ref_df = pd.DataFrame({'Summary': summary, 'Price': price})
print(po_eBay_ref_df.info())
po_eBay_ref_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1344 entries, 0 to 1343
Data columns (total 2 columns):
Summary    1344 non-null object
Price      1344 non-null object
dtypes: object(2)
memory usage: 21.1+ KB
None


Unnamed: 0,Summary,Price
0,"Sold Sep 16, 2019Reformation Grilfriend Colle...",$39.99
1,"Sold Sep 16, 2019REFORMATION JEANS Size XS Ri...",$14.80
2,"Sold Sep 16, 2019Reformation Bea Skirt (4)",$103.28
3,"Sold Sep 15, 2019Reformation York White Linen...",$43.61
4,"Sold Sep 15, 2019Reformation Navy Rena Dress ...",$40.00
