# CSD 1: Beautiful Soup

1. Go to [ebay.com](ebay.com) and search for little boys t-shirts. The ebay website, like any modern website, is filled with text, images and links. But if you are using Google Chrome and you right-click on any page and choose "View page source" you will see the raw HTML script behind it.

2. The python library [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)'s job is to help you parse this raw HTML, to get what you want. Run the following piece of code, by pressing it and pressing the "play" icon in the above menu, or just Ctrl + Enter:

In [1]:
from bs4 import BeautifulSoup
import requests

url = "https://www.ebay.com/b/Boys-Short-Sleeve-Sleeve-Tops-T-Shirts-Sizes-4-Up/175521/bn_4278610?rt=nc&LH_ItemCondition=1000&LH_BIN=1&LH_PrefLoc=3&_pgn=1"
r = requests.get(url)
soup = BeautifulSoup(r.content)

The above code imports Beautiful Soup, imports the requests library for handling web connections, assigns an ebay search results page address to a variable called `url`, "requests" this URL, stores the response in a variable called `r`, makes a `BeautifulSoup` object out of the response's `content`, and assigns it to a variable called `soup`.

3. Print the raw HTML:

In [2]:
print(soup.prettify())

<!DOCTYPE html>
<!--[if IE 9]><html class="ie9" lang="en"><![endif]-->
<!--[if gt IE 9]><!-->
<html lang="en">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <meta content="Shop eBay for great deals on Boys' Short Sleeve Sleeve Tops &amp; T-Shirts (Sizes 4 &amp; Up). You'll find new or used products in Boys' Short Sleeve Sleeve Tops &amp; T-Shirts (Sizes 4 &amp; Up) on eBay. Free shipping on selected items." property="og:description">
   <link href="https://ir.ebaystatic.com" rel="preconnect"/>
   <title>
    Boys&amp;apos Short Sleeve Sleeve Tops &amp; T-Shirts (Sizes 4 &amp; Up) | eBay
   </title>
   <meta content="Shop eBay for great deals on Boys' Short Sleeve Sleeve Tops &amp; T-Shirts (Sizes 4 &amp; Up). You'll find new or used products in Boys' Short Sleeve Sleeve Tops &amp; T-Shirts (Sizes 4 &amp; Up) on eBay. Free shipping on selected items." name="description"/>
   <meta content="eBay" property="og:site_name"/>
   <meta content="unsafe-url" name="referrer"/>
   <meta c

4. Replace the `### YOUR CODE HERE ###` comment to print just the title of the page.

Hint 1: The [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

Hint 2: type `soup`, then `.`, then press TAB to get possible object members or methods from Jupyter

Hint 3: Keep Calm and [Stack Overflow](https://stackoverflow.com/a/51550)

In [None]:
print(### YOUR CODE HERE ###)

5. This is an HTML paragraph tag or element: `<p>This is a paragraph</p>`

This is a hyperlink tag: `<a href="https://www.google.com/">Google it!</a>`

This is how we `find_all` links in a page:

In [33]:
all_links = soup.find_all('a')
print('all_links is a: ' + str(type(all_links)))
print()
print('all_links length is: %d' % len(all_links))
print()
print('first 5 elements in all_links:')
print(all_links[:5])

all_links is a: <class 'bs4.element.ResultSet'>

all_links length is: 223

first 5 elements in all_links:
[<a class="gh-acc-a" href="#mainContent" id="gh-hdn-stm">Skip to main content</a>, <a _sp="m570.l2586" href="https://www.ebay.com/" id="gh-la">eBay<img alt="" height="200" id="gh-logo" role="presentation" src="https://ir.ebaystatic.com/rs/v/fxxj3ttftm5ltcqnto1o4baovyl.png" style="clip:rect(47px, 118px, 95px, 0px); position:absolute; top:-47px;left:0" width="250"/></a>, <a _sp="m570.l2614" href="https://www.ebay.com/sch/ebayadvsearch" id="gh-as-a" title="Advanced Search">Advanced</a>, <a href="https://signin.ebay.com/ws/eBayISAPI.dll?SignIn&amp;_trksid=m570.l3348">Sign in</a>, <a _sp="m570.l3188" class="gh-p" href="https://www.ebay.com/globaldeals"> Daily Deals</a>]


6. Find all image elements in our ebay page and put them in a variable called `images`. You might want to find out what is the [HTML tag for an image](https://www.w3schools.com/tags/tag_img.asp) first.

In [13]:
images = ### YOUR CODE HERE ###

7. Notice that the hyperlink tag from (6) has an attribute called `href` holding the link's address. In a BeautifulSoup element, to access this attribute you can think of an element as a dictionary and the attribute its key:

In [20]:
print(all_links[100])
print()
print(type(all_links[100]))
print()
print(all_links[100]['href'])

<a _sp="p2489527.m4335.l8656" class="s-item__link" data-track='{"eventFamily":"LST","eventAction":"ACTN","actionKind":"NAVSRC","operationId":"2489527","flushImmediately":false,"eventProperty":{"moduledtl":"mi:4335|iid:1|li:8656|luid:34|scen:Listings","parentrq":"f6e800341680ada3df2db78cffe26d87","pageci":"2a56ab15-a8c9-413d-825e-8554d935631d"}}' href="https://www.ebay.com/itm/Nike-Little-Boy-XS-X-Small-6-7-Black-Short-Sleeve-T-Shirt-Top-Tee-NWT/202598557784?hash=item2f2bd0a858:g:x6YAAOSwDcJcZ5Ed"><h3 class="s-item__title" role="text"><span class="LIGHT_HIGHLIGHT">New Listing</span>Nike Little Boy XS X Small 6-7 Black Short Sleeve T Shirt Top Tee NWT</h3></a>

<class 'bs4.element.Tag'>

https://www.ebay.com/itm/Nike-Little-Boy-XS-X-Small-6-7-Black-Short-Sleeve-T-Shirt-Top-Tee-NWT/202598557784?hash=item2f2bd0a858:g:x6YAAOSwDcJcZ5Ed


8. To get the actual dictionary of an element use the `attrs` member:

In [23]:
print(type(all_links[100].attrs))
print()
print(all_links[100].attrs.keys())

<class 'dict'>

dict_keys(['class', 'href', 'data-track', '_sp'])


9. Get a `list` of all image titles from the `images` object, **except for the first one**. Print that list.

Hint: `alt`

In [None]:
image_titles = ### YOUR CODE HERE ###
print(image_titles)

10. What is the attribute for an image JPEG file address?

Some images have the attribute `src` and some `data-src`. This is one way to combine the two. Make sure you understand:

In [50]:
image_files_src = [img['src'] for img in images[1:]]
image_files_datasrc = [img.get('data-src', None) for img in images[1:]]
image_files = [src if datasrc is None else datasrc for src, datasrc in zip(image_files_src, image_files_datasrc)]
image_files[:5]

['https://i.ebayimg.com/thumbs/images/m/mbAW78EZTb7qwbMe51QkaBQ/s-l225.jpg',
 'https://i.ebayimg.com/thumbs/images/m/mymwuucjBVuu2IjMcivCm7g/s-l225.jpg',
 'https://i.ebayimg.com/thumbs/images/m/m0mi9HnUFy8AZsJW8Q3dwAA/s-l225.jpg',
 'https://i.ebayimg.com/thumbs/images/m/mVWR712d-Hg2GBjKxZpCEEw/s-l225.jpg',
 'https://i.ebayimg.com/thumbs/images/m/mrsjFAzx_5DkmDUqJ9mPyRQ/s-l225.jpg']

11. Let's find a shirt's price. If you scan the HTML you'll see you need the `span` element, and in it the class `s-item__price`. This is how we do it in BeautifulSoup:

In [49]:
price_elements = soup.find_all('span', class_ = 's-item__price')
print(price_elements[:5])

[<span class="s-item__price">ILS 78.83<span class="DEFAULT"> to </span>ILS 144.67</span>, <span class="s-item__price">ILS 72.32</span>, <span class="s-item__price">ILS 54.19</span>, <span class="s-item__price">ILS 54.27</span>, <span class="s-item__price">ILS 46.96</span>]


From each of these `price_elements` we extract the actual price text with the `get_text` function:

In [42]:
print(price_elements[0].get_text())

ILS 78.83 to ILS 144.67


You can see some prices come as a range. To get the the minimum price for example, we could split this string to its elements:

In [44]:
print(price_elements[0].get_text().split(' '))

['ILS', '78.83', 'to', 'ILS', '144.67']


Get the second element:

In [47]:
print(price_elements[0].get_text().split(' ')[1])

78.83


And convert it to a float

In [48]:
print(float(price_elements[0].get_text().split(' ')[1]))

78.83


12. Your task is to complete the `parse_price` function so that in the end the `prices` variable will hold a list of all shirts prices:

In [51]:
def parse_price(price_element):
    try:
        price = ### YOUR CODE HERE ###
    except:
        price = None
    return price

prices = [parse_price(price_e) for price_e in price_elements]

13. It's time to actually download the shirts images! The following function accepts an image file address, a shirt title and the file name for the image and attempts to download the image to the current directory with the specified file name:

In [52]:
def download_image(url, title, file_name):
    try:
        response = requests.get(url)    
    except:
        return '', ''
    with open(file_name, "wb") as file:
        file.write(response.content)
    return title, file_name

Download the first image from our page, name it 'test.jpg'. Make sure it was downloaded correctly and see what the function returns:

In [None]:
download_image(### YOUR CODE HERE ###)

14. We will now download all of the page's images, using a loop. 

First, create a folder named 'boys' in the current directory. You can do it right here in this notebook!

In [None]:
!mkdir boys

While downloading, fill in the blanks to correctly create a dictionary called `images_data` which will hold the title of the image, its file name, and the shirt's price:

In [None]:
from ipywidgets import IntProgress
from IPython.display import display

images_data = {'title': {},
               ### YOUR CODE HERE ###: {},
               'price': {}}

f = IntProgress(min = 0, max = len(images[1:])) # instantiate a progress bar
display(f) # display the bar

for i in range(### YOUR CODE HERE ###):
    ### YOUR CODE HERE ### = download_image(image_files[i], image_titles[i], './boys/' + str(i) + '.jpg')
    images_data['title'][### YOUR CODE HERE ###] = title
    images_data['file_name'][### YOUR CODE HERE ###] = file_name
    images_data[### YOUR CODE HERE ###][### YOUR CODE HERE ###] = prices[i]
    f.value += 1

15. One thing that would prove useful later on is having a dataset which summarizes all we have gathered. That's what `images_data` is for. We're going to use `pandas` to make it a `DataFrame` we can easily read and write:

In [60]:
import pandas as pd
images_data_df = pd.DataFrame(images_data)
images_data_df

Unnamed: 0,title,file_name,price
0,Polo Ralph Lauren Boys Polo Shirt Classic Mesh...,./boys/0.jpg,78.83
1,Tommy Hilfiger Kids T-Shirt Big Boys Solid Cre...,./boys/1.jpg,72.32
2,Los Angeles Dodgers Primary Logo Kids Shirt,./boys/2.jpg,54.19
3,Brand New NWT Nike Boys Youth Kids Graphic T S...,./boys/3.jpg,54.27
4,Kansas City Royals Primary Logo Kids Shirt,./boys/4.jpg,46.96
5,NINJAGO COLE LLOYD JAY & KAI Blue Tee T-Shirt ...,./boys/5.jpg,50.61
6,Kids Camo T-Shirt Short Sleeve Military Tee Ar...,./boys/6.jpg,36.14
7,NIKE Boys T shirt DIFF COLORS Sizes Athletic 1...,./boys/7.jpg,38.89
8,TOMMY HILFIGER Boys Polo Shirt Diff Colors Siz...,./boys/8.jpg,53.0
9,"Skin Industries Children Boy's T-Shirt "" Adam ...",./boys/9.jpg,32.52


16. This was fun, we got 48 images. But we're looking to get times ~200 than that, and the same amount of shirts images for girls. The following code was run to get all boys shirts images. You can run it to see that it's working or you can just skim it to see you get how all the different elements are combined:

In [None]:
boys_url = 'https://www.ebay.com/b/Boys-Short-Sleeve-Sleeve-Tops-T-Shirts-Sizes-4-Up/175521/bn_4278610?rt=nc&LH_ItemCondition=1000&LH_BIN=1&LH_PrefLoc=3&_pgn='
max_pages = 400
boys_items_data = {'title': {}, 'file_id': {}, 'price': {}}
f = IntProgress(min = 0, max = max_pages)
display(f)
all_items_counter = 0

for page_num in range(max_pages):
    url = boys_url + str(page_num)
    try:
        r = requests.get(url)
    except:
        print('Stopped at page: ' + page_num)
        break
    soup = BeautifulSoup(r.content)
    images = soup.find_all('img')[1:]
    image_titles = [img['alt'] for img in images]
    image_files_src = [img['src'] for img in images]
    image_files_datasrc = [img.get('data-src', None) for img in images]
    image_files = [src if datasrc is None else datasrc for src, datasrc in zip(image_files_src, image_files_datasrc)]
    
    price_elements = soup.find_all('span', class_ = 's-item__price')
    prices = [parse_price(price_e) for price_e in price_elements]
    try:
        assert len(prices) == len(images)
    except:
        print('Found unequal number of prices in page_num % d' % page_num)
        prices = [None] * len(images)
        
    for i in range(len(images)):
        title, file_name = download_image(image_files[i], image_titles[i], './boys/' + str(all_items_counter + i) + '.jpg')
        boys_items_data['title'][all_items_counter + i] = title
        boys_items_data['file_id'][all_items_counter + i] = all_items_counter + i
        boys_items_data['price'][all_items_counter + i] = prices[i]
    all_items_counter += len(images)
    f.value += 1

17. This is how you'll get all boys and girls images quicker, using the images that were downloaded for you. You should be able to do this only once.

First download the compressed file from a remote server:

In [74]:
url = "http://www.tau.ac.il/~saharon/DScourse/ebay_boys_girls_shirts.tar.gz"
r = requests.get(url)

with open("ebay_boys_girls_shirts.tar", "wb") as file:
    file.write(r.content)

Next decompress the file in the datasets folder:

In [76]:
import tarfile

with tarfile.open("ebay_boys_girls_shirts.tar") as tar:
    tar.extractall('.')

18. You now have in your datasets folder all ~33K boys and girls shirts images. See that you can read the four CSVs holding the metadata for the train and test sets of images:

In [78]:
folder = 'ebay_boys_girls_shirts/'
boys_train_df = pd.read_csv(folder + 'boys_train.csv')
girls_train_df = pd.read_csv(folder + 'girls_train.csv')
boys_test_df = pd.read_csv(folder + 'boys_test.csv')
girls_test_df = pd.read_csv(folder + 'girls_test.csv')
print('N boys train images: %d' % boys_train_df.shape[0])
print('N girls train images: %d' % girls_train_df.shape[0])
print('N boys test images: %d' % boys_test_df.shape[0])
print('N girls test images: %d' % girls_test_df.shape[0])

N boys train images: 10000
N girls train images: 10000
N boys test images: 2500
N girls test images: 2500
