To open this notebook in Google Colab and start coding, click on the Colab icon below.

<table style="border:2px solid orange" align="left">
    <td style="border:2px solid orange">
        <a target="_blank" href="https://colab.research.google.com/github/neuefische/ds-meetups/blob/main/02_Web_Scraping_With_Beautiful_Soup/2_web scraping_bs4.ipynb">
        <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
    </td>
</table>

# Web Scraping

Web scraping is the process of extracting and storing data from websites for analytical or other purposes. Therefore, it is useful to know the basics of html and css, because you have to identifiy the elements of a webpage you want to scrape. If you want to refresh your knowledge about these elements, check out the [HTML basics notebook](./01_HTML_Basics.ipynb).

We will go through all the important steps performed during web scraping with python and BeautifulSoup in this Notebook.

### Learning objectives for this Notebook

At the end of this notebook you should:
- be able to look at the structure of a real website
- be able to figure out what information is relevant to you and how to find it (Locating Elements)
- know how to download the HTML content with BeautifulSoup
- know how to loop over an entire website structure and extract information
- know how to save the data afterwards


For web scraping it is useful to know the basics of html and css, because you have to identifiy the elements of a webpage you want to scrape. The easiest way to locate an element is to open your Chrome dev tools and inspect the element that you need. A cool shortcut for this is to highlight the element you want with your mouse and then press Ctrl + Shift + C or on macOS Cmd + Shift + C instead of having to right click + inspect each time (same in mozilla).

## Locating Elements

For locating an element on a website you can use:

- Tag name
- Class name
- IDs
- XPath
- CSS selectors

![alt text](./images/html_elements.png)

XPath is a technology that uses path expressions to select nodes or node- sets in an XML document (or in our case an HTML document). [Read here for more information](https://www.scrapingbee.com/blog/practical-xpath-for-web-scraping/)

## Is Web Scraping Legal?

Unfortunately, there’s not a cut-and-dry answer here. Some websites explicitly allow web scraping. Others explicitly forbid it. Many websites don’t offer any clear guidance one way or the other.

Before scraping any website, we should look for a terms and conditions page to see if there are explicit rules about scraping. If there are, we should follow them. If there are not, then it becomes more of a judgement call.

Remember, though, that web scraping consumes server resources for the host website. If we’re just scraping one page once, that isn’t going to cause a problem. But if our code is scraping 1,000 pages once every ten minutes, that could quickly get expensive for the website owner.

Thus, in addition to following any and all explicit rules about web scraping posted on the site, it’s also a good idea to follow these best practices:

### Web Scraping Best Practices:

- Never scrape more frequently than you need to.
- Consider caching the content you scrape so that it’s only downloaded once.
- Build pauses into your code using functions like time.sleep() to keep from overwhelming servers with too many requests too quickly.

# The Problem we want to solve

![](images/sad_larissa.png)

Larissa's sister broke her aquarium. And we decided to get her a new one because christmas is near and we want to cheer Larissa up! And because we know how to code and can't decide what fish we want to get, we will solve this problem with web scraping!

## BeautifulSoup

The library we will use today to find fishes we can gift Larissa for christmas is [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). It is a library to extract data out of HTML and XML files.

The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests.

The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us. There are several different types of requests we can make using requests, of which GET is just one.

In [None]:
import time
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

In [None]:
# get the content of the website
page = requests.get("https://www.interaquaristik.de/tiere/zierfische")
html = page.content

We can use the BeautifulSoup library to parse this document, and extract the information from it.

We first have to import the library, and create an instance of the BeautifulSoup class to parse our document:

In [None]:
# parse the html and save it into a BeautifulSoup instance
bs = BeautifulSoup(html, 'html.parser')

We can now print out the HTML content of the page, formatted nicely, using the prettify method on the BeautifulSoup object.

In [None]:
print(bs.prettify())

This step isn't strictly necessary, and we won't always bother with it, but it can be helpful to look at prettified HTML to make the structure of the page clearer and nested tags are easier to read.

As all the tags are nested, we can move through the structure one level at a time. We can first select all the elements at the top level of the page using the children property of ``bs``.

Note that children returns a list generator, so we need to call the list function on it:

In [None]:
list(bs.findChildren())

And then we can have a closer look on the children. For example the ```head```.

In [None]:
bs.find('head')

Here you can try out different tags like ```body```, headers like ```h1``` or ```title```:

In [None]:
bs.find('insert your tag here')

But what if we have more than one element with the same tag? Then we can just use the ```.find_all()``` method of BeautifulSoup:

In [None]:
bs.find_all('article')

Also you can search for more than one tag at once for example if you want to look for all headers on the page:

In [None]:
titles = bs.find_all(['h1', 'h2','h3','h4','h5','h6'])
print([title for title in titles])

Often we are not interested in the tags themselves, but in the content they contain. With the ```.get_text()``` method we can easily extract the text from between the tags. So let's find out if we really scrape the right page to buy the fishes:

In [None]:
bs.find('title').get_text()

### Searching for tags by class and id
We introduced ```classes``` and ```ids``` earlier, but it probably wasn’t clear why they were useful.

Classes and ```ids``` are used by ```CSS``` to determine which ```HTML``` elements to apply certain styles to. For web scraping they are also pretty useful as we can use them to specify the elements we want to scrape. In our case the ```ìds``` are not that useful there are only a few of them but one example would be:

In [None]:
bs.find_all('div', id='page-body')

But it seems like that the ```classes``` could be useful for finding the fishes and their prices, can you spot the necessary tags in the DevTool of your browser?

In [None]:
# tag of the description of the fishes
bs.find_all(class_="insert your tag here for the name")

In [None]:
# tag of the price of the fishes
bs.find_all(class_="insert your tag here for the price")

## Extracting all the important information from the page
Now that we know how to extract each individual piece of information, we can save thse informations to a list. Let's start with the price:

In [None]:
# We will search for the price
prices = bs.find_all(class_= "price")

prices_lst = [price.get_text() for price in prices]
prices_lst

We seem to be on the right track but like you can see it doesn't handle the special characters, spaces and paragraphs. So web scraping is coming hand in hand with cleaning your data:

In [None]:
prices_lst = [price.strip() for price in prices_lst]
prices_lst[:5]

That looks a little bit better but we want a only the number to work with the prices later. We have to remove the letters and convert the string to a float:

In [None]:
# We are removing the letters from the end of the string and keeping only the first part
prices_lst = [price.replace('\xa0€ *', '') for price in prices_lst]
prices_lst[:5]

In [None]:
# Now we have to replace the comma with a dot to convert the string to a float
prices_lst = [price.replace(',', '.') for price in prices_lst]
prices_lst[:5]

In [None]:
# So lets convert the string into a float
prices_lst = [float(price) for price in prices_lst]

But if we want to convert the string to a flaot we get an error message there seems to be prices which start with ```ab```.
So let me intodruce you to a very handy thing called ```Regular expressions``` or short ```regex```. It is a sequence of characters that specifies a search pattern. In python you can use regex with the ```re``` library. So lets have a look how many of the prices still contain any kind of letters.

In [None]:
# with the regex sequence we are looking for strings that contain any
# kind of letters
for price in prices_lst:
    if re.match("^[A-Za-z]", price):
        print(price)

So there are some prices with an "ab" in front of them, so lets remove the letters:

In [None]:
# Now we have to replace the comma with a dot to convert the string to a float
prices_lst = [float(price.replace('ab ', '')) for price in prices_lst]
prices_lst[:5]

Now it worked! so let's do the same with the description of the fishes:

In [None]:
# Find all the descriptions of the fish and save them in a variable
descriptions = bs.find_all(class_='thumb-title small')

# Get only the text of the descriptions
descriptions_lst = [description.get_text() for description in descriptions]
descriptions_lst

In [None]:
# Clean the text by removing spaces and paragraphs
descriptions_lst = [description.strip() for description in descriptions_lst]
descriptions_lst[:5]

Let's have a look if we can get the links to the images of the fish, so that we later can look up how the fish are looking, we can use the ```img``` tag for that in most cases:

In [None]:
# find all images of the fish
image_lst = bs.find('ul', {'class': 'product-list row grid'})
images = image_lst.find_all('img')
images

There are only two results for the image tag so let's have a look what the tag of the other images are.

So they have the tag: ```picture``` so lets extract those:

In [None]:
# Extract all the pictures for the fish by using first the tag ul and than the tag picture
picture_lst = bs.find('ul', {'class': 'product-list row grid'})
pictures = picture_lst.find_all('picture')
pictures[:5]

That looks more like all pictures! 
Although, it seems some of the fish have specials like 'Sonderangebot' or 'Neuheit'. Wouldn't it be nice if we would have this information as well?  Here it gets a little bit tricky because the 'Sonderangebot' and 'Neuheit' do not have the same ```classes``` in the ```span``` but if we go one tag higher we can get all of them: 

In [None]:
# Extracting all the special offers by using the div tag and the class 'special-tags p-2'
specials = bs.find_all('div', {'class' : 'special-tags p-2'})
specials

If we want only the text from the ```span``` we now can iterate over the specials list and extract the text:

In [None]:
# to get only the text from the specials we are iterating over all specials
for special in specials:
    # and than get the text of all spans from the special objects
    special_text = special.find("span").get_text().strip()
    print(special_text)

Nice that will help us for making a decision what fish to buy!

But so far we only scraped the first page there are more fish on the next pages. There are 29 pages of fish. So how can we automate this? <br>
So this is the link of the first page: https://www.interaquaristik.de/tiere/zierfische <br>
The second link of the second page looks like this: https://www.interaquaristik.de/tiere/zierfische?page=2 <br>
The third: https://www.interaquaristik.de/tiere/zierfische?page=3 <br>

So the only thing that changes is the ending... Let's use this! But don't forget each request is causing traffic for the server, so we will set a sleep timer between requests!

```
link = 'https://www.interaquaristik.de/tiere/zierfische'
for _ in range(30):
    time.sleep(3)
    if _ == 0:
        page = requests.get(link)
        html = page.content
    else:
        print(link + f'?page={_}')
        page = requests.get(link + f'?page={_}')
        html = page.content
```

This will be our starting point!
We will save our results in a pandas data frame so that we can work with the data later. Therefore we will create a empty data frame and append our data to it.

In [None]:
# Creating an empty Dataframe for later use
df = pd.DataFrame()

But first lets create some functions for the scraping part:
1. for the description
2. for the price
3. for the images
4. for specials

In [None]:
# Creating a function to get all the description
def get_description(lst_name):
    ''' 
    Get all the description from the fish by class_ = 'thumb-title small'
    and saving it to an input list.
    Input: list
    Output: list
    '''
    # find all the descriptions and save them to a list
    fish = bs.find_all(class_='thumb-title small')
    # iterate over the list fish to get the text and strip the strings
    for names in fish:
        lst_name.append(
            names.get_text()\
                .strip()
        )
    return lst_name

In [None]:
# Creating a function to get all the prices
def get_price(lst_name):
    ''' 
    Get all the prices from the fish by class_ = 'prices'
    and saving it to an input list.
    Input: list
    Output: list
    '''
    # find all the prices and save them to a list
    prices = bs.find_all(class_='prices')
    # iterate over the prices
    for price in prices:
        # try to clean the strings from spaces, letters and paragraphs and convert it into a float
        try:
            price = float(price.get_text()\
                .strip()\
                .replace('\xa0€ *','')\
                .replace(',','.')\
                .replace('ab ', '')
            )
        except:
            # in some cases there is no * in the string like here: '\xa0€ *' with the except we try to intercept this
            price = price.get_text()\
                .split('\n')[0]\
                .replace('\xa0€','')
            if price != '':
                price = 0.0                  
            else:
                price = float(price)  
        # append the prices to the fish_prices list
        fish_prices.append(
           price
        )
    return lst_name

In [None]:
# Creating a function to get all the images
def get_image(lst_name_1, lst_name_2):
    ''' 
    Get all the images from the fish by tag = 'ul' and class_ = 'product-list row grid'
    and saving the name to one lst_name_1 and the link of the image to another lst_name_2.
    Input: list_1, list_2
    Output: list_1, list_2
    '''
    # find all images
    images_listings = bs.find('ul', {'class': 'product-list row grid'})
    images = images_listings.find_all('img')
    # find all pictures
    pictures_listings = bs.find('ul', {'class': 'product-list row grid'})
    pictures = pictures_listings.find_all('picture')
    # iterate over the images and save the names of the fish in one list and the link to the image in another one
    for image in images:
        lst_name_1.append(image['src'])
        lst_name_2.append(image['alt'].strip())
    # iterate over the pictures and save the names of the fish in one list and the link to the image in another one
    for picture in pictures:
        lst_name_1.append(picture['data-iesrc'])
        lst_name_2.append(picture['data-alt'].strip())
    return lst_name_1, lst_name_2

In [None]:
def get_special(lst_name_1, lst_name_2):
    ''' 
    Get all the images from the fish by tag = 'div' and class_ = 'thumb-inner'
    and saving the name to one lst_name_1 and the index to another lst_name_2.
    Input: list_1, list_2
    Output: list_1, list_2
    '''
    # use the article as tag to get the index of all articles
    article_lst = bs.find_all('div', {'class' : 'thumb-inner'})
    # iterate over all articles with enumerate to get the single articles and the index
    for idx,article in enumerate(article_lst):
        # get all specials
        spans = article.find('div', {'class' : 'special-tags p-2'})
        # and if there is a special save the special and the index each to a list
        if spans != None:
            special = spans.find("span").get_text().strip()
            lst_name_1.append(special)
            lst_name_2.append(idx)
    return lst_name_1, lst_name_2

Now we will combine it all and could scrape all pages:

**NOTE:** We have commented out the code, because we don't want to overwhelm the server with the requests of participants in the meetup. Feel free to run the code after the meetup.

In [None]:
#link = 'https://www.interaquaristik.de/tiere/zierfische'
#
## for loop to get the page numbers
#for _ in range(30):
#    # sleep timer to reduce the traffic for the server
#    time.sleep(3)
#    # create the lists for the functions
#    fish_names = []
#    fish_prices = []
#    picture_lst = []
#    picture_name = []
#    index_lst =[]
#    special_lst = []
#    # first iteration is the main page
#    if _ == 0:
#        # get the content
#        page = requests.get(link)
#        html = page.content
#        bs = BeautifulSoup(html, 'html.parser')
#        # call the functions to get the information
#        get_description(fish_names)
#        get_price(fish_prices)
#        get_image(picture_lst, picture_name)
#        get_special(special_lst, index_lst)
#        # create a pandas dataframe for the names and prices
#        fish_dict = {
#            'fish_names': fish_names,
#            'fish_prices in EUR': fish_prices
#        }
#        df_fish_info = pd.DataFrame(data=fish_dict)
#        # create a pandas dataframe for the pictures
#        picture_dict = {
#            'fish_names': picture_name,
#            'pictures': picture_lst
#        }
#        df_picture = pd.DataFrame(data=picture_dict)
#        
#        # merge those two dataframes on the fishnames
#        df_ = pd.merge(df_fish_info, df_picture, on='fish_names', how='outer')
#        
#        # create a pandas dataframe for the specials
#        specials_dict = {
#            'special': special_lst
#        }
#        real_index = pd.Series(index_lst)
#        df_specials = pd.DataFrame(data=specials_dict)
#        df_specials.set_index(real_index, inplace=True)
#        
#        # merge the dataframes on the index
#        df_ = pd.merge(df_, df_specials, left_index=True,right_index=True, how='outer')
#        # append the temporary dataframe to the dataframe we created earlier outside the for loop
#        df = df.append(df_)
#    # else-statment for the next pages
#    else:
#        # get the content from the links we create with a f-string an the number we get from the for-loop
#        page = requests.get(link+f'?page={_}')
#        html = page.content
#        bs = BeautifulSoup(html, 'html.parser')
#        # call the functions to get the information
#        get_description(fish_names)
#        get_price(fish_prices)
#        get_image(picture_lst, picture_name)
#        get_special(special_lst, index_lst)
#        # create a pandas dataframe for the names and prices
#        fish_dict = {
#            'fish_names': fish_names,
#            'fish_prices in EUR': fish_prices
#        }
#        df_fish_info = pd.DataFrame(data=fish_dict)
#        # create a pandas dataframe for the pictures
#        picture_dict = {
#            'fish_names': picture_name,
#            'pictures': picture_lst
#        }
#        df_picture = pd.DataFrame(data=picture_dict)
#        
#        # merge those two dataframes on the fishnames
#        df_ = pd.merge(df_fish_info, df_picture, on='fish_names', how='outer')
#        
#        # create a pandas dataframe for the specials
#        specials_dict = {
#            'special': special_lst
#        }
#        real_index = pd.Series(index_lst)
#        df_specials = pd.DataFrame(data=specials_dict)
#        df_specials.set_index(real_index, inplace=True)
#        
#        # merge the dataframes on the index
#        df_ = pd.merge(df_, df_specials, left_index=True,right_index=True, how='outer')
#        # append the temporary dataframe to the dataframe we created earlier outside the for loop
#        df = df.append(df_)
#        

In [None]:
#checking if everything worked
df.head()

The web scraping part is over and the following part is only looking at the data.
We will save the dataframe to a csv file so that we don't have to scrape the info again!

### Checking for duplicates something that can happen quickly while scraping
df.pivot_table(columns=['fish_names'], aggfunc='size')

It seems like we have some duplicates. Let's drop them!

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
# save the dataframe to a csv file without index
df.to_csv('fish_data.csv', index=False)

Because we haven't run the code for scraping all pages, we uploaded the data we scraped before to github and we now can load it into pandas:

In [None]:
# reading the csv file from github
df = pd.read_csv('https://raw.githubusercontent.com/neuefische/ds-meetups/main/02_Web_Scraping_With_Beautiful_Soup/fish_data.csv')

In [None]:
#checking if everything worked
df.head()

We want fish for Larissa that she has never had before, that is why we are looking for new items (Neuheiten).

In [None]:
# Query over the dataframe and keeping only the fish with the special Neuheit
df_special_offer = df.query('special == "Neuheit"')

In [None]:
df_special_offer.head()

We have a budget of around 250 € and we want to buy at least 10 fish so we will filter out fishes that are more expensive than 25 €!

In [None]:
# Filtering only for the fish that are cheaper than 25 EUR
df_final = df_special_offer[df_special_offer['fish_prices in EUR'] <= 25]

In [None]:
df_final.head()

So let's write some code that chooses the fish for us:

In [None]:
# our budget
BUDGET = 250
# a list for the fish we will buy
shopping_bag = []
# a variable here we save the updating price in
price = 0
# we are looking for fish until our budget is reached
while price <= BUDGET:
    # samples the dataframe randomly
    df_temp = df_final.sample(1)
    # getting the name from the sample
    name = df_temp['fish_names'].values
    # getting the price from the sample
    fish_price = df_temp['fish_prices in EUR'].values
    # updating our price
    price += fish_price
    # adding the fish name to the shopping bag
    shopping_bag.append((name[0],fish_price[0]))

In [None]:
pd.set_option('display.max_colwidth', None)

print(f"We are at a price point of {price[0].round(2)} Euro and this are the fish we chose:")
res=pd.DataFrame(shopping_bag,columns=["Name","Price [€]"])
display(res)

# Christmas can come!

![](images/happy_larissa.png)