**Important notes:**

- Activate the correct conda environment before you start.

- Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and (if necessary) collaborators below.

- The function `NotImplementedError()` prevents you from hand in assignments with empty cells. Simply delete the function if you start working on a cell with this entry.

- Before you turn this problem in (i.e., after you completed all tasks), make sure everything runs as expected: Restart the kernel and run all cells:
  - in *Jupyter Notebook*: in the menubar, select `Kernel` and click on `Restart & Run All`
  - in *Visual Studio Code*: select "Restart" and then "Run All" 

Good luck!

In [None]:
NAME = ""
COLLABORATORS = ""

In [None]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

---

# Web Scraping in Python with Beautiful Soup, Requests and pandas

*This tutorial is mainly based on the tutorial [Build a Web Scraper with Python in 5 Minutes](https://www.kdnuggets.com/2022/02/build-web-scraper-python-5-minutes.html) by Natassha Selvaraj as well as the [Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).*

In this tutorial, you will learn how to:

1. Scrape the web page [“Quotes to Scrape”](https://quotes.toscrape.com/) using [Requests](https://docs.python-requests.org/en/latest/). 


1. Pulling data out of HTML using [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).


1. Use [Selector Gadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) to inspect the CSS of the web page.


1. Store the scraped data in a [pandas](https://pandas.pydata.org/) dataframe.

## Prerequisites

To start this tutorial, you need: 

- Some basic understanding of HTML and CSS and CSS selectors.
- Google's web browser [Chrome](https://support.google.com/chrome/answer/95346?hl=en&co=GENIE.Platform%3DDesktop) and the [Chrome extension SelectorGadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb)
- To know how to use [Chrome DevTools](https://developer.chrome.com/docs/devtools/)

> To learn more about HTML, CSS, Chrome DevTools and the Selector Gadget, follow the instructions in this [web scraping basics tutorial](https://kirenz.github.io/codelabs/codelabs/webscraping/#0).

## Setup

In [None]:
import pandas as pd

import requests
from bs4 import BeautifulSoup

## Scrape website with Requests

- First, we use `requests` to scrape the website (using a GET request).

- `requests.get()` fetches all the content from a particular website and returns a response object (we call it `html`):

In [None]:
url = 'http://quotes.toscrape.com/'

Hint:

___ = ___.get(___)


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

- Check if the response was succesful (with `.status_code`):

In [None]:
html.status_code

- Response 200 means that the request has succeeded. 

In [None]:
"""Run this cell to check that yor code returns the correct output"""
assert html.status_code == 200
assert html.url == "http://quotes.toscrape.com/"

## Investigate HTML with Beautiful Soup

- We can use the response object to access certain features such as content, text, headers, etc. 

- In our example, we only want to obtain `text` from the object.

- Therefore, we use `html.text` which only returns the text of the response.

- Running `html.text` through BeautifulSoup using the `html.parser` gives us a Beautiful Soup object:

In [None]:
soup = BeautifulSoup(html.text, 'html.parser')

- `soup` represents the document as a nested data structure:

In [None]:
print(soup.prettify())

Next, we take a look at some ways to navigate that data structure.

### Get all text

- A common task is extracting all the text from a page (since the output is quite large, we don't actually print the output of the following function):

In [None]:
# print(soup.get_text())

### Investigate title

- Print the complete HTML title (`.title)`:

In [None]:
soup.title

- Show name of the title tag (`.title.name`):

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
"""Run this cell to check that yor code returns the correct output"""
assert soup.title.name == "title"

- Only print the text of the title (`title.string`):

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
"""Run this cell to check that yor code returns the correct output"""
assert soup.title.string == "Quotes to Scrape"

- Show the name of the parent tag of title:

In [None]:
soup.title.parent.name

### Investigate hyperlinks

- Show the first hyperlink in the document:

In [None]:
soup.a

### Investigate a text element

In [None]:
soup.span.text

### Extract specific elements with find and find_all

- Since there are many div tags in HTML, we can’t use the previous approaches to extract relevant information.

- Instead, we need to use the `find` and `find_all` methods which you can use to extract specific HTML tags from the web page.

- This methods can be used to retrieve all the elements on the page that match our specifications. 

- Let's say our goal is to obtain all quotes, authors and tags from the website [“Quotes to Scrape”](https://quotes.toscrape.com/).

- We want to store all information in a pandas dataframe (every row should contain a quote as well as the corresponding author and tags).   

- First, we use [SelectorGadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) in Google Chrome to inspect the website. 


> Review the [web scraping basics tutorial](https://kirenz.github.io/codelabs/codelabs/webscraping/#0) to learn how inspect websites.

#### Extract all quotes

Task: Extract all quotes

- First, we use the div class "quote" to retrieve all relevant information regarding the quotes:

In [None]:
quotes_all = soup.find_all('div', {'class': 'quote'})

In [None]:
quotes_all

- Next, we can iterate through our new `quotes_all` object and extract only the text of the quotes:

  - we want to store all text quotes in a new array called `quotes_text` (you need top provide an empty list)
  - To extract the quotes, note that the text of the quotes are available in the tag `span` as "`class`:`text`" (see output above))
  - finally, we can use the method `.text` to make sure we only extract text
  

Some hints:  
  
 ```python 
# create empty array
quotes_text = []

# use for loop to write quotes in quotes_text with append
 for i in ___:
    ___.append((___.find('___', {'___':'___'})).___)
```  

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
"""Run this cell to check that yor code returns the correct output"""
assert quotes_text[0] == "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”"

Take a look at quotes_text

In [None]:
 # quotes_text

- Next, we want to store the data in a pandas dataframe (to make later data preprocessing steps easier)

In [None]:
df_quotes = pd.DataFrame({"quote" : quotes_text})
df_quotes

#### Extract all authors

Task: Extract all authors 

In [None]:
soup

- In this example, we don't want to create a new object (like  `quotes_all`) as an intermediate step. 


- Instead, we use a different approach:
  - create an emtpty array mit the name `authors_text`
  - use the `soup` object and implement the `find_all()` function in a for loop to extract the authors (take a look at the code where we created `quotes_all`):
  

Hint:

```python
___ = []

for i in ___.___("___",{"___": "___"}):
    ___.___((___.___("___", {"___": "___"})).___)
```

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
"""Run this cell to check that yor code returns the correct output"""
assert authors_text[0] == "Albert Einstein"

We create a new dataframe:

- call the dataframe: df_authors
- name the column: author

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
"""Run this cell to check that yor code returns the correct output"""
assert df_authors.iloc[0,0] == "Albert Einstein"

We can use a left join to combine the two dataframes:

In [None]:
df1 = df_quotes.join(df_authors)
df1

#### Extract all tags

Task: Extract all tags

- We use the same process as in the extraction of the authors to obtain the tags 

- Information about the tags is available in the class "tags".

- This time, we need to extract the "content" from "meta" and return it as array (since there are multiple entries per quote)

- Call the array `tags_text`

Hint:

```python

___ = []

for i in ___.___("___",{"___": "___"}):
    ___.___((___.___("___"))['___'])
    
```    

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
"""Run this cell to check that yor code returns the correct output"""
assert tags_text[0] == "change,deep-thoughts,thinking,world"

We create a new dataframe:

- call the dataframe: df_tags
- name the column: tags

In [None]:
df_tags = pd.DataFrame({"tags" : tags_text})


## Create dataframe for all quotes, authors and tags

Finally, we can combine all information in one single dataframe

In [None]:
df2 = df1.join(df_tags)
df2

- Next, we want to store ALL quotes with the corresponding authors and tags information in a pandas dataframe.  


- Note that the site has a total of ten pages and we want to collect the data from all of them (we only extracted content from the first page so far). 


- The website's URL address is structured as follows:

  - page 1: https://quotes.toscrape.com/page/1/
  - page 2: https://quotes.toscrape.com/page/2/
  - ...
  - page 10: https://quotes.toscrape.com/page/10/

- This means we can use the part "https://quotes.toscrape.com/page/" as root and iterate over the pages 1 to 10.

We will proceed as follows:

1. Store the root url without the page number as a variable called `root`.


2. Prepare three empty arrays: `quotes`, `authors` and `tags`.


3. Create a loop that ranges from 1–10 to iterate through every page on the site.


4. Append the scraped data to our arrays.


5. Create a dataframe 

- Note that we use almost the same code as before

Hint:

```python
# store root url without page number... needs to end with /
root = 'http://___/'

# create empty arrays



# loop over page 1 to 10
for pages in range(__,__): 
        
        html = requests.get(___ + str(pages))
        
        soup = BeautifulSoup(___.text)    

        for i in soup.findAll("div",{"class":"quote"}):
                 quotes.append((i.find("span",{"class":"text"})).text)  
   
        for j in soup.findAll("div",{"class":"quote"}):
                 authors.append((j.find("small",{"class":"author"})).text)    
        
        for k in soup.findAll("div",{"class":"tags"}):
                 tags.append((k.find("meta"))['content'])

# create dataframe
df = pd.DataFrame(
    {'Quotes':quotes,
     'Authors':authors,
     'Tags':tags
    })

```


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
"""Run this cell to check that yor code returns the correct output"""
assert df.iloc[0, 2] == "change,deep-thoughts,thinking,world"
assert len(df) == 90
assert df.iloc[0, 1] == "Albert Einstein"

- Show result

In [None]:
df.head()

- Congratulations! You have successfully completed this tutorial.