# Web Scraping 101 (oDCM)

*After finishing this tutorial, you can extract data from multiple pages on the web, and export such data to CSV files so that you can use it in an analysis.*

--- 

## Learning Objectives

* Generate lists of entities to scrape data from
* Map navigation path on a website using URLs, and understand how to use parameters to modify results
* Select data for extraction on a website using CSS selectors
* Write data to CSV file, and enrich with relevant metadata
* Bundle data capture in Python functions and modularize extraction code
* Loop through a list of URLs to capture data in bulk, using functions
* Understand the difference between Jupyter Notebooks and “raw” Python files, and run collection via the command line/terminal

--- 

## Acknowledgements
This tutorial has been inspired by various open-access online resources, which we list for further reference at the [course website](https://odcm.hannesdatta.com/docs/about). 

--- 

## Support Needed?
For technical issues outside of scheduled classes, please check the [support section](https://odcm.hannesdatta.com/docs/course/support) on the course website.

---

## 1. Seed Generation


### 1.1 Collecting Links


__Importance__

In web scraping, we typically refer to a "seed" as a starting point for a data collection. Without a seed, there's no data to collect.

For example, before we can crawl through all books available on [this site](https://books.toscrape.com/catalogue/category/books_1/index.html), we first need to generate a *list of all books on the page*.

One way to get there would be to:

1. first scrape all book links (“seeds”) from the overview page, and 
2. then iterate over all links to scrape the product description (or anything else on that page). 

Note that the overview page allows us to "navigate" to the individual book pages, either by clicking on the book cover or the book title (see red boxes in the figure below). 

<img src="images/books_links.png" align="left" width=80%/>

__Let's try it out__

Let's now check out how the links from the book covers or book titles are encoded in the website's source code.

Open the [book catalogue](https://books.toscrape.com/catalogue/category/books_1/index.html), and inspect the underlying HTML code with the Chrome Inspector (right click --> inspect element). 

The book covers (`<img>`) are surrounded by `<a>` tags, which contain a link (`href`) to the book. 

Also, the book titles (`<h3>`) are surrounded by `<a>` tags with the relevant links to the book pages.

<img src="images/inspector_links.png" align="left" width=80%/>

How could we tell a computer to capture the links to the various books on the site?

One simple way is to select *elements by their tags*. For example, to extract all links (`<a>` tags). 

__Exercise 1__

Please run the code cell below, which extracts all links (the `a` tag!), and prints the URL (`href`) to the screen. 

If you look at these links more closely, you'll notive that we're not interested in many of these links... 

Make a list of all links we're *not* interested in (i.e., those *not* pointing to a book page). Which ones are those? Can you find out why they are there?

In [None]:
# Run this code now
import requests
from bs4 import BeautifulSoup

# make a get request to the books overview page (see Webdata for Dummies tutorial)
url = 'https://books.toscrape.com/catalogue/category/books_1/index.html'
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")

# return the href attribute in the <a> tag nested within the first product class element
for link in soup.find_all("a"): print(link.attrs["href"])

__Solution__

The links we want to ignore are...

* "Books to Scrape" link at the top
* "Home" breadcrumb link 
* Left sidebar with all book genres (e.g., Travel)
* The next button at the bottom

These links are present on the page, because they are used by users to navigate on the page. This can also be seen on the animation:

<img src="images/books_overview.gif" align="left" width=50%/>

### 1.2 Collecting *More Specific* Links

__Importance__

We've just discovered that selecting elements by their tags gives us many irrelevant links. But, how can we narrow down these links, or, in other words, __how can we scrape only the book links we're interested in?__.

To answer this question, we need to briefly revisit the notion of __HTML classes__. 

A __class__ is often used as a reference in the code. For example, to make all text elements with a given class blue or increase the font size. In the Google Inspector screenshot shown earlier, you find an `<article>` tag with class `product_pod` in which a `<div>` is nested which contains the image and link attribute we're after. 

Every link to a book is *nested within this class* (nested = "part of"). The "wrong links" extracted above (i.e., the ones in the page's header and sidebar) are *not*. 

Thus, if we can tell our scraper that we're only interested in the `<a>` tags *within the `product_pod` class*, we end up with our desired selection of links.

__Let's try it out__

Like before, we'll use `.find_all()` to capture all matching elements on the page. The difference, however, is that we specify __a class (`class_=`)__, rather than an HTML tag. From the inspector, we know the class name (`product_pod`). 

This result is a list with __all 20 `product_pod` classes__ on the page (i.e., one for each book). 

Run the code below, in which we pick the __first book__ from the list (A Light in the Attic, element `[0]`), and extract the `<a>` tag nested within the `product_pod` class. 

Finally, we pull out the `href` attribute from the `<a>` tag which gives us the book link.

In [27]:
import requests
from bs4 import BeautifulSoup

# make a get request to the books overview page (see Webdata for Dummies tutorial)
url = 'https://books.toscrape.com/catalogue/category/books_1/index.html'
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")

# return the href attribute in the <a> tag nested within the first product class element
soup.find_all(class_="product_pod")[0].find("a").attrs["href"]

'../../a-light-in-the-attic_1000/index.html'

Note the `../../` in front of the link which tells the browser: this tells the browser to go back two directories from the current URL:
* Current URL: https://books.toscrape.com/catalogue/category/books_1/index.html
* 1 step back: https://books.toscrape.com/catalogue/category/books_1
* 2 steps back: https://books.toscrape.com/catalogue/category/

Thereafter, it appends `a-light-in-the-attic_1000/index.html` to the URL which forms the full link to the [A Light in the Attic](https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html) book. 

Pretty cool, right?

#### Exercise 2
1. Modify the script to extract the link from the *second book* (Tipping the Velvet), using BeautifulSoup.
2. Create a new variable `book_url` that concatenates the base URL (` https://books.toscrape.com/catalogue/`) and the string you extracted in the previous exercise 1.2 (`../../a-light-....`). Use *slicing* to remove the `../../` part inbetween. The final output should be: `https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html` 
3. The `replace` functions offers a more convenient way to "search and replace" in a string. The syntax is: `my_string = old_string.replace('text-to-replace', 'replace-by-text')`. Implement the `replace` function for the previous exercise 2.2.

#### Solutions

In [31]:
# Question 1
url_book = soup.find_all(class_="product_pod")[1].find("a").attrs["href"]
print(url_book)

../../tipping-the-velvet_999/index.html


In [32]:
# Question 2 
base_url = "https://books.toscrape.com/catalogue/"
book_url = base_url + url_book[6:]
print(book_url)

https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html


In [30]:
# Question 3
base_url = "https://books.toscrape.com/catalogue/"
book_url = base_url + url_book
book_url = book_url.replace('../', '')
print(book_url)

https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html


### 1.3 Iterating over items

__Importance__

Ideally, we'd like our code to extract the URL from *every* book on the page, not just *one* product.

In other words, we need a way to *iterate*/*loop* through the entire page to assemble a list of links (product pages) to scrape.

__Let's try it out__

Let's set up this exercise.

1. We have a BeautifulSoup object, holding all of the book previews (`soup.find_all(class_="product_pod")`)
2. We have an empty array of `book_urls`, that we would like to fill
3. We write a loop, which iterates through 1. and fills in 2.

Run the code below!

In [35]:
# list of all books on the overview page
books = soup.find_all(class_="product_pod")
book_urls = []

for book in books: 
    book_url = book.find("a").attrs["href"]
    book_urls.append(book_url)
    
# print the first five urls
print(book_urls[0:5])

['../../a-light-in-the-attic_1000/index.html', '../../tipping-the-velvet_999/index.html', '../../soumission_998/index.html', '../../sharp-objects_997/index.html', '../../sapiens-a-brief-history-of-humankind_996/index.html']


In practice, it may be more convenient to create a *dictionary* in which the `book_title` is the key and the `book_url` the value. This way it is more intuitive to look up the URL from a given book because you don't have to remember the exact position in the list but can simply pass it the title of the book. 

In the Google Inspector screenshot at the beginning of this section, you can see that the book title is stored in the `alt` attribute of the `<img>` tag (as well as in the `title` attribute from the second `<a>` tag). Using a similar approach as above, we collect the `book_title` and `book_url` of each book, and use these records to update `book_dict`.

In [36]:
book_dict = {}

for book in books: 
    book_title = book.find("img").attrs["alt"] 
    book_url = book.find("a").attrs["href"]
    book_dict[book_title] = book_url

As a result, we can simply pass the book title (mind the capitals!) to the dictionary to obtain the corresponding URL.

In [37]:
print(book_dict['A Light in the Attic'])

../../a-light-in-the-attic_1000/index.html


#### Exercise 3
1. Like exercise 2.2, write code that transforms the relative URLs (`../..`) in `book_dict` into full URLs. Tip: you can use `for key, value in book_dict.items():` to iterate over the key-value pairs in the dictionary and update URLs accordingly. 
2. One of the books on `books.toscrape.com` is [Black Dust](https://books.toscrape.com/catalogue/black-dust_976/index.html). What happens once you pass this title as a key to `book_dict`? Why is that? 

#### Solutions

In [40]:
# Question 1
for key, value in book_dict.items():
    book_dict[key] = (base_url + value).replace('../','')

In [41]:
# Question 2 
book_dict["Black Dust"] # it throws an error because the key does not exist (this book is on shown on the 2nd page and we only scraped the first one!)

KeyError: 'Black Dust'

### 1.4 Page Navigation

__Importance__

Alright - what have we learnt up this point?

- Section 1.1 taught us how to extract links from a page, 
- Section 1.2 taught us how to extract *more specific links* from a page, and finally
- Section 1.3 taught us how to assemble a list of *links* to *all* books listed on a specific page.

So... what's missing?

Exactly! The [`books.toscrape.com`](https://books.toscrape.com/catalogue/category/books_1/index.html) contains __1000 books__, spread across __50 pages__. 

So, the goal of this section is to navigate through the __entire book assortment__, not only the first 20 books.



__Let's try it out__

Open [the website](https://books.toscrape.com/catalogue/category/books_1/index.html), and click on the "next" button at the bottom of the page.

<img src="images/books.png" align="left" width=60%/>




Repeat this a couple of times, and observe how the URL in your navigation bar is changing...

- `https://books.toscrape.com/catalogue/category/books_1/page-1.html`
- `https://books.toscrape.com/catalogue/category/books_1/page-2.html`
- `https://books.toscrape.com/catalogue/category/books_1/page-3.html`

Can you guess the next one...?

Indeed! The URL can be divided into a __fixed base part__ (`https://books.toscrape.com/catalogue/category/books_1/`), and a __counter__ that is dependent on the page you're visiting (e.g., `page-1.html`). 

__Now let's create a list of all 50 URLs!__ 

First, we create a counter variable, which we now set to 1 (but it can take on any value later on). Then, we concatenate the `base_url` with the counter (note that we have to convert the integer counter to a string before we can do that, using the `str` function).

In [44]:
counter = 1
full_url = base_url + "page-" + str(counter) + ".html" 
print(full_url)

https://books.toscrape.com/catalogue/page-1.html


In a similar fashion, we generate a list of 50 `page_urls` with a for loop that starts at 1 and ends at 50 (not 51!). 

In [45]:
base_url = "https://books.toscrape.com/catalogue/category/books_1/"
page_urls = []

for counter in range(1, 51):
    full_url = base_url + "page-" + str(counter) + ".html" 
    page_urls.append(full_url)

As expected, this gives a list of all page URLs that contain books. 

In [46]:
# print the last five page urls
print("The number of page urls in the list is: " + str(len(page_urls)))

The number of page urls in the list is: 50


#### Exercise 4
In this exercise, we practice generating a seed for another website, [`quotes.toscrape.com`](https://quotes.toscrape.com/), which displays 100 famous quotes from GoodReads, categorized by tag. 

<img src="images/quotes.png" align="left" width=60% style="border: 1px solid black" />

1. Make yourself comfortable with how the [site](https://quotes.toscrape.com) works and ask yourself questions such as: how does the navigation work, how many pages are there, what is the base URL, and how does it change if I move to the next page?
2. Generate a list `quote_page_urls` that contains the page URLs we need if we'd like to scrape all 100 quotes.

#### Solutions
1. The 100 quotes are evenly spread across 10 pages. The base URL is `https://quotes.toscrape.com/page/` followed by a page number between 1 and 10.

In [47]:
# Question 2
base_url = "https://quotes.toscrape.com/page/"
quote_page_urls = []

for counter in range(1, 11):
    full_url = base_url + str(counter)
    quote_page_urls.append(full_url)

print(quote_page_urls)

['https://quotes.toscrape.com/page/1', 'https://quotes.toscrape.com/page/2', 'https://quotes.toscrape.com/page/3', 'https://quotes.toscrape.com/page/4', 'https://quotes.toscrape.com/page/5', 'https://quotes.toscrape.com/page/6', 'https://quotes.toscrape.com/page/7', 'https://quotes.toscrape.com/page/8', 'https://quotes.toscrape.com/page/9', 'https://quotes.toscrape.com/page/10']


In summary, we have defined our seed and thought about a data extraction strategy to obtain the book links on a page. Since there are multiple pages, we needed to generate a list of URLs as an input for our scraper, which we'll further refine in the next chapter.  

--- 

## 2. Data Extraction


### 2.1 Timers

__Importance__

Before we start running the scraper, we need to realize that sending many requests at the same time can overload a server. Therefore, it's highly recommended to pause between requests rather than sending them all simultaneously. This avoids that your IP address (i.e., numerical label assigned to each device connected to the internet) gets blocked, and you can no longer visit (and scrape) the website. 

__Let's try it out__

In Python, you can import the `sleep` module, which pauses the execution of future commands for a given amount of time. For example, the print statement after `sleep(5)` will only be executed after 5 seconds:


In [48]:
# run this cell again to see the timer in action yourself!
from time import sleep
sleep(5)
print("I'll be printed to the console after 5 seconds!")

I'll be printed to the console after 5 seconds!


__Exercise 5__

Modify the code above to sleep for 2 minutes. Go grab a coffee inbetween. Did it take you longer than 2 minutes?

(if you want to abort the running code, just select the cell and push the "stop" button)

In [53]:
# Solution
sleep(2*60)
print("Done!")


KeyboardInterrupt: 

### 2.2 Modularization

With this addition to our toolkit, let's finish up our book URL scraper by putting together everything we have learned thus far. Following Python conventions, let's modularize our  code into functions to improve readability and reusability. 

First, we define a function `generate_page_urls()` that takes a base URL and an upper limit of the number of pages (50) as input parameters. This way we can easily update our scraper if more books are added or if the base URL changes. 

Second, the `extract_book_urls()` function takes a list of page URLs as input and returns a dictionary of book titles and URLs. Note the two-step structure of the for-loops: on every page, we create a `books` object which we subsequently loop over by extracting the `book_title` and `book_url` from each book. These records are added to the dictionary `book_dict` which is eventually returned by the function.

In [15]:
def generate_page_urls(base_url, num_pages):
    '''generate a list of full page urls from a base url and counter that has takes on the values between 1 and num_pages'''
    page_urls = []
    
    for counter in range(1, num_pages + 1):
        counter_url = f"page-{counter}.html"
        full_url = base_url + counter_url 
        page_urls.append(full_url)
        
    return page_urls
    
def extract_book_urls(page_urls):
    '''collect the book title and url for every book on all page urls'''
    book_dict = {}
    
    for page_url in page_urls: 
        res = requests.get(page_url)
        soup = BeautifulSoup(res.text, "html.parser")
        books = soup.find_all(class_="product_pod")

        for book in books: 
            book_title = book.find("img").attrs["alt"] 
            book_url = "https://books.toscrape.com/catalogue/" + book.find("a").attrs["href"][6:]
            book_dict[book_title] = book_url
            
        sleep(1)  # pause 1 second after each request
            
    return book_dict
    
base_url = "https://books.toscrape.com/catalogue/category/books_1/"
page_urls = generate_page_urls(base_url, 2) # to save time and resources we only scrape the first 2 pages
book_dict = extract_book_urls(page_urls)

Although this code works without problems, there is one little improvement that we can make. If the number of pages changes, we need to manually update the `num_pages` parameter. For example, we may miss out once new books are added which appear on page 51 and further. 


### 2.3 Next Page Button

A general solution is therefore to look up whether there is a `next` button on the page (HTML code below). If so, it means a next page exists and we keep on incrementing the page counter by 1. If not, it means we have reached the last page. 

<img src="images/next_page.png" align="left" width=60% style="border: 1px solid black" />

As such, we write the function `check_next_page()` which takes an URL as an input and returns the outgoing link of the next button (if present):

In [15]:
def check_next_page(url):
    res = requests.get(url)
    soup = BeautifulSoup(res.text, "html.parser")
    next_btn = soup.find(class_= "next")
    return next_btn.find("a").attrs["href"] if next_btn else None

page_1 = "https://books.toscrape.com/catalogue/page-1.html"
print(f"The next page is: {check_next_page(page_1)}")

The next page is: page-2.html


We pass the function the first page of the bookshop, and it returns the link to the second page (note that `page-2.html` is a relative path from the current URL). Now let's check what happens once we pass it the 50th page!

#### Exercise 4
1. Pass `https://books.toscrape.com/catalogue/page-50.html` to `check_next_page()` and observe the output. Is that what you expected? 
2. Write a function `next_page_url()` that that checks whether the output of `check_next_page()` is not equal to `None` (i.e., anything but `None`). If so, it should return a new variable `page_url` that concatenates the base URL and the relative path to the next page. If not, it should print the statement `This is already the last page!`

#### Solutions

In [16]:
# Question 1 
output = check_next_page("https://books.toscrape.com/catalogue/page-50.html")
print(output) # the output is None because page 50 is the last one

None


In [17]:
# Question 2 
def next_page_url(url):
    base_url = "https://books.toscrape.com/catalogue/"
    if url != None: 
        page_url = base_url + url 
        return page_url 
    else: 
        print("This is already the last page!")
        
next_page_url(check_next_page("https://books.toscrape.com/catalogue/page-50.html"))

This is already the last page!


---
As a last step, we have revised the `extract_book_urls()` function. Instead of generating the list of page URLs up front, we now use a `while` loop that remains `True` as long as there is another new page. At the end of each loop, we update the `page_url` according to the link of the next button (using `check_next_page()`). On the last page, there is no new page URL and thus we break out of the while loop. 

All in all, we have modularized our code into functions, made it future-proof (e.g. if new books are added), and reduced the number of lines of code to get the job done! 

In [18]:
def extract_book_urls(page_url):
    '''collect the book title and url for every book on all page urls'''
    book_dict = {}

    while page_url: 
        res = requests.get(page_url)
        soup = BeautifulSoup(res.text, "html.parser")
        books = soup.find_all(class_="product_pod")

        for book in books: 
            book_title = book.find("img").attrs["alt"] 
            book_url = "https://books.toscrape.com/catalogue/" + book.find("a").attrs["href"][6:]
            book_dict[book_title] = book_url

        sleep(1)  # pause 1 second after each request
        
        if check_next_page(page_url) != None: 
            page_url = "https://books.toscrape.com/catalogue/category/books_1/" + check_next_page(page_url)
        else: 
            break
        
    return book_dict

book_dict = extract_book_urls("https://books.toscrape.com/catalogue/page-1.html")

#### Exercise 5
Try to run the function, inspect the output for yourself, and answer the following questions (you may need to wait for a bit as the scraper loops through all 50 pages!).
1. How many books are there in `book_dict`? Does this align with your initial expectations? Can you come up with a plausible explanation? (p.s. this is a tricky one, you have been warned!)
2. Your good friend recommended the book `The Activist's Tao Te Ching: Ancient Advice for a Modern Revolution`. After looking up the reviews on [GoodReads](https://www.goodreads.com/book/show/25986790-the-activist-s-tao-te-ching?ac=1&from_search=true&qid=jpcvOsxKfP&rank=1) you decide to look for a copy of the book online. Does [books.toscrape.com](books.toscrape.com) offer a copy in their store? If so, do they have enough stock currently? 

#### Solutions
1. There are 999 books (use: `len(book_dict)` to get the answer). The reason you find 999 books - rather than 1000 as listed on the homepage - is because there is one duplicate (Do Androids Dream of Electric Sheep? (Blade Runner #1)). Dictionaries can only store distinct keys and thus duplicate keys will be overwritten. 

In [19]:
# Question 2
# first we check whether the book is present in the dictionary (mind the capitals!)
"The Activist's Tao Te Ching: Ancient Advice for a Modern Revolution" in book_dict

# then we look up the corresponding URL -> on the page we find a stock level of 16 which is more than sufficient
book_dict["The Activist's Tao Te Ching: Ancient Advice for a Modern Revolution"]

'https://books.toscrape.com/catalogue/the-activists-tao-te-ching-ancient-advice-for-a-modern-revolution_928/index.html'

### 2.4 Page-Level Data Collection

Do you remember trying to obtain the URL of the [Black Bust](https://books.toscrape.com/catalogue/black-dust_976/index.html) book in exercise 2? Let's see whether it works this time...

In [20]:
print(book_dict["Black Dust"])

https://books.toscrape.com/catalogue/black-dust_976/index.html


Excellent, it works flawlessly! But why did we need the book URLs in the first place? It forms the seed for other web scraping efforts. For example, the product descriptions can only be obtained from the book pages themselves which means we need to loop over all book URLs to extract the right information. In the follow-up exercise, we'll look at how to do this. 

#### Exercise 6
1. We'd like to extract the product description of our 3 favorite books. Fill in the blanks below to finish the `get_book_description()` function. Each `#` represents a single missing character (e.g. `####` means the solution requires 4 characters). 
2. Run the function and inspect the output. If you look carefully, you may spot `â\x80\x99t` symbols throughout the product description. Look up the original text on the book pages and compare it side-by-side with the output of `book_dict`. 

In [None]:
def get_book_description(books):
    book_descriptions = {}

    for book in books: 
        page_url = book_dict[####]

        res = requests.get(########)
        soup = BeautifulSoup(res.text, "html.parser")

        # tip: look at the Google Inspector screenshot below 
        description = soup.find(id="content_inner").find_all("p")[#].get_text()
        book_descriptions[####] = ###########

    return book_descriptions

favorite_books = ["Black Dust", "The Grand Design", "Twenties Girl"]
book_descriptions = get_book_description(##############)

#### Solutions

In [22]:
def get_book_description(books):
    book_descriptions = {}

    for book in books: 
        page_url = book_dict[book]

        res = requests.get(page_url)
        soup = BeautifulSoup(res.text, "html.parser")

        # tip: look at the Google Inspector screenshot below 
        description = soup.find(id="content_inner").find_all("p")[3].get_text()
        book_descriptions[book] = description

    return book_descriptions

favorite_books = ["Black Dust", "The Grand Design", "Twenties Girl"]
book_descriptions = get_book_description(favorite_books) # â\x80\x99t refers to an apostrophe (' - e.g. can't)

<img src="images/black_dust.png" align="left" width=70% style="border: 1px solid black" />

---
### 2.5 CSV Export

Lastly, we convert the dictionary into a Comma Separated Values (CSV) file which you can open up in any spreadsheet program (e.g., Excel). More specifically, we'd like to have a file with three columns: one for the book title, one for the product description, and another one for the current date and time. The latter helps you to distinguish between data from scrapers you run repeatedly. For example, you may run the book scraper at the beginning of every month to keep track of price changes of any of the books. Although you could store the data of each extraction moment into a separate file (e.g., `2021_01_01_book_prices.csv` for January 2021, `2021_02_01_book_prices.csv` for February 2021), we recommend always including a timestamp column to your scraped datasets. After all, losing or overwriting data can be disastrous (especially for scrapers) as you may never be able to obtain historical data (e.g., the price of a book 2 months ago).

In that light, we import the `datetime` library which contains a function `now()` that automatically determines the current date and time which we'll incorporate into our final dataset. Run the cell again, and you'll see how the values update to the current time: 

In [23]:
from datetime import datetime

now = datetime.now()
print(now)

2020-12-14 21:25:06.955207


In essence, CSV-files are simply text files with symbols that indicate the beginning of a new column (i.e., delimiter). Below you find a screenshot of the `book_descriptions.csv` file opened in a basic text editor. Every `;` and enter (empty line) indicate the start of a new column and row, respectively.

<img src="images/csv_files.png" align="left" width=50% style="border: 1px solid black" />

Excel then applies this logic - converting semicolons and empty lines - to assign the data points to their respective cells: 

<img src="images/excel.png" align="left" width=50% style="border: 1px solid black" />

It gets more complicated once the delimiter has been embodied into data. For example, a comma is sometimes used as a delimiter but that would not work here because the product description also contains commas (e.g., `No matter how busy he keeps himself, successful Broadway...`). In that case, the part after the comma (`successful Broadway...`) would be regarded as a new column whereas it actually still belongs to the product description. For that reason, setting the delimiter to `;` is a safer choice here. 

We can write to a text file with the `csv` library. The first row is the header and contains the three column names (`"title", "description", "date_time"`). Thereafter, we iterate over the key-value pairs in the dictionary and add the current date time to it. Importantly, the `w` flag in the `with` statement indicates that the file will be overwritten every time the cell is executed. If you, however, want to append data to an existing file and avoid losing historical data, you can swap `w` for `a`. 

In [24]:
import csv 

with open("book_descriptions.csv", "w") as csv_file: 
    writer = csv.writer(csv_file, delimiter = ";")
    writer.writerow(["title", "description", "date_time"])
    now = datetime.now()
    for title, description in book_descriptions.items(): 
        writer.writerow([title, description, now])

#### Exercise 7 
1. Run the cell above and look at the `book_descriptions.csv` file in Excel. Make sure it looks like the screenshot above (3 columns x 4 rows). Depending on the language settings on your machine, the data may not be correctly distributed over the columns. In that case, go to the "Data" tab in Excel, click the "Text to Columns" button in the ribbon, choose "Delimited", put a checkmark in front of "Semicolon", and choose "Finish".

<img src="images/text_to_column.gif" align="left" width=60% style="border: 1px solid black" />

2. Close Excel, change the flag to `a`, and run the cell again. Open the `book_descriptions.csv` file again (and repeat the Text to Columns procedure if necessary). How does the output differ from the previous step? Why is that? 

#### Solutions
2. It shows the same data, including the header, twice (below one another). It goes beyond the scope of this course to define better alternatives (e.g., save data to a database).
---

At the beginning of this tutorial, we set out the promise of writing multi-page scrapers from start to finish. Although the examples we have studied are relatively simple, the same principles (seed definition, data extraction plan, page-level data collection) apply to any other website you'd like to scrape. 

Now that you have hopefully got the hang of using Jupyter notebooks, we're going to introduce you to an alternative that goes hand in hand with what you have learned thus far but overcomes some of its limitations.

## 3. Executing Python Files

### 3.1 Jupyter Notebooks vs Spyder
Jupyter Notebooks are ideal for combining programming and markdown (e.g., text, plots, equations) which makes it the default choice for sharing and presenting data analyses from a reproducibility standpoint. Since we can execute code blocks one by one it's suitable for developing and debugging code on the fly. There are, however, some limitations to Jupyter Notebooks which makes us consider other Integrated Development Environments (IDEs) such as Spider. First, the order in which you run cells within a notebook may affect the results. While prototyping you may lose sight of the top-down hierarchy which can cause problems once you restart the kernel (e.g., a library is imported after is being used). Second, there is no easy way to browse through directories and files within a Jupyter Notebook. Third, notebooks cannot handle large codebases nor big data particularly well. For those reasons, we recommend starting out in Jupyter Notebooks, moving code into functions along the way, and once all seems to be running well, copy-paste all necessary code into Spider. From there you can save it as a Python file (`.py`) - rather than a notebook (`.ipynb`) - and execute the file from the command line. Next, we introduce you to the Spider IDE, and learn how to run Python files from the command line. 

### 3.2 Introduction to Spyder
The first time you need to click on the green "Install" button in Anaconda Navigator after which you start Spyder by clicking on the blue "Launch" button (alternatively, type `spyder` in the terminal). 

<img src="images/anaconda_navigator.png" width=90% align="left" style="border: 1px solid black" />

The main interface consists of three panels: 
1. **Code editor** = where you write Python code (i.e., the content of code cells in a notebook)
2. **Variable / files** = depending on which tab you choose either an overview of all declared variables (e.g. look up their type or change their values) or a file explorer (e.g., to open other Python files)
3. **Console** = the output of running the Python script from the code editor (what normally appears below each cell in a notebook)

<img src="images/spyder.png" width=90% align="left" style="border: 1px solid black" />

In the `webscraping_101.py` file above, we have put together all code snippets from this notebook needed to scrape and store the URLs of all books. To run the script you either click on the green play button to run all code (from line 1 to 46). As an alternative, you can highlight the parts of the script you want to execute and then click the run selection button.

<img src="images/toolbar.png" width=40% align="left" style="border: 1px solid black" />

Once the script is running, you may need to interrupt the execution because it is simply taking too long or you spotted a bug somewhere. Click on the red rectangular in the console to stop the execution. 

<img src="images/interrupt.gif" width=80% align="left" style="border: 1px solid black" />

#### Exercise 8
1. Start (and install) Spyder and open `webscraping101.py` (`File` > `Open`). Compare this notebook and the Python script in Spyder side-by-side: which do you find clearer? 
2. Run the script and then open the `book_urls.csv` file in Excel. Where is the file stored on your computer? How many records are there?

#### Solutions
1. It remains a personal opinion but I'd say the `.py` looks neater because all the code is in the same view (e.g., all import statements below each other rather than spreading them throughout your notebook)
2. Exported files appear in the same working directory (unless specified differently). The `book_urls.csv` file contains 1000 rows (999 records and 1 header row).

### 3.3 Run Python Files 
* *Mac*
    1. Open the terminal and navigate to the folder in which the `.py` file has been saved (use `cd` to change directories and `ls` to list all files).
    2. Run the Python script by typing `python <FILENAME.py>` (e.g., `python webscraping_101.py`).

<img src="images/running_python.gif" width=60% align="left" style="border: 1px solid black" />

* *Windows*
    1. Open Windows explorer and navigate to the folder in which the `.py` file has been saved. Type `cmd` to open the command prompt. Alternatively, open the command prompt from the start menu (and use `cd` to change directories and `dir` to list files).
    2. Activate Anaconda by typing `conda activate`.
    3. Run the Python script by typing `python <FILENAME.py>` (e.g., `python webscraping_101.py`).

### 3.4 Wrap-up

Although we barely scratched the surface of Spyder and the command line, it's already the end of this tutorial. Keep up the good work!+
