## We are scraping : 
https://quotes.toscrape.com/

#### **Steps:**

1. Import the necessary packages:
   - `import requests` 
   - `from bs4 import BeautifulSoup` 
2. Use `requests.get()` to retrieve the necessary details of the page to be scraped.
   
   - `page = requests.get("page_url")` where `page_url` is the URL of page to be scraped.
   - can check `page.status_code` to know if the connection is fine ie `status_code = 200` else `status_code = 404` for some error such as if the page doesnt exist.
3. Use `BeautifulSoup` to create a soup object as `page_soup = BeautifulSoup(markup = page.content, features= "html.parser")` or simply, `page_soup = BeautifulSoup(page.content, "html.parser")`. 
   
4. Alternatively, we can also use `page_soup = BeautifulSoup(page.text, "html.parser")`. `page.content` returns `binary` data whereas `page.text` returns `string` data.
   
   Now, this object holds the `html` codes for the page we are dealing with. 
   
5. Next step is to extract the relevant details via inspecting the code. This can be achieved with the `find`,`findAll`,`findChild` methods in conjunction with `get` and `getText` methods.


In [1]:
import requests
from bs4 import BeautifulSoup

Let's do a trial run of the starting page to do a basic inspection before we code a scraper to dig through the entire site.

In [3]:
page = requests.get("https://quotes.toscrape.com/")
page_soup = BeautifulSoup(page.content, "html.parser")
page_soup

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"/>
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="

We can write the following code to extract the author, quote and tags. Then we store them as a list of dictionaries with `author-name` as the key and a list with members as a string `quote` and a list of `tags`.

In [35]:
quote_data = page_soup.find_all("span", class_="text")
author_data = page_soup.find_all("small", class_="author")
tags_data = page_soup.find_all("div", class_="tags")
quotes = []
for author, quote, tags in zip(author_data, quote_data, tags_data):
    page_quotes = {}

    tag_set = tags.findChildren("a")
    tags_list = []
    for tag in tag_set:
        tags_list.append(tag.get_text())

    page_quotes[author.get_text()] = [quote.get_text(), tags_list]
    quotes.append(page_quotes)

In [36]:
quotes

[{'Albert Einstein': ['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
   ['change', 'deep-thoughts', 'thinking', 'world']]},
 {'J.K. Rowling': ['“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
   ['abilities', 'choices']]},
 {'Albert Einstein': ['“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
   ['inspirational', 'life', 'live', 'miracle', 'miracles']]},
 {'Jane Austen': ['“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
   ['aliteracy', 'books', 'classic', 'humor']]},
 {'Marilyn Monroe': ["“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
   ['be-yourself', 'inspirational']]},
 {'Albert Einstein': ['“Try not to become a man of success. Rather become a man of value.”',
  

So we have been able to extract all quotes and related data from the first page. Now we just need to automate this for the rest of the pages. Let us see how.

First thing is to see how the page URLs vary:

- https://quotes.toscrape.com/page/1/
- https://quotes.toscrape.com/page/2/
- https://quotes.toscrape.com/page/3/
- and so on

In [41]:
page_num = 1

base_url = f"https://quotes.toscrape.com/page/{page_num}/"

We also need to check for the condition for when the pages of quotes run out. We open a absurd high page number "https://quotes.toscrape.com/page/200/" and see what happens. We see that it doesnt return an error, instead just shows a page which says "No quotes found!". So this is our condition to stop scraping. Whenever our url hits such a page, we should stop. Let us explore what exactly this condition will be.

In [56]:
break_page = requests.get("https://quotes.toscrape.com/page/200/")
break_soup = BeautifulSoup(break_page.content, "html.parser")
break_data = break_soup.find("div", class_ = 'col-md-8').get_text(strip=True)
break_data
# print(break_data[1].get_text(strip=True))

'Quotes to Scrape'

Unfortunately, I was unable to figure out exactly how to get to this break condition. So we will do it the old fashion way by checking how many pages of quotes there are: 10. Thus we will scrape till we hit the 11th page.

We define a function that scrapes a page (basically from our trial exercise above). The function will have the page url and final data list as arguments.

In [63]:
def page_scraper(url:str):
    page = requests.get(url)
    page_soup = BeautifulSoup(page.text, "html.parser")

    quote_data = page_soup.find_all("span", class_="text")
    author_data = page_soup.find_all("small", class_="author")
    tags_data = page_soup.find_all("div", class_="tags")
    page_quotes = []
    for author, quote, tags in zip(author_data, quote_data, tags_data):
        page_quote = {}

        tag_set = tags.findChildren("a")
        tags_list = []
        for tag in tag_set:
            tags_list.append(tag.get_text())

        page_quote[author.get_text()] = [quote.get_text(), tags_list]
        page_quotes.append(page_quote)
    
    return page_quotes

Next we will use the above function to do the actual scraping for quotes:

In [64]:
final_quotes_data=[]
page_num = 1

while page_num<11:
    page_url = f"https://quotes.toscrape.com/page/{page_num}/"
    page_q = page_scraper(page_url)
    final_quotes_data.extend(page_q)
    page_num += 1

In [66]:
len(final_quotes_data)

100

In [67]:
final_quotes_data[15:20]

[{'Douglas Adams': ['“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”',
   ['life', 'navigation']]},
 {'Elie Wiesel': ["“The opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.”",
   ['activism',
    'apathy',
    'hate',
    'indifference',
    'inspirational',
    'love',
    'opposite',
    'philosophy']]},
 {'Friedrich Nietzsche': ['“It is not a lack of love, but a lack of friendship that makes unhappy marriages.”',
   ['friendship',
    'lack-of-friendship',
    'lack-of-love',
    'love',
    'marriage',
    'unhappy-marriage']]},
 {'Mark Twain': ['“Good friends, good books, and a sleepy conscience: this is the ideal life.”',
   ['books', 'contentment', 'friends', 'friendship', 'life']]},
 {'Allen Saunders': ['“Life is what happens to us while we are making other plans.

Finally, let us write all this data to a file (in this case, I am using `.txt`)

In [69]:
with open("quotes.txt", "w") as file:
    for quote in final_quotes_data:
        file.write(f"{str(quote)}\n")