# Welcome to Web Scraping

In this assignment, we are going to web scrape a website and obtain a dataframe at the end using various methods.

We begin by importing the necessary packages.

- For information on BeautifulSoup, please see [This](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#insert-before-and-insert-after)

In [1]:
import pandas as pd
import urllib.request
import requests as requests
from bs4 import BeautifulSoup

As you can see, BeautifulSoup is really powerful.

- We copy and paste the link
- We import urllib.request to view the data
- We use BeautifulSoup to parse our data into something neat

In [2]:
firstUrl = "http://quotes.toscrape.com"
firstpage = urllib.request.urlopen(firstUrl).read()
soup = BeautifulSoup(firstpage)

## Lines 3-5

My goal right now is the following:

- Extract each column from every page
- To do this, there's no doubt that we need to loop multiple times to  extract multiple pages
- We also need to extract all quotes, tags, and author information as we loop every page
- Once we obtain the text, we append it to a dictionary
- Comments for line 3 apply exactly the same for line 4

In [3]:
allQuote = [] # empty dictionary
firstUrl = "http://quotes.toscrape.com"
firstPage = firstUrl

while True: # loop multiple pages
    firstPageContent = urllib.request.urlopen(firstPage).read()
    soup = BeautifulSoup(firstPageContent)
    
    nextPage = soup.find(class_='next') # find the class next
    if nextPage is None: # solution to Nonetype Error
        break
    else:
        firstPage = firstUrl + nextPage.a.get("href") # extract the pages
        quotes = soup.find_all(class_='text') # find all with class text
        for quote in quotes:
            quotesauthors = quote.text # we take the ones with text only
            allQuote.append(quotesauthors) # append all quotes to dictionary

In [4]:
allTag = []
firstUrl = "http://quotes.toscrape.com"
firstPage = firstUrl

while True:
    firstPageContent = urllib.request.urlopen(firstPage).read()
    soup = BeautifulSoup(firstPageContent)
    
    nextPage = soup.find(class_='next')
    if nextPage is None:
        break
    else:
        firstPage = firstUrl + nextPage.a.get("href")
        tags = soup.find_all(class_='tags')
        for tag in tags:
            quotestag = tag.text
            allTag.append(quotestag)

## Loop within a loop

Here's the similarity:

- Four empty dictionaries for exactly the same reason as before
- We're looping exactly the same number of iterations

Here's the difference:

- For every quote, there exists information for the author that can be found using a hyperlink.
- We have to extract all information from each hyperlink

In [5]:
allTitle = []
allBirth = []
allLocation = []
allDescription = []
firstUrl = "http://quotes.toscrape.com"
firstPage = firstUrl

while True:
    firstPageContent = urllib.request.urlopen(firstPage).read()
    soup = BeautifulSoup(firstPageContent)
    
    nextPage = soup.find(class_='next')
    if nextPage is None:
        break
    else:
        firstPage = firstUrl + nextPage.a.get("href") # Multiple iterations end here
        for link in soup.findAll(class_='quote'): # We extract class quote
            if link.a.get('href') == None: # We want hyperlink by solution of nonetype error
                continue
            else:
                infoUrl = firstUrl + link.a.get("href")
                infoPage = urllib.request.urlopen(infoUrl).read()
                infoSoup = BeautifulSoup(infoPage) # New variable functions just like soup except it's for the hyperlink
                title = infoSoup.find(class_="author-title").text
                birth = infoSoup.find(class_="author-born-date").text
                location = infoSoup.find(class_="author-born-location").text
                description = infoSoup.find(class_="author-description").text # we find the class and extract the text
                allTitle.append(title)
                allBirth.append(birth)
                allLocation.append(location)
                allDescription.append(description) # we append each column to each dictionary like before

## Dictionary within a dictionary

Here are the final steps:

- We appended the empty dictionaries for a reason
- Now that we have all the values, we make column names and form a bigger dictionary that takes each of the corresponding columns
- The beauty of pandas is that we can transform a dictionary into a dataframe
- Information on orient = 'index', you can find [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_dict.html)

In [6]:
data = {'Quote':allQuote,'Tag':allTag,'Title':allTitle,'Birth':allBirth,'Location':allLocation,'Description':allDescription}
Mydata = pd.DataFrame.from_dict(data, orient='index')
Mydata2 = Mydata.transpose() # Original was columns to rows so transpose switches them up
Mydata2 # We have a shape of (90,6) so 90 observations and 6 columns

Unnamed: 0,Quote,Tag,Title,Birth,Location,Description
0,“The world as we have created it is a process ...,\n Tags:\n \nchange\ndee...,Albert Einstein,"March 14, 1879","in Ulm, Germany","\n In 1879, Albert Einstein was born in..."
1,"“It is our choices, Harry, that show what we t...",\n Tags:\n \nabilities\n...,J.K. Rowling,"July 31, 1965","in Yate, South Gloucestershire, England, The U...",\n See also: Robert GalbraithAlthough s...
2,“There are only two ways to live your life. On...,\n Tags:\n \ninspiration...,Albert Einstein,"March 14, 1879","in Ulm, Germany","\n In 1879, Albert Einstein was born in..."
3,"“The person, be it gentleman or lady, who has ...",\n Tags:\n \naliteracy\n...,Jane Austen,"December 16, 1775","in Steventon Rectory, Hampshire, The United Ki...",\n Jane Austen was an English novelist ...
4,"“Imperfection is beauty, madness is genius and...",\n Tags:\n \nbe-yourself...,Marilyn Monroe,"June 01, 1926",in The United States,\n Marilyn Monroe (born Norma Jeane Mor...
...,...,...,...,...,...,...
85,“Some day you will be old enough to start read...,\n Tags:\n \nage\nfairyt...,C.S. Lewis,"November 29, 1898","in Belfast, Ireland",\n CLIVE STAPLES LEWIS (1898–1963) was ...
86,“We are not necessarily doubting that God will...,\n Tags:\n \ngod\n,C.S. Lewis,"November 29, 1898","in Belfast, Ireland",\n CLIVE STAPLES LEWIS (1898–1963) was ...
87,“The fear of death follows from the fear of li...,\n Tags:\n \ndeath\nlife\n,Mark Twain,"November 30, 1835","in Florida, Missouri, The United States","\n Samuel Langhorne Clemens, better kno..."
88,“A lie can travel half way around the world wh...,\n Tags:\n \nmisattribut...,Mark Twain,"November 30, 1835","in Florida, Missouri, The United States","\n Samuel Langhorne Clemens, better kno..."


In [7]:
Mydata2.to_csv("../data/raw/Mydata.csv")

In [8]:
Mydata2.to_excel("../data/raw/Mydata.xlsx")