# Guide to Web Scrapping
- Perian Bootcamp
- Just keep following the commands
- Keep making notes for following the commands
- Keep learning the commands also

In [None]:
conda install requests
conda install lxml
conda install bs4

# if you are not using the Anaconda Installation, you can use pip install instead of conda install, for example:

pip install requests
pip install lxml
pip install bs4 

# Now let's see what we can do with these libraries.

## Example Task 0 - Grabbing the title of a page

Remember that this is the HTML block with the title tag. Let`s go through the main steps as listed below:

In [None]:
import requests
# Step 1: Use the requests library to grab the page
# Note, this may fail if you have a firewall blocking Python/Jupyter 
# Note sometimes you need to run this twice if it fails the first time
res = requests.get("http://www.example.com")
type(res)
res.text

In [None]:
import bs4
soup = bs4.BeautifulSoup(res.text,"lxml")
soup
soup.select('title')
title_tag = soup.select('title')
title_tag[0]
type(title_tag[0])
title_tag[0].getText()

## Example Task 1 - Grabbing all elements of a class

Let's try to grab all the section headings of the Wikipedia Article on Grace Hopper from this URL: https://en.wikipedia.org/wiki/Grace_Hopper

In [None]:
# First get the request
res = requests.get('https://en.wikipedia.org/wiki/Grace_Hopper')

# Create a soup from request
soup = bs4.BeautifulSoup(res.text,"lxml")

# note depending on your IP Address, 
# this class may be called something different
soup.select(".toctext")

for item in soup.select(".toctext"):
    print(item.text)

# Example Task 2 - Getting an Image from a Website

Let's attempt to grab the image of the Deep Blue Computer from this wikipedia article: https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)

In [None]:
res = requests.get("https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)")
soup = bs4.BeautifulSoup(res.text,'lxml')
image_info = soup.select('.thumbimage')
image_info

len(image_info)
computer = image_info[0]

type(computer)
computer['src']

image_link = requests.get('https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Deep_Blue.jpg/220px-Deep_Blue.jpg')


# The raw content (its a binary file, meaning we will need to use binary read/write methods for saving it)
image_link.content

# Let's write this to a file:=, not the 'wb' call to denote a binary writing of the file.
f = open('my_new_file_name.jpg','wb')
f.write(image_link.content)

f.close()

## Example Project - Working with Multiple Pages and Items

We will do the following:

Figure out the URL structure to go through every page
Scrap every page in the catalogue
Figure out what tag/class represents the Star rating
Filter by that star rating using an if statement
Store the results to a list
We can see that the URL structure is the following:

http://books.toscrape.com/catalogue/page-1.html

In [None]:
base_url = 'http://books.toscrape.com/catalogue/page-{}.html'

res = requests.get(base_url.format('1'))

soup = bs4.BeautifulSoup(res.text,"lxml")

soup.select(".product_pod")

example.select('.star-rating.Three')


example.select('.star-rating.Two')


example.select('a')


example.select('a')[1]

example.select('a')[1]['title']

In [None]:

two_star_titles = []

for n in range(1,51):

    scrape_url = base_url.format(n)
    res = requests.get(scrape_url)
    
    soup = bs4.BeautifulSoup(res.text,"lxml")
    books = soup.select(".product_pod")
    
    for book in books:
        if len(book.select('.star-rating.Two')) != 0:
            two_star_titles.append(book.select('a')[1]['title'])

two_star_titles